# Data Cleaning

In [1]:
import pandas as pd
import numpy as np

In [2]:
laptops = pd.read_csv("laptops.csv", encoding = "latin-1") # windows-1251

### info ()
    The .info() method is a useful function in pandas that provides a concise summary of a DataFrame, including the column names, data types, and information about the presence of missing values.

In [3]:
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


### head ()
    The .head() method is used in pandas to display the first few rows of a DataFrame. By default, it displays the first 5 rows, but you can specify the number of rows to be displayed by passing an argument to the method. 

In [4]:
laptops.head()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


### strip()
    The .strip() method is a string method in Python that is used to remove leading and trailing whitespace characters from a string. It returns a new string with the whitespace characters removed.

In [5]:
"danial gauhar ".strip()

'danial gauhar'

### columns
    The .columns attribute is used in pandas to retrieve the column labels of a DataFrame. It returns an Index object containing the column labels, which can be accessed or manipulated. 

In [6]:
laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [7]:
laptops.columns = laptops.columns.str.strip().str.replace(")","").str.replace("(","").str.replace(" ","_").str.replace("Operating_System","os").str.lower()

  laptops.columns = laptops.columns.str.strip().str.replace(")","").str.replace("(","").str.replace(" ","_").str.replace("Operating_System","os").str.lower()
  laptops.columns = laptops.columns.str.strip().str.replace(")","").str.replace("(","").str.replace(" ","_").str.replace("Operating_System","os").str.lower()


In [8]:
laptops.columns

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')

The line of code given below removes double quotation marks from the values in the "screen_size" column of the "laptops" DataFrame using .str.strip('"'). Then, it converts the cleaned values to float data type using .astype(float).

In [9]:
laptops["screen_size"]=laptops["screen_size"].str.strip('"').astype(float)
laptops["screen_size"]

0       13.3
1       13.3
2       15.6
3       15.4
4       13.3
        ... 
1298    14.0
1299    13.3
1300    14.0
1301    15.6
1302    15.6
Name: screen_size, Length: 1303, dtype: float64

The line of code given below renames the column label "screen_size" to "screen_size_inches" in the "laptops" DataFrame using the rename() method. The {"screen_size":"screen_size_inches"} dictionary specifies the mapping of the old column name to the new column name. The axis=1 parameter indicates that the renaming should be applied to the column labels (axis=1) rather than the row labels (axis=0). The inplace=True parameter ensures that the modification is made directly on the DataFrame without creating a new copy.

In [10]:
laptops.rename({"screen_size":"screen_size_inches"},axis=1,inplace=True)
laptops

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,Yoga 500-14ISK,2 in 1 Convertible,14.0,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows,10,1.8kg,63800
1299,Lenovo,Yoga 900-13ISK,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows,10,1.3kg,149900
1300,Lenovo,IdeaPad 100S-14IBR,Notebook,14.0,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows,10,1.5kg,22900
1301,HP,15-AC110nv (i7-6500U/6GB/1TB/Radeon,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows,10,2.19kg,76400


In [11]:
laptops["ram"]=laptops["ram"].str.strip('GB').astype(int)
laptops

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,Yoga 500-14ISK,2 in 1 Convertible,14.0,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4,128GB SSD,Intel HD Graphics 520,Windows,10,1.8kg,63800
1299,Lenovo,Yoga 900-13ISK,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16,512GB SSD,Intel HD Graphics 520,Windows,10,1.3kg,149900
1300,Lenovo,IdeaPad 100S-14IBR,Notebook,14.0,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2,64GB Flash Storage,Intel HD Graphics,Windows,10,1.5kg,22900
1301,HP,15-AC110nv (i7-6500U/6GB/1TB/Radeon,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6,1TB HDD,AMD Radeon R5 M330,Windows,10,2.19kg,76400


In [12]:
laptops["gpu"]

0       Intel Iris Plus Graphics 640
1             Intel HD Graphics 6000
2              Intel HD Graphics 620
3                 AMD Radeon Pro 455
4       Intel Iris Plus Graphics 650
                    ...             
1298           Intel HD Graphics 520
1299           Intel HD Graphics 520
1300               Intel HD Graphics
1301              AMD Radeon R5 M330
1302               Intel HD Graphics
Name: gpu, Length: 1303, dtype: object

The line of code given below splits the values in the "gpu" column of the "laptops" DataFrame into two parts using the .str.split() method. The n=1 parameter specifies that the splitting should be performed only once, splitting the string into a maximum of two parts. The expand=True parameter indicates that the result of the split should be expanded into separate columns.

In [13]:
gpu_split=laptops["gpu"].str.split(n=1, expand = True)
gpu_split

Unnamed: 0,0,1
0,Intel,Iris Plus Graphics 640
1,Intel,HD Graphics 6000
2,Intel,HD Graphics 620
3,AMD,Radeon Pro 455
4,Intel,Iris Plus Graphics 650
...,...,...
1298,Intel,HD Graphics 520
1299,Intel,HD Graphics 520
1300,Intel,HD Graphics
1301,AMD,Radeon R5 M330


In [14]:
gpu_split[0] # accessing the 0th index of gpu_split.

0       Intel
1       Intel
2       Intel
3         AMD
4       Intel
        ...  
1298    Intel
1299    Intel
1300    Intel
1301      AMD
1302    Intel
Name: 0, Length: 1303, dtype: object

In [15]:
laptops["gpu_manufacturer"] = gpu_split[0] #Creating a new column of name gpu_manufacturer.

### value_counts()
    The .value_counts() method is used in pandas to count the occurrences of unique values in a Series (column) of a DataFrame. It returns a new Series object where the unique values are the index labels, and the corresponding values are the counts of those values in the original Series.

In [16]:
laptops["gpu_manufacturer"].value_counts()

Intel     722
Nvidia    400
AMD       180
ARM         1
Name: gpu_manufacturer, dtype: int64

### unique()
    The .unique() method is used in pandas to retrieve the unique values in a Series (column) of a DataFrame. It returns an array-like object containing the unique values in the Series, without any duplicates.

In [17]:
laptops["os"].unique()

array(['macOS', 'No OS', 'Windows', 'Mac OS', 'Linux', 'Android',
       'Chrome OS'], dtype=object)

The line of code given below updates the values in the "os" column of the "laptops" DataFrame. It changes any occurrence of "Mac OS" to "macOS" by using boolean indexing (laptops["os"] == "Mac OS") to select the rows where the condition is True and then assigns the new value "macOS" to those selected rows in the "os" column.

In [18]:
laptops.loc[laptops["os"] == "Mac OS", "os"] = "macOS"

In [19]:
laptops["os"].unique()

array(['macOS', 'No OS', 'Windows', 'Linux', 'Android', 'Chrome OS'],
      dtype=object)

In [20]:
laptops.isnull().sum()
# calculating the sum of null values in each column of the "laptops" DataFrame.

manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram                     0
storage                 0
gpu                     0
os                      0
os_version            170
weight                  0
price_euros             0
gpu_manufacturer        0
dtype: int64

In [21]:
laptops[laptops["os_version"].isnull()]
# retrieving rows from the "laptops" DataFrame where the "os_version" column contains null values.

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros,gpu_manufacturer
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969,Intel
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894,Intel
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500,Intel
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745,AMD
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360,Intel
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1267,Dell,Inspiron 3567,Notebook,15.6,1366x768,Intel Core i7 7500U 2.7GHz,8,1TB HDD,AMD Radeon R5 M430,Linux,,2.3kg,80599,AMD
1277,Acer,Aspire ES1-531,Notebook,15.6,1366x768,Intel Celeron Dual Core N3060 1.6GHz,4,500GB HDD,Intel HD Graphics 400,Linux,,2.4kg,28900,Intel
1281,Dell,Inspiron 3567,Notebook,15.6,1366x768,Intel Core i7 7500U 2.7GHz,8,1TB HDD,AMD Radeon R5 M430,Linux,,2.3kg,80599,AMD
1291,Acer,Aspire ES1-531,Notebook,15.6,1366x768,Intel Celeron Dual Core N3060 1.6GHz,4,500GB HDD,Intel HD Graphics 400,Linux,,2.4kg,28900,Intel


The line of code given below
updates the values in the "os_version" column of the "laptops" DataFrame. It changes the value to "No Version" in the "os_version" column for rows where the "os" column is equal to "No OS" by using boolean indexing (laptops["os"] == "No OS") to select the rows where the condition is True, and then assigns the new value "No Version" to the selected rows in the "os_version" column.

In [22]:
laptops.loc[laptops["os"] == "No OS","os_version"]="No Version"

In [23]:
laptops.isnull().sum()

manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram                     0
storage                 0
gpu                     0
os                      0
os_version            104
weight                  0
price_euros             0
gpu_manufacturer        0
dtype: int64

The line of code given below updates the values in the "os_version" column of the "laptops" DataFrame. It assigns the value "Version Unknown" to the "os_version" column for rows where the "os_version" is null, identified by the condition laptops["os_version"].isnull().

In [24]:
laptops.loc[laptops["os_version"].isnull(),"os_version"] = "Version Unknown"

In [25]:
laptops.isnull().sum()

manufacturer          0
model_name            0
category              0
screen_size_inches    0
screen                0
cpu                   0
ram                   0
storage               0
gpu                   0
os                    0
os_version            0
weight                0
price_euros           0
gpu_manufacturer      0
dtype: int64

The line of code given below exports the "laptops" DataFrame to a CSV (Comma-Separated Values) file named "laptops_cleaned.csv". The index=False parameter specifies that the index labels should not be included in the exported CSV file.

In [26]:
laptops.to_csv("laptops_cleaned.csv",index = False)

## AMERICAN GALLERY 

In [27]:
american_gallery = pd.read_csv("AmericanGallery.csv")

In [28]:
american_gallery.head()

Unnamed: 0,Title,Artist,Nationality,BeginDate,EndDate,Gender,Date,Department
0,Dress MacLeod from Tartan Sets,Sarah Charlesworth,(American),-1947.0,-2013.0,(Female),1986,Prints & Illustrated Books
1,Duplicate of plate from folio 11 verso (supple...,Pablo Palazuelo,(Spanish),-1916.0,-2007.0,(Male),1978,Prints & Illustrated Books
2,Tailpiece (page 55) from SAGESSE,Maurice Denis,(French),-1870.0,-1943.0,(Male),1889-1911,Prints & Illustrated Books
3,Headpiece (page 129) from LIVRET DE FOLASTRIES...,Aristide Maillol,(French),-1861.0,-1944.0,(Male),1927-1940,Prints & Illustrated Books
4,97 rue du Bac,Eugène Atget,(French),-1857.0,-1927.0,(Male),1903,Photography


In [29]:
american_gallery.tail()

Unnamed: 0,Title,Artist,Nationality,BeginDate,EndDate,Gender,Date,Department
16724,Oval with Points,Henry Moore,(British),-1898.0,-1986.0,(Male),1968-1969,Painting & Sculpture
16725,"Cementerio de la Ciudad Abierta, Ritoque, Chile",Juan Baixas,(Chilean),-1942.0,,(Male),1975,Architecture & Design
16726,The Catboat,Edward Hopper,(American),-1882.0,-1967.0,(Male),1922,Prints & Illustrated Books
16727,Dognat' i peregnat' v tekhniko-ekonomicheskom ...,Unknown,(),,,(),1931,Prints & Illustrated Books
16728,Plate (page 11) from The Dive,Alex Katz,(American),-1927.0,,(Male),2011,Prints & Illustrated Books


In [30]:
american_gallery.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16729 entries, 0 to 16728
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Title        16728 non-null  object 
 1   Artist       16729 non-null  object 
 2   Nationality  16729 non-null  object 
 3   BeginDate    15787 non-null  float64
 4   EndDate      10475 non-null  float64
 5   Gender       16729 non-null  object 
 6   Date         16729 non-null  object 
 7   Department   16729 non-null  object 
dtypes: float64(2), object(6)
memory usage: 1.0+ MB


In [31]:
american_gallery.dtypes

Title           object
Artist          object
Nationality     object
BeginDate      float64
EndDate        float64
Gender          object
Date            object
Department      object
dtype: object

In [32]:
american_gallery.isnull().sum()

Title             1
Artist            0
Nationality       0
BeginDate       942
EndDate        6254
Gender            0
Date              0
Department        0
dtype: int64

In [33]:
american_gallery.columns

Index(['Title', 'Artist', 'Nationality', 'BeginDate', 'EndDate', 'Gender',
       'Date', 'Department'],
      dtype='object')

In [34]:
american_gallery["Nationality"] = american_gallery["Nationality"].str.strip("()")
american_gallery["Nationality"]

0        American
1         Spanish
2          French
3          French
4          French
           ...   
16724     British
16725     Chilean
16726    American
16727            
16728    American
Name: Nationality, Length: 16729, dtype: object

In [35]:
american_gallery["BeginDate"] = american_gallery["BeginDate"].astype(str)


In [36]:
american_gallery["BeginDate"] = american_gallery["BeginDate"].str.strip("-")


In [37]:
american_gallery["BeginDate"] = american_gallery["BeginDate"].astype(float)
american_gallery["BeginDate"] = american_gallery["BeginDate"].replace([np.inf, -np.inf], np.nan)
american_gallery["BeginDate"] = american_gallery["BeginDate"].fillna(0)
american_gallery["BeginDate"] = american_gallery["BeginDate"].astype(int)
american_gallery["BeginDate"]

0        1947
1        1916
2        1870
3        1861
4        1857
         ... 
16724    1898
16725    1942
16726    1882
16727       0
16728    1927
Name: BeginDate, Length: 16729, dtype: int32

In [38]:
american_gallery["EndDate"] = american_gallery["EndDate"].astype(str)
american_gallery["EndDate"] = american_gallery["EndDate"].str.strip("-")
american_gallery["EndDate"] = american_gallery["EndDate"].astype(float)
american_gallery["EndDate"] = american_gallery["EndDate"].replace([np.inf, -np.inf], np.nan)
american_gallery["EndDate"] = american_gallery["EndDate"].fillna(0)
american_gallery["EndDate"] = american_gallery["EndDate"].astype(int)
american_gallery["EndDate"]

0        2013
1        2007
2        1943
3        1944
4        1927
         ... 
16724    1986
16725       0
16726    1967
16727       0
16728       0
Name: EndDate, Length: 16729, dtype: int32

In [39]:
american_gallery["Gender"] = american_gallery["Gender"].str.strip("()")
american_gallery["Gender"]

0        Female
1          Male
2          Male
3          Male
4          Male
          ...  
16724      Male
16725      Male
16726      Male
16727          
16728      Male
Name: Gender, Length: 16729, dtype: object

In [40]:
american_gallery["Date"] = american_gallery["Date"].str.strip("()").str.strip("-")

In [41]:
american_gallery["Date"] = american_gallery["Date"].str.replace("[sScC.()' ]", "", regex=True)


In [42]:
american_gallery['Date'] = american_gallery['Date'].str.replace('-', '')
american_gallery['Date'] = american_gallery['Date'].str[:4]

In [43]:
american_gallery

Unnamed: 0,Title,Artist,Nationality,BeginDate,EndDate,Gender,Date,Department
0,Dress MacLeod from Tartan Sets,Sarah Charlesworth,American,1947,2013,Female,1986,Prints & Illustrated Books
1,Duplicate of plate from folio 11 verso (supple...,Pablo Palazuelo,Spanish,1916,2007,Male,1978,Prints & Illustrated Books
2,Tailpiece (page 55) from SAGESSE,Maurice Denis,French,1870,1943,Male,1889,Prints & Illustrated Books
3,Headpiece (page 129) from LIVRET DE FOLASTRIES...,Aristide Maillol,French,1861,1944,Male,1927,Prints & Illustrated Books
4,97 rue du Bac,Eugène Atget,French,1857,1927,Male,1903,Photography
...,...,...,...,...,...,...,...,...
16724,Oval with Points,Henry Moore,British,1898,1986,Male,1968,Painting & Sculpture
16725,"Cementerio de la Ciudad Abierta, Ritoque, Chile",Juan Baixas,Chilean,1942,0,Male,1975,Architecture & Design
16726,The Catboat,Edward Hopper,American,1882,1967,Male,1922,Prints & Illustrated Books
16727,Dognat' i peregnat' v tekhniko-ekonomicheskom ...,Unknown,,0,0,,1931,Prints & Illustrated Books


In [44]:
american_gallery.to_csv("americangallery_cleaned.csv",index = False)