# __Example: Data Cleaning__

The example below uses a small dataset of just over 1300 laptops with an intial 13 columns.  In cleaning the dataset, I utilize the following concepts/tools: 
* Functions
* Column header replacement
* the df.unique() method
* Dropping missing values or identifying an alternative in dealing with them
* Mapping dictionary
* Regular expressions
* Boolean arrays

In the end, I rewrite all of the cleaned data to a new csv file and export it for analysis.

At the moment, this is not a completely clean dataset.  Ultimately, the depth of cleansing will be determined the questions to which we're seeking answers, and consequently, the data needed to provide answers.  As cleaning is the process that data scientists often spend most of their time, it only makes sense to limit cleaning to only necessary parts of the dataset.

In [2]:
#Import the pandas library with an alias
import pandas as pd
import re

In [3]:
# Import the csv file.  I knew to use a different encoding method as I got an error when I tried the default UTF-8 encoder
laptops_df = pd.read_csv('laptops.csv', encoding = 'Latin-1')
laptops_df.info()
laptops_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB


Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


In [4]:
# Defining a column-header cleaning function that eliminates parentheses, strips white spaces, modifies a couple of column headers, and 
def clean_col(column):
    column = column.strip()
    column = column.replace("Operating System", "os")
    column = column.replace(" ", "_")
    column = column.replace("(","")
    column = column.replace(")","")
    column = column.lower()
    return column
# Run the header cleaning function
new_columns = []
for c in laptops_df.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)
# Save the cleaned headers back to the original dataframe
laptops_df.columns = new_columns

laptops_df.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


In [5]:
# Check the ram column for unique values to determine what will need to be removed to make it a numeric column
unique_ram = laptops_df['ram'].unique()
print(unique_ram)

['8GB' '16GB' '4GB' '2GB' '12GB' '6GB' '32GB' '24GB' '64GB']


In [6]:
# Make the values in ram numeric
laptops_df['ram'] = laptops_df['ram'].str.replace('GB', '').astype(int)

# Rename the ram column, use the inplace parameter instead of assigning the results back to the variable
laptops_df.rename({'ram':'ram_gb'}, axis = 1, inplace = True)

unique_ram = laptops_df['ram_gb'].unique()

# Check the data types to verify my work
print(laptops_df.dtypes)

manufacturer    object
model_name      object
category        object
screen_size     object
screen          object
cpu             object
ram_gb           int32
storage         object
gpu             object
os              object
os_version      object
weight          object
price_euros     object
dtype: object


In [7]:
# Split the manufactures of GPUs and CPUs from their continuous strings, and provide a count
laptops_df["gpu_manufacturer"] = (laptops_df["gpu"]
                                       .str.split()
                                       .str[0]
                              )

laptops_df['cpu_manufacturer'] = (laptops_df['cpu']
                             .str.split()
                             .str[0]
                               )

cpu_manufacturer_counts = laptops_df['cpu_manufacturer'].value_counts()
print(cpu_manufacturer_counts)

Intel      1240
AMD          62
Samsung       1
Name: cpu_manufacturer, dtype: int64


In [8]:
# Creating a dictionary to utilize in conjunction with the series.map() method in the correction of the way operating systems are labeled
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}

s = laptops_df['os']
laptops_df['os'] = s.map(mapping_dict)


In [9]:
# Practice dropping rows and columns (respectively)

# laptops_no_null_rows = laptops_df.dropna()
# laptops_no_null_cols = laptops_df.dropna(axis =1)

In [10]:
#A better way to take care of missing values - examine the data and see if you can extrapolate
#This shows the number of rows for each os category for which a value is missing in the 'os_version' category
value_counts_os = laptops_df.loc[laptops_df["os_version"].isnull(), "os"].value_counts()
print(value_counts_os)

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64


In [12]:
# Seeing in the previous cell that there are multiple laptops with 'No OS', and assuming that all our laptops have come out since the 
#inception of OS X, we will enter 'X' as the os version of all listed as 'macOS', and 'Version Unknown' when listed as 'No OS' 
laptops_df.loc[laptops_df['os'] == "macOS", "os_version"] = "X"

laptops_df.loc[laptops_df['os'] == 'No OS', 'os_version'] = 'Version Unknown'

# Count and display 'os' categories for entries who have no value in the 'os_version' column
value_counts_after = laptops_df.loc[laptops_df['os_version'].isnull(), 'os'].value_counts()
print(value_counts_after)

Linux        62
Chrome OS    27
Android       2
Name: os, dtype: int64


In [13]:
# Convert the price_euros column to a numeric dtype.  As I'm in the U.S., I'll use a decimal instead of a comma
laptops_df['price_euros'] = laptops_df['price_euros'].str.replace(',', '.').astype(float)

laptops_df.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram_gb,storage,gpu,os,os_version,weight,price_euros,gpu_manufacturer,cpu_manufacturer
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,X,1.37kg,1339.69,Intel,Intel
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,X,1.34kg,898.94,Intel,Intel
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,Version Unknown,1.86kg,575.0,Intel,Intel
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,X,1.83kg,2537.45,AMD,Intel
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,X,1.37kg,1803.6,Intel,Intel


In [18]:
# Extract the screen resolution from the screen column
screen_resolution = []
for line in laptops_df['screen']:
    s = re.findall('\d\d\d\d[x]\d\d\d*', line)
    screen_resolution.append(s)
laptops_df['screen_resolution'] = screen_resolution

#Extract the screen type from the screen column
screen_type = []
for line in laptops_df['screen']:
    s = re.split('[\d]', line)
    screen_type.append(s[0])
laptops_df['screen_type'] = screen_type
# print(laptops_df['screen_type'])

#Extract whether or not the computer contains a touch screen
touchscreen = []
for line in laptops_df['screen_type']:
    if 'Touchscreen' in line:
        touchscreen.append('Yes')
    else:
        touchscreen.append('No')
laptops_df['touchscreen'] = touchscreen
# 19 is the first touchscreen in the dataset
laptops_df.head(19)

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram_gb,storage,gpu,os,os_version,weight,price_euros,gpu_manufacturer,cpu_manufacturer,screen_resolution,screen_type,touchscreen
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,X,1.37kg,1339.69,Intel,Intel,[2560x1600],IPS Panel Retina Display,No
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,X,1.34kg,898.94,Intel,Intel,[1440x900],,No
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,Version Unknown,1.86kg,575.0,Intel,Intel,[1920x1080],Full HD,No
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,X,1.83kg,2537.45,AMD,Intel,[2880x1800],IPS Panel Retina Display,No
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,X,1.37kg,1803.6,Intel,Intel,[2560x1600],IPS Panel Retina Display,No
5,Acer,Aspire 3,Notebook,"15.6""",1366x768,AMD A9-Series 9420 3GHz,4,500GB HDD,AMD Radeon R5,Windows,10,2.1kg,400.0,AMD,AMD,[1366x768],,No
6,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.2GHz,16,256GB Flash Storage,Intel Iris Pro Graphics,macOS,X,2.04kg,2139.97,Intel,Intel,[2880x1800],IPS Panel Retina Display,No
7,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8,256GB Flash Storage,Intel HD Graphics 6000,macOS,X,1.34kg,1158.7,Intel,Intel,[1440x900],,No
8,Asus,ZenBook UX430UN,Ultrabook,"14.0""",Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,16,512GB SSD,Nvidia GeForce MX150,Windows,10,1.3kg,1495.0,Nvidia,Intel,[1920x1080],Full HD,No
9,Acer,Swift 3,Ultrabook,"14.0""",IPS Panel Full HD 1920x1080,Intel Core i5 8250U 1.6GHz,8,256GB SSD,Intel UHD Graphics 620,Windows,10,1.6kg,770.0,Intel,Intel,[1920x1080],IPS Panel Full HD,No


In [115]:
# Look at unique values to try and determine a pattern for use in regex
laptops_df['cpu'].unique()

# Extract the processor speed from the cpu column.
cpu_speed = []

for chip in laptops_df['cpu']:
    c = re.split('\s', chip)
    if 'GHz' in c[3]:
        cpu_speed.append(c[3])
    elif 'GHz' in c[4]:
        cpu_speed.append(c[4])
    elif 'GHz' in c[5]:
        cpu_speed.append(c[5])
laptops_df['cpu_speed'] = cpu_speed

# Eliminiate the 'GHz' label on all values in the cpu_speed column
laptops_df['cpu_speed'] = laptops_df['cpu_speed'].str.replace('GHz','')

# Modify the cpu_speed column header to record units of cpu_speed
laptops_df.rename({'cpu_speed':'cpu_speed_ghz'}, axis = 1, inplace = True)

#Check work
laptops_df.head()


Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram_gb,storage,gpu,os,os_version,weight,price_euros,gpu_manufacturer,cpu_manufacturer,screen_resolution,screen_type,touchscreen,cpu_speed_ghz,cpu_speed_ghz.1
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,X,1.37kg,1339.69,Intel,Intel,[2560x1600],IPS Panel Retina Display,No,2.3,2.3
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,X,1.34kg,898.94,Intel,Intel,[1440x900],,No,1.8,1.8
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,Version Unknown,1.86kg,575.0,Intel,Intel,[1920x1080],Full HD,No,2.5,2.5
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,X,1.83kg,2537.45,AMD,Intel,[2880x1800],IPS Panel Retina Display,No,2.7,2.7
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,X,1.37kg,1803.6,Intel,Intel,[2560x1600],IPS Panel Retina Display,No,3.1,3.1


In [None]:
# Convert the values in the weight column to numeric values
laptops_df["weight"] = laptops_df["weight"].str.replace('kgs','').str.replace('kg','').astype(float)
                       
# Rename the weight column to weight_kg
laptops_df.rename(columns = {'weight':'weight_kg'}, inplace = True)
laptops_df.header

# Use the DataFrame.to_csv() method to save the laptops dataframe to a CSV file laptops_cleaned.csv without index labels
laptops_df.to_csv('laptops_cleaned.csv', index = False)
