# Cleaning Data & Imputation

In most cases, the data we are working with is missing or is not in the most ideal format for us to work with, and it is up to us to modify it so that it fits our use case. In this notebook we will clean identified errors and explore the concept of imputation.

### Import Basic Packages & Data

In [2]:
#Basics
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

We will continue working with the student grades dataset.

In [21]:
# Import data to a pandas dataframe
df_grades = pd.read_csv('student grades.csv')
df_grades.head()

Unnamed: 0,student_ID,first_name,last_name,grade_avg,faculty,tuition,OH_participated,classes_skipped
0,20123456.0,John,Park,B,Arts,44191.0,0,5.0
1,20123457.0,Alex,Great,B,Science,32245.0,"""4""",10.0
2,20123458.0,Sebastian,Taylor,B,Business,42679.0,6,7.0
3,20123459.0,Michael,Bay,A,Math,46478.0,15,2.0
4,20123460.0,Scott,Foster,A,Engineering,36784.0,5,8.0


### Dealing with identified errors

In the previous lesson, we discovered various errors and will now explore ways to deal with them.

Lets first look at the number of null values in our dataset, both in the column level and row level.

In [3]:
# Identify missing values (NULLS) in the dataset
df_grades.isna().sum()

StudentID         1
FirstName         0
LastName          1
GradeAvg          0
Faculty           0
Tuition           4
OHParticipated    2
ClassesSkipped    2
dtype: int64

In [4]:
df_grades.tail()

Unnamed: 0,StudentID,FirstName,LastName,GradeAvg,Faculty,Tuition,OHParticipated,ClassesSkipped
26,20123482.0,Joseph,Kim,A,Math,33376.0,12.0,6.0
27,20123483.0,Chris,Dang,F,Business,44737.0,,8.0
28,20123484.0,Robbie,Tee,B,Engineering,49682.0,10.0,6.0
29,20123485.0,Shelly,Yoon,A,Math,33585.0,5.0,10.0
30,,Joseph,,A,English,,2.0,4.0


The simpliest way of getting rid of null values is to use the `drop_na` function. This allows us to either drop all rows that have null values or all columns that have null values. We can confirm that the rows or columns were dropped by looking at the shape of our dataframe.

In [None]:
# drop all rows or columns with nas (not recommended)
df_drop_na = df_grades.dropna()      #Use axis = 1 to drop columns with missing values.

We saw in our previous notebook that the last row of our column was identified as a row with lots of errors that we can safely decide to drop.

To easily drop only the last row of our dataset which contains the errors, we can simply use the `drop` function and use the index of the `tail` function in it.

In [2]:
# dropping last row with the tail function


In the case of the dropped last row, we can safely assume that that specific row of data was an error, because there was no name associated with it and there were many missing data.

However, in most instances we want to keep our data and leave it as a last resort to drop it. We will explore other methods of cleaning our data using imputation further down in the notebook. 

We saw before that there was a typo in one of the faculty values by looking at the `value_counts` function. To solve this, we can use the `replace` function to replace the typo.

In [5]:
# Explore what unique values appear in the Faculty column
df_grades['faculty'].value_counts()

Business       9
Engineering    8
Arts           4
Science        4
Math           4
Art$           1
English        1
Name: faculty, dtype: int64

In [1]:
# Are there any categorical values which acutally have been entered incorrectly? Make them consistent with replace.


Finally, we discovered using the `describe` function, we were able to see that there was a minimum tuition value that didn't make sense since the minimum tuition was 40 dollars. Luckily, we were able to find out from the external data source that their tuition was actually 40000 dollars. We can fix this by correcting the student's tuition value.

In [10]:
# Explore what range of values exist for numerical columns
df_grades.describe()

Unnamed: 0,StudentID,Tuition,ClassesSkipped
count,30.0,27.0,28.0
mean,20123470.0,39727.592593,4.892857
std,8.803408,9749.186961,3.071244
min,20123460.0,40.0,0.0
25%,20123460.0,34898.5,2.75
50%,20123470.0,42679.0,4.5
75%,20123480.0,45734.0,7.25
max,20123480.0,49682.0,10.0


In [15]:
#Identifying which index is the student's


In [16]:
# Specify the index in the tuition column to equal 40,000


### Identify Errors for Data Types

When importing data, there may be some data in places that there shouldn't be, whether it is in a different format or data that doesn't belong. For example, you may have noticed when we first imported the dataset that the "OfficeHoursParticipated" column had quotation marks to some numbers. Because of this, when we use the `info` function, the data type is shown as an object rather than a float. 

In [13]:
# Explore what errors exist in the data
df_grades.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   StudentID       30 non-null     float64
 1   FirstName       30 non-null     object 
 2   LastName        30 non-null     object 
 3   GradeAvg        30 non-null     object 
 4   Faculty         30 non-null     object 
 5   Tuition         27 non-null     float64
 6   OHParticipated  28 non-null     object 
 7   ClassesSkipped  28 non-null     float64
dtypes: float64(3), object(5)
memory usage: 3.2+ KB


To fix this error, we can use the `str.replace` function to extract the quotation marks and convert the entire column into the approrpriate data type. In this case, it would be the float data type for this column.

In [17]:
# Using the str.replace function to extract the quotation mark


# Once values have been stripped of the quotation mark, we will convert the entire column to a float


### Imputation

At a basic level, we can replace missing values or errors with any value we wish. But we should think carefully about if the values that we are replacing make sense. For example, a common way of filling in values is to fill in the column using the average of the column distribution. However, in most cases, these averages can really vary depending on other factors, and it may not be the best method to fill in your data.

Lets explore ways to fill in our columns better. We will start by identifying the null values in our columns in our dataframe using the `isna` and `sum` functions.

In [16]:
# Look at the count of null values to identify easily which columns contain null values
df_grades.isna().sum()

StudentID         0
FirstName         0
LastName          0
GradeAvg          0
Faculty           0
Tuition           3
OHParticipated    2
ClassesSkipped    2
dtype: int64

We can see that three columns have a null count greater than 0.

Some null values can be treated as zeros, whilst some may really be missing values.


We have confirmed with staff that OHparticipated and ClassesSkipped both appear as null when these values are zero. The team know the data well enough to verify this. Therefore we can fill the nulls with zeros using the fillna() function.

In [None]:
#Replace OH_participated nulls with zeros

#Replace ClassesSkipped nulls with zeros


In [None]:
#Check the presence of null values in all columns once again




Lets now finally look at the null values in the "Tuition" column.

In [18]:
# Identify the null values in "Tuition" column


In this instance, we cannot use zeros to fill in the values, since it wouldn't make sense for students to not pay any tuition. One approriate approach to filling in these values is to take the average of the tuition of each faculty. That way, the average can be tied to each faculty since each faculty may have different tuition rates. 

In [19]:
# Find tuition averages (returned as series)


# Assigning varaibles based on the faculty average


Finally, we can now check using the info function if there are any null values remaining. 

In [None]:
# Check null values in each column


### Exercise 1 - Removing Unwanted Strings

Once again, we are working with the phone dataset. However, as you know we identified a few problems.

Task:
- Explore the unique values in the marketplace column, identify the typos, and fix them.

In [3]:
# Import data to a pandas dataframe
df_phone = pd.read_csv('phone_marketplace_dataset_cleaning_set.csv')
df_phone

Unnamed: 0,price,year_made,name,battery_life_percentage,storage,magnet_charging,marketplace,years_owned,visible_scratches,pro,original_sale_price,#_of_previous_owners,megapixel
0,551.0,2019,iPhone_11,74,64,no,kijiji,2,9,no,747,1,12
1,822.0,2020,iPhone_12,94,128,yes,craigslist,2,6,no,888,1,16
2,1008.0,2022,iPhone_14,97,256,yes,craigslist!,0,2,no,1185,1,22
3,,2021,iPhone_13,90,128,yes,craigslist,2,2,no,887,1,20
4,839.0,2020,iPhone_12,91,256,yes,kijiji,1,5,no,969,1,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...
344,1326.0,2022,iPhone_14,91,64,yes,craigslist,0,0,no,1394,1,22
345,458.0,2019,iPhone_11,75,256,no,facebook,3,3,no,702,2,12
346,487.0,2019,iPhone_11,87,256,no,facebook,1,7,no,781,2,12
347,1340.0,2022,iPhone_14,100,256,yes,craigslist,0,0,no,1411,1,22


In [1]:
# Explore the unique values in the marketplace column


In [5]:
#Replace or remove all ! in the marketplace column


### Exercise 2 - Imputing Nulls with Appropriate Values

Identify the column that has null values and populate the values using a basic imputation method of your choice.

Task:
- Identify the columns that need to be dealt with
- Populate missing values with a basic imputation method of your choice

In [None]:
df_phone.isna().sum()

In [2]:
# Check the rows that are null


In [6]:
# Find phone price averages (returned as series)


In [3]:
# Option 1: Using np.where with the loc function for phone averages


In [4]:
# Checking to see if anymore null values are left
