# Data Validation

In this notebook, we will go over various ways to validate our data. Validating our data is important because it gives us proof that our data is reliable to work with.

We can validate our data in various ways such as checking -

- against another trusted source
- data types
- unique values
- data ranges
- nulls
- consistency


### Import Basic Packages & Data

In [2]:
#Basics
import numpy as np
import pandas as pd

We will continue to work with the student grades data.

In [4]:
# Import data to a pandas dataframe
df_grades = pd.read_csv('student grades.csv')
df_grades.head()

Unnamed: 0,student_ID,first_name,last_name,grade_avg,faculty,tuition,OH_participated,classes_skipped
0,20123456.0,John,Park,B,Arts,44191.0,0,5.0
1,20123457.0,Alex,Great,B,Science,32245.0,"""4""",10.0
2,20123458.0,Sebastian,Taylor,B,Business,42679.0,6,7.0
3,20123459.0,Michael,Bay,A,Math,46478.0,15,2.0
4,20123460.0,Scott,Foster,A,Engineering,36784.0,5,8.0


### Check Data Types 

In [5]:
# Check the data types of each column
df_grades.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   student_ID       30 non-null     float64
 1   first_name       31 non-null     object 
 2   last_name        30 non-null     object 
 3   grade_avg        31 non-null     object 
 4   faculty          31 non-null     object 
 5   tuition          27 non-null     float64
 6   OH_participated  29 non-null     object 
 7   classes_skipped  29 non-null     float64
dtypes: float64(3), object(5)
memory usage: 2.1+ KB


### Identify NULL or Missing values

The `isna` function can be used to check for null or missing values in a dataframe. When we combine it with the `sum` function, we are able to figure out the number of null or missing values in a dataframe. 


In [6]:
# Identify missing values (NULLS) in the dataset
df_grades.isna().sum()

student_ID         1
first_name         0
last_name          1
grade_avg          0
faculty            0
tuition            4
OH_participated    2
classes_skipped    2
dtype: int64

One other thing to look out for as well is that whenever there are multiple null value columns, there is a good chance that there are errors not just in the column level, but in the row level as well. If we see the entire dataset, we see that the last line of the majority of the row is null, which may indicate that there was a mistake when data was inputted for this particular student.

In [7]:
# Investigate the bottom of the dataset using tail()
df_grades.tail()

Unnamed: 0,student_ID,first_name,last_name,grade_avg,faculty,tuition,OH_participated,classes_skipped
26,20123482.0,Joseph,Kim,A,Math,33376.0,12.0,6.0
27,20123483.0,Chris,Dang,F,Business,44737.0,,8.0
28,20123484.0,Robbie,Tee,B,Engineering,49682.0,10.0,6.0
29,20123485.0,Shelly,Yoon,A,Math,33585.0,5.0,10.0
30,,Joseph,,A,English,,2.0,4.0


### Check unique values for categorical columns 

We can also use the `value_counts` function to easily find unique values and their frequencies for a column. With this, we can see that we found that there was a typo for "Arts" since there was a count for the incorrect spelling of it.

In [8]:
# Explore what unique values appear in the Faculty column
df_grades['faculty'].value_counts()

Business       9
Engineering    8
Arts           4
Science        4
Math           4
Art$           1
English        1
Name: faculty, dtype: int64

### Check value ranges for numerical columns and identify clear errors

In [7]:
# Explore what range of values exist for numerical columns
df_grades.describe()

Unnamed: 0,StudentID,Tuition,ClassesSkipped
count,30.0,27.0,29.0
mean,20123470.0,39727.592593,4.862069
std,8.803408,9749.186961,3.020456
min,20123460.0,40.0,0.0
25%,20123460.0,34898.5,3.0
50%,20123470.0,42679.0,4.0
75%,20123480.0,45734.0,7.0
max,20123480.0,49682.0,10.0


We can see here that from the `describe` function, that there is something that sticks out - the minimum value for tuition we see here is 40, which is highly unlikely since tuition in general is much more expensive and relative to the other values it is substantially lower. This is a good chance that this is an error in the dataset. 


We have identified many errors in our dataset, and we will deal with this in the next lesson.

### Compare Aggregated Data to Another Trusted DataSource

One other important method of validating our data is to simply compare it from another data source. It is crucial to utilize all of our resources, and cross-validating our data can be a simple yet powerful way to gain confidence in our data.


Open the "External Report for Data Prep" file. Navigate the file to cross-reference if the below function results are correct and incorrect based on your findings.

In [10]:
# Confirming if count of students is the same as our known source of truth?
count_students = len(df_grades)
count_students

30

In [11]:
# Confirming if count of business only students is the same as our known source of truth?
business_students = len(df_grades[df_grades['faculty'] == 'Business'])
business_students

1072645.0

In [13]:


# Confirming if total tuition fees is the same as our known source of truth?
sum_tuition = df_grades['tuition'].sum()
sum_tuition

9

### Exercise 1 (Basic) - Understanding the data type of columns

Find the data type of each column, for each column that is categorical list out their unique values, and check value range for each column that is numerical

Task:
- Import data
- Explore data types
- Explore value ranges


In [34]:
# Import data to a pandas dataframe
df_phone = pd.read_csv('phone_marketplace_dataset_cleaning_set.csv')
df_phone

Unnamed: 0,price,year_made,name,battery_life_percentage,storage,magnet_charging,marketplace,years_owned,visible_scratches,pro,original_sale_price,#_of_previous_owners,megapixel
0,551.0,2019,iPhone_11,74,64,no,kijiji,2,9,no,747,1,12
1,822.0,2020,iPhone_12,94,128,yes,craigslist,2,6,no,888,1,16
2,1008.0,2022,iPhone_14,97,256,yes,craigslist!,0,2,no,1185,1,22
3,,2021,iPhone_13,90,128,yes,craigslist,2,2,no,887,1,20
4,839.0,2020,iPhone_12,91,256,yes,kijiji,1,5,no,969,1,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...
344,1326.0,2022,iPhone_14,91,64,yes,craigslist,0,0,no,1394,1,22
345,458.0,2019,iPhone_11,75,256,no,facebook,3,3,no,702,2,12
346,487.0,2019,iPhone_11,87,256,no,facebook,1,7,no,781,2,12
347,1340.0,2022,iPhone_14,100,256,yes,craigslist,0,0,no,1411,1,22


In [35]:
# Finding out data type of each column
df_phone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   price                    343 non-null    float64
 1   year_made                349 non-null    int64  
 2   name                     349 non-null    object 
 3   battery_life_percentage  349 non-null    int64  
 4   storage                  349 non-null    int64  
 5   magnet_charging          349 non-null    object 
 6   marketplace              349 non-null    object 
 7   years_owned              349 non-null    int64  
 8   visible_scratches        349 non-null    int64  
 9   pro                      349 non-null    object 
 10  original_sale_price      349 non-null    int64  
 11  #_of_previous_owners     349 non-null    int64  
 12  megapixel                349 non-null    int64  
dtypes: float64(1), int64(8), object(4)
memory usage: 35.6+ KB


In [36]:
# Explore the range of values in the numerical columns
df_phone.describe()

Unnamed: 0,price,year_made,battery_life_percentage,storage,years_owned,visible_scratches,original_sale_price,#_of_previous_owners,megapixel
count,343.0,349.0,349.0,349.0,349.0,349.0,349.0,349.0,349.0
mean,879.09621,2020.538682,88.60745,148.17192,1.495702,3.002865,965.103152,1.2149,17.524355
std,304.855338,1.182642,7.550067,79.229727,1.242499,2.566919,277.363067,0.599047,3.899986
min,402.0,2019.0,70.0,64.0,0.0,0.0,408.0,1.0,12.0
25%,595.0,2019.0,84.0,64.0,0.0,1.0,745.0,1.0,12.0
50%,856.0,2021.0,91.0,128.0,1.0,2.0,923.0,1.0,20.0
75%,1121.5,2022.0,94.0,256.0,2.0,5.0,1200.0,1.0,22.0
max,1499.0,2024.0,100.0,256.0,4.0,10.0,1499.0,4.0,22.0


### Exercise 2 (Advanced) - Exploring Column Values

Find the value count of each categorical column

Task:
- For each categorical column, find the count of each value
- For each categorical column, return the unique values.
- Hint: use a for loop to go through each column

In [37]:
# Define a list of category columns
cat_cols = ['name', 'magnet_charging', 'marketplace', 'pro']

In [38]:
# Using a loop to find all the unique values of a categorical column
for col in cat_cols:
    print(col,df_phone[col].unique())

name ['iPhone_11' 'iPhone_12' 'iPhone_14' 'iPhone_13']
magnet_charging ['no' 'yes']
marketplace ['kijiji' 'craigslist' 'craigslist!' 'facebook' 'facebook!' 'kijiji!']
pro ['no' 'yes']


In [41]:
# Using a for loop find number of instances of each unique value of each categorical column
for col in cat_cols:
    print(df_phone[col].value_counts(), "\n")

iPhone_14    94
iPhone_11    89
iPhone_12    85
iPhone_13    81
Name: name, dtype: int64 

yes    260
no      89
Name: magnet_charging, dtype: int64 

facebook       125
craigslist     110
kijiji         106
craigslist!      3
facebook!        3
kijiji!          2
Name: marketplace, dtype: int64 

no     271
yes     78
Name: pro, dtype: int64 

