# Data Validation

In this notebook, we will go over various ways to validate our data. Validating our data is important because it gives us proof that our data is reliable to work with.

We can validate our data in various ways such as checking -

- against another trusted source
- data types
- unique values
- data ranges
- nulls
- consistency


### Import Basic Packages & Data

In [2]:
#Basics
import numpy as np
import pandas as pd

We will continue to work with the student grades data.

In [None]:
# Import data to a pandas dataframe
df_grades = pd.read_csv('student grades.csv')
df_grades.head()

### Check Data Types 

In [11]:
# Check the data types of each column


### Identify NULL or Missing values

The `isna` function can be used to check for null or missing values in a dataframe. When we combine it with the `sum` function, we are able to figure out the number of null or missing values in a dataframe. 


In [10]:
# Identify missing values (NULLS) in the dataset


One other thing to look out for as well is that whenever there are multiple null value columns, there is a good chance that there are errors not just in the column level, but in the row level as well. If we see the entire dataset, we see that the last line of the majority of the row is null, which may indicate that there was a mistake when data was inputted for this particular student.

In [None]:
# Investigate the bottom of the dataset using tail()


### Check unique values for categorical columns 

We can also use the `value_counts` function to easily find unique values and their frequencies for a column. With this, we can see that we found that there was a typo for "Arts" since there was a count for the incorrect spelling of it.

In [9]:
# Explore what unique values appear in the Faculty column


### Check value ranges for numerical columns and identify clear errors

In [8]:
# Explore what range of values exist for numerical columns


We can see here that from the `describe` function, that there is something that sticks out - the minimum value for tuition we see here is 40, which is highly unlikely since tuition in general is much more expensive and relative to the other values it is substantially lower. This is a good chance that this is an error in the dataset. 


We have identified many errors in our dataset, and we will deal with this in the next lesson.

### Compare Aggregated Data to Another Trusted DataSource

One other important method of validating our data is to simply compare it from another data source. It is crucial to utilize all of our resources, and cross-validating our data can be a simple yet powerful way to gain confidence in our data.


Open the "External Report for Data Prep" file. Navigate the file to cross-reference if the below function results are correct and incorrect based on your findings.

In [1]:
# Confirming if count of students is the same as our known source of truth?


In [2]:
# Confirming if count of business only students is the same as our known source of truth?


In [3]:
# Confirming if total tuition fees is the same as our known source of truth?


### Exercise 1 (Basic) - Understanding the data type of columns

Find the data type of each column, for each column that is categorical list out their unique values, and check value range for each column that is numerical

Task:
- Import data
- Explore data types
- Explore value ranges

In [None]:
# Import data to a pandas dataframe
df_phone = pd.read_csv('phone_marketplace_dataset_cleaning_set.csv')
df_phone

In [4]:
# Finding out data type of each column


In [5]:
# Explore the range of values in the numerical columns


### Exercise 2 (Advanced) - Exploring Column Values

Find the value count of each categorical column

Task:
- For each categorical column, find the count of each value
- For each categorical column, return the unique values.
- Hint: use a for loop to go through each column

In [37]:
# Define a list of category columns
cat_cols = ['name', 'magnet_charging', 'marketplace', 'pro']

In [6]:
# Using a loop to find all the unique values of a categorical column


In [7]:
# Using a for loop find number of instances of each unique value of each categorical column
