## Data Cleaning with Python
We have provided an example of data representing exam scores from 1000 students in an online math class.

These DataFrames are hard to work with. They’re separated into multiple tables, and the values don’t lend themselves well to analysis. Try to think about how you would plot the exam score average against the age of the students in the class. This would not be easy!

In [152]:
import pandas as pd
import glob

### 1. Dealing With Multiple Files

##### Exercise 1
We have 10 different files containing 100 students each. These files follow the naming structure:

- exams0.csv
- exams1.csv
- … up to exams9.csv <br>

We are going to import each file using pandas, and combine all of the entries into one DataFrame.

First, create a variable called `student_files` and set it equal to the `glob()` of all of the csv files we want to import.

In [153]:
student_files = glob.glob("exams*.csv")

In [154]:
student_files

['exams0.csv',
 'exams1.csv',
 'exams2.csv',
 'exams3.csv',
 'exams4.csv',
 'exams5.csv',
 'exams6.csv',
 'exams7.csv',
 'exams8.csv',
 'exams9.csv']

##### Exercise 2
Create an empty list called `df_list` that will store all of the DataFrames we make from the files `exams0.csv` through `exams9.csv`. Loop through the filenames in `student_files`, and create a DataFrame from each file. Append this DataFrame to `df_list`.

In [155]:
df_list = []
for filename in student_files:
    data = pd.read_csv(filename)
    df_list.append(data)
df_list

[    id         full_name gender_age fractions probability       grade
 0    0    Barrett Feragh        M14       76%         72%   9th grade
 1    1   Llewellyn Keech        M14       83%         NaN  12th grade
 2    2   Llewellyn Keech        M14       83%         NaN  12th grade
 3    3      Terrell Geri        M15       80%         86%  11th grade
 4    4    Gram Hallewell        M14       67%         78%  10th grade
 ..  ..               ...        ...       ...         ...         ...
 95  95     Halley Clunie        F14       73%         82%  12th grade
 96  96    Gale Mullender        F17       72%         81%  12th grade
 97  97        Ryun Denne        M17       74%         78%  11th grade
 98  98  Cazzie Potapczuk        M14       71%         78%  10th grade
 99  99     Verina Pasque        F18       48%         77%  10th grade
 
 [100 rows x 6 columns],
     id          full_name gender_age fractions probability       grade
 0    0  Roseanna Gwinnell        F15       89%  

##### Exercise 3

Concatenate all of the DataFrames in `df_list` into one DataFrame called `students`.

In [156]:
students = pd.concat(df_list)
students

Unnamed: 0,id,full_name,gender_age,fractions,probability,grade
0,0,Barrett Feragh,M14,76%,72%,9th grade
1,1,Llewellyn Keech,M14,83%,,12th grade
2,2,Llewellyn Keech,M14,83%,,12th grade
3,3,Terrell Geri,M15,80%,86%,11th grade
4,4,Gram Hallewell,M14,67%,78%,10th grade
...,...,...,...,...,...,...
95,95,Maxi Dew,F16,77%,71%,10th grade
96,96,Jewell Boas,F15,57%,90%,12th grade
97,97,Lebbie Twine,F17,72%,91%,12th grade
98,98,Garek Culbert,M14,64%,,11th grade


### 2. Reshaping Your Data

##### Exercise 1
Print out the columns of students.

In [157]:
students.columns

Index(['id', 'full_name', 'gender_age', 'fractions', 'probability', 'grade'], dtype='object')

##### Exercise 2
There is a column for the scores on the fractions exam, and a column for the scores on the probabilities exam.

We want to make each row an observation, so we want to transform this table to look like:

- **full_name--------- exam------------score----gender_age----	grade**
- “First Student”----“Fractions”-------score%	…	…
- “First Student”----“Probabilities”---score%	…	…
- “Second Student”----“Fractions”------score%	…	…
- “Second Student”----“Probabilities”--score%	…	…
…	…	…		<br>

Use pd.melt() to create a new table (still called students) that follows this structure.

In [158]:
students = pd.melt(frame=students, id_vars=["full_name", "gender_age", "grade"],\
                   value_vars=["fractions", "probability"], value_name="score", var_name="exam")

##### Exercise 3
Print the `.head()` and the .columns of students.

Also, print out the `.value_counts()` of the column exam.

In [159]:
students.head()

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Barrett Feragh,M14,9th grade,fractions,76%
1,Llewellyn Keech,M14,12th grade,fractions,83%
2,Llewellyn Keech,M14,12th grade,fractions,83%
3,Terrell Geri,M15,11th grade,fractions,80%
4,Gram Hallewell,M14,10th grade,fractions,67%


In [160]:
students.columns

Index(['full_name', 'gender_age', 'grade', 'exam', 'score'], dtype='object')

In [161]:
students["exam"].value_counts()

fractions      1000
probability    1000
Name: exam, dtype: int64

### 3. Dealing with Duplicates

##### Exercise 1
It seems like in the data collection process, some rows may have been recorded twice. Use the `.duplicated()` function on the students DataFrame to make a Series object called duplicates.

In [162]:
duplicates = students.duplicated()
duplicates

0       False
1       False
2        True
3       False
4       False
        ...  
1995    False
1996    False
1997    False
1998    False
1999    False
Length: 2000, dtype: bool

##### Exercise 2
Print out the `.value_counts()` of the duplicates Series to see how many rows are exact duplicates.

In [163]:
duplicates.value_counts()

False    1976
True       24
dtype: int64

##### Exercise 3
Update the value of students to be the students table with the duplicates dropped.

In [164]:
students.drop_duplicates(inplace=True)

#####  Exercise 4
Use the `.duplicated()` function again to make a Series object called duplicates after dropping the duplicates. Print out the value counts again. Are there any Trues left?

In [165]:
duplicates = students.duplicated()
duplicates.value_counts()

False    1976
dtype: int64

### 4. Splitting by Index

##### Exercise 1
Print out the columns of the students DataFrame.

In [166]:
students.columns

Index(['full_name', 'gender_age', 'grade', 'exam', 'score'], dtype='object')

##### Exercise 2
The column `gender_age` sounds like it contains both gender and age!

Print out the `.head()` of the column to see what kind of data it contains.

In [167]:
students.head()

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Barrett Feragh,M14,9th grade,fractions,76%
1,Llewellyn Keech,M14,12th grade,fractions,83%
3,Terrell Geri,M15,11th grade,fractions,80%
4,Gram Hallewell,M14,10th grade,fractions,67%
5,Stephana Boots,F18,9th grade,fractions,


##### Exercise 3
It looks like the first character of the values in `gender_age` contains the gender, while the rest of the string contains the age. Let’s separate out the gender data into a new column called `gender`.

In [168]:
students["gender"] = students["gender_age"].str[:1]

In [169]:
students.head()

Unnamed: 0,full_name,gender_age,grade,exam,score,gender
0,Barrett Feragh,M14,9th grade,fractions,76%,M
1,Llewellyn Keech,M14,12th grade,fractions,83%,M
3,Terrell Geri,M15,11th grade,fractions,80%,M
4,Gram Hallewell,M14,10th grade,fractions,67%,M
5,Stephana Boots,F18,9th grade,fractions,,F


##### Exercise 4
Now, separate out the age data into a new column called `age`.

In [170]:
students["age"] = students["gender_age"].str[1:]
students.head()

Unnamed: 0,full_name,gender_age,grade,exam,score,gender,age
0,Barrett Feragh,M14,9th grade,fractions,76%,M,14
1,Llewellyn Keech,M14,12th grade,fractions,83%,M,14
3,Terrell Geri,M15,11th grade,fractions,80%,M,15
4,Gram Hallewell,M14,10th grade,fractions,67%,M,14
5,Stephana Boots,F18,9th grade,fractions,,F,18


##### Exercise 5
Now, we don’t need that `gender_age` column anymore.

Let’s set the `students` DataFrame to be the `students` DataFrame with all columns except `gender_age`.

In [171]:
students.drop(columns="gender_age", inplace=True)

In [172]:
students.head()

Unnamed: 0,full_name,grade,exam,score,gender,age
0,Barrett Feragh,9th grade,fractions,76%,M,14
1,Llewellyn Keech,12th grade,fractions,83%,M,14
3,Terrell Geri,11th grade,fractions,80%,M,15
4,Gram Hallewell,10th grade,fractions,67%,M,14
5,Stephana Boots,9th grade,fractions,,F,18


### 5. Splitting by Character

##### Exercise 1
The students’ names are stored in a column called `full_name`.

We want to separate this data out into two new columns, `first_name` and `last_name`.

In [173]:
students["first_name"]  = students["full_name"].str.split(" ",expand=True)[0]
students["last_name"] = students["full_name"].str.split(" ",expand=True)[1]

##### Exercise 2
Print out the `.head()` of students to see how the DataFrame has changed.

In [174]:
students.head()

Unnamed: 0,full_name,grade,exam,score,gender,age,first_name,last_name
0,Barrett Feragh,9th grade,fractions,76%,M,14,Barrett,Feragh
1,Llewellyn Keech,12th grade,fractions,83%,M,14,Llewellyn,Keech
3,Terrell Geri,11th grade,fractions,80%,M,15,Terrell,Geri
4,Gram Hallewell,10th grade,fractions,67%,M,14,Gram,Hallewell
5,Stephana Boots,9th grade,fractions,,F,18,Stephana,Boots


##### Exercise 3
Now, we don’t need that `full_name` column anymore.

Let’s set the students DataFrame to be the students DataFrame with all columns except `full_name`.

In [175]:
students.drop(columns="full_name", inplace=True)

In [176]:
students.head()

Unnamed: 0,grade,exam,score,gender,age,first_name,last_name
0,9th grade,fractions,76%,M,14,Barrett,Feragh
1,12th grade,fractions,83%,M,14,Llewellyn,Keech
3,11th grade,fractions,80%,M,15,Terrell,Geri
4,10th grade,fractions,67%,M,14,Gram,Hallewell
5,9th grade,fractions,,F,18,Stephana,Boots


### 6. Looking at Types

##### Exercise 1
Let’s inspect the dtypes in the students table.

Print out the `.dtypes` attribute.

In [177]:
students.dtypes

grade         object
exam          object
score         object
gender        object
age           object
first_name    object
last_name     object
dtype: object

### 7. String Parsing

##### Exercise 1
We saw in the last exercise that finding the mean of the `score` column is hard to do when the data is stored as Objects and not numbers.

Use regex to take out the `%` signs in the `score` column.

In [178]:
students["score"].replace("%", "", regex=True, inplace=True)

##### Exercise 2
Convert the score column to a numerical type using the `pd.to_numeric()` function.

In [179]:
students["score"] =  pd.to_numeric(students["score"])

In [180]:
students.dtypes

grade          object
exam           object
score         float64
gender         object
age            object
first_name     object
last_name      object
dtype: object

### 8. More String Parsing

##### Exercise 1
Print out the first five rows of the `grade` column.

In [181]:
students["grade"].head()

0     9th grade
1    12th grade
3    11th grade
4    10th grade
5     9th grade
Name: grade, dtype: object

##### Exercise 2
Use regex to extract the number from each string in `grade` and store those values back into the `grade` column.

In [182]:
students["grade"] = students["grade"].str.split("(\d+)", expand=True)[1]

In [183]:
students.head()

Unnamed: 0,grade,exam,score,gender,age,first_name,last_name
0,9,fractions,76.0,M,14,Barrett,Feragh
1,12,fractions,83.0,M,14,Llewellyn,Keech
3,11,fractions,80.0,M,15,Terrell,Geri
4,10,fractions,67.0,M,14,Gram,Hallewell
5,9,fractions,,F,18,Stephana,Boots


##### Exercise 3
Print the dtypes of the `students` table.

In [184]:
students.dtypes

grade          object
exam           object
score         float64
gender         object
age            object
first_name     object
last_name      object
dtype: object

##### Exercise 4
Convert the `grade` column to be numerical values instead of objects.

In [185]:
students["grade"] = pd.to_numeric(students["grade"])

In [186]:
students.dtypes

grade           int64
exam           object
score         float64
gender         object
age            object
first_name     object
last_name      object
dtype: object

##### Exercise 5
Calculate the mean of `grade`, store it in a variable called `avg_grade`, and then print it out!

In [187]:
avg_grade = students["grade"].mean()

In [188]:
avg_grade

10.620445344129555

### 9. Missing Values
**Method 1:** drop all of the rows with a missing value. <br>
**Method 2:** fill the missing values with the mean of the column, or with some other aggregate value.

##### Exercise 1
Get the mean of the `score` column. Store it in `score_mean` and print it out.

In [189]:
score_mean = students["score"].mean()
score_mean

77.69657422512235

##### Exercise 2
We will assume that everyone who doesn’t have a score for an exam missed the test. <br>
We want to replace all `nan`s with a score of 0. Let’s do this with the `score` column. <br>
Fill all of the `nan`s in `students['score']` with 0

In [190]:
students["score"].isnull().value_counts()

False    1839
True      137
Name: score, dtype: int64

In [191]:
students["score"].fillna(0, inplace=True)

##### Exercise 3
Get the mean of the `score` column again. Store it in `score_mean_2` and print it out.

In [192]:
score_mean_2 = students["score"].mean()
score_mean_2

72.30971659919028