## Student Data Analysis

In this activity, you will use the steps below to analyze a dataset of student test scores from schools in a fake school district.

1. Collect the data.

2. Prepare the data.

3. Summarize the data. 

4. Drill down into the data. 

5. Make comparisons. 



### Import required libraries and dependencies

<!-- https://pypi.org/project/pathlib2/ -->

In [18]:
import pandas as pd
import os

## Step 1: Collect the data.

To collect the data that you’ll need, complete the following steps:

**1. Using the Pandas `read_csv` function and the `os.path.join` function, import the data from the `new_student_data.csv` file, and create a DataFrame called student_df.**

In [19]:
student_data = os.path.join('../Resources/new_student_data.csv')
student_df = pd.read_csv(student_data)

**2. Use the head (and/or the tail) function to confirm that Pandas properly imported the data.**

In [20]:
student_df.head()

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type
0,127008367,Sarah Douglas,11th,Chang High School,87.2,64.1,Public
1,33365505,Francisco Osborne,9th,Fisher High School,,,Public
2,44359500,Ryan Haas,12th,Campbell High School,91.6,54.7,Public
3,24791243,Kathryn Mack,11th,Richard High School,68.9,73.3,Charter
4,121467881,Harold Reynolds,12th,Chang High School,68.7,43.4,Public


## Step 2: Prepare the Data

To prepare and clean your data for analysis, complete the following steps:
    
**1. Check for and replace all `NaN`, or missing, values in the student_df DataFrame.**

Use the following methods and functions to complete this section:
* `count()`
* `dropna()`
* `duplicated()`
* `sum()`
* `drop_duplicates()`

In [21]:
# Check for null values
student_df.isnull().sum()

student_id          0
student_name        0
grade               0
school_name         0
reading_score    1414
math_score        705
school_type         0
dtype: int64

In [32]:
# Drop null values
student_df = student_df.dropna()

In [33]:
# Check for duplicate rows
student_df.duplicated().sum()

1299

In [38]:
# Drop duplicate rows
student_df = student_df.drop_duplicates()
student_df.duplicated().sum()

0

**2. Use the `str.replace` function to remove the "th" from the grade levels in the grade column.**

In [47]:
# Check the type of the grade column with dtypes
student_df.dtypes

student_id         int64
student_name      object
grade              int32
school_name       object
reading_score    float64
math_score       float64
school_type       object
dtype: object

In [48]:
# View the grade column to look for a reason it isn't numeric
student_df["grade"]

0        11
2        12
3        11
4        12
5         9
         ..
13935    10
13936    10
13937     9
13938    10
13939    11
Name: grade, Length: 10604, dtype: int32

In [45]:
# Remove 'th' suffixes by replacing with and empty string
student_df['grade'] = student_df['grade'].str.replace('th', '')

AttributeError: Can only use .str accessor with string values!

In [46]:
# View the grade column to ensure the suffixes were removed

**3. Convert the data type of the "grade" column to a `int`.**

In [None]:
student_df.loc[:, "grade"] = student_df.loc[:, "grade"].astype("int")
student_df.dtypes

**4. Use the head (and/or the tail) function to preview the DataFrame.**

In [49]:
student_df.head()

Unnamed: 0,student_id,student_name,grade,school_name,reading_score,math_score,school_type
0,127008367,Sarah Douglas,11,Chang High School,87.2,64.1,Public
2,44359500,Ryan Haas,12,Campbell High School,91.6,54.7,Public
3,24791243,Kathryn Mack,11,Richard High School,68.9,73.3,Charter
4,121467881,Harold Reynolds,12,Chang High School,68.7,43.4,Public
5,79397676,Kyle Brooks,9,Turner High School,72.6,55.4,Public


## Good work!

You are now prepared to start the next lesson before starting step 3.