# Student Exam Performance Workshop
Use Python's pandas module to load a dataset containing student demographic information and test scores and find relationships between student attributes and test scores. This workshop will serve as an introduction to pandas and will allow students to practice the following skills: 

- Load a csv into a pandas DataFrame and examine summary statistics
- Rename DataFrame column names
- Add columns to a DataFrame
- Change values in DataFrame rows
- Analyze relationships between categorical features and test scores

**Bonus:**

Determine the relationship between the students' lunch classification and average test scores by creating a seaborn boxplot

In [15]:
# Import the python modules that we will need to use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [16]:
def load_data(path):
    df = pd.read_csv(path)
    return df

Use the `load_data` function to load the StudentsPerformance.csv file into a pandas dataframe variable called `df`

__Hint__: Keep in mind where the csv file is in relation to this Jupyter Notebook. Do you need to provide an absolute or relative file path?

In [10]:
#Write python to call the function above and load the StudentPeformance csv file into a pandas dataframe


#Keep this line so you can see the first five rows of your dataframe once you have loaded it!
df.head(5)

__Next step:__ Now that we have loaded our DataFrame, let's look at the summary statistics of our data. We can use the `describe` method to accomplish this:

In [11]:
df.describe(include='all')

By looking at this breakdown of our dataset, I can make at least the following observations:

1. Our DataFrame consists of eight columns, three of which are student test scores.
2. There are no missing any values in our DataFrame!
3. The data appears to be pretty evenly distributed.
4. The column names are long and difficult to type

## Renaming DataFrame Columns

Let's change our column names so they are easier to work with!

__Hint__: Look into the pandas `columns` attribute to make the change!

In [19]:
columns = 'gender', 'race', 'parentDegree', 'lunchStatus', 'courseCompletion', 'mathScore', 'readingScore', 'writingScore'

In [20]:
def renameColumns(df, columns):
    df.columns=columns
    return df

In [12]:
#Use the above function to rename the DataFrame's column names


df.head(10) #Look at the first ten rows of the DataFrame to ensure the renaming worked!

## Adding Columns to a DataFrame

Great! Next we want to add an `avgScore` column that is an average of the three given test scores (`mathScore`, `readingScore` and `writingScore`). This will allow us to generalize the students' performance and simplify the process of us examining our feature's impact on student performance.  

In [2]:
#Complete the following line of code to create an avgScore column
df['avgScore'] = 
df.head(5)

## Analyzing Feature Relationships
Now that our data is looking the way we want, let's examine how some of our features correlate with students' test performances. We will start by looking at the relationship between race and parent degree status on test scores.

__Hint__: Use pandas' `groupby` method to examine these relationships. The documentation for `groupby` can be found here: https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.groupby.html

In [3]:
df.groupby(['race','parentDegree']).mean()

From examining the above output, we can see that across all `race` groups, students with "high school" and "some high school" as their parent degree status (`parentDegree`) had lower test scores. 

__Next step__: Since there seems to be a clear distinction between students that have parents with have some college education and those that do not, let's simplify our DataFrame by creating a `degreeBinary` column based on values in the `parentDegree` column. This new column will simply contain either "no_degree" or "has_degree." We can do this by writing a basic function and using pandas' `apply` method:

In [2]:
#Complete this function to return the proper strings to denote degree status

def degree_status(edu):
    if edu in {'high school', 'some high school'}:
        #Fill in your code here!

df['degreeBinary'] = df['parentDegree'].apply(degree_status)
df.head(10)

Great job! Now let's continue examining our features to find relationships in our data

__Your turn:__ Use the `groupby` function again examine relationships between other features and student test scores. What can we learn about the relationship between these whether or not the students have completed the course and their test scores? What about the relationship between gender and test scores?

In [5]:
##Use groupby to examine the relationship between course completion status and test scores

In [6]:
##Use groupby to examine the relationship between gender and test scores

## Bonus: Visualization

Great job making it this far! As a bonus exercise, we will create a simple data visualization. We have examined the relationship between all of our features and student test scores except for one -- student lunch status, which is found in the `lunch` column.

In order to explore this relationship, let's create a `barplot`, with the students'`lunch` status as the x-axis and their average test scores (`avgScore`) as the y-axis.

We will use seaborn, which is a third-party library, to complete this visualization. If you do not already have seaborn installed, `pip install` it now! Follow the seaborn documentation to create the `barplot` in the cell below.

In [8]:
import seaborn as sns #import the seaborn module

sns.set(style='whitegrid')
    
def graph_data(data, xkey='lunchStatus', ykey='avgScore'):
    #Fill this in to create the barplot!
    
graph_data(df)