# Preppin Data Week 5
## The Prep School - Setting Grades
### Import Pandas

In [1]:
import pandas as pd

### Import data

In [2]:
file = 'PD 2022 WK 3 Grades.csv'
df = pd.read_csv(file)
df.head()

Unnamed: 0,Student ID,Maths,English,Spanish,Science,Art,History,Geography
0,1,66,97,85,75,76,94,76
1,2,84,85,62,87,68,75,74
2,3,88,68,69,81,92,89,75
3,4,65,97,96,89,98,77,62
4,5,86,97,94,98,67,77,97


### Divide the students into 6 evenly distributed groups for each subject & assign each subject a letter grade according to these groups

1. Use pandas function qcut to divide each subject (Maths, English, Spanish, etc.) and store each result in a new column
    - Labels for each bin (the letter score) can be passed as an argument for qcut
    
`WARNING` qcut automatically groups boundary cases (duplicates) together.  Without instructions, your results will not be evenly split. This is demonstrated below

`WARNING` qcut assigns results as a categorical data type with the labels ascending in value from the order you assigned them (F < E < D, etc.) This is not a problem with the letter grades, but the categorical data type will carry over for columns derived from the qcut results. 

In [3]:
#Assign column using qcut without boundary case handling
df['Math Grade'] = pd.qcut(df.Maths, 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
#View value counts (to provide evidence of uneven groups)
df['Math Grade'].value_counts()

D    184
F    182
A    166
B    166
E    155
C    147
Name: Math Grade, dtype: int64

To produce even groups, you must pass a ranked list instead using the pd.Series.rank method

In [4]:
df['Math Grade'] = pd.qcut(df.Maths.rank(method='first'), 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
df['Math Grade'].value_counts()

A    167
C    167
E    167
F    167
B    166
D    166
Name: Math Grade, dtype: int64

2. Repeat this method with all of the grade columns

In [5]:
df['English Grade'] = pd.qcut(df.English.rank(method='first'), 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
df['Spanish Grade'] = pd.qcut(df.Spanish.rank(method='first'), 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
df['Science Grade'] = pd.qcut(df.Science.rank(method='first'), 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
df['Art Grade'] = pd.qcut(df.Art.rank(method='first'), 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
df['History Grade'] = pd.qcut(df.History.rank(method='first'), 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
df['Geography Grade'] = pd.qcut(df.Geography.rank(method='first'), 6, labels=['F', 'E', 'D', 'C', 'B', 'A'])
df.head()

Unnamed: 0,Student ID,Maths,English,Spanish,Science,Art,History,Geography,Math Grade,English Grade,Spanish Grade,Science Grade,Art Grade,History Grade,Geography Grade
0,1,66,97,85,75,76,94,76,F,A,C,D,D,A,D
1,2,84,85,62,87,68,75,74,C,C,F,C,E,D,D
2,3,88,68,69,81,92,89,75,B,E,E,D,B,B,D
3,4,65,97,96,89,98,77,62,F,A,A,B,A,D,F
4,5,86,97,94,98,67,77,97,B,A,B,A,E,D,A


### Figure out how many application points each subject is worth
An A is worth 10, a B is worth 8, etc.

1. Write a function to assign values to a new column given values from a first column

In [6]:
def app_score(column):
    
    if column == 'A':
        return 10
    elif column == 'B':
        return 8
    elif column == 'C':
        return 6
    elif column == 'D':
        return 4
    elif column == 'E':
        return 2
    elif column == 'F':
        return 1

2. Apply the function using df.apply

Since we want to perform mathematical operations on these columns in a following step, we need to force a type conversion on the results. Otherwise performing a function with apply on a categorical column will result in a categorical column
     
You can check whether data in pandas is numeric, and therefore valid for mathmatical operations by using the function

`df._get_numeric_data()`

In [8]:
df['app_points_math'] = df['Math Grade'].apply(app_score).astype('int64')
df['app_points_math'].value_counts()

10    167
6     167
2     167
1     167
8     166
4     166
Name: app_points_math, dtype: int64

In [9]:
df['app_points_english'] = df['English Grade'].apply(app_score).astype('int64')
df['app_points_spanish'] = df['Spanish Grade'].apply(app_score).astype('int64')
df['app_points_science'] = df['Science Grade'].apply(app_score).astype('int64')
df['app_points_art'] = df['Art Grade'].apply(app_score).astype('int64')
df['app_points_history'] = df['History Grade'].apply(app_score).astype('int64')
df['app_points_geography'] = df['Geography Grade'].apply(app_score).astype('int64')

### Find out how many total application points each student has received

1. Add columns together

In [10]:
df['Total App Points'] = df['app_points_math'] + df['app_points_english'] + df['app_points_spanish'] + df['app_points_science'] + df['app_points_art'] + df['app_points_history'] + df['app_points_geography']

### Workout average total points per student by grade

ex: for all students who got an A, how many points did they receive across all their subjects
*Yes, I also struggled to figure out what this step was asking for*

1. Reshape the data so that the individual subject scores are each given their own column

intended result:

`Student ID | Total App Points | Subject | Score | App Points | Letter Grade` 

1. A:
    Perform three melt operations to 'unpivot' the data on the values of Score, App Points, and Letter Grade

In [12]:
df1 = df.melt(id_vars=['Student ID', 'Total App Points'], value_vars=['Maths', 'English', 'Spanish', 'Science', 'Art', 'History', 'Geography'], var_name='Subject', value_name='Score')
df1

Unnamed: 0,Student ID,Total App Points,Subject,Score
0,1,39,Maths,66
1,2,29,Maths,84
2,3,36,Maths,88
3,4,44,Maths,65
4,5,52,Maths,86
...,...,...,...,...
6995,996,49,Geography,64
6996,997,56,Geography,93
6997,998,41,Geography,84
6998,999,27,Geography,69


In [13]:
df2 = df.melt(id_vars='Student ID', value_vars=['app_points_math', 'app_points_english', 'app_points_spanish',
       'app_points_science', 'app_points_art', 'app_points_history',
       'app_points_geography'], var_name = 'Subject', value_name='App Points')
df2.head()

Unnamed: 0,Student ID,Subject,App Points
0,1,app_points_math,1
1,2,app_points_math,6
2,3,app_points_math,8
3,4,app_points_math,1
4,5,app_points_math,8


In [14]:
df3 = df.melt(id_vars='Student ID', value_vars=['Math Grade', 'English Grade', 'Spanish Grade',
       'Science Grade', 'Art Grade', 'History Grade', 'Geography Grade'], var_name='Subject', value_name='Grade')
df3.head()

Unnamed: 0,Student ID,Subject,Grade
0,1,Math Grade,F
1,2,Math Grade,C
2,3,Math Grade,B
3,4,Math Grade,F
4,5,Math Grade,B


2. Join the melted dataframes together to produce a single dataframe

*But wait? How can we ensure that the math grade (A) corresponds to the correct math score (93) to the correct App points (10)?*

2. *A*. Replace the label column values in df2 and df3 so that they correspond to the subject from df1

In [15]:
df2 = df2.replace('app_points_math', 'Maths')
df2 = df2.replace('app_points_english', 'English')
df2 = df2.replace('app_points_spanish', 'Spanish')
df2 = df2.replace('app_points_science', 'Science')
df2 = df2.replace('app_points_art', 'Art')
df2 = df2.replace('app_points_history', 'History')
df2 = df2.replace('app_points_geography', 'Geography')

In [16]:
df3 = df3.replace('Math Grade', 'Maths')
df3 = df3.replace('English Grade', 'English')
df3 = df3.replace('Spanish Grade', 'Spanish')
df3 = df3.replace('Science Grade', 'Science')
df3 = df3.replace('Art Grade', 'Art')
df3 = df3.replace('History Grade', 'History')
df3 = df3.replace('Geography Grade', 'Geography')

2. *B.* Now join the dataframes on student ID and subject using pd.merge

In [18]:
dfc = pd.merge(df1, df2, on=['Student ID', 'Subject'])
dfc = pd.merge(dfc, df3, on=['Student ID', 'Subject'])
dfc

Unnamed: 0,Student ID,Total App Points,Subject,Score,App Points,Grade
0,1,39,Maths,66,1,F
1,2,29,Maths,84,6,C
2,3,36,Maths,88,8,B
3,4,44,Maths,65,1,F
4,5,52,Maths,86,8,B
...,...,...,...,...,...,...
6995,996,49,Geography,64,1,F
6996,997,56,Geography,93,8,B
6997,998,41,Geography,84,6,C
6998,999,27,Geography,69,2,E


3. Use a groupby to calculate the average total points per student per grade
    - Use `.mean()` to get the average of all numeric columns; Total App Points is the one we're interested in
    - Use `.reset_index()` to produce a single index dataframe rather than a multi-index
    - Rename columns using `.rename()` for clarity
    - Remove excess columns (We don't need to know the average score or the average student ID)

In [19]:
gb = dfc.groupby('Grade').mean().reset_index().rename(columns={'Total App Points' : 'Average Student Total Points Per Grade'})
gb = gb[['Grade', 'Average Student Total Points Per Grade']]
gb

Unnamed: 0,Grade,Average Student Total Points Per Grade
0,A,41.072712
1,B,38.94062
2,C,36.853721
3,D,35.033563
4,E,33.34645
5,F,31.6929


4. Join the Average Points Per Grade to the full report (dfc)

In [20]:
dfc2 = pd.merge(dfc, gb, on='Grade', how='left')
dfc2

Unnamed: 0,Student ID,Total App Points,Subject,Score,App Points,Grade,Average Student Total Points Per Grade
0,1,39,Maths,66,1,F,31.692900
1,2,29,Maths,84,6,C,36.853721
2,3,36,Maths,88,8,B,38.940620
3,4,44,Maths,65,1,F,31.692900
4,5,52,Maths,86,8,B,38.940620
...,...,...,...,...,...,...,...
6995,996,49,Geography,64,1,F,31.692900
6996,997,56,Geography,93,8,B,38.940620
6997,998,41,Geography,84,6,C,36.853721
6998,999,27,Geography,69,2,E,33.346450


### Take the average total score for all students who received at lest one A and remove anyone who scored less than this

1. Average total score for A's was calculated in the groupby.
    - Access via filtering and iloc

In [31]:
avg_a_df = gb[gb['Grade'] == 'A']
avg_a_score = avg_a_df.iloc[0, 1]
avg_a_score

41.07271171941831

2. Remove everyone who scored less than `avg_a_score`

In [32]:
dfc2 = dfc2[dfc2['Total App Points'] >= avg_a_score]
dfc2

Unnamed: 0,Student ID,Total App Points,Subject,Score,App Points,Grade,Average Student Total Points Per Grade
3,4,44,Maths,65,1,F,31.692900
4,5,52,Maths,86,8,B,38.940620
7,8,42,Maths,82,6,C,36.853721
10,11,45,Maths,61,1,F,31.692900
13,14,45,Maths,63,1,F,31.692900
...,...,...,...,...,...,...,...
6984,985,51,Geography,75,4,D,35.033563
6992,993,42,Geography,87,6,C,36.853721
6993,994,42,Geography,73,4,D,35.033563
6995,996,49,Geography,64,1,F,31.692900


### Answer the question: How many students scored more than the average if you ignore their As?

1. Exclude all A grades via filtering

In [33]:
dfc2 = dfc2[dfc2['Grade'] != 'A']
dfc2

Unnamed: 0,Student ID,Total App Points,Subject,Score,App Points,Grade,Average Student Total Points Per Grade
3,4,44,Maths,65,1,F,31.692900
4,5,52,Maths,86,8,B,38.940620
7,8,42,Maths,82,6,C,36.853721
10,11,45,Maths,61,1,F,31.692900
13,14,45,Maths,63,1,F,31.692900
...,...,...,...,...,...,...,...
6984,985,51,Geography,75,4,D,35.033563
6992,993,42,Geography,87,6,C,36.853721
6993,994,42,Geography,73,4,D,35.033563
6995,996,49,Geography,64,1,F,31.692900


2. How many of the remaining students scored more than the average?

Use `pd.Series.nunique()` to calculate the distinct count of students

In [34]:
dfc2['Student ID'].nunique()

281

### Output to csv

In [37]:
dfc2.to_csv('pandas_output.csv',index=False)

### FINAL NOTE ###

My solution has 1387 rows and 281 students, while the official solution has 1361 rows and 277 students. My averages by grade are also slightly different from the official solution (My 'A' average is 41.07 and the official is 41.15)

I'm reasonably sure this is due to a different methodology in binning the data in step 1. 1000 does not break into 6 exactly equal parts, and pandas qcut chose to split the boundary cases differently than Tableau did.
(For example, the math score was binned so that B and D received 166 students, while A, C, E. amd F received 167 students.) That then affected the average points per student by grade, causing a difference of .04 to .10 points.

(I can't tell you exactly how Tableau Prep breaks down its quantiles, because I'm running version 2019.4 which does not have the tile functionality.)

This is obviously not a huge difference, but the real world implications are interesting. If this data really was used for high school admissions, a data analyst at my school using pandas would have created a lower 'A' average, and would have admitted 4 more students than if they'd used Tableau Prep. That sounds pretty insignificant in terms of data, but it's very important if you happen to be one of those 4 children.