# DTSC 580: Data Manipulation

## Assignment: High School Students Merging Practice

### Name: 

## Overview

*Note:  This is an optional extra credit assignment worth up to 5% extra credit.  I suggest that you complete all other assignments first and then come back to this assignment for extra practice.*

In this optional assignment, you will be working with multiple CSV files with the goal to merge the information into a single DataFrame. The data is made up and contains information about four imaginary High Schools.  The files that you will be working with are:

- <u>central.csv</u>: list of students that attend Central High School along with their class scores
- <u>columbia.csv</u>: list of students that attend Columbia High School along with their class scores
- <u>eastside.csv</u>: list of students that attend Eastside High School along with their class scores
- <u>greenwich.csv</u>: list of students that attend Greenwich High School along with their class scores
- <u>school_info.csv</u>: information about the four local schools
- <u>activities.csv</u>: list of students that participate in after school activities
- <u>principal.csv</u>: information about the principals for all the schools in the district, not just the 4 high schools that we're analyzing

## Assignment

Your job is to load and merge the data so that you end up with a final DataFrame that you must call `students_final`. The `students_final` DataFrame:
- must be sorted by `Student_ID`
- The index must be in order `0` through `n - 1`, where `n` is the number of total students in the file. 
- You will create column 6 called `Grade_Average` that is the average of the Math, Science, English, and History scores for each student.
- You will create column 7 called `Letter_Grade` that creates a categorical column for the letter grade earned based on the `Grade_Average` column.  Scores between 0-59.99 earn an `F`, 60-69.99 earn a `D`, 70-79.99 earn a `C`, 80-89.99 earn a `B`, and 90 and above earns an `A`. The categories should be ordered with the unknown category called `None` listed in the beginning of the order as follows:
    - `Index(['None', 'F', 'D', 'C', 'B', 'A'], dtype='object')`
- Any missing values for the entire data set must be filled with the string `None`.
- As an extra check, make sure that no student IDs are duplicated in your final DataFrame as one way to see if you merged the DataFrames correctly.
- Ensure that column names and data types match the below list and are in this exact order.
```
#   Column              Dtype      
 0   Student_ID         int64   
 1   Math               int64   
 2   Science            int64   
 3   English            int64   
 4   History            int64   
 5   Grade_Average      float64 
 6   Letter_Grade       category
 7   Activity           object  
 8   School_Name        object  
 9   Address            object  
 10  Principal_Name     object  
 11  Mascot             object  
 12  Student_Population int64   
```
- Once complete, save your notebook as `students.ipynb` and submit to CodeGrade to check your work.

In [97]:
# standard imports
import pandas as pd
import numpy as np

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)

In [98]:
central = pd.read_csv('central.csv')
columbia = pd.read_csv('columbia.csv')
eastside = pd.read_csv('eastside.csv')
greenwich = pd.read_csv('greenwich.csv')
school_info = pd.read_csv('school_info.csv')
activities = pd.read_csv('activities.csv')
principals = pd.read_csv('principal.csv')

In [99]:
df = pd.concat([central, columbia, eastside, greenwich])
df

Unnamed: 0,Student_ID,School_Name,Math,Science,English,History
0,145581,Central,70,74,87,63
1,321209,Central,70,62,70,84
2,221982,Central,62,61,79,63
3,204249,Central,89,65,73,67
4,319950,Central,61,99,86,86
...,...,...,...,...,...,...
1022,213951,Greenwich,98,65,60,89
1023,205324,Greenwich,60,85,76,67
1024,209950,Greenwich,76,66,68,99
1025,364186,Greenwich,91,98,68,84


In [100]:
df2 = pd.merge(df, activities, left_on='Student_ID', right_on='ID', how='outer')
df2

Unnamed: 0,Student_ID,School_Name,Math,Science,English,History,ID,Activity
0,145581,Central,70,74,87,63,,
1,321209,Central,70,62,70,84,,
2,221982,Central,62,61,79,63,,
3,204249,Central,89,65,73,67,,
4,319950,Central,61,99,86,86,319950.0,Cheer
...,...,...,...,...,...,...,...,...
4072,213951,Greenwich,98,65,60,89,,
4073,205324,Greenwich,60,85,76,67,,
4074,209950,Greenwich,76,66,68,99,209950.0,Volleyball
4075,364186,Greenwich,91,98,68,84,,


In [101]:
df3 = pd.merge(df2, principals, left_on='School_Name', right_on='School', how='inner')
df3

Unnamed: 0,Student_ID,School_Name,Math,Science,English,History,ID,Activity,School,School_Address,Principal_Name
0,145581,Central,70,74,87,63,,,Central,100 Central High Lane,Ray Smith
1,321209,Central,70,62,70,84,,,Central,100 Central High Lane,Ray Smith
2,221982,Central,62,61,79,63,,,Central,100 Central High Lane,Ray Smith
3,204249,Central,89,65,73,67,,,Central,100 Central High Lane,Ray Smith
4,319950,Central,61,99,86,86,319950.0,Cheer,Central,100 Central High Lane,Ray Smith
...,...,...,...,...,...,...,...,...,...,...,...
4072,213951,Greenwich,98,65,60,89,,,Greenwich,1 Greenwich Blvd,Shannon Baker
4073,205324,Greenwich,60,85,76,67,,,Greenwich,1 Greenwich Blvd,Shannon Baker
4074,209950,Greenwich,76,66,68,99,209950.0,Volleyball,Greenwich,1 Greenwich Blvd,Shannon Baker
4075,364186,Greenwich,91,98,68,84,,,Greenwich,1 Greenwich Blvd,Shannon Baker


In [102]:
df4 = pd.merge(df3, school_info, left_on='School_Name', right_on='School', how='inner')
df4

Unnamed: 0,Student_ID,School_Name,Math,Science,English,History,ID,Activity,School_x,School_Address,Principal_Name,School_y,Address,Mascot,Student_Population
0,145581,Central,70,74,87,63,,,Central,100 Central High Lane,Ray Smith,Central,100 Central High Lane,Eagle,300
1,321209,Central,70,62,70,84,,,Central,100 Central High Lane,Ray Smith,Central,100 Central High Lane,Eagle,300
2,221982,Central,62,61,79,63,,,Central,100 Central High Lane,Ray Smith,Central,100 Central High Lane,Eagle,300
3,204249,Central,89,65,73,67,,,Central,100 Central High Lane,Ray Smith,Central,100 Central High Lane,Eagle,300
4,319950,Central,61,99,86,86,319950.0,Cheer,Central,100 Central High Lane,Ray Smith,Central,100 Central High Lane,Eagle,300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4072,213951,Greenwich,98,65,60,89,,,Greenwich,1 Greenwich Blvd,Shannon Baker,Greenwich,1 Greenwich Blvd,Bears,1200
4073,205324,Greenwich,60,85,76,67,,,Greenwich,1 Greenwich Blvd,Shannon Baker,Greenwich,1 Greenwich Blvd,Bears,1200
4074,209950,Greenwich,76,66,68,99,209950.0,Volleyball,Greenwich,1 Greenwich Blvd,Shannon Baker,Greenwich,1 Greenwich Blvd,Bears,1200
4075,364186,Greenwich,91,98,68,84,,,Greenwich,1 Greenwich Blvd,Shannon Baker,Greenwich,1 Greenwich Blvd,Bears,1200


In [103]:
df4 = df4[['Student_ID', 'Math', 'Science', 'English', 'History', 'Activity', 'School_Name', 'Address', 'Principal_Name', 'Mascot', 'Student_Population']]
df4

Unnamed: 0,Student_ID,Math,Science,English,History,Activity,School_Name,Address,Principal_Name,Mascot,Student_Population
0,145581,70,74,87,63,,Central,100 Central High Lane,Ray Smith,Eagle,300
1,321209,70,62,70,84,,Central,100 Central High Lane,Ray Smith,Eagle,300
2,221982,62,61,79,63,,Central,100 Central High Lane,Ray Smith,Eagle,300
3,204249,89,65,73,67,,Central,100 Central High Lane,Ray Smith,Eagle,300
4,319950,61,99,86,86,Cheer,Central,100 Central High Lane,Ray Smith,Eagle,300
...,...,...,...,...,...,...,...,...,...,...,...
4072,213951,98,65,60,89,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
4073,205324,60,85,76,67,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
4074,209950,76,66,68,99,Volleyball,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
4075,364186,91,98,68,84,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200


In [104]:
scores = df4[['Math', 'Science', 'English', 'History']]
scores

Unnamed: 0,Math,Science,English,History
0,70,74,87,63
1,70,62,70,84
2,62,61,79,63
3,89,65,73,67
4,61,99,86,86
...,...,...,...,...
4072,98,65,60,89
4073,60,85,76,67
4074,76,66,68,99
4075,91,98,68,84


In [105]:
df4['Grade_Average'] = scores.mean(axis=1)
df4['Grade_Average']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['Grade_Average'] = scores.mean(axis=1)


0       73.50
1       71.50
2       66.25
3       73.50
4       83.00
        ...  
4072    78.00
4073    72.00
4074    77.25
4075    85.25
4076    82.25
Name: Grade_Average, Length: 4077, dtype: float64

In [106]:
df4.head()

Unnamed: 0,Student_ID,Math,Science,English,History,Activity,School_Name,Address,Principal_Name,Mascot,Student_Population,Grade_Average
0,145581,70,74,87,63,,Central,100 Central High Lane,Ray Smith,Eagle,300,73.5
1,321209,70,62,70,84,,Central,100 Central High Lane,Ray Smith,Eagle,300,71.5
2,221982,62,61,79,63,,Central,100 Central High Lane,Ray Smith,Eagle,300,66.25
3,204249,89,65,73,67,,Central,100 Central High Lane,Ray Smith,Eagle,300,73.5
4,319950,61,99,86,86,Cheer,Central,100 Central High Lane,Ray Smith,Eagle,300,83.0


In [107]:
bins = [0,59.99, 69.99, 79.99, 89.99, 98 ]
df4['Letter_Grade'] = pd.cut(df4.Grade_Average, bins, labels=['F', 'D', 'C', 'B', 'A'] )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['Letter_Grade'] = pd.cut(df4.Grade_Average, bins, labels=['F', 'D', 'C', 'B', 'A'] )


In [108]:
df4['Letter_Grade'] = df4['Letter_Grade'].cat.add_categories('None')
df4['Letter_Grade'] = df4['Letter_Grade'].cat.reorder_categories(['None', 'F', 'D', 'C', 'B', 'A'])
df4['Letter_Grade'] = df4['Letter_Grade'].fillna('None')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['Letter_Grade'] = df4['Letter_Grade'].cat.add_categories('None')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['Letter_Grade'] = df4['Letter_Grade'].cat.reorder_categories(['None', 'F', 'D', 'C', 'B', 'A'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['Letter_Grade'] = df4['Letter_Gr

In [109]:
#checking
df4['Letter_Grade']

0       C
1       C
2       D
3       C
4       B
       ..
4072    C
4073    C
4074    C
4075    B
4076    B
Name: Letter_Grade, Length: 4077, dtype: category
Categories (6, object): ['None' < 'F' < 'D' < 'C' < 'B' < 'A']

In [110]:
df4 = df4[['Student_ID', 'Math', 'Science', 'English', 'History', 'Grade_Average', 'Letter_Grade','Activity', 'School_Name', 'Address', 'Principal_Name', 'Mascot', 'Student_Population' ]]


In [111]:
df4['Student_ID'] = df4['Student_ID'].astype('int64')
df4['Math'] = df4['Math'].astype('int64')
df4['Science'] = df4['Science'].astype('int64')
df4['English'] = df4['English'].astype('int64')
df4['History'] = df4['History'].astype('int64')

#   Column              Dtype      
0   Student_ID         int64   
1   Math               int64   
2   Science            int64   
3   English            int64   
4   History            int64   
5   Grade_Average      float64 
6   Letter_Grade       category
7   Activity           object  
8   School_Name        object  
9   Address            object  
10  Principal_Name     object  
11  Mascot             object  
12  Student_Population int64   

In [112]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4077 entries, 0 to 4076
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Student_ID          4077 non-null   int64   
 1   Math                4077 non-null   int64   
 2   Science             4077 non-null   int64   
 3   English             4077 non-null   int64   
 4   History             4077 non-null   int64   
 5   Grade_Average       4077 non-null   float64 
 6   Letter_Grade        4077 non-null   category
 7   Activity            1711 non-null   object  
 8   School_Name         4077 non-null   object  
 9   Address             4077 non-null   object  
 10  Principal_Name      4077 non-null   object  
 11  Mascot              4077 non-null   object  
 12  Student_Population  4077 non-null   int64   
dtypes: category(1), float64(1), int64(6), object(5)
memory usage: 418.3+ KB


In [113]:
#checking
df4.duplicated(subset='Student_ID').sum()

0

In [114]:
df4

Unnamed: 0,Student_ID,Math,Science,English,History,Grade_Average,Letter_Grade,Activity,School_Name,Address,Principal_Name,Mascot,Student_Population
0,145581,70,74,87,63,73.50,C,,Central,100 Central High Lane,Ray Smith,Eagle,300
1,321209,70,62,70,84,71.50,C,,Central,100 Central High Lane,Ray Smith,Eagle,300
2,221982,62,61,79,63,66.25,D,,Central,100 Central High Lane,Ray Smith,Eagle,300
3,204249,89,65,73,67,73.50,C,,Central,100 Central High Lane,Ray Smith,Eagle,300
4,319950,61,99,86,86,83.00,B,Cheer,Central,100 Central High Lane,Ray Smith,Eagle,300
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4072,213951,98,65,60,89,78.00,C,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
4073,205324,60,85,76,67,72.00,C,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
4074,209950,76,66,68,99,77.25,C,Volleyball,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
4075,364186,91,98,68,84,85.25,B,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200


In [115]:
students_final = df4.sort_values(by='Student_ID').reset_index(drop=True)
students_final

Unnamed: 0,Student_ID,Math,Science,English,History,Grade_Average,Letter_Grade,Activity,School_Name,Address,Principal_Name,Mascot,Student_Population
0,100089,91,96,88,62,84.25,B,Other_Club,Central,100 Central High Lane,Ray Smith,Eagle,300
1,100213,85,72,70,76,75.75,C,,Eastside,9755 Hwy 60,Dwayne Anderson,Raptors,1000
2,100300,65,99,77,76,79.25,C,Football,Eastside,9755 Hwy 60,Dwayne Anderson,Raptors,1000
3,100355,83,75,64,99,80.25,B,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
4,100359,61,77,73,83,73.50,C,,Greenwich,1 Greenwich Blvd,Shannon Baker,Bears,1200
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4072,399853,93,69,97,63,80.50,B,,Columbia,19 East Avenue,Patricia Rogers,Tigers,700
4073,399872,83,95,76,81,83.75,B,Baseball,Columbia,19 East Avenue,Patricia Rogers,Tigers,700
4074,399905,88,96,66,98,87.00,B,,Columbia,19 East Avenue,Patricia Rogers,Tigers,700
4075,399915,76,97,96,61,82.50,B,,Central,100 Central High Lane,Ray Smith,Eagle,300
