# PyCity Schools Analysis

 - Spending more per student does not look to yield stronger test results. 
 - School size does seem to affect test results negatively. Larger schools correlate with lower test scores.
 - Charter schools on the whole outperform district schools. This could be due to a variety of factors, and a deeper analysis would need to be conducted about the demographics of students at these types of schools.

## Setup

#### Import dependencies

In [1]:
import pandas as pd
import numpy as np

#### Load in csv files and convert to dataframes

In [2]:
school_data_df = pd.read_csv('./Resources/schools_complete.csv')
student_data_df = pd.read_csv('./Resources/students_complete.csv')

## Join school and student data

In [3]:
full_data_df = pd.merge(student_data_df, school_data_df, how='left', on='school_name')

#### Clean column headers

In [4]:
# Rename a couple columns to avoid spaces and make column contents clearer
full_data_df = full_data_df.rename(columns={'Student ID': 'student_id', 'School ID': 'school_id', 'type': 'school_type', 'size': 'school_size', 'budget': 'school_budget'})

# Reorder columns to be in a more natural order
full_data_df=full_data_df[['student_id', 'student_name', 'gender', 'grade', 'reading_score', 'math_score', 'school_id', 'school_name', 'school_type', 'school_size', 'school_budget']]

#### Print head

In [5]:
full_data_df.head()

Unnamed: 0,student_id,student_name,gender,grade,reading_score,math_score,school_id,school_name,school_type,school_size,school_budget
0,0,Paul Bradley,M,9th,66,79,0,Huang High School,District,2917,1910635
1,1,Victor Smith,M,12th,94,61,0,Huang High School,District,2917,1910635
2,2,Kevin Rodriguez,M,12th,90,60,0,Huang High School,District,2917,1910635
3,3,Dr. Richard Scott,M,12th,67,58,0,Huang High School,District,2917,1910635
4,4,Bonnie Ray,F,9th,97,84,0,Huang High School,District,2917,1910635


## District summary

#### Calculate values of number of students passing Math, and Reading, respectively.

In [6]:
# Define counter variables, set initial values to 0
numPassMath = 0
numPassReading = 0

# Define for loop to count # of students passing in each subject respectively
for x in full_data_df['math_score']:
    if x >= 70:
        numPassMath += 1

for x in full_data_df['reading_score']:
    if x >= 70:
        numPassReading += 1

#### Store variables for each metric which will populate the district summary dataframe

In [7]:
totalSchools = full_data_df.nunique().loc['school_id']
totalStudents = full_data_df.nunique().loc['school_id']
totalBudget = full_data_df['school_budget'].unique().sum()
meanMathScore = full_data_df['math_score'].mean()
meanReadingScore = full_data_df['reading_score'].mean()
percentPassMath = numPassMath/totalStudents
percentPassReading = numPassReading/totalStudents
overallPassRate = (percentPassMath + percentPassReading)/2 

#### Build district summary dataframe

In [8]:
district_summary_df = pd.DataFrame(
    {
         'Total Schools': totalSchools,
         'Total Students': '{:,}'.format(totalStudents),
         'Total Budget': '${:,.2f}'.format(totalBudget),
         'Average Math Score': meanMathScore,
         'Average Reading Score': meanReadingScore,
         '% Passing Math': '{0:.6f}'.format(percentPassMath*100),
         '% Passing Reading': '{0:.6f}'.format(percentPassReading*100),
         '% Overall Passing Rate': '{0:.6f}'.format(overallPassRate*100)
    }, 
    
    index=['Stats'])

#### Print district summary dataframe

In [9]:
district_summary_df

Unnamed: 0,Total Schools,Total Students,Total Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing Rate
Stats,15,15,"$24,649,428.00",78.985371,81.87784,195800.0,224066.666667,209933.333333


## School summary

#### School names, school types, and number of students

In [12]:
# Define relevant dataframe to pull this information from
name_type = full_data_df.groupby(['school_name', 'school_type']).count()
name_type_df = pd.DataFrame(name_type)

# Pull my school names, school types, and number of students data from this dataframe
schoolNames = name_type_df.index.get_level_values('school_name').tolist() 
schoolTypes = name_type_df.index.get_level_values('school_type').tolist() 
numStudents = name_type_df['student_id'].tolist()

#### School budget

In [13]:
# Find total school budget using a different groupby object transferred to data frame 
name_budget = full_data_df.groupby(['school_name', 'school_budget']).count() 
name_budget_df = pd.DataFrame(name_budget)
totalSchoolBudget = name_budget_df.index.get_level_values('school_budget').tolist()

# Calculate per student budget
perStudentBudget=[]
tracker = 0

for j in range(len(totalSchoolBudget)):
    tracker = totalSchoolBudget[j]/numStudents[j]
    perStudentBudget.append(tracker)

#### Math and reading scores

In [14]:
averages = full_data_df.groupby(['school_name']).mean()
averages_df = pd.DataFrame(averages)
meanReadingScorePerSchool = averages_df['reading_score'].tolist()
meanMathScorePerSchool = averages_df['math_score'].tolist()

### Percent of students passing math and reading
- Step 1: Create dataframes which just have students passing in Math, and Reading, respectively.
- Step 2: Play with dataframe to get number of students passing in Math, and Reading, respectively, for each school. 
- Step 3: Take these values and divide by number of students at school (already calculated above).

#### Step 1: Create dataframes which just have students passing in Math, and Reading, respectively

In [16]:
# Define booleans to filter on
passMath = full_data_df['math_score'] >= 70
passReading = full_data_df['reading_score'] >= 70

# Define dataframes
passing_math_df = full_data_df[passMath]
passing_reading_df = full_data_df[passReading]

#### Step 2: Play with dataframe to get number of students passing in Math, and Reading, respectively, for each school.

#### Math

In [17]:
# Build initial dataframe
name_math = passing_math_df.groupby(['school_name', 'math_score']).count()
name_math_df = pd.DataFrame(name_math)
name_math_df = name_math_df.rename(columns={'student_id': 'num_students'})
name_math_df = pd.DataFrame(name_math_df['num_students'])

# Reduce index complexity
name_math_df = name_math_df.reset_index(level='math_score')
name_math_df = name_math_df.reset_index(level='school_name')

# Build final dataframe
sum_math = name_math_df.groupby(['school_name']).sum()
sum_math_df = pd.DataFrame(sum_math)
numPassMathPerSchool = sum_math_df['num_students'].tolist()

#### Reading 

In [18]:
# Build initial dataframe
name_read = passing_reading_df.groupby(['school_name', 'reading_score']).count()
name_read_df = pd.DataFrame(name_read)
name_read_df = name_read_df.rename(columns={'student_id': 'num_students'})
name_read_df = pd.DataFrame(name_read_df['num_students'])

# Reduce index complexity
name_read_df = name_read_df.reset_index(level='reading_score')
name_read_df = name_read_df.reset_index(level='school_name')

# Build final dataframe
sum_read = name_read_df.groupby(['school_name']).sum()
sum_read_df = pd.DataFrame(sum_read)
numPassReadingPerSchool=sum_read_df['num_students'].tolist()

#### Step 3: Take these values and divide by number of students at school (already calculated above).

In [19]:
percentPassMathPerSchool=[] # Define list of percentage of students passing math
percentPassReadingPerSchool=[] # Define list of percentage of students passing reading
overallPassRatePerSchool=[] # Define list of overall passing rates 
mathTracker=0 # Define math tracker variable
readingTracker=0 # Define reading tracker variable
overallTracker=0 # Define overall pass rate tracker variable

for j in range(len(schoolNames)):
    mathTracker=numPassMathPerSchool[j]/numStudents[j]
    readingTracker=numPassReadingPerSchool[j]/numStudents[j]
    overallTracker=(mathTracker+readingTracker)/2
    percentPassMathPerSchool.append(mathTracker)
    percentPassReadingPerSchool.append(readingTracker)    
    overallPassRatePerSchool.append(overallTracker)

#### Store variables for each metric which will populate the school summary dataframe

In [22]:
numStudentsFormatted = ['{:,}'.format(x) for x in numStudents]
totalSchoolBudgetFormatted = ['${:,.2f}'.format(x) for x in totalSchoolBudget]
perStudentBudgetFormatted = ['${:,.2f}'.format(x) for x in perStudentBudget]
percentPassMathPerSchoolFormatted = ['{0:.6f}'.format(100*x) for x in percentPassMathPerSchool]
percentPassReadingPerSchoolFormatted = ['{0:.6f}'.format(100*x) for x in percentPassReadingPerSchool]
overallPassRatePerSchoolFormatted = ['{0:.6f}'.format(100*x) for x in overallPassRatePerSchool]

#### Build school summary dataframe

In [23]:
# Print school summary dataframe
school_summary_df=pd.DataFrame(
    {
        'School Name': schoolNames,
        'School Type': schoolTypes,
        'Number of Students': numStudentsFormatted,
        'Total Budget': totalSchoolBudgetFormatted,
        'Budget Per Student': perStudentBudgetFormatted,
        'Average Math Score': meanMathScorePerSchool,
        'Average Reading Score': meanReadingScorePerSchool,
        '% Passing Math': percentPassMathPerSchoolFormatted,
        '% Passing Reading': percentPassReadingPerSchoolFormatted,
        '% Overall Passing Rate': overallPassRatePerSchoolFormatted
    })

#### Print school summary dataframe

In [24]:
school_summary_df

Unnamed: 0,School Name,School Type,Number of Students,Total Budget,Budget Per Student,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing Rate
0,Bailey High School,District,4976,"$3,124,928.00",$628.00,77.048432,81.033963,66.680064,81.93328,74.306672
1,Cabrera High School,Charter,1858,"$1,081,356.00",$582.00,83.061895,83.97578,94.133477,97.039828,95.586652
2,Figueroa High School,District,2949,"$1,884,411.00",$639.00,76.711767,81.15802,65.988471,80.739234,73.363852
3,Ford High School,District,2739,"$1,763,916.00",$644.00,77.102592,80.746258,68.309602,79.299014,73.804308
4,Griffin High School,Charter,1468,"$917,500.00",$625.00,83.351499,83.816757,93.392371,97.138965,95.265668
5,Hernandez High School,District,4635,"$3,022,020.00",$652.00,77.289752,80.934412,66.752967,80.862999,73.807983
6,Holden High School,Charter,427,"$248,087.00",$581.00,83.803279,83.814988,92.505855,96.252927,94.379391
7,Huang High School,District,2917,"$1,910,635.00",$655.00,76.629414,81.182722,65.683922,81.316421,73.500171
8,Johnson High School,District,4761,"$3,094,650.00",$650.00,77.072464,80.966394,66.057551,81.222432,73.639992
9,Pena High School,Charter,962,"$585,858.00",$609.00,83.839917,84.044699,94.594595,95.945946,95.27027


## Top and bottom performing schools (passing rate)

#### Top 5 schools

In [27]:
school_summary_df['% Overall Passing Rate'] = pd.to_numeric(school_summary_df['% Overall Passing Rate'])
top5_df = school_summary_df.nlargest(5, ['% Overall Passing Rate']) 
top5_df.set_index('School Name')

Unnamed: 0_level_0,School Type,Number of Students,Total Budget,Budget Per Student,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing Rate
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Cabrera High School,Charter,1858,"$1,081,356.00",$582.00,83.061895,83.97578,94.133477,97.039828,95.586652
Thomas High School,Charter,1635,"$1,043,130.00",$638.00,83.418349,83.84893,93.272171,97.308869,95.29052
Pena High School,Charter,962,"$585,858.00",$609.00,83.839917,84.044699,94.594595,95.945946,95.27027
Griffin High School,Charter,1468,"$917,500.00",$625.00,83.351499,83.816757,93.392371,97.138965,95.265668
Wilson High School,Charter,2283,"$1,319,574.00",$578.00,83.274201,83.989488,93.867718,96.539641,95.203679


#### Bottom 5 schools

In [28]:
bottom5_df = school_summary_df.nsmallest(5, ['% Overall Passing Rate']) 
bottom5_df.set_index('School Name')

Unnamed: 0_level_0,School Type,Number of Students,Total Budget,Budget Per Student,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing Rate
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Rodriguez High School,District,3999,"$2,547,363.00",$637.00,76.842711,80.744686,66.366592,80.220055,73.293323
Figueroa High School,District,2949,"$1,884,411.00",$639.00,76.711767,81.15802,65.988471,80.739234,73.363852
Huang High School,District,2917,"$1,910,635.00",$655.00,76.629414,81.182722,65.683922,81.316421,73.500171
Johnson High School,District,4761,"$3,094,650.00",$650.00,77.072464,80.966394,66.057551,81.222432,73.639992
Ford High School,District,2739,"$1,763,916.00",$644.00,77.102592,80.746258,68.309602,79.299014,73.804308


## Math and reading scores by grade

#### Math scores

In [37]:
full_data_df.groupby(["school_name","grade"]).mean()["math_score"].reset_index().pivot("school_name","grade","math_score")

grade,10th,11th,12th,9th
school_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bailey High School,76.996772,77.515588,76.492218,77.083676
Cabrera High School,83.154506,82.76556,83.277487,83.094697
Figueroa High School,76.539974,76.884344,77.151369,76.403037
Ford High School,77.672316,76.918058,76.179963,77.361345
Griffin High School,84.229064,83.842105,83.356164,82.04401
Hernandez High School,77.337408,77.136029,77.186567,77.438495
Holden High School,83.429825,85.0,82.855422,83.787402
Huang High School,75.908735,76.446602,77.225641,77.027251
Johnson High School,76.691117,77.491653,76.863248,77.187857
Pena High School,83.372,84.328125,84.121547,83.625455


#### Reading scores

In [38]:
full_data_df.groupby(["school_name","grade"]).mean()["reading_score"].reset_index().pivot("school_name","grade","reading_score")

grade,10th,11th,12th,9th
school_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bailey High School,80.907183,80.945643,80.912451,81.303155
Cabrera High School,84.253219,83.788382,84.287958,83.676136
Figueroa High School,81.408912,80.640339,81.384863,81.198598
Ford High School,81.262712,80.403642,80.662338,80.632653
Griffin High School,83.706897,84.288089,84.013699,83.369193
Hernandez High School,80.660147,81.39614,80.857143,80.86686
Holden High School,83.324561,83.815534,84.698795,83.677165
Huang High School,81.512386,81.417476,80.305983,81.290284
Johnson High School,80.773431,80.616027,81.227564,81.260714
Pena High School,83.612,84.335938,84.59116,83.807273


## Scores by school spending

#### Build dataframe

In [40]:
school_spending_df = pd.DataFrame({
        'Budget Per Student': perStudentBudget,
        'Average Math Score': meanMathScorePerSchool,
        'Average Reading Score': meanReadingScorePerSchool,
        '% Passing Math': [100*x for x in percentPassMathPerSchool],
        '% Passing Reading': [100* x for x in percentPassReadingPerSchool],
        '% Overall Passing Rate': [100*x for x in overallPassRatePerSchool]
    })

#### Define bins and add this data to dataframe

In [42]:
# Define bins and bin labels
spending_bins = [0, 585, 615, 645, 675]
group_names = ["<$585", "$585-615", "$615-645", "$645-675"]

# Bin the data and add an extra column to the data frame
pd.cut(school_spending_df['Budget Per Student'], bins=spending_bins, labels=group_names)
school_spending_df['Spending Ranges (Per Student)'] = pd.cut(school_spending_df['Budget Per Student'], bins=spending_bins, labels=group_names)
group = school_spending_df.groupby('Spending Ranges (Per Student)')
group[['Average Math Score', 'Average Reading Score', '% Passing Math', '% Passing Reading', '% Overall Passing Rate']].mean()

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing Rate
Spending Ranges (Per Student),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
<$585,83.455399,83.933814,93.460096,96.610877,95.035486
$585-615,83.599686,83.885211,94.230858,95.900287,95.065572
$615-645,79.079225,81.891436,75.668212,86.106569,80.887391
$645-675,76.99721,81.027843,66.164813,81.133951,73.649382


## Scores by school size

#### Build dataframe

In [47]:
school_size_df=pd.DataFrame({
        'Number of Students': numStudents,
        'Average Math Score': meanMathScorePerSchool,
        'Average Reading Score': meanReadingScorePerSchool,
        '% Passing Math': [100*x for x in percentPassMathPerSchool],
        '% Passing Reading': [100* x for x in percentPassReadingPerSchool],
        '% Overall Passing Rate': [100*x for x in overallPassRatePerSchool]
    })

#### Define bins and add this data to dataframe

In [48]:
# Define bins and bin labels
size_bins = [0, 1000, 2000, 5000]
group_names = ["Small (<1000)", "Medium (1000-2000)", "Large (2000-5000)"]

# Bin the data and add an extra column to the data frame
pd.cut(school_size_df['Number of Students'], bins=size_bins, labels=group_names)
school_size_df['School Size']=pd.cut(school_size_df['Number of Students'], bins=size_bins, labels=group_names)
group=school_size_df.groupby('School Size')
group[['Average Math Score', 'Average Reading Score', '% Passing Math', '% Passing Reading', '% Overall Passing Rate']].mean()

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing Rate
School Size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Small (<1000),83.821598,83.929843,93.550225,96.099437,94.824831
Medium (1000-2000),83.374684,83.864438,93.599695,96.79068,95.195187
Large (2000-5000),77.746417,81.344493,69.963361,82.766634,76.364998


## Schools by school type

#### Build dataframe

In [49]:
school_type_df=pd.DataFrame({
        'School Type': schoolTypes,
        'Average Math Score': meanMathScorePerSchool,
        'Average Reading Score': meanReadingScorePerSchool,
        '% Passing Math': [100*x for x in percentPassMathPerSchool],
        '% Passing Reading': [100*x for x in percentPassReadingPerSchool],
        '% Overall Passing Rate': [100*x for x in overallPassRatePerSchool]
    })

#### Group by school type

In [50]:
group=school_type_df.groupby('School Type')
group[['Average Math Score', 'Average Reading Score', '% Passing Math', '% Passing Reading', '% Overall Passing Rate']].mean()

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing Rate
School Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Charter,83.473852,83.896421,93.62083,96.586489,95.10366
District,76.956733,80.966636,66.548453,80.799062,73.673757
