<H1> PyCity Schools Challenge

<p>Author: Alex Schanne
<p>This notebook will analyze school performances within PyCity. It will focus on math and reading performance using grades as the metric for evaluation. It will also consider school spending, size and type (i.e. charter vs district) as factors of performance.

In [1]:
#importing dependencies
import pandas as pd
import numpy as np

In [2]:
#Creating path to csv and read it into Pandas DataFrame
schools_path = "Resources/schools_complete.csv"
students_path = "Resources/students_complete.csv"

schools_df = pd.read_csv(schools_path)
students_df = pd.read_csv(students_path)

In [3]:
#merging the two datasets
school_complete = pd.merge(students_df, schools_df, how = "left", on = ["school_name", "school_name"])
school_complete.head()

Unnamed: 0,Student ID,student_name,gender,grade,school_name,reading_score,math_score,School ID,type,size,budget
0,0,Paul Bradley,M,9th,Huang High School,66,79,0,District,2917,1910635
1,1,Victor Smith,M,12th,Huang High School,94,61,0,District,2917,1910635
2,2,Kevin Rodriguez,M,12th,Huang High School,90,60,0,District,2917,1910635
3,3,Dr. Richard Scott,M,12th,Huang High School,67,58,0,District,2917,1910635
4,4,Bonnie Ray,F,9th,Huang High School,97,84,0,District,2917,1910635


## School Summary

Now that we have set up the data properly and merged the two datasets, we will create a summary for the district. It will include: 
    <li> Total Number of Schools
    <li> Total Number of Students
    <li> Total Budget
    <li> Average Math Score
    <li> Average Reading Score
    <li> The percentage of students with a passing math score (70% or higher)
    <li> The percentage of students with a passing reading score (70% or higher)
    <li> The percentage of students passing math and reading (% Overall passing)
        
 <p> <b> Please note that this data may have errors in it. It was not cleaned for anomalies.

In [4]:
#Calculating the needed information for the school summary data and creating the dataframe
tot_schools = school_complete["school_name"].nunique()
tot_students = school_complete["student_name"].count()
tot_budget = schools_df["budget"].sum()
avg_math = school_complete["math_score"].mean()
avg_read = school_complete["reading_score"].mean()
math_pass = (len(school_complete[school_complete["math_score"] >= 70]))/tot_students * 100
read_pass = (len(school_complete[school_complete["reading_score"] >= 70]))/tot_students * 100
overall_pass = len(school_complete.loc[(school_complete["math_score"] >= 70) & (school_complete["reading_score"] >= 70)])/tot_students * 100

#Creating the DataFrame and formatting it for a cleaner picture of the data
#formatting and outputting the dataframe
dist_sum = [{"Total Schools":tot_schools,
            "Total Students":tot_students,
            "Total Budget": '${:,}'.format(tot_budget),
            "Average Math Score":'{:,.2f}'.format(avg_math),
            "Average Reading Score":'{:,.2f}'.format(avg_read),
            "% Passing Math":'{:.2f}%'.format(math_pass),
            "% Passing Reading":'{:.2f}%'.format(read_pass),
            "Overall Passing Rate":'{:.2f}%'.format(overall_pass)}]
dist_summary = pd.DataFrame(dist_sum)
dist_summary

Unnamed: 0,Total Schools,Total Students,Total Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
0,15,39170,"$24,649,428",78.99,81.88,74.98%,85.81%,65.17%


Now that we have created the District wide summary. We will look at a summary of the district, broken down by school. Having this will make it easier to analyze school performance according to our key metrics. 

In [5]:
#Calculating and creating a dataframe grouped by each school
group_school = school_complete.set_index("school_name").groupby(["school_name"])

typeschool = schools_df.set_index('school_name')['type']
sch_stud = group_school['Student ID'].count()
sch_budg = schools_df.set_index('school_name')['budget']
budg_per_stud = sch_budg/sch_stud

sch_avgmath = group_school['math_score'].mean()
sch_avgread = group_school['reading_score'].mean()

sch_mathpass = school_complete[school_complete["math_score"] >= 70].groupby(['school_name'])['Student ID'].count()
per_mathpass = sch_mathpass/sch_stud * 100
sch_readpass = school_complete[school_complete["reading_score"] >= 70].groupby(['school_name'])['Student ID'].count()/sch_stud * 100
sch_overall = school_complete[(school_complete["math_score"] >= 70) & (school_complete["reading_score"] >= 70)].groupby(['school_name'])['Student ID'].count()/sch_stud * 100

#formatting and outputting the dataframe
schools_summary = pd.DataFrame({"School Type":typeschool,
            "Total Students":sch_stud,
            "Total School Budget":sch_budg,
            "Per Student Budget": budg_per_stud,
            "Average Math Score":sch_avgmath,
            "Average Reading Score":sch_avgread,
            "% Passing Math":per_mathpass,
            "% Passing Reading":sch_readpass,
            "Overall Passing Rate":sch_overall})

schools_summary

Unnamed: 0,School Type,Total Students,Total School Budget,Per Student Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
Bailey High School,District,4976,3124928,628.0,77.048432,81.033963,66.680064,81.93328,54.642283
Cabrera High School,Charter,1858,1081356,582.0,83.061895,83.97578,94.133477,97.039828,91.334769
Figueroa High School,District,2949,1884411,639.0,76.711767,81.15802,65.988471,80.739234,53.204476
Ford High School,District,2739,1763916,644.0,77.102592,80.746258,68.309602,79.299014,54.289887
Griffin High School,Charter,1468,917500,625.0,83.351499,83.816757,93.392371,97.138965,90.599455
Hernandez High School,District,4635,3022020,652.0,77.289752,80.934412,66.752967,80.862999,53.527508
Holden High School,Charter,427,248087,581.0,83.803279,83.814988,92.505855,96.252927,89.227166
Huang High School,District,2917,1910635,655.0,76.629414,81.182722,65.683922,81.316421,53.513884
Johnson High School,District,4761,3094650,650.0,77.072464,80.966394,66.057551,81.222432,53.539172
Pena High School,Charter,962,585858,609.0,83.839917,84.044699,94.594595,95.945946,90.540541


## Top Performing Schools (By % Overall Passing)

These are the top five performing schools as determined by overall percent of passing students.

In [6]:
#using the previously created schools_summary dataframe and sorting it by overall passing percentage

top_overallpass = schools_summary.sort_values("Overall Passing Rate", ascending = False)
top_overallpass.head(5)

Unnamed: 0,School Type,Total Students,Total School Budget,Per Student Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
Cabrera High School,Charter,1858,1081356,582.0,83.061895,83.97578,94.133477,97.039828,91.334769
Thomas High School,Charter,1635,1043130,638.0,83.418349,83.84893,93.272171,97.308869,90.948012
Griffin High School,Charter,1468,917500,625.0,83.351499,83.816757,93.392371,97.138965,90.599455
Wilson High School,Charter,2283,1319574,578.0,83.274201,83.989488,93.867718,96.539641,90.582567
Pena High School,Charter,962,585858,609.0,83.839917,84.044699,94.594595,95.945946,90.540541


## Bottom Performing Schools (By % Overall Passing)

These are the worst performing schools as determined by overall percentage of passing students.

In [7]:
#using the previously defined schools dataframe and sorting it by descending value order

bottom_overallpass = schools_summary.sort_values("Overall Passing Rate", ascending = True)
bottom_overallpass.head(5)

Unnamed: 0,School Type,Total Students,Total School Budget,Per Student Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
Rodriguez High School,District,3999,2547363,637.0,76.842711,80.744686,66.366592,80.220055,52.988247
Figueroa High School,District,2949,1884411,639.0,76.711767,81.15802,65.988471,80.739234,53.204476
Huang High School,District,2917,1910635,655.0,76.629414,81.182722,65.683922,81.316421,53.513884
Hernandez High School,District,4635,3022020,652.0,77.289752,80.934412,66.752967,80.862999,53.527508
Johnson High School,District,4761,3094650,650.0,77.072464,80.966394,66.057551,81.222432,53.539172


## Math Scores by Grade

The following work will provide average math scores for each grade level (9th through 12th grade) as reported for the schools in the district.

In [8]:
#math scores by grade level
nine_math = students_df.loc[students_df['grade']== "9th"].groupby(['school_name'])['math_score'].mean()
ten_math = students_df.loc[students_df['grade']== "10th"].groupby(['school_name'])['math_score'].mean()
eleven_math = students_df.loc[students_df['grade']== "11th"].groupby(['school_name'])['math_score'].mean()
twelve_math = students_df.loc[students_df['grade']== "12th"].groupby(['school_name'])['math_score'].mean()

math = pd.DataFrame({
    "9th Grade": nine_math,
    "10th Grade": ten_math,
    "11th Grade": eleven_math,
    "12th Grade": twelve_math
})

math = math[['9th Grade', '10th Grade', '11th Grade', '12th Grade']]
math.index.name = "School"
math

Unnamed: 0_level_0,9th Grade,10th Grade,11th Grade,12th Grade
School,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bailey High School,77.083676,76.996772,77.515588,76.492218
Cabrera High School,83.094697,83.154506,82.76556,83.277487
Figueroa High School,76.403037,76.539974,76.884344,77.151369
Ford High School,77.361345,77.672316,76.918058,76.179963
Griffin High School,82.04401,84.229064,83.842105,83.356164
Hernandez High School,77.438495,77.337408,77.136029,77.186567
Holden High School,83.787402,83.429825,85.0,82.855422
Huang High School,77.027251,75.908735,76.446602,77.225641
Johnson High School,77.187857,76.691117,77.491653,76.863248
Pena High School,83.625455,83.372,84.328125,84.121547


## Reading Score by Grade

The following work will provide average reading scores for each grade level (9th through 12th grade) as reported for the schools in the district.

In [9]:
#reading scores by grade level
nine_read = students_df.loc[students_df['grade']== "9th"].groupby(['school_name'])['reading_score'].mean()
ten_read = students_df.loc[students_df['grade']== "10th"].groupby(['school_name'])['reading_score'].mean()
eleven_read = students_df.loc[students_df['grade']== "11th"].groupby(['school_name'])['reading_score'].mean()
twelve_read = students_df.loc[students_df['grade']== "12th"].groupby(['school_name'])['reading_score'].mean()

read = pd.DataFrame({
    "9th Grade": nine_read,
    "10th Grade": ten_read,
    "11th Grade": eleven_read,
    "12th Grade": twelve_read
})

read = read[['9th Grade', '10th Grade', '11th Grade', '12th Grade']]
read.index.name = "School"
read

Unnamed: 0_level_0,9th Grade,10th Grade,11th Grade,12th Grade
School,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bailey High School,81.303155,80.907183,80.945643,80.912451
Cabrera High School,83.676136,84.253219,83.788382,84.287958
Figueroa High School,81.198598,81.408912,80.640339,81.384863
Ford High School,80.632653,81.262712,80.403642,80.662338
Griffin High School,83.369193,83.706897,84.288089,84.013699
Hernandez High School,80.86686,80.660147,81.39614,80.857143
Holden High School,83.677165,83.324561,83.815534,84.698795
Huang High School,81.290284,81.512386,81.417476,80.305983
Johnson High School,81.260714,80.773431,80.616027,81.227564
Pena High School,83.807273,83.612,84.335938,84.59116


## Scores by School Spending

The following will analyze school performance based on average spending by number of students.
In order to do so, the data will be binned into ranges of school spending.

In [10]:
#Creating the bins in which the data will be held
bin = [0, 584, 629, 644, 675]
bin_names = ["<$584", "$585-$629", "$630-$644", "$645-$675"]
school_complete['budg_bin'] = pd.cut(school_complete['budget']/school_complete['size'], bin, labels = bin_names)
bin_spend = school_complete.groupby('budg_bin')


#Calculating the data to go in the new binned DataFrame
avgmath = bin_spend['math_score'].mean()
avgread = bin_spend['reading_score'].mean()
passmath = school_complete[school_complete['math_score'] >= 70].groupby(['budg_bin'])['Student ID'].count()/bin_spend['Student ID'].count() *100
passread = school_complete[school_complete['reading_score'] >= 70].groupby(['budg_bin'])['Student ID'].count()/bin_spend['Student ID'].count() * 100
overpass = school_complete[(school_complete['math_score'] >= 70) & (school_complete['reading_score'] >= 70)].groupby(['budg_bin'])['Student ID'].count()/bin_spend['Student ID'].count() * 100

#Creating teh new binned DataFrame
bin_by_spend = pd.DataFrame({"Average Math Score": avgmath,
                            "Average Reading Score": avgread,
                            "% Passing Math": passmath,
                            "% Passing Reading": passread,
                            "Overall % Passing": overpass})

bin_by_spend

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall % Passing
budg_bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
<$584,83.363065,83.964039,93.702889,96.686558,90.640704
$585-$629,79.982873,82.312643,79.109851,88.513145,70.939239
$630-$644,77.821056,81.301007,70.623565,82.600247,58.841194
$645-$675,77.049297,81.005604,66.230813,81.109397,53.528791


## Scores by School Size

The following will analyze school performances based on school sizes. Again, this analysis will require binning the data into groups. In this instance, we will create three bins to represent small (<1000), medium (1000-2000), or large (>2000) schools. 

In [11]:
#Creating the bins in which the data will be held
sizebins = [0, 999, 1999, 9999999999]
size_names = ["Small (<1000)", "Medium (1000-2000)" , "Large (>2000)"]
school_complete['size_bins'] = pd.cut(school_complete['size'], sizebins, labels = size_names)
sizes = school_complete.groupby('size_bins')
 
avgmath = sizes['math_score'].mean()
avgread = sizes['reading_score'].mean()
passmath = school_complete[school_complete['math_score'] >= 70].groupby('size_bins')['Student ID'].count()/sizes['Student ID'].count() * 100
passread = school_complete[school_complete['reading_score'] >= 70].groupby('size_bins')['Student ID'].count()/sizes['Student ID'].count() * 100
overpass = school_complete[(school_complete['reading_score'] >= 70) & (school_complete['math_score'] >= 70)].groupby('size_bins')['Student ID'].count()/sizes['Student ID'].count() * 100
            
bin_by_size = pd.DataFrame({"Average Math Score": avgmath,
    "Average Reading Score": avgread,
    '% Passing Math': passmath,
    '% Passing Reading': passread,
    "Overall Passing Rate": overpass
            
})          
bin_by_size

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
size_bins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Small (<1000),83.828654,83.974082,93.952484,96.040317,90.136789
Medium (1000-2000),83.372682,83.867989,93.616522,96.773058,90.624267
Large (>2000),77.477597,81.198674,68.65238,82.125158,56.574046


## Scores by School Type

The following will analyze school performances based on school type (District vs. Charter)

In [13]:
#Grouping the school district by typ of school and analyzing performance
types = school_complete.groupby('type')
 
avgmath = types['math_score'].mean()
avgread = types['reading_score'].mean()
avgbudg = types['budget'].mean()
passmath = school_complete[school_complete['math_score'] >= 70].groupby('type')['Student ID'].count()/types['Student ID'].count() * 100
passread = school_complete[school_complete['reading_score'] >= 70].groupby('type')['Student ID'].count()/types['Student ID'].count() * 100
overpass = school_complete[(school_complete['reading_score'] >= 70) & (school_complete['math_score'] >= 70)].groupby('type')['Student ID'].count()/types['Student ID'].count() * 100
            
grouptype = pd.DataFrame({"Average Budget": avgbudg,
    "Average Math Score": avgmath,
    "Average Reading Score": avgread,
    '% Passing Math': passmath,
    '% Passing Reading': passread,
    "Overall Passing Rate": overpass
            
})          
grouptype


Unnamed: 0_level_0,Average Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Charter,1024543.0,83.406183,83.902821,93.701821,96.645891,90.560932
District,2611175.0,76.987026,80.962485,66.518387,80.905249,53.695878


# Written Observations

First, we should note before making any conclusive observations about the data that, as previously stated, this data was not cleaned for anomalies or outliers. Therefore, the data may be flawed and deviate from any patterns we note in the following written analysis.

One thing that we can observe from the data is smaller schools tend to perform better. The schools with the largest student populations have a significantly lower percentage of students passing in both math and reading, whereas the small and medium sized schools both have approximately 90% of their students passing both. 

Another thin we can observe from the data is that charter schools perform better than the district schools, and although this assignment did not ask for the average budget difference between charter and district schools, it appears that charter schools actually have a lesser budget. 

A third observation from the data is that budget does not seem to have a positive correlation with performance. Schools with a larger budget did not perform better than schools with a smaller budget on average. This third observation in conjunction with the first two might lead further research to explore if it is a causal relationship between the smaller student population and performance and if that may help explain the success of charter schools. 

Unfortunately purely based on these data we are unable to make any conclusions on causation of these trends. But this analysis would recommend future research on the mechanisms by which we see a positive relationship between size and performance over budget spend and performance, particularly in relations to charter versus district schools. 