   # Education Project


   <img src='data/education_image.jpg' width="900">
   
   **Credit:**  [wsimag](https://wsimag.com/culture/60264-education-in-venezuela-the-americas-and-the-world)



In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

sns.set(style='ticks')

warnings.filterwarnings("ignore")  # Suppress all warnings

# Introduction

## Business Context
Research shows that high-poverty areas disproportionally educate children of color. The chances of ending up in a high-poverty or high-minority school are highly determined by a student’s race/ethnicity and social class. For instance, African American and Hispanic students—even if they are not poor—are much more likely than white or Asian students to be in high-poverty schools.

There is a growing body of evidence that shows increased investment on education returns better outcomes and that the positive effects are even greater among low-income students. On the other hand, it costs more to educate low-income students and provide them with a robust education capable of overcoming their initial disadvantages.


### Goals
1. Understand the current demographics of wealthy to high-poverty schools across the state of California.
2. Identify how much funding is available per pupil in wealthy vs high-poverty areas.
3. Learn what factors are most correlated with student performance.


#### Predictive modeling
What's the average test score per school?
What's the percentage of students who pass/not pass?


# DATA WRANGLING

The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

- Extracting and cleaning relevant data. Let's start looking at the datasets!

## Assessment Data

- It contains assessment data for the Smarter Balance Summative Assessment (2018-2019) for the state of California.

- Legend types can be found here: https://caaspp-elpac.cde.ca.gov/caaspp/research_fixfileformat19
- More information about assesment set up: https://www.cde.ca.gov/ta/tg/ca/sbsummativefaq.asp

In [2]:
# loa datafile
df_all = pd.read_csv('large_data/sb_ca2019_all_csv_v4.txt')

In [3]:
# create dataset containing district level data
df_district = df_all[df_all['District Code'] == 00000]

In [4]:
# create dataset containing school level data
df_school = df_all.drop(df_all[df_all['School Code'] == 0].index)
df_school.head(10)

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,...,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
1888,1,10017,112607,,2019,1,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1889,1,10017,112607,,2019,3,B,42,42,11,...,35.71,14.29,50.00,35.71,7.14,76.19,16.67,9.52,64.29,26.19
1890,1,10017,112607,,2019,4,B,43,42,11,...,39.02,10.00,72.50,17.50,7.32,65.85,26.83,17.07,58.54,24.39
1891,1,10017,112607,,2019,6,B,79,78,11,...,33.77,13.16,63.16,23.68,7.79,71.43,20.78,14.29,63.64,22.08
1892,1,10017,112607,,2019,7,B,*,*,11,...,*,*,*,*,*,*,*,*,*,*
1893,1,10017,112607,,2019,8,B,38,38,11,...,36.84,18.42,68.42,13.16,7.89,73.68,18.42,18.42,65.79,15.79
1894,1,10017,112607,,2019,31,B,68,67,11,...,40.91,12.31,63.08,24.62,9.09,68.18,22.73,13.64,62.12,24.24
1895,1,10017,112607,,2019,51,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1896,1,10017,112607,,2019,53,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1897,1,10017,112607,,2019,74,B,30,29,11,...,28.57,7.41,59.26,33.33,3.57,67.86,28.57,7.14,71.43,21.43


In [5]:
# check columns' names
df_school.columns

Index(['County Code', 'District Code', 'School Code', 'Filler', 'Test Year',
       'Subgroup ID', 'Test Type', 'Total Tested At Entity Level',
       'Total Tested with Scores', 'Grade', 'Test Id',
       'CAASPP Reported Enrollment', 'Students Tested', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores',
       'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
       'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
       'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
       'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
       'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
       'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard'],
      dtype='object')

In [6]:
# check data type
df_school.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3013079 entries, 1888 to 3576490
Data columns (total 32 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   County Code                        int64  
 1   District Code                      int64  
 2   School Code                        int64  
 3   Filler                             float64
 4   Test Year                          int64  
 5   Subgroup ID                        int64  
 6   Test Type                          object 
 7   Total Tested At Entity Level       object 
 8   Total Tested with Scores           object 
 9   Grade                              int64  
 10  Test Id                            int64  
 11  CAASPP Reported Enrollment         object 
 12  Students Tested                    object 
 13  Mean Scale Score                   object 
 14  Percentage Standard Exceeded       object 
 15  Percentage Standard Met            object 
 16  Percentage Stan

In [7]:
# Check for missing data
df_school.isnull().sum()

County Code                                0
District Code                              0
School Code                                0
Filler                               3013079
Test Year                                  0
Subgroup ID                                0
Test Type                                  0
Total Tested At Entity Level               0
Total Tested with Scores                   0
Grade                                      0
Test Id                                    0
CAASPP Reported Enrollment                 0
Students Tested                            0
Mean Scale Score                      797631
Percentage Standard Exceeded               0
Percentage Standard Met                    0
Percentage Standard Met and Above          0
Percentage Standard Nearly Met             0
Percentage Standard Not Met                0
Students with Scores                       0
Area 1 Percentage Above Standard           0
Area 1 Percentage Near Standard            0
Area 1 Per

In [8]:
# Number of rows where subgroup ID == 1
df_school[df_school['Subgroup ID'] == 1].count()

County Code                          87324
District Code                        87324
School Code                          87324
Filler                                   0
Test Year                            87324
Subgroup ID                          87324
Test Type                            87324
Total Tested At Entity Level         87324
Total Tested with Scores             87324
Grade                                87324
Test Id                              87324
CAASPP Reported Enrollment           87324
Students Tested                      87324
Mean Scale Score                     66727
Percentage Standard Exceeded         87324
Percentage Standard Met              87324
Percentage Standard Met and Above    87324
Percentage Standard Nearly Met       87324
Percentage Standard Not Met          87324
Students with Scores                 87324
Area 1 Percentage Above Standard     87324
Area 1 Percentage Near Standard      87324
Area 1 Percentage Below Standard     87324
Area 2 Perc

In [9]:
# Check number of unique schools
df_school['School Code'].nunique()

10300

- There are 10,300 unique schools!

### Reorganizing Subgroup ID 

The assessment dataset contains a lot of demographic information in the subgroup ID column. Need to reorganize the dataset in order to have one variable per column and one observation per row. Also, neet to filter only the demographic information of interest.

#### Before merging:
- Filter variables of interest;
- Rearrange the data to have: 
    - one feature per column; 
    - one observation per row;

This dataset representes the Smater Balanced Assessments for English Language Arts/Literacy and Mathematics (SB). Test ID 1 and 2. More info about the test can be found here: https://www.caaspp.org/administration/about/testing/index.html

## Creating two datasets for modeling

- Language Arts & Literature: test_id == 1

    - 10,299 rows
    
    
- Mathematics: test_id == 2

    - 10,298 rows



In [10]:
# Filter Grade == 13 summary of all grades per school
all_grades = df_school[df_school['Grade'] == 13]

# Filter Subgroup ID == 1 summary of all students
all_students = all_grades[all_grades['Subgroup ID'] == 1]

## 1. df_language

In [11]:
# Create df_test1 language arts & literature 
df_test1 = all_students[all_students['Test Id'] == 1]

# drop columns that won't be used
df_language = df_test1.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard', 
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])

df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
1927,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84
2234,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127
2681,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232
3126,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192
3474,130401,1,*,,*,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...
3575148,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588
3575555,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132
3575772,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74
3575957,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51


## 2. df_math

In [12]:
# Create df_test2 mathematics
df_test2 = all_students[all_students['Test Id'] == 2]

# drop columns with redundant information
df_math = df_test2.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])
df_math

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
2005,112607,1,90,,3.57,7.14,10.71,16.67,72.62,84
2465,123968,1,142,,6.67,11.11,17.78,28.15,54.07,135
2891,124172,1,239,,74.57,19.40,93.97,5.17,0.86,232
3367,125567,1,201,,17.10,18.65,35.75,18.13,46.11,193
3572,130401,1,34,,*,*,*,*,*,4
...,...,...,...,...,...,...,...,...,...,...
3575405,6056816,1,608,,13.78,27.04,40.82,32.48,26.70,588
3575698,6056832,1,146,,9.09,24.24,33.33,40.15,26.52,132
3575838,6056840,1,74,,12.16,31.08,43.24,27.03,29.73,74
3576079,6118806,1,57,,15.69,35.29,50.98,29.41,19.61,51


### Subgroup ID 

In the legend below, Demographic Id and Demographic Id Num are represented in the dataset as Subgroup ID.

In [13]:
legend = pd.read_csv('data/Subgroups.txt')
legend

Unnamed: 0,Demographic ID,Demographic ID Num,Demographic Name,Student Group
0,1,1,All Students,All Students
1,3,3,Male,Gender
2,4,4,Female,Gender
3,6,6,Fluent English proficient and English only,English-Language Fluency
4,7,7,Initial fluent English proficient (IFEP),English-Language Fluency
5,8,8,Reclassified fluent English proficient (RFEP),English-Language Fluency
6,28,28,Migrant education,Migrant
7,31,31,Economically disadvantaged,Economic Status
8,50,50,Military,Military Status
9,51,51,Not Military,Military Status


## Next: 
1. Transform demographic information contained in subgroup id column into one variable per column.

In [14]:
# drop columns that won't be used at ALL_GRADES level
all_grades = all_grades.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Mean Scale Score', 'Percentage Standard Exceeded', 'Percentage Standard Met',
                             'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
                             'Percentage Standard Not Met', 'Students with Scores', 'Total Tested At Entity Level',
                             'Total Tested with Scores', 'Students Tested', 'Grade'])

# Language arts and literature dataset test_id == 1

In [15]:
# Filter subgroup_id and test_id, rename column and merge into main df
def merge_column(subgroup_id, name):
    df = all_grades[(all_grades['Subgroup ID'] == subgroup_id) & (all_grades['Test Id'] == 1)]
    df = df.drop(columns=['Test Id', 'Subgroup ID'])
    df = df.rename({'CAASPP Reported Enrollment': name}, axis=1)
    global df_language
    df_language = df_language.merge(df, how='left', on='School Code')

In [16]:
# Define dictionary to merge columns
dict = {3:'Male', 4:'Female', 50:'Military', 51:'Non Military', 52:'Homeless', 53:'Non Homeless', 31:'Disadvantaged',
       111:'Not Disadvantaged', 74:'Black', 75:'Native American', 76:'Asian', 78:'Hispanic', 79:'Pacific Islander',
       80:'White', 144:'Two/More Races', 90:'< High School', 91:'High School Grad', 92:'Some College', 
       93:'College Grad', 94:'Graduate School'}

In [17]:
# Loop through dict to define label and merge column to language_df
for subgroup_id, column_name in dict.items():
    merge_column(subgroup_id, column_name)

In [18]:
df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84,...,*,46,*,4,*,23,20,19,11,*
1,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127,...,10,82,,5,4,28,39,34,14,13
2,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232,...,111,9,,18,94,,*,7,58,164
3,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192,...,14,50,,64,27,*,17,39,59,76
4,130401,1,*,,*,*,*,*,*,*,...,,*,*,*,,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10294,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588,...,15,166,5,314,61,32,97,255,156,57
10295,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132,...,,26,*,83,14,,10,64,44,26
10296,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74,...,*,27,,40,4,10,15,30,13,6
10297,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51,...,,15,*,29,8,,*,18,23,15


# Mathematics dataset test_id == 2

In [19]:
# Filter subgroup_id and test_id, rename column and merge into main df
def merge_column(subgroup_id, name):
    df = all_grades[(all_grades['Subgroup ID'] == subgroup_id) & (all_grades['Test Id'] == 2)]
    df = df.drop(columns=['Test Id', 'Subgroup ID'])
    df = df.rename({'CAASPP Reported Enrollment': name}, axis=1)
    global df_math
    df_math = df_math.merge(df, how='left', on='School Code')

In [20]:
# Define dictionary to merge columns
dict = {3:'Male', 4:'Female', 50:'Military', 51:'Non Military', 52:'Homeless', 53:'Non Homeless', 31:'Disadvantaged',
       111:'Not Disadvantaged', 74:'Black', 75:'Native American', 76:'Asian', 78:'Hispanic', 79:'Pacific Islander',
       80:'White', 144:'Two/More Races', 90:'< High School', 91:'High School Grad', 92:'Some College', 
       93:'College Grad', 94:'Graduate School'}

In [21]:
# Loop through dict to define label and merge column to math_df
for subgroup_id, column_name in dict.items():
    merge_column(subgroup_id, column_name)

In [22]:
df_math

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,112607,1,90,,3.57,7.14,10.71,16.67,72.62,84,...,*,46,*,4,*,23,20,19,11,*
1,123968,1,142,,6.67,11.11,17.78,28.15,54.07,135,...,10,82,,5,4,28,39,34,14,13
2,124172,1,239,,74.57,19.40,93.97,5.17,0.86,232,...,111,9,,18,94,,*,7,58,164
3,125567,1,201,,17.10,18.65,35.75,18.13,46.11,193,...,14,50,,64,27,*,17,39,59,76
4,130401,1,34,,*,*,*,*,*,4,...,,*,*,*,,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10293,6056816,1,608,,13.78,27.04,40.82,32.48,26.70,588,...,15,166,5,314,61,32,97,255,156,57
10294,6056832,1,146,,9.09,24.24,33.33,40.15,26.52,132,...,,26,*,83,14,,10,64,44,26
10295,6056840,1,74,,12.16,31.08,43.24,27.03,29.73,74,...,*,27,,40,4,10,15,30,13,6
10296,6118806,1,57,,15.69,35.29,50.98,29.41,19.61,51,...,,15,*,29,8,,*,18,23,15


---------

## Entities Data

- It contains information such as school and district name, zip code and relevant codes to allow merge with the assessment data.

Number of rows in this dataset is closely related to current information about number of schools and districts in the state of CA:

- There are ~ 1,040 school districts in California. 
    - The entities_dist dataset contains 1,087 rows.
- There are ~ 10,588 schools in California. 
    - The df_entities dataset contains 10,300 rows.

In [43]:
df_entities = pd.read_csv('data/sb_ca2019entities_csv.txt')

In [44]:
# create dataset containing entities data at district level
entities_dist = df_entities[df_entities['School Code'] == 0]

# create dataset containing entities data at school level 
df_entities = df_entities.drop(df_entities[df_entities['School Code'] == 0].index) # drop district level data

# drop columns with redundant information or not of use 
df_entities = df_entities.drop(columns = ['Filler', 'Test Year', 'Type Id', 'District Code', 'District Name', 'County Name'])


In [45]:
df_entities['County Code'].nunique()

58

In [46]:
df_entities['Zip Code'].nunique()

1492

--------

## Merge df_language to df_entities

This merge adds school name and zipcode to df_language and df_math.

In [25]:
# merge dfs on school code
df_language_merge = df_entities.merge(df_language, how='left', on='School Code')
df_language_merge

Unnamed: 0,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,114686,Ocean Air,92130,1.0,420,,71.19,19.85,91.04,5.81,...,175,26,,188,24,,*,4,85,326
1,6038111,Del Mar Heights Elementary,92014,1.0,275,,69.23,21.61,90.84,5.49,...,19,23,*,203,25,,*,9,85,178
2,6088983,Del Mar Hills Elementary,92014,1.0,169,,61.35,25.15,86.50,6.75,...,13,33,,105,17,*,4,13,47,99
3,6110696,Carmel Del Mar Elementary,92130,1.0,299,,68.03,20.75,88.78,8.50,...,91,34,,148,21,*,*,10,73,203
4,6115620,Ashley Falls Elementary,92130,1.0,331,,51.85,29.01,80.86,11.42,...,100,29,*,171,15,,*,18,63,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,*,...,,*,,*,,*,*,*,*,*
10296,138313,University Prep,91764,1.0,92,,17.78,28.89,46.67,24.44,...,*,32,,43,*,*,15,18,25,*
10297,6038095,Dehesa Elementary,92019,1.0,95,,6.52,25.00,31.52,32.61,...,,26,,37,18,8,8,37,14,10
10298,6119564,Dehesa Charter,92026,1.0,108,,16.13,30.11,46.24,25.81,...,*,27,,67,6,,6,30,39,25


In [26]:
df_language_merge.columns

Index(['School Code', 'School Name', 'Zip Code', 'Subgroup ID',
       'CAASPP Reported Enrollment', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Not Disadvantaged', 'Black', 'Native American', 'Asian', 'Hispanic',
       'Pacific Islander', 'White', 'Two/More Races', '< High School',
       'High School Grad', 'Some College', 'College Grad', 'Graduate School'],
      dtype='object')

In [27]:
# percentage of missing data per column
percent_missing = (df_language_merge.isnull().sum() * 100 / len(df_language_merge)).round(2)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})

# sorting values in ascending format
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,percent_missing
School Code,0.0
School Name,0.0
Zip Code,0.0
Students with Scores,0.01
Percentage Standard Not Met,0.01
Percentage Standard Nearly Met,0.01
Percentage Standard Met and Above,0.01
Non Military,0.01
Percentage Standard Exceeded,0.01
Percentage Standard Met,0.01


## Merge df_math to df_entities

In [28]:
df_math_merge = df_entities.merge(df_math, how='left', on='School Code')
df_math_merge

Unnamed: 0,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,114686,Ocean Air,92130,1.0,420,,76.44,14.90,91.35,6.97,...,175,26,,188,24,,*,4,85,326
1,6038111,Del Mar Heights Elementary,92014,1.0,275,,67.40,22.34,89.74,8.42,...,19,23,*,203,25,,*,9,85,178
2,6088983,Del Mar Hills Elementary,92014,1.0,169,,54.49,28.14,82.63,12.57,...,13,33,,105,17,*,4,13,47,99
3,6110696,Carmel Del Mar Elementary,92130,1.0,299,,70.71,17.85,88.55,7.74,...,91,34,,148,21,*,*,10,73,203
4,6115620,Ashley Falls Elementary,92130,1.0,331,,55.49,25.61,81.10,11.28,...,100,29,*,171,15,,*,18,63,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,*,...,,*,,*,,*,*,*,*,*
10296,138313,University Prep,91764,1.0,92,,6.82,9.09,15.91,25.00,...,*,32,,43,*,*,15,18,25,*
10297,6038095,Dehesa Elementary,92019,1.0,95,,11.96,19.57,31.52,35.87,...,,26,,37,18,8,8,37,14,10
10298,6119564,Dehesa Charter,92026,1.0,108,,5.43,13.04,18.48,19.57,...,*,27,,67,6,,6,30,39,25


In [29]:
# percentage of missing data per column
percent_missing = (df_math_merge.isnull().sum() * 100 / len(df_math_merge)).round(2)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})

# sorting values in ascending format
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,percent_missing
School Code,0.0
School Name,0.0
Zip Code,0.0
Students with Scores,0.02
Percentage Standard Not Met,0.02
Percentage Standard Nearly Met,0.02
Percentage Standard Met and Above,0.02
Non Military,0.02
Percentage Standard Exceeded,0.02
Percentage Standard Met,0.02


----------

### Median household income by zipcode
- Year 2014
- California

source:http://www.usa.com/rank/california-state--median-household-income--zip-code-rank.htm?yr=9000&dis=&wist=&plow=&phigh=

In [61]:
# load csv file
income = pd.read_csv('data/median_income_zipcode.csv')

# drop columns Rank and population
income = income.drop(columns = ['Rank', 'Population'])

# rename columns to merge with zip code from the main dfs
income.columns = ['Median Household Income', 'Zip Code']

# transform zip code int to object
income['Zip Code'] = income['Zip Code'].apply(str)

income

Unnamed: 0,Median Household Income,Zip Code
0,236912.00,94027
1,228587.00,92145
2,200325.00,91980
3,187857.00,94957
4,182750.00,94022
...,...,...
1681,11922.00,93721
1682,11250.00,93530
1683,10625.00,90089
1684,10481.00,95915


-----------

### Merge df_language and df_math to income

In [62]:
df_language_merge = df_language_merge.merge(income, how='left', on='Zip Code')
df_language_merge

Unnamed: 0,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Median Household Income
0,114686,Ocean Air,92130,1.0,420,,71.19,19.85,91.04,5.81,...,26,,188,24,,*,4,85,326,131166.00
1,6038111,Del Mar Heights Elementary,92014,1.0,275,,69.23,21.61,90.84,5.49,...,23,*,203,25,,*,9,85,178,114777.00
2,6088983,Del Mar Hills Elementary,92014,1.0,169,,61.35,25.15,86.50,6.75,...,33,,105,17,*,4,13,47,99,114777.00
3,6110696,Carmel Del Mar Elementary,92130,1.0,299,,68.03,20.75,88.78,8.50,...,34,,148,21,*,*,10,73,203,131166.00
4,6115620,Ashley Falls Elementary,92130,1.0,331,,51.85,29.01,80.86,11.42,...,29,*,171,15,,*,18,63,241,131166.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,*,...,*,,*,,*,*,*,*,*,51060.00
10296,138313,University Prep,91764,1.0,92,,17.78,28.89,46.67,24.44,...,32,,43,*,*,15,18,25,*,51060.00
10297,6038095,Dehesa Elementary,92019,1.0,95,,6.52,25.00,31.52,32.61,...,26,,37,18,8,8,37,14,10,74797.00
10298,6119564,Dehesa Charter,92026,1.0,108,,16.13,30.11,46.24,25.81,...,27,,67,6,,6,30,39,25,53820.00


In [63]:
df_math_merge = df_math_merge.merge(income, how='left', on='Zip Code')
df_math_merge

Unnamed: 0,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Median Household Income
0,114686,Ocean Air,92130,1.0,420,,76.44,14.90,91.35,6.97,...,26,,188,24,,*,4,85,326,131166.00
1,6038111,Del Mar Heights Elementary,92014,1.0,275,,67.40,22.34,89.74,8.42,...,23,*,203,25,,*,9,85,178,114777.00
2,6088983,Del Mar Hills Elementary,92014,1.0,169,,54.49,28.14,82.63,12.57,...,33,,105,17,*,4,13,47,99,114777.00
3,6110696,Carmel Del Mar Elementary,92130,1.0,299,,70.71,17.85,88.55,7.74,...,34,,148,21,*,*,10,73,203,131166.00
4,6115620,Ashley Falls Elementary,92130,1.0,331,,55.49,25.61,81.10,11.28,...,29,*,171,15,,*,18,63,241,131166.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,*,...,*,,*,,*,*,*,*,*,51060.00
10296,138313,University Prep,91764,1.0,92,,6.82,9.09,15.91,25.00,...,32,,43,*,*,15,18,25,*,51060.00
10297,6038095,Dehesa Elementary,92019,1.0,95,,11.96,19.57,31.52,35.87,...,26,,37,18,8,8,37,14,10,74797.00
10298,6119564,Dehesa Charter,92026,1.0,108,,5.43,13.04,18.48,19.57,...,27,,67,6,,6,30,39,25,53820.00


---------

### Enrollment Dataset, Full-Time Equivalent Teacher, and Pupil/Teacher Ratio
- It contains total enrollment per school for the academic year 2018-2019 in California.
- Data comes from the National Center for Education Statistics.

In [38]:
# load datafile
df_enrollment = pd.read_csv('data/ELSI_total_enrollment_.csv')
df_enrollment

Unnamed: 0,School Name,State Name [Public School] Latest available year,School ID - NCES Assigned [Public School] Latest available year,Agency ID - NCES Assigned [Public School] Latest available year,Total Students All Grades (Excludes AE) [Public School] 2018-19,Full-Time Equivalent (FTE) Teachers [Public School] 2018-19,Pupil/Teacher Ratio [Public School] 2018-19
0,21ST CENTURY LEARNING INSTITUTE,CALIFORNIA,"=""060429013779""","=""0604290""",88,3.60,24.44
1,A PLACE TO GROW,California,"=""062827013394""","=""0628270""",†,–,–
2,A. E. ARNOLD ELEMENTARY,California,"=""061044001166""","=""0610440""",739,27.00,27.37
3,A. G. COOK ELEMENTARY,California,"=""061488001834""","=""0614880""",366,16.00,22.88
4,A. G. CURRIE MIDDLE,California,"=""064015006636""","=""0640150""",611,25.30,24.15
...,...,...,...,...,...,...,...
10436,ZUPANIC HIGH,California,"=""063237010019""","=""0632370""",70,3.00,23.33
10437,Data Source: U.S. Department of Education Nati...,,,,,,
10438,† indicates that the data are not applicable.,,,,,,
10439,– indicates that the data are missing.,,,,,,


---------

### Total Revenue

- It contains total revenue per school district in California for the academic year 2018-2019.
- Revenue comes from local, state and federal sources.

In [35]:
df_revenue = pd.read_csv('data/ELSI_csv_export_revenue.csv')

In [36]:
df_revenue

Unnamed: 0,Agency Name,State Name [District] Latest available year,Fall Membership (V33) [District Finance] 2016-17,Total Revenue - Local Sources (TLOCREV) [District Finance] 2016-17,Total General Revenue (TOTALREV) [District Finance] 2016-17,Total Revenue - State Sources (TSTREV) [District Finance] 2016-17,Total Revenue - Federal Sources (TFEDREV) [District Finance] 2016-17,Total Current Expenditures - El-Sec Education (TCURELSC) [District Finance] 2016-17,Total Expenditures (TOTALEXP) [District Finance] 2016-17,Total Revenue (TOTALREV) per Pupil (V33) [District Finance] 2016-17,Total Revenue - Local Sources (TLOCREV) per Pupil (V33) [District Finance] 2016-17,Total Revenue - State Sources (TSTREV) per Pupil (V33) [District Finance] 2016-17,Total Revenue - Federal Sources (TFEDREV) per Pupil (V33) [District Finance] 2016-17,Agency ID - NCES Assigned [District] Latest available year
0,ABC UNIFIED,California,20768,52379000,258745000,190473000,15893000,227441000,252804000,12459,2522,9171,765,"=""0601620"""
1,ACALANES UNION HIGH,California,5530,75958000,91214000,13651000,1605000,75057000,84811000,16494,13736,2469,290,"=""0601650"""
2,ACKERMAN CHARTER,California,†,1938000,5686000,3525000,223000,4945000,6018000,†,†,†,†,"=""0601680"""
3,ACTON-AGUA DULCE UNIFIED,California,10016,11199000,35216000,22767000,1250000,28079000,34504000,3516,1118,2273,125,"=""0600001"""
4,ADELANTO ELEMENTARY,California,10288,11183000,127966000,105877000,10906000,111119000,115722000,12438,1087,10291,1060,"=""0601710"""
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1160,YUCAIPA-CALIMESA JOINT UNIFIED,California,9969,23778000,107263000,76380000,7105000,95746000,99066000,10760,2385,7662,713,"=""0643560"""
1161,Data Source: U.S. Department of Education Nati...,,,,,,,,,,,,,
1162,† indicates that the data are not applicable.,,,,,,,,,,,,,
1163,– indicates that the data are missing.,,,,,,,,,,,,,
