   # CAPSTONE 1: EDUCATION PROJECT


   <img src='data/education_image.jpg' width="900">
   
   **Credit:**  [wsimag](https://wsimag.com/culture/60264-education-in-venezuela-the-americas-and-the-world)



In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

sns.set(style='ticks')

warnings.filterwarnings("ignore")  # Suppress all warnings

# Introduction

## Business Context
Research shows that high-poverty areas disproportionally educate children of color. The chances of ending up in a high-poverty or high-minority school are highly determined by a student’s race/ethnicity and social class. For instance, African American and Hispanic students—even if they are not poor—are much more likely than white or Asian students to be in high-poverty schools.

There is a growing body of evidence that shows increased investment on education returns better outcomes and that the positive effects are even greater among low-income students. On the other hand, it costs more to educate low-income students and provide them with a robust education capable of overcoming their initial disadvantages.


### Goals
1. Understand the current demographics of wealthy to high-poverty schools across the state of California.
2. Identify how much funding is available per pupil in wealthy vs high-poverty areas.
3. Learn what factors are most correlated with student performance.


#### Predictive modeling
What's the average test score per school?
What's the percentage of students who pass/not pass?


# DATA WRANGLING

The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

- Extracting and cleaning relevant data. Let's start looking at the datasets!

## 1. Assessment Data

- It contains assessment data for the Smarter Balance Summative Assessment (2018-2019) for the state of California.

- Legend types can be found here: https://caaspp-elpac.cde.ca.gov/caaspp/research_fixfileformat19
- More information about assesment set up: https://www.cde.ca.gov/ta/tg/ca/sbsummativefaq.asp

In [2]:
# loa datafile
df_all = pd.read_csv('large_data/sb_ca2019_all_csv_v4.txt')

In [3]:
# create dataset containing district level data
df_district = df_all[df_all['District Code'] == 00000]

In [4]:
# create dataset containing school level data
df_school = df_all.drop(df_all[df_all['School Code'] == 0].index)
df_school.head(10)

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,...,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
1888,1,10017,112607,,2019,1,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1889,1,10017,112607,,2019,3,B,42,42,11,...,35.71,14.29,50.00,35.71,7.14,76.19,16.67,9.52,64.29,26.19
1890,1,10017,112607,,2019,4,B,43,42,11,...,39.02,10.00,72.50,17.50,7.32,65.85,26.83,17.07,58.54,24.39
1891,1,10017,112607,,2019,6,B,79,78,11,...,33.77,13.16,63.16,23.68,7.79,71.43,20.78,14.29,63.64,22.08
1892,1,10017,112607,,2019,7,B,*,*,11,...,*,*,*,*,*,*,*,*,*,*
1893,1,10017,112607,,2019,8,B,38,38,11,...,36.84,18.42,68.42,13.16,7.89,73.68,18.42,18.42,65.79,15.79
1894,1,10017,112607,,2019,31,B,68,67,11,...,40.91,12.31,63.08,24.62,9.09,68.18,22.73,13.64,62.12,24.24
1895,1,10017,112607,,2019,51,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1896,1,10017,112607,,2019,53,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1897,1,10017,112607,,2019,74,B,30,29,11,...,28.57,7.41,59.26,33.33,3.57,67.86,28.57,7.14,71.43,21.43


In [5]:
# check columns' names
df_school.columns

Index(['County Code', 'District Code', 'School Code', 'Filler', 'Test Year',
       'Subgroup ID', 'Test Type', 'Total Tested At Entity Level',
       'Total Tested with Scores', 'Grade', 'Test Id',
       'CAASPP Reported Enrollment', 'Students Tested', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores',
       'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
       'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
       'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
       'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
       'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
       'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard'],
      dtype='object')

In [6]:
# check data type
df_school.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3013079 entries, 1888 to 3576490
Data columns (total 32 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   County Code                        int64  
 1   District Code                      int64  
 2   School Code                        int64  
 3   Filler                             float64
 4   Test Year                          int64  
 5   Subgroup ID                        int64  
 6   Test Type                          object 
 7   Total Tested At Entity Level       object 
 8   Total Tested with Scores           object 
 9   Grade                              int64  
 10  Test Id                            int64  
 11  CAASPP Reported Enrollment         object 
 12  Students Tested                    object 
 13  Mean Scale Score                   object 
 14  Percentage Standard Exceeded       object 
 15  Percentage Standard Met            object 
 16  Percentage Stan

In [7]:
# Check for missing data
df_school.isnull().sum()

County Code                                0
District Code                              0
School Code                                0
Filler                               3013079
Test Year                                  0
Subgroup ID                                0
Test Type                                  0
Total Tested At Entity Level               0
Total Tested with Scores                   0
Grade                                      0
Test Id                                    0
CAASPP Reported Enrollment                 0
Students Tested                            0
Mean Scale Score                      797631
Percentage Standard Exceeded               0
Percentage Standard Met                    0
Percentage Standard Met and Above          0
Percentage Standard Nearly Met             0
Percentage Standard Not Met                0
Students with Scores                       0
Area 1 Percentage Above Standard           0
Area 1 Percentage Near Standard            0
Area 1 Per

In [8]:
# Number of rows where subgroup ID == 1
df_school[df_school['Subgroup ID'] == 1].count()

County Code                          87324
District Code                        87324
School Code                          87324
Filler                                   0
Test Year                            87324
Subgroup ID                          87324
Test Type                            87324
Total Tested At Entity Level         87324
Total Tested with Scores             87324
Grade                                87324
Test Id                              87324
CAASPP Reported Enrollment           87324
Students Tested                      87324
Mean Scale Score                     66727
Percentage Standard Exceeded         87324
Percentage Standard Met              87324
Percentage Standard Met and Above    87324
Percentage Standard Nearly Met       87324
Percentage Standard Not Met          87324
Students with Scores                 87324
Area 1 Percentage Above Standard     87324
Area 1 Percentage Near Standard      87324
Area 1 Percentage Below Standard     87324
Area 2 Perc

In [9]:
# Check number of unique schools
df_school['School Code'].nunique()

10300

- There are 10,300 unique schools!

## Creating two datasets for modeling

- Language Arts & Literature: test_id == 1

    - 10,299 rows
    
    
- Mathematics: test_id == 2

    - 10,298 rows



In [10]:
# Filter Grade == 13 summary of all grades per school
all_grades = df_school[df_school['Grade'] == 13]

# Filter Subgroup ID == 1 summary of all students
all_students = all_grades[all_grades['Subgroup ID'] == 1]

### df_language

In [11]:
# Create df_test1 language arts & literature 
df_test1 = all_students[all_students['Test Id'] == 1]

# drop columns that won't be used
df_language = df_test1.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard', 
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])

df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
1927,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84
2234,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127
2681,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232
3126,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192
3474,130401,1,*,,*,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...
3575148,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588
3575555,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132
3575772,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74
3575957,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51


### df_math

In [12]:
# Create df_test2 mathematics
df_test2 = all_students[all_students['Test Id'] == 2]

# drop columns with redundant information
df_math = df_test2.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])
df_math

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
2005,112607,1,90,,3.57,7.14,10.71,16.67,72.62,84
2465,123968,1,142,,6.67,11.11,17.78,28.15,54.07,135
2891,124172,1,239,,74.57,19.40,93.97,5.17,0.86,232
3367,125567,1,201,,17.10,18.65,35.75,18.13,46.11,193
3572,130401,1,34,,*,*,*,*,*,4
...,...,...,...,...,...,...,...,...,...,...
3575405,6056816,1,608,,13.78,27.04,40.82,32.48,26.70,588
3575698,6056832,1,146,,9.09,24.24,33.33,40.15,26.52,132
3575838,6056840,1,74,,12.16,31.08,43.24,27.03,29.73,74
3576079,6118806,1,57,,15.69,35.29,50.98,29.41,19.61,51


-------------

### Reorganizing Subgroup ID 

The assessment dataset contains a lot of demographic information in the subgroup ID column. Need to reorganize the dataset in order to have one variable per column and one observation per row. Also, neet to filter only the demographic information of interest.

#### Before merging:
- Filter variables of interest;
- Rearrange the data to have: 
    - one feature per column; 
    - one observation per row;

This dataset representes the Smater Balanced Assessments for English Language Arts/Literacy and Mathematics (SB). Test ID 1 and 2. More info about the test can be found here: https://www.caaspp.org/administration/about/testing/index.html

## 1. a. Subgroup ID 

In the legend below, Demographic Id and Demographic Id Num are represented in the dataset as Subgroup ID.

In [13]:
legend = pd.read_csv('data/Subgroups.txt')
legend

Unnamed: 0,Demographic ID,Demographic ID Num,Demographic Name,Student Group
0,1,1,All Students,All Students
1,3,3,Male,Gender
2,4,4,Female,Gender
3,6,6,Fluent English proficient and English only,English-Language Fluency
4,7,7,Initial fluent English proficient (IFEP),English-Language Fluency
5,8,8,Reclassified fluent English proficient (RFEP),English-Language Fluency
6,28,28,Migrant education,Migrant
7,31,31,Economically disadvantaged,Economic Status
8,50,50,Military,Military Status
9,51,51,Not Military,Military Status


-----------

### Next: 
1. Transform demographic information contained in subgroup id column into one variable per column.
    - Drop redundant columns;
    - Remap each variable of interest to its own column;
    - Value is number of students fitting each category.

In [14]:
# drop columns that won't be used at ALL_GRADES level
all_grades = all_grades.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Mean Scale Score', 'Percentage Standard Exceeded', 'Percentage Standard Met',
                             'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
                             'Percentage Standard Not Met', 'Students with Scores', 'Total Tested At Entity Level',
                             'Total Tested with Scores', 'Students Tested', 'Grade'])

## Language arts and literature dataset test_id == 1

In [15]:
# Filter subgroup_id and test_id, rename column and merge into main df
def merge_column(subgroup_id, name):
    df = all_grades[(all_grades['Subgroup ID'] == subgroup_id) & (all_grades['Test Id'] == 1)]
    df = df.drop(columns=['Test Id', 'Subgroup ID'])
    df = df.rename({'CAASPP Reported Enrollment': name}, axis=1)
    global df_language
    df_language = df_language.merge(df, how='left', on='School Code')

In [16]:
# Define dictionary to merge columns
dict = {3:'Male', 4:'Female', 50:'Military', 51:'Non Military', 52:'Homeless', 53:'Non Homeless', 31:'Disadvantaged',
       111:'Not Disadvantaged', 74:'Black', 75:'Native American', 76:'Asian', 78:'Hispanic', 79:'Pacific Islander',
       80:'White', 144:'Two/More Races', 90:'< High School', 91:'High School Grad', 92:'Some College', 
       93:'College Grad', 94:'Graduate School'}

In [17]:
# Loop through dict to define label and merge column to language_df
for subgroup_id, column_name in dict.items():
    merge_column(subgroup_id, column_name)

In [18]:
df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84,...,*,46,*,4,*,23,20,19,11,*
1,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127,...,10,82,,5,4,28,39,34,14,13
2,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232,...,111,9,,18,94,,*,7,58,164
3,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192,...,14,50,,64,27,*,17,39,59,76
4,130401,1,*,,*,*,*,*,*,*,...,,*,*,*,,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10294,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588,...,15,166,5,314,61,32,97,255,156,57
10295,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132,...,,26,*,83,14,,10,64,44,26
10296,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74,...,*,27,,40,4,10,15,30,13,6
10297,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51,...,,15,*,29,8,,*,18,23,15


## Mathematics dataset test_id == 2

In [19]:
# Filter subgroup_id and test_id, rename column and merge into main df
def merge_column(subgroup_id, name):
    df = all_grades[(all_grades['Subgroup ID'] == subgroup_id) & (all_grades['Test Id'] == 2)]
    df = df.drop(columns=['Test Id', 'Subgroup ID'])
    df = df.rename({'CAASPP Reported Enrollment': name}, axis=1)
    global df_math
    df_math = df_math.merge(df, how='left', on='School Code')

In [20]:
# Define dictionary to merge columns
dict = {3:'Male', 4:'Female', 50:'Military', 51:'Non Military', 52:'Homeless', 53:'Non Homeless', 31:'Disadvantaged',
       111:'Not Disadvantaged', 74:'Black', 75:'Native American', 76:'Asian', 78:'Hispanic', 79:'Pacific Islander',
       80:'White', 144:'Two/More Races', 90:'< High School', 91:'High School Grad', 92:'Some College', 
       93:'College Grad', 94:'Graduate School'}

In [21]:
# Loop through dict to define label and merge column to math_df
for subgroup_id, column_name in dict.items():
    merge_column(subgroup_id, column_name)

In [22]:
df_math

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,112607,1,90,,3.57,7.14,10.71,16.67,72.62,84,...,*,46,*,4,*,23,20,19,11,*
1,123968,1,142,,6.67,11.11,17.78,28.15,54.07,135,...,10,82,,5,4,28,39,34,14,13
2,124172,1,239,,74.57,19.40,93.97,5.17,0.86,232,...,111,9,,18,94,,*,7,58,164
3,125567,1,201,,17.10,18.65,35.75,18.13,46.11,193,...,14,50,,64,27,*,17,39,59,76
4,130401,1,34,,*,*,*,*,*,4,...,,*,*,*,,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10293,6056816,1,608,,13.78,27.04,40.82,32.48,26.70,588,...,15,166,5,314,61,32,97,255,156,57
10294,6056832,1,146,,9.09,24.24,33.33,40.15,26.52,132,...,,26,*,83,14,,10,64,44,26
10295,6056840,1,74,,12.16,31.08,43.24,27.03,29.73,74,...,*,27,,40,4,10,15,30,13,6
10296,6118806,1,57,,15.69,35.29,50.98,29.41,19.61,51,...,,15,*,29,8,,*,18,23,15


---------

## 1. b. Entities Data

- It contains information such as school and district name, zip code and relevant codes to allow merge with the assessment data.

Number of rows in this dataset is closely related to current information about number of schools and districts in the state of CA:

- There are ~ 1,040 school districts in California. 
    - The entities_dist dataset contains 1,087 rows.
- There are ~ 10,588 schools in California. 
    - The df_entities dataset contains 10,300 rows.

In [23]:
df_entities = pd.read_csv('data/sb_ca2019entities_csv.txt')

In [24]:
# create dataset containing entities data at district level
entities_dist = df_entities[df_entities['School Code'] == 0]

# create dataset containing entities data at school level 
df_entities = df_entities.drop(df_entities[df_entities['School Code'] == 0].index) # drop district level data

# drop columns with redundant information or not of use 
df_entities = df_entities.drop(columns = ['Filler', 'Test Year', 'County Code','Type Id', 'District Name', 'County Name'])


In [25]:
df_entities

Unnamed: 0,District Code,School Code,School Name,Zip Code
0,68056,114686,Ocean Air,92130
1,68056,6038111,Del Mar Heights Elementary,92014
2,68056,6088983,Del Mar Hills Elementary,92014
3,68056,6110696,Carmel Del Mar Elementary,92130
4,68056,6115620,Ashley Falls Elementary,92130
...,...,...,...,...
11383,68049,136747,California Academy of Sports Science,91764
11384,68049,138313,University Prep,91764
11385,68049,6038095,Dehesa Elementary,92019
11386,68049,6119564,Dehesa Charter,92026


--------

### Merge df_language to df_entities

This merge adds school name and zipcode to df_language and df_math.

In [26]:
# merge dfs on school code
df_language_merge = df_entities.merge(df_language, how='left', on='School Code')
df_language_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,68056,114686,Ocean Air,92130,1.0,420,,71.19,19.85,91.04,...,175,26,,188,24,,*,4,85,326
1,68056,6038111,Del Mar Heights Elementary,92014,1.0,275,,69.23,21.61,90.84,...,19,23,*,203,25,,*,9,85,178
2,68056,6088983,Del Mar Hills Elementary,92014,1.0,169,,61.35,25.15,86.50,...,13,33,,105,17,*,4,13,47,99
3,68056,6110696,Carmel Del Mar Elementary,92130,1.0,299,,68.03,20.75,88.78,...,91,34,,148,21,*,*,10,73,203
4,68056,6115620,Ashley Falls Elementary,92130,1.0,331,,51.85,29.01,80.86,...,100,29,*,171,15,,*,18,63,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,68049,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,...,,*,,*,,*,*,*,*,*
10296,68049,138313,University Prep,91764,1.0,92,,17.78,28.89,46.67,...,*,32,,43,*,*,15,18,25,*
10297,68049,6038095,Dehesa Elementary,92019,1.0,95,,6.52,25.00,31.52,...,,26,,37,18,8,8,37,14,10
10298,68049,6119564,Dehesa Charter,92026,1.0,108,,16.13,30.11,46.24,...,*,27,,67,6,,6,30,39,25


In [27]:
df_language_merge.columns

Index(['District Code', 'School Code', 'School Name', 'Zip Code',
       'Subgroup ID', 'CAASPP Reported Enrollment', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Not Disadvantaged', 'Black', 'Native American', 'Asian', 'Hispanic',
       'Pacific Islander', 'White', 'Two/More Races', '< High School',
       'High School Grad', 'Some College', 'College Grad', 'Graduate School'],
      dtype='object')

In [28]:
# percentage of missing data per column
percent_missing = (df_language_merge.isnull().sum() * 100 / len(df_language_merge)).round(2)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})

# sorting values in ascending format
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,percent_missing
District Code,0.0
School Code,0.0
School Name,0.0
Zip Code,0.0
Students with Scores,0.01
Percentage Standard Not Met,0.01
Percentage Standard Nearly Met,0.01
Percentage Standard Met and Above,0.01
Non Military,0.01
Percentage Standard Exceeded,0.01


### Merge df_math to df_entities

In [29]:
df_math_merge = df_entities.merge(df_math, how='left', on='School Code')
df_math_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,68056,114686,Ocean Air,92130,1.0,420,,76.44,14.90,91.35,...,175,26,,188,24,,*,4,85,326
1,68056,6038111,Del Mar Heights Elementary,92014,1.0,275,,67.40,22.34,89.74,...,19,23,*,203,25,,*,9,85,178
2,68056,6088983,Del Mar Hills Elementary,92014,1.0,169,,54.49,28.14,82.63,...,13,33,,105,17,*,4,13,47,99
3,68056,6110696,Carmel Del Mar Elementary,92130,1.0,299,,70.71,17.85,88.55,...,91,34,,148,21,*,*,10,73,203
4,68056,6115620,Ashley Falls Elementary,92130,1.0,331,,55.49,25.61,81.10,...,100,29,*,171,15,,*,18,63,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,68049,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,...,,*,,*,,*,*,*,*,*
10296,68049,138313,University Prep,91764,1.0,92,,6.82,9.09,15.91,...,*,32,,43,*,*,15,18,25,*
10297,68049,6038095,Dehesa Elementary,92019,1.0,95,,11.96,19.57,31.52,...,,26,,37,18,8,8,37,14,10
10298,68049,6119564,Dehesa Charter,92026,1.0,108,,5.43,13.04,18.48,...,*,27,,67,6,,6,30,39,25


In [30]:
# percentage of missing data per column
percent_missing = (df_math_merge.isnull().sum() * 100 / len(df_math_merge)).round(2)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})

# sorting values in ascending format
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,percent_missing
District Code,0.0
School Code,0.0
School Name,0.0
Zip Code,0.0
Students with Scores,0.02
Percentage Standard Not Met,0.02
Percentage Standard Nearly Met,0.02
Percentage Standard Met and Above,0.02
Non Military,0.02
Percentage Standard Exceeded,0.02


----------

## 2. Median household income by zipcode
- Year 2014
- California

source:http://www.usa.com/rank/california-state--median-household-income--zip-code-rank.htm?yr=9000&dis=&wist=&plow=&phigh=

In [31]:
# load csv file
income = pd.read_csv('data/median_income_zipcode.csv')

# drop columns Rank and population
income = income.drop(columns = ['Rank', 'Population'])

# rename columns to merge with zip code from the main dfs
income.columns = ['Median Household Income', 'Zip Code']

# transform zip code int to object
income['Zip Code'] = income['Zip Code'].apply(str)

income

Unnamed: 0,Median Household Income,Zip Code
0,236912.00,94027
1,228587.00,92145
2,200325.00,91980
3,187857.00,94957
4,182750.00,94022
...,...,...
1681,11922.00,93721
1682,11250.00,93530
1683,10625.00,90089
1684,10481.00,95915


-----------

### Merge df_language and df_math to income

In [32]:
# Add median household income by merging df_language to income df
df_language_merge = df_language_merge.merge(income, how='left', on='Zip Code')

# Sort values
df_language_merge = df_language_merge.sort_values('School Name', ascending=True)

df_language_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Median Household Income
8740,66993,129882,21st Century Learning Institute,92223,1.0,58,,8.93,35.71,44.64,...,33,,18,,6,13,22,9,4,64738.00
8063,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,37.59,28.02,65.60,...,131,*,111,7,15,44,85,138,121,84051.00
8119,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,48.39,32.80,81.18,...,43,,10,6,*,13,14,33,6,48345.00
8480,73643,6085377,A. G. Currie Middle,92780,1.0,585,,6.60,23.78,30.38,...,532,*,15,*,223,168,82,39,15,64089.00
2450,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.60,18.23,29.83,...,166,,*,*,82,55,24,15,*,57470.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7254,75309,136531,iLEAD Online,93510,1.0,52,,31.82,18.18,50.00,...,14,,22,*,*,*,11,18,12,89403.00
7258,75309,138297,iLead Agua Dulce,91390,1.0,64,,12.07,25.86,37.93,...,24,,30,9,,*,17,16,10,105659.00
7197,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,9.09,29.75,38.84,...,36,8,166,33,,9,25,18,5,94173.00
780,10397,120717,one.Charter,95206,1.0,184,,0.00,6.15,6.15,...,112,*,16,13,63,47,35,10,9,42404.00


In [33]:
# Add median household income by merging df_math to income df
df_math_merge = df_math_merge.merge(income, how='left', on='Zip Code')

# Sort values
df_math_merge = df_math_merge.sort_values('School Name', ascending=True)

df_math_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Median Household Income
8740,66993,129882,21st Century Learning Institute,92223,1.0,58,,1.79,8.93,10.71,...,33,,18,,6,13,22,9,4,64738.00
8063,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,36.36,27.05,63.41,...,131,*,111,7,15,44,85,138,121,84051.00
8119,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,46.28,25.53,71.81,...,43,,10,6,*,13,14,33,6,48345.00
8480,73643,6085377,A. G. Currie Middle,92780,1.0,585,,8.06,10.29,18.35,...,532,*,15,*,223,168,82,39,15,64089.00
2450,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.05,14.36,25.41,...,166,,*,*,82,55,24,15,*,57470.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7254,75309,136531,iLEAD Online,93510,1.0,52,,9.09,9.09,18.18,...,14,,22,*,*,*,11,18,12,89403.00
7258,75309,138297,iLead Agua Dulce,91390,1.0,64,,6.90,20.69,27.59,...,24,,30,9,,*,17,16,10,105659.00
7197,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,5.79,7.71,13.50,...,36,8,166,33,,9,25,18,5,94173.00
780,10397,120717,one.Charter,95206,1.0,184,,0.00,0.00,0.00,...,114,*,16,13,63,48,35,9,9,42404.00


---------

## 3. Enrollment Dataset, Full-Time Equivalent Teacher, and Pupil/Teacher Ratio
- It contains total enrollment per school for the academic year 2018-2019 in California.
- Data comes from the National Center for Education Statistics.

In [34]:
# load datafile
df_enrollment = pd.read_csv('data/ELSI_enrollment_fte_pupil_teacher.csv')

# Rename columns
df_enrollment.columns = ['School Name1', 'State', 'School Name', 'Agency Id', 'School Id', 'Total Students',
                        'Free Lunch Eligible', 'Reduced-price Lunch', 'Free and Reduced-price', 'Full-Time Teachers',
                        'Pupil/Teacher Ratio']

# Drop columns
df_enrollment = df_enrollment.drop(columns = ['School Name1', 'State', 'Agency Id', 'School Id'])

# Sort values
df_enrollment = df_enrollment.sort_values('School Name', ascending=True)

# Update value in columns
regex_pattern = r'="(.*)"'
regex_group = r'\1'

df_enrollment = df_enrollment.replace(to_replace=regex_pattern, value=regex_group, regex=True)
df_enrollment

Unnamed: 0,School Name,Total Students,Free Lunch Eligible,Reduced-price Lunch,Free and Reduced-price,Full-Time Teachers,Pupil/Teacher Ratio
0,21st Century Learning Institute,88,40,4,44,3.60,24.44
1,A Place to Grow,†,†,†,†,–,–
2,A. E. Arnold Elementary,739,246,37,283,27.00,27.37
3,A. G. Cook Elementary,366,187,37,224,16.00,22.88
4,A. G. Currie Middle,611,422,31,453,25.30,24.15
...,...,...,...,...,...,...,...
4220,iLEAD Lancaster Charter,728,451,47,498,23.10,31.52
4221,iLEAD Online,73,28,7,35,4.49,16.26
4218,iLead Agua Dulce,119,35,3,38,5.30,22.45
4312,iQ Academy California-Los Angeles,702,373,55,428,27.50,25.53


## Merge df_enrollment to df_language and df_math

In [35]:
# Merge df_language_merge to df_enrollment
#language_df = df_language_merge.merge(df_enrollment, how='left', on='School Name')

# Sort values
#language_df = language_df.sort_values('School Name', ascending=True)

#language_df

---------

## 4. Total Revenue

- It contains total revenue per school district in California for the academic year 2018-2019.
- Revenue comes from local, state and federal sources.

In [36]:
# Load dataset
df_revenue = pd.read_csv('data/ELSI_revenue_details.csv')

# Rename columns
df_revenue.columns = ['Agency Name', 'State', 'District Code', 'Latitude', 'Longitude', 'Total Revenue',
                     'Total Revenue per Pupil', 'Total Expenditures per Pupil']

# drop columns
df_revenue = df_revenue.drop(columns = 'State')

# Update value in columns
regex_pattern = r'="(.*)"'
regex_group = r'\1'

df_revenue = df_revenue.replace(to_replace=regex_pattern, value=regex_group, regex=True)

df_revenue

Unnamed: 0,Agency Name,District Code,Latitude,Longitude,Total Revenue,Total Revenue per Pupil,Total Expenditures per Pupil
0,ABC UNIFIED,0601620,33.879715,-118.071463,265547000,12922,12316
1,ACALANES UNION HIGH,0601650,37.905787,-122.099170,95322000,16835,15554
2,ACKERMAN CHARTER,0601680,38.934997,-121.055904,5890000,†,†
3,ACTON-AGUA DULCE UNIFIED,0600001,34.472708,-118.196768,56618000,3811,3425
4,ADELANTO ELEMENTARY,0601710,34.572373,-117.406551,111722000,12831,12322
...,...,...,...,...,...,...,...
1151,YREKA UNION ELEMENTARY,0643380,41.727428,-122.640968,12239000,12251,11710
1152,YREKA UNION HIGH,0643410,41.741154,-122.637652,10060000,16438,18830
1153,YUBA CITY UNIFIED,0643470,39.133900,-121.633400,169251000,12787,13286
1154,YUBA COUNTY OFFICE OF EDUCATION,0691048,39.149990,-121.600051,32157000,51699,49889
