   # CAPSTONE 1: EDUCATION PROJECT


   <img src='data/education_image.jpg' width="900">
   
   **Credit:**  [wsimag](https://wsimag.com/culture/60264-education-in-venezuela-the-americas-and-the-world)



In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

sns.set(style='ticks')

warnings.filterwarnings("ignore")  # Suppress all warnings

# Introduction

## Business Context
Research shows that high-poverty areas disproportionally educate children of color. The chances of ending up in a high-poverty or high-minority school are highly determined by a student’s race/ethnicity and social class. For instance, African American and Hispanic students—even if they are not poor—are much more likely than White or Asian students to be in high-poverty schools.

There is a growing body of evidence that shows increased investment on education returns better outcomes and that the positive effects are even greater among low-income students. On the other hand, it costs more to educate low-income students and provide them with a robust education capable of overcoming their initial disadvantages.


### Goals
1. Understand the current demographics of wealthy to high-poverty schools across the state of California.
2. Identify how much funding is available per pupil in wealthy vs high-poverty areas.
3. Learn what factors are most correlated with student performance.


#### Predictive modeling
What's the average test score per school?
What's the percentage of students who pass/not pass?


# DATA WRANGLING

The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

- Extracting relevant data. Let's start looking at the datasets!

## 1. Assessment Data

- It contains assessment data for the Smarter Balance Summative Assessment (2018-2019) for the state of California.

- Legend types can be found here: https://caaspp-elpac.cde.ca.gov/caaspp/research_fixfileformat19
- More information about assesment set up: https://www.cde.ca.gov/ta/tg/ca/sbsummativefaq.asp

In [2]:
# load datafile
df_all = pd.read_csv('large_data/sb_ca2019_all_csv_v4.txt')

In [3]:
# create dataset containing district level data
df_district = df_all[df_all['District Code'] == 0000]

In [4]:
# create dataset containing school level data
df_school = df_all.drop(df_all[df_all['School Code'] == 0].index)
df_school.head(10)

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,...,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
1888,1,10017,112607,,2019,1,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1889,1,10017,112607,,2019,3,B,42,42,11,...,35.71,14.29,50.00,35.71,7.14,76.19,16.67,9.52,64.29,26.19
1890,1,10017,112607,,2019,4,B,43,42,11,...,39.02,10.00,72.50,17.50,7.32,65.85,26.83,17.07,58.54,24.39
1891,1,10017,112607,,2019,6,B,79,78,11,...,33.77,13.16,63.16,23.68,7.79,71.43,20.78,14.29,63.64,22.08
1892,1,10017,112607,,2019,7,B,*,*,11,...,*,*,*,*,*,*,*,*,*,*
1893,1,10017,112607,,2019,8,B,38,38,11,...,36.84,18.42,68.42,13.16,7.89,73.68,18.42,18.42,65.79,15.79
1894,1,10017,112607,,2019,31,B,68,67,11,...,40.91,12.31,63.08,24.62,9.09,68.18,22.73,13.64,62.12,24.24
1895,1,10017,112607,,2019,51,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1896,1,10017,112607,,2019,53,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1897,1,10017,112607,,2019,74,B,30,29,11,...,28.57,7.41,59.26,33.33,3.57,67.86,28.57,7.14,71.43,21.43


In [5]:
# check columns names
df_school.columns

Index(['County Code', 'District Code', 'School Code', 'Filler', 'Test Year',
       'Subgroup ID', 'Test Type', 'Total Tested At Entity Level',
       'Total Tested with Scores', 'Grade', 'Test Id',
       'CAASPP Reported Enrollment', 'Students Tested', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores',
       'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
       'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
       'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
       'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
       'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
       'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard'],
      dtype='object')

In [6]:
# check data type
df_school.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3013079 entries, 1888 to 3576490
Data columns (total 32 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   County Code                        int64  
 1   District Code                      int64  
 2   School Code                        int64  
 3   Filler                             float64
 4   Test Year                          int64  
 5   Subgroup ID                        int64  
 6   Test Type                          object 
 7   Total Tested At Entity Level       object 
 8   Total Tested with Scores           object 
 9   Grade                              int64  
 10  Test Id                            int64  
 11  CAASPP Reported Enrollment         object 
 12  Students Tested                    object 
 13  Mean Scale Score                   object 
 14  Percentage Standard Exceeded       object 
 15  Percentage Standard Met            object 
 16  Percentage Stan

In [7]:
# Check for missing data
df_school.isnull().sum()

County Code                                0
District Code                              0
School Code                                0
Filler                               3013079
Test Year                                  0
Subgroup ID                                0
Test Type                                  0
Total Tested At Entity Level               0
Total Tested with Scores                   0
Grade                                      0
Test Id                                    0
CAASPP Reported Enrollment                 0
Students Tested                            0
Mean Scale Score                      797631
Percentage Standard Exceeded               0
Percentage Standard Met                    0
Percentage Standard Met and Above          0
Percentage Standard Nearly Met             0
Percentage Standard Not Met                0
Students with Scores                       0
Area 1 Percentage Above Standard           0
Area 1 Percentage Near Standard            0
Area 1 Per

- Verifying number of rows where Subgroup Id == 1:
    - Summary of all students per school 

In [8]:
# Number of rows where subgroup ID == 1
df_school[df_school['Subgroup ID'] == 1].count()

County Code                          87324
District Code                        87324
School Code                          87324
Filler                                   0
Test Year                            87324
Subgroup ID                          87324
Test Type                            87324
Total Tested At Entity Level         87324
Total Tested with Scores             87324
Grade                                87324
Test Id                              87324
CAASPP Reported Enrollment           87324
Students Tested                      87324
Mean Scale Score                     66727
Percentage Standard Exceeded         87324
Percentage Standard Met              87324
Percentage Standard Met and Above    87324
Percentage Standard Nearly Met       87324
Percentage Standard Not Met          87324
Students with Scores                 87324
Area 1 Percentage Above Standard     87324
Area 1 Percentage Near Standard      87324
Area 1 Percentage Below Standard     87324
Area 2 Perc

In [9]:
# Check number of unique schools
df_school['School Code'].nunique()

10300

- There are 10,300 schools!

## Creating two datasets for modeling

- Language Arts & Literature: test_id == 1

    - 10,299 rows
    
    
- Mathematics: test_id == 2

    - 10,298 rows



In [10]:
# Filter Grade == 13 summary of all grades per school
all_grades = df_school[df_school['Grade'] == 13]

# Filter Subgroup ID == 1 summary of all students
all_students = all_grades[all_grades['Subgroup ID'] == 1]

- Language Arts & Literature Dataset:

In [11]:
# Create df_test1 language arts & literature 
df_test1 = all_students[all_students['Test Id'] == 1]

# drop columns that won't be used
df_language = df_test1.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard', 
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])

df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
1927,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84
2234,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127
2681,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232
3126,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192
3474,130401,1,*,,*,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...
3575148,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588
3575555,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132
3575772,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74
3575957,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51


- Mathematics Dataset:

In [12]:
# Create df_test2 mathematics
df_test2 = all_students[all_students['Test Id'] == 2]

# drop columns with redundant information
df_math = df_test2.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])
df_math

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
2005,112607,1,90,,3.57,7.14,10.71,16.67,72.62,84
2465,123968,1,142,,6.67,11.11,17.78,28.15,54.07,135
2891,124172,1,239,,74.57,19.40,93.97,5.17,0.86,232
3367,125567,1,201,,17.10,18.65,35.75,18.13,46.11,193
3572,130401,1,34,,*,*,*,*,*,4
...,...,...,...,...,...,...,...,...,...,...
3575405,6056816,1,608,,13.78,27.04,40.82,32.48,26.70,588
3575698,6056832,1,146,,9.09,24.24,33.33,40.15,26.52,132
3575838,6056840,1,74,,12.16,31.08,43.24,27.03,29.73,74
3576079,6118806,1,57,,15.69,35.29,50.98,29.41,19.61,51


-------------

### Reorganizing Subgroup ID 

The assessment dataset contains a lot of demographic information in the subgroup ID column. Need to reorganize the dataset in order to have one variable per column and one observation per row. Also, neet to filter only the demographic information of interest.

#### Before merging:
- Filter variables of interest;
- Rearrange the data to have: 
    - one feature per column; 
    - one observation per row;

This dataset representes the Smater Balanced Assessments for English Language Arts/Literacy and Mathematics (SB). Test ID 1 and 2. More info about the test can be found here: https://www.caaspp.org/administration/about/testing/index.html

## 1. a. Subgroup ID 

In the legend below, Demographic Id and Demographic Id Num are represented in the dataset as Subgroup ID.

In [13]:
legend = pd.read_csv('data/Subgroups.txt')
legend

Unnamed: 0,Demographic ID,Demographic ID Num,Demographic Name,Student Group
0,1,1,All Students,All Students
1,3,3,Male,Gender
2,4,4,Female,Gender
3,6,6,Fluent English proficient and English only,English-Language Fluency
4,7,7,Initial fluent English proficient (IFEP),English-Language Fluency
5,8,8,Reclassified fluent English proficient (RFEP),English-Language Fluency
6,28,28,Migrant education,Migrant
7,31,31,Economically disadvantaged,Economic Status
8,50,50,Military,Military Status
9,51,51,Not Military,Military Status


-----------

### Next: 
1. Transform demographic information contained in subgroup id column into one variable per column.
    - Drop redundant columns;
    - Remap each variable of interest to its own column;
    - Value is number of students fitting each category.

In [14]:
# drop columns that won't be used at ALL_GRADES level
all_grades = all_grades.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Mean Scale Score', 'Percentage Standard Exceeded', 'Percentage Standard Met',
                             'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
                             'Percentage Standard Not Met', 'Students with Scores', 'Total Tested At Entity Level',
                             'Total Tested with Scores', 'Students Tested', 'Grade'])

**Language Arts & Literature Dataset:**

In [15]:
# Filter subgroup_id and test_id, rename column and merge into main df
def merge_column(subgroup_id, name):
    df = all_grades[(all_grades['Subgroup ID'] == subgroup_id) & (all_grades['Test Id'] == 1)]
    df = df.drop(columns=['Test Id', 'Subgroup ID'])
    df = df.rename({'CAASPP Reported Enrollment': name}, axis=1)
    global df_language
    df_language = df_language.merge(df, how='left', on='School Code')

In [16]:
# Define dictionary to merge columns
dict = {3:'Male', 4:'Female', 50:'Military', 51:'Non Military', 52:'Homeless', 53:'Non Homeless', 31:'Disadvantaged',
       111:'Not Disadvantaged', 74:'Black', 75:'Native American', 76:'Asian', 78:'Hispanic', 79:'Pacific Islander',
       80:'White', 144:'Two/More Races', 90:'< High School', 91:'High School Grad', 92:'Some College', 
       93:'College Grad', 94:'Graduate School'}

In [17]:
# Loop through dict to define label and merge column to language_df
for subgroup_id, column_name in dict.items():
    merge_column(subgroup_id, column_name)

In [18]:
df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84,...,*,46,*,4,*,23,20,19,11,*
1,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127,...,10,82,,5,4,28,39,34,14,13
2,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232,...,111,9,,18,94,,*,7,58,164
3,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192,...,14,50,,64,27,*,17,39,59,76
4,130401,1,*,,*,*,*,*,*,*,...,,*,*,*,,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10294,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588,...,15,166,5,314,61,32,97,255,156,57
10295,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132,...,,26,*,83,14,,10,64,44,26
10296,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74,...,*,27,,40,4,10,15,30,13,6
10297,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51,...,,15,*,29,8,,*,18,23,15


**Mathematics Dataset:**

In [19]:
# Filter subgroup_id and test_id, rename column and merge into main df
def merge_column(subgroup_id, name):
    df = all_grades[(all_grades['Subgroup ID'] == subgroup_id) & (all_grades['Test Id'] == 2)]
    df = df.drop(columns=['Test Id', 'Subgroup ID'])
    df = df.rename({'CAASPP Reported Enrollment': name}, axis=1)
    global df_math
    df_math = df_math.merge(df, how='left', on='School Code')

In [20]:
# Define dictionary to merge columns
dict = {3:'Male', 4:'Female', 50:'Military', 51:'Non Military', 52:'Homeless', 53:'Non Homeless', 31:'Disadvantaged',
       111:'Not Disadvantaged', 74:'Black', 75:'Native American', 76:'Asian', 78:'Hispanic', 79:'Pacific Islander',
       80:'White', 144:'Two/More Races', 90:'< High School', 91:'High School Grad', 92:'Some College', 
       93:'College Grad', 94:'Graduate School'}

In [21]:
# Loop through dict to define label and merge column to math_df
for subgroup_id, column_name in dict.items():
    merge_column(subgroup_id, column_name)

In [22]:
df_math

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,112607,1,90,,3.57,7.14,10.71,16.67,72.62,84,...,*,46,*,4,*,23,20,19,11,*
1,123968,1,142,,6.67,11.11,17.78,28.15,54.07,135,...,10,82,,5,4,28,39,34,14,13
2,124172,1,239,,74.57,19.40,93.97,5.17,0.86,232,...,111,9,,18,94,,*,7,58,164
3,125567,1,201,,17.10,18.65,35.75,18.13,46.11,193,...,14,50,,64,27,*,17,39,59,76
4,130401,1,34,,*,*,*,*,*,4,...,,*,*,*,,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10293,6056816,1,608,,13.78,27.04,40.82,32.48,26.70,588,...,15,166,5,314,61,32,97,255,156,57
10294,6056832,1,146,,9.09,24.24,33.33,40.15,26.52,132,...,,26,*,83,14,,10,64,44,26
10295,6056840,1,74,,12.16,31.08,43.24,27.03,29.73,74,...,*,27,,40,4,10,15,30,13,6
10296,6118806,1,57,,15.69,35.29,50.98,29.41,19.61,51,...,,15,*,29,8,,*,18,23,15


---------

## 1. b. Entities Data

- It contains information such as school and district name, zip code and relevant codes to allow merge with the assessment data.

Number of rows in this dataset is closely related to current information about number of schools and districts in the state of CA:

- There are ~ 1,040 school districts in California. 
    - The entities_dist dataset contains 1,087 rows.
- There are ~ 10,588 schools in California. 
    - The df_entities dataset contains 10,300 rows.

In [23]:
df_entities = pd.read_csv('data/sb_ca2019entities_csv.txt')

In [24]:
# create dataset containing entities data at district level
entities_dist = df_entities[df_entities['School Code'] == 0]

# create dataset containing entities data at school level 
df_entities = df_entities.drop(df_entities[df_entities['School Code'] == 0].index) # drop district level data

# drop columns with redundant information or not of use 
df_entities = df_entities.drop(columns = ['Filler', 'Test Year', 'County Code','Type Id', 'District Name', 'County Name'])


In [25]:
df_entities.sort_values(by=['School Name'])

Unnamed: 0,District Code,School Code,School Name,Zip Code
9734,66993,129882,21st Century Learning Institute,92223
9009,66480,6027767,A. E. Arnold Elementary,90630
9069,66522,6028211,A. G. Cook Elementary,92844
9444,73643,6085377,A. G. Currie Middle,92780
2685,69369,6046114,A. J. Dorsa Elementary,95122
...,...,...,...,...
8056,75309,136531,iLEAD Online,93510
8060,75309,138297,iLead Agua Dulce,91390
7996,73452,120600,iQ Academy California-Los Angeles,93065
828,10397,120717,one.Charter,95206


In [26]:
df_entities['School Name'].nunique()

9042

In [27]:
df_entities['School Code'].nunique()

10300

--------


### Merge df_entities to df_language and df_math

This merge adds school name and zipcode to df_language and df_math.

- Language Arts & Literature Dataset:

In [28]:
# merge dfs on school code
df_language_merge = df_entities.merge(df_language, how='left', on='School Code')
df_language_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,68056,114686,Ocean Air,92130,1.0,420,,71.19,19.85,91.04,...,175,26,,188,24,,*,4,85,326
1,68056,6038111,Del Mar Heights Elementary,92014,1.0,275,,69.23,21.61,90.84,...,19,23,*,203,25,,*,9,85,178
2,68056,6088983,Del Mar Hills Elementary,92014,1.0,169,,61.35,25.15,86.50,...,13,33,,105,17,*,4,13,47,99
3,68056,6110696,Carmel Del Mar Elementary,92130,1.0,299,,68.03,20.75,88.78,...,91,34,,148,21,*,*,10,73,203
4,68056,6115620,Ashley Falls Elementary,92130,1.0,331,,51.85,29.01,80.86,...,100,29,*,171,15,,*,18,63,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,68049,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,...,,*,,*,,*,*,*,*,*
10296,68049,138313,University Prep,91764,1.0,92,,17.78,28.89,46.67,...,*,32,,43,*,*,15,18,25,*
10297,68049,6038095,Dehesa Elementary,92019,1.0,95,,6.52,25.00,31.52,...,,26,,37,18,8,8,37,14,10
10298,68049,6119564,Dehesa Charter,92026,1.0,108,,16.13,30.11,46.24,...,*,27,,67,6,,6,30,39,25


In [29]:
df_language_merge.columns

Index(['District Code', 'School Code', 'School Name', 'Zip Code',
       'Subgroup ID', 'CAASPP Reported Enrollment', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Not Disadvantaged', 'Black', 'Native American', 'Asian', 'Hispanic',
       'Pacific Islander', 'White', 'Two/More Races', '< High School',
       'High School Grad', 'Some College', 'College Grad', 'Graduate School'],
      dtype='object')

In [30]:
# percentage of missing data per column
percent_missing = (df_language_merge.isnull().sum() * 100 / len(df_language_merge)).round(2)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})

# sorting values in ascending format
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,percent_missing
District Code,0.0
School Code,0.0
School Name,0.0
Zip Code,0.0
Students with Scores,0.01
Percentage Standard Not Met,0.01
Percentage Standard Nearly Met,0.01
Percentage Standard Met and Above,0.01
Non Military,0.01
Percentage Standard Exceeded,0.01


- Mathematics Dataset:

In [31]:
df_math_merge = df_entities.merge(df_math, how='left', on='School Code')
df_math_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Asian,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School
0,68056,114686,Ocean Air,92130,1.0,420,,76.44,14.90,91.35,...,175,26,,188,24,,*,4,85,326
1,68056,6038111,Del Mar Heights Elementary,92014,1.0,275,,67.40,22.34,89.74,...,19,23,*,203,25,,*,9,85,178
2,68056,6088983,Del Mar Hills Elementary,92014,1.0,169,,54.49,28.14,82.63,...,13,33,,105,17,*,4,13,47,99
3,68056,6110696,Carmel Del Mar Elementary,92130,1.0,299,,70.71,17.85,88.55,...,91,34,,148,21,*,*,10,73,203
4,68056,6115620,Ashley Falls Elementary,92130,1.0,331,,55.49,25.61,81.10,...,100,29,*,171,15,,*,18,63,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,68049,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,...,,*,,*,,*,*,*,*,*
10296,68049,138313,University Prep,91764,1.0,92,,6.82,9.09,15.91,...,*,32,,43,*,*,15,18,25,*
10297,68049,6038095,Dehesa Elementary,92019,1.0,95,,11.96,19.57,31.52,...,,26,,37,18,8,8,37,14,10
10298,68049,6119564,Dehesa Charter,92026,1.0,108,,5.43,13.04,18.48,...,*,27,,67,6,,6,30,39,25


In [32]:
# percentage of missing data per column
percent_missing = (df_math_merge.isnull().sum() * 100 / len(df_math_merge)).round(2)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})

# sorting values in ascending format
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,percent_missing
District Code,0.0
School Code,0.0
School Name,0.0
Zip Code,0.0
Students with Scores,0.02
Percentage Standard Not Met,0.02
Percentage Standard Nearly Met,0.02
Percentage Standard Met and Above,0.02
Non Military,0.02
Percentage Standard Exceeded,0.02


----------

## 2. Median household income by zipcode
- Year 2014
- California

source: http://www.usa.com/rank/california-state--median-household-income--zip-code-rank.htm?yr=9000&dis=&wist=&plow=&phigh=

In [33]:
# load csv file
income = pd.read_csv('data/median_income_zipcode.csv')

# drop columns Rank and population
income = income.drop(columns = ['Rank', 'Population'])

# rename columns to merge with zip code from the main dfs
income.columns = ['Median Household Income', 'Zip Code']

# transform zip code int to object
income['Zip Code'] = income['Zip Code'].apply(str)

income

Unnamed: 0,Median Household Income,Zip Code
0,236912.00,94027
1,228587.00,92145
2,200325.00,91980
3,187857.00,94957
4,182750.00,94022
...,...,...
1681,11922.00,93721
1682,11250.00,93530
1683,10625.00,90089
1684,10481.00,95915


In [34]:
income.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1686 entries, 0 to 1685
Data columns (total 2 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Median Household Income  1686 non-null   object
 1   Zip Code                 1686 non-null   object
dtypes: object(2)
memory usage: 26.5+ KB


In [35]:
# Check for missing data
income.isnull().sum()

Median Household Income    0
Zip Code                   0
dtype: int64

In [36]:
# Use string manipulation to remove punctuation 
income['Median Household Income'] = income['Median Household Income'].str.replace('[ \,\-/\(\)\'@]', '')

### Merge income to df_language and df_math

- Language Arts & Literature Dataset:

In [37]:
# Add median household income by merging df_language to income df
df_language_merge = df_language_merge.merge(income, how='left', on='Zip Code')

# Sort values
df_language_merge = df_language_merge.sort_values('School Name', ascending=True)

df_language_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Median Household Income
8740,66993,129882,21st Century Learning Institute,92223,1.0,58,,8.93,35.71,44.64,...,33,,18,,6,13,22,9,4,64738.00
8063,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,37.59,28.02,65.60,...,131,*,111,7,15,44,85,138,121,84051.00
8119,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,48.39,32.80,81.18,...,43,,10,6,*,13,14,33,6,48345.00
8480,73643,6085377,A. G. Currie Middle,92780,1.0,585,,6.60,23.78,30.38,...,532,*,15,*,223,168,82,39,15,64089.00
2450,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.60,18.23,29.83,...,166,,*,*,82,55,24,15,*,57470.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7254,75309,136531,iLEAD Online,93510,1.0,52,,31.82,18.18,50.00,...,14,,22,*,*,*,11,18,12,89403.00
7258,75309,138297,iLead Agua Dulce,91390,1.0,64,,12.07,25.86,37.93,...,24,,30,9,,*,17,16,10,105659.00
7197,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,9.09,29.75,38.84,...,36,8,166,33,,9,25,18,5,94173.00
780,10397,120717,one.Charter,95206,1.0,184,,0.00,6.15,6.15,...,112,*,16,13,63,47,35,10,9,42404.00


In [38]:
df_language_merge['School Code'].nunique()

10300

In [39]:
df_language_merge['School Name'].nunique()

9042

- Mathematics Dataset:

In [40]:
# Add median household income by merging df_math to income df
df_math_merge = df_math_merge.merge(income, how='left', on='Zip Code')

# Sort values
df_math_merge = df_math_merge.sort_values('School Name', ascending=True)

df_math_merge

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Median Household Income
8740,66993,129882,21st Century Learning Institute,92223,1.0,58,,1.79,8.93,10.71,...,33,,18,,6,13,22,9,4,64738.00
8063,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,36.36,27.05,63.41,...,131,*,111,7,15,44,85,138,121,84051.00
8119,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,46.28,25.53,71.81,...,43,,10,6,*,13,14,33,6,48345.00
8480,73643,6085377,A. G. Currie Middle,92780,1.0,585,,8.06,10.29,18.35,...,532,*,15,*,223,168,82,39,15,64089.00
2450,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.05,14.36,25.41,...,166,,*,*,82,55,24,15,*,57470.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7254,75309,136531,iLEAD Online,93510,1.0,52,,9.09,9.09,18.18,...,14,,22,*,*,*,11,18,12,89403.00
7258,75309,138297,iLead Agua Dulce,91390,1.0,64,,6.90,20.69,27.59,...,24,,30,9,,*,17,16,10,105659.00
7197,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,5.79,7.71,13.50,...,36,8,166,33,,9,25,18,5,94173.00
780,10397,120717,one.Charter,95206,1.0,184,,0.00,0.00,0.00,...,114,*,16,13,63,48,35,9,9,42404.00


In [41]:
df_math_merge['School Code'].nunique()

10300

In [42]:
df_math_merge['School Name'].nunique()

9042

-----------

**Preparing df_language_merge and df_math_merge datasets for merging on school name with enrollment dataset below.**
- make school name all caps
- remove all punctuation

In [43]:
# Use string manipulation to remove punctuation and make School Name all caps

df_language_merge['School_Name'] = df_language_merge['School Name'].str.replace('[ \.\-/\(\)\'@]', '')
df_language_merge['School_Name'] = df_language_merge['School_Name'].apply(lambda x: x.upper()).sort_values()
#df_language_merge.head()

In [44]:
# Use string manipulation to remove punctuation and make School Name all caps

df_math_merge['School_Name'] = df_math_merge['School Name'].str.replace('[ \.\-/\(\)\'@]', '')
df_math_merge['School_Name'] = df_math_merge['School_Name'].apply(lambda x: x.upper()).sort_values()
#df_math_merge.head()

---------

## 3. Current Expense per Average Daily Attendance

- CA Department of Education 2018-2019.
- Expenditures for current expense of education, current expense average daily expense per district.
- EDP 365 = Expenditures for Current Expense of Education	

**Average Daily Attendance (ADA):**
Total ADA is defined as the total days of student attendance divided by the total days of instruction. The type of ADA used is annual district ADA (for the same year as the expenditures) from CDE's "Attendance School District" and "Attendance Charter School" reports and includes ADA from special education programs and applicable charter schools (i.e., those charter schools with data in the district's Current Expense of Education calculation).  ADA credited to districts for the attendance of pupils in county-operated programs is not included.

**Cost Per ADA:**
By district, the adjusted expenditures are divided by the total ADA to arrive at the Current Expense (or Cost) of Education per ADA.
	

In [45]:
expense_df = pd.read_excel('data/currentexpense1819.xlsx')

# drop columns
expense_df = expense_df.drop(columns = 'LEA Type')
expense_df.head(20)

Unnamed: 0,District Code,District,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA
0,61119,Alameda Unified,117225900.0,8968.85,13070.335941
1,61127,Albany City Unified,46611060.0,3544.52,13150.175366
2,61143,Berkeley Unified,159457800.0,9356.44,17042.573724
3,61150,Castro Valley Unified,102239900.0,8940.2,11435.978763
4,61168,Emery Unified,12504020.0,681.82,18339.185137
5,61176,Fremont Unified,382169500.0,33966.73,11251.288434
6,61192,Hayward Unified,266964100.0,18755.2,14234.137025
7,61200,Livermore Valley Joint Unified,158623700.0,13142.04,12069.941505
8,61218,Mountain House Elementary,485228.8,14.93,32500.253851
9,61234,Newark Unified,66780270.0,5552.86,12026.28461


In [46]:
expense_df['District'].nunique()

925

**Merge with df_language_merge and df_math_merge**


- Language Arts & Literature Dataset:

In [47]:
# Merge df_language_merge to df_enrollment
language_expense_df = df_language_merge.merge(expense_df, on='District Code', how='left')

# Sort values
language_expense_df = language_expense_df.sort_values('School_Name')

language_expense_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,High School Grad,Some College,College Grad,Graduate School,Median Household Income,School_Name,District,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,8.93,35.71,44.64,...,13,22,9,4,64738.00,21STCENTURYLEARNINGINSTITUTE,Beaumont Unified,1.133925e+08,9960.32,11384.420846
23,69039,6044796,Abbott Middle,94403,1.0,812,,19.72,30.41,50.13,...,124,143,204,139,95189.00,ABBOTTMIDDLE,San Mateo-Foster City Elementary,1.366818e+08,11279.24,12117.995622
24,75192,6116446,Abby Reinke Elementary,92592,1.0,406,,43.36,30.33,73.68,...,24,113,123,136,89541.00,ABBYREINKEELEMENTARY,Temecula Valley Unified,2.828429e+08,26622.22,10624.315362
8,64212,1995596,ABC Secondary (Alternative),90703,1.0,37,,5.56,16.67,22.22,...,10,*,*,*,90613.00,ABCSECONDARYALTERNATIVE,ABC Unified,2.306853e+08,19754.36,11677.688891
9,64212,1964212,ABC Unified District Level Program,90703,1.0,27,,*,*,*,...,*,*,*,*,90613.00,ABCUNIFIEDDISTRICTLEVELPROGRAM,ABC Unified,2.306853e+08,19754.36,11677.688891
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10285,64592,6014039,Zela Davis,90250,1.0,560,,17.40,22.16,39.56,...,216,143,49,18,45766.00,ZELADAVIS,Hawthorne Elementary,8.997957e+07,7176.50,12538.085408
10286,63461,129130,Zephyr Lane Elementary,93307,1.0,412,,8.56,25.43,33.99,...,139,124,60,23,34358.00,ZEPHYRLANEELEMENTARY,Fairfax Elementary,3.094876e+07,2563.52,12072.760770
10287,75515,1232057,Zoe Barnum High,95501,1.0,35,,0.00,0.00,0.00,...,14,13,*,*,38175.00,ZOEBARNUMHIGH,Eureka City Unified,4.261009e+07,3387.66,12578.028996
10288,67850,3630530,Zupanic High,92376,1.0,29,,7.69,26.92,34.62,...,8,7,*,*,44965.00,ZUPANICHIGH,Rialto Unified,3.242716e+08,23921.22,13555.814425


In [48]:
language_expense_df['District'].nunique()

920

In [49]:
language_expense_df['District Code'].nunique()

1032

- Mathematics Dataset:

In [50]:
# Merge df_math_merge to df_enrollment
math_expense_df = df_math_merge.merge(expense_df, on='District Code', how='left')

# Sort values
math_expense_df = math_expense_df.sort_values('School_Name')

math_expense_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,High School Grad,Some College,College Grad,Graduate School,Median Household Income,School_Name,District,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,1.79,8.93,10.71,...,13,22,9,4,64738.00,21STCENTURYLEARNINGINSTITUTE,Beaumont Unified,1.133925e+08,9960.32,11384.420846
23,69039,6044796,Abbott Middle,94403,1.0,810,,17.04,15.54,32.58,...,123,142,204,139,95189.00,ABBOTTMIDDLE,San Mateo-Foster City Elementary,1.366818e+08,11279.24,12117.995622
24,75192,6116446,Abby Reinke Elementary,92592,1.0,406,,30.92,35.66,66.58,...,24,113,123,136,89541.00,ABBYREINKEELEMENTARY,Temecula Valley Unified,2.828429e+08,26622.22,10624.315362
8,64212,1995596,ABC Secondary (Alternative),90703,1.0,38,,0.00,5.26,5.26,...,10,7,*,*,90613.00,ABCSECONDARYALTERNATIVE,ABC Unified,2.306853e+08,19754.36,11677.688891
9,64212,1964212,ABC Unified District Level Program,90703,1.0,27,,*,*,*,...,*,*,*,*,90613.00,ABCUNIFIEDDISTRICTLEVELPROGRAM,ABC Unified,2.306853e+08,19754.36,11677.688891
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10285,64592,6014039,Zela Davis,90250,1.0,560,,14.86,18.84,33.70,...,216,143,49,18,45766.00,ZELADAVIS,Hawthorne Elementary,8.997957e+07,7176.50,12538.085408
10286,63461,129130,Zephyr Lane Elementary,93307,1.0,412,,6.13,20.10,26.23,...,139,124,60,23,34358.00,ZEPHYRLANEELEMENTARY,Fairfax Elementary,3.094876e+07,2563.52,12072.760770
10287,75515,1232057,Zoe Barnum High,95501,1.0,35,,0.00,0.00,0.00,...,14,13,*,*,38175.00,ZOEBARNUMHIGH,Eureka City Unified,4.261009e+07,3387.66,12578.028996
10288,67850,3630530,Zupanic High,92376,1.0,29,,0.00,4.00,4.00,...,8,7,*,*,44965.00,ZUPANICHIGH,Rialto Unified,3.242716e+08,23921.22,13555.814425


----------

## 4. Total Revenue, Total Revenue per Pupil, Total Expenditure per Pupil

- It contains total revenue per school district in California for the academic year 2018-2019.
- Revenue comes from local, state and federal sources.

In [51]:
# Load dataset
df_revenue = pd.read_csv('data/ELSI_revenue_details.csv')

# Rename columns
df_revenue.columns = ['Agency Name', 'State', 'District Code', 'Latitude', 'Longitude', 'Total Revenue',
                     'Total Revenue per Pupil', 'Total Expenditures per Pupil']

# Update value in columns
regex_pattern = r'="(.*)"'
regex_group = r'\1'

df_revenue = df_revenue.replace(to_replace=regex_pattern, value=regex_group, regex=True)
df_revenue['District Code'] = df_revenue['District Code'].astype(int)

# drop columns
df_revenue = df_revenue.drop(columns = ['State', 'District Code'])
df_revenue

Unnamed: 0,Agency Name,Latitude,Longitude,Total Revenue,Total Revenue per Pupil,Total Expenditures per Pupil
0,ABC UNIFIED,33.879715,-118.071463,265547000,12922,12316
1,ACALANES UNION HIGH,37.905787,-122.099170,95322000,16835,15554
2,ACKERMAN CHARTER,38.934997,-121.055904,5890000,†,†
3,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425
4,ADELANTO ELEMENTARY,34.572373,-117.406551,111722000,12831,12322
...,...,...,...,...,...,...
1151,YREKA UNION ELEMENTARY,41.727428,-122.640968,12239000,12251,11710
1152,YREKA UNION HIGH,41.741154,-122.637652,10060000,16438,18830
1153,YUBA CITY UNIFIED,39.133900,-121.633400,169251000,12787,13286
1154,YUBA COUNTY OFFICE OF EDUCATION,39.149990,-121.600051,32157000,51699,49889


In [52]:
df_revenue['Agency Name'].nunique()

1139

--------

**String manipulation with district name to allow merge with math_expense_df and language_expense_df**

In [53]:
# Use string manipulation to remove punctuation and make column values all caps
# column name District_Name
df_revenue['District_Name'] = df_revenue['Agency Name'].str.replace('[ \.\-/\(\)\'@]', '')
df_revenue['District_Name'] = df_revenue['District_Name'].apply(lambda x: x.upper()).sort_values()
df_revenue

Unnamed: 0,Agency Name,Latitude,Longitude,Total Revenue,Total Revenue per Pupil,Total Expenditures per Pupil,District_Name
0,ABC UNIFIED,33.879715,-118.071463,265547000,12922,12316,ABCUNIFIED
1,ACALANES UNION HIGH,37.905787,-122.099170,95322000,16835,15554,ACALANESUNIONHIGH
2,ACKERMAN CHARTER,38.934997,-121.055904,5890000,†,†,ACKERMANCHARTER
3,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425,ACTONAGUADULCEUNIFIED
4,ADELANTO ELEMENTARY,34.572373,-117.406551,111722000,12831,12322,ADELANTOELEMENTARY
...,...,...,...,...,...,...,...
1151,YREKA UNION ELEMENTARY,41.727428,-122.640968,12239000,12251,11710,YREKAUNIONELEMENTARY
1152,YREKA UNION HIGH,41.741154,-122.637652,10060000,16438,18830,YREKAUNIONHIGH
1153,YUBA CITY UNIFIED,39.133900,-121.633400,169251000,12787,13286,YUBACITYUNIFIED
1154,YUBA COUNTY OFFICE OF EDUCATION,39.149990,-121.600051,32157000,51699,49889,YUBACOUNTYOFFICEOFEDUCATION


In [54]:
# Use string manipulation to remove punctuation and make column values all caps
# column name District_Name
language_expense_df['District_Name'] = language_expense_df['District'].str.replace('[ \.\-/\(\)\'@]', '')

def upperOrNone(x):
    try:
        return x.upper()
    except:
        return "None"

language_expense_df['District_Name'] = language_expense_df['District_Name'].apply(lambda x: upperOrNone(x)).sort_values()

language_expense_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Some College,College Grad,Graduate School,Median Household Income,School_Name,District,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA,District_Name
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,8.93,35.71,44.64,...,22,9,4,64738.00,21STCENTURYLEARNINGINSTITUTE,Beaumont Unified,1.133925e+08,9960.32,11384.420846,BEAUMONTUNIFIED
23,69039,6044796,Abbott Middle,94403,1.0,812,,19.72,30.41,50.13,...,143,204,139,95189.00,ABBOTTMIDDLE,San Mateo-Foster City Elementary,1.366818e+08,11279.24,12117.995622,SANMATEOFOSTERCITYELEMENTARY
24,75192,6116446,Abby Reinke Elementary,92592,1.0,406,,43.36,30.33,73.68,...,113,123,136,89541.00,ABBYREINKEELEMENTARY,Temecula Valley Unified,2.828429e+08,26622.22,10624.315362,TEMECULAVALLEYUNIFIED
8,64212,1995596,ABC Secondary (Alternative),90703,1.0,37,,5.56,16.67,22.22,...,*,*,*,90613.00,ABCSECONDARYALTERNATIVE,ABC Unified,2.306853e+08,19754.36,11677.688891,ABCUNIFIED
9,64212,1964212,ABC Unified District Level Program,90703,1.0,27,,*,*,*,...,*,*,*,90613.00,ABCUNIFIEDDISTRICTLEVELPROGRAM,ABC Unified,2.306853e+08,19754.36,11677.688891,ABCUNIFIED
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10285,64592,6014039,Zela Davis,90250,1.0,560,,17.40,22.16,39.56,...,143,49,18,45766.00,ZELADAVIS,Hawthorne Elementary,8.997957e+07,7176.50,12538.085408,HAWTHORNEELEMENTARY
10286,63461,129130,Zephyr Lane Elementary,93307,1.0,412,,8.56,25.43,33.99,...,124,60,23,34358.00,ZEPHYRLANEELEMENTARY,Fairfax Elementary,3.094876e+07,2563.52,12072.760770,FAIRFAXELEMENTARY
10287,75515,1232057,Zoe Barnum High,95501,1.0,35,,0.00,0.00,0.00,...,13,*,*,38175.00,ZOEBARNUMHIGH,Eureka City Unified,4.261009e+07,3387.66,12578.028996,EUREKACITYUNIFIED
10288,67850,3630530,Zupanic High,92376,1.0,29,,7.69,26.92,34.62,...,7,*,*,44965.00,ZUPANICHIGH,Rialto Unified,3.242716e+08,23921.22,13555.814425,RIALTOUNIFIED


In [55]:
# Use string manipulation to remove punctuation and make column values all caps
# column name District_Name
math_expense_df['District_Name'] = math_expense_df['District'].str.replace('[ \.\-/\(\)\'@]', '')

def upperOrNone(x):
    try:
        return x.upper()
    except:
        return "None"

math_expense_df['District_Name'] = math_expense_df['District_Name'].apply(lambda x: upperOrNone(x)).sort_values()

math_expense_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Some College,College Grad,Graduate School,Median Household Income,School_Name,District,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA,District_Name
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,1.79,8.93,10.71,...,22,9,4,64738.00,21STCENTURYLEARNINGINSTITUTE,Beaumont Unified,1.133925e+08,9960.32,11384.420846,BEAUMONTUNIFIED
23,69039,6044796,Abbott Middle,94403,1.0,810,,17.04,15.54,32.58,...,142,204,139,95189.00,ABBOTTMIDDLE,San Mateo-Foster City Elementary,1.366818e+08,11279.24,12117.995622,SANMATEOFOSTERCITYELEMENTARY
24,75192,6116446,Abby Reinke Elementary,92592,1.0,406,,30.92,35.66,66.58,...,113,123,136,89541.00,ABBYREINKEELEMENTARY,Temecula Valley Unified,2.828429e+08,26622.22,10624.315362,TEMECULAVALLEYUNIFIED
8,64212,1995596,ABC Secondary (Alternative),90703,1.0,38,,0.00,5.26,5.26,...,7,*,*,90613.00,ABCSECONDARYALTERNATIVE,ABC Unified,2.306853e+08,19754.36,11677.688891,ABCUNIFIED
9,64212,1964212,ABC Unified District Level Program,90703,1.0,27,,*,*,*,...,*,*,*,90613.00,ABCUNIFIEDDISTRICTLEVELPROGRAM,ABC Unified,2.306853e+08,19754.36,11677.688891,ABCUNIFIED
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10285,64592,6014039,Zela Davis,90250,1.0,560,,14.86,18.84,33.70,...,143,49,18,45766.00,ZELADAVIS,Hawthorne Elementary,8.997957e+07,7176.50,12538.085408,HAWTHORNEELEMENTARY
10286,63461,129130,Zephyr Lane Elementary,93307,1.0,412,,6.13,20.10,26.23,...,124,60,23,34358.00,ZEPHYRLANEELEMENTARY,Fairfax Elementary,3.094876e+07,2563.52,12072.760770,FAIRFAXELEMENTARY
10287,75515,1232057,Zoe Barnum High,95501,1.0,35,,0.00,0.00,0.00,...,13,*,*,38175.00,ZOEBARNUMHIGH,Eureka City Unified,4.261009e+07,3387.66,12578.028996,EUREKACITYUNIFIED
10288,67850,3630530,Zupanic High,92376,1.0,29,,0.00,4.00,4.00,...,7,*,*,44965.00,ZUPANICHIGH,Rialto Unified,3.242716e+08,23921.22,13555.814425,RIALTOUNIFIED


**Merge on District_Name**
- Language Arts & Literacy Dataset:

In [56]:
# Merge language_expense_df to df_revenue
language_revenue_df = language_expense_df.merge(df_revenue, on='District_Name', how='left')

# Sort values
language_revenue_df = language_revenue_df.sort_values('School Name')

language_revenue_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA,District_Name,Agency Name,Latitude,Longitude,Total Revenue,Total Revenue per Pupil,Total Expenditures per Pupil
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,8.93,35.71,44.64,...,1.133925e+08,9960.32,11384.420846,BEAUMONTUNIFIED,BEAUMONT UNIFIED,33.962281,-116.984589,130514000,12626,14449
87,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,37.59,28.02,65.60,...,4.088819e+07,3800.49,10758.663825,CYPRESSELEMENTARY,CYPRESS ELEMENTARY,33.824900,-118.045700,53979000,13641,11589
91,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,48.39,32.80,81.18,...,5.296527e+08,40854.24,12964.447700,GARDENGROVEUNIFIED,GARDEN GROVE UNIFIED,33.777700,-117.953000,634300000,14695,16663
92,73643,6085377,A. G. Currie Middle,92780,1.0,585,,6.60,23.78,30.38,...,2.462747e+08,22921.38,10744.324251,TUSTINUNIFIED,TUSTIN UNIFIED,33.743100,-117.824900,316172000,13166,12601
103,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.60,18.23,29.83,...,1.437715e+08,9260.91,15524.553854,ALUMROCKUNIONELEMENTARY,ALUM ROCK UNION ELEMENTARY,37.369388,-121.833560,166657000,14788,15286
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4220,75309,136531,iLEAD Online,93510,1.0,52,,31.82,18.18,50.00,...,1.465401e+07,1027.63,14260.002822,ACTONAGUADULCEUNIFIED,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425
4217,75309,138297,iLead Agua Dulce,91390,1.0,64,,12.07,25.86,37.93,...,1.465401e+07,1027.63,14260.002822,ACTONAGUADULCEUNIFIED,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425
4301,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,9.09,29.75,38.84,...,1.650227e+08,12728.74,12964.578388,ROWLANDUNIFIED,ROWLAND UNIFIED,33.985314,-117.888584,215076000,15219,15708
6877,10397,120717,one.Charter,95206,1.0,184,,0.00,6.15,6.15,...,,,,,,,,,,


- Mathematics Dataset:

In [57]:
# Merge math_expense_df to df_revenue
math_revenue_df = math_expense_df.merge(df_revenue, on='District_Name', how='left')

# Sort values
math_revenue_df = math_revenue_df.sort_values('School Name')

math_revenue_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA,District_Name,Agency Name,Latitude,Longitude,Total Revenue,Total Revenue per Pupil,Total Expenditures per Pupil
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,1.79,8.93,10.71,...,1.133925e+08,9960.32,11384.420846,BEAUMONTUNIFIED,BEAUMONT UNIFIED,33.962281,-116.984589,130514000,12626,14449
87,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,36.36,27.05,63.41,...,4.088819e+07,3800.49,10758.663825,CYPRESSELEMENTARY,CYPRESS ELEMENTARY,33.824900,-118.045700,53979000,13641,11589
91,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,46.28,25.53,71.81,...,5.296527e+08,40854.24,12964.447700,GARDENGROVEUNIFIED,GARDEN GROVE UNIFIED,33.777700,-117.953000,634300000,14695,16663
92,73643,6085377,A. G. Currie Middle,92780,1.0,585,,8.06,10.29,18.35,...,2.462747e+08,22921.38,10744.324251,TUSTINUNIFIED,TUSTIN UNIFIED,33.743100,-117.824900,316172000,13166,12601
103,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.05,14.36,25.41,...,1.437715e+08,9260.91,15524.553854,ALUMROCKUNIONELEMENTARY,ALUM ROCK UNION ELEMENTARY,37.369388,-121.833560,166657000,14788,15286
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4220,75309,136531,iLEAD Online,93510,1.0,52,,9.09,9.09,18.18,...,1.465401e+07,1027.63,14260.002822,ACTONAGUADULCEUNIFIED,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425
4217,75309,138297,iLead Agua Dulce,91390,1.0,64,,6.90,20.69,27.59,...,1.465401e+07,1027.63,14260.002822,ACTONAGUADULCEUNIFIED,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425
4301,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,5.79,7.71,13.50,...,1.650227e+08,12728.74,12964.578388,ROWLANDUNIFIED,ROWLAND UNIFIED,33.985314,-117.888584,215076000,15219,15708
6877,10397,120717,one.Charter,95206,1.0,184,,0.00,0.00,0.00,...,,,,,,,,,,


----------

## 5. Student Poverty – Free or Reduced Price Meals Data
- School level data, year 2018-2019


source: https://www.cde.ca.gov/ds/sd/sd/fsspfrpm.asp


In [58]:
freelunch_df = pd.read_excel('data/frpm1819.xlsx')

# Drop columns
freelunch_df = freelunch_df.drop(columns = ['Academic Year', 'County Code', 'District Code',
       'County Name', 'District Name', 'School Name',
       'School Type', 'Educational \nOption Type', 'NSLP \nProvision \nStatus',
       'Charter \nSchool \n(Y/N)', 'Charter \nSchool \nNumber',
       'Charter \nFunding \nType', 'IRC', 'Low Grade', 'High Grade',
       'Percent (%) \nEligible Free \n(K-12)', 'FRPM Count \n(K-12)',
       'Percent (%) \nEligible FRPM \n(K-12)', 'Enrollment \n(Ages 5-17)',
       'Percent (%) \nEligible Free \n(Ages 5-17)', 'FRPM Count \n(Ages 5-17)',
       'Percent (%) \nEligible FRPM \n(Ages 5-17)',
       'CALPADS Fall 1 \nCertification Status'])

freelunch_df

Unnamed: 0,School Code,District Type,Enrollment \n(K-12),Free Meal \nCount \n(K-12),Free Meal \nCount \n(Ages 5-17)
0,112607,County Office of Education (COE),385,262,249
1,123968,County Office of Education (COE),241,118,113
2,124172,County Office of Education (COE),445,58,58
3,125567,County Office of Education (COE),432,113,111
4,130401,County Office of Education (COE),53,53,50
...,...,...,...,...,...
10515,6056832,Elementary School District,366,77,76
10516,6056840,Elementary School District,321,151,144
10517,6118806,Elementary School District,98,12,12
10518,1,High School District,2,0,0


#### Merge on School Code

- Language Arts & Literature Dataset:

In [59]:
# Merge language_expense_df to df_revenue
language_df = language_revenue_df.merge(freelunch_df, on='School Code', how='left')

# Sort values
language_df = language_df.sort_values('School Name')

# Rename columns
language_df.columns = ['District Code', 'School Code', 'School Name', 'Zip Code',
       'Subgroup ID', 'CAASPP Reported Enrollment', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Not Disadvantaged', 'Black', 'Native American', 'Asian', 'Hispanic',
       'Pacific Islander', 'White', 'Two/More Races', '< High School',
       'High School Grad', 'Some College', 'College Grad', 'Graduate School',
       'Median Household Income', 'School_Name', 'District', 'EDP 365',
       'Current Expense ADA', 'Current Expense Per ADA', 'District_Name',
       'Agency Name', 'Latitude', 'Longitude', 'Total Revenue',
       'Total Revenue per Pupil', 'Total Expenditures per Pupil',
       'District Type', 'Enrollment K-12', 'Free Meal Count K-12',
       'Free Meal Count Ages 5-17']

language_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Agency Name,Latitude,Longitude,Total Revenue,Total Revenue per Pupil,Total Expenditures per Pupil,District Type,Enrollment K-12,Free Meal Count K-12,Free Meal Count Ages 5-17
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,8.93,35.71,44.64,...,BEAUMONT UNIFIED,33.962281,-116.984589,130514000,12626,14449,Unified School District,88.0,41.0,37.0
1,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,37.59,28.02,65.60,...,CYPRESS ELEMENTARY,33.824900,-118.045700,53979000,13641,11589,Elementary School District,739.0,246.0,246.0
2,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,48.39,32.80,81.18,...,GARDEN GROVE UNIFIED,33.777700,-117.953000,634300000,14695,16663,Unified School District,366.0,187.0,186.0
3,73643,6085377,A. G. Currie Middle,92780,1.0,585,,6.60,23.78,30.38,...,TUSTIN UNIFIED,33.743100,-117.824900,316172000,13166,12601,Unified School District,611.0,422.0,422.0
4,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.60,18.23,29.83,...,ALUM ROCK UNION ELEMENTARY,37.369388,-121.833560,166657000,14788,15286,Elementary School District,371.0,262.0,254.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10430,75309,136531,iLEAD Online,93510,1.0,52,,31.82,18.18,50.00,...,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425,Unified School District,73.0,28.0,27.0
10431,75309,138297,iLead Agua Dulce,91390,1.0,64,,12.07,25.86,37.93,...,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425,Unified School District,119.0,35.0,31.0
10432,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,9.09,29.75,38.84,...,ROWLAND UNIFIED,33.985314,-117.888584,215076000,15219,15708,Unified School District,702.0,373.0,364.0
10433,10397,120717,one.Charter,95206,1.0,184,,0.00,6.15,6.15,...,,,,,,,County Office of Education (COE),509.0,414.0,149.0


- Mathematics Dataset:

In [60]:
# Merge language_expense_df to df_revenue
math_df = math_revenue_df.merge(freelunch_df, on='School Code', how='left')

# Sort values
math_df = math_df.sort_values('School Name')

# Rename columns
math_df.columns = ['District Code', 'School Code', 'School Name', 'Zip Code',
       'Subgroup ID', 'CAASPP Reported Enrollment', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Not Disadvantaged', 'Black', 'Native American', 'Asian', 'Hispanic',
       'Pacific Islander', 'White', 'Two/More Races', '< High School',
       'High School Grad', 'Some College', 'College Grad', 'Graduate School',
       'Median Household Income', 'School_Name', 'District', 'EDP 365',
       'Current Expense ADA', 'Current Expense Per ADA', 'District_Name',
       'Agency Name', 'Latitude', 'Longitude', 'Total Revenue',
       'Total Revenue per Pupil', 'Total Expenditures per Pupil',
       'District Type', 'Enrollment K-12', 'Free Meal Count K-12',
       'Free Meal Count Ages 5-17']

math_df

Unnamed: 0,District Code,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,...,Agency Name,Latitude,Longitude,Total Revenue,Total Revenue per Pupil,Total Expenditures per Pupil,District Type,Enrollment K-12,Free Meal Count K-12,Free Meal Count Ages 5-17
0,66993,129882,21st Century Learning Institute,92223,1.0,58,,1.79,8.93,10.71,...,BEAUMONT UNIFIED,33.962281,-116.984589,130514000,12626,14449,Unified School District,88.0,41.0,37.0
1,66480,6027767,A. E. Arnold Elementary,90630,1.0,447,,36.36,27.05,63.41,...,CYPRESS ELEMENTARY,33.824900,-118.045700,53979000,13641,11589,Elementary School District,739.0,246.0,246.0
2,66522,6028211,A. G. Cook Elementary,92844,1.0,192,,46.28,25.53,71.81,...,GARDEN GROVE UNIFIED,33.777700,-117.953000,634300000,14695,16663,Unified School District,366.0,187.0,186.0
3,73643,6085377,A. G. Currie Middle,92780,1.0,585,,8.06,10.29,18.35,...,TUSTIN UNIFIED,33.743100,-117.824900,316172000,13166,12601,Unified School District,611.0,422.0,422.0
4,69369,6046114,A. J. Dorsa Elementary,95122,1.0,184,,11.05,14.36,25.41,...,ALUM ROCK UNION ELEMENTARY,37.369388,-121.833560,166657000,14788,15286,Elementary School District,371.0,262.0,254.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10430,75309,136531,iLEAD Online,93510,1.0,52,,9.09,9.09,18.18,...,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425,Unified School District,73.0,28.0,27.0
10431,75309,138297,iLead Agua Dulce,91390,1.0,64,,6.90,20.69,27.59,...,ACTON-AGUA DULCE UNIFIED,34.472708,-118.196768,56618000,3811,3425,Unified School District,119.0,35.0,31.0
10432,73452,120600,iQ Academy California-Los Angeles,93065,1.0,405,,5.79,7.71,13.50,...,ROWLAND UNIFIED,33.985314,-117.888584,215076000,15219,15708,Unified School District,702.0,373.0,364.0
10433,10397,120717,one.Charter,95206,1.0,184,,0.00,0.00,0.00,...,,,,,,,County Office of Education (COE),509.0,414.0,149.0


--------------

## Reorganize columns on df_language and df_math

- Language Arts & Literature Dataset:

In [61]:
# Select columns of interest
df_language = language_df[['District', 'School Name', 'Zip Code', 'Latitude', 'Longitude', 
       'Median Household Income', 'CAASPP Reported Enrollment', 'Enrollment K-12', 'Total Revenue',
       'Total Revenue per Pupil', 'Total Expenditures per Pupil', 'EDP 365', 'Free Meal Count K-12',
       'Free Meal Count Ages 5-17', 'Current Expense ADA', 'Current Expense Per ADA', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Not Disadvantaged', 'Black', 'Native American', 'Asian', 'Hispanic',
       'Pacific Islander', 'White', 'Two/More Races', '< High School',
       'High School Grad', 'Some College', 'College Grad', 'Graduate School',
       'Percentage Standard Met and Above']]

#df_language

In [62]:
# create csv file with language_df

- Mathematics Dataset:

In [63]:
df_math = math_df[['District', 'School Name', 'Zip Code', 'Latitude', 'Longitude', 
       'Median Household Income', 'CAASPP Reported Enrollment', 'Enrollment K-12', 'Total Revenue',
       'Total Revenue per Pupil', 'Total Expenditures per Pupil', 'EDP 365', 'Free Meal Count K-12',
       'Free Meal Count Ages 5-17', 'Current Expense ADA', 'Current Expense Per ADA', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Not Disadvantaged', 'Black', 'Native American', 'Asian', 'Hispanic',
       'Pacific Islander', 'White', 'Two/More Races', '< High School',
       'High School Grad', 'Some College', 'College Grad', 'Graduate School',
       'Percentage Standard Met and Above']]

#df_math

In [64]:
# create csv file with math_df

----------

# DATA CLEANING

## Data definition
1. Column name
2. Data type
3. Description of column
4. Count or percent per unique values or code (includes NA)
5. Range of values

## Handling missing and NA data
1. Identify how many NA are in the dataset
df.info()
.isnull()
value_counts()
2. Review the percentage of observatios missing per column
3. Drop, impute, or replace missing values

## Removing duplicates
- Duplicates were addressed during merges

### Reviewing for outliers and anomalies
- Boxplots
    - Quick way to identify outliers or anomalous observations
    - Important to consider these inthe context of the problem

## Language Arts & Literature Dataset:

In [65]:
df_language

Unnamed: 0,District,School Name,Zip Code,Latitude,Longitude,Median Household Income,CAASPP Reported Enrollment,Enrollment K-12,Total Revenue,Total Revenue per Pupil,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Percentage Standard Met and Above
0,Beaumont Unified,21st Century Learning Institute,92223,33.962281,-116.984589,64738.00,58,88.0,130514000,12626,...,33,,18,,6,13,22,9,4,44.64
1,Cypress Elementary,A. E. Arnold Elementary,90630,33.824900,-118.045700,84051.00,447,739.0,53979000,13641,...,131,*,111,7,15,44,85,138,121,65.60
2,Garden Grove Unified,A. G. Cook Elementary,92844,33.777700,-117.953000,48345.00,192,366.0,634300000,14695,...,43,,10,6,*,13,14,33,6,81.18
3,Tustin Unified,A. G. Currie Middle,92780,33.743100,-117.824900,64089.00,585,611.0,316172000,13166,...,532,*,15,*,223,168,82,39,15,30.38
4,Alum Rock Union Elementary,A. J. Dorsa Elementary,95122,37.369388,-121.833560,57470.00,184,371.0,166657000,14788,...,166,,*,*,82,55,24,15,*,29.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10430,Acton-Agua Dulce Unified,iLEAD Online,93510,34.472708,-118.196768,89403.00,52,73.0,56618000,3811,...,14,,22,*,*,*,11,18,12,50.00
10431,Acton-Agua Dulce Unified,iLead Agua Dulce,91390,34.472708,-118.196768,105659.00,64,119.0,56618000,3811,...,24,,30,9,,*,17,16,10,37.93
10432,Rowland Unified,iQ Academy California-Los Angeles,93065,33.985314,-117.888584,94173.00,405,702.0,215076000,15219,...,36,8,166,33,,9,25,18,5,38.84
10433,,one.Charter,95206,,,42404.00,184,509.0,,,...,112,*,16,13,63,47,35,10,9,6.15


In [66]:
# Round decimal places
cols = ['Median Household Income', 'CAASPP Reported Enrollment',
       'Enrollment K-12', 'Total Revenue', 'Total Revenue per Pupil',
       'Total Expenditures per Pupil', 'EDP 365', 'Free Meal Count K-12',
       'Free Meal Count Ages 5-17', 'Current Expense ADA',
       'Current Expense Per ADA', 'Male', 'Female', 'Military', 'Non Military',
       'Homeless', 'Non Homeless', 'Disadvantaged', 'Not Disadvantaged',
       'Black', 'Native American', 'Asian', 'Hispanic', 'Pacific Islander',
       'White', 'Two/More Races', '< High School', 'High School Grad',
       'Some College', 'College Grad', 'Graduate School',
       'Percentage Standard Met and Above']

df_language[cols] = df_language[cols].round(2)

In [67]:
# Convert to int!

In [68]:
# Verify column names
df_language.columns

Index(['District', 'School Name', 'Zip Code', 'Latitude', 'Longitude',
       'Median Household Income', 'CAASPP Reported Enrollment',
       'Enrollment K-12', 'Total Revenue', 'Total Revenue per Pupil',
       'Total Expenditures per Pupil', 'EDP 365', 'Free Meal Count K-12',
       'Free Meal Count Ages 5-17', 'Current Expense ADA',
       'Current Expense Per ADA', 'Male', 'Female', 'Military', 'Non Military',
       'Homeless', 'Non Homeless', 'Disadvantaged', 'Not Disadvantaged',
       'Black', 'Native American', 'Asian', 'Hispanic', 'Pacific Islander',
       'White', 'Two/More Races', '< High School', 'High School Grad',
       'Some College', 'College Grad', 'Graduate School',
       'Percentage Standard Met and Above'],
      dtype='object')

In [69]:
# Check for data type
df_language.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10435 entries, 0 to 10434
Data columns (total 37 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   District                           9930 non-null   object 
 1   School Name                        10435 non-null  object 
 2   Zip Code                           10435 non-null  object 
 3   Latitude                           9310 non-null   float64
 4   Longitude                          9310 non-null   float64
 5   Median Household Income            10367 non-null  object 
 6   CAASPP Reported Enrollment         10434 non-null  object 
 7   Enrollment K-12                    9992 non-null   float64
 8   Total Revenue                      9310 non-null   object 
 9   Total Revenue per Pupil            9310 non-null   object 
 10  Total Expenditures per Pupil       9310 non-null   object 
 11  EDP 365                            9930 non-null   flo

In [70]:
# Check for missing data
df_language.isnull().sum()

District                              505
School Name                             0
Zip Code                                0
Latitude                             1125
Longitude                            1125
Median Household Income                68
CAASPP Reported Enrollment              1
Enrollment K-12                       443
Total Revenue                        1125
Total Revenue per Pupil              1125
Total Expenditures per Pupil         1125
EDP 365                               505
Free Meal Count K-12                  443
Free Meal Count Ages 5-17             443
Current Expense ADA                   505
Current Expense Per ADA               505
Male                                   51
Female                                253
Military                             7844
Non Military                            1
Homeless                             2887
Non Homeless                            5
Disadvantaged                         149
Not Disadvantaged                 

In [71]:
# percentage of missing data per column
percent_missing = (df_language.isnull().sum() * 100 / len(df_language)).round(2)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})

# sorting values in ascending format
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,percent_missing
School Name,0.0
Zip Code,0.0
Percentage Standard Met and Above,0.01
Non Military,0.01
CAASPP Reported Enrollment,0.01
Non Homeless,0.05
Male,0.49
Median Household Income,0.65
Disadvantaged,1.43
Not Disadvantaged,1.89


In [72]:
# Turn string to numeric
cols = ['Median Household Income', 'CAASPP Reported Enrollment', 'Total Revenue', 'Total Revenue per Pupil',
       'Total Expenditures per Pupil', 'Male', 'Female', 'Military', 'Non Military',
       'Homeless', 'Non Homeless', 'Disadvantaged', 'Not Disadvantaged',
       'Black', 'Native American', 'Asian', 'Hispanic', 'Pacific Islander',
       'White', 'Two/More Races', '< High School', 'High School Grad',
       'Some College', 'College Grad', 'Graduate School',
       'Percentage Standard Met and Above']

df_language[cols] = df_language[cols].apply(pd.to_numeric, errors='coerce', axis=1)



In [73]:
# Summary statistics of columns
df_language.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Latitude,9310.0,35.82072,2.210299,32.56426,34.0096,34.70031,37.7794,41.9633
Longitude,9310.0,-119.5676,2.027393,-124.285548,-121.5618,-119.0313,-117.9467,-114.5959
Median Household Income,10367.0,62551.68,25323.84,11922.0,43442.0,56651.0,77047.0,236912.0
CAASPP Reported Enrollment,9968.0,330.2603,274.9335,4.0,155.0,273.0,417.0,3665.0
Enrollment K-12,9992.0,620.9588,535.391,1.0,325.0,519.0,740.0,6324.0
Total Revenue,9310.0,1335865000.0,3044181000.0,217000.0,69822000.0,217062000.0,563697000.0,10200840000.0
Total Revenue per Pupil,9266.0,14631.85,3822.233,632.0,12922.0,14095.0,16044.0,106533.0
Total Expenditures per Pupil,9266.0,14605.09,3904.649,490.0,12678.0,14212.0,15700.0,99133.0
EDP 365,9930.0,914935500.0,2088821000.0,174193.83,57159050.0,165366900.0,411789600.0,7213548000.0
Free Meal Count K-12,9992.0,324.6413,312.4046,0.0,104.0,255.0,447.0,3863.0


In [74]:
# Check for data type
df_language.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10435 entries, 0 to 10434
Data columns (total 37 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   District                           9930 non-null   object 
 1   School Name                        10435 non-null  object 
 2   Zip Code                           10435 non-null  object 
 3   Latitude                           9310 non-null   float64
 4   Longitude                          9310 non-null   float64
 5   Median Household Income            10367 non-null  float64
 6   CAASPP Reported Enrollment         9968 non-null   float64
 7   Enrollment K-12                    9992 non-null   float64
 8   Total Revenue                      9310 non-null   float64
 9   Total Revenue per Pupil            9266 non-null   float64
 10  Total Expenditures per Pupil       9266 non-null   float64
 11  EDP 365                            9930 non-null   flo

In [75]:
df_language

Unnamed: 0,District,School Name,Zip Code,Latitude,Longitude,Median Household Income,CAASPP Reported Enrollment,Enrollment K-12,Total Revenue,Total Revenue per Pupil,...,Hispanic,Pacific Islander,White,Two/More Races,< High School,High School Grad,Some College,College Grad,Graduate School,Percentage Standard Met and Above
0,Beaumont Unified,21st Century Learning Institute,92223,33.962281,-116.984589,64738.0,58.0,88.0,130514000.0,12626.0,...,33.0,,18.0,,6.0,13.0,22.0,9.0,4.0,44.64
1,Cypress Elementary,A. E. Arnold Elementary,90630,33.824900,-118.045700,84051.0,447.0,739.0,53979000.0,13641.0,...,131.0,,111.0,7.0,15.0,44.0,85.0,138.0,121.0,65.60
2,Garden Grove Unified,A. G. Cook Elementary,92844,33.777700,-117.953000,48345.0,192.0,366.0,634300000.0,14695.0,...,43.0,,10.0,6.0,,13.0,14.0,33.0,6.0,81.18
3,Tustin Unified,A. G. Currie Middle,92780,33.743100,-117.824900,64089.0,585.0,611.0,316172000.0,13166.0,...,532.0,,15.0,,223.0,168.0,82.0,39.0,15.0,30.38
4,Alum Rock Union Elementary,A. J. Dorsa Elementary,95122,37.369388,-121.833560,57470.0,184.0,371.0,166657000.0,14788.0,...,166.0,,,,82.0,55.0,24.0,15.0,,29.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10430,Acton-Agua Dulce Unified,iLEAD Online,93510,34.472708,-118.196768,89403.0,52.0,73.0,56618000.0,3811.0,...,14.0,,22.0,,,,11.0,18.0,12.0,50.00
10431,Acton-Agua Dulce Unified,iLead Agua Dulce,91390,34.472708,-118.196768,105659.0,64.0,119.0,56618000.0,3811.0,...,24.0,,30.0,9.0,,,17.0,16.0,10.0,37.93
10432,Rowland Unified,iQ Academy California-Los Angeles,93065,33.985314,-117.888584,94173.0,405.0,702.0,215076000.0,15219.0,...,36.0,8.0,166.0,33.0,,9.0,25.0,18.0,5.0,38.84
10433,,one.Charter,95206,,,42404.0,184.0,509.0,,,...,112.0,,16.0,13.0,63.0,47.0,35.0,10.0,9.0,6.15


-------------

# EXPLORATORY DATA ANALYSIS
## Model development dataset is ready for exploration

### Summary statistics
- Check individual variable distribution
    - verify the spread of the data
    - may be abel to infer the mean from distribution plots

## Visualizing relationship between variables
- Correlation matrix
    - Look at correlation for each variable in the dataframe
    - Using Pearson correlation heatmap
- Pairplots
    - Visualizing variables distribution against one another

## Pairplots

Another way of evaluating the variables distribution against each other.