   # Education Project


   <img src='data/education_image.jpg' width="900">
   
   **Credit:**  [wsimag](https://wsimag.com/culture/60264-education-in-venezuela-the-americas-and-the-world)



In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

sns.set(style='ticks')

warnings.filterwarnings("ignore")  # Suppress all warnings

# Introduction

## Business Context
Research shows that high-poverty areas disproportionally educate children of color. The chances of ending up in a high-poverty or high-minority school are highly determined by a student’s race/ethnicity and social class. For instance, African American and Hispanic students—even if they are not poor—are much more likely than white or Asian students to be in high-poverty schools.

There is a growing body of evidence that shows increased investment on education returns better outcomes and that the positive effects are even greater among low-income students. On the other hand, it costs more to educate low-income students and provide them with a robust education capable of overcoming their initial disadvantages.


### Goals
1. Understand the current demographics of wealthy to high-poverty schools across the state of California.
2. Identify how much funding is available per pupil in wealthy vs high-poverty areas.
3. Learn what factors are most correlated with student performance.


#### Predictive modeling
What's the average test score per school?
What's the percentage of students who pass/not pass?


# DATA WRANGLING

The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

- Extracting and cleaning relevant data. Let's start looking at the datasets!

### Assessment Data

- It contains assessment data for the Smarter Balance Summative Assessment (2018-2019) for the state of California.

- Legend types can be found here: https://caaspp-elpac.cde.ca.gov/caaspp/research_fixfileformat19
- More information about assesment set up: https://www.cde.ca.gov/ta/tg/ca/sbsummativefaq.asp

In [2]:
# loa datafile
df_all = pd.read_csv('large_data/sb_ca2019_all_csv_v4.txt')

In [3]:
# create dataset containing district level data
df_district = df_all[df_all['District Code'] == 00000]

In [4]:
# create dataset containing school level data
df_school = df_all.drop(df_all[df_all['School Code'] == 0].index)
df_school.head(10)

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,...,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
1888,1,10017,112607,,2019,1,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1889,1,10017,112607,,2019,3,B,42,42,11,...,35.71,14.29,50.00,35.71,7.14,76.19,16.67,9.52,64.29,26.19
1890,1,10017,112607,,2019,4,B,43,42,11,...,39.02,10.00,72.50,17.50,7.32,65.85,26.83,17.07,58.54,24.39
1891,1,10017,112607,,2019,6,B,79,78,11,...,33.77,13.16,63.16,23.68,7.79,71.43,20.78,14.29,63.64,22.08
1892,1,10017,112607,,2019,7,B,*,*,11,...,*,*,*,*,*,*,*,*,*,*
1893,1,10017,112607,,2019,8,B,38,38,11,...,36.84,18.42,68.42,13.16,7.89,73.68,18.42,18.42,65.79,15.79
1894,1,10017,112607,,2019,31,B,68,67,11,...,40.91,12.31,63.08,24.62,9.09,68.18,22.73,13.64,62.12,24.24
1895,1,10017,112607,,2019,51,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1896,1,10017,112607,,2019,53,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1897,1,10017,112607,,2019,74,B,30,29,11,...,28.57,7.41,59.26,33.33,3.57,67.86,28.57,7.14,71.43,21.43


In [5]:
# check columns' names
df_school.columns

Index(['County Code', 'District Code', 'School Code', 'Filler', 'Test Year',
       'Subgroup ID', 'Test Type', 'Total Tested At Entity Level',
       'Total Tested with Scores', 'Grade', 'Test Id',
       'CAASPP Reported Enrollment', 'Students Tested', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores',
       'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
       'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
       'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
       'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
       'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
       'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard'],
      dtype='object')

In [6]:
# check data type
df_school.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3013079 entries, 1888 to 3576490
Data columns (total 32 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   County Code                        int64  
 1   District Code                      int64  
 2   School Code                        int64  
 3   Filler                             float64
 4   Test Year                          int64  
 5   Subgroup ID                        int64  
 6   Test Type                          object 
 7   Total Tested At Entity Level       object 
 8   Total Tested with Scores           object 
 9   Grade                              int64  
 10  Test Id                            int64  
 11  CAASPP Reported Enrollment         object 
 12  Students Tested                    object 
 13  Mean Scale Score                   object 
 14  Percentage Standard Exceeded       object 
 15  Percentage Standard Met            object 
 16  Percentage Stan

In [7]:
# Check for missing data
df_school.isnull().sum()

County Code                                0
District Code                              0
School Code                                0
Filler                               3013079
Test Year                                  0
Subgroup ID                                0
Test Type                                  0
Total Tested At Entity Level               0
Total Tested with Scores                   0
Grade                                      0
Test Id                                    0
CAASPP Reported Enrollment                 0
Students Tested                            0
Mean Scale Score                      797631
Percentage Standard Exceeded               0
Percentage Standard Met                    0
Percentage Standard Met and Above          0
Percentage Standard Nearly Met             0
Percentage Standard Not Met                0
Students with Scores                       0
Area 1 Percentage Above Standard           0
Area 1 Percentage Near Standard            0
Area 1 Per

In [8]:
# Number of rows where subgroup ID == 1
df_school[df_school['Subgroup ID'] == 1].count()

County Code                          87324
District Code                        87324
School Code                          87324
Filler                                   0
Test Year                            87324
Subgroup ID                          87324
Test Type                            87324
Total Tested At Entity Level         87324
Total Tested with Scores             87324
Grade                                87324
Test Id                              87324
CAASPP Reported Enrollment           87324
Students Tested                      87324
Mean Scale Score                     66727
Percentage Standard Exceeded         87324
Percentage Standard Met              87324
Percentage Standard Met and Above    87324
Percentage Standard Nearly Met       87324
Percentage Standard Not Met          87324
Students with Scores                 87324
Area 1 Percentage Above Standard     87324
Area 1 Percentage Near Standard      87324
Area 1 Percentage Below Standard     87324
Area 2 Perc

In [9]:
# Check number of unique schools
df_school['School Code'].nunique()

10300

- There are 10,300 unique schools!

### Reorganizing Subgroup ID 

The assessment dataset contains a lot of demographic information in the subgroup ID column. Need to reorganize the dataset in order to have one variable per column and one observation per row. Also, neet to filter only the demographic information of interest.

#### Before merging:
- Filter variables of interest;
- Rearrange the data to have: 
    - one feature per column; 
    - one observation per row;

This dataset representes the Smater Balanced Assessments for English Language Arts/Literacy and Mathematics (SB). Test ID 1 and 2. More info about the test can be found here: https://www.caaspp.org/administration/about/testing/index.html

## Creating two datasets for modeling

- Language Arts & Literature: test_id == 1

    - 10,299 rows
    
    
- Mathematics: test_id == 2

    - 10,298 rows



In [10]:
# Filter Grade == 13 summary of all grades per school
all_grades = df_school[df_school['Grade'] == 13]

# Filter Subgroup ID == 1 summary of all students
all_students = all_grades[all_grades['Subgroup ID'] == 1]

In [11]:
# Create df_test1 language arts & literature 
df_test1 = all_students[all_students['Test Id'] == 1]

# drop columns with redundant information
df_language = df_test1.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard', 
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])
df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
1927,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84
2234,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127
2681,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232
3126,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192
3474,130401,1,*,,*,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...
3575148,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588
3575555,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132
3575772,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74
3575957,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51


In [12]:
# Create df_test2 mathematics
df_test2 = all_students[all_students['Test Id'] == 2]

# drop columns with redundant information
df_math = df_test2.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Total Tested At Entity Level', 'Total Tested with Scores', 'Grade',
                             'Test Id', 'Students Tested'])
df_math

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores
2005,112607,1,90,,3.57,7.14,10.71,16.67,72.62,84
2465,123968,1,142,,6.67,11.11,17.78,28.15,54.07,135
2891,124172,1,239,,74.57,19.40,93.97,5.17,0.86,232
3367,125567,1,201,,17.10,18.65,35.75,18.13,46.11,193
3572,130401,1,34,,*,*,*,*,*,4
...,...,...,...,...,...,...,...,...,...,...
3575405,6056816,1,608,,13.78,27.04,40.82,32.48,26.70,588
3575698,6056832,1,146,,9.09,24.24,33.33,40.15,26.52,132
3575838,6056840,1,74,,12.16,31.08,43.24,27.03,29.73,74
3576079,6118806,1,57,,15.69,35.29,50.98,29.41,19.61,51


#### Subgroup ID 

In the legend below Demographic Id and Demographic Id Num are represented in the dataset as Subgroup ID.

In [13]:
legend = pd.read_csv('data/Subgroups.txt')
legend

Unnamed: 0,Demographic ID,Demographic ID Num,Demographic Name,Student Group
0,1,1,All Students,All Students
1,3,3,Male,Gender
2,4,4,Female,Gender
3,6,6,Fluent English proficient and English only,English-Language Fluency
4,7,7,Initial fluent English proficient (IFEP),English-Language Fluency
5,8,8,Reclassified fluent English proficient (RFEP),English-Language Fluency
6,28,28,Migrant education,Migrant
7,31,31,Economically disadvantaged,Economic Status
8,50,50,Military,Military Status
9,51,51,Not Military,Military Status


In [14]:
# filter demographic of interest from Subgroup ID
#subgroup_id = [1, 3, 4, 50, 51, 52, 53, 90, 91, 92, 93, 94, 220, 221, 222, 223, 
#               224, 225, 226, 227, 200, 201, 202, 203, 204, 205, 206, 207]

#df_school_id = df_school[df_school['Subgroup ID'].isin(subgroup_id)]

## Next: 
1. Transform demographic information contained into subgroup id (rows) to one variable per column.

# Language arts and literature dataset test_id == 1

In [15]:
# drop columns with redundant information at ALL_GRADES level
all_grades = all_grades.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard',
                             'Mean Scale Score', 'Percentage Standard Exceeded', 'Percentage Standard Met',
                             'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
                             'Percentage Standard Not Met', 'Students with Scores', 'Total Tested At Entity Level',
                             'Total Tested with Scores', 'Students Tested', 'Grade'])


In [16]:
# Filter Subgroup ID == 3 
male_students = all_grades[all_grades['Subgroup ID'] == 3]

# Filter test_id == 1 language arts/literature
male_students = male_students[male_students['Test Id'] == 1]

# drop columns with redundant information
male_students = male_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
male_students = male_students.rename({'CAASPP Reported Enrollment': 'Male'}, axis=1)
male_students

Unnamed: 0,School Code,Male
1928,112607,43
2235,123968,70
2682,124172,114
3127,125567,101
3475,130401,*
...,...,...
3575149,6056816,325
3575556,6056832,58
3575773,6056840,32
3575958,6118806,30


In [17]:
# Filter Subgroup ID == 4 
female_students = all_grades[all_grades['Subgroup ID'] == 4]

# Filter test_id == 1 language arts/literature
female_students = female_students[female_students['Test Id'] == 1]

# drop columns with redundant information
female_students = female_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
female_students = female_students.rename({'CAASPP Reported Enrollment': 'Female'}, axis=1)
female_students

Unnamed: 0,School Code,Female
1929,112607,47
2236,123968,72
2683,124172,125
3128,125567,100
3476,130401,*
...,...,...
3575150,6056816,283
3575557,6056832,88
3575774,6056840,42
3575959,6118806,27


In [18]:
# Filter Subgroup ID == 50 
military_students = all_grades[all_grades['Subgroup ID'] == 50]

# Filter test_id == 1 language arts/literature
military_students = military_students[military_students['Test Id'] == 1]

# drop columns with redundant information
military_students = military_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
military_students = military_students.rename({'CAASPP Reported Enrollment': 'Military'}, axis=1)
military_students

Unnamed: 0,School Code,Military
8413,106401,*
8665,111765,22
9176,119222,6
9670,122085,24
10011,126656,5
...,...,...
3574132,114652,15
3575156,6056816,122
3575562,6056832,109
3575779,6056840,*


In [19]:
# Filter Subgroup ID == 51 
non_military_students = all_grades[all_grades['Subgroup ID'] == 51]

# Filter test_id == 1 language arts/literature
non_military_students = non_military_students[non_military_students['Test Id'] == 1]

# drop columns with redundant information
non_military_students = non_military_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
non_military_students = non_military_students.rename({'CAASPP Reported Enrollment': 'Non Military'}, axis=1)
non_military_students

Unnamed: 0,School Code,Non Military
1934,112607,90
2240,123968,142
2688,124172,239
3133,125567,201
3480,130401,*
...,...,...
3575157,6056816,486
3575563,6056832,37
3575780,6056840,72
3575965,6118806,15


In [20]:
# Filter Subgroup ID == 52 
homeless_students = all_grades[all_grades['Subgroup ID'] == 52]

# Filter test_id == 1 language arts/literature
homeless_students = homeless_students[homeless_students['Test Id'] == 1]

# drop columns with redundant information
homeless_students = homeless_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
homeless_students = homeless_students.rename({'CAASPP Reported Enrollment': 'Homeless'}, axis=1)
homeless_students

Unnamed: 0,School Code,Homeless
3134,125567,*
3481,130401,*
3688,130419,*
4824,137448,6
5021,6001788,5
...,...,...
3572019,6056782,*
3572294,6056790,23
3573465,107375,*
3575158,6056816,6


In [21]:
# Filter Subgroup ID == 53 
non_homeless_students = all_grades[all_grades['Subgroup ID'] == 53]

# Filter test_id == 1 language arts/literature
non_homeless_students = non_homeless_students[non_homeless_students['Test Id'] == 1]

# drop columns with redundant information
non_homeless_students = non_homeless_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
non_homeless_students = non_homeless_students.rename({'CAASPP Reported Enrollment': 'Non Homeless'}, axis=1)
non_homeless_students

Unnamed: 0,School Code,Non Homeless
1935,112607,90
2241,123968,142
2689,124172,239
3135,125567,198
3482,130401,*
...,...,...
3575159,6056816,602
3575564,6056832,146
3575781,6056840,74
3575966,6118806,57


In [22]:
# Filter Subgroup ID == 31
disadvantaged_students = all_grades[all_grades['Subgroup ID'] == 31]

# Filter test_id == 1 language arts/literature
disadvantaged_students = disadvantaged_students[disadvantaged_students['Test Id'] == 1]

# drop columns with redundant information
disadvantaged_students = disadvantaged_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
disadvantaged_students = disadvantaged_students.rename({'CAASPP Reported Enrollment': 'Disadvantaged'}, axis=1)
disadvantaged_students

Unnamed: 0,School Code,Disadvantaged
1933,112607,69
2239,123968,59
2687,124172,26
3132,125567,51
3479,130401,*
...,...,...
3575155,6056816,302
3575561,6056832,47
3575778,6056840,41
3575963,6118806,21


In [23]:
# Filter Subgroup ID == 111
non_disadvantaged_students = all_grades[all_grades['Subgroup ID'] == 111]

# Filter test_id == 1 language arts/literature
non_disadvantaged_students = non_disadvantaged_students[non_disadvantaged_students['Test Id'] == 1]

# drop columns with redundant information
non_disadvantaged_students = non_disadvantaged_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
non_disadvantaged_students = non_disadvantaged_students.rename({'CAASPP Reported Enrollment': 'Non Disadvantaged'}, axis=1)
non_disadvantaged_students

Unnamed: 0,School Code,Non Disadvantaged
1948,112607,21
2254,123968,83
2699,124172,213
3147,125567,150
3494,130401,*
...,...,...
3575173,6056816,306
3575576,6056832,99
3575791,6056840,33
3575977,6118806,36


In [24]:
# Filter Subgroup ID == 6 
english_fluency = all_grades[all_grades['Subgroup ID'] == 6]

# Filter test_id == 1 language arts/literature
english_fluency = english_fluency[english_fluency['Test Id'] == 1]

# drop columns with redundant information
english_fluency = english_fluency.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
english_fluency = english_fluency.rename({'CAASPP Reported Enrollment': 'English Fluency'}, axis=1)

english_fluency

Unnamed: 0,School Code,English Fluency
1930,112607,83
2237,123968,86
2684,124172,236
3129,125567,165
3477,130401,*
...,...,...
3575151,6056816,570
3575558,6056832,144
3575775,6056840,61
3575960,6118806,54


In [25]:
# Filter Subgroup ID == 160 
esl_students = all_grades[all_grades['Subgroup ID'] == 160]

# Filter test_id == 1 language arts/literature
esl_students = esl_students[esl_students['Test Id'] == 1]

# drop columns with redundant information
esl_students = esl_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
esl_students = esl_students.rename({'CAASPP Reported Enrollment': 'English Learner'}, axis=1)

esl_students

Unnamed: 0,School Code,English Learner
1953,112607,7
2260,123968,52
2704,124172,*
3152,125567,35
3497,130401,*
...,...,...
3575178,6056816,38
3575581,6056832,*
3575795,6056840,13
3575981,6118806,*


## Ethnicities:

In [26]:
# Filter Subgroup ID == 74
black_students = all_grades[all_grades['Subgroup ID'] == 74]

# Filter test_id == 1 language arts/literature
black_students = black_students[black_students['Test Id'] == 1]

# drop columns with redundant information
black_students = black_students.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
black_students = black_students.rename({'CAASPP Reported Enrollment': 'Black'}, axis=1)

black_students

Unnamed: 0,School Code,Black
1936,112607,34
2242,123968,26
2690,124172,7
3136,125567,44
3483,130401,*
...,...,...
3574135,114652,*
3575160,6056816,28
3575565,6056832,13
3575967,6118806,*


In [27]:
# Filter Subgroup ID == 75
native_american = all_grades[all_grades['Subgroup ID'] == 75]

# Filter test_id == 1 language arts/literature
native_american = native_american[native_american['Test Id'] == 1]

# drop columns with redundant information
native_american = native_american.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
native_american = native_american.rename({'CAASPP Reported Enrollment': 'Native American'}, axis=1)

native_american

Unnamed: 0,School Code,Native American
1937,112607,*
2243,123968,*
3691,130419,*
4338,136101,*
5404,6002000,*
...,...,...
3573804,112623,4
3574136,114652,*
3575161,6056816,6
3575566,6056832,*


In [28]:
# Filter Subgroup ID == 76
asian = all_grades[all_grades['Subgroup ID'] == 76]

# Filter test_id == 1 language arts/literature
asian = asian[asian['Test Id'] == 1]

# drop columns with redundant information
asian = asian.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
asian = asian.rename({'CAASPP Reported Enrollment': 'Asian'}, axis=1)

asian

Unnamed: 0,School Code,Asian
1938,112607,*
2244,123968,10
2691,124172,111
3137,125567,14
3925,131581,*
...,...,...
3573805,112623,24
3574137,114652,11
3575162,6056816,15
3575782,6056840,*


In [29]:
# Filter Subgroup ID == 78
hispanic = all_grades[all_grades['Subgroup ID'] == 78]

# Filter test_id == 1 language arts/literature
hispanic = hispanic[hispanic['Test Id'] == 1]

# drop columns with redundant information
hispanic = hispanic.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
hispanic = hispanic.rename({'CAASPP Reported Enrollment': 'Hispanic'}, axis=1)

hispanic

Unnamed: 0,School Code,Hispanic
1939,112607,46
2246,123968,82
2692,124172,9
3139,125567,50
3485,130401,*
...,...,...
3575164,6056816,166
3575568,6056832,26
3575783,6056840,27
3575969,6118806,15


In [30]:
# Filter Subgroup ID == 79
pacific_isl = all_grades[all_grades['Subgroup ID'] == 79]

# Filter test_id == 1 language arts/literature
pacific_isl = pacific_isl[pacific_isl['Test Id'] == 1]

# drop columns with redundant information
pacific_isl = pacific_isl.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
pacific_isl = pacific_isl.rename({'CAASPP Reported Enrollment': 'Pacific Islander'}, axis=1)

pacific_isl

Unnamed: 0,School Code,Pacific Islander
1940,112607,*
3486,130401,*
4342,136101,*
5026,6001788,7
6467,131755,*
...,...,...
3573808,112623,*
3575165,6056816,5
3575569,6056832,*
3575970,6118806,*


In [31]:
# Filter Subgroup ID == 80
white = all_grades[all_grades['Subgroup ID'] == 80]

# Filter test_id == 1 language arts/literature
white = white[white['Test Id'] == 1]

# drop columns with redundant information
white = white.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
white = white.rename({'CAASPP Reported Enrollment': 'White'}, axis=1)

white

Unnamed: 0,School Code,White
1941,112607,4
2247,123968,5
2693,124172,18
3140,125567,64
3487,130401,*
...,...,...
3575166,6056816,314
3575570,6056832,83
3575784,6056840,40
3575971,6118806,29


In [32]:
# Filter Subgroup ID == 144
two_or_more = all_grades[all_grades['Subgroup ID'] == 144]

# Filter test_id == 1 language arts/literature
two_or_more = two_or_more[two_or_more['Test Id'] == 1]

# drop columns with redundant information
two_or_more = two_or_more.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
two_or_more = two_or_more.rename({'CAASPP Reported Enrollment': 'Two/More Races'}, axis=1)

two_or_more

Unnamed: 0,School Code,Two/More Races
1952,112607,*
2259,123968,4
2703,124172,94
3151,125567,27
3939,131581,*
...,...,...
3575177,6056816,61
3575580,6056832,14
3575794,6056840,4
3575980,6118806,8


## Parents education:

In [33]:
# Filter Subgroup ID == 91
PE_high_school = all_grades[all_grades['Subgroup ID'] == 91]

# Filter test_id == 1 language arts/literature
PE_high_school = PE_high_school[PE_high_school['Test Id'] == 1]

# drop columns with redundant information
PE_high_school = PE_high_school.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
PE_high_school = PE_high_school.rename({'CAASPP Reported Enrollment': 'HS graduate'}, axis=1)

PE_high_school

Unnamed: 0,School Code,HS graduate
1943,112607,20
2249,123968,39
2694,124172,*
3142,125567,17
3489,130401,*
...,...,...
3575168,6056816,97
3575571,6056832,10
3575786,6056840,15
3575972,6118806,*


In [34]:
# Filter Subgroup ID == 90
non_high_school = all_grades[all_grades['Subgroup ID'] == 90]

# Filter test_id == 1 language arts/literature
non_high_school = non_high_school[non_high_school['Test Id'] == 1]

# drop columns with redundant information
non_high_school = non_high_school.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
non_high_school = non_high_school.rename({'CAASPP Reported Enrollment': 'Non HS graduate'}, axis=1)

non_high_school

Unnamed: 0,School Code,Non HS graduate
1942,112607,23
2248,123968,28
3141,125567,*
3488,130401,*
3695,130419,*
...,...,...
3573810,112623,10
3574141,114652,6
3575167,6056816,32
3575785,6056840,10


In [35]:
# Filter Subgroup ID == 92
some_college = all_grades[all_grades['Subgroup ID'] == 92]

# Filter test_id == 1 language arts/literature
some_college = some_college[some_college['Test Id'] == 1]

# drop columns with redundant information
some_college = some_college.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
some_college = some_college.rename({'CAASPP Reported Enrollment': 'Some College'}, axis=1)

some_college

Unnamed: 0,School Code,Some College
1944,112607,19
2250,123968,34
2695,124172,7
3143,125567,39
3490,130401,*
...,...,...
3575169,6056816,255
3575572,6056832,64
3575787,6056840,30
3575973,6118806,18


In [36]:
# Filter Subgroup ID == 93
college_grad = all_grades[all_grades['Subgroup ID'] == 93]

# Filter test_id == 1 language arts/literature
college_grad = college_grad[college_grad['Test Id'] == 1]

# drop columns with redundant information
college_grad = college_grad.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
college_grad = college_grad.rename({'CAASPP Reported Enrollment': 'College Graduate'}, axis=1)

college_grad

Unnamed: 0,School Code,College Graduate
1945,112607,11
2251,123968,14
2696,124172,58
3144,125567,59
3491,130401,*
...,...,...
3575170,6056816,156
3575573,6056832,44
3575788,6056840,13
3575974,6118806,23


In [37]:
# Filter Subgroup ID == 94
graduate_school = all_grades[all_grades['Subgroup ID'] == 94]

# Filter test_id == 1 language arts/literature
graduate_school = graduate_school[graduate_school['Test Id'] == 1]

# drop columns with redundant information
graduate_school = graduate_school.drop(columns = ['Test Id', 'Subgroup ID'])

# rename column to reflect demographic information
graduate_school = graduate_school.rename({'CAASPP Reported Enrollment': 'Graduate School'}, axis=1)

graduate_school

Unnamed: 0,School Code,Graduate School
1946,112607,*
2252,123968,13
2697,124172,164
3145,125567,76
3492,130401,*
...,...,...
3575171,6056816,57
3575574,6056832,26
3575789,6056840,6
3575975,6118806,15


## Merge demographics into one dataset:

In [38]:
# Merge df with filtered demographics to main df
df_language = df_language.merge(male_students, how='left', on='School Code')
df_language = df_language.merge(female_students, how='left', on='School Code')
df_language = df_language.merge(military_students, how='left', on='School Code')
df_language = df_language.merge(non_military_students, how='left', on='School Code')
df_language = df_language.merge(homeless_students, how='left', on='School Code')
df_language = df_language.merge(non_homeless_students, how='left', on='School Code')
df_language = df_language.merge(disadvantaged_students, how='left', on='School Code')
df_language = df_language.merge(non_disadvantaged_students, how='left', on='School Code')
df_language = df_language.merge(english_fluency, how='left', on='School Code')
df_language = df_language.merge(esl_students, how='left', on='School Code')
df_language = df_language.merge(black_students, how='left', on='School Code')
df_language = df_language.merge(native_american, how='left', on='School Code')
df_language = df_language.merge(asian, how='left', on='School Code')
df_language = df_language.merge(hispanic, how='left', on='School Code')
df_language = df_language.merge(white, how='left', on='School Code')
df_language = df_language.merge(pacific_isl, how='left', on='School Code')
df_language = df_language.merge(two_or_more, how='left', on='School Code')
df_language = df_language.merge(PE_high_school, how='left', on='School Code')
df_language = df_language.merge(non_high_school, how='left', on='School Code')
df_language = df_language.merge(some_college, how='left', on='School Code')
df_language = df_language.merge(college_grad, how='left', on='School Code')
df_language = df_language.merge(graduate_school, how='left', on='School Code')

In [39]:
df_language

Unnamed: 0,School Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,Percentage Standard Not Met,Students with Scores,...,Asian,Hispanic,White,Pacific Islander,Two/More Races,HS graduate,Non HS graduate,Some College,College Graduate,Graduate School
0,112607,1,90,,7.14,27.38,34.52,36.90,28.57,84,...,*,46,4,*,*,20,23,19,11,*
1,123968,1,142,,7.09,18.11,25.20,29.92,44.88,127,...,10,82,5,,4,39,28,34,14,13
2,124172,1,239,,71.12,22.41,93.53,4.31,2.16,232,...,111,9,18,,94,*,,7,58,164
3,125567,1,201,,28.13,17.71,45.83,18.75,35.42,192,...,14,50,64,,27,17,*,39,59,76
4,130401,1,*,,*,*,*,*,*,*,...,,*,*,*,,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10294,6056816,1,608,,13.95,37.41,51.36,24.32,24.32,588,...,15,166,314,5,61,97,32,255,156,57
10295,6056832,1,146,,25.76,25.76,51.52,27.27,21.21,132,...,,26,83,*,14,10,,64,44,26
10296,6056840,1,74,,31.08,22.97,54.05,18.92,27.03,74,...,*,27,40,,4,15,10,30,13,6
10297,6118806,1,57,,23.53,41.18,64.71,21.57,13.73,51,...,,15,29,*,8,*,,18,23,15


---------

### Entities Data

- It contains information such as school and district name, as well as zip code and relevant codes that will allow merging with the assessment data. 
- It comes from the California Assessment of Student Performance and Progress.

Dataset number of rows match current information about the state of CA:

- There are ~ 1,040 school districts in California. 
    - The entities_dist dataset contains 1,087 rows.
- There are ~ 10,588 schools in California. 
    - The df_entities dataset contains 10,300 rows.

In [40]:
df_entities = pd.read_csv('data/sb_ca2019entities_csv.txt')

In [41]:
# create dataset containing entities data at district level
entities_dist = df_entities[df_entities['School Code'] == 0]

# create dataset containing entities data at school level 
df_entities = df_entities.drop(df_entities[df_entities['School Code'] == 0].index) # drop district level data

# drop columns with redundant information or not of use 
df_entities = df_entities.drop(columns = ['Filler', 'Test Year', 'Type Id', 'County Code', 'District Code', 'District Name', 'County Name'])
df_entities

Unnamed: 0,School Code,School Name,Zip Code
0,114686,Ocean Air,92130
1,6038111,Del Mar Heights Elementary,92014
2,6088983,Del Mar Hills Elementary,92014
3,6110696,Carmel Del Mar Elementary,92130
4,6115620,Ashley Falls Elementary,92130
...,...,...,...
11383,136747,California Academy of Sports Science,91764
11384,138313,University Prep,91764
11385,6038095,Dehesa Elementary,92019
11386,6119564,Dehesa Charter,92026


--------

## Merge df_school_id with df_entities

This merge adds school name, zipcode, and test year to the main df.

In [42]:
# merge dfs on school code
df_language_merge = df_entities.merge(df_language, how='left', on='School Code')
df_language_merge

Unnamed: 0,School Code,School Name,Zip Code,Subgroup ID,CAASPP Reported Enrollment,Mean Scale Score,Percentage Standard Exceeded,Percentage Standard Met,Percentage Standard Met and Above,Percentage Standard Nearly Met,...,Asian,Hispanic,White,Pacific Islander,Two/More Races,HS graduate,Non HS graduate,Some College,College Graduate,Graduate School
0,114686,Ocean Air,92130,1.0,420,,71.19,19.85,91.04,5.81,...,175,26,188,,24,*,,4,85,326
1,6038111,Del Mar Heights Elementary,92014,1.0,275,,69.23,21.61,90.84,5.49,...,19,23,203,*,25,*,,9,85,178
2,6088983,Del Mar Hills Elementary,92014,1.0,169,,61.35,25.15,86.50,6.75,...,13,33,105,,17,4,*,13,47,99
3,6110696,Carmel Del Mar Elementary,92130,1.0,299,,68.03,20.75,88.78,8.50,...,91,34,148,,21,*,*,10,73,203
4,6115620,Ashley Falls Elementary,92130,1.0,331,,51.85,29.01,80.86,11.42,...,100,29,171,*,15,*,,18,63,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10295,136747,California Academy of Sports Science,91764,1.0,22,,*,*,*,*,...,,*,*,,,*,*,*,*,*
10296,138313,University Prep,91764,1.0,92,,17.78,28.89,46.67,24.44,...,*,32,43,,*,15,*,18,25,*
10297,6038095,Dehesa Elementary,92019,1.0,95,,6.52,25.00,31.52,32.61,...,,26,37,,18,8,8,37,14,10
10298,6119564,Dehesa Charter,92026,1.0,108,,16.13,30.11,46.24,25.81,...,*,27,67,,6,6,,30,39,25


In [43]:
df_language_merge.columns

Index(['School Code', 'School Name', 'Zip Code', 'Subgroup ID',
       'CAASPP Reported Enrollment', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores', 'Male', 'Female',
       'Military', 'Non Military', 'Homeless', 'Non Homeless', 'Disadvantaged',
       'Non Disadvantaged', 'English Fluency', 'English Learner', 'Black',
       'Native American', 'Asian', 'Hispanic', 'White', 'Pacific Islander',
       'Two/More Races', 'HS graduate', 'Non HS graduate', 'Some College',
       'College Graduate', 'Graduate School'],
      dtype='object')

In [44]:
df_language_merge.isnull().sum()

School Code                              0
School Name                              0
Zip Code                                 0
Subgroup ID                              1
CAASPP Reported Enrollment               1
Mean Scale Score                     10300
Percentage Standard Exceeded             1
Percentage Standard Met                  1
Percentage Standard Met and Above        1
Percentage Standard Nearly Met           1
Percentage Standard Not Met              1
Students with Scores                     1
Male                                    50
Female                                 246
Military                              7739
Non Military                             1
Homeless                              2844
Non Homeless                             5
Disadvantaged                          147
Non Disadvantaged                      193
English Fluency                         23
English Learner                        786
Black                                 1891
Native Amer

----------

### Expenses Data

- It contains the current cost of education for school districts in California.
- The dataset contains variables such school district expense average daily attendance cost for the academic year 2018-2019.

In [45]:
df_expenses = pd.read_excel('data/currentexpense1819.xlsx')

In [46]:
df_expenses = df_expenses.drop(df_expenses.index[[0,1,2,3,4,5,6,7,8]])

In [47]:
new_header = df_expenses.iloc[0] #grab the first row for the header
df_expenses = df_expenses[1:] #take the data less the header row
df_expenses.columns = new_header #set the header row as the df header

In [48]:
df_expenses

9,CO Code,District Code,District,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA,LEA Type
10,01,61119,Alameda Unified,117225882.5,8968.85,13070.335941,Unified
11,01,61127,Albany City Unified,46611059.59,3544.52,13150.175366,Unified
12,01,61143,Berkeley Unified,159457818.49,9356.44,17042.573724,Unified
13,01,61150,Castro Valley Unified,102239937.34,8940.2,11435.978763,Unified
14,01,61168,Emery Unified,12504023.21,681.82,18339.185137,Unified
...,...,...,...,...,...,...,...
944,58,72728,Camptonville Elementary,776334.87,44.68,17375.444718,Elementary
945,58,72736,Marysville Joint Unified,107389549.48,9072.18,11837.23752,Unified
946,58,72744,Plumas Lake Elementary,12851169.64,1283.04,10016.187835,Elementary
947,58,72751,Wheatland Elementary,15925495.9,1236.92,12875.121997,Elementary


---------

### Enrollment Dataset, Full-Time Equivalent Teacher, and Pupil/Teacher Ratio
- It contains total enrollment per school for the academic year 2018-2019 in California.
- Data comes from the National Center for Education Statistics.

In [49]:
# load datafile
df_enrollment = pd.read_csv('data/ELSI_total_enrollment_.csv')
df_enrollment

Unnamed: 0,School Name,State Name [Public School] Latest available year,School ID - NCES Assigned [Public School] Latest available year,Agency ID - NCES Assigned [Public School] Latest available year,Total Students All Grades (Excludes AE) [Public School] 2018-19,Full-Time Equivalent (FTE) Teachers [Public School] 2018-19,Pupil/Teacher Ratio [Public School] 2018-19
0,21ST CENTURY LEARNING INSTITUTE,CALIFORNIA,"=""060429013779""","=""0604290""",88,3.60,24.44
1,A PLACE TO GROW,California,"=""062827013394""","=""0628270""",†,–,–
2,A. E. ARNOLD ELEMENTARY,California,"=""061044001166""","=""0610440""",739,27.00,27.37
3,A. G. COOK ELEMENTARY,California,"=""061488001834""","=""0614880""",366,16.00,22.88
4,A. G. CURRIE MIDDLE,California,"=""064015006636""","=""0640150""",611,25.30,24.15
...,...,...,...,...,...,...,...
10436,ZUPANIC HIGH,California,"=""063237010019""","=""0632370""",70,3.00,23.33
10437,Data Source: U.S. Department of Education Nati...,,,,,,
10438,† indicates that the data are not applicable.,,,,,,
10439,– indicates that the data are missing.,,,,,,


---------

### Total Revenue

- It contains total revenue per school district in California for the academic year 2018-2019.
- Revenue comes from local, state and federal sources.

In [50]:
df_revenue = pd.read_csv('data/ELSI_csv_export_revenue.csv')

In [51]:
df_revenue

Unnamed: 0,Agency Name,State Name [District] Latest available year,Fall Membership (V33) [District Finance] 2016-17,Total Revenue - Local Sources (TLOCREV) [District Finance] 2016-17,Total General Revenue (TOTALREV) [District Finance] 2016-17,Total Revenue - State Sources (TSTREV) [District Finance] 2016-17,Total Revenue - Federal Sources (TFEDREV) [District Finance] 2016-17,Total Current Expenditures - El-Sec Education (TCURELSC) [District Finance] 2016-17,Total Expenditures (TOTALEXP) [District Finance] 2016-17,Total Revenue (TOTALREV) per Pupil (V33) [District Finance] 2016-17,Total Revenue - Local Sources (TLOCREV) per Pupil (V33) [District Finance] 2016-17,Total Revenue - State Sources (TSTREV) per Pupil (V33) [District Finance] 2016-17,Total Revenue - Federal Sources (TFEDREV) per Pupil (V33) [District Finance] 2016-17,Agency ID - NCES Assigned [District] Latest available year
0,ABC UNIFIED,California,20768,52379000,258745000,190473000,15893000,227441000,252804000,12459,2522,9171,765,"=""0601620"""
1,ACALANES UNION HIGH,California,5530,75958000,91214000,13651000,1605000,75057000,84811000,16494,13736,2469,290,"=""0601650"""
2,ACKERMAN CHARTER,California,†,1938000,5686000,3525000,223000,4945000,6018000,†,†,†,†,"=""0601680"""
3,ACTON-AGUA DULCE UNIFIED,California,10016,11199000,35216000,22767000,1250000,28079000,34504000,3516,1118,2273,125,"=""0600001"""
4,ADELANTO ELEMENTARY,California,10288,11183000,127966000,105877000,10906000,111119000,115722000,12438,1087,10291,1060,"=""0601710"""
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1160,YUCAIPA-CALIMESA JOINT UNIFIED,California,9969,23778000,107263000,76380000,7105000,95746000,99066000,10760,2385,7662,713,"=""0643560"""
1161,Data Source: U.S. Department of Education Nati...,,,,,,,,,,,,,
1162,† indicates that the data are not applicable.,,,,,,,,,,,,,
1163,– indicates that the data are missing.,,,,,,,,,,,,,
