   # Education Project 


   <img src='data/education_image.jpg' width="900">
   
   **Credit:**  [wsimag](https://wsimag.com/culture/60264-education-in-venezuela-the-americas-and-the-world)



In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

sns.set(style='ticks')

warnings.filterwarnings("ignore")  # Suppress all warnings

# Introduction

## Business Context
Research shows that high-poverty areas disproportionally educate children of color. The chances of ending up in a high-poverty or high-minority school are highly determined by a student’s race/ethnicity and social class. For instance, African American and Hispanic students—even if they are not poor—are much more likely than white or Asian students to be in high-poverty schools.

There is a growing body of evidence that shows increased investment on education returns better outcomes and that the positive effects are even greater among low-income students. On the other hand, it costs more to educate low-income students and provide them with a robust education capable of overcoming their initial disadvantages.


### Goals
1. Understand the current demographics of wealthy to high-poverty schools across the state of California.
2. Identify how much funding is available per pupil in wealthy vs high-poverty areas
3. Discover what the return on investment per student based on race and socioeconomic status is.
4. Determine the ideal capital to provide a robust education to students of color and/or in high-poverty areas.

#### Predictive modeling
Identify the optimal capital cost per child to maximize outcome according to ethnicity.

# Data wrangling

The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

## Extracting and cleaning relevant data

Let's start looking at the datasets!

### Assessment Data

It contains assessment data for the Smarter Balance Summative Assessment (2018-2019) for the state of California.

legend types can be found here: https://caaspp-elpac.cde.ca.gov/caaspp/research_fixfileformat19

In [2]:
# loading datafile
df_all = pd.read_csv('data/sb_ca2019_all_csv_v4.txt')

# filtering at school level
df_all = df_all.drop(df_all[df_all['School Code'] == 0].index)

In [13]:
df_all

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Subgroup ID,Test Type,Total Tested At Entity Level,Total Tested with Scores,Grade,...,Area 1 Percentage Below Standard,Area 2 Percentage Above Standard,Area 2 Percentage Near Standard,Area 2 Percentage Below Standard,Area 3 Percentage Above Standard,Area 3 Percentage Near Standard,Area 3 Percentage Below Standard,Area 4 Percentage Above Standard,Area 4 Percentage Near Standard,Area 4 Percentage Below Standard
1888,1,10017,112607,,2019,1,B,85,84,11,...,37.35,12.20,60.98,26.83,7.23,71.08,21.69,13.25,61.45,25.30
1889,1,10017,112607,,2019,3,B,42,42,11,...,35.71,14.29,50.00,35.71,7.14,76.19,16.67,9.52,64.29,26.19
1890,1,10017,112607,,2019,4,B,43,42,11,...,39.02,10.00,72.50,17.50,7.32,65.85,26.83,17.07,58.54,24.39
1891,1,10017,112607,,2019,6,B,79,78,11,...,33.77,13.16,63.16,23.68,7.79,71.43,20.78,14.29,63.64,22.08
1892,1,10017,112607,,2019,7,B,*,*,11,...,*,*,*,*,*,*,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3576486,58,72769,5838305,,2019,222,B,*,*,13,...,*,*,*,*,*,*,*,*,*,*
3576487,58,72769,5838305,,2019,223,B,*,*,13,...,*,*,*,*,*,*,*,*,*,*
3576488,58,72769,5838305,,2019,224,B,19,19,13,...,57.89,0.00,63.16,36.84,5.26,57.89,36.84,0.00,0.00,0.00
3576489,58,72769,5838305,,2019,226,B,54,54,13,...,55.56,16.67,44.44,38.89,14.81,57.41,27.78,0.00,0.00,0.00


In [4]:
df_all.columns

Index(['County Code', 'District Code', 'School Code', 'Filler', 'Test Year',
       'Subgroup ID', 'Test Type', 'Total Tested At Entity Level',
       'Total Tested with Scores', 'Grade', 'Test Id',
       'CAASPP Reported Enrollment', 'Students Tested', 'Mean Scale Score',
       'Percentage Standard Exceeded', 'Percentage Standard Met',
       'Percentage Standard Met and Above', 'Percentage Standard Nearly Met',
       'Percentage Standard Not Met', 'Students with Scores',
       'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
       'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
       'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
       'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
       'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
       'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard'],
      dtype='object')

---------

### Entities data

It goes with the assessment dataset

In [5]:
df_entities = pd.read_csv('data/sb_ca2019entities_csv.txt')
df_entities

Unnamed: 0,County Code,District Code,School Code,Filler,Test Year,Type Id,County Name,District Name,School Name,Zip Code
0,37,68056,114686,,2019,7,San Diego,Del Mar Union Elementary,Ocean Air,92130
1,37,68056,6038111,,2019,7,San Diego,Del Mar Union Elementary,Del Mar Heights Elementary,92014
2,37,68056,6088983,,2019,7,San Diego,Del Mar Union Elementary,Del Mar Hills Elementary,92014
3,37,68056,6110696,,2019,7,San Diego,Del Mar Union Elementary,Carmel Del Mar Elementary,92130
4,37,68056,6115620,,2019,7,San Diego,Del Mar Union Elementary,Ashley Falls Elementary,92130
...,...,...,...,...,...,...,...,...,...,...
11384,37,68049,138313,,2019,9,San Diego,University Prep,University Prep,91764
11385,37,68049,6038095,,2019,7,San Diego,Dehesa Elementary,Dehesa Elementary,92019
11386,37,68049,6119564,,2019,9,San Diego,Dehesa Charter,Dehesa Charter,92026
11387,37,68056,0,,2019,6,San Diego,Del Mar Union Elementary,,


----------

### Expenses Data

Current cost of education for school districts in California.
The dataset contains school district expense average daily attendance cost for the academic year 2018-2019.

In [6]:
df_expenses = pd.read_excel('data/currentexpense1819.xlsx')

In [7]:
df_expenses = df_expenses.drop(df_expenses.index[[0,1,2,3,4,5,6,7,8]])

In [8]:
new_header = df_expenses.iloc[0] #grab the first row for the header
df_expenses = df_expenses[1:] #take the data less the header row
df_expenses.columns = new_header #set the header row as the df header

In [9]:
df_expenses

9,CO Code,District Code,District,EDP 365,Current\nExpense ADA,Current\nExpense Per ADA,LEA Type
10,01,61119,Alameda Unified,117225882.5,8968.85,13070.335941,Unified
11,01,61127,Albany City Unified,46611059.59,3544.52,13150.175366,Unified
12,01,61143,Berkeley Unified,159457818.49,9356.44,17042.573724,Unified
13,01,61150,Castro Valley Unified,102239937.34,8940.2,11435.978763,Unified
14,01,61168,Emery Unified,12504023.21,681.82,18339.185137,Unified
...,...,...,...,...,...,...,...
944,58,72728,Camptonville Elementary,776334.87,44.68,17375.444718,Elementary
945,58,72736,Marysville Joint Unified,107389549.48,9072.18,11837.23752,Unified
946,58,72744,Plumas Lake Elementary,12851169.64,1283.04,10016.187835,Elementary
947,58,72751,Wheatland Elementary,15925495.9,1236.92,12875.121997,Elementary


---------

### Enrollment Dataset
Total enrollment per district for grades 3-8 and grade 11 for the academic year 2018-2019 in California.

In [15]:
# loading datafile
#df_enrollment = pd.read_csv('data/ELSI_csv_export_enrollment.csv')
#df_enrollment

---------

### Total Revenue

Total revenue per school district in California dfor the academic year 2018-2019.

In [14]:
#df_revenue = pd.read_csv('data/ELSI_csv_export_revenue.csv')