   # Education Project


   <img src='data/education_image.jpg' width="900">
   
   **Credit:**  [wsimag](https://wsimag.com/culture/60264-education-in-venezuela-the-americas-and-the-world)



In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

sns.set(style='ticks')

warnings.filterwarnings("ignore")  # Suppress all warnings

# Introduction

## Business Context
Research shows that high-poverty areas disproportionally educate children of color. The chances of ending up in a high-poverty or high-minority school are highly determined by a student’s race/ethnicity and social class. For instance, African American and Hispanic students—even if they are not poor—are much more likely than white or Asian students to be in high-poverty schools.

There is a growing body of evidence that shows increased investment on education returns better outcomes and that the positive effects are even greater among low-income students. On the other hand, it costs more to educate low-income students and provide them with a robust education capable of overcoming their initial disadvantages.


### Goals
1. Understand the current demographics of wealthy to high-poverty schools across the state of California.
2. Identify how much funding is available per pupil in wealthy vs high-poverty areas.
3. Learn what factors are most correlated with student performance.


#### Predictive modeling
What's the average test score per school?
What's the percentage of students who pass/not pass?


# DATA WRANGLING

The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

- Extracting and cleaning relevant data. Let's start looking at the datasets!

### Assessment Data

- It contains assessment data for the Smarter Balance Summative Assessment (2018-2019) for the state of California.

- Legend types can be found here: https://caaspp-elpac.cde.ca.gov/caaspp/research_fixfileformat19
- More information about assesment set up: https://www.cde.ca.gov/ta/tg/ca/sbsummativefaq.asp

In [None]:
# loa datafile
df_all = pd.read_csv('large_data/sb_ca2019_all_csv_v4.txt')

In [None]:
# create dataset containing district level data
df_district = df_all[df_all['District Code'] == 00000]

In [None]:
# create dataset containing school level data
df_school = df_all.drop(df_all[df_all['School Code'] == 0].index)
df_school.head(10)

In [None]:
# check columns' names
df_school.columns

In [None]:
# check data type
df_school.info()

In [None]:
# Check for missing data
df_school.isnull().sum()

In [None]:
# Number of rows where subgroup ID == 1
df_school[df_school['Subgroup ID'] == 1].count()

In [None]:
# Check number of unique schools
df_school['School Code'].nunique()

- There are 10,300 unique schools!

### Reorganizing Subgroup ID 

The assessment dataset contains a lot of demographic information in the subgroup ID column. Need to reorganize the dataset in order to have one variable per column and one observation per row. Also, neet to filter only the demographic information of interest.

#### Before merging:
- Filter variables of interest;
- Rearrange the data to have: 
    - one feature per column; 
    - one observation per row;

This dataset representes the Smater Balanced Assessments for English Language Arts/Literacy and Mathematics (SB). Test ID 1 and 2. More info about the test can be found here: https://www.caaspp.org/administration/about/testing/index.html

## Creating two datasets for modeling

- Language Arts & Literature: test_id == 1

    - 10,299 rows
    
    
- Mathematics: test_id == 2

    - 10,298 rows



In [None]:
# Filter Grade == 13 summary of all grades per school
all_grades = df_school[df_school['Grade'] == 13]

# Filter Subgroup ID == 1 summary of all students
all_students = all_grades[all_grades['Subgroup ID'] == 1]

In [None]:
# Create df_test1 language arts & literature 
df_test1 = all_students[all_students['Test Id'] == 1]

In [None]:
# drop columns with redundant information
df_test1 = df_test1.drop(columns = ['Filler', 'Test Year', 'Test Type', 'County Code', 'District Code',
                             'Area 1 Percentage Above Standard', 'Area 1 Percentage Near Standard',
                             'Area 1 Percentage Below Standard', 'Area 2 Percentage Above Standard',
                             'Area 2 Percentage Near Standard', 'Area 2 Percentage Below Standard',
                             'Area 3 Percentage Above Standard', 'Area 3 Percentage Near Standard',
                             'Area 3 Percentage Below Standard', 'Area 4 Percentage Above Standard',
                             'Area 4 Percentage Near Standard', 'Area 4 Percentage Below Standard'])

In [None]:
df_test1

In [None]:
# Create df_test2 mathematics
df_test2 = all_students[all_students['Test Id'] == 2]
df_test2

#### Subgroup ID 

In the legend below Demographic Id and Demographic Id Num are represented in the dataset as Subgroup ID.

In [None]:
legend = pd.read_csv('data/Subgroups.txt')
legend

In [None]:
# filter demographic of interest from Subgroup ID
subgroup_id = [1, 3, 4, 50, 51, 52, 53, 90, 91, 92, 93, 94, 220, 221, 222, 223, 
               224, 225, 226, 227, 200, 201, 202, 203, 204, 205, 206, 207]

df_school_id = df_school[df_school['Subgroup ID'].isin(subgroup_id)]

## Next: 
1. Transform demographic information contained into subgroup id (rows) to one variable per column.
2. Do the same with other demographic information contained in subgroup id column.

---------

### Entities Data

- It contains information such as school and district name, as well as zip code and relevant codes that will allow merging with the assessment data. 
- It comes from the California Assessment of Student Performance and Progress.

Dataset number of rows match current information about the state of CA:

- There are ~ 1,040 school districts in California. 
    - The entities_dist dataset contains 1,087 rows.
- There are ~ 10,588 schools in California. 
    - The df_entities dataset contains 10,300 rows.

In [None]:
df_entities = pd.read_csv('data/sb_ca2019entities_csv.txt')

In [None]:
# create dataset containing entities data at district level
entities_dist = df_entities[df_entities['School Code'] == 0]


In [None]:
# create dataset containing entities data at school level 
df_entities = df_entities.drop(df_entities[df_entities['School Code'] == 0].index) # drop district level data

In [None]:
df_entities['County Name'].unique()

In [None]:
# drop columns with redundant information or not of use 
df_entities = df_entities.drop(columns = ['Filler', 'Type Id', 'County Code', 'District Code', 'District Name', 'County Name'])
df_entities

--------

## Merge df_school_id with df_entities

This merge adds school name, zipcode, and test year to the main df.

In [None]:
# merge dfs on school code
#df_merge = pd.merge(df_entities, df_school_id, on='School Code')

#df_merge

In [None]:
#df_merge.isnull().sum()

----------

### Expenses Data

- It contains the current cost of education for school districts in California.
- The dataset contains variables such school district expense average daily attendance cost for the academic year 2018-2019.

In [None]:
df_expenses = pd.read_excel('data/currentexpense1819.xlsx')

In [None]:
df_expenses = df_expenses.drop(df_expenses.index[[0,1,2,3,4,5,6,7,8]])

In [None]:
new_header = df_expenses.iloc[0] #grab the first row for the header
df_expenses = df_expenses[1:] #take the data less the header row
df_expenses.columns = new_header #set the header row as the df header

In [None]:
df_expenses

---------

### Enrollment Dataset, Full-Time Equivalent Teacher, and Pupil/Teacher Ratio
- It contains total enrollment per school for the academic year 2018-2019 in California.
- Data comes from the National Center for Education Statistics.

In [None]:
# load datafile
df_enrollment = pd.read_csv('data/ELSI_total_enrollment_.csv')
df_enrollment

---------

### Total Revenue

- It contains total revenue per school district in California for the academic year 2018-2019.
- Revenue comes from local, state and federal sources.

In [None]:
df_revenue = pd.read_csv('data/ELSI_csv_export_revenue.csv')

In [None]:
df_revenue