# Education Project 

**In class lessons and modules**

**Author:** Alyssa Zukas
**Date Started:** October 16, 2025


**Objective:** This project addresses inequality of educational opportunity in U.S. high schools. Here we will focus on analyzing the relationship between the average student performance on the ACT or SAT exams that students take as part of the college application process.


## Domain Problem 

**Problem Statement:** We expect a range of school performance on these exams, but is school performance associated with socioeconomic factors? <br>

This is a broad question that we will make more precise as we consider how we want to answer the question and what data are available. Additionally, each of us will personalize the question by adding a data set to the data I will provide to us by Dr Fischer. <br>

## Analytic Approach 


**Diagnostic Approach**: We will use statistical analyses to test hypotheses and make inferences about relationships in the data.

^^ Make sure to not refer to the technical names in the final report. 

## Data Collection

### Data Sources

This project utilizes three data sets:
- <span style="color: red;">EdGap_data.xlsx</span>
- <span style="color: red;">ccd_sch_029_1617_w_1a_11212017.csv</span>
- <span style="color: red;">EDGE_GEOCODE_PUBLICSCH_1617.xlsx</span>


The 1st primary data set is the EdGap_data.xlsx data set from **EdGap.org** inlcuding the School ID in respect to the values for the 4 socioeconomic variables we are accounting for. 

This 2nd is a data set from the **National Center for Education Statistics**, displaying information about average ACT or SAT scores for schools and several socioeconomic characteristics of the school district in the year of 2016.

The 3rd data set is also from **National Center for Education Statistics**, displaying the `locale code` in the year of 2016 for **public schools**. This is the geoassignment for that school depending on the Longitute and Lattitude geographical information - which will be used to determing the schools geographic setting (city, subarb, town, rural).

#### EdGap Data

All socioeconomic data (household income, unemployment, adult educational attainment, and family structure) are from the **Census Bureau’s American Community Survey**.

- **EdGap.org** report that ACT and SAT score data is from each state’s department of education or some other public data release. The nature of the other public data release is not known, although the quality of the census data and the department of education data can be assumed to be reasonably high.

- **EdGap.org** do not indicate that they processed the data in any way. <br> 
The data were assembled by the EdGap.org team, so there is always the possibility for human error. Given the public nature of the data, we would be able to consult the original data sources to check the quality of the data if we had any questions.

#### School Information Data

The school information data from the **National Center for Education Statistics (NCES)** consists of basic identifying information about schools and can be assumed to be of **reasonably high quality**. 

As for the data from **EdGap.org data**, the school information data is public, so we would be able to consult the original data sources to check the quality of the data if we had any questions.

### Data Dictionary

**School Info**
- NCES SCH School ID: National Center for Education Statistics school identification number 
- NCES EDGE School Urbanicity: National Center for Education Statistics school urbanicity locale codes.

**Socioeconomic Factors**
- CT Unemployment Rate: Census tract unemployment rate
- CT Pct Adults with College Degree: Census tract percentage of adults with a college degree
<br>
- CT Pct Children In Married Couple Family: Census tract percentage of children in a married couple family.
- CT Median Household Income: Census tract median household income in dollars
- School Pct Free and Reduced Lunch: Percentage of students at the school eligible for free or reduced price lunch.
<br>

**ACT and SAT Exam Scores**
- School ACT average (or equivalent if SAT score): Average ACT score for the school. If the average SAT score is reported by the school, it is converted to an equivalent ACT scoreLinks to an external site.. Scores range from 1 to 36, with 36 being the highest score.
<br>
<br>
Note that the attendance area for a school may contain multiple census tracts.

## Data Processing 

### Import Libraries

In [2]:
# Import pandas, numpy, and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import seaborn - a data visualization library built on matplotlib
import seaborn as sns
# set the plotting style
sns.set_style("whitegrid")

# Model preprocessing
from sklearn.preprocessing import StandardScaler

# Modeling
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Model metrics and analysis
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from statsmodels.stats.anova import anova_lm

### Load the data

In [None]:
df = pd.read_csv(
)

## Exploratory Data Analysis