# Educational Equity in New Jersey:
## Analyzing Relationships Between Resources and Outcomes Across School Districts

-----------------------------------------------------

### Christy Hernandez
#### CS 668 - Pace University - Fall 2024

---------------------------------------------------------------------------------------------------

As said in my Project Overview Statement:

New Jersey is my home state – as well as home to a diverse population and a wide range of school districts, from affluent suburban areas to urban centers and rural communities. However, this diversity presents a challenge: significant disparities in educational resources, funding, and student outcomes persist across different districts. This imbalance in resources has direct implications on student performance, educational opportunities, and overall equity within the state's educational system. Addressing these disparities, especially the most impactful ones, can lead to more targeted policy interventions and an overall improvement in educational outcomes across New Jersey. 

In response to this opportunity, the goal of this project is to conduct a comprehensive analysis of educational equity across school districts in New Jersey, identifying relationships between resources and student outcomes. By examining factors such as funding, teacher-student ratios, and access to technology, the analysis will identify which variables have the most impact on student success. The project will evaluate and propose evidence-based interventions, like increased school funding or mentorship programs, aimed at closing the achievement gap. The final deliverable will be a machine learning model that predicts the potential impact of these interventions on educational outcomes, particularly in underserved communities. This model will allow us to highlight specific areas where inequities are most pronounced and propose targeted, data-driven solutions to address them. By leveraging pre-existing and publicly available datasets from the State of New Jersey, I aim to generate actionable insights that can guide state and local policymakers in promoting a more equitable distribution of resources and improving educational outcomes.

----------------------

Let's get started by importing pandas and loading the related dataset I found!

In [1]:
import pandas as pd

In [16]:
dataset = pd.read_excel('C:/Users/Christy Hernandez/OneDrive/Desktop/Pace Masters Work/CS Analytics Capstone Project/Source files/Database_DistrictStateDetail.xlsx', sheet_name = None)

Luckily, this dataset has a lot of features that should be able to provide a good analysis overall. Let's see just how many features there are:

In [18]:
print(f'There are {len(dataset)} features.')

There are 72 features.


Okay, and what are they?

In [12]:
print(dataset.keys())

dict_keys(['Important 2022-2023 Notes', 'Header and Contact', 'EnrollmentTrendsbyGrade', 'EnrollmentTrendsByStudentGroup', 'EnrollmentByRacialEthnicGroup', 'PreKAndK-FullDayHalfDay', 'EnrollmentTrendsFullSharedTime', 'EnrollmentByHomeLanguage', 'StudentGrowthTrends', 'StudentGrowth', 'StudentGrowthByGrade', 'StudentGrowthByPerformLevel', 'ELAMathPerformanceTrends', 'ELAParticipationPerformance', 'ELAPerformanceTrends', 'ELAPerformanceByGrade', 'MathParticipationPerformance', 'MathPerformanceTrends', 'MathPerformanceByGradeTest', 'ScienceAssessmentSummaryByGrade', 'ScienceAssessmentByGrade', 'AlternateAssessmentParticipatio', 'EnglishLangProgressToProficienc', 'NJGPA', 'EnglishLangParticipationPerform', 'NAEP', 'PSAT-SAT-ACTParticipation', 'PSAT-SAT-ACTPerformance', 'APIBCourseworkPartPerf', 'APIBDualEnrPartByStudentGrp', 'APIBCoursesOffered', 'CTE_SLEParticipation', 'CTEParticipationByStudentGroup', 'WorkbasedLearningByCareerClust', 'IndustryValuedCredentialsEarned', 'MathCoursePartici

Okay, well, 72 is a lot, and I was sure that not all of them would be very useful here. So, I went through the dataset manually and found 16 features that I think will assist with this project the most! Here is what they are and what they each contain:

1.	'EnrollmentTrendsbyGrade'
o	Includes the total student enrollment for each district.

2.	'EnrollmentTrendsByStudentGroup'
o	Percentages of district student population by gender, economic disadvantage, disability, English learning, homelessness, foster care, military-connection, and migrant

3.	'StudentGrowthTrends'
o	For each district, for the subject Math and then English Language Arts (ELA), what was the median student growth percentile, and then did it meet state standards (options are ‘Not Met’, ‘Met Standard’, or ‘Exceeds Standards’)

4.	‘PSAT-SAT-ACTParticipation’
o	Percentage of student participation for each standardized exam type for each district, and then the state average participation rate for each exam.

5.	‘APIBCourseworkPartPerf’
o	Percentage of students in each district that are enrolled in one or more AP or IB course, then percentage for one or more AP or IB exam, then percentage for one or more dual enrollment course, followed by the state averages for each.

6.	‘4YrGraduationCohortProfile’
o	After 4 years of enrollment, for each student group within each district, the percentage of students graduating, then the percentage continuing, then the percentage not continuing, followed by the state averages for each.

7.	‘DropoutRateTrends’
o	Percentage of students who drop out by district, followed by the state average.

8.	'PostsecondaryEnrRatesFall'
o	For the fall semester following high school graduation, for each student group within each district, range of percentage of students who are enrolled in post-secondary education, as well as the districtwide and statewide percentage ranges.

9.	‘ChronicAbsenteeism'
o	For each student group for each district, the percentage of students who are chronically absent, as well as the districtwide percentage.

10.	'ViolenceVandalismHIBSubstanceOf'
o	For each district, incidents per 100 students enrolled

11.	'DisciplinaryRemovals'
o	For each district, the percentage of students who received suspension, expulsion, and arrest.

12.	‘TeachersExperience’
o	For each district, the average number of years of total experience, then the average number of years of experience in their current district, followed by the state averages for both

13.	‘AdministratorsExperience’
o	For each district, the average number of years of total experience, then the average number of years of experience in their current district, followed by the state averages for both

14.	'StudentToStaffRatios'
o	For each district, staff to student ratios for different staff positions – Teacher, Admin, Librarian, Nurse, Counselor, etc., as well as a Teacher to Admin ratio.

15.	‘TeachersAdminsLevelofEducation’
o	For teachers and then administrators in each district, the percentage that has achieved a maximum of a bachelors, then a maximum of a masters, then a maximum of a doctorate.

16.	‘TeachersAdminsOneYearRetention’
o	For teachers and then administrators, the percentage of retention for each district, and then for the state.

Now that we know which features we want to use, let's create a separate dictionary that has only these features.

In [43]:
desired_data = {}

In [55]:
desired_sheets = ['EnrollmentTrendsbyGrade', 'EnrollmentTrendsByStudentGroup', 'StudentGrowthTrends', 'PSAT-SAT-ACTParticipation', 'APIBCourseworkPartPerf', '4YrGraduationCohortProfile', 'DropoutRateTrends', 'PostsecondaryEnrRatesFall', 'ChronicAbsenteeism', 'ViolenceVandalismHIBSubstanceOf', 'DisciplinaryRemovals', 'TeachersExperience', 'AdministratorsExperience', 'StudentToStaffRatios', 'TeachersAdminsLevelOfEducation', 'TeachersAdminsOneYearRetention']

In [56]:
desired_data = {sheet_name: dataset[sheet_name] for sheet_name in desired_sheets if sheet_name in dataset}

Great, now let's take a look at our modified dataset!

In [58]:
for sheet_name, sheet_data in desired_data.items():
    print(f'In regards to the "{sheet_name}" sheet:\n')
    print(desired_data[sheet_name].head(),"\n---------------------------------------\n")

In regards to the "EnrollmentTrendsbyGrade" sheet:

  CountyCode CountyName DistrictCode  \
0         01   Atlantic         0010   
1         01   Atlantic         0110   
2         01   Atlantic         0120   
3         01   Atlantic         0125   
4         01   Atlantic         0570   

                                       DistrictName  GradePK  GradeKG  \
0                   Absecon Public Schools District      135       90   
1                     Atlantic City School District      531      463   
2        Atlantic County Vocational School District        0        0   
3  Atlantic County Special Services School District       20       15   
4                 Brigantine Public School District       60       39   

   Grade01  Grade02  Grade03  Grade04  Grade05  Grade06  Grade07  Grade08  \
0       74       88       86       75      106      102      105       92   
1      446      416      439      425      490      418      473      467   
2        0        0        0        0

Now let's see how many NaNs exist for each feature.

In [57]:
print(f'There are {len(desired_data)} features.\n')
for sheet_name, sheet_data in desired_data.items():
    number_of_nans = sheet_data.isnull().sum().sum()
    print(f'Sheet "{sheet_name}" contains {number_of_nans} null/NaN values.\n')

There are 16 features.

Sheet "EnrollmentTrendsbyGrade" contains 0 null/NaN values.

Sheet "EnrollmentTrendsByStudentGroup" contains 0 null/NaN values.

Sheet "StudentGrowthTrends" contains 2 null/NaN values.

Sheet "PSAT-SAT-ACTParticipation" contains 3 null/NaN values.

Sheet "APIBCourseworkPartPerf" contains 4 null/NaN values.

Sheet "4YrGraduationCohortProfile" contains 51 null/NaN values.

Sheet "DropoutRateTrends" contains 1 null/NaN values.

Sheet "PostsecondaryEnrRatesFall" contains 0 null/NaN values.

Sheet "ChronicAbsenteeism" contains 9120 null/NaN values.

Sheet "ViolenceVandalismHIBSubstanceOf" contains 0 null/NaN values.

Sheet "DisciplinaryRemovals" contains 0 null/NaN values.

Sheet "TeachersExperience" contains 9 null/NaN values.

Sheet "AdministratorsExperience" contains 5 null/NaN values.

Sheet "StudentToStaffRatios" contains 0 null/NaN values.

Sheet "TeachersAdminsLevelOfEducation" contains 668 null/NaN values.

Sheet "TeachersAdminsOneYearRetention" contains 2 nu

It is important to note that most features contain at least one NaN value. We will have to address these appropriately as we handle each feature individually.

-------------------------------------------------------

**Note to Reader (AKA Professor Gutu):**

So far, my project is in accordance with the objective timeline as listed on my project overview statement, but I had honestly hoped to be a bit further along at this point. I did admittedly struggle to find the appropriate dataset and the most useful features to compare, but now that I have this foundation I think I should be able to pick up speed. For the next code draft submission, according to my objective timeline, I need to conduct exploratory data analysis, incorporate feature engineering, and develop my machine learning prediction model! In addition, I will be sure to clean up any previous parts along the way. 

Looking forward to presenting you with my next draft, as well as hearing your comments/critiques for this draft!