#                                        Why are Students Absent?

### Introduction

    New York City Public Schools are the largest public school system in the United States. As such, large datasets about all kinds of relevant topics are compiled every year in order to facilitate the best possible education for all New York City public school students. The data is publicly available and compiled by PASSNYC, a non-profit dedicated to broadening educational opportunities for New York City students. The dataset being used for this project is from 2016 called "2016 School Explorer". It contains data from 1270 New York City public that serve students from Kindergarten to grade eight. Based on the structuring of New York City public schools, this means that the data does not include high schools. The data contains information on schools, their location, the background of their students, and how they perform on New York State standardized tests. All data is based on averages among each school.
    In this project the data is being used to answer the question "Why are students absent?". This is a pertinent question for today's schools. Only recently, in 2016, did the US Department of Education release data for chronic absenteeism. It has not been tracked in aggregate so there is little research available to understand chronic absenteeism. Much of the current research on the topic is about its affects on students. This paper seeks instead to understand why chronic absenteeism occurs.
    TALK ABOUT RESULTS

### Variables

    The X variables being used in this project are an economic need index, average ELA proficiency, average math proficiency, trust percentage, and strong family-community ties percentage. The economic need index is a number 0-1 with the lowest economic need being 0 and highest being 1. It is calculated using a formula using data from the percentage of students in temporary housing, eligibility for HRA, a New York City social services program, and reduced cost lunch provided to students with low income. Additionally, I am using the average ELA and Math proficiency for each school. This is measured through New York State standardized testing where anything above a score of 3 is a pass. Trust percentage is a survey-based statistic that reflects if relationships between students, families, and teachers is based on respect and trust. Similarly, strong family-community ties percentage is a rating that reflects how well the school forms partnerships with families and the community it serves. Percent of students chronically absent is the Y variable, and it measures the percent of students in every school that miss ten percent or more of school in the given year. 

    These variables give the most complete picture of a students background, integral for understanding chronic absenteeism. The economic need of a student could have an impact on whether or not the student comes to school as the student might have limited access to school supplies, reliable transportation, and time away from dependents. Additionally, average ELA and math test scores help to understand the academic background of the child. The trust a student and family has in their teachers could be a factor in whether or not they come to school. A lack of trust could lead to a poor sense of belonging for the student, leaading them not to attend school. Lastly, whether or not the school has strong ties with the community can affect a student's ability to come to school due to social services working with the school and ensuring relevance of teaching topics. 
    
    Edit last bit of the last paragraph, ask TA what of this is needed 

### Data Cleaning/Loading

In [36]:
#import all software needed
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats

In [37]:
#read in data
df = pd.read_csv('2016 School Explorer.csv')

In [38]:
relevant_columns = ['Economic Need Index', 'Average ELA Proficiency', 'Average Math Proficiency', 'Trust %', 'Strong Family-Community Ties %', 'Percent of Students Chronically Absent']

#replace "N/A" to a null variable for relevant columns that contain "N/A"
df.replace('N/A', np.nan, inplace=True)

#check data for null variables in relevant columns
print(df[relevant_columns].isnull().any())

#check data for type
print(df[relevant_columns].dtypes)

Economic Need Index                       True
Average ELA Proficiency                   True
Average Math Proficiency                  True
Trust %                                   True
Strong Family-Community Ties %            True
Percent of Students Chronically Absent    True
dtype: bool
Economic Need Index                       float64
Average ELA Proficiency                   float64
Average Math Proficiency                  float64
Trust %                                    object
Strong Family-Community Ties %             object
Percent of Students Chronically Absent     object
dtype: object


In [39]:
columns_to_change = ['Trust %', 'Strong Family-Community Ties %', 'Student Attendance Rate', 'Percent of Students Chronically Absent']

#eliminate rows with null variables in relevant columns
df.dropna(subset=columns_to_change, inplace=True)

#convert columns to float values
df[columns_to_change] = df[columns_to_change].apply(lambda x: x.str.rstrip('%').astype('float') / 100.0)

#create table with all relevant variables
relevant_columns_table = df[relevant_columns]

#measure skewness
skew_values = relevant_columns_table[relevant_columns].skew()
skew_values

#create table with all relevant variables
relevant_columns_table = df[relevant_columns]

In [40]:
#check for outliers 

z_scores = stats.zscore(relevant_columns_table, axis=0)
threshold = 2.58
outliers = (z_scores > threshold) | (z_scores < -threshold)
outliers_in_columns = outliers.any()
outliers_in_columns

Economic Need Index                        True
Average ELA Proficiency                   False
Average Math Proficiency                  False
Trust %                                    True
Strong Family-Community Ties %             True
Percent of Students Chronically Absent     True
dtype: bool

### Summary Statistics Tables

In [86]:
print("\nSummary Statistics")
relevant_columns_table.describe()


Summary Statistics


Unnamed: 0,Economic Need Index,Average ELA Proficiency,Average Math Proficiency,Trust %,Strong Family-Community Ties %,Percent of Students Chronically Absent
count,1247.0,1217.0,1217.0,1247.0,1247.0,1247.0
mean,0.672281,2.534215,2.668956,0.904226,0.830914,0.21575
std,0.210959,0.363589,0.47047,0.061228,0.062786,0.140716
min,0.049,1.81,1.83,0.0,0.0,0.0
25%,0.55,2.25,2.3,0.87,0.8,0.11
50%,0.731,2.45,2.58,0.92,0.83,0.2
75%,0.841,2.76,2.98,0.94,0.87,0.3
max,0.957,3.93,4.2,1.0,0.99,1.0


Economic need index: The mean is very high, meaning many students rely on some type of government assistance. Additionally, the 25 percent distribution mark shows that over seventy five percent of schools have students with an economic need index of over 0.55, indicating outliers are bringing this statistic down. This means a majority of students rely on some sort of government assistance, meaning they are from lower income households. 

Average ELA Proficiency: 75% distribution mark is not a passing grade, meaning the vast majority of students are not proficient in ELA (reading).

Average Math Proficiency: 75% distribution mark is slightly better than that of ELA, possibly due to english language learners being a sizeable portion of the school. However, it still indicates that more than three fourths of students are not proficient in mathematics. 

Trust %: Generally fairly high trust in schools and its teachers with around ninety percent and not much deviation. Could be influenced by bias as its plausible that people with high trust in the school would be more inclined to take the survey.

Strong Family-Community Ties %: generally high family-community ties with low deviation from the high mean. Could be influenced through bias in the way the data is collected. Surveys could have been answered by peopke who feel more connected to the school and community. 

In [79]:
#create table for chronic absence rate summary statistics for lower income schools
lower_income_schools = relevant_columns_table[relevant_columns_table['Economic Need Index'] > relevant_columns_table['Economic Need Index'].mean()]
chronic_absence_lower_income = lower_income_schools['Percent of Students Chronically Absent'].describe()
print("\nChronic Absence Rate Summary Statistics for Lower Income Schools")
print(chronic_absence_lower_income.to_string(index=True))


Chronic Absence Rate Summary Statistics for Lower Income Schools
count    751.000000
mean       0.271971
std        0.140158
min        0.000000
25%        0.180000
50%        0.270000
75%        0.350000
max        1.000000


This data suggests that students with greater economic need have a higher rate of chronic absenteeism. The mean of lower income chronically absent students is higher, as well as the 25%, 50%, and 75% distributions, leading to the conclusion that outliers are not skewing this data, and lower income students have higher rates of chronic absenteeism. The data shows that one fourth of schools with students that on average have lower than the average in New York City have a chronic absentee rating of thirty five percent, a surprisingly large number. This means that thirty five percent of students in those schools are absent for more than ten percent of school days per year. 

EDIT THIS

In [80]:
#create table for chronic absence rate summary statistics for schools with average scores of 2 or below of math and ELA exams
schools_under_proficiency = relevant_columns_table[relevant_columns_table['Average ELA Proficiency'] <= 2.58]
chronic_absence_under_proficiency = schools_under_proficiency['Percent of Students Chronically Absent'].describe()
print("\nChronic Absence Rates Summary Statistics for Schools with below median math and ELA Scores")
print(chronic_absence_under_proficiency.to_string(index=True))


Chronic Absence Rates Summary Statistics for Schools with below median math and ELA Scores
count    749.000000
mean       0.268131
std        0.109411
min        0.000000
25%        0.190000
50%        0.260000
75%        0.350000
max        0.740000


The data shows chronic absenteeism is more common in schools with students who score, on average, lower than other New York City public schools. These rates are very similar to the 

## Plots, Histograms, Figures