            How do school racial demographics relate to student absence rate?

Introduction:
    New York City Public Schools are the largest public school system in the United States. As such, large datasets about all kinds of variables are created in order to facilitate the optimal education for all students. The data is publicly available and compiled by PASSNYC, a non-profit dedicated to broadening educational opportunities for New York City students. 
    In this project I am using data from this compiled dataset in order to answer the question "How do school racial demographics correlate to student absence rate?". To answer this question, I am...

#explain why I picked these variables 

Data Cleaning:

In [72]:
#import all software needed
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

In [73]:
#read in data
df = pd.read_csv('2016 School Explorer.csv')

In [74]:
relevant_columns = ['Percent Asian', 'Percent Black', 'Percent Hispanic', 'Percent Black / Hispanic', 'Percent White', 'Student Attendance Rate', 'Percent of Students Chronically Absent']

#check data for null variables in relevant columns
print(df[columns_to_check].isnull().any())

#check data for type
print(df[columns_to_check].dtypes)

Percent Asian                             False
Percent Black                             False
Percent Hispanic                          False
Percent Black / Hispanic                  False
Percent White                             False
Student Attendance Rate                    True
Percent of Students Chronically Absent     True
dtype: bool
Percent Asian                             object
Percent Black                             object
Percent Hispanic                          object
Percent Black / Hispanic                  object
Percent White                             object
Student Attendance Rate                   object
Percent of Students Chronically Absent    object
dtype: object


In [77]:
#eliminate rows with null variables
df = df.dropna(subset=['Student Attendance Rate', 'Percent of Students Chronically Absent'])

#convert columns to float values
relevant_columns_table = df[relevant_columns].apply(lambda x: x.str.rstrip('%').astype('float') / 100.0)
relevant_columns_table

Unnamed: 0,Percent Asian,Percent Black,Percent Hispanic,Percent Black / Hispanic,Percent White,Student Attendance Rate,Percent of Students Chronically Absent
0,0.05,0.32,0.60,0.92,0.01,0.94,0.18
1,0.10,0.20,0.63,0.83,0.06,0.92,0.30
2,0.35,0.08,0.49,0.57,0.04,0.94,0.20
3,0.05,0.29,0.63,0.92,0.04,0.92,0.28
4,0.04,0.20,0.65,0.84,0.10,0.93,0.23
...,...,...,...,...,...,...,...
1267,0.00,0.20,0.77,0.97,0.01,0.95,0.13
1268,0.00,0.68,0.31,0.98,0.01,0.94,0.24
1269,0.00,0.54,0.45,0.99,0.01,0.95,0.12
1270,0.02,0.86,0.09,0.95,0.01,0.95,0.12


Create Summary Statistics Tables

In [85]:
#create seperate tables for each summary statistic
mean_values = relevant_columns_table[relevant_columns].mean()
median_values = relevant_columns_table[relevant_columns].median()
mode_values = relevant_columns_table[relevant_columns].mode()
variance_values = relevant_columns_table[relevant_columns].var()
std_values = relevant_columns_table[relevant_columns].std()
skew_values = relevant_columns_table[relevant_columns].skew()
iqr_values = relevant_columns_table[relevant_columns].quantile([0.25, 0.75])
kurtosis_values = relevant_columns_table[relevant_columns].kurtosis()
#possibly add range

#create dataframes for each summary statistic 
mean_df = pd.DataFrame(mean_values, columns=['Mean'])
median_df = pd.DataFrame(median_values, columns=['Median'])
mode_df = pd.DataFrame(mode_values, columns=['Mode'])
variance_df = pd.DataFrame(variance_values, columns=['Variance'])
std_df = pd.DataFrame(std_values, columns=['Standard Deviation'])
skew_df = pd.DataFrame(skew_values, columns=['Skewness'])
iqr_df = pd.DataFrame(iqr_values, columns=['Q1', 'Q3']).transpose()
kurtosis_df = pd.DataFrame(kurtosis_values, columns=['Kurtosis'])

#merge tables to show one clean output

summary_stats = pd.concat([mean_df, median_df, mode_df, variance_df, std_df, skew_df, iqr_df, kurtosis_df], axis=1)
print(summary_stats)

                                            Mean  Median  Mode  Variance  \
Percent Asian                           0.117642    0.04   NaN  0.031622   
Percent Black                           0.318348    0.23   NaN  0.082670   
Percent Hispanic                        0.412518    0.36   NaN  0.068470   
Percent Black / Hispanic                0.730810    0.90   NaN  0.086570   
Percent White                           0.131179    0.03   NaN  0.040033   
Student Attendance Rate                 0.927249    0.94   NaN  0.007640   
Percent of Students Chronically Absent  0.215750    0.20   NaN  0.019801   
0                                            NaN     NaN   NaN       NaN   
Q1                                           NaN     NaN   NaN       NaN   
Q3                                           NaN     NaN   NaN       NaN   

                                        Standard Deviation  Skewness  0.25  \
Percent Asian                                     0.177825  2.137882   NaN   
Percent