## Introduction

We are exploring two different questions:
1. What attributes are involved with schools that have CS programs vs schools that do not? This question will be tackled first.
2. For schools with CS, do any attributes of the individual student contribute to whether or not they take CS? This may be explored in a separate notebook.

In [120]:
import pandas as pd
import numpy as np

In [121]:
ospi_data = pd.read_csv('2022_school_pt5.csv',  index_col=0)

In [122]:
ospi_data.head()

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,G_Female,GC_Female,G_Male,GC_Male,G_GenderX,GC_GenderX,R_Native,RC_Native,R_Asian,RC_Asian,R_Black,RC_Black,R_Hisp_Lat,RC_Hisp_Lat,R_HPI,RC_HPI,R_NA,RC_NA,R_TwoOrMore,RC_TwoOrMore,R_White,RC_White,L_ELL,LC_ELL,L_NoELL,LC_NoELL,I_LowIncome,IC_LowIncome,I_NoLowIncome,IC_NoLowIncome,D_Disability,DC_Disability,D_NoDisability,DC_NoDisability,A_9,AC_9,A_10,AC_10,A_11,AC_11,A_12,AC_12
0,1109,3075,Washtucna Elementary/High School,-118.311231,46.752189,Adams,28,3,9,0,19,3,0,0,0,0,0,0,0,0,2,0,0,0,0,0,2,0,24,3,0,0,28,3,17,1,11,2,2,0,26,3,7,0,4,0,6,1,11,2
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,635,60,646,109,0,0,0,0,5,1,2,0,1173,151,0,0,0,0,3,0,98,17,402,54,879,115,1069,142,212,27,195,29,1086,140,382,51,328,44,321,29,250,45
2,1158,2903,Lind-Ritzville High School,-118.292516,47.125588,Adams,60,17,30,8,29,9,1,0,0,0,0,0,1,0,23,10,0,0,0,0,4,2,32,5,7,3,53,14,48,14,12,3,3,1,57,16,18,9,15,2,14,1,13,5
3,1160,2132,Ritzville High School,-118.292516,47.125588,Adams,113,18,57,9,56,9,0,0,2,2,2,0,0,0,9,0,0,0,0,0,3,0,97,16,0,0,113,18,47,6,66,12,6,4,107,14,32,5,35,3,25,3,21,7
4,2250,1617,Educational Opportunity Center,-117.057521,46.411019,Asotin,167,0,89,0,78,0,0,0,1,0,0,0,2,0,11,0,0,0,0,0,5,0,148,0,0,0,167,0,133,0,34,0,25,0,142,0,16,0,51,0,41,0,59,0


In [123]:
ospi_data.isna().sum()

DistrictCode       0
SchoolCode         0
SchoolName         0
Longitude          0
Latitude           0
County             0
AllStudents        0
C_AllStudents      0
G_Female           0
GC_Female          0
G_Male             0
GC_Male            0
G_GenderX          0
GC_GenderX         0
R_Native           0
RC_Native          0
R_Asian            0
RC_Asian           0
R_Black            0
RC_Black           0
R_Hisp_Lat         0
RC_Hisp_Lat        0
R_HPI              0
RC_HPI             0
R_NA               0
RC_NA              0
R_TwoOrMore        0
RC_TwoOrMore       0
R_White            0
RC_White           0
L_ELL              0
LC_ELL             0
L_NoELL            0
LC_NoELL           0
I_LowIncome        0
IC_LowIncome       0
I_NoLowIncome      0
IC_NoLowIncome     0
D_Disability       0
DC_Disability      0
D_NoDisability     0
DC_NoDisability    0
A_9                0
AC_9               0
A_10               0
AC_10              0
A_11               0
AC_11        

In [124]:
len(ospi_data.index)

730

How can we expand this data so that we can use it on the training algorithm? If we are concentrating on schools, we could expand each school by multiplying each row by the value of "AllStudents" and disregard individual demographic information. Just my first thought.

In [125]:
ospi_data = ospi_data.iloc[:,:-15]

In [126]:
ospi_data.head()

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,G_Female,GC_Female,G_Male,GC_Male,G_GenderX,GC_GenderX,R_Native,RC_Native,R_Asian,RC_Asian,R_Black,RC_Black,R_Hisp_Lat,RC_Hisp_Lat,R_HPI,RC_HPI,R_NA,RC_NA,R_TwoOrMore,RC_TwoOrMore,R_White,RC_White,L_ELL,LC_ELL,L_NoELL,LC_NoELL,I_LowIncome
0,1109,3075,Washtucna Elementary/High School,-118.311231,46.752189,Adams,28,3,9,0,19,3,0,0,0,0,0,0,0,0,2,0,0,0,0,0,2,0,24,3,0,0,28,3,17
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,635,60,646,109,0,0,0,0,5,1,2,0,1173,151,0,0,0,0,3,0,98,17,402,54,879,115,1069
2,1158,2903,Lind-Ritzville High School,-118.292516,47.125588,Adams,60,17,30,8,29,9,1,0,0,0,0,0,1,0,23,10,0,0,0,0,4,2,32,5,7,3,53,14,48
3,1160,2132,Ritzville High School,-118.292516,47.125588,Adams,113,18,57,9,56,9,0,0,2,2,2,0,0,0,9,0,0,0,0,0,3,0,97,16,0,0,113,18,47
4,2250,1617,Educational Opportunity Center,-117.057521,46.411019,Asotin,167,0,89,0,78,0,0,0,1,0,0,0,2,0,11,0,0,0,0,0,5,0,148,0,0,0,167,0,133


In [127]:
def low_inc_perc(row):
    return row['I_LowIncome'] / row['AllStudents']

ospi_data['PercentLowIncome'] = ospi_data.apply(low_inc_perc, axis=1)

In [128]:
ospi_data.head()

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,G_Female,GC_Female,G_Male,GC_Male,G_GenderX,GC_GenderX,R_Native,RC_Native,R_Asian,RC_Asian,R_Black,RC_Black,R_Hisp_Lat,RC_Hisp_Lat,R_HPI,RC_HPI,R_NA,RC_NA,R_TwoOrMore,RC_TwoOrMore,R_White,RC_White,L_ELL,LC_ELL,L_NoELL,LC_NoELL,I_LowIncome,PercentLowIncome
0,1109,3075,Washtucna Elementary/High School,-118.311231,46.752189,Adams,28,3,9,0,19,3,0,0,0,0,0,0,0,0,2,0,0,0,0,0,2,0,24,3,0,0,28,3,17,0.607143
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,635,60,646,109,0,0,0,0,5,1,2,0,1173,151,0,0,0,0,3,0,98,17,402,54,879,115,1069,0.834504
2,1158,2903,Lind-Ritzville High School,-118.292516,47.125588,Adams,60,17,30,8,29,9,1,0,0,0,0,0,1,0,23,10,0,0,0,0,4,2,32,5,7,3,53,14,48,0.8
3,1160,2132,Ritzville High School,-118.292516,47.125588,Adams,113,18,57,9,56,9,0,0,2,2,2,0,0,0,9,0,0,0,0,0,3,0,97,16,0,0,113,18,47,0.415929
4,2250,1617,Educational Opportunity Center,-117.057521,46.411019,Asotin,167,0,89,0,78,0,0,0,1,0,0,0,2,0,11,0,0,0,0,0,5,0,148,0,0,0,167,0,133,0.796407


In [129]:
pd.set_option('display.max_columns', None)
ospi_data.head()

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,G_Female,GC_Female,G_Male,GC_Male,G_GenderX,GC_GenderX,R_Native,RC_Native,R_Asian,RC_Asian,R_Black,RC_Black,R_Hisp_Lat,RC_Hisp_Lat,R_HPI,RC_HPI,R_NA,RC_NA,R_TwoOrMore,RC_TwoOrMore,R_White,RC_White,L_ELL,LC_ELL,L_NoELL,LC_NoELL,I_LowIncome,PercentLowIncome
0,1109,3075,Washtucna Elementary/High School,-118.311231,46.752189,Adams,28,3,9,0,19,3,0,0,0,0,0,0,0,0,2,0,0,0,0,0,2,0,24,3,0,0,28,3,17,0.607143
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,635,60,646,109,0,0,0,0,5,1,2,0,1173,151,0,0,0,0,3,0,98,17,402,54,879,115,1069,0.834504
2,1158,2903,Lind-Ritzville High School,-118.292516,47.125588,Adams,60,17,30,8,29,9,1,0,0,0,0,0,1,0,23,10,0,0,0,0,4,2,32,5,7,3,53,14,48,0.8
3,1160,2132,Ritzville High School,-118.292516,47.125588,Adams,113,18,57,9,56,9,0,0,2,2,2,0,0,0,9,0,0,0,0,0,3,0,97,16,0,0,113,18,47,0.415929
4,2250,1617,Educational Opportunity Center,-117.057521,46.411019,Asotin,167,0,89,0,78,0,0,0,1,0,0,0,2,0,11,0,0,0,0,0,5,0,148,0,0,0,167,0,133,0.796407


For the following, note that BIPOC does not include Asian in this calculation. The following calculation includes all reported racial identities except for Asian and white.

In [130]:
def number_bipoc(row):
    return (row['R_Native'] + row['R_Black'] + row['R_Hisp_Lat'] + row['R_HPI'] + row['R_TwoOrMore'])

ospi_data['R_BIPOC'] = ospi_data.apply(number_bipoc, axis=1)

In [131]:
def bipoc_perc(row):
    return (row['R_BIPOC']) / row['AllStudents']

ospi_data['PercentBIPOC'] = ospi_data.apply(bipoc_perc, axis=1)

In [145]:
ospi_data['DistrictSize'] = ospi_data.groupby('DistrictCode')['AllStudents'].transform('sum')
ospi_data['DistrictLowIncome'] = ospi_data.groupby('DistrictCode')['I_LowIncome'].transform('sum')
ospi_data['DistrictPctLowIncome'] = ospi_data['DistrictLowIncome']/ospi_data['DistrictSize']
ospi_data['DistrictBIPOC'] = ospi_data.groupby('DistrictCode')['R_BIPOC'].transform('sum')
ospi_data['DistrictPctBIPOC'] = ospi_data['DistrictBIPOC']/ospi_data['DistrictSize']

In [151]:
columns_to_analyze = ['DistrictCode','SchoolCode','SchoolName','Longitude'
                      ,'Latitude','County','AllStudents','C_AllStudents', 'I_LowIncome', 'PercentLowIncome'
                      , 'R_BIPOC' ,'PercentBIPOC','DistrictSize','DistrictLowIncome','DistrictPctLowIncome'
                      ,'DistrictBIPOC','DistrictPctBIPOC']

reduced_ospi = ospi_data[columns_to_analyze]

In [152]:
reduced_ospi.head()

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,I_LowIncome,PercentLowIncome,R_BIPOC,PercentBIPOC,DistrictSize,DistrictLowIncome,DistrictPctLowIncome,DistrictBIPOC,DistrictPctBIPOC
0,1109,3075,Washtucna Elementary/High School,-118.311231,46.752189,Adams,28,3,17,0.607143,4,0.142857,28,17,0.607143,4,0.142857
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,1069,0.834504,1178,0.919594,1464,1235,0.843579,1346,0.919399
2,1158,2903,Lind-Ritzville High School,-118.292516,47.125588,Adams,60,17,48,0.8,28,0.466667,60,48,0.8,28,0.466667
3,1160,2132,Ritzville High School,-118.292516,47.125588,Adams,113,18,47,0.415929,14,0.123894,113,47,0.415929,14,0.123894
4,2250,1617,Educational Opportunity Center,-117.057521,46.411019,Asotin,167,0,133,0.796407,19,0.113772,1034,559,0.540619,192,0.185687


Quick sanity check of percentages validity:

In [149]:
reduced_ospi[reduced_ospi['DistrictCode'] == 1147]

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,PercentLowIncome,PercentBIPOC,DistrictSize,DistrictLowIncome,DistrictPctLowIncome,DistrictBIPOC,DistrictPctBIPOC
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,0.834504,0.919594,1464,1235,0.843579,1346,0.919399
595,1147,5367,Desert Oasis High School,-119.163036,46.818533,Adams,142,0,0.922535,0.915493,1464,1235,0.843579,1346,0.919399
696,1147,5634,Open Door Re-Engagement,-119.17399,46.815949,Adams,41,0,0.853659,0.926829,1464,1235,0.843579,1346,0.919399


In [150]:
ospi_data[ospi_data['DistrictCode'] == 1147]

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,G_Female,GC_Female,G_Male,GC_Male,G_GenderX,GC_GenderX,R_Native,RC_Native,R_Asian,RC_Asian,R_Black,RC_Black,R_Hisp_Lat,RC_Hisp_Lat,R_HPI,RC_HPI,R_NA,RC_NA,R_TwoOrMore,RC_TwoOrMore,R_White,RC_White,L_ELL,LC_ELL,L_NoELL,LC_NoELL,I_LowIncome,PercentLowIncome,R_BIPOC,PercentBIPOC,DistrictSize,DistrictLowIncome,DistrictPctLowIncome,income_binned,size_binned,bipoc_binned,DistrictBIPOC,DistrictPctBIPOC
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,635,60,646,109,0,0,0,0,5,1,2,0,1173,151,0,0,0,0,3,0,98,17,402,54,879,115,1069,0.834504,1178,0.919594,1464,1235,0.843579,1,4,1,1346,0.919399
595,1147,5367,Desert Oasis High School,-119.163036,46.818533,Adams,142,0,63,0,79,0,0,0,0,0,0,0,0,0,130,0,0,0,0,0,0,0,12,0,57,0,85,0,131,0.922535,130,0.915493,1464,1235,0.843579,1,2,1,1346,0.919399
696,1147,5634,Open Door Re-Engagement,-119.17399,46.815949,Adams,41,0,15,0,26,0,0,0,0,0,0,0,0,0,38,0,0,0,0,0,0,0,3,0,6,0,35,0,35,0.853659,38,0.926829,1464,1235,0.843579,1,1,1,1346,0.919399


## Preprocessing: Binning

*Random forest and gradient boosting machines will benefit from binning.*

(School income from original), Bands:  <20%, 20 – 40%, 40% - 60%, 60% - 80%, 80%+

(School size modified from original): <=100, 101 – 300, 301 – 900. 901 – 1800. 1801+

(Disadvantaged race/ethnic percentage): < 15%, 15% - 30%, 30% - 50%, 50% - 75%, 75%+

(Location):  (King),  (Pierce, Snohomish, Spokane, Clark), (All Others). 

In [135]:
income_bins = [-float('inf'), 20, 40, 60, 80, float('inf')]
income_labels = ['1', '2', '3', '4', '5']

size_bins = [-float('inf'), 100, 300, 900, 1800, float('inf')]
size_labels = ['1', '2', '3', '4', '5']

bipoc_bins = [-float('inf'), 15, 30, 50, 75, float('inf')]
bipoc_labels = ['1', '2', '3', '4', '5']

ospi_data['income_binned'] = pd.cut(ospi_data['PercentLowIncome'], bins=income_bins, labels=income_labels)
ospi_data['size_binned'] = pd.cut(ospi_data['AllStudents'], bins=size_bins, labels=size_labels)
ospi_data['bipoc_binned'] = pd.cut(ospi_data['PercentBIPOC'], bins=bipoc_bins, labels=bipoc_labels)

In [136]:
ospi_data['income_binned'].value_counts()

1    730
2      0
3      0
4      0
5      0
Name: income_binned, dtype: int64

In [137]:
ospi_data['size_binned'].value_counts()

1    253
2    176
4    125
3    124
5     52
Name: size_binned, dtype: int64

In [138]:
ospi_data['bipoc_binned'].value_counts()

1    730
2      0
3      0
4      0
5      0
Name: bipoc_binned, dtype: int64

Noting the results above, I suggest that we reevaluate the binning strategy. The income and BIPOC binning, in particular, will not be useful if all of the values fall into the same bin. Entropy MDL, quantiles or equal width could be helpful to look into. I will put binning on the back-burner for now until I receive the go-ahead to either conduct further research or receive a new strategy to pursue.

In [139]:
ospi_data.head()

Unnamed: 0,DistrictCode,SchoolCode,SchoolName,Longitude,Latitude,County,AllStudents,C_AllStudents,G_Female,GC_Female,G_Male,GC_Male,G_GenderX,GC_GenderX,R_Native,RC_Native,R_Asian,RC_Asian,R_Black,RC_Black,R_Hisp_Lat,RC_Hisp_Lat,R_HPI,RC_HPI,R_NA,RC_NA,R_TwoOrMore,RC_TwoOrMore,R_White,RC_White,L_ELL,LC_ELL,L_NoELL,LC_NoELL,I_LowIncome,PercentLowIncome,R_BIPOC,PercentBIPOC,DistrictSize,DistrictLowIncome,DistrictPctLowIncome,income_binned,size_binned,bipoc_binned
0,1109,3075,Washtucna Elementary/High School,-118.311231,46.752189,Adams,28,3,9,0,19,3,0,0,0,0,0,0,0,0,2,0,0,0,0,0,2,0,24,3,0,0,28,3,17,0.607143,4,0.142857,28,17,0.607143,1,1,1
1,1147,3015,Othello High School,-119.165246,46.82271,Adams,1281,169,635,60,646,109,0,0,0,0,5,1,2,0,1173,151,0,0,0,0,3,0,98,17,402,54,879,115,1069,0.834504,1178,0.919594,1464,1235,0.843579,1,4,1
2,1158,2903,Lind-Ritzville High School,-118.292516,47.125588,Adams,60,17,30,8,29,9,1,0,0,0,0,0,1,0,23,10,0,0,0,0,4,2,32,5,7,3,53,14,48,0.8,28,0.466667,60,48,0.8,1,1,1
3,1160,2132,Ritzville High School,-118.292516,47.125588,Adams,113,18,57,9,56,9,0,0,2,2,2,0,0,0,9,0,0,0,0,0,3,0,97,16,0,0,113,18,47,0.415929,14,0.123894,113,47,0.415929,1,2,1
4,2250,1617,Educational Opportunity Center,-117.057521,46.411019,Asotin,167,0,89,0,78,0,0,0,1,0,0,0,2,0,11,0,0,0,0,0,5,0,148,0,0,0,167,0,133,0.796407,19,0.113772,1034,559,0.540619,1,2,1


## Preprocessing: Normalizing
*K-Nearest Neighbors and K-means Clustering will benefit from normalizing.*

## Preprocessing: Standardizing
*Linear regression, logistic regression and principal component analysis will benefit from standardizing.*

## Preprocessing: Principle Component Analysis (PCA)

## Analysis: Random Forest

## Analysis: Logistic Regression

## Analysis: Mutual Information