#### This data comes from the City of Chicago and is for all public schools in Chicago. Your task is to classify schools into probationary status (probation = 1) and non-probationary status (probation = 0). 

So there are three ways to do this: easy, medium, and hard.

Easy: pick columns that all the schools have in common that have no missing data and model off of that

Medium: fill the missing data in columns that only have a little bit of missing data with logical values (like that set that consistently has 34 missing)

Hard: create separate models for each subset of data and then either write code to use different models for different rows based on school type or create subsets of data, run each model, and then join back together afterward (and sort if id numbers need to be in some particular order)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, Imputer, FunctionTransformer, LabelBinarizer
from sklearn.pipeline import make_pipeline, make_union
%matplotlib inline

In [2]:
pd.set_option('display.max_columns', 100)
pd.set_option("display.max_rows", 100)

In [3]:
#get original data in
school_df = pd.read_csv('school_data_training.csv')
school_df.head(20)

Unnamed: 0,Name of School,"Elementary, Middle, or High School",Street Address,ZIP Code,Link,Healthy Schools Certified?,Safety Icon,Safety Score,Family Involvement Icon,Family Involvement Score,Environment Icon,Environment Score,Instruction Icon,Instruction Score,Leaders Icon,Leaders Score,Teachers Icon,Teachers Score,Parent Engagement Icon,Parent Engagement Score,Parent Environment Icon,Parent Environment Score,Average Student Attendance,Rate of Misconducts (per 100 students),Average Teacher Attendance,Individualized Education Program Compliance Rate,Pk-2 Literacy %,Pk-2 Math %,Gr3-5 Grade Level Math %,Gr3-5 Grade Level Read %,Gr3-5 Keep Pace Read %,Gr3-5 Keep Pace Math %,Gr6-8 Grade Level Math %,Gr6-8 Grade Level Read %,Gr6-8 Keep Pace Math%,Gr6-8 Keep Pace Read %,Gr-8 Explore Math %,Gr-8 Explore Read %,ISAT Exceeding Math %,ISAT Exceeding Reading %,ISAT Value Add Math,ISAT Value Add Read,ISAT Value Add Color Math,ISAT Value Add Color Read,Students Taking Algebra %,Students Passing Algebra %,9th Grade EXPLORE (2009),9th Grade EXPLORE (2010),10th Grade PLAN (2009),10th Grade PLAN (2010),Net Change EXPLORE and PLAN,11th Grade Average ACT (2011),Net Change PLAN and ACT,College Eligibility %,Graduation Rate %,College Enrollment Rate %,College Enrollment (number of students),General Services Route,Freshman on Track Rate %,Community Area Number,Community Area Name,Ward,Police District,probation,Id
0,John Spry Elementary Community School,ES,2400 S Marshall Blvd,60623,http://schoolreports.cps.edu/SchoolProgressRep...,No,Strong,66.0,Average,59,Strong,70.0,Strong,67.0,Average,52,Average,43,Weak,46,Average,48,96.2%,5.9,97.4%,99.0%,44.4,12.8,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,18.7,13.3,17.9,7.3,1.7,1.1,Green,Green,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,809,39,NDA,30,SOUTH LAWNDALE,12,10,0,610184
1,Thomas A Edison Regional Gifted Center Element...,ES,4929 N Sawyer Ave,60625,http://schoolreports.cps.edu/SchoolProgressRep...,No,Very Strong,91.0,NDA,NDA,Strong,64.0,Average,56.0,NDA,NDA,NDA,NDA,Strong,55,Strong,57,96.6%,1.9,96.3%,100.0%,NDA,NDA,92,97.8,52.8,57.6,93.1,98.9,55.2,60.4,80,96.7,79.0,88.4,1.8,1.6,Green,Green,31.8,78.6,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,269,31,NDA,14,ALBANY PARK,39,17,0,609794
2,Milton Brunson Math & Science Specialty Elemen...,ES,932 N Central Ave,60651,http://schoolreports.cps.edu/SchoolProgressRep...,No,Weak,30.0,NDA,NDA,Weak,30.0,Average,45.0,NDA,NDA,NDA,NDA,Average,47,Strong,55,91.3%,16.6,95.0%,100.0%,64.4,43.9,21.1,18.7,44.4,36.4,22.6,19,33.6,45.2,3.4,8.6,8.8,5.4,-0.3,-0.6,Yellow,Yellow,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,658,36,NDA,25,AUSTIN,29,15,1,609830
3,Emil G Hirsch Metropolitan High School,HS,7740 S Ingleside Ave,60619,http://schoolreports.cps.edu/SchoolProgressRep...,No,Very Weak,13.0,Average,46,Weak,28.0,Weak,28.0,Average,52,Average,50,NDA,NDA,NDA,NDA,84.8%,47.1,95.4%,88.0%,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,,,,,NDA,NDA,NDA,NDA,11.5,11.2,12.9,12.4,0.9,14.1,1.2,10.8,36.2,45.1,458,46,69.5,69,GREATER GRAND CROSSING,8,6,1,609712
4,Lawndale Elementary Community Academy,ES,3500 W Douglas Blvd,60623,http://schoolreports.cps.edu/SchoolProgressRep...,No,NDA,,NDA,NDA,NDA,,NDA,,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,91.0%,18.8,94.8%,98.0%,43.4,NDA,17.2,10,34,44,21.1,15.9,45.2,47,0,2,3.3,4.0,-2.1,-2.2,Red,Red,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,508,37,NDA,29,NORTH LAWNDALE,24,10,1,610034
5,Dr Martin Luther King Jr College Prep High ...,HS,4445 S Drexel Blvd,60653,http://schoolreports.cps.edu/SchoolProgressRep...,No,NDA,,NDA,NDA,NDA,,NDA,,NDA,NDA,NDA,NDA,Strong,56,Average,52,92.9%,4.4,96.3%,100.0%,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,,,,,NDA,NDA,NDA,NDA,17.4,17.7,17.7,18.1,0.7,20.5,2.8,31.9,75.4,85.1,915,40,91.1,39,KENWOOD,4,2,0,609751
6,William C Reavis Math & Science Specialty Elem...,ES,834 E 50th St,60615,http://schoolreports.cps.edu/SchoolProgressRep...,No,Average,48.0,Very Weak,6,Weak,37.0,Weak,26.0,Very Weak,10,Very Weak,15,Weak,45,Average,47,93.5%,37.2,94.8%,98.2%,47.2,NDA,31,29.5,43.5,51.9,19.1,23.4,38.5,48.4,3.2,16.1,5.8,10.1,-0.7,0.0,Red,Yellow,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,311,42,NDA,39,KENWOOD,4,2,1,610143
7,Alfred Nobel Elementary School,ES,4127 W Hirsch St,60651,http://schoolreports.cps.edu/SchoolProgressRep...,No,Weak,37.0,Weak,34,Weak,37.0,Weak,35.0,Average,48,Weak,36,Average,48,Average,51,94.6%,24.8,95.2%,100.0%,55.4,33.2,26.7,23.2,47.3,50,39.2,28.1,59.5,51,15.2,16.7,9.7,7.1,-0.3,0.7,Yellow,Yellow,23.4,44.4,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,801,34,NDA,23,HUMBOLDT PARK,37,25,0,610098
8,Roger C Sullivan High School,HS,6631 N Bosworth Ave,60626,http://schoolreports.cps.edu/SchoolProgressRep...,No,Weak,30.0,Average,44,Weak,34.0,Weak,34.0,Average,42,Weak,33,Weak,44,Weak,45,81.6%,14.7,95.8%,94.4%,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,,,,,NDA,NDA,NDA,NDA,12.6,13.2,13.8,13.5,0.9,15.4,1.6,14.8,39.3,52.1,826,32,68.6,1,ROGERS PARK,40,24,1,609733
9,Neal F Simeon Career Academy High School,HS,8147 S Vincennes Ave,60620,http://schoolreports.cps.edu/SchoolProgressRep...,No,Average,52.0,NDA,NDA,Weak,33.0,Average,42.0,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,78.6%,2.1,94.3%,100.0%,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,NDA,,,,,NDA,NDA,NDA,NDA,14.3,14.1,15.3,14.8,0.5,16.2,0.9,14.8,75.2,74,1535,45,68.5,44,CHATHAM,21,6,1,609692


In [4]:
school_df.dtypes

Name of School                                        object
Elementary, Middle, or High School                    object
Street Address                                        object
ZIP Code                                               int64
Link                                                  object
Healthy Schools Certified?                            object
Safety Icon                                           object
Safety Score                                         float64
Family Involvement Icon                               object
Family Involvement Score                              object
Environment Icon                                      object
Environment Score                                    float64
Instruction Icon                                      object
Instruction Score                                    float64
Leaders Icon                                          object
Leaders Score                                         object
Teachers Icon           

In [5]:
school_df.shape

(414, 65)

In [6]:
#"NDA" is not reading as null, see below, need to change that

In [7]:
school_df.isnull().sum()

Name of School                                        0
Elementary, Middle, or High School                    0
Street Address                                        0
ZIP Code                                              0
Link                                                  0
Healthy Schools Certified?                            0
Safety Icon                                           0
Safety Score                                         34
Family Involvement Icon                               0
Family Involvement Score                              0
Environment Icon                                      0
Environment Score                                    34
Instruction Icon                                      0
Instruction Score                                    34
Leaders Icon                                          0
Leaders Score                                         0
Teachers Icon                                         0
Teachers Score                                  

In [8]:
school_df.replace("NDA", np.nan, inplace=True)

In [9]:
school_df.isnull().sum()

Name of School                                         0
Elementary, Middle, or High School                     0
Street Address                                         0
ZIP Code                                               0
Link                                                   0
Healthy Schools Certified?                             0
Safety Icon                                           34
Safety Score                                          34
Family Involvement Icon                              196
Family Involvement Score                             196
Environment Icon                                      34
Environment Score                                     34
Instruction Icon                                      34
Instruction Score                                     34
Leaders Icon                                         197
Leaders Score                                        197
Teachers Icon                                        197
Teachers Score                 

#### Feature selection: easy option

In [10]:
#here are the probable easy columns:
#ZIP Code, Healthy Schools Certified?, Average Student Attendance, 
#"Rate of Misconducts (per 100 students) ", Average Teacher Attendance,
#"Individualized Education Program Compliance Rate ", 
#NOTE: I DO NOT WANT TO USE Community Area Number,
#Ward, Police District

In [None]:
#Stuff that needs to happen:
#ZIP Code: dummy variables
#Healthy Schools: No to zero and Yes to 1
#Average Student and Teacher Attendance: % removed
#"Rate of Misconducts (per 100 students) ": just remember space at end of name
#Individualized Education Program Compliance Rate : also space and remove %
#Community Area Number: dummies
#Ward: dummies
#Police District: dummies



In [25]:
school_df["Police District"].value_counts()

8     33
2     26
7     25
9     25
22    22
19    22
11    22
10    22
25    21
4     20
5     20
6     20
17    18
16    18
14    16
3     14
12    14
15    14
24    10
18    10
13     9
20     8
1      5
Name: Police District, dtype: int64