# Simple Predictions

This notebook converts the application data into a form ready to some Machine Learning tests. 

In [4]:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn import svm
from sklearn.preprocessing import Imputer
from sklearn import metrics
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

By default, uses the normalized data. Change this to `collegedata_unnormalized.csv` for the unnormalized form.

In [5]:
applications = pd.read_csv("collegedata_normalized.csv")
applications.head(10)

Unnamed: 0.1,Unnamed: 0,studentID,classrank,admissionstest,AP,averageAP,SATsubject,GPA,GPA_w,program,...,alumni,outofstate,acceptStatus,acceptProb,name,acceptrate,size,public,finAidPct,instatePct
0,0,PWY05BUB4I,,0.111013,7,0.187427,0.070643,-0.013895,0.005109,Biomedical engineering,...,-1,-1,1,,Rice,0.151,6621,-1,0,0
1,1,3UVDFVI9Z0,,0.035099,7,0.115998,-0.096024,0.036646,0.033998,Classics,...,-1,1,1,,Rice,0.151,6621,-1,0,0
2,2,BCCBHJUP0M,,0.035099,0,,0.070643,0.029426,-0.088225,Biological Science,...,-1,1,-1,,Rice,0.151,6621,-1,0,0
3,3,WZFPWHSQMS,,0.166223,7,0.151713,0.23731,0.007765,-0.032669,Physics,...,-1,1,-1,,Rice,0.151,6621,-1,0,0
4,4,5W1JNQA7G0,,0.048901,1,-0.062573,-0.096024,0.040256,,,...,-1,1,1,,Rice,0.151,6621,-1,0,0
5,5,TWUKL79B6V,,0.048901,0,,-0.096024,0.058307,,Political Science,...,-1,1,0,,Rice,0.151,6621,-1,0,0
6,6,1OJUGUL4LL,,0.09721,0,,-0.096024,0.022206,0.018442,Computer Science,...,-1,1,-1,,Rice,0.151,6621,-1,0,0
7,7,NX2TARIB0P,,-0.296164,3,-0.145906,-0.096024,-0.049996,-0.026002,,...,-1,1,-1,,Rice,0.151,6621,-1,0,0
8,8,N4Y1IOID8K,,0.007493,7,-0.062573,-0.429357,0.036646,-0.08378,Business,...,-1,-1,-1,,Rice,0.151,6621,-1,0,0
9,9,911MU875UY,,-0.006309,4,-7.3e-05,-0.096024,0.058307,0.151775,Computer Science,...,-1,1,1,,Rice,0.151,6621,-1,0,0


In [6]:
applications.columns

Index([u'Unnamed: 0', u'studentID', u'classrank', u'admissionstest', u'AP',
       u'averageAP', u'SATsubject', u'GPA', u'GPA_w', u'program',
       u'schooltype', u'intendedgradyear', u'addInfo', u'canAfford', u'female',
       u'MinorityGender', u'MinorityRace', u'international', u'firstinfamily',
       u'sports', u'artist', u'workexp', u'collegeID', u'earlyAppl',
       u'visited', u'alumni', u'outofstate', u'acceptStatus', u'acceptProb',
       u'name', u'acceptrate', u'size', u'public', u'finAidPct',
       u'instatePct'],
      dtype='object')

Pick only the columns we'll use for prediction. There are no factors in the current model.

In [7]:
y = np.ravel(applications.acceptStatus)
cols_to_retain = [u'admissionstest', u'AP',
       u'averageAP', u'SATsubject', u'GPA', u'GPA_w',u'schooltype', u'canAfford', u'female',
       u'MinorityGender', u'MinorityRace', u'international',
       u'sports', u'earlyAppl',
       u'alumni', u'outofstate']

applDF = applications[cols_to_retain]
applDF.head()

Unnamed: 0,admissionstest,AP,averageAP,SATsubject,GPA,GPA_w,schooltype,canAfford,female,MinorityGender,MinorityRace,international,sports,earlyAppl,alumni,outofstate
0,0.111013,7,0.187427,0.070643,-0.013895,0.005109,-1,0,1,-1,-1,-1,-1,-1,-1,-1
1,0.035099,7,0.115998,-0.096024,0.036646,0.033998,-1,0,-1,-1,-1,-1,-1,-1,-1,1
2,0.035099,0,,0.070643,0.029426,-0.088225,1,0,1,-1,1,-1,-1,-1,-1,1
3,0.166223,7,0.151713,0.23731,0.007765,-0.032669,1,0,-1,-1,-1,-1,-1,-1,-1,1
4,0.048901,1,-0.062573,-0.096024,0.040256,,1,0,-1,-1,-1,-1,-1,1,-1,1


Impute missing values.

In [8]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(applDF)
X = imp.transform(applDF)
X.shape, y.shape

((16062, 16), (16062,))

Split into training and test sets. There is no validation set currently.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print X_train.shape, X_test.shape, y_train.shape, y_test.shape, y.shape

(12849, 16) (3213, 16) (12849,) (3213,) (16062,)


Run an initial Logistic Regression without any optimization. This is boiler plate for the final training code if we
elect to use Scikit-learn

In [10]:
clf = linear_model.LogisticRegression(C=1000)
clf.fit(X_train,y_train)
predicted = clf.predict(X_test)
print metrics.accuracy_score(y_test, predicted)

0.544973544974


Only 54% accuracy. Not so great. Show the coefficients.

In [11]:
pd.DataFrame(zip(applDF.columns, np.transpose(clf.coef_)))

Unnamed: 0,0,1
0,admissionstest,"[-2.70964829055, 1.02888916041, 2.62731288861]"
1,AP,"[-0.0125526092972, -0.0107880513872, 0.0213932..."
2,averageAP,"[-0.451431569105, 0.194880150093, 0.368256779965]"
3,SATsubject,"[0.324437532234, -0.316107891002, -0.165193140..."
4,GPA,"[-2.44417724732, -0.676517016124, 3.93592882358]"
5,GPA_w,"[-0.775291377667, 0.880175549677, 0.27360895452]"
6,schooltype,"[-0.0293820265718, 0.0403076306956, 0.00846334..."
7,canAfford,"[0.0539701818375, -0.0565995741233, -0.0303885..."
8,female,"[-0.0170100071389, -0.0727362252438, 0.0710153..."
9,MinorityGender,"[-0.347965036465, 0.148784702401, 0.272710429176]"


Run a series of 10 fold cross validations using different algorithms and take an average of the result.

In [12]:
scores = cross_val_score(linear_model.LogisticRegression(), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()

[ 0.5528607   0.56102117  0.48630137  0.5373599   0.51992528  0.56039851
  0.56039851  0.54358655  0.5373599   0.5392279 ]
0.539843977497


In [13]:
scores = cross_val_score(RandomForestClassifier(), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()

[ 0.66542289  0.66998755  0.61457036  0.6120797   0.63636364  0.64757161
  0.64757161  0.64134496  0.62826899  0.59589041]
0.635907170251


In [14]:
scores = cross_val_score(DecisionTreeClassifier(), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()

[ 0.64427861  0.67123288  0.61892902  0.59713574  0.62266501  0.61892902
  0.63013699  0.62328767  0.6139477   0.58655044]
0.622709305279


So we are getting about 63% accuracy with untuned Random Forests.