# Predicting Donor's Choose

1. What is the primary metric you care about in this task?  Be sure to clearly state the question about why this is the case.
1. Is the column `teacher_number_of_previously_posted_projects` a good predictor for the approval of the project? Use both a `KNearestNeighbors` and `LogisticRegression` model. Which model performs better?  Can you select the best parameters for each?  The best penalty for `LogisticRegression`?  Form a pipeline that includes a scaling transformation?  Compare this to a `DummyClassifier`?
2. What are the top 8 states in terms of raw number approved? The lowest?  Show me with a nice barplot.
3. Are these states different from the number of proportion of applications approved by state?  Show me.
4. Does your model improve with the inclusion of the `teacher_prefix` column in a `LogisticRegression` model?
5. What is your best parameter for the training set with these inputs?
6. Construct a feature that is simply `STEM`, which is 1 if a scientific discipline is a part of the `subject_subcategory` column, or 0 if not.  Did your model improve?
7. What if you include the `project_grade_category` column?  Is your model improved?
8. What if your client only cares about what's happening in New York.  Is there a difference in the performance of a `LogisticRegression` model? `KNearestNeighbors`?


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [7]:
train = pd.read_csv('data/train.csv', index_col = 'id')
test = pd.read_csv('data/test.csv', index_col='id')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 182080 entries, p036502 to p190772
Data columns (total 15 columns):
teacher_id                                      182080 non-null object
teacher_prefix                                  182076 non-null object
school_state                                    182080 non-null object
project_submitted_datetime                      182080 non-null object
project_grade_category                          182080 non-null object
project_subject_categories                      182080 non-null object
project_subject_subcategories                   182080 non-null object
project_title                                   182080 non-null object
project_essay_1                                 182080 non-null object
project_essay_2                                 182080 non-null object
project_essay_3                                 6374 non-null object
project_essay_4                                 6374 non-null object
project_resource_summary               

In [54]:
X = train.teacher_number_of_previously_posted_projects.values.reshape(-1,1)
y = train.project_is_approved
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [67]:
knn = KNeighborsClassifier(n_neighbors=5)
clf = LogisticRegression()

In [68]:
knn.fit(X_train, y_train)
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [60]:
knn_pred = knn.predict(X_test)
clf_pred = clf.predict(X_test)

In [61]:
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_predict

In [62]:
from sklearn.dummy import DummyClassifier

In [63]:
dum = DummyClassifier()
dum.fit(X_train, y_train)
dum.score(X_test, y_test)

0.7414543057996485

In [64]:
from sklearn.metrics import mean_squared_error

In [69]:
knn_rmse = np.sqrt(mean_squared_error(knn_pred, y_test))
clf_rmse = np.sqrt(mean_squared_error(clf_pred, y_test))
print("score for knn is", knn.score(X_test, y_test))
print("score for Logistic Regressions is", clf.score(X_test, y_test))
print("score for Dummy is", dum.score(X_test, y_test))

score for knn is 0.8444200351493849
score for Logistic Regressions is 0.8449253075571177
score for Dummy is 0.7391695957820739


# Without optimizing the classifiers, it looks like the dummy does better than the other two.  I'm moving onto task 2.

In [34]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 182080 entries, p036502 to p190772
Data columns (total 15 columns):
teacher_id                                      182080 non-null object
teacher_prefix                                  182076 non-null object
school_state                                    182080 non-null object
project_submitted_datetime                      182080 non-null object
project_grade_category                          182080 non-null object
project_subject_categories                      182080 non-null object
project_subject_subcategories                   182080 non-null object
project_title                                   182080 non-null object
project_essay_1                                 182080 non-null object
project_essay_2                                 182080 non-null object
project_essay_3                                 6374 non-null object
project_essay_4                                 6374 non-null object
project_resource_summary               

In [49]:
best_states = train.groupby(train.school_state)
approved_sum = best_states.project_is_approved.sum().sort_values(ascending=False)
approved_sum = approved_sum.nlargest(10)

In [36]:
import seaborn as sns

## I can't get the barplot to work quickly, so skipping for now.  The below is adding the teacher prefix as a feature for the model.

In [70]:
train.teacher_prefix.value_counts()

Mrs.       95405
Ms.        65066
Mr.        17667
Teacher     3912
Dr.           26
Name: teacher_prefix, dtype: int64

In [71]:
train.groupby('teacher_prefix')['project_is_approved'].mean()

teacher_prefix
Dr.        0.807692
Mr.        0.842022
Mrs.       0.854085
Ms.        0.843052
Teacher    0.794223
Name: project_is_approved, dtype: float64

In [72]:
p2 = train[['teacher_prefix', 'teacher_number_of_previously_posted_projects']]

In [75]:
X2 = pd.get_dummies(p2, drop_first=1)

In [76]:
X2.head()

Unnamed: 0_level_0,teacher_number_of_previously_posted_projects,teacher_prefix_Dr.,teacher_prefix_Mr.,teacher_prefix_Mrs.,teacher_prefix_Ms.,teacher_prefix_Teacher
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
p036502,26,0,0,0,1,0
p039565,1,0,0,1,0,0
p233823,5,0,0,0,1,0
p185307,16,0,1,0,0,0
p013780,42,0,1,0,0,0


##  Now that the prefix has been dummied, the next step would be to re-do the above where we split, trained, fit, and scored the model. 

In [77]:
X2_train, X2_test, y_train, y_test = train_test_split(X2, y)

In [78]:
clf.fit(X2_train, y_train)
knn.fit(X2_train, y_train)
dum.fit(X2_train, y_train)

DummyClassifier(constant=None, random_state=None, strategy='stratified')

In [81]:
print(clf.score(X2_test, y_test))
print(knn.score(X2_test, y_test))
print(dum.score(X2_test, y_test))

0.8490553602811951
0.7838532513181019
0.7417398945518453


# Creating a STEM feature 

In [82]:
train.project_subject_categories.value_counts()

Literacy & Language                           39257
Math & Science                                28555
Literacy & Language, Math & Science           24499
Health & Sports                               16951
Music & The Arts                               8527
Special Needs                                  7065
Literacy & Language, Special Needs             6685
Applied Learning                               6310
Math & Science, Literacy & Language            3843
Applied Learning, Literacy & Language          3725
History & Civics                               3065
Math & Science, Special Needs                  3010
Literacy & Language, Music & The Arts          2878
Math & Science, Music & The Arts               2761
Applied Learning, Special Needs                2481
Health & Sports, Special Needs                 2368
History & Civics, Literacy & Language          2288
Warmth, Care & Hunger                          2191
Math & Science, Applied Learning               2071
Applied Lear

## The above shows the the string "Math & Science" is what distinguishes a STEM category

In [85]:
train['STEM'] = train.project_subject_categories.str.contains('Math & Science')

In [86]:
p3 = train[['teacher_prefix', 'teacher_number_of_previously_posted_projects', 'STEM']]

In [87]:
X3 = pd.get_dummies(p3, drop_first=1)

In [88]:
X3.head()

Unnamed: 0_level_0,teacher_number_of_previously_posted_projects,STEM,teacher_prefix_Mr.,teacher_prefix_Mrs.,teacher_prefix_Ms.,teacher_prefix_Teacher
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
p036502,26,False,0,0,1,0
p039565,1,False,0,1,0,0
p233823,5,True,0,0,1,0
p185307,16,False,1,0,0,0
p013780,42,False,1,0,0,0


##  Now that the prefix has been dummied, the next step would be to re-do the above where we split, trained, fit, and scored the model. 

In [89]:
X3_train, X3_test, y_train, y_test = train_test_split(X2, y)

In [90]:
clf.fit(X3_train, y_train)
knn.fit(X3_train, y_train)
dum.fit(X3_train, y_train)

DummyClassifier(constant=None, random_state=None, strategy='stratified')

In [91]:
print(clf.score(X3_test, y_test))
print(knn.score(X3_test, y_test))
print(dum.score(X3_test, y_test))

0.8473637961335677
0.8407513181019333
0.7449033391915642
