# Capstone Presentation

## Find a dataset of interest

I will be investigating a Kaggle dataset gathered from a [Speed Dating Experiment](https://www.kaggle.com/annavictoria/speed-dating-experiment). It was compiled by 2 Columbia Business School professors for their paper "Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment", which they wrote in an effort to understand what influences "love at first sight".

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a 4-minute "first date" with every other participant of the opposite sex. At the end of their 4 minutes, participants were asked to rate their date on 6 attributes: 
- Attractiveness
- Sincerity
- Intelligence
- Fun
- Ambition
- Shared Interests

They were also asked if they would like to see their date again.

The dataset also includes questionnaire data gathered from participants at different points in the process (i.e. demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information). 

## Explore the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import ensemble
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score
from imblearn.over_sampling import SMOTE

raw_data = pd.read_csv('./data/speed_dating.csv', encoding="ISO-8859-1")
print(raw_data.shape[0], 'Rows')
print(raw_data.shape[1], 'Columns')
raw_data.head()

8378 Rows
195 Columns


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [2]:
def get_col_descriptions(df):
    for col in df.columns:
        if col != 'iid' and col != 'pid':
            print('*', col, '--', len(df[col].unique()), 'Unique values') #, df[col].value_counts().sort_index())
        else:
            print('*', col, '--', len(df[col].unique()), 'Unique values') #, (df[col].unique()))
            
        if df[col].isnull().sum() > 0:
            num_nans = df[col].isnull().sum()
            print('     # NaNs:', num_nans, '-', round(num_nans/df.shape[0] * 100, 2), '% NaN')
            

get_col_descriptions(raw_data)

* iid -- 551 Unique values
* id -- 23 Unique values
     # NaNs: 1 - 0.01 % NaN
* gender -- 2 Unique values
* idg -- 44 Unique values
* condtn -- 2 Unique values
* wave -- 21 Unique values
* round -- 15 Unique values
* position -- 22 Unique values
* positin1 -- 23 Unique values
     # NaNs: 1846 - 22.03 % NaN
* order -- 22 Unique values
* partner -- 22 Unique values
* pid -- 552 Unique values
     # NaNs: 10 - 0.12 % NaN
* match -- 2 Unique values
* int_corr -- 156 Unique values
     # NaNs: 158 - 1.89 % NaN
* samerace -- 2 Unique values
* age_o -- 25 Unique values
     # NaNs: 104 - 1.24 % NaN
* race_o -- 6 Unique values
     # NaNs: 73 - 0.87 % NaN
* pf_o_att -- 95 Unique values
     # NaNs: 89 - 1.06 % NaN
* pf_o_sin -- 79 Unique values
     # NaNs: 89 - 1.06 % NaN
* pf_o_int -- 66 Unique values
     # NaNs: 89 - 1.06 % NaN
* pf_o_fun -- 72 Unique values
     # NaNs: 98 - 1.17 % NaN
* pf_o_amb -- 83 Unique values
     # NaNs: 107 - 1.28 % NaN
* pf_o_sha -- 86 Unique values
     # Na

In [3]:
df = raw_data.copy()

my_prefs = list(df.columns[69:75])
partners_prefs = list(df.columns[17:23])
me_rated = list(df.columns[24:28]) # 30
partner_rated = list(df.columns[98:102]) # 104

df[my_prefs] = df[my_prefs].apply(lambda x: round(x / 100, 2))
df[partners_prefs] = df[partners_prefs].apply(lambda x: round(x / 100, 2))

df[me_rated] = df[me_rated].apply(lambda x: round(x / 10, 2))
df[partner_rated] = df[partner_rated].apply(lambda x: round(x / 10, 2))

cols_of_interest = ['match'] + my_prefs + partners_prefs + me_rated + partner_rated
df = df[cols_of_interest]

df.head()

Unnamed: 0,match,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,pf_o_att,pf_o_sin,pf_o_int,...,pf_o_amb,pf_o_sha,attr_o,sinc_o,intel_o,fun_o,attr,sinc,intel,fun
0,0,0.15,0.2,0.2,0.15,0.15,0.15,0.35,0.2,0.2,...,0.0,0.05,0.6,0.8,0.8,0.8,0.6,0.9,0.7,0.7
1,0,0.15,0.2,0.2,0.15,0.15,0.15,0.6,0.0,0.0,...,0.0,0.0,0.7,0.8,1.0,0.7,0.7,0.8,0.7,0.8
2,1,0.15,0.2,0.2,0.15,0.15,0.15,0.19,0.18,0.19,...,0.14,0.12,1.0,1.0,1.0,1.0,0.5,0.8,0.9,0.8
3,1,0.15,0.2,0.2,0.15,0.15,0.15,0.3,0.05,0.15,...,0.05,0.05,0.7,0.8,0.9,0.8,0.7,0.6,0.8,0.7
4,1,0.15,0.2,0.2,0.15,0.15,0.15,0.3,0.1,0.2,...,0.1,0.2,0.8,0.7,0.9,0.6,0.5,0.6,0.7,0.7


In [4]:
df = df.dropna(axis=0)
print(df.shape[0], 'rows left')
df.head()

7378 rows left


Unnamed: 0,match,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,pf_o_att,pf_o_sin,pf_o_int,...,pf_o_amb,pf_o_sha,attr_o,sinc_o,intel_o,fun_o,attr,sinc,intel,fun
0,0,0.15,0.2,0.2,0.15,0.15,0.15,0.35,0.2,0.2,...,0.0,0.05,0.6,0.8,0.8,0.8,0.6,0.9,0.7,0.7
1,0,0.15,0.2,0.2,0.15,0.15,0.15,0.6,0.0,0.0,...,0.0,0.0,0.7,0.8,1.0,0.7,0.7,0.8,0.7,0.8
2,1,0.15,0.2,0.2,0.15,0.15,0.15,0.19,0.18,0.19,...,0.14,0.12,1.0,1.0,1.0,1.0,0.5,0.8,0.9,0.8
3,1,0.15,0.2,0.2,0.15,0.15,0.15,0.3,0.05,0.15,...,0.05,0.05,0.7,0.8,0.9,0.8,0.7,0.6,0.8,0.7
4,1,0.15,0.2,0.2,0.15,0.15,0.15,0.3,0.1,0.2,...,0.1,0.2,0.8,0.7,0.9,0.6,0.5,0.6,0.7,0.7


In [5]:
matches = df[df['match'] == 1]
non_matches = df[df['match'] == 0]

num_matches = matches.shape[0]
num_non_matches = non_matches.shape[0]
percent_majority = round(num_non_matches / df.shape[0] * 100, 2)

print(('{} matches; {} non-matches\n{}% non-matches').format(num_matches, num_non_matches, percent_majority))

1288 matches; 6090 non-matches
82.54% non-matches


## Must Over-Sample

In [6]:
X = df.loc[:, df.columns != 'match']
Y = df['match']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1)

sm = SMOTE(random_state=1, ratio = 1.0)
X_train_res, Y_train_res = sm.fit_sample(X_train, Y_train)



# Model outcome of interest

You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power and experiment with both.__

In [7]:
def fit_and_train(model):
    model_fit = model.fit(X_train_res, Y_train_res)
    model_score_train = model.score(X_train, Y_train)
    print('R² for train:', model_score_train)
    
    model_score_test = model.score(X_test, Y_test)
    print('R² for test:', model_score_test)
    
    test_crosstab = pd.crosstab(Y_test, model_fit.predict(X_test), rownames=['actual'], colnames=['predicted'], margins=True)
    print('\n', test_crosstab)
    
    tI_errors = test_crosstab.loc[0,1] / test_crosstab.loc['All','All'] * 100
    tII_errors = test_crosstab.loc[1,0] / test_crosstab.loc['All','All'] * 100
    print(('\nType I errors: {}%\nType II errors: {}%\n').format(round(tI_errors, 2), round(tII_errors, 2)))

    precision = test_crosstab.loc[1,1] / test_crosstab.loc['All', 1] * 100 
    recall = test_crosstab.loc[1,1] / test_crosstab.loc[1,'All'] * 100 
    print(('Precision: {}%\nRecall: {}%').format(round(precision, 2), round(recall, 2)))
    
    
    if hasattr(model_fit, 'coef_'):
        print('\nCoefficients:', model_fit.coef_)

lasso = linear_model.LogisticRegression(penalty='l1', C=10) 
fit_and_train(lasso)

R² for train: 0.7382974878004699
R² for test: 0.7447154471544716

 predicted     0    1   All
actual                    
0          1121  387  1508
1            84  253   337
All        1205  640  1845

Type I errors: 20.98%
Type II errors: 4.55%

Precision: 39.53%
Recall: 75.07%

Coefficients: [[-1.31550779 -1.36787597  1.04426665  0.77961095 -1.2913668   0.39804416
  -2.17687636 -1.4895599   0.         -0.81759189 -1.62205306 -1.22683141
   3.32755173 -0.3695101   0.52404822  3.02621084  3.35707152  0.0231907
   0.1959008   2.83009353]]


In [8]:
gbm = ensemble.GradientBoostingClassifier(n_estimators=500, max_depth=2, loss='deviance')
fit_and_train(gbm)

R² for train: 0.872582685703958
R² for test: 0.8525745257452575

 predicted     0    1   All
actual                    
0          1427   81  1508
1           191  146   337
All        1618  227  1845

Type I errors: 4.39%
Type II errors: 10.35%

Precision: 64.32%
Recall: 43.32%


In [9]:
svm = SVC(kernel='linear', probability=True)
fit_and_train(svm)

R² for train: 0.731610337972167
R² for test: 0.7338753387533875

 predicted     0    1   All
actual                    
0          1097  411  1508
1            80  257   337
All        1177  668  1845

Type I errors: 22.28%
Type II errors: 4.34%

Precision: 38.47%
Recall: 76.26%

Coefficients: [[-0.88183812 -1.08612012  1.15350421  0.75641592 -0.51531128  0.38501539
  -0.9029509  -0.41569176  1.13555936  0.37477409 -0.3715132  -0.29081559
   2.66566971 -0.25432352  0.26543576  2.68679467  2.73443596  0.10990641
   0.08485801  2.41392559]]


In [10]:
rfc = ensemble.RandomForestClassifier()
fit_and_train(rfc)

R² for train: 0.9931321163925537
R² for test: 0.8238482384823849

 predicted     0    1   All
actual                    
0          1403  105  1508
1           220  117   337
All        1623  222  1845

Type I errors: 5.69%
Type II errors: 11.92%

Precision: 52.7%
Recall: 34.72%


In [None]:
bnb = BernoulliNB()
fit_and_train(bnb)

## Deliverables

Prepare a slide deck and 15 minute presentation that guides viewers through your model. Be sure to cover a few specific things:

- A specified research question your model addresses
- How you chose your model specification and what alternatives you compared it to
- The practical uses of your model for an audience of interest
- Any weak points or shortcomings of your model

You'll be presenting this slide deck live to a group as the culmination of your work in the last 2 supervised learning units. As a secondary matter, your slides and/or the Jupyter notebook you use or adapt them into should be worthy of inclusion as examples of your work product when applying to jobs.