# Capstone Presentation

__First: Go out and find a dataset of interest. It could be from one of our recommended resources, some other aggregation, or scraped yourself. Just make sure it has lots of variables in it, including an outcome of interest to you.__

I will be investigating a Kaggle dataset gathered from a [Speed Dating Experiment](https://www.kaggle.com/annavictoria/speed-dating-experiment). It was compiled by 2 Columbia Business School professors for their paper "Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment", which they wrote in an effort to understand what influences "love at first sight".

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a 4-minute "first date" with every other participant of the opposite sex. At the end of their 4 minutes, participants were asked to rate their date on 6 attributes: 
- Attractiveness
- Sincerity
- Intelligence
- Fun
- Ambition
- Shared Interests

They were also asked if they would like to see their date again.

The dataset also includes questionnaire data gathered from participants at different points in the process (i.e. demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information). 

__Second: Explore the data. Get to know the data. Spend time going over its quirks. You should understand how it was gathered, what's in it, and what the variables look like.__

In [18]:
import pandas as pd
import numpy as np

df = pd.read_csv('./data/speed_dating.csv', encoding = "ISO-8859-1")
print(df.shape[0], 'Rows,', df.shape[1], 'Columns')
df.head()
df[df['wave']==20]['pid']

8378 Rows, 195 Columns


7326    502.0
7327    503.0
7328    504.0
7329    505.0
7330    506.0
7331    507.0
7332    508.0
7333    502.0
7334    503.0
7335    504.0
7336    505.0
7337    506.0
7338    507.0
7339    508.0
7340    502.0
7341    503.0
7342    504.0
7343    505.0
7344    506.0
7345    507.0
7346    508.0
7347    502.0
7348    503.0
7349    504.0
7350    505.0
7351    506.0
7352    507.0
7353    508.0
7354    502.0
7355    503.0
        ...  
7380    496.0
7381    497.0
7382    498.0
7383    499.0
7384    500.0
7385    501.0
7386    496.0
7387    497.0
7388    498.0
7389    499.0
7390    500.0
7391    501.0
7392    496.0
7393    497.0
7394    498.0
7395    499.0
7396    500.0
7397    501.0
7398    496.0
7399    497.0
7400    498.0
7401    499.0
7402    500.0
7403    501.0
7404    496.0
7405    497.0
7406    498.0
7407    499.0
7408    500.0
7409    501.0
Name: pid, Length: 84, dtype: float64

In [2]:
def get_col_descriptions(df):
    for col in df.columns:
        print('--', col, '--')
        
        if col != 'iid' and col != 'pid':
            print(len(df[col].unique()), 'Unique values:', df[col].value_counts().sort_index())
        else:
            print(len(df[col].unique()), 'Unique values:', (df[col].unique()))
            
        if df[col].isnull().sum() > 0:
            num_nans = df[col].isnull().sum()
            print('# NaNs:', num_nans, '-', round(num_nans/df.shape[0]*100, 2), '% NaN')
            
        print('\n')

get_col_descriptions(df)

-- iid --
551 Unique values: [  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 119 120 121 122 123 124 125 126 127
 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
 236 237 238 239 240 2

__Third: Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power and experiment with both.__

## Deliverables

Prepare a slide deck and 15 minute presentation that guides viewers through your model. Be sure to cover a few specific things:

- A specified research question your model addresses
- How you chose your model specification and what alternatives you compared it to
- The practical uses of your model for an audience of interest
- Any weak points or shortcomings of your model

You'll be presenting this slide deck live to a group as the culmination of your work in the last 2 supervised learning units. As a secondary matter, your slides and/or the Jupyter notebook you use or adapt them into should be worthy of inclusion as examples of your work product when applying to jobs.