# Python Project - Speed Dating Experiment

*URL of our dataset* : https://www.kaggle.com/annavictoria/speed-dating-experiment/data

*Description* : What influences love at first sight? (Or, at least, love in the first four minutes?) This dataset was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment.

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.

The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details.

*We'll first try to uncover insights about the data - among which what are the least and most desirable attributes for each gender and the differences between what the individuals say they want, and what they actually like.*
*Then, we'll try to develop a predictive model to match people given their set of attributes* 

*But first, let's import and preprocess the data!*

## Importing data

In [54]:
# the usual import list
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as mpl
import csv

In [46]:
import urllib.request as ur
ur.urlretrieve("http://www.kaggle.com/account/login?ReturnUrl=c/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data.csv", "Speed Dating Data.csv.zip")


('Speed Dating Data.csv.zip', <http.client.HTTPMessage at 0x10dec4b38>)

*This way to download doesn't actually work : the resulting .zip file is not recognized as a .zip file by Python, and it is much lighter that what we actually get by manually downloading the file. This is because when we download the dataset this way, we don't actually log into Kaggle.*

*A research on the subject on the Internet shows this is not a simple matter, and we never saw how to deal with this kind of thing in class. Code such as the one found on this [page](https://ramhiser.com/2012/11/23/how-to-download-kaggle-data-with-python-and-requests-dot-py/) doesn't work, even after adapting it. We could just copy paste some complicated [code](http://blog.romanofoti.com/download_from_kaggle/) that we do not completely understand and input our credentials here, but this doesn't seem a sensible thing to do.*

*Consequently, we would like to invite you to download the dataset directly using this [link](https://www.kaggle.com/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data.csv).*

*Now that it's done, let's actually import and preprocess the data*

In [57]:
sdd = pd.read_csv("Speed Dating Data.csv", encoding="latin_1", dtype={'field' : str, 'from' : str, 'career' : str})

In [56]:
print(sdd.dtypes)

iid           int64
id          float64
gender        int64
idg           int64
condtn        int64
wave          int64
round         int64
position      int64
positin1    float64
order         int64
partner       int64
pid         float64
match         int64
int_corr    float64
samerace      int64
age_o       float64
race_o      float64
pf_o_att    float64
pf_o_sin    float64
pf_o_int    float64
pf_o_fun    float64
pf_o_amb    float64
pf_o_sha    float64
dec_o         int64
attr_o      float64
sinc_o      float64
intel_o     float64
fun_o       float64
amb_o       float64
shar_o      float64
             ...   
amb1_3      float64
shar1_3     float64
attr7_3     float64
sinc7_3     float64
intel7_3    float64
fun7_3      float64
amb7_3      float64
shar7_3     float64
attr4_3     float64
sinc4_3     float64
intel4_3    float64
fun4_3      float64
amb4_3      float64
shar4_3     float64
attr2_3     float64
sinc2_3     float64
intel2_3    float64
fun2_3      float64
amb2_3      float64


In [58]:
sdd

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,
5,1,1.0,0,1,1,1,10,7,,6,...,5.0,7.0,7.0,7.0,7.0,,,,,
6,1,1.0,0,1,1,1,10,7,,1,...,5.0,7.0,7.0,7.0,7.0,,,,,
7,1,1.0,0,1,1,1,10,7,,2,...,5.0,7.0,7.0,7.0,7.0,,,,,
8,1,1.0,0,1,1,1,10,7,,8,...,5.0,7.0,7.0,7.0,7.0,,,,,
9,1,1.0,0,1,1,1,10,7,,9,...,5.0,7.0,7.0,7.0,7.0,,,,,


The NaN values are just missing data, such as a speed dating session where the initial position of the participants was not recorded, or people that did not meet as many dates as the others.

To finish the preprocessing, we would like to remove the rows corresponding to waves 6 to 9, because the participants were not rated in the same way for these waves as for the others (a grade for each attribute rather than 100 points to distribute), in order to avoid getting a biased analysis.

In [63]:
sddOK = sdd.query('wave not in [6,7,8,9]')
sddOK

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,
5,1,1.0,0,1,1,1,10,7,,6,...,5.0,7.0,7.0,7.0,7.0,,,,,
6,1,1.0,0,1,1,1,10,7,,1,...,5.0,7.0,7.0,7.0,7.0,,,,,
7,1,1.0,0,1,1,1,10,7,,2,...,5.0,7.0,7.0,7.0,7.0,,,,,
8,1,1.0,0,1,1,1,10,7,,8,...,5.0,7.0,7.0,7.0,7.0,,,,,
9,1,1.0,0,1,1,1,10,7,,9,...,5.0,7.0,7.0,7.0,7.0,,,,,


## Unsupervised Learning

### Attributes and Self-Assessment

First, let's look at all the self assessment data. People were asked to rate themselves according to the different attributes, then to express how they thought people rated them, and lastly they were actually rated by other people. 
Let's see how self-perception differs from actual perception by others.

In [70]:
#First, let's look at self assessment of attributes
sddSA = sdd[['iid', 'gender', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1']]
sddSA = sddSA.drop_duplicates()

In [82]:
sddSA.head() #here we have isolated the self assessment of everyone

Unnamed: 0,iid,gender,attr3_1,sinc3_1,fun3_1,intel3_1,amb3_1
0,1,0,6.0,8.0,8.0,8.0,7.0
10,2,0,7.0,5.0,10.0,8.0,3.0
20,3,0,8.0,9.0,8.0,9.0,8.0
30,4,0,7.0,8.0,9.0,7.0,8.0
40,5,0,6.0,3.0,6.0,10.0,8.0


In [83]:
#In the same way, let's isolate the way people think they are perceived by others
sddTP = sdd[['iid', 'gender', 'attr5_1', 'sinc5_1', 'fun5_1', 'intel5_1', 'amb5_1']]
sddTP = sddTP.drop_duplicates()
sddTP.head()

Unnamed: 0,iid,gender,attr5_1,sinc5_1,fun5_1,intel5_1,amb5_1
0,1,0,,,,,
10,2,0,,,,,
20,3,0,,,,,
30,4,0,,,,,
40,5,0,,,,,


About half the values are missing for this category, but we still have enough data (more than 3000 observations) to make an analysis. Anyway, we will mostly focus on the difference between self perception and actual perception in this part of the project.

In [74]:
#And finally, the way people are actually perceived by others
sddAP = sdd[['iid', 'gender', 'attr_o', 'sinc_o', 'fun_o', 'intel_o', 'amb_o']]
sddAP = sddAP.groupby('iid').agg({'attr_o': 'mean','sinc_o': 'mean', 'fun_o': 'mean', 'intel_o': 'mean', 'amb_o': 'mean'})

In [79]:
sddAP.head()

Unnamed: 0_level_0,attr_o,sinc_o,fun_o,intel_o,amb_o
iid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,6.7,7.4,7.2,8.0,8.0
2,7.7,7.1,7.5,7.9,7.5
3,6.5,7.1,6.2,7.3,7.111111
4,7.0,7.1,7.5,7.7,7.7
5,5.3,7.7,7.2,7.6,7.8
