# Predicting Speed Dating Match

*Written by Ben Pang, presented to Prof. David Skillicorn*

<b>

## 1. Introduction

The purpose of this study is to predict whether speed dating rounds result in a match, given the corresponding data from the individuals who participated in the round. We will select an appropriate classification model based on observations on the dataset and tune our model in order to achieve the highest test accuracy possible.

The two main tools that we will use are [scikit-learn](http://scikit-learn.org/stable/) and [pandas](https://pandas.pydata.org/). **scikit-learn** provides tools for preprocessing the data as well as implemenation and tuning of our classification model. [**finish**]

In [2]:
import pandas as pd
import numpy as np
import matplotlib as plt

from pathlib import Path


## 2. The Dataset



In [22]:
data_path = Path('../data/speed_dating_data.csv')
data = pd.read_csv(data_path, encoding = "ISO-8859-1")

In [20]:
data.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [23]:
data.shape

(8378, 195)

In [38]:
grp = data.columns.to_series().groupby(data.dtypes).groups
{k.name:v for k, v in grp.items()}

{'float64': Index(['id', 'positin1', 'pid', 'int_corr', 'age_o', 'race_o', 'pf_o_att',
        'pf_o_sin', 'pf_o_int', 'pf_o_fun',
        ...
        'attr3_3', 'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3',
        'sinc5_3', 'intel5_3', 'fun5_3', 'amb5_3'],
       dtype='object', length=174),
 'int64': Index(['iid', 'gender', 'idg', 'condtn', 'wave', 'round', 'position', 'order',
        'partner', 'match', 'samerace', 'dec_o', 'dec'],
       dtype='object'),
 'object': Index(['field', 'undergra', 'mn_sat', 'tuition', 'from', 'zipcode', 'income',
        'career'],
       dtype='object')}

The *match* attribute indicates whether or not the round resulted in a match betwwen the two participants. Therefore, *match* is the target attribute that we will try to classify.

## 3. Data Cleaning

In [39]:
data2 = data.copy()

From the Speed Dating Data Key, we notice that waves 6, 7, 8, and 9 use a 1-10 preference scale instead of the usual 100 point allocation. This is an inconsistency that will distract our classification model. Therefore, records from waves 6-9 are removed.

In [57]:
data2 = data2.drop(data2[(data2.wave > 5) & (data2.wave < 10)].index)
data2.shape

(6816, 195)

Next, we observe that there exists two features, *dec* and *dec_o*, that indicate the decision of the player and the decision of his/her partner, respectively. If both the player and partner decide *yes* (1), then obvously the round results in a match. This information should not be a part of our analysis since we are trying to predict the result of a round without prior knowledge of the outcome. Thus, the features *dec* and *dec_o* will be removed.

In [64]:
data2 = data2.drop(columns=['dec', 'dec_o'])
data2.shape

(6816, 193)