# Predicting Speed Dating Match

*Written by Ben Pang, presented to Prof. David Skillicorn*

<b>

## Abstract

## 1. Introduction

The purpose of this study is to predict whether speed dating rounds result in a match, given the corresponding data surveryed from the individuals who participated in the round. We will select an appropriate classification model based on observations on the dataset and tune our model in order to achieve the highest test accuracy possible.

The two main tools that we will use are [scikit-learn](http://scikit-learn.org/stable/) and [pandas](https://pandas.pydata.org/). **scikit-learn** provides tools for preprocessing the data as well as implemenation and tuning of our classification model. **pandas** provides the *DataFrame* class which is convenient for data manipulation and cleaning.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt

from pathlib import Path


## 2. The Dataset



In [2]:
data_path = Path('../data/speed_dating_data.csv')
data = pd.read_csv(data_path, encoding = "ISO-8859-1")

In [3]:
data.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [4]:
data.shape

(8378, 195)

In [5]:
grp = data.columns.to_series().groupby(data.dtypes).groups
{k.name:v for k, v in grp.items()}

{'float64': Index(['id', 'positin1', 'pid', 'int_corr', 'age_o', 'race_o', 'pf_o_att',
        'pf_o_sin', 'pf_o_int', 'pf_o_fun',
        ...
        'attr3_3', 'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 'attr5_3',
        'sinc5_3', 'intel5_3', 'fun5_3', 'amb5_3'],
       dtype='object', length=174),
 'int64': Index(['iid', 'gender', 'idg', 'condtn', 'wave', 'round', 'position', 'order',
        'partner', 'match', 'samerace', 'dec_o', 'dec'],
       dtype='object'),
 'object': Index(['field', 'undergra', 'mn_sat', 'tuition', 'from', 'zipcode', 'income',
        'career'],
       dtype='object')}

The *match* attribute indicates whether or not the round resulted in a match betwwen the two participants. Therefore, *match* is the target attribute that we will try to classify.

## 3. Data Cleaning

First, we make a copy for our cleaned data and replace the NaN values with -1:

In [6]:
data2 = data.copy().fillna(-1)

From the Speed Dating Data Key, we notice that waves 6, 7, 8, and 9 use a 1-10 preference scale instead of the usual 100 point allocation. This is an inconsistency that will distract our classification model. Therefore, records from waves 6-9 are removed.

In [7]:
data2 = data2.drop(data2[(data2.wave > 5) & (data2.wave < 10)].index)
data2.shape

(6816, 195)

Next, we observe that there exists two features, *dec* and *dec_o*, that indicate the decision of the player and the decision of his/her partner, respectively. If both the player and partner decide *yes* (1), then obvously the round results in a match. This information should not be a part of our analysis since we are trying to predict the result of a round without prior knowledge of the outcome. Thus, the features *dec* and *dec_o* will be removed.

In [8]:
data2 = data2.drop(columns=['dec', 'dec_o'])
data2.shape

(6816, 193)

The data is quite "messy". There is plenty of missing data in some comlumns. For example, the *field* feature has a corresponding *field_cd* feature which is the integer-coded indicator for the player's field of study. However, some records have a value for one but not the other. The same is true for *career* and *career_c*. We are interested in the coded version of this information.

In [9]:
set(data2.field_cd[data2.field.isnull()].isnull())

set()

In [10]:
set(data2.career_c[data2.career.isnull()].isnull())

set()

Evidently, when there is no entry for *field_cd* (the field code), there is also no entry for *field*. The same is true for *career_c* (the career code) and *career*. Thus, we can build a dictionary mapping the string entries to their corresponding integer codes and then use this dictionary to replace missing code values.

In [14]:
def replace_code(data, name_col, code_col):
    code_dict = dict(zip(data[name_col], data[code_col]))
    for index, name in enumerate(data[name_col]):
        data[code_col][index] = code_dict.get(name)
    return code_dict

In [17]:
field_dict = replace_code(data2, 'field', 'field_cd')

In [18]:
print(set(data2.field_cd))

{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 18.0}
