## Speed dating data-set
#### Abdel K. Bokharouss - November 2017

### 1.2 Seperate model per gender

### <font color="green">imports, preparation and configuration</font>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from IPython.core.display import HTML # markdown cell styling
HTML("""
<style>
div.text_cell_render h1 {
font-size: 1.6em;
line-height:1.2em;
}

div.text_cell_render h2 { 
margin-bottom: -0.4em;
}

div.text_cell_render { 
font-size:1.2em;
line-height:1.2em;
font-weight:500;
}

div.text_cell_render p, li {
color:Navy;
}

</style>
""")

In [3]:
dates = pd.read_csv("speed_dating_assignment.csv")
dates.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


Now there are a lot of attributes which can be exploited to build a predictive model. The model should be trained to build the class attribute <i>dec</i>. Considering the fact that a performance metric (e.g. accuracy) should be optimized and evaluated, it makes sense to have sufficient records to actually train the model well enough. An attribute with a lot of missing values (NaN) is, therefore, not a very good candidate since many models require non-NaN values for the feature attributes of the models. Data mining methods can be used to fill in these NaN values, but this will be very error-prone considering the small size of the datasets. 

The next step is, therefore, an assesment of the attributes which are considered to be of use in the predictive model, and an assesment of the completeness of these attributes in the two datasets. In this part of the assignment (1.2) the attributes that are going to be used will not be constructed ourselves. Feature engineering (e.g. age difference (1.1)) will be exploited in the models of sub-task 1.3

The evaluation of possibly useful attributes is not done by scripting, but by an evaluation of the speed dating data key document and a lot of deductive reasoning. The following attributes are considered to be possible candidates:

<i>(The text in italics is the explanation giving by the official data key)</i>

* <b>order:</b> <i>The number of date that night when met partner.</i> One can image that subjects whom are desperately looking for a partner lower their standards by the end of the night. So if the first x persons were not exactly a success, the subject might (subcounsiously) lower their standard for the next (round - x) persons and this could lead more easily to a decision (dec) to meet the person again
* <b>field:</b> <i>field of study.</i> There could be differences in the cognitive process that is decision making between subjects from different field of studies
* <b>goal, date and go_out:</b> <i>primary goal event, date frequency, going out freuqency (all categorical)</i> There could be differences in the cognitive process that is decision making between subjects who live a different lifestyle and/or have a different primary goal (not focussing on the differences (posibility for 1.3), but on the subject's answer)
* <b>satis_2:</b> <i>Overall how satisfied were you with the people you met (1 = not at all satisfied, 10 = extremely satisfied.</i> If a subject is satisfied with the people he/she met, the likelihood of a match (and thus a dec = 1) will be higher, if one would think logically. This attributes is, therefore, definitely worth an evaluation.
[comment]: <> (imprace, imprelig, from)
Note that if one would use the attribute <i>match</i> it would probably improve the accuracy (or other performance metric which is assed) of the model significantly, but this would be against the main idea of the predictive model. The model should asses whether the subject decides that he/she wants to see date partner in question again. Using the <i>match</i> attribute would be illogical since this attribute would tell us immediately whether <i>dec = 1</i> if <i>match = 1</i> and if <i>match = 0</i>, it can still be that <i>dec = 1</i>, but this is less-likely since this would imply that the date partner in question would not like to see the subject again, which often implies that the date did not go well

In [4]:
dates_model = dates[['gender', 'order', 'field', 'goal', 'date', 'go_out', 'dec']]
print(dates.shape) # to asses the number of NaN values per column
dates_model.describe()

(8378, 175)


Unnamed: 0,gender,order,goal,date,go_out,dec
count,8378.0,8378.0,8299.0,8281.0,8299.0,8378.0
mean,0.500597,8.927668,2.122063,5.006762,2.158091,0.419909
std,0.500029,5.477009,1.407181,1.444531,1.105246,0.493573
min,0.0,1.0,1.0,1.0,1.0,0.0
25%,0.0,4.0,1.0,4.0,1.0,0.0
50%,1.0,8.0,2.0,5.0,2.0,0.0
75%,1.0,13.0,2.0,6.0,3.0,1.0
max,1.0,22.0,6.0,7.0,7.0,1.0


We see that the attributs <i>gender, order</i> and <i>dec</i> have no NaN values as one would expect, <i>goal</i> and <i>go_out</i> have (8378 - 8299 =) 79 NaN values and <i>date</i> has (8378 - 8281 =) 97 NaN values

In [5]:
sum(dates_model.apply(lambda x: sum(x.isnull().values), axis = 1) > 0) # number of rows with NaN values

97

As can by the result of the previous statement, the number of rows is max(79, 97). This means that a lot of rows with a NaN value in one of the columns, often have also another NaN value in one of the other columns (we could also have 79 + 97 rows with NaN values)

Considering the relatively low number of NaN's compared to the total record count of the dataset it would not hurt to drop those rows. Dropping those 97 records won't hurt the training of the model by much. In addition, exploiting data mining methods to fill the NaN values would be error-prone considering the nature of the attributes. The choice is, therefore, made to drop these values

In [6]:
dates_model = dates_model.dropna().reset_index(drop=True)
(dates.shape[0] - dates_model.shape[0]) == 97 # checking whether the statement actually deleted the rows

True

Since a predictive model needs to be trained for two separate genders (males and females) it makes sense to separate the data into a data frame for the male daters (subject in the instance) and a data frame for the female daters. Male and female subjects are classified by a value of 1 or 0 in the gender column, respectively.

In [7]:
male_subjects = dates_model[dates_model.gender == 1]
female_subjects = dates_model[dates_model.gender == 0]
male_subjects.shape, female_subjects.shape

((4156, 7), (4125, 7))

There are 4156 records in the data set of the male subjects and 4125 records in the data set of the female subjects

### <font color="green">Training the models</font>

In [8]:
male_subjects_shuffle = male_subjects.sample(frac=1).reset_index(drop=True) # shuffle rows
female_subjects_shuffle = female_subjects.sample(frac=1).reset_index(drop=True)

The rows are shuffled to ensure a fair split into training and test data

In [9]:
male_x_data = male_subjects_shuffle.drop('dec', axis = 1) # dec is target attribute
female_x_data = female_subjects_shuffle.drop('dec', axis = 1)
male_labels = male_subjects_shuffle['dec']
female_labels = female_subjects_shuffle['dec']

The choice is made to split the data into 70% training data and 30% test data. This ratio should ensure enough training data for the model and enough data to asses and evaluate the performance of the model

In [11]:
male_x_train, male_y_train, male_x_test, male_y_test = train_test_split(male_x_data, male_labels, test_size = 0.3)
female_x_train, female_y_train, female_x_test, female_y_test = train_test_split(female_x_data, female_labels, test_size = 0.3)

.... train model ... 

### <font color="green">Evaluating the performance of the models</font>

### <font color="green">Comparing the differences among the models</font>

### <font color="green">Evaluating a third model for the female gender</font>