# Speed dating data-set
#### Joris Rombouts & Remco Surtel

### 1.1 Visualization

### <font color="green">Imports, preparation and configuration</font>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("speed_dating_assignment.csv")

### <font color="green">Creating the visualization matrix</font>

In [3]:
# Add partner age (age_o) to the dataframe 
# by cross referencing: pid == iid
# first we copy the dataframe to a new dataframe, to change iid to pid 
# and to change age to age_o, so that we can merge them into the original dataframe
# we merge on pid
df_o = df.copy()
#only pick the iid and age columns
df_o = df_o.filter(items=['iid', 'age'])
#rename these to pid and age_o respectively
df_o.columns = ['pid', 'age_o']
#drop all duplicated pid's
df_o = df_o.drop_duplicates()
#merge the pid's and age-o's with old df to df_new
#merge on pid (because pid == iid)
df_matrix = pd.merge(df, df_o, on=['pid'], how = 'left')

In [4]:
#create visualization matrix showing the average decision made for each pair of ages
visualization_matrix = pd.crosstab(df_matrix['age'], df_matrix['age_o'], values = df_matrix['dec'], aggfunc=[np.mean])
#round values in matrix down to two decimals for readability
visualization_matrix = np.round(visualization_matrix, decimals=2)
#Convert ages shown in column and row names to integers
visualization_matrix.columns = [int(visualization_matrix.columns[i][1]) for i in range(0, len(visualization_matrix.columns))]
visualization_matrix.index = [int(visualization_matrix.index[i]) for i in range(0, len(visualization_matrix.index))]

In [5]:
cm = sns.light_palette("red", as_cmap=True)
s = visualization_matrix.style.background_gradient(cmap=cm)
s

Unnamed: 0,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,42,55
18,,0.0,0.33,0.0,,,,,,,,,,,,,,,,,,,,
19,1.0,0.0,0.25,0.29,0.25,,,,,,,,,,,,,,,,,,,
20,0.33,0.25,0.67,0.45,0.42,,0.0,0.25,0.5,0.0,,0.0,0.0,,,,0.0,,,,,,,
21,0.4,0.43,0.73,0.62,0.44,0.55,0.57,0.76,0.76,0.47,0.44,0.39,0.39,0.5,0.33,1.0,0.67,1.0,0.0,,,,,
22,,1.0,0.67,0.49,0.5,0.51,0.46,0.52,0.37,0.32,0.42,0.47,0.33,0.33,0.53,0.25,0.62,1.0,0.0,,1.0,0.5,0.0,
23,,,,0.42,0.4,0.51,0.33,0.49,0.42,0.44,0.4,0.37,0.4,0.45,0.22,0.38,0.44,0.0,0.2,0.0,0.5,0.0,0.0,0.0
24,,,0.5,0.35,0.47,0.48,0.31,0.48,0.43,0.43,0.39,0.33,0.31,0.38,0.3,0.33,0.19,0.5,0.0,0.0,0.67,,0.0,
25,,,0.75,0.29,0.38,0.51,0.35,0.61,0.49,0.43,0.39,0.45,0.54,0.17,0.23,0.11,0.58,0.44,0.0,,,0.0,0.4,
26,,,0.5,0.35,0.24,0.33,0.33,0.52,0.4,0.41,0.45,0.39,0.41,0.14,0.22,0.24,0.4,0.5,0.25,1.0,0.5,0.0,0.0,
27,,,1.0,0.25,0.49,0.4,0.44,0.46,0.45,0.39,0.49,0.37,0.38,0.42,0.27,0.41,0.2,0.25,0.4,0.0,0.67,0.0,0.0,0.0


### <font color="green">Relations that we managed to discover</font>

First of all, note the presence of NaN values in the matrix. The speed dating data has been collected during an event in Boston over the course of two years. The rows represent the age of the subjects and the columns the ages of the date partners. The row and column labels are limited to the ages that are actually present in the dataset.  The presence of NaN values indicates that during the collection of speed dating data, not all subject-partner age pairs occurred in those two years. So, for example, it never occurred that an eighteen year-old went on a speed-date with another eighteen year-old.

Moreover, there weren't many old participants in general, as evidenced by the lack of participants between ages 42 and 55. Therefore, in the process of discovering relations, we have tried to focus on ages that did have sufficiently many participants, roughly ages 20 through 35. We also mostly ignored values of 1 (meaning that all decisions for the corresponding pair of ages were positive), because in most cases this just meant that there were only one or two cases where two people of these ages met.

Contrary to our initial expectations, the visualization does not show a clear line through the middle indicating that people are generally attracted to partners of the same age. What this does tell us, is that dating is more complicated than that. Matches cannot be determined only by age, there are certainly other significant factors to take into account. 

One relation that we can clearly see in the visualization though, is that there is a diverging trend. This indicates that, as people get older, they are attracted to a larger range of ages in their partners. This could be due to the age difference becoming less significant. For instance, when children start dating in high school, two years is a major difference. When you are in your thirties, however, two years isn't such a big deal anymore.

Furthermore, the visualization shows another interesting phenomenon: the older the subject, the more willing they are to say yes. Or, in other words, older participants are more desperate to find a partner. A very clear example for this can be seen in row 39, where one participant said yes to every single potential partner. 