<h1>Speed Dating: Who to Date Long Term</h1>

What influences love at first sight? (Or, at least, love in the first four minutes?) This dataset was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment.<br>

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.<br>

The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details.<br>

For more analysis from Iyengar and Fisman, read Racial Preferences in Dating.<br>

Data Exploration Ideas<br>

What are the least desirable attributes in a male partner? Does this differ for female partners?<br>
How important do people think attractiveness is in potential mate selection vs. its real impact?<br>
Are shared interests more important than a shared racial background?<br>
Can people accurately predict their own perceived value in the dating market?<br>
In terms of getting a second date, is it better to be someone's first speed date of the night or their last?

<h2>Import Libraries</h2>

In [None]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

print('pandas version is {}.'.format(pd.__version__))
print('numpy version is {}.'.format(np.__version__))
print('scikit-learn version is {}.'.format(sklearn.__version__))
print('seaborn version is {}.'.format(sns.__version__))
print('matplotlib version is {}.'.format(matplotlib.__version__))

In [None]:
data = pd.read_csv("Speed Dating Data.csv")
print "This set has {} data points and {} features.".format(*data.shape)

<h1>Data Exploration</h1>

<h4>Samples for each Feature</h4>

In [None]:
import features_creator as fc #importing feature names made in file features_creator.py
fc.count_samples_in_features(data)#count samples for each feature

<h4>Features Space of interest (with most samples avalaible)</h4>

In [None]:
fc.count_samples_in_features(data[fc.feature_space])

<h4>Data Clean Up: Making Sure Features are within Range</h4>

In [None]:
fc.likert_scale_question_3(data)#likert scale from 0 - 10. Some samples were at rating 12. this function brings rating of 12 down to ratings to 10
fc.scale_question_3(data)#change scale from 0 - 10 to scale from 0 - 100, also forces that features add to 100
fc.scale_question_4(data)#change scale from 0 - 10 to scale from 0 - 100, also forces that features add to 100
fc.scale_question_5(data)#change scale from 0 - 10 to scale from 0 - 100, also forces that features add to 100
fc.scale_question_1(data)#forces that features add to 100, if not the case
fc.scale_question_2(data)#forces that features add to 100, if not the case
fc.scale_question_7(data)#forces that features add to 100, if not the case
fc.scale_rating_received(data)#change scale from 0 - 10 to scale from 0 - 100, also forces that features add to 100
fc.scale_rating_given(data)#change scale from 0 - 10 to scale from 0 - 100, also forces that features add to 100
fc.scale_half_way(data)#change feature 7_2 scale from 0 - 10 to scale from 0 - 100, also forces that features add to 100
fc.scale_half_way_2(data)#change feature 7_3 scale from 0 - 10 to scale from 0 - 100, also forces that features add to 100
fc.scale_age(data)#features scale age to 0 - 1

<h4>Type Casting</h4>

In [None]:
fc.convert_income_to_float(data)#income was imported as string this call converts strings to float
fc.convert_tuition_to_float(data)#ditto, tuition was imported as strings and are converted to float
fc.zipcode_to_float(data)#zipcode strings converted to float
fc.sat_to_float(data)#this function converts sat scores to float

<h4>Outlier Detection: Turkey's Method</h4>

In [None]:
index_to_be_removed = fc.outlier_detection(data[fc.feature_space[10:72]]) #these indices span at least 15  features as outliers
print index_to_be_removed
data.drop(data.index[index_to_be_removed], inplace = True)

<h4>Basic Stats for Unique Females</h4>

In [None]:
#fc.dating_attributes_vs_time_describe(data = data, gender = 0)

<h4>Frequency Charts for Females</h4>

In [None]:
#fc.dating_attributes_vs_time_hist(data = data, gender = 0)

<h4>Basic Stats for Unique Males</h4>

In [None]:
#fc.dating_attributes_vs_time_describe(data = data, gender = 1)

<h4>Frequency Charts for Males</h4>

In [None]:
#fc.dating_attributes_vs_time_hist(data = data, gender = 1)

<h4>Scale Numerical features between 0 & 1</h4>

In [None]:
fc.scale_majority_of_features(data)#this function scales most features between 0 - 1
fc.scale_exphappy(data)

<h4>Correlation Heat Map</h4>

In [None]:
#fc.make_corr(data[fc.feature_space])

<h2>Forest Feature Selection: ExtraTreesClassifier & RandomForestClassifier</h2>

In [None]:
women_men = data[fc.all_space].copy()

In [None]:
women_men.dropna(axis = 0, how = 'any', inplace = True)

<h4>Both Genders</h4>

In [None]:
target_df = women_men['dec'].copy()
input_df = women_men[fc.feature_space].copy()

In [None]:
fc.forests(input_df, target_df)

<h4>Feature Selection: SelectKBest, F-Classifier</h4>

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
kBest = SelectKBest(f_classif, k = 'all')
kBest.fit_transform(input_df, target_df)
k_Best_features = [(j, i, k) for i, j, k in zip(input_df.keys(), kBest.scores_, kBest.pvalues_)]
k_Best_features.sort()
k_Best_features.reverse()
counter = 0
print 'SelectKBest: f_classif'
for i in k_Best_features:
    counter += 1
    print counter, i

<h4>Create Array of Selected Features</h4>

In [None]:
features_selected = ['like', 'attr', 'intel', 'shar', 'sinc', 'amb', 'fun', 'prob']
print features_selected

<h4>Feature Selection: Chi2</h4>

In [None]:
from sklearn import preprocessing
new_input_df = input_df.copy()
new_input_df['int_corr'] = (new_input_df - new_input_df.min()) / (new_input_df.max() - new_input_df.min())
new_input_df.drop(labels = ['iid', 'gender', 'race', 'field_cd','career_c', 'goal', 'date', 'zipcode', 'imprelig', 'imprace', 'prob_o', 'met', 'go_out', 
                            'race_o', 'samerace','pid','order', 'met_o'], axis = 1, inplace = True)

In [None]:
new_input_df.keys()

In [None]:
from sklearn.feature_selection import chi2
kBest = SelectKBest(chi2, k = 'all')
kBest.fit_transform(new_input_df, target_df)
k_Best_features = [(j, i, k) for i, j, k in zip(new_input_df, kBest.scores_, kBest.pvalues_)]
k_Best_features.sort()
k_Best_features.reverse()
counter = 0
print 'SelectKBest: chi2'
for i in k_Best_features:
    counter += 1
    print counter, i

<h4>Normalization features: PCA</h4>

In [None]:
new_input_df = input_df.copy()
new_input_df = preprocessing.normalize(new_input_df[features_selected].copy())
from sklearn.decomposition import PCA
for i in range(1, 9):
    pca = PCA(n_components = i)
    pca.fit(new_input_df)
    print 'n = ', str(i),': ' , 
    print 'Components:', pca.components_
    print 'Explained Variance:', pca.explained_variance_ 
    print 'Explained Ratio: ', pca.explained_variance_ratio_, '\n'*2

<h4>PCA: Projections to 3 dimensions</h4>

In [None]:
pca = PCA(n_components = 3)
pca.fit(new_input_df)
display(new_input_df)

<h2>Create Matched People DataFrame</h2>

In [None]:
#people_matched = data[data['match'] == 1].copy()
#people_matched.drop_duplicates(subset = 'iid', keep = 'first', inplace = True)
#display(people_matched)

<h2>Exploring Matches</h2>

In [None]:
#people_matched[['iid', 'gender', 'dec'] + fc.features_of_attraction + fc.preferences_of_attraction + ['dec_o', 'pid', 'goal', 'int_corr', 'match']]

<h4>Get Index for 'iid' for non-matches</h4>

In [None]:
#number = [int(i) for i in people_matched['iid']]
#not_ever_matched = [i for i in range(1,553) if i not in number]
#print not_ever_matched

In [None]:
#people_not_matched = data[data['iid'].isin(not_ever_matched)].copy()

<h2>Exploring Non-Matches</h2>

In [None]:
#people_not_matched[['iid', 'gender', 'dec'] + fc.features_of_attraction + fc.preferences_of_attraction + ['dec_o', 'pid', 'goal', 'int_corr', 'match']]

<h4>Non-Matched Females: Graphs</h4>

In [None]:
#fc.dating_attributes_vs_time(data = people_not_matched, gender = 0)

<h4>Non-Matched Males: Graphs</h4>

In [None]:
#fc.dating_attributes_vs_time(data = people_not_matched, gender = 1)