<h1>Speed Dating: Who to Date Long Term</h1>

What influences love at first sight? (Or, at least, love in the first four minutes?) This dataset was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment.<br>

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.<br>

The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details.<br>

For more analysis from Iyengar and Fisman, read Racial Preferences in Dating.<br>

Data Exploration Ideas<br>

What are the least desirable attributes in a male partner? Does this differ for female partners?<br>
How important do people think attractiveness is in potential mate selection vs. its real impact?<br>
Are shared interests more important than a shared racial background?<br>
Can people accurately predict their own perceived value in the dating market?<br>
In terms of getting a second date, is it better to be someone's first speed date of the night or their last?

In [1]:
import pandas as pd
import numpy as np
import sklearn 
from IPython.display import display
%matplotlib inline
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

print('pandas version is {}.'.format(pd.__version__))
print('numpy version is {}.'.format(np.__version__))
print('scikit-learn version is {}.'.format(sklearn.__version__))

pandas version is 0.18.0.
numpy version is 1.10.4.
scikit-learn version is 0.17.1.


In [2]:
data = pd.read_csv("Speed Dating Data.csv")
print "This set has {} data points and {} features.".format(*data.shape)

This set has 8378 data points and 195 features.


<h1>Data Exploration</h1>

<h4>Samples for each Feature</h4>

In [3]:
import features_creator as fc #importing feature names made in file features_creator.py
fc.count_samples_in_features(data)#count samples for each feature

	iid 8378 		id 8377 		gender 8378 		idg 8378 		condtn 8378 		wave 8378 		round 8378 		position 8378 		positin1 6532 		order 8378 		partner 8378 		pid 8368 		match 8378 		int_corr 8220 		samerace 8378 		age_o 8274 		race_o 8305 		pf_o_att 8289 		pf_o_sin 8289 		pf_o_int 8289 		pf_o_fun 8280 		pf_o_amb 8271 		pf_o_sha 8249 		dec_o 8378 		attr_o 8166 		sinc_o 8091 		intel_o 8072 		fun_o 8018 		amb_o 7656 		shar_o 7302 		like_o 8128 		prob_o 8060 		met_o 7993 		age 8283 		field 8315 		field_cd 8296 		undergra 4914 		mn_sat 3133 		tuition 3583 		race 8315 		imprace 8299 		imprelig 8299 		from 8299 		zipcode 7314 		income 4279 		goal 8299 		date 8281 		go_out 8299 		career 8289 		career_c 8240 		sports 8299 		tvsports 8299 		exercise 8299 		dining 8299 		museums 8299 		art 8299 		hiking 8299 		gaming 8299 		clubbing 8299 		reading 8299 		tv 8299 		theater 8299 		movies 8299 		concerts 8299 		music 8299 		shopping 8299 		yoga 8299 		exphappy 8277 		expnum 1800 		attr1_1 8299 		sinc1_1 8299 		

<h4>Scale Numerical features between 0 & 1</h4>

In [4]:
fc.likert_scale_question_3(data)#likert scale from 0 - 10. Some samples were at rating 12. this function brings rating of 12 down to ratings to 10
fc.scale_majority_of_features(data)#this function scales most features between 0 - 1
fc.scale_exphappy(data)
fc.scale_question_4(data)#different waves were at different scales. all scaled repectively between 0 - 1 in this function call
fc.convert_income_to_float(data)#income was imported as string this call converts strings to float
fc.convert_tuition_to_float(data)#ditto, tuition was imported as strings and are converted to float
fc.zipcode_to_float(data)
fc.sat_to_float(data)#this function converts sat scores to float

<h2>Unique Profiles</h2>

In [5]:
unique = data.copy()
unique.drop_duplicates(subset = 'iid', keep = 'first', inplace = True)

<h3>Stats and Frequency Charts for Females</h3>

In [6]:
#fc.dating_attributes_vs_time_describe(data = unique, gender = 0)

In [7]:
#fc.dating_attributes_vs_time_hist(data = unique, gender = 0)

<h3>Stats and Frequency Charts for Males</h3>

In [8]:
#fc.dating_attributes_vs_time_describe(data = unique, gender = 1)

In [9]:
#fc.dating_attributes_vs_time_hist(data = unique, gender = 1)

<h2>Create Matched People DataFrame</h2>

In [10]:
#people_matched = data[data['match'] == 1].copy()
#people_matched.drop_duplicates(subset = 'iid', keep = 'first', inplace = True)
#display(people_matched)

<h2>Exploring Matches</h2>

In [11]:
#people_matched[['iid', 'gender', 'dec'] + fc.features_of_attraction + fc.preferences_of_attraction + ['dec_o', 'pid', 'goal', 'int_corr', 'match']]

<h2>Get Index for 'iid' for non-matches</h2>

In [12]:
#number = [int(i) for i in people_matched['iid']]
#not_ever_matched = [i for i in range(1,553) if i not in number]
#print not_ever_matched

In [13]:
#people_not_matched = data[data['iid'].isin(not_ever_matched)].copy()

<h2>Exploring Non-Matches</h2>

In [14]:
#people_not_matched[['iid', 'gender', 'dec'] + fc.features_of_attraction + fc.preferences_of_attraction + ['dec_o', 'pid', 'goal', 'int_corr', 'match']]

<h3>Non-Matched Females: Graphs</h3>

In [15]:
#fc.dating_attributes_vs_time(data = people_not_matched, gender = 0)

<h3>Non-Matched Males: Graphs</h3>

In [16]:
#fc.dating_attributes_vs_time(data = people_not_matched, gender = 1)

<h2>Dating Attributes as a function of Time: Distributing 100pts</h2>

<h3>Female Attributes</h3>

In [17]:
#fc.dating_attributes_vs_time_describe(unique[(unique['wave'] >= 6) & (unique['wave']<= 11)], 0)

<h3>Male Attributes</h3>

In [18]:
#fc.dating_attributes_vs_time_describe(unique[(unique['wave'] >= 6) & (unique['wave']<= 11)], 1)

<h2>Dating Attributes as a function of Time: Likert Scale</h2>

<h3>Female Attributes</h3>

In [19]:
#fc.dating_attributes_vs_time_describe(unique[(unique['wave'] >= 15) & (unique['wave']<= 20)], 0)

<h3>Male Attributes</h3>

In [20]:
#fc.dating_attributes_vs_time_describe(unique[(unique['wave'] >= 15) & (unique['wave']<= 20)], 1)

<h3>Female Subset</h3>

In [21]:
#women = data[data['gender'] == 0].copy()
#women_decision = women['dec'].copy()
#women.drop(['dec', 'dec_o', 'match'], axis = 1, inplace = True)

<h3>Male Subset</h3>

In [22]:
#men = data[data['gender'] == 1].copy()
#men_decision = men['dec'].copy()
#men.drop(['dec', 'dec_o', 'match'], axis = 1, inplace = True)

In [23]:
women_men = data[fc.all_space].copy()

In [24]:
women_men.dropna(axis = 0, how = 'any', inplace = True)

In [25]:
target_df = women_men['dec'].copy()
input_df = women_men[fc.feature_space].copy()

<h2>ExtraTreeClassifier: Ensemble Learning</h2>

In [26]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
clf = ExtraTreesClassifier(random_state = 0)
clf = clf.fit(input_df, target_df)
model = SelectFromModel(clf, prefit=True)
input_df_new = model.transform(input_df)
print 'Old Space is (number of samples, number of features)', input_df.shape
print 'New Space is (number of samples, number of features)', input_df_new.shape

Old Space is (number of samples, number of features) (4771, 78)
New Space is (number of samples, number of features) (4771L, 13L)


<h4>Pair and rank Features of Important</h4>

In [27]:
tuple_holder = [(j, i) for i, j in zip(fc.feature_space, clf.feature_importances_)]
tuple_holder.sort()
tuple_holder.reverse()
for i in tuple_holder:
    print i

(0.08719934203192059, 'like')
(0.073517231710954367, 'attr')
(0.060677190393373673, 'shar')
(0.046856854339198929, 'fun')
(0.02411931703354029, 'prob')
(0.018639943286502689, 'prob_o')
(0.017201097446884181, 'sinc')
(0.014566796487924503, 'amb')
(0.014147512706396553, 'intel')
(0.013335115209298471, 'pf_o_sha')
(0.012938283403008424, 'attr_o')
(0.012902498216996561, 'order')
(0.012869007070778366, 'pf_o_att')
(0.012386790154050432, 'pid')
(0.012347604159748973, 'sinc_o')
(0.012193901200407257, 'shar_o')
(0.011861872727354263, 'int_corr')
(0.01152917159852892, 'intel_o')
(0.011287245516955893, 'go_out')
(0.011242624625603683, 'like_o')
(0.011221350273296996, 'age_o')
(0.011103299789043859, 'pf_o_amb')
(0.011091794702037349, 'pf_o_int')
(0.010966972559424699, 'pf_o_sin')
(0.010963403747974337, 'amb_o')
(0.010904888922498617, 'gender')
(0.010830639098182219, 'fun_o')
(0.010639848890732799, 'pf_o_fun')
(0.010408986176807347, 'exphappy')
(0.010383900021382888, 'intel2_1')
(0.010352712322308

<h2>Look at Which Features Have More than 7000 Samples</h2>

In [28]:
for i, j in zip(data.keys(), data.count()):
    if j > 7000:
        if i not in fc.all_space:
            print '\t', i, j, '\t',

	id 8377 		idg 8378 		condtn 8378 		wave 8378 		round 8378 		position 8378 		partner 8378 		field 8315 		from 8299 		career 8289 		match_es 7205 		satis_2 7463 		length 7463 		numdat_2 7433 		attr1_2 7445 		sinc1_2 7463 		intel1_2 7463 		fun1_2 7463 		amb1_2 7463 		shar1_2 7463 		attr3_2 7463 		sinc3_2 7463 		intel3_2 7463 		fun3_2 7463 		amb3_2 7463 	

In [29]:
#for i, j in fc.data_cleaner.iteritems():
#    print i, j, '\n'
#for i, j in fc.master_list.items():
#    print i, j, '\n'
#print 'clean_up_1', '\n', fc.clean_up_1, '\n'
#print 'clean_up_2', '\n', fc.clean_up_2, '\n'
#print 'clean_up_3', '\n', fc.clean_up_3, '\n'
#print 'clean_up_4', '\n', fc.clean_up_4, '\n'
#print 'clean_up_5', '\n', fc.clean_up_5, '\n'
#print 'features_of_attraction', '\n', fc.features_of_attraction, '\n'
#print 'actual_decisions', '\n', fc.actual_decisions, '\n'
#print 'preferences_of_attraction', '\n', fc.preferences_of_attraction, '\n'
#print 'rating_by_partner_features', '\n', fc.rating_by_partner_features, '\n'
#print 'halfway_questions', '\n', fc.halfway_questions, '\n'
#print 'interests', '\n', fc.interests, '\n'
#print 'list_of_lists', '\n', fc.list_of_lists, '\n'
#print 'all columns in dataset', '\n'
#for i in data.keys():
#    print i,
to_drop = [i for i in data.keys() if i not in fc.all_space]
#print '\n'*2, 'to_drop', '\n', to_drop, '\n'