# Bloc 2 - Analyse exploratoire, descriptive et inférentielle de données - Speed Dating with Tinder

## Introduction

Tinder is an online dating and geosocial networking application launched in 2012. It is the world’s most popular application for meeting new people. The application is available in over 190 countries and 56 languages. It has been downloaded more than 530 million times and led to more than 75 billion matches. In 2023, Tinder had 10.9 million subscribers and 75 million monthly active users. 

Tinder matches users based on geographic proximity. In Tinder, users "swipe right" to like or "swipe left" to dislike other users' profiles, which include their photos, a short bio, and a list of their interests. Tinder uses a "double opt-in" system where both users must like each other (match) before they can exchange messages.

### Problematic

In 2002, Tinder experienced a decrease in the number of matches.

The company would like to understand the unlerlying basis of this drop.

### Scope

To better understand this decline in matches, the Tinder marketing team decided to focus on the profile of potential users in order to identify the key features that influence matching.

They ran a speed dating experiment, split into 21 experimental speed dating events (waves) from 2002 to 2004. During the events, attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their dates, participants were asked if they would like to see the other persons again.

Participants would have to give Tinder extensive information about themselves, that could ultimately reflect on their dating profile on the application. Information was collected in several surveys before, during, and after the event. The dataset includes information on each speed date, as well as information on demographics, dating habits, lifestyle information, self-perception across key attributes, and beliefs on what others find valuable in a mate.

### Aim and objectives

Overall aim: Understand what makes people interested into each other.

Objectives:
- 1 - Evaluate the importance and impact of partner's attributes in decision making.
- 2 - Evaluate the impact of shared interests between partners.
- 3 - Evaluate the importance and impact of shared ethnicity between patners.
- 4 - Evaluate the importance and impact of self-esteem for obtaining a like and a match.
- 5 - Evaluate the importance and impact of shared dating goals between partners.

##
## Methods

### 1 - Library import

### 2 - File reading and basic exploration

The dataset is composed of records of various information on 8378 speed dates performed in 21 waves with 551 individual subjects. It contains 195 features.

The initial inspection of the dataset revealed that the data key was not always reliable, that categorical features were encoded, that the dataset contained many missing values, some outliers, some values out of the given scale of the features, and some numeric values stored as strings. Moreover, different scales were used between waves for the quotation of some attributes.

Therefore, the dataset needed preprocessing to make it suitable for further analysis.

### 3 - Preprocessing

According to what was observed, and according to the structure of the dataset, a first preprocessing was performed with the following steps:
- 1 - Assessment of discrepencies within each subject-related feature, per subject.
- 2 - Corrections on discrepencies in id and positin1 features.
- 3 - Renaming of the content of categorical features (according to the data key, when possible).
- 4 - Ad hoc corrections on numeric features stored as strings.
- 5 - Assessment of the quality of the survey quotations and ad hoc corrections, rescaling.
- 6 - Correction of values that were rescaled.
- 7 - Indentification and treatment of outliers

Since the dataset covered dates between 551 subjects within 21 waves, it contained a lot of redundant information (speed dates are reciprocal). Moreover, not all subjects had the same number of speed dates,
thus the dataset was biased towards subjects that had more dates. For these reasons, the dataset has been reduced to obtain a second dataset with data grouped by subject to plot demographics info.

### 4 - Overview 1 - Quality of the speed dating experiment (see Figure 1)

The dataset contained recordings of information given by the subjects at several time points of the experiment. Before analysing the data, the quality of the experiment was assessed. 

The quality of surveys was assessed by examining the proportion of fully and partially filled surveys (some general info on subjects may be missing, some subjects may not have filled all the surveys). The effectiveness of the dating process  was assessed by examining the proportion of likes, matches, and resulting actual dates. The numbers of likes and matches relative to the order of the speed dates were graphically represented.

### 5 - Overview 2 - Demographics of the cohort (see Figure 2)

The original dataset being biased towards subjects that had more dates than others, the reduced data was used to plot demographics of the cohort (i.e. not demographics of the experiment itself).

### 6 - Analysis 1 - Importance and impact of partner's attributes (see Figure 3)

Subjects were asked to rate a series of attributes that they desire to find in a partner for a successful date. The importance of attributes was compared between males and females to identify partner expectations before a speed date.

For females, intelligence seems to be the most desired attribute, while it seems to be attractiveness for males. This result was compared to the real impact of these attributes on giving a like to the partner.

### 7 - Analysis 2 - Impact of shared interests (see Figure 4)

Dating apps often ask for people their interest in given activities, not only for people to get to know each other, but also as a mean to engage conversation when a match occurs. However the impact of shared interests for getting a match was not proven.

Shared interests are the second least desired attribute when meeting a potential partner, both for females and males. To assess whether they have a real impact on people to give a like after a speed date, the Euclidean distance between partners (relative to activities) was calculated and the two groups (no like versus like) were compared, for females and for males. The impact on getting a match was also assessed.

### 8 - Analysis 3 - Importance and impact of shared ethnicity (see Figure 5)

Shared ethnicities might be important for giving a like and a match and might differ amongst ethnicities. Preferences were compared amongst ethnicity groups. The impact of shared ethnicity on giving a like and getting a match was displayed for each ethnicity.

### 9 - Analysis 4 - Importance and impact of self-esteem (see Figure 6)

Self-esteem is defined as the ability of the subject to grade its own attributes at least as well as the grades given by the speed date partner for the same attributes. It is measured as the mean of the differences between these two grades.

The measure of self-esteem was compared between females and males. Its impact on getting a like from the partner and on getting a match was assessed.

### 10 - Analysis 5 - Importance and impact of the dating goal (see Figure 7)

Having shared goals might be of crucial importance to find the right partner. Answers given by the subjects about their primary goal in dating were compared between females and males. The number of likes given for each category was also compared between genders, as well as the proportion of matches per category.

The impact of having shared goals was assessed by calculating the proportion of likes given to a partner of the same category over the total number of likes given by that category. The same logic was followed to assess the impact of shared goals on matches.

##
## Conclusion

The experiment carried out by Tinder was quite comprehensive, and many hypotheses which would explain the decrease in the number of matches could have been tested. To better understand why people match or not, this project gave insights about the importance and impact on matching for several key attributes of the speed dating partners, including their background, qualities, hobbies, self-esteem, and dating goals.

The results of this project show that matches are influenced by multiple factors.
- Regarding the background, it appears that some ethnicities, like caucasians, are more prone than others to match with people of the same ethnicity. 
- Regarding qualities, women statistically give more likes to men that they qualify as intelligent, while men statistically give more likes to women that they qualify as attractive.
- Regarding shared interests, the divergence of interests between speed dating partners does not explain a difference in the number of matches.
- Regarding self-esteem, people under-estimating their qualities (in comparison to the grades given by their partners) have statistically less chances to receive a like or to obtain a match.
- Regarding dating goals, it appears that people having a light goal tend to better match together.

To improve the number of matches, it might be useful for Tinder to optimise the information they first display on the screen, especially that people do not necessarily read the entire profile of their potential future date. For example, replacing the shared hobbies by the shared goals could influence users on reading further, or at least help them in making quick decisions.

##
## Code

### 1 - Library import

In [5]:
### 1 - library import ### ----

import pandas as pd
import numpy as np

import scipy.stats as stats
import scipy.spatial as spatial
from scipy.optimize import curve_fit

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots


###
### 2 - File reading and basic exploration

In [6]:
### 2 - file reading and basic exploration - import dataset ### ----

# define path
path = "/Users/celinemartineau/Desktop/Fullstack/11_projects_for_certification/bloc2_tinder/"

# read data
data = pd.read_csv(path + "data/Speed_Dating_Data.csv", encoding = "ISO-8859-1")

# WARNING! choose the following option for testing yourself
# data = pd.read_csv("cnm_bloc2_data.csv", encoding = "ISO-8859-1")


In [7]:
### 2 - file reading and basic exploration - get basic stats ### ----

# print shape of data
print("Number of rows: {}".format(data.shape[0]))
print("Number of columns: {}".format(data.shape[1]))
print()

# display dataset
pd.set_option('display.max_columns', None)
print("Dataset display: ")
display(data.head())
print()

# display basic statistics
print("Basics statistics: ")
data_desc = data.describe(include='all')
display(data_desc)
print()


Number of rows: 8378
Number of columns: 195

Dataset display: 


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,reading,tv,theater,movies,concerts,music,shopping,yoga,exphappy,expnum,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,attr4_1,sinc4_1,intel4_1,fun4_1,amb4_1,shar4_1,attr2_1,sinc2_1,intel2_1,fun2_1,amb2_1,shar2_1,attr3_1,sinc3_1,fun3_1,intel3_1,amb3_1,attr5_1,sinc5_1,intel5_1,fun5_1,amb5_1,dec,attr,sinc,intel,fun,amb,shar,like,prob,met,match_es,attr1_s,sinc1_s,intel1_s,fun1_s,amb1_s,shar1_s,attr3_s,sinc3_s,intel3_s,fun3_s,amb3_s,satis_2,length,numdat_2,attr7_2,sinc7_2,intel7_2,fun7_2,amb7_2,shar7_2,attr1_2,sinc1_2,intel1_2,fun1_2,amb1_2,shar1_2,attr4_2,sinc4_2,intel4_2,fun4_2,amb4_2,shar4_2,attr2_2,sinc2_2,intel2_2,fun2_2,amb2_2,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,1,11.0,0,0.14,0,27.0,2.0,35.0,20.0,20.0,20.0,0.0,5.0,0,6.0,8.0,8.0,8.0,8.0,6.0,7.0,4.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,2.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,6.0,9.0,7.0,7.0,6.0,5.0,7.0,6.0,2.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,,,,,,,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,2,12.0,0,0.54,0,22.0,2.0,60.0,0.0,0.0,40.0,0.0,0.0,0,7.0,8.0,10.0,7.0,7.0,5.0,8.0,4.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,2.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,7.0,8.0,7.0,8.0,5.0,6.0,7.0,5.0,1.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,,,,,,,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,3,13.0,1,0.16,1,22.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,2.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,5.0,8.0,9.0,8.0,5.0,7.0,7.0,,1.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,,,,,,,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,4,14.0,1,0.61,0,23.0,2.0,30.0,5.0,15.0,40.0,5.0,5.0,1,7.0,8.0,9.0,8.0,9.0,8.0,7.0,7.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,2.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,7.0,6.0,8.0,7.0,6.0,8.0,7.0,6.0,2.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,,,,,,,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,5,15.0,1,0.21,0,24.0,3.0,30.0,10.0,20.0,10.0,10.0,20.0,1,8.0,7.0,9.0,6.0,9.0,7.0,8.0,6.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,2.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,5.0,6.0,7.0,7.0,6.0,6.0,6.0,6.0,2.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,,,,,,,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,



Basics statistics: 


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,reading,tv,theater,movies,concerts,music,shopping,yoga,exphappy,expnum,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,attr4_1,sinc4_1,intel4_1,fun4_1,amb4_1,shar4_1,attr2_1,sinc2_1,intel2_1,fun2_1,amb2_1,shar2_1,attr3_1,sinc3_1,fun3_1,intel3_1,amb3_1,attr5_1,sinc5_1,intel5_1,fun5_1,amb5_1,dec,attr,sinc,intel,fun,amb,shar,like,prob,met,match_es,attr1_s,sinc1_s,intel1_s,fun1_s,amb1_s,shar1_s,attr3_s,sinc3_s,intel3_s,fun3_s,amb3_s,satis_2,length,numdat_2,attr7_2,sinc7_2,intel7_2,fun7_2,amb7_2,shar7_2,attr1_2,sinc1_2,intel1_2,fun1_2,amb1_2,shar1_2,attr4_2,sinc4_2,intel4_2,fun4_2,amb4_2,shar4_2,attr2_2,sinc2_2,intel2_2,fun2_2,amb2_2,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
count,8378.0,8377.0,8378.0,8378.0,8378.0,8378.0,8378.0,8378.0,6532.0,8378.0,8378.0,8368.0,8378.0,8220.0,8378.0,8274.0,8305.0,8289.0,8289.0,8289.0,8280.0,8271.0,8249.0,8378.0,8166.0,8091.0,8072.0,8018.0,7656.0,7302.0,8128.0,8060.0,7993.0,8283.0,8315,8296.0,4914,3133.0,3583.0,8315.0,8299.0,8299.0,8299,7314.0,4279.0,8299.0,8281.0,8299.0,8289,8240.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8277.0,1800.0,8299.0,8299.0,8299.0,8289.0,8279.0,8257.0,6489.0,6489.0,6489.0,6489.0,6489.0,6467.0,8299.0,8299.0,8299.0,8299.0,8289.0,8289.0,8273.0,8273.0,8273.0,8273.0,8273.0,4906.0,4906.0,4906.0,4906.0,4906.0,8378.0,8176.0,8101.0,8082.0,8028.0,7666.0,7311.0,8138.0,8069.0,8003.0,7205.0,4096.0,4096.0,4096.0,4096.0,4096.0,4096.0,4000.0,4000.0,4000.0,4000.0,4000.0,7463.0,7463.0,7433.0,1984.0,1955.0,1984.0,1984.0,1955.0,1974.0,7445.0,7463.0,7463.0,7463.0,7463.0,7463.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,7463.0,7463.0,7463.0,7463.0,7463.0,4377.0,4377.0,4377.0,4377.0,4377.0,3974.0,3974.0,3974.0,1496.0,668.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,2016.0,2016.0,2016.0,2016.0,2016.0,2016.0,2959.0,2959.0,2959.0,2959.0,2959.0,2959.0,2959.0,2959.0,2959.0,2959.0,2959.0,2016.0,3974.0,3974.0,3974.0,3974.0,3974.0,2016.0,2016.0,2016.0,2016.0,2016.0
unique,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,259,,241,68.0,115.0,,,,269,409.0,261.0,,,,367,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Business,,UC Berkeley,1400.0,26908.0,,,,New York,0.0,55080.0,,,,Finance,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,521,,107,403.0,241.0,,,,522,355.0,124.0,,,,202,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,283.675937,8.960248,0.500597,17.327166,1.828837,11.350919,16.872046,9.042731,9.295775,8.927668,8.963595,283.863767,0.164717,0.19601,0.395799,26.364999,2.756653,22.495347,17.396867,20.270759,17.459714,10.685375,11.84593,0.419551,6.190411,7.175256,7.369301,6.400599,6.778409,5.47487,6.134498,5.208251,1.960215,26.358928,,7.662488,,,,2.757186,3.784793,3.651645,,,,2.122063,5.006762,2.158091,,5.277791,6.425232,4.575491,6.245813,7.783829,6.985781,6.714544,5.737077,3.881191,5.745993,7.678515,5.304133,6.776118,7.919629,6.825401,7.851066,5.631281,4.339197,5.534131,5.570556,22.514632,17.396389,20.265613,17.457043,10.682539,11.845111,26.39436,11.071506,12.636308,15.566805,9.780089,11.014845,30.362192,13.273691,14.416891,18.42262,11.744499,11.854817,7.084733,8.294935,7.70446,8.403965,7.578388,6.941908,7.927232,8.284346,7.426213,7.617611,0.419909,6.189995,7.175164,7.368597,6.400598,6.777524,5.474559,6.134087,5.207523,0.948769,3.207814,20.791624,15.434255,17.243708,15.260869,11.144619,12.457925,7.21125,8.082,8.25775,7.6925,7.58925,5.71151,1.843495,2.338087,32.819556,13.529923,15.293851,18.868448,7.286957,12.156028,26.217194,15.865084,17.813755,17.654765,9.913436,12.760263,26.806234,11.929177,12.10303,15.16381,9.342511,11.320866,29.344369,13.89823,13.958265,17.967233,11.909735,12.887976,7.125285,7.931529,8.238912,7.602171,7.486802,6.827964,7.394106,7.838702,7.279415,7.332191,0.780825,0.981631,0.37695,1.230615,0.934132,24.384524,16.588583,19.411346,16.233415,10.898075,12.699142,31.330357,15.654266,16.679563,16.418155,7.823909,12.207837,25.610341,10.751267,11.524839,14.276783,9.207503,11.253802,24.970936,10.923285,11.952687,14.959108,9.526191,11.96627,7.240312,8.093357,8.388777,7.658782,7.391545,6.81002,7.615079,7.93254,7.155258,7.048611
std,158.583367,5.491329,0.500029,10.940735,0.376673,5.995903,4.358458,5.514939,5.650199,5.477009,5.491068,158.584899,0.370947,0.303539,0.489051,3.563648,1.230689,12.569802,7.044003,6.782895,6.085526,6.126544,6.362746,0.493515,1.950305,1.740575,1.550501,1.954078,1.79408,2.156163,1.841258,2.129354,0.245925,3.566763,,3.758935,,,,1.230905,2.845708,2.805237,,,,1.407181,1.444531,1.105246,,3.30952,2.619024,2.801874,2.418858,1.754868,2.052232,2.263407,2.570207,2.620507,2.502218,2.006565,2.529135,2.235152,1.700927,2.156283,1.791827,2.608913,2.717612,1.734059,4.762569,12.587674,7.0467,6.783003,6.085239,6.124888,6.362154,16.297045,6.659233,6.717476,7.328256,6.998428,6.06015,16.249937,6.976775,6.263304,6.577929,6.886532,6.167314,1.395783,1.40746,1.564321,1.076608,1.778315,1.498653,1.627054,1.283657,1.779129,1.773094,0.493573,1.950169,1.740315,1.550453,1.953702,1.794055,2.156363,1.841285,2.129565,0.989889,2.444813,12.968524,6.915322,6.59642,5.356969,5.514028,5.921789,1.41545,1.455741,1.179317,1.626839,1.793136,1.820764,0.975662,0.63124,17.15527,7.977482,7.292868,8.535963,6.125187,8.241906,14.388694,6.658494,6.535894,6.129746,5.67555,6.651547,16.402836,6.401556,5.990607,7.290107,5.856329,6.296155,14.551171,6.17169,5.398621,6.100307,6.313281,5.615691,1.37139,1.503236,1.18028,1.5482,1.744634,1.411096,1.588145,1.280936,1.647478,1.521854,1.611694,1.382139,0.484683,1.294557,0.753902,13.71212,7.471537,6.124502,5.163777,5.900697,6.557041,17.55154,9.336288,7.880088,7.231325,6.100502,8.615985,17.477134,5.740351,6.004222,6.927869,6.385852,6.516178,17.007669,6.226283,7.01065,7.935509,6.403117,7.012067,1.576596,1.610309,1.459094,1.74467,1.961417,1.507341,1.504551,1.340868,1.672787,1.717988
min,1.0,1.0,0.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,0.0,-0.83,0.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,,1.0,,,,1.0,0.0,1.0,,,,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,3.0,2.0,2.0,1.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,3.0,1.0,4.0,3.0,2.0,1.0,1.0,1.0,10.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,3.0,2.0,1.0,2.0,2.0,4.0,1.0,1.0
25%,154.0,4.0,0.0,8.0,2.0,7.0,14.0,4.0,4.0,4.0,4.0,154.0,0.0,-0.02,0.0,24.0,2.0,15.0,15.0,17.39,15.0,5.0,9.52,0.0,5.0,6.0,6.0,5.0,6.0,4.0,5.0,4.0,2.0,24.0,,5.0,,,,2.0,1.0,1.0,,,,1.0,4.0,1.0,,2.0,4.0,2.0,5.0,7.0,6.0,5.0,4.0,2.0,4.0,7.0,3.0,5.0,7.0,5.0,7.0,4.0,2.0,5.0,2.0,15.0,15.0,17.39,15.0,5.0,9.52,10.0,6.0,8.0,10.0,5.0,7.0,20.0,10.0,10.0,15.0,6.0,10.0,6.0,8.0,7.0,8.0,7.0,6.0,7.0,8.0,6.0,7.0,0.0,5.0,6.0,6.0,5.0,6.0,4.0,5.0,4.0,0.0,2.0,14.81,10.0,10.0,10.0,7.0,9.0,7.0,7.0,8.0,7.0,7.0,5.0,1.0,2.0,20.0,10.0,10.0,10.0,0.0,5.0,16.67,10.0,15.0,15.0,5.0,10.0,10.0,8.0,8.0,9.0,5.0,7.0,19.15,10.0,10.0,15.0,10.0,10.0,7.0,7.0,8.0,7.0,7.0,6.0,6.0,7.0,6.0,6.0,0.0,0.0,0.0,1.0,1.0,15.22,10.0,16.67,14.81,5.0,10.0,20.0,10.0,10.0,10.0,0.0,5.0,10.0,7.0,7.0,9.0,5.0,7.0,10.0,7.0,7.0,9.0,6.0,5.0,7.0,7.0,8.0,7.0,6.0,6.0,7.0,7.0,6.0,6.0
50%,281.0,8.0,1.0,16.0,2.0,11.0,18.0,8.0,9.0,8.0,8.0,281.0,0.0,0.21,0.0,26.0,2.0,20.0,18.37,20.0,18.0,10.0,10.64,0.0,6.0,7.0,7.0,7.0,7.0,6.0,6.0,5.0,2.0,26.0,,8.0,,,,2.0,3.0,3.0,,,,2.0,5.0,2.0,,6.0,7.0,4.0,6.0,8.0,7.0,7.0,6.0,3.0,6.0,8.0,6.0,7.0,8.0,7.0,8.0,6.0,4.0,6.0,4.0,20.0,18.18,20.0,18.0,10.0,10.64,25.0,10.0,10.0,15.0,10.0,10.0,25.0,15.0,15.0,20.0,10.0,10.0,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,8.0,8.0,0.0,6.0,7.0,7.0,7.0,7.0,6.0,6.0,5.0,0.0,3.0,17.65,15.79,18.42,15.91,10.0,12.5,7.0,8.0,8.0,8.0,8.0,6.0,1.0,2.0,30.0,10.0,15.0,20.0,5.0,10.0,20.0,16.67,19.05,18.37,10.0,13.0,25.0,10.0,10.0,15.0,10.0,10.0,25.0,15.0,15.0,18.52,10.0,13.95,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0,0.0,1.0,0.0,1.0,1.0,20.0,16.67,20.0,16.33,10.0,14.29,25.0,15.0,18.0,17.0,10.0,10.0,20.0,10.0,10.0,12.0,9.0,10.0,20.0,10.0,10.0,15.0,10.0,10.0,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0
75%,407.0,13.0,1.0,26.0,2.0,15.0,20.0,13.0,14.0,13.0,13.0,408.0,0.0,0.43,1.0,28.0,4.0,25.0,20.0,23.81,20.0,15.0,16.0,1.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,2.0,28.0,,10.0,,,,4.0,6.0,6.0,,,,2.0,6.0,3.0,,7.0,9.0,7.0,8.0,9.0,9.0,8.0,8.0,6.0,8.0,9.0,7.0,9.0,9.0,8.0,9.0,8.0,7.0,7.0,8.0,25.0,20.0,23.81,20.0,15.0,16.0,35.0,15.0,16.0,20.0,15.0,15.0,40.0,18.75,20.0,20.0,15.0,15.63,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,9.0,9.0,1.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,2.0,4.0,25.0,20.0,20.0,20.0,15.0,16.28,8.0,9.0,9.0,9.0,9.0,7.0,3.0,3.0,40.0,20.0,20.0,24.0,10.0,20.0,30.0,20.0,20.0,20.0,15.0,16.67,40.0,15.0,15.0,20.0,10.0,15.0,38.46,19.23,17.39,20.0,15.09,16.515,8.0,9.0,9.0,9.0,9.0,8.0,8.0,9.0,8.0,8.0,1.0,1.0,1.0,1.0,1.0,30.0,20.0,20.0,20.0,15.0,16.67,40.0,20.0,20.0,20.0,10.0,20.0,37.0,15.0,15.0,20.0,10.0,15.0,35.0,15.0,15.0,20.0,10.0,15.0,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,8.0,8.0





In [8]:
### 2 - file reading and basic exploration - get percentage of missing values ### ----

# check wether some columns are full of NaNs
column_nan_full = data.columns[data.isnull().all()]
column_nb = len(column_nan_full)

# get percent of NaNs per column to output overview
column_nan_percent = data.isnull().sum() * 100 /data.shape[0]
column_nan_distrib = pd.Series(np.zeros(4),index = ["less than 10%","10% to 50%","50% to 70%","70% and more"])
column_nan_distrib[0] = len(column_nan_percent[column_nan_percent < 10])
column_nan_distrib[1] = len(column_nan_percent[(column_nan_percent >= 10) & (column_nan_percent < 50)])
column_nan_distrib[2] = len(column_nan_percent[(column_nan_percent >= 50) & (column_nan_percent < 70)])
column_nan_distrib[3] = len(column_nan_percent[column_nan_percent >= 70])

# check wether some rows are full of NaNs
row_nan_count = pd.Series([data.loc[i,:].isnull().sum() for i in range(0, data.shape[0])])
row_nan_full = row_nan_count.index[row_nan_count == data.shape[1]]
row_nb = len(row_nan_full)

# get percent of NaNs per row to output overview
row_nan_percent = row_nan_count * 100 / data.shape[1]
row_nan_distrib = pd.Series(np.zeros(4),index = ["less than 10%","10% to 50%","50% to 70%","70% and more"])
row_nan_distrib[0] = len(row_nan_percent[row_nan_percent < 10])
row_nan_distrib[1] = len(row_nan_percent[(row_nan_percent >= 10) & (row_nan_percent < 50)])
row_nan_distrib[2] = len(row_nan_percent[(row_nan_percent >= 50) & (row_nan_percent < 70)])
row_nan_distrib[3] = len(row_nan_percent[row_nan_percent >= 70])

# print report
print("COLUMNS")
print("{} columns out of {} are fully filled with missing values".format(column_nb,data.shape[1]))
print("Percentage of missing values (number of columns):\n{}".format(column_nan_distrib) + "\n")
print("ROWS")
print("{} rows out of {} are fully filled with missing values".format(row_nb,data.shape[0]))
print("Percentage of missing values (number of rows):\n{}".format(row_nan_distrib))


COLUMNS
0 columns out of 195 are fully filled with missing values
Percentage of missing values (number of columns):
less than 10%    87.0
10% to 50%       49.0
50% to 70%       38.0
70% and more     21.0
dtype: float64

ROWS
0 rows out of 8378 are fully filled with missing values
Percentage of missing values (number of rows):
less than 10%    1469.0
10% to 50%       6674.0
50% to 70%        199.0
70% and more       36.0
dtype: float64


###
### 3 - Preprocessing

In [9]:
### 3 - preprocessing - assess discrepencies within each subject-related feature by subject ### ----

# some features should be unique within each subject
# since the dataset is slightly funky:
# - look for the mixed presence of values and NaNs within these features
# - look for multiple unique values within these features

# identify columns that are not supposed to contain unique data
columns_drop = ['iid', 'position', 'order', 'partner', 'pid', 'match', 'int_corr', 'samerace',
       'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun','pf_o_amb',
       'pf_o_sha', 'dec_o', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o',
       'like_o', 'prob_o', 'met_o', 'dec', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar',
       'like', 'prob', 'met', 'match_es']
columns_tocheck = data.columns.drop(columns_drop)

# look for discrepencies within each info per subject

# get subjects id and number of subjects
subjects_id = data["iid"].unique()
subjects_nb = len(subjects_id)

# initialise variables to store errors
errors_nan = pd.DataFrame(index = subjects_id, columns = columns_tocheck)
errors_unique = pd.DataFrame(index = subjects_id, columns = columns_tocheck)

# loop through subjects
for i in subjects_id:

    # loop through features
    for j in columns_tocheck:

       # get data
       data_current = data.loc[data["iid"] == i,j]

       # check for the presence of partially filled info
       if (data_current.notnull().any()) & (data_current.isnull().any()):
              errors_nan.loc[i,j] = 1

       # get unique values per info (without missing values)
       data_current = data_current[data_current.notnull()]
       values_unique = data_current.unique()

       # store error if there is more than one unique value per info
       if len(values_unique) > 1:
              errors_unique.loc[i,j] = 1

# get errors summary and print result
errors_nan_summ = errors_nan.sum()
if errors_nan_summ.isnull().all():
       print("No features are partially filled")
else:
       errors_nan_summ = errors_nan_summ.index[errors_nan_summ > 0]
       print("These features are partially filled:\n{}\n".format(errors_nan_summ))
errors_unique_summ = errors_unique.sum()
if errors_unique_summ.isnull().all():
       print("No features show discrepencies in values")
else:
       errors_unique_summ = errors_unique_summ.index[errors_unique_summ > 0]
       print("These features show discrepencies in values:\n{}\n".format(errors_unique_summ))


These features are partially filled:
Index(['id'], dtype='object')

These features show discrepencies in values:
Index(['positin1'], dtype='object')



In [10]:
### 3 - preprocessing - apply corrections on discrepencies in id and positin1 ### ----

# errors were found in id and positin1 features
# according to the data key, these values should be unique per subject

# id records the subject number within the wave
# - identify subjects for whom this mistake occurs
# - fill NaNs with the id found for the subject in other speed dates

# positin1 records the station number of the first date
# since there is no way to infer the positin1 value from the data:
# - identify subjects for whom this mistake occurs
# - replace positin1 values by NaN for theses subjects

# copy data for safety
data1 = data.copy()

# identify subjects
subjects_nan = errors_nan.index[errors_nan["id"] == 1]
subjects_unique = errors_unique.index[errors_unique["positin1"] == 1]

# apply corrections
for i in subjects_nan:
    data1.loc[data1["iid"] == i,"id"] = data1.loc[data1["iid"] == i,"id"].min()
for i in subjects_unique:
    data1.loc[data1["iid"] == i,"positin1"] = np.nan


In [11]:
### 3 - preprocessing - rename the content of categorical features ### ----

# copy data for safety
data2 = data1.copy()

# gender
data2["gender"] = data2["gender"].apply(lambda x: "Female" if x == 0
                                                else "Male" if x == 1
                                                else np.nan)

# condtn
data2["condtn"] = data2["condtn"].apply(lambda x: "Limited<br>choice" if x == 1
                                                else "Extensive<br>choice" if x == 2
                                                else np.nan)

# match
data2["match"] = data2["match"].apply(lambda x: "No" if x == 0
                                                else "Yes" if x == 1
                                                else np.nan)

# samerace
data2["samerace"] = data2["samerace"].apply(lambda x: "No" if x == 0
                                                else "Yes" if x == 1
                                                else np.nan)
# race_o
data2["race_o"] = data2["race_o"].apply(lambda x: "African" if x == 1
                                                else "Caucasian" if x ==2
                                                else "Latino" if x ==3
                                                else "Asian" if x == 4
                                                else "Native" if x == 5
                                                else "Other" if x == 6
                                                else np.nan)

# dec_o
data2["dec_o"] = data2["dec_o"].apply(lambda x: "No" if x == 0
                                                else "Yes" if x == 1
                                                else np.nan)

# WARNING: problem in met_o > unique values range from 0 to 8
# data2["met_o"] = data2["met_o"].apply(lambda x: "Yes" if x == 1
                                                #else "No" if x == 2
                                                #else np.nan)

# field_cd
data2["field_cd"] = data2["field_cd"].apply(lambda x: "Law" if x == 1
                                                else "Math" if x ==2
                                                else "Social Science<br>Psychologist" if x ==3
                                                else "Medical Science<br>Pharmaceuticals<br>Bio Tech" if x == 4
                                                else "Engineering" if x == 5
                                                else "English<br>Creative Writing<br>Journalism" if x == 6
                                                else "History<br>Religion<br>Philosophy" if x == 7
                                                else "Business<br>Economy<br>Finance" if x == 8
                                                else "Education<br>Academia" if x == 9
                                                else "Biological Sciences<br>Chemistry<br>Physics" if x == 10
                                                else "Social Work" if x == 11
                                                else "Undergraduate<br>Undecided" if x == 12
                                                else "Political Science<br>International Affairs" if x == 13
                                                else "Film" if x == 14
                                                else "Fine Arts<br>Arts Administration" if x == 15
                                                else "Languages" if x == 16
                                                else "Architecture" if x == 17
                                                else "Other" if x == 18
                                                else np.nan)

# race
data2["race"] = data2["race"].apply(lambda x: "African" if x == 1
                                                else "Caucasian" if x ==2
                                                else "Latino" if x ==3
                                                else "Asian" if x == 4
                                                else "Native" if x == 5
                                                else "Other" if x == 6
                                                else np.nan)

# goal
data2["goal"] = data2["goal"].apply(lambda x: "Seemed like a<br>fun night out" if x == 1
                                                else "To meet<br>new people" if x ==2
                                                else "To get<br>a date" if x ==3
                                                else "Looking for<br>a serious<br>relationship" if x == 4
                                                else "To say<br>I did it" if x == 5
                                                else "Other" if x == 6
                                                else np.nan)

# date
data2["date"] = data2["date"].apply(lambda x: "Several times<br>a week" if x == 1
                                                else "Twice<br>a week" if x ==2
                                                else "Once<br>a week" if x ==3
                                                else "Twice<br>a month" if x == 4
                                                else "Once<br>a month" if x == 5
                                                else "Several times<br>a year" if x == 6
                                                else "Almost<br>never" if x == 7
                                                else np.nan)

# go_out
data2["go_out"] = data2["go_out"].apply(lambda x: "Several times<br>a week" if x == 1
                                                else "Twice<br>a week" if x ==2
                                                else "Once<br>a week" if x ==3
                                                else "Twice<br>a month" if x == 4
                                                else "Once<br>a month" if x == 5
                                                else "Several times<br>a year" if x == 6
                                                else "Almost<br>never" if x == 7
                                                else np.nan)

# career_c
data2["career_c"] = data2["career_c"].apply(lambda x: "Lawyer" if x == 1
                                                else "Academic - Research" if x ==2
                                                else "Psychologist" if x ==3
                                                else "Doctor - Medicine" if x == 4
                                                else "Engineer" if x == 5
                                                else "Creative Arts - Entertainment" if x == 6
                                                else "Banking - Consulting - Finance - Marketing - Business - CEO - Entrepreneur - Admin" if x == 7
                                                else "Real Estate" if x == 8
                                                else "International - Humanitarian Affairs" if x == 9
                                                else "Undecided" if x == 10
                                                else "Social Work" if x == 11
                                                else "Speech Pathology" if x == 12
                                                else "Politics" if x == 13
                                                else "Pro sports - Athletics" if x == 14
                                                else "Other" if x == 15
                                                else "Journalism" if x == 16
                                                else "Architecture" if x == 17
                                                else np.nan)

# dec
data2["dec"] = data2["dec"].apply(lambda x: "No" if x == 0
                                                else "Yes" if x == 1
                                                else np.nan)

# WARNING: problem in met > unique values range from 0 to 8
# data2["met"] = data2["met"].apply(lambda x: "Yes" if x == 1
                                                #else "No" if x == 2
                                                #else np.nan)

# length
data2["length"] = data2["length"].apply(lambda x: "Too little" if x == 1
                                                else "Too much" if x == 2
                                                else "Just Right" if x == 3
                                                else np.nan)

# numdat_2
data2["numdat_2"] = data2["numdat_2"].apply(lambda x: "Too few" if x == 1
                                                else "Too many" if x == 2
                                                else "Just Right" if x == 3
                                                else np.nan)

# date_3
# WARNING: problem in date_3 > unique values range from 0 to 1
# EXPECTED: No = 0 and Yes = 1
data2["date_3"] = data2["date_3"].apply(lambda x: "No" if x == 0
                                                else "Yes" if x == 1
                                                else np.nan)


In [12]:
### 3 preprocessing - apply ad hoc corrections on numeric features stored as strings ### ----

# copy data for safety
data3 = data2.copy()

# identify categorical features
columns_cat = data3.columns.drop(data2._get_numeric_data().columns)

# identify numeric variables stored as strings (by eye)
columns_cat_types = data3.loc[:,columns_cat].dtypes
print("Check for supposed numeric variables within this list:\n{}".format(columns_cat_types))

# mn_sat, tuition, and income were stored as strings
# replace the "," by "" for numbers to be recognised
for i in range(0,data3.shape[0]):
    if type(data3.loc[i,"mn_sat"]) == str:
        data3.loc[i,"mn_sat"] = data3.loc[i,"mn_sat"].replace(",","")
    if type(data3.loc[i,"tuition"]) == str:
        data3.loc[i,"tuition"] = data3.loc[i,"tuition"].replace(",","")
    if type(data3.loc[i,"income"]) == str:
        data3.loc[i,"income"] = data3.loc[i,"income"].replace(",","")

# update type of columns
data3["mn_sat"] = data3["mn_sat"].astype(float)
data3["tuition"] = data3["tuition"].astype(float)
data3["income"] = data3["income"].astype(float)


Check for supposed numeric variables within this list:
gender      object
condtn      object
match       object
samerace    object
race_o      object
dec_o       object
field       object
field_cd    object
undergra    object
mn_sat      object
tuition     object
race        object
from        object
zipcode     object
income      object
goal        object
date        object
go_out      object
career      object
career_c    object
dec         object
length      object
numdat_2    object
date_3      object
dtype: object


In [13]:
### 3 preprocessing - assess quality of the survey quotations and apply corrections ### ----

# some funky values were found when visually inspecting the dataset:
# values out of the scale given by the data key document, which is totally misleading

# - reassign features to the right scale after checking
# - correct min and max values according to scale
# - for quotation series, rescale all series with scale 1-10 to shared scale 0-100

# copy data for safety
data4 = data3.copy()

# get numeric features that belong to a 1-10 quotation system (after checking myself)
columns_scale10 = ["imprace","imprelig","sports","tvsports","exercise","dining","museums","art","hiking",
    "gaming","clubbing","reading","tv","theater","movies","concerts","music","shopping","yoga","exphappy",
    "like","prob","like_o","prob_o","satis_2",
    "attr3_1","sinc3_1","fun3_1","intel3_1","amb3_1",
    "attr5_1","sinc5_1","intel5_1","fun5_1","amb5_1",
    "attr","sinc","intel","fun","amb","shar",
    "attr3_s","sinc3_s","intel3_s","fun3_s","amb3_s",
    "attr3_2","sinc3_2","intel3_2","fun3_2","amb3_2",
    "attr5_2","sinc5_2","intel5_2","fun5_2","amb5_2",
    "attr3_3","sinc3_3","intel3_3","fun3_3","amb3_3",
    "attr5_3","sinc5_3","intel5_3","fun5_3","amb5_3",
    "attr_o","sinc_o","intel_o","fun_o","amb_o","shar_o"]

# get numeric features that belong to a mixed or a shared 0-100 quotation system (after checking myself)
serie1 = ["attr1_1","sinc1_1","intel1_1","fun1_1","amb1_1","shar1_1"]
serie2 = ["attr2_1","sinc2_1","intel2_1","fun2_1","amb2_1","shar2_1"]
serie3 = ["attr1_2","sinc1_2","intel1_2","fun1_2","amb1_2","shar1_2"]
serie4 = ["attr2_2","sinc2_2","intel2_2","fun2_2","amb2_2","shar2_2"]
serie5 = ["attr1_3","sinc1_3","intel1_3","fun1_3","amb1_3","shar1_3"]
serie6 = ["pf_o_att","pf_o_sin","pf_o_int","pf_o_fun","pf_o_amb","pf_o_sha"]
serie7 = ["attr1_s","sinc1_s","intel1_s","fun1_s","amb1_s","shar1_s"]
serie8 =["attr4_1","sinc4_1","intel4_1","fun4_1","amb4_1","shar4_1"]
serie9 = ["attr4_2","sinc4_2","intel4_2","fun4_2","amb4_2","shar4_2"]
serie10 = ["attr7_2","sinc7_2","intel7_2","fun7_2","amb7_2","shar7_2"]
serie11 = ["attr4_3","sinc4_3","intel4_3","fun4_3","amb4_3","shar4_3"]
serie12 = ["attr2_3","sinc2_3","intel2_3","fun2_3","amb2_3","shar2_3"]
serie13 = ["attr7_3","sinc7_3","intel7_3","fun7_3","amb7_3","shar7_3"]

# assemble feature lists
series_list = [serie1, serie2, serie3, serie4, serie5, serie6, serie7, serie8, serie9, serie10,serie11, 
    serie12, serie13]
series_all = pd.DataFrame(series_list, columns = range(0,6))

# check features of scale 1-10
for i in columns_scale10:
    
    # get data
    data_current = data4[i]

    # apply correction to minimum and maximum values
    if data_current.min() < 1:
        data4.loc[data4[i] < 1,i] = 1
    if data_current.max() > 10:
        data4.loc[data4[i] > 10,i] = 10

# check features with mixed scales
# loop through series
for i in range(0,series_all.shape[0]):

    # get columns
    columns_current = series_all.loc[i,series_all.loc[i,:].notnull()]

    # loop through rows
    for j in range(0, data4.shape[0]):

        # get data
        data_current = data4.loc[j,columns_current]

        # skip series full of missing values
        if data_current.isnull().all():
            continue
                
        # deal with potential missing values in partially filled series
        # if in scale 1-10, impossible to infer the values, replace all values of this serie by NaNs and skip
        # if in shared scale 0-100
        # assign 0 to missing values if the sum of the serie is already equal to 100
        # otherwise impossible to infer the values, replace all values of this serie by NaNs and skip
        if (data_current.isnull().any()) & (data_current.max() <= 10):
            data4.loc[j,columns_current] = np.nan
            continue
        if (data_current.isnull().any()) & (data_current.max() > 10) & (data_current.sum() == 100):
            data_current[data_current.isnull()] = 0
        if (data_current.isnull().any()) & (data_current.max() > 10) & (data_current.sum() != 100):
            data4.loc[j,columns_current] = np.nan
            continue
          
        # decipher between max error on scale 1-10 and min error on shared scale 0-100
        # consider serie as belonging to shared 0-100 scale if its sum is over 75
        if (data_current.max() > 10) & (data_current.sum() <= 75):
            data_current[data_current > 10] = 10

        # if the sum of a serie is equal to 0, assign the minimum value of 1
        if data_current.sum() <= 0:
            data_current[:] = 1
            
        # apply corrections on min
        if (data_current.max() <= 10) & (data_current.min() < 1):
            data_current[data_current < 1] = 1
        if (data_current.max() > 10) & (data_current.min() < 0):
            data_current[data_current < 0] = 0
        
        # rescale scale 1-10 to shared scale 0-100
        if data_current.sum() == 100:
            data4.loc[j,columns_current] = data_current
        else:
            data4.loc[j,columns_current] = data_current * 100 / data_current.sum()


In [14]:
### 3 - preprocessing - correct values that were rescaled ### ----

# many values were standardised, leading to non-integer values
# - set all values to integers (this will result in errors in the serie total)
# - apply following corrections:
# if serie sum = 98, add 2 to the min value
# if serie sum = 99, add 1 to the min value
# if serie sum = 101, remove 1 to the max value
# if serie sum = 102, remove 2 to the max value

# copy data for safety
data5 = data4.copy()

# round values
data5 = np.round(data5)

# apply corrections on all series
# loop through rows
for i in range(0,data5.shape[0]):

    # loop through quotation series
    for j in range(5,series_all.shape[0]):

        # get serie columns
        serie_columns = series_all.loc[j, series_all.loc[j,:].notnull()]

        # get data
        serie_current = data5.loc[i,serie_columns]

        # apply corrections
        if serie_current.isnull().all():
            continue
        elif serie_current.isnull().any():
            print("bug")
        elif serie_current.sum() == 100:
            continue
        elif serie_current.sum() == 98:
            index_min = serie_current.index[np.argmin(serie_current)]
            data5.loc[i,index_min] += 2
        elif serie_current.sum() == 99:
            index_min = serie_current.index[np.argmin(serie_current)]
            data5.loc[i,index_min] += 1
        elif serie_current.sum() == 101:
            index_max = serie_current.index[np.argmin(serie_current)]
            data5.loc[i,index_max] -= 1
        elif serie_current.sum() == 102:
            index_max = serie_current.index[np.argmin(serie_current)]
            data5.loc[i,index_max] -= 2
        

In [15]:
### 3 - preprocessing - indentify and treat outliers ### ----

# done within each gender, not to indroduce a bias since features will be compared within genders
# - for numeric features, drop the rows if value is below or over 2 * std
# - for categorical features, set their value to "Under-represented"


# copy data for safety
data6 = data5.copy()

# check outliers only in features related to demographics info
columns_check_num = ['age_o', 'age', 'mn_sat', 'tuition', 'income']
columns_check_cat = ['race_o', 'field', 'field_cd', 'undergra', 'race', 'from', 'zipcode', 'career', 'career_c']

# set masks for gender
mask_f = data6["gender"] == "Female"
mask_m = data6["gender"] == "Male"

# drop rows that contain numeric outliers
for i in columns_check_num:

    # get lower and upper bonds per gender
    lower_bound_current_f = data6.loc[mask_f,i].mean() - 2 * data6.loc[mask_f,i].std()
    upper_bound_current_f = data6.loc[mask_f,i].mean() + 2 * data6.loc[mask_f,i].std()
    lower_bound_current_m = data6.loc[mask_m,i].mean() - 2 * data6.loc[mask_m,i].std()
    upper_bound_current_m = data6.loc[mask_m,i].mean() + 2 * data6.loc[mask_m,i].std()
    
    # set masks for outliers and get id of the corresponding subjects
    mask_outliers_f = (data6.loc[:,i] < lower_bound_current_f) | (data6.loc[:,i] > upper_bound_current_f)
    mask_outliers_m = (data6.loc[:,i] < lower_bound_current_m) | (data6.loc[:,i] > upper_bound_current_m)
    mask_outliers = mask_outliers_f | mask_outliers_m
    id_outliers = data6.loc[mask_outliers,"iid"]

    # drop rows that contain outliers
    mask_drop = (data6.loc[:,"iid"].isin(id_outliers)) | (data6.loc[:,"pid"].isin(id_outliers))
    index_drop = data6.loc[mask_drop,:].index
    data6 = data6.drop(index_drop, axis = 0)
    
# assign new value to outliers in categorical features (categories representing less than 1% of the gender)
for i in columns_check_cat:

    # get category frequencies (without NaNs) and set mask for frequencies < 0.01
    freq_current_f = data6.loc[mask_f,i].value_counts() / data6.loc[mask_f,i].value_counts().sum()
    categories_current_f = freq_current_f.index[freq_current_f.values < 0.01]
    mask_current_f = mask_f & (data6[i].isin(categories_current_f))
    freq_current_m = data6.loc[mask_m,i].value_counts() / data6.loc[mask_m,i].value_counts().sum()
    categories_current_m = freq_current_m.index[freq_current_m.values < 0.01]
    mask_current_m = mask_m & (data6[i].isin(categories_current_m))

    # set new value to outliers
    data6.loc[mask_current_f,i] = "Under-represented"
    data6.loc[mask_current_m,i] = "Under-represented"


In [16]:
### 3 - preprocessing - reformat data to create a new dataset (one row per subject) ### ----

# the dataset covers dates between 551 subjects within 21 waves
# a lot of data is redundant (multiple speed dates per subject)

# not all subjects had the same number of speed dates
# the dataset is biased towards subjects that had more dates

# this dataset will be used to plot the demographics of the cohort (and not of the survey)

# - keep only features that are interesting for plotting some demographics info
# - get data grouped by subject


# keep interesting features
columns_keep = ["iid", "gender", "age", "race", "income"]
data7 = data6[columns_keep]

# get subjects id and number of subjects
subjects_id = data7["iid"].unique()
subjects_nb = len(subjects_id)

# initialise new dataset (with columns to store like and match info)
data_fig = pd.DataFrame(np.nan, index = range(0,subjects_nb), columns = data7.columns)

# loop through subjects to collect individual data
for i in range(0,subjects_nb):

    # store data on subjects (take the first record of each subject)
    data_current = data7.loc[data7["iid"] == subjects_id[i],:]
    data_fig.loc[i,columns_keep] = data_current.loc[data_current.index[0],columns_keep]


###
### 4 - Overview 1 - Quality of the speed dating experiment

In [17]:
### 4 - overview 1 - quality of the speed dating experiment - get data ### ----

# the dataset contains recordings of information given at several time points of
# the experiment by the subjects

# 1 - assess quality of the surveys
# some general info on subjects may be missing
# some subjects may not have filled all the surveys

# 2 - assess effectiveness of the dating process
# number of likes
# number of matches
# number of dates after the experiment


# 1 - assess quality of the surveys
# get percentage of fully and partially filled surveys

# store info available on each subject by survey
general = ["iid","id","gender","idg","condtn","wave","round","positin1"]
signup1 = ["age","field","field_cd","undergra","mn_sat","tuition","race","imprace","imprelig","from",
       "zipcode","income","goal","date","go_out","career","career_c","sports","tvsports","exercise",
       "dining","museums","art","hiking","gaming","clubbing","reading","tv","theater","movies",
       "concerts","music","shopping","yoga","exphappy","expnum","attr1_1","sinc1_1","intel1_1","fun1_1",
       "amb1_1","shar1_1","attr4_1","sinc4_1","intel4_1","fun4_1","amb4_1","shar4_1","attr2_1","sinc2_1",
       "intel2_1","fun2_1","amb2_1","shar2_1","attr3_1","sinc3_1","fun3_1","intel3_1","amb3_1","attr5_1",
       "sinc5_1","intel5_1","fun5_1","amb5_1"]
scorecard = ["dec","attr","sinc","intel","fun","amb","shar","like","prob","met","match_es"]
signup2 = ["attr1_s","sinc1_s","intel1_s","fun1_s","amb1_s","shar1_s","attr3_s","sinc3_s","intel3_s",
       "fun3_s","amb3_s"]
       
followup1 = ['satis_2', 'length', 'numdat_2', 'attr7_2', 'sinc7_2', 'intel7_2', 
       'fun7_2', 'amb7_2', 'shar7_2', 'attr1_2', 'sinc1_2', 'intel1_2', 'fun1_2',
       'amb1_2', 'shar1_2', 'attr4_2', 'sinc4_2', 'intel4_2', 'fun4_2', 'amb4_2', 
       'shar4_2', 'attr2_2', 'sinc2_2', 'intel2_2', 'fun2_2', 'amb2_2', 'shar2_2',
       'attr3_2', 'sinc3_2', 'intel3_2', 'fun3_2', 'amb3_2', 'attr5_2', 'sinc5_2', 
       'intel5_2', 'fun5_2', 'amb5_2']
       
followup2 = ['you_call', 'them_cal', 'date_3', 'numdat_3', 'num_in_3', 'attr1_3', 
       'sinc1_3', 'intel1_3', 'fun1_3', 'amb1_3', 'shar1_3', 'attr7_3', 'sinc7_3', 
       'intel7_3', 'fun7_3', 'amb7_3', 'shar7_3', 'attr4_3', 'sinc4_3', 'intel4_3', 
       'fun4_3', 'amb4_3', 'shar4_3', 'attr2_3', 'sinc2_3', 'intel2_3', 'fun2_3', 
       'amb2_3', 'shar2_3', 'attr3_3', 'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3', 
       'attr5_3', 'sinc5_3', 'intel5_3', 'fun5_3', 'amb5_3']

# store all infos
infos_all = general + signup1 + scorecard + signup2 + followup1 + followup2
infos_all_df = pd.DataFrame(index = ["general<br>info","signup1","scorecard","signup2","followup1","followup2"],
                                columns = np.arange(0,len(signup1)))
infos_all_df.iloc[0,0:len(general)] = general
infos_all_df.iloc[1,0:len(signup1)] = signup1
infos_all_df.iloc[2,0:len(scorecard)] = scorecard
infos_all_df.iloc[3,0:len(signup2)] = signup2
infos_all_df.iloc[4,0:len(followup1)] = followup1
infos_all_df.iloc[5,0:len(followup2)] = followup2

# initialise variables to store summary results ("not filled", "partially filled", "fully filled")
data_fig1a = pd.DataFrame(np.zeros((6,4)), columns = ["info","count not","count partial","count full"])
data_fig1a["info"] = infos_all_df.index

# loop through dates
for i in range(0,data.shape[0]): 
       
       # loop through type of info
       for j in range(0,infos_all_df.shape[0]):

              # get type of info to be tested
              infos_current = infos_all_df.iloc[j,:]
              infos_current = infos_current[infos_current.notnull()]

              # get data for current subject and current info
              data_current = data.loc[i,infos_current]

              # test for NaN content
              if data_current.isnull().all():
                     data_fig1a.loc[j,"count not"] += 1
              elif data_current.notnull().all():
                     data_fig1a.loc[j,"count full"] += 1
              else:
                     data_fig1a.loc[j,"count partial"] += 1

# update info summary with percentages
data_fig1a["percent not"] = data_fig1a["count not"] * 100 / data.shape[0]
data_fig1a["percent partial"] = data_fig1a["count partial"] * 100 / data.shape[0]
data_fig1a["percent full"] = data_fig1a["count full"] * 100 / data.shape[0]


# 2 - assess effectiveness of the dating process
# get percentage of successfull dating steps

# initialise variable to store data to plot
data_fig1b= pd.DataFrame(index = ["percent"], columns = ["likes","matches","dates"])

# get percent over number of total number of records
data_fig1b.loc["percent","likes"] = (data["dec"] == 1).sum() * 100 / data.shape[0]
data_fig1b.loc["percent","matches"] = (data["match"] == 1).sum() * 100 / data.shape[0]
dates_total = 0
for i in data["iid"].unique():
       if data.loc[data["iid"] == i,"numdat_3"].isnull().all():
              continue
       else:
              dates_total += data.loc[data["iid"] == i,"numdat_3"].unique()
data_fig1b.loc["percent","dates"] = dates_total[0] * 100 / data.shape[0]


# 3 - assess importance of speed date order to get a like and a date

# get data
data_fig1c = data.loc[:,["order", "dec", "num_in_3"]]
data_fig1c.loc[data_fig1c["dec"] == "Yes"] = 1
data_fig1c.loc[data_fig1c["dec"] == "No"] = 0

# get order, likes mean and dates mean per date order
order = data_fig1c["order"].sort_values().unique()
likes_mean = data_fig1c.groupby(["order"])["dec"].mean()
dates_mean = data_fig1c.groupby(["order"])["num_in_3"].mean()

# get fits
likes_popt, pcov = curve_fit(lambda x, a, b: a * x + b, order, likes_mean)
likes_fit = likes_popt[0] * order + likes_popt[1]
dates_popt, pcov = curve_fit(lambda x, a, b, c: a * np.exp(b * x) + c, order, dates_mean, p0 = (0.85, 1.4, 0.8))
dates_fit = dates_popt[0] * np.exp(dates_popt[1] * order) + dates_popt[2]


In [18]:
### 4 - overview 1 - quality of the speed dating experiment - plot ### ----

# set figure to make subplots
fig1 = make_subplots(
    rows = 2,
    cols = 6,
    specs = [[{"colspan": 4}, None, None, None, {"colspan": 2}, None], 
    [{"colspan": 3}, None, None, {"colspan": 3}, None, None]],
    subplot_titles = ("A. Surveys",
                        "B. Dating steps",
                        "C. Speed date order and likes given",
                        "D. Speed date order and actual dates"),
    row_heights = [0.5, 0.35],
    vertical_spacing = 0.15,
    horizontal_spacing = 0.18)

# plot percentage of fully and partially filled surveys
fig1.add_trace(go.Bar(
    name='Fully filled',
    x = data_fig1a["info"],
    y = data_fig1a["percent full"],
    marker_color = px.colors.qualitative.Vivid[2],
    showlegend = True),
    row = 1, col = 1)

fig1.add_trace(go.Bar(
    name='Partially filled',
    x = data_fig1a["info"],
    y = data_fig1a["percent partial"],
    marker_color = px.colors.qualitative.Vivid[3],
    showlegend = True),
    row = 1, col = 1)

# plot percentage of successfull dating steps
fig1.add_trace(go.Bar(
    x = data_fig1b.columns,
    y = data_fig1b.loc["percent",:],
    marker_color = px.colors.qualitative.Vivid[4:],
    text = data_fig1b.loc["percent",:],
    texttemplate= "%{text:.1f}",
    textfont = dict(color = ["rgb(232,232,232)", "rgb(232,232,232)"]),
    showlegend = False),
    row = 1, col = 5)

# plot percentage of likes
fig1.add_trace(go.Scatter(
    x = order, 
    y = likes_mean,
    marker_color = px.colors.qualitative.Vivid[7],
    mode = "markers",
    showlegend = False),
    row = 2, col = 1)
fig1.add_trace(go.Scatter(
    x = order, 
    y = likes_fit, 
    marker_color = px.colors.qualitative.Vivid[9],
    mode = "lines",
    showlegend = False),
    row = 2, col = 1)

# plot number of dates
fig1.add_trace(go.Scatter(
    x = order, 
    y = dates_mean, 
    marker_color = px.colors.qualitative.Vivid[8],
    mode = "markers",
    showlegend = False),
    row = 2, col = 4)
fig1.add_trace(go.Scatter(
    x = order, 
    y = dates_fit, 
    marker_color = px.colors.qualitative.Vivid[9],
    mode = "lines",
    showlegend = False),
    row = 2, col = 4)

# update layout
fig1.update_annotations(font_size = 15)
fig1.update_xaxes(title_font = dict(size = 13), tickfont = dict(size = 11))
fig1.update_yaxes(title_font = dict(size = 13), tickfont = dict(size = 11))
fig1.update_layout(
    margin = dict(l = 90),
    title_text = "Figure 1. Quality of the speed dating experiment",
    title_x = 0.5,
    title_y = 0.95,
    title_font_size = 18,
    barmode = "stack",
    xaxis3 = dict(title = "Speed date order", zeroline = False, showgrid = False),
    xaxis4 = dict(title = "Speed date order", zeroline = False, showgrid = False),
    yaxis = dict(title = "Percent of total records", range = [0,110], tickvals = [0, 20, 40, 60, 80, 100]),
    yaxis2 = dict(title = "Percent of total records", range = [0,55], tickvals = [0, 10, 20, 30, 40, 50]),
    yaxis3 = dict(title = "Ratio of likes (over speed date number)", range = [0.28, 0.55], 
        tickvals = [0.30, 0.35, 0.40, 0.45, 0.50]),
    yaxis4 = dict(title = "Number of actual dates (mean)", range = [0.76, 1.3], 
        tickvals = [0.8, 0.9, 1.0, 1.1, 1.2]),
    legend = dict(
        orientation = "h",
        yanchor = "top",
        y = 0.47,
        xanchor = "left",
        x = 0.12),
    plot_bgcolor = "rgba(0,0,0,0)",
    paper_bgcolor = "rgb(232,232,232)",
    width = 800,
    height = 800)

fig1.show()


###
### 5 - Overview 2 - Demographics of the cohort

In [19]:
### 5 - overview 2 - demographics of the cohort - plot ### ----

# format data for races
counts = data_fig["race"].value_counts()
races = counts.index

# set figure to make subplots for each region
fig2 = make_subplots(rows = 2, cols = 2,
                        specs = [[{"type": "pie"}, {"type": "box"}], [{"type": "pie"}, {"type": "box"}]],
                        subplot_titles = ("A. Gender representation",
                                                "B. Age profile per gender",
                                                "C. Ethnicity representation",
                                                "D. Income profile per ethnicity"),
                        column_widths = [0.3, 0.4],
                        horizontal_spacing = 0.3,
                        vertical_spacing = 0.15)

# plot gender representation
fig2.add_trace(go.Pie(
        labels = data_fig["gender"].value_counts(dropna = False).index,
        values = data_fig["gender"].value_counts(dropna = False).values,
        textfont = dict(size = 12),
        textinfo = "label + percent",
        textposition = "outside",
        hole = 0.3,
        showlegend = False,
        marker = dict(colors = px.colors.qualitative.Vivid)),

        row = 1, col = 1)

# plot age profile per gender
fig2.add_trace(go.Box(
        y = data_fig.loc[data_fig["gender"] == "Female","age"],
        name = "Female",
        marker_color = px.colors.qualitative.Vivid[1]),
        row = 1, col = 2)
fig2.add_trace(go.Box(
        y = data_fig.loc[data_fig["gender"] == "Male","age"],
        name = "Male",
        marker_color = px.colors.qualitative.Vivid[0]),
        row = 1, col = 2)

# plot ethnicity representation
fig2.add_trace(go.Pie(
        labels = data_fig["race"].value_counts(dropna = False).index,
        values = data_fig["race"].value_counts(dropna = False).values,
        textfont = dict(size = 12),
        textinfo = "label + percent",
        textposition = "outside",
        rotation = 0,
        hole = 0.3,
        showlegend = False,
        marker = dict(colors = px.colors.qualitative.Vivid[2:])),

        row = 2, col = 1)

# plot income profile per ethnicity
[fig2.add_trace(go.Box(
        y = data_fig.loc[data_fig["race"] == races[i],"income"], 
        name = races[i], 
        marker_color = px.colors.qualitative.Vivid[i+2]),
        row = 2, col = 2) for i in range (0,len(races))]

# update layout
fig2.update_annotations(font_size = 15)
fig2.update_xaxes(title_font = dict(size = 13), tickfont = dict(size = 12))
fig2.update_yaxes(title_font = dict(size = 13), tickfont = dict(size = 12))
fig2.update_layout(
        margin = dict(l = 90),
        title_text = "Figure 2. Demographics of the cohort",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18,
        yaxis = dict(title = "Age", range = [13,40], tickvals = [15, 20, 25, 30, 35]),
        yaxis2 = dict(title = "Income", range = [-10000,120000], 
                tickvals = [0, 20000, 40000, 60000, 80000, 100000]), 
        showlegend = False,
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 800)

fig2.show()


###
### 6 - Analysis 1 - Importance and impact of partner's attributes

In [20]:
### 6 - analysis 1 - importance and impact of partner's attributes - get data ### ----

# compare the importance given to a serie of attributes before the speed dates

# assess the impact of the most desired attributes (for females and males)
# impact is assessed by comparing the means (t-test) of the grades given to the partner after 
# the speed date for the given attribute, depending on whether they obtained a like or not


# keep only columns that are relevant for the analysis
columns_useful = ['gender', 'dec', "intel", "attr", 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']
data_fig3 = data6[columns_useful]

# rename columns for nice plotting
columns_names = ["Gender", "Like", "Intel_Rating", "Attr_Rating", "Attractiveness", "Sincerity", "Intelligence",
    "Fun", "Ambition", "Shared<br>interests"]
data_fig3.columns = columns_names

# set masks for genders
mask_f = data_fig3["Gender"] == "Female" 
mask_m = data_fig3["Gender"] == "Male"

# sort attributes by ascending median for plotting
attributes_f = data_fig3.loc[mask_f,columns_names[4:]].median().sort_values()
attributes_m = data_fig3.loc[mask_m,columns_names[4:]].median().sort_values()

# set masks for genders and likes given to partner
mask_f_yes = (data_fig3["Gender"] == "Female") & (data_fig3["Like"] == "Yes")
mask_f_no = (data_fig3["Gender"] == "Female") & (data_fig3["Like"] == "No")
mask_m_yes = (data_fig3["Gender"] == "Male") & (data_fig3["Like"] == "Yes")
mask_m_no = (data_fig3["Gender"] == "Male") & (data_fig3["Like"] == "No")

# assess the impact of male intelligence for getting a like
_, pvalue_intel = stats.ttest_ind(
    data_fig3.loc[mask_f_yes,"Intel_Rating"], data_fig3.loc[mask_f_no,"Intel_Rating"],
    equal_var = False, nan_policy = "omit", alternative = "greater")

# assess the impact of female attractiveness for getting a like
_, pvalue_attr = stats.ttest_ind(
    data_fig3.loc[mask_m_yes,"Attr_Rating"], data_fig3.loc[mask_m_no,"Attr_Rating"],
    equal_var = False, nan_policy = "omit", alternative = "greater")
        

In [21]:
### 6 - analysis 1 - importance and impact of partner's attributes - plot ### ----

# set figure to make subplots
fig3 = make_subplots(
    rows = 2,
    cols = 2,
    subplot_titles = (
        "A. Importance of attributes<br>as rated by females",
        "B. Impact of Intelligence on<br>female decision (pvalue = {:.4f})".format(pvalue_intel),
        "C. Importance of attributes<br>as rated by males",
        "D. Impact of Attractiveness on<br>male decision (pvalue = {:.4f})".format(pvalue_attr)),
    row_heights = [0.40, 0.40],
    column_widths = [0.6, 0.2],
    horizontal_spacing = 0.15,
    vertical_spacing = 0.15)

# plot importance for females
[fig3.add_trace(go.Box(
        y = data_fig3.loc[mask_f,i], 
        name = i, 
        marker_color = px.colors.qualitative.Vivid[1]),
        row = 1, col = 1) for i in attributes_f.index]

# plot importance for males
[fig3.add_trace(go.Box(
        y = data_fig3.loc[mask_m,i], 
        name = i, 
        marker_color = px.colors.qualitative.Vivid[0]),
        row = 2, col = 1) for i in attributes_m.index]

# plot impact of intelligence
fig3.add_trace(go.Box(
        y = data_fig3.loc[mask_f_no,"Intel_Rating"],
        name = "No Like",
        marker_color = px.colors.qualitative.Vivid[1]),
        row = 1, col = 2)
fig3.add_trace(go.Box(
        y = data_fig3.loc[mask_f_yes,"Intel_Rating"],
        name = "Like",
        marker_color = px.colors.qualitative.Vivid[1]),
        row = 1, col = 2)

# plot impact of attractiveness
fig3.add_trace(go.Box(
        y = data_fig3.loc[mask_m_no,"Attr_Rating"],
        name = "No Like",
        marker_color = px.colors.qualitative.Vivid[0]),
        row = 2, col = 2)
fig3.add_trace(go.Box(
        y = data_fig3.loc[mask_m_yes,"Attr_Rating"],
        name = "Like",
        marker_color = px.colors.qualitative.Vivid[0]),
        row = 2, col = 2)

# update layout
fig3.update_annotations(font_size = 15)
fig3.update_xaxes(title_font = dict(size = 13), tickfont = dict(size = 10))
fig3.update_yaxes(title_font = dict(size = 13), tickfont = dict(size = 10))
fig3.update_layout(
        margin = dict(l = 90, t = 130),
        title_text = "Figure 3. Importance and impact of partner's attributes",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18, 
        yaxis = dict(title = "Rating (scale 0-100)", range = [-10, 120], tickvals = [0, 20, 40, 60, 80, 100]),
        yaxis2 = dict(title = "Rating (scale 1-10)", range = [-1, 12], tickvals = [0, 2, 4, 6, 8, 10]),
        yaxis3 = dict(title = "Rating (scale 0-100)", range = [-10, 120], tickvals = [0, 20, 40, 60, 80, 100]),
        yaxis4 = dict(title = "Rating (scale 1-10)", range = [-1, 12], tickvals = [0, 2, 4, 6, 8, 10]),
        showlegend = False,
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 800)

fig3.show()

###
### 7 - Analysis 2 - Impact of shared interests

In [22]:
### 7 - analysis 2 - impact of shared interests - get data ### ----

# the feature int_corr indicates whether partners share interests or not but is not quantitative
# to get a more informative feature, calculate the Euclidean distance between partners (relative to activities)

# assess the impact of having shared interests on giving a like to the partner

# assess the impact of having shared interests on getting a match between partners


# keep only columns that are relevant for the analysis
columns_useful = ['iid', 'pid', 'gender', 'dec', 'match', 'sports', 'tvsports', 'exercise', 'dining', 
    'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 
    'shopping', 'yoga']
data_fig4 = data6[columns_useful].reset_index(drop = True)

# rename columns for nice plotting
columns_names = ["iid", "pid", "Gender", "Like", "Match", "Sports", "TV-Sports", "Exercise", "Dining", 
    "Museums", "Art", "Hiking", "Gaming", "Clubbing", "Reading", "TV", "Theater", "Movies", "Concerts", "Music", 
    "Shopping", "Yoga"]
data_fig4.columns = columns_names

# initialize columns to store distance
data_fig4["Distance"] = np.nan

# loop through dates to get Euclidean distances between partners
for i in range(0,data_fig4.shape[0]):

    # get subject data
    subject_activities = data_fig4.loc[i,columns_names[5:]]

    # get partner data
    partner_id = data_fig4.loc[i,"pid"]
    partner_activities_all = data_fig4.loc[data_fig4["iid"] == partner_id,columns_names[5:]].reset_index(drop = True)
    partner_activities = partner_activities_all.loc[0,:]

    # get euclidean distance
    data_fig4.loc[i,"Distance"] = spatial.distance.euclidean(subject_activities, partner_activities)

# set masks for genders and likes given to partner
mask_f_yes = (data_fig4["Gender"] == "Female") & (data_fig4["Like"] == "Yes")
mask_f_no = (data_fig4["Gender"] == "Female") & (data_fig4["Like"] == "No")
mask_m_yes = (data_fig4["Gender"] == "Male") & (data_fig4["Like"] == "Yes")
mask_m_no = (data_fig4["Gender"] == "Male") & (data_fig4["Like"] == "No")

# assess the impact of shared interests for females to give a like
_, pvalue_f = stats.ttest_ind(
    data_fig4.loc[mask_f_yes,"Distance"], data_fig4.loc[mask_f_no,"Distance"],
    equal_var = False, nan_policy = "omit", alternative = "two-sided")

# assess the impact of shared interests for males to give a like
_, pvalue_m = stats.ttest_ind(
    data_fig4.loc[mask_m_yes,"Distance"], data_fig4.loc[mask_m_no,"Distance"],
    equal_var = False, nan_policy = "omit", alternative = "two-sided")

# copy data for safety
data_fig4bis = data_fig4.loc[:,["iid", "pid", "Gender", "Match", "Distance"]]

# initialise variable to store index of data to drop
index_drop = []

# identify reciprocal speed dates
for i in range(0,data_fig4bis.shape[0]):

    # get subject and partner ids
    # do it only for females, to not drop every date
    if data_fig4bis.loc[i,"Gender"] == "Female":
        subject_current = data_fig4bis.loc[i,"iid"]
        partner_current = data_fig4bis.loc[i,"pid"]

        # search for reciprocal speed date
        mask_drop = (data_fig4bis["iid"] == partner_current) & (data_fig4bis["pid"] == subject_current)
        index_drop.append(data_fig4bis.loc[mask_drop,:].index[0])

# drop reciprocal speed dates
data_fig4bis = data_fig4bis.drop(index_drop, axis = 0)

# set masks for genders and likes given to partner
mask_yesmatch = (data_fig4bis["Match"] == "Yes")
mask_nomatch = (data_fig4bis["Match"] == "No")

# assess the impact of shared interests for getting a match
_, pvalue_match = stats.ttest_ind(
    data_fig4bis.loc[mask_yesmatch,"Distance"], data_fig4bis.loc[mask_nomatch,"Distance"],
    equal_var = False, nan_policy = "omit", alternative = "two-sided")


In [23]:
### 7 - analysis 1 - impact of shared interests - plot ### ----

# set figure to make subplots
fig4 = make_subplots(
    rows = 1,
    cols = 3,
    subplot_titles = (
        "A. Impact  on female decision<br>(pvalue = {:.4f})".format(pvalue_f),
        "B. Impact on male decision<br>(pvalue = {:.4f})".format(pvalue_m),
        "C. Impact on matches<br>(pvalue = {:.4f})".format(pvalue_match)),
    column_widths = [0.25, 0.25, 0.25],
    horizontal_spacing = 0.15)

# plot impact of shared interests on female decision
fig4.add_trace(go.Box(
        y = data_fig4.loc[mask_f_no,"Distance"],
        name = "No Like",
        marker_color = px.colors.qualitative.Vivid[1]),
        row = 1, col = 1)
fig4.add_trace(go.Box(
        y = data_fig4.loc[mask_f_yes,"Distance"],
        name = "Like",
        marker_color = px.colors.qualitative.Vivid[1]),
        row = 1, col = 1)

# plot impact of shared interests on male decision
fig4.add_trace(go.Box(
        y = data_fig4.loc[mask_m_no,"Distance"],
        name = "No Like",
        marker_color = px.colors.qualitative.Vivid[0]),
        row = 1, col = 2)
fig4.add_trace(go.Box(
        y = data_fig4.loc[mask_m_yes,"Distance"],
        name = "Like",
        marker_color = px.colors.qualitative.Vivid[0]),
        row = 1, col = 2)

# plot impact of shared interests on matches
fig4.add_trace(go.Box(
        y = data_fig4bis.loc[mask_nomatch,"Distance"],
        name = "No Match",
        marker_color = px.colors.qualitative.Vivid[5]),
        row = 1, col = 3)
fig4.add_trace(go.Box(
        y = data_fig4bis.loc[mask_yesmatch,"Distance"],
        name = "Match",
        marker_color = px.colors.qualitative.Vivid[5]),
        row = 1, col = 3)

# update layout
fig4.update_annotations(font_size = 15)
fig4.update_xaxes(title_font = dict(size = 13), tickfont = dict(size = 12))
fig4.update_yaxes(title_font = dict(size = 13), tickfont = dict(size = 12))
fig4.update_layout(
        margin = dict(l = 90),
        title_text = "Figure 4. Impact of shared interests",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18,
        yaxis = dict(title = "Euclidean distance", range = [-1, 35], tickvals = [0, 10, 20, 30]),
        yaxis2 = dict(title = "Euclidean distance", range = [-1, 35], tickvals = [0, 10, 20, 30]),
        yaxis3 = dict(title = "Euclidean distance", range = [-1, 35], tickvals = [0, 10, 20, 30]),
        showlegend = False,
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 400)

fig4.show()

###
### 8 - Analysis 3 - Importance and impact of shared ethnicity

In [24]:
### 8 - analysis 3 - importance and impact of shared ethnicity - get data ### ----

# the feature int_corr indicates whether partners share interests or not but is not quantitative
# to get a more informative feature, calculate the Euclidean distance between partners (relative to activities)

# assess the impact of having shared interests on giving a like to the partner


# keep only columns that are relevant for the analysis
columns_useful = ['iid', 'pid', 'gender', 'dec', 'match', 'race', 'samerace', 'imprace']
data_fig5 = data6[columns_useful]

# get importance of ethnicity per ethnicity
race_imp = data_fig5.groupby(["race"])["imprace"].median().sort_values()


# get impact of having the same goal on decision to give a like and on getting a match

# initialise variables to store results on impact
data_fig5bc = pd.DataFrame(index = race_imp.index, columns = ["percent_like", "percent_match"])

# set masks
mask_f = data_fig5["gender"] == "Female"
mask_yeslike = data_fig5["dec"] == "Yes"
mask_yesmatch = data_fig5["match"] == "Yes"
mask_yessame = data_fig5["samerace"] == "Yes"

# fill dataframe with percentages
for i in race_imp.index:

    # for likes
    data_fig5bc.loc[i,"percent_like"] = data_fig5.loc[(data_fig5["race"] == i) & mask_yeslike & \
        mask_yessame,"iid"].count() / data_fig5.loc[(data_fig5["race"] == i) & mask_yeslike,"iid"].count() * 100
    
    # for matches (only on females to drop reciprocal dates)
    data_fig5bc.loc[i,"percent_match"] = data_fig5.loc[(data_fig5["race"] == i) & mask_f & mask_yesmatch & \
        mask_yessame,"iid"].count() / data_fig5.loc[(data_fig5["race"] == i) & \
        mask_f & mask_yesmatch,"iid"].count() * 100


In [25]:
# get impact on decision to give a like

# initialise variable to store results
data_fig5b = pd.DataFrame(np.zeros((2,3)), columns = ["info","same","not same"])
data_fig5b["info"] = ["No Like", "Like"]

# set masks
mask_yeslike = data_fig5["dec"] == "Yes"
mask_nolike = data_fig5["dec"] == "No"
mask_yessame = data_fig5["samerace"] == "Yes"
mask_nosame = data_fig5["samerace"] == "No"

# fill dataframe with percentages
data_fig5b.loc[0,"same"] = data_fig5.loc[mask_nolike & mask_yessame,"iid"].count() / \
    data_fig5.loc[mask_nolike,"iid"].count() * 100
data_fig5b.loc[1,"same"] = data_fig5.loc[mask_yeslike & mask_yessame,"iid"].count() / \
    data_fig5.loc[mask_yeslike,"iid"].count() * 100
data_fig5b.loc[0,"not same"] = data_fig5.loc[mask_nolike & mask_nosame,"iid"].count() / \
    data_fig5.loc[mask_nolike,"iid"].count() * 100
data_fig5b.loc[1,"not same"] = data_fig5.loc[mask_yeslike & mask_nosame,"iid"].count() / \
    data_fig5.loc[mask_yeslike,"iid"].count() * 100


# get impact on matches

# copy data for safety
data_fig5c = data_fig5.loc[:,["iid", "pid", "gender", "match", "samerace"]].reset_index(drop = True)

# initialise variable to store index of data to drop
index_drop = []

# identify reciprocal speed dates
for i in range(0,data_fig5c.shape[0]):

    # get subject and partner ids
    # do it only for females, to not drop every date
    if data_fig5c.loc[i,"gender"] == "Female":
        subject_current = data_fig5c.loc[i,"iid"]
        partner_current = data_fig5c.loc[i,"pid"]

        # search for reciprocal speed date
        mask_drop = (data_fig5c["iid"] == partner_current) & (data_fig5c["pid"] == subject_current)
        index_drop.append(data_fig5c.loc[mask_drop,:].index[0])

# drop reciprocal speed dates
data_fig5c = data_fig5c.drop(index_drop, axis = 0)

# initialise varibale to store results
data_fig5cplot = pd.DataFrame(np.zeros((2,3)), columns = ["info","same","not same"])
data_fig5cplot["info"] = ["No Match", "Match"]

# set masks
mask_yeslike = data_fig5c["match"] == "Yes"
mask_nolike = data_fig5c["match"] == "No"
mask_yessame = data_fig5c["samerace"] == "Yes"
mask_nosame = data_fig5c["samerace"] == "No"

# fill dataframe with percentages
data_fig5cplot.loc[0,"same"] = data_fig5c.loc[mask_nolike & mask_yessame,"iid"].count() / \
    data_fig5c.loc[mask_nolike,"iid"].count() * 100
data_fig5cplot.loc[1,"same"] = data_fig5c.loc[mask_yeslike & mask_yessame,"iid"].count() / \
    data_fig5c.loc[mask_yeslike,"iid"].count() * 100
data_fig5cplot.loc[0,"not same"] = data_fig5c.loc[mask_nolike & mask_nosame,"iid"].count() / \
    data_fig5c.loc[mask_nolike,"iid"].count() * 100
data_fig5cplot.loc[1,"not same"] = data_fig5c.loc[mask_yeslike & mask_nosame,"iid"].count() / \
    data_fig5c.loc[mask_yeslike,"iid"].count() * 100

In [26]:
### 8 - analysis 3 - importance and impact of shared ethnicity - plot ### ----

# set figure to make subplots
fig5 = make_subplots(
    rows = 1,
    cols = 3,
    subplot_titles = (
        "A. Importance by ethnicity",
        "B. Impact on decision",
        "C. Impact on matches"),
    column_widths = [0.25, 0.25, 0.25],
    horizontal_spacing = 0.15)

# plot importance per ethnicity
[fig5.add_trace(go.Box(
        y = data_fig5.loc[data_fig5["race"] == race_imp.index[i],"imprace"], 
        name = race_imp.index[i], 
        marker_color = px.colors.qualitative.Vivid[i+2],
        showlegend = False),
        row = 1, col = 1) for i in range(0,len(race_imp.index))]

# plot impact of shared ethnicity on decision to give a like
fig5.add_trace(go.Bar(
    x = data_fig5bc.index,
    y = data_fig5bc["percent_like"],
    marker_color = px.colors.qualitative.Vivid[2:],
    showlegend = False),
    row = 1, col = 2)

# plot impact of shared ethnicity on matches
fig5.add_trace(go.Bar(
    x = data_fig5bc.index,
    y = data_fig5bc["percent_match"],
    marker_color = px.colors.qualitative.Vivid[2:],
    showlegend = False),
    row = 1, col = 3)

# update layout
fig5.update_annotations(font_size = 15)
fig5.update_xaxes(title_font = dict(size = 13), tickfont = dict(size = 10), tickangle = 90)
fig5.update_yaxes(title_font = dict(size = 13), tickfont = dict(size = 10))
fig5.update_layout(
        margin = dict(l = 90),
        title_text = "Figure 5. Importance and impact of shared ethnicity",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18,
        yaxis = dict(title = "Rating (scale 1-10)", range = [-1, 12], tickvals = [0, 2, 4, 6, 8, 10]),
        yaxis2 = dict(title = "Percent of total likes<br>per ethnicity", range = [-6, 95], tickvals = [0, 20, 40, 60, 80]),
        yaxis3 = dict(title = "Percent of total matches<br>per ethnicity", range = [-6, 95], tickvals = [0, 20, 40, 60, 80]),
        legend = dict(
            orientation = "h",
            yanchor = "top",
            y = -0.15,
            xanchor = "left",
            x = 0.46,
            font = dict(size = 11)),
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 400)

fig5.show()

###
### 9 - Analysis 4 - Importance and impact of self-esteem

In [27]:
### 9 - Analysis 4 - Importance and impact of self-esteem - get data ### ----

# self-esteem is quantified as the mean of the differences between the mean of the grades given by the speed 
# date partners and the grades given by the subject for the same series of attributes

# keep only columns that are relevant for the analysis
columns_useful = ["iid", "gender", "dec_o", "match", "attr3_1", "sinc3_1", "intel3_1", "fun3_1", "amb3_1", 
    "attr_o", "sinc_o", "intel_o", "fun_o", "amb_o"]
data_fig6 = data6[columns_useful]

# get unique subject ids
id_unique = data_fig6["iid"].unique()

# initialise new dataframe to store self-esteem measure by unique subject
data_fig6a = pd.DataFrame(index = range(0,len(id_unique)), columns = ["iid","gender", "self_esteem"])
data_fig6a["iid"] = id_unique

# get a measure of sel-esteem
for i in id_unique:

    # get data
    data_current = data_fig6.loc[data_fig6["iid"] == i, :].reset_index(drop = True)

    # get subject grades
    subject_grades = data_current.loc[0,["attr3_1", "sinc3_1", "intel3_1", "fun3_1", "amb3_1"]]

    # get mean of grades given by partner
    partner_grades = data_current.loc[:,["attr_o", "sinc_o", "intel_o", "fun_o", "amb_o"]].mean()
    
    # get gender
    data_fig6a.loc[data_fig6a["iid"] == i,"gender"] = data_current.loc[0,"gender"]

    # get score for self-esteem if no NaNs in data
    if subject_grades.isnull().any() | partner_grades.isnull().any():
        data_fig6a.loc[data_fig6a["iid"] == i,"self_esteem"] = np.nan
    else:
        data_fig6a.loc[data_fig6a["iid"] == i,"self_esteem"] = \
            [(partner_grades.values - subject_grades.values).mean()]

# copy data for safety
data_fig6b = data_fig6

# update data with self-esteem measure
for i in id_unique:
    data_fig6b.loc[data_fig6b["iid"] == i,"self_esteem"] = \
        [data_fig6a.loc[data_fig6a["iid"] == i,"self_esteem"].values] * \
        data_fig6b.loc[data_fig6b["iid"] == i,:].shape[0]
    
# set masks for likes given to partner and matches
mask_yeslike = data_fig6b["dec_o"] == "Yes"
mask_nolike = data_fig6b["dec_o"] == "No"
mask_yesmatch = data_fig6b["match"] == "Yes"
mask_nomatch = data_fig6b["match"] == "No"

# assess the impact of self-esteem on receiving a like
_, pvalue_like = stats.ttest_ind(
    data_fig6b.loc[mask_yeslike,"self_esteem"], data_fig6b.loc[mask_nolike,"self_esteem"],
    equal_var = False, nan_policy = "omit", alternative = "two-sided")

# assess the impact of self-esteem on getting a match
_, pvalue_match = stats.ttest_ind(
    data_fig6b.loc[mask_yesmatch,"self_esteem"], data_fig6b.loc[mask_nomatch,"self_esteem"],
    equal_var = False, nan_policy = "omit", alternative = "two-sided")


In [28]:
### 9 - Analysis 4 - Importance and impact of self-esteem - plot ### ----

# set figure to make subplots
fig6 = make_subplots(
    rows = 1,
    cols = 3,
    subplot_titles = (
        "A. Self-esteem distribution<br> ",
        "B. Impact on like from partner<br>(pvalue = {:.4f})".format(pvalue_like),
        "C. Impact on matches<br>(pvalue = {:.4f})".format(pvalue_match, ".4f")),
    column_widths = [0.25, 0.25, 0.25],
    horizontal_spacing = 0.15)

# plot distribution of self-esteem per gender
fig6.add_trace(go.Histogram(
    x = data_fig6a.loc[data_fig6a["gender"] == "Female","self_esteem"],
    name = "Females",
    opacity = 0.6),
    row = 1, col = 1)
fig6.add_trace(go.Histogram(
    x = data_fig6a.loc[data_fig6a["gender"] == "Male","self_esteem"],
    name = "Males",
    opacity = 0.6),
    row = 1, col = 1)

# plot impact on decision to give a like
fig6.add_trace(go.Box(
        y = data_fig6b.loc[mask_nolike,"self_esteem"],
        name = "No Like",
        marker_color = px.colors.qualitative.Vivid[4],
        showlegend = False),
        row = 1, col = 2)
fig6.add_trace(go.Box(
        y = data_fig6b.loc[mask_yeslike,"self_esteem"],
        name = "Like",
        marker_color = px.colors.qualitative.Vivid[4],
        showlegend = False),
        row = 1, col = 2)

# plot impact on matches
fig6.add_trace(go.Box(
        y = data_fig6b.loc[mask_nomatch,"self_esteem"],
        name = "No Match",
        marker_color = px.colors.qualitative.Vivid[5],
        showlegend = False),
        row = 1, col = 3)
fig6.add_trace(go.Box(
        y = data_fig6b.loc[mask_yesmatch,"self_esteem"],
        name = "Match",
        marker_color = px.colors.qualitative.Vivid[5],
        showlegend = False),
        row = 1, col = 3)

# update layout
fig6.update_annotations(font_size = 15)
fig6.update_xaxes(title_font = dict(size = 13), tickfont = dict(size = 10))
fig6.update_yaxes(title_font = dict(size = 13), tickfont = dict(size = 10))
fig6.update_layout(
        margin = dict(l = 90),
        title_text = "Figure 6. Importance and impact of self_esteem",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18,
        barmode = "overlay",
        xaxis = dict(title = "Self-esteem", range = [-6, 4], tickvals = [-4, -2, 0, 2]),
        yaxis = dict(range = [-1, 18], tickvals = [0, 5, 10, 15]),
        yaxis2 = dict(title = "Self-esteem", range = [-5.3, 3.4], tickvals = [-5, -4, -3, -2, -1, 0, 1, 2]),
        yaxis3 = dict(title = "Self-esteem", range = [-5.3, 3.4], tickvals = [-5, -4, -3, -2, -1, 0, 1, 2]),
        legend = dict(
            orientation = "h",
            yanchor = "top",
            y = 1.11,
            xanchor = "left",
            x = -0.025,
            font = dict(size = 11)),
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 400)

fig6.show()

###
### 10 - Analysis 5 - Importance and impact of the dating goal

In [29]:
### 10 - Analysis 5 - Importance and impact of the dating goal - get data ### ----

# people have different goals or expectations when signing up on a dating app

# compare goals of females and males
# assess the impact of goals on the decision to give a like and on matches

# keep only columns that are relevant for the analysis
columns_useful = ["iid", "pid", "gender", "dec", "dec_o", "match", "goal"]
data_fig7 = data6[columns_useful].reset_index(drop = True)

# set masks for genders
mask_f = data_fig7["gender"] == "Female"
mask_m = data_fig7["gender"] == "Male"

# get unique goals
goals_unique = data_fig7["goal"].dropna().unique()

# initialise dataframe to store results
data_fig7ab_f = pd.DataFrame(index = goals_unique, columns = ["count", "percent_like_g", "percent_match"])
data_fig7ab_m = pd.DataFrame(index = goals_unique, columns = ["count", "percent_like_g", "percent_match"])

# fill dataframe with results
for i in goals_unique:

    # goal count per gender
    data_fig7ab_f.loc[i,"count"] = data_fig7.loc[mask_f & (data_fig7["goal"] == i),"goal"].count()
    data_fig7ab_m.loc[i,"count"] = data_fig7.loc[mask_m & (data_fig7["goal"] == i),"goal"].count()

    # percent like given per goal per gender
    data_fig7ab_f.loc[i,"percent_like_g"] = data_fig7.loc[mask_f & (data_fig7["goal"] == i) & \
        (data_fig7["dec"] == "Yes"),"goal"].count() / data_fig7ab_f.loc[i,"count"] * 100
    data_fig7ab_m.loc[i,"percent_like_g"] = data_fig7.loc[mask_m & (data_fig7["goal"] == i) & \
        (data_fig7["dec"] == "Yes"),"goal"].count() / data_fig7ab_m.loc[i,"count"] * 100

    # percent match per goal per gender
    data_fig7ab_f.loc[i,"percent_match"] = data_fig7.loc[mask_f & (data_fig7["goal"] == i) & \
        (data_fig7["match"] == "Yes"),"goal"].count() / data_fig7ab_f.loc[i,"count"] * 100
    data_fig7ab_m.loc[i,"percent_match"] = data_fig7.loc[mask_m & (data_fig7["goal"] == i) & \
        (data_fig7["match"] == "Yes"),"goal"].count() / data_fig7ab_m.loc[i,"count"] * 100

# sort data by count
data_fig7ab_f = data_fig7ab_f.sort_values("count", ascending = False)
data_fig7ab_m = data_fig7ab_m.sort_values("count", ascending = False)

# create a new column "same_goal" to record goal matching between partners
for i in range(0,data_fig7.shape[0]):

    # get id of partner
    id_partner = data_fig7.loc[i,"pid"]

    # get goal of subject and partner
    goal_subject = data_fig7.loc[i,"goal"]
    goal_partner = data_fig7.loc[data_fig7["iid"] == id_partner,"goal"].reset_index(drop = True)
    goal_partner = goal_partner[0]

    # record goal matching
    if goal_subject == goal_partner:
        data_fig7.loc[i,"same_goal"] = "Yes"
    else:
        data_fig7.loc[i,"same_goal"] = "No"


# get impact of having the same goal on decision to give a like and on getting a match

# initialise variables to store results on impact
data_fig7c_f = pd.DataFrame(index = data_fig7ab_f.index, columns = ["percent_like", "percent_match"])
data_fig7c_m = pd.DataFrame(index = data_fig7ab_m.index, columns = ["percent_like", "percent_match"])

# set masks
mask_yeslike = data_fig7["dec"] == "Yes"
mask_yesmatch = data_fig7["match"] == "Yes"
mask_yessame = data_fig7["same_goal"] == "Yes"

# fill dataframe with percentages
for i in goals_unique:

    # for likes
    data_fig7c_f.loc[i,"percent_like"] = data_fig7.loc[(data_fig7["goal"] == i) & mask_f & mask_yeslike & \
        mask_yessame,"iid"].count() / data_fig7.loc[(data_fig7["goal"] == i) & mask_f & mask_yeslike,"iid"].count() * 100
    data_fig7c_m.loc[i,"percent_like"] = data_fig7.loc[(data_fig7["goal"] == i) & mask_m & mask_yeslike & \
        mask_yessame,"iid"].count() / data_fig7.loc[(data_fig7["goal"] == i) & mask_m & mask_yeslike,"iid"].count() * 100
    
    # for matches (only for females to drop reciprocal dates)
    data_fig7c_f.loc[i,"percent_match"] = data_fig7.loc[(data_fig7["goal"] == i) & mask_f & mask_yesmatch & \
        mask_yessame,"iid"].count() / data_fig7.loc[(data_fig7["goal"] == i) & mask_f & mask_yesmatch,"iid"].count() * 100


In [30]:
### 10 - Analysis 5 - Importance and impact of the dating goal - plot ### ----

# set figure to make subplots
fig7 = make_subplots(
    rows = 5,
    cols = 4,
    specs = [[{"colspan": 2}, None, {"colspan": 2}, None], [{"colspan": 2}, None, {"colspan": 2}, None],
        [{"colspan": 2}, None, {"colspan": 2}, None], [{"colspan": 2}, None, {"colspan": 2}, None],
        [None, {"colspan": 2}, None, None]],
    subplot_titles = (
        "A. Goals for females",
        "B. Goals for males",
        "C. Proportion of likes given by females",
        "D. Proportion of likes given by males",
        "E. Proportion of matches for females",
        "F. Proportion of matches for males",
        "G. Proportion of likes given by females<br>to males having the same goal",
        "H. Proportion of likes given by males<br>to females having the same goal",
        "I. Impact of shared goals on matches"),
    column_widths = [0.2, 0.2, 0.2, 0.2],
    horizontal_spacing = 0.15)

# plot importance of goals
fig7.add_trace(go.Bar(
    x = data_fig7ab_f.index,
    y = data_fig7ab_f["count"],
    marker_color = px.colors.qualitative.Vivid[1]),
    row = 1, col = 1)
fig7.add_trace(go.Bar(
    x = data_fig7ab_m.index,
    y = data_fig7ab_m["count"],
    marker_color = px.colors.qualitative.Vivid[0]),
    row = 1, col = 3)

# plot percent of likes given per goal
fig7.add_trace(go.Bar(
    x = data_fig7ab_f.index,
    y = data_fig7ab_f["percent_like_g"],
    marker_color = px.colors.qualitative.Vivid[1]),
    row = 2, col = 1)
fig7.add_trace(go.Bar(
    x = data_fig7ab_m.index,
    y = data_fig7ab_m["percent_like_g"],
    marker_color = px.colors.qualitative.Vivid[0]),
    row = 2, col = 3)

# plot percent of matches per goal
fig7.add_trace(go.Bar(
    x = data_fig7ab_f.index,
    y = data_fig7ab_f["percent_match"],
    marker_color = px.colors.qualitative.Vivid[1]),
    row = 3, col = 1)
fig7.add_trace(go.Bar(
    x = data_fig7ab_m.index,
    y = data_fig7ab_m["percent_match"],
    marker_color = px.colors.qualitative.Vivid[0]),
    row = 3, col = 3)

# plot impact of shared goals on decision to give a like
fig7.add_trace(go.Bar(
    x = data_fig7c_f.index,
    y = data_fig7c_f["percent_like"],
    marker_color = px.colors.qualitative.Vivid[1]),
    row = 4, col = 1)
fig7.add_trace(go.Bar(
    x = data_fig7c_m.index,
    y = data_fig7c_m["percent_like"],
    marker_color = px.colors.qualitative.Vivid[0]),
    row = 4, col = 3)

# plot impact of shared goals on matches
fig7.add_trace(go.Bar(
    x = data_fig7c_f.sort_values(by = "percent_match", ascending = False).index,
    y = data_fig7c_f["percent_match"].sort_values(ascending = False),
    marker_color = px.colors.qualitative.Vivid[3]),
    row = 5, col = 2)

# update layout
fig7.update_annotations(font_size = 15)
fig7.update_xaxes(title_font = dict(size = 13), tickfont = dict(size = 10), tickangle = 90)
fig7.update_yaxes(title_font = dict(size = 13), tickfont = dict(size = 10))
fig7.update_layout(
        margin = dict(l = 90, t = 120),
        title_text = "Figure 7. Importance and impact of dating goals",
        title_x = 0.5,
        title_y = 0.98,
        title_font_size = 18,
        yaxis = dict(title = "Subject count", range = [-20, 500], tickvals = [0, 100, 200, 300, 400]),
        yaxis2 = dict(title = "Subject count", range = [-20, 500], tickvals = [0, 100, 200, 300, 400]),
        yaxis3 = dict(title = "Percent likes per goal", range = [-10, 120], tickvals = [0, 20, 40, 60, 80, 100]),
        yaxis4 = dict(title = "Percent likes per goal", range = [-10, 120], tickvals = [0, 20, 40, 60, 80, 100]),
        yaxis5 = dict(title = "Percent matches per goal", range = [-10, 120], tickvals = [0, 20, 40, 60, 80, 100]),
        yaxis6 = dict(title = "Percent matches per goal", range = [-10, 120], tickvals = [0, 20, 40, 60, 80, 100]),
        yaxis7 = dict(title = "Percent of total likes<br>per goal", range = [-6, 72], tickvals = [0, 20, 40, 60]),
        yaxis8 = dict(title = "Percent of total likes<br>per goal", range = [-6, 72], tickvals = [0, 20, 40, 60]),
        yaxis9 = dict(title = "Percent of total matches<br>per goal", range = [-6, 72], tickvals = [0, 20, 40, 60]),
        legend = dict(
            orientation = "h",
            yanchor = "top",
            y = -0.02,
            xanchor = "left",
            x = 0.3,
            font = dict(size = 11)),
        showlegend = False,
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 1600)

fig7.show()