# Speed Dating

## Challenge description

We will start a new data visualization and exploration project. Your goal will be to try to understand *love*! It's a very complicated subject so we've simplified it. Your goal is going to be to understand what happens during a speed dating and especially to understand what will influence the obtaining of a **second date**.

This is a Kaggle competition on which you can find more details here :

[Speed Dating Dataset](https://www.kaggle.com/annavictoria/speed-dating-experiment#Speed%20Dating%20Data%20Key.doc)

Take some time to read the description of the challenge and try to understand each of the variables in the dataset. Help yourself with this from the document : *Speed Dating - Variable Description.md*

### Rendering

To be successful in this project, you will need to do a descriptive analysis of the main factors that influence getting a second appointment. 

Over the next few days, you'll learn how to use python libraries like seaborn, plotly and bokeh to produce data visualizations that highlight relevant facts about the dataset.

For today, you can start exploring the dataset with pandas to extract some statistics.

In [220]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [2]:
dataset = pd.read_csv('Speed Dating Data.csv',encoding_errors='ignore')
dataset.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [3]:
nb_participants = len(dataset['iid'].unique())
nb_participants #Id 118 inexistant

551

In [4]:
nb_groupes = len(dataset['wave'].unique())
nb_groupes 

21

In [5]:
dataset.describe(include=['O'])

Unnamed: 0,field,undergra,mn_sat,tuition,from,zipcode,income,career
count,8315,4914,3133.0,3583.0,8299,7314,4279.0,8289
unique,259,241,68.0,115.0,269,409,261.0,367
top,Business,UC Berkeley,1400.0,26908.0,New York,0,55080.0,Finance
freq,521,107,403.0,241.0,522,355,124.0,202


In [6]:
dataset_temp = dataset.iloc[:,0:35]
dataset_temp.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field
0,1,1.0,0,1,1,1,10,7,,4,...,8.0,8.0,8.0,8.0,6.0,7.0,4.0,2.0,21.0,Law
1,1,1.0,0,1,1,1,10,7,,3,...,8.0,10.0,7.0,7.0,5.0,8.0,4.0,2.0,21.0,Law
2,1,1.0,0,1,1,1,10,7,,10,...,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,21.0,Law
3,1,1.0,0,1,1,1,10,7,,5,...,8.0,9.0,8.0,9.0,8.0,7.0,7.0,2.0,21.0,Law
4,1,1.0,0,1,1,1,10,7,,7,...,7.0,9.0,6.0,9.0,7.0,8.0,6.0,2.0,21.0,Law


In [7]:
dataset['sports']

0       9.0
1       9.0
2       9.0
3       9.0
4       9.0
       ... 
8373    8.0
8374    8.0
8375    8.0
8376    8.0
8377    8.0
Name: sports, Length: 8378, dtype: float64

### Wave 1 : data manipulation

let's start with wave 1 to see all variables

In [107]:
data_wave1=dataset.loc[dataset['wave']==1]
data_wave1.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [141]:
data_wave1.columns.tolist()

['iid',
 'id',
 'gender',
 'idg',
 'condtn',
 'wave',
 'round',
 'position',
 'positin1',
 'order',
 'partner',
 'pid',
 'match',
 'int_corr',
 'samerace',
 'age_o',
 'race_o',
 'pf_o_att',
 'pf_o_sin',
 'pf_o_int',
 'pf_o_fun',
 'pf_o_amb',
 'pf_o_sha',
 'dec_o',
 'attr_o',
 'sinc_o',
 'intel_o',
 'fun_o',
 'amb_o',
 'shar_o',
 'like_o',
 'prob_o',
 'met_o',
 'age',
 'field',
 'field_cd',
 'undergra',
 'mn_sat',
 'tuition',
 'race',
 'imprace',
 'imprelig',
 'from',
 'zipcode',
 'income',
 'goal',
 'date',
 'go_out',
 'career',
 'career_c',
 'sports',
 'tvsports',
 'exercise',
 'dining',
 'museums',
 'art',
 'hiking',
 'gaming',
 'clubbing',
 'reading',
 'tv',
 'theater',
 'movies',
 'concerts',
 'music',
 'shopping',
 'yoga',
 'exphappy',
 'expnum',
 'attr1_1',
 'sinc1_1',
 'intel1_1',
 'fun1_1',
 'amb1_1',
 'shar1_1',
 'attr4_1',
 'sinc4_1',
 'intel4_1',
 'fun4_1',
 'amb4_1',
 'shar4_1',
 'attr2_1',
 'sinc2_1',
 'intel2_1',
 'fun2_1',
 'amb2_1',
 'shar2_1',
 'attr3_1',
 'sinc3_1',
 

#### D-day : befor speed dating start

we will start by categorising all variables, in order to facilitate the manipulation and the analysis of the database

##### 1- Signup : 
 All infos about each person

In [205]:
iid_infos_vars=['age','race','from','zipcode','income']
iid_study_vars=['field','field_cd','undergra','mn_sat','tuition']
iid_career_vars =['career','career_c']
iid_entertain_vars =['go_out','sports','tvsports','exercise','dining','museums','art','hiking','gaming','clubbing','reading','tv','theater','movies','concerts','music','shopping','yoga']
iid_dating_vars = ['imprace','imprelig','goal','date','expnum']
iid_attributes1_1_vars = ['attr1_1','sinc1_1','intel1_1','fun1_1','amb1_1','shar1_1'] # iid expect from opposit sex 
iid_attributes2_1_vars = ['attr2_1','sinc2_1','intel2_1','fun2_1','amb2_1','shar2_1'] # opposit sex expect
iid_attributes3_1_vars = ['attr3_1','sinc3_1','intel3_1','fun3_1','amb3_1'] # iid expectation of himself
iid_attributes4_1_vars = ['attr4_1','sinc4_1','intel4_1','fun4_1','amb4_1','shar4_1'] # all genra of iid expect from opposit sex
iid_attributes5_1_vars = ['attr5_1','sinc5_1','intel5_1','fun5_1','amb5_1'] # iid expectation of perception of others

In [169]:
len(iid_entertain_vars)

18

##### 1.a- Cleaning the database 

###### we will be looking her for all anormalities in the data base for each category defined before

In [38]:
#number of participants in wave 1
nb_participants = len(data_wave1['iid'].unique())
nb_participants

20

In [45]:
#regrouping database by iid to avoid duplicates
data_wave1_iids= data_wave1.drop_duplicates(subset=['iid'])

 -- Cleaning : " Infos Variables " --

In [46]:
# infos variables
data_wave1_iids[iid_infos_vars].describe(include='all')

Unnamed: 0,age,race,from,zipcode,income
count,20.0,20.0,20,20.0,12.0
unique,,,19,18.0,12.0
top,,,Southern California,0.0,69487.0
freq,,,2,3.0,1.0
mean,24.4,2.65,,,
std,2.414866,1.136708,,,
min,21.0,2.0,,,
25%,22.75,2.0,,,
50%,24.0,2.0,,,
75%,26.0,3.25,,,


We see that the variable 'from' is not structured and difuclt to structure, same for 'zipcode', so we're going to drop them now from our analysis. 

In [51]:
iid_infos_vars = ['age','race','income']

For the 'income' variable, it's an important one, so we're going to keep it and fill all NaNs with the median.

Why the median?

For incomes, the mean is very influenced by big incomes (outliers), so I prefer to go for the median

In [71]:
# Incomes mean 
Income_mean = pd.to_numeric(data_wave1_iids['income'].str.replace(',','')).mean()
Income_mean

53020.25

In [72]:
# Incomes median
Income_median = pd.to_numeric(data_wave1_iids['income'].str.replace(',','')).median()
Income_median

53315.0

In [108]:
#Convert incomes to numeric
data_wave1.loc[:,'income']=pd.to_numeric(data_wave1.loc[:,'income'].str.replace(',',''))




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [110]:
#fill NAN with the median

value = {'income':Income_median} 
data_wave1 = data_wave1.fillna(value=value)

In [111]:
#Verify that all went good
data_wave1_iids= data_wave1.drop_duplicates(subset=['iid'])
data_wave1_iids[iid_infos_vars].describe(include='all')

Unnamed: 0,age,race,income
count,20.0,20.0,20.0
mean,24.4,2.65,53138.15
std,2.414866,1.136708,12372.777492
min,21.0,2.0,29237.0
25%,22.75,2.0,51170.5
50%,24.0,2.0,53315.0
75%,26.0,3.25,55110.0
max,30.0,6.0,86340.0


!! All went good !! 

Unfortunatly, we see that the std is quite big for income so we're going to skip it for now for our analysis



In [206]:
iid_infos_vars = ['age','race']

 -- Cleaning : " Study variables " --

In [131]:
data_wave1_iids[iid_study_vars].describe(include='all')

Unnamed: 0,field,field_cd,undergra,mn_sat,tuition
count,20,20.0,0.0,0.0,0.0
unique,9,,0.0,0.0,0.0
top,Law,,,,
freq,8,,,,
mean,,5.2,,,
std,,4.96938,,,
min,,1.0,,,
25%,,1.0,,,
50%,,1.5,,,
75%,,8.0,,,


Here it's easy, all 'undergra','mn_sat','tuition' are going to be droped cause no values

In [132]:
iid_study_vars=['field','field_cd']

For the other variables, we're going to keep 'field_cd' because it's better structured

In [207]:
iid_study_vars=['field_cd']

 -- Cleaning : " Career variables " --

In [136]:
data_wave1_iids[iid_career_vars].describe(include='all')

Unnamed: 0,career,career_c
count,20,17.0
unique,17,
top,lawyer,
freq,3,
mean,,3.764706
std,,3.250566
min,,1.0
25%,,1.0
50%,,2.0
75%,,7.0


Same for study variables, we're going to take only 'career_c' for better structure.

In [208]:
iid_career_vars=['career_cd']

 -- Cleaning : " Entertaining variables " --

In [180]:
data_wave1_iids[iid_entertain_vars].describe(include='all')

Unnamed: 0,go_out,sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,reading,tv,theater,movies,concerts,music,shopping,yoga
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,1.2,6.8,5.3,6.45,8.2,6.15,5.9,5.15,4.45,6.95,7.3,5.65,6.4,8.15,6.9,7.4,5.5,4.0
std,0.695852,2.93078,3.02794,2.999561,1.321881,2.183069,2.268781,2.230766,2.502104,2.187885,1.41793,2.99605,2.945112,1.308877,1.832456,1.500877,2.781518,3.509386
min,1.0,1.0,1.0,1.0,6.0,1.0,1.0,2.0,1.0,2.0,4.0,1.0,1.0,6.0,3.0,4.0,1.0,1.0
25%,1.0,4.75,2.0,4.75,7.0,5.0,5.0,3.75,2.0,5.0,6.0,2.75,4.0,7.0,6.0,6.75,2.75,1.0
50%,1.0,7.5,5.5,7.0,8.0,6.0,6.0,5.0,4.5,7.0,8.0,6.5,7.0,8.0,7.0,7.5,6.0,2.5
75%,1.0,9.0,8.0,9.0,9.25,7.25,7.0,6.25,6.25,8.25,8.0,8.0,9.0,9.0,7.25,8.25,8.0,7.0
max,4.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


In [None]:
range_columns=[7,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10]

###### New Function : Identifying all columns with values out of range and not integers  

In [176]:
def find_bad_columns(dataset,columns_list,range_columns):
    # dataset : dataset where to look for bad columns
    # columns_list : list with name of columns
    # range_columns : list with the max value of each column
    bad_columns=[]
    good_columns=[]
    for i,c in enumerate(columns_list):
        toto = dataset[c].apply(lambda x:int(x)==x and x<=range_columns[i] and x>0)
        if sum(toto)!=nb_participants:
            bad_columns.append(c)
        else:
            good_columns.append(c)
    return bad_columns
    

In [178]:
bad_columns = find_bad_columns(data_wave1_iids,iid_entertain_vars,range_columns)
bad_columns

[]

Since all values of those columns are in range, we're going to keep all those variables

 -- Cleaning : " dating variables " --

In [179]:
data_wave1_iids[iid_dating_vars].describe(include='all')

Unnamed: 0,imprace,imprelig,goal,date,expnum
count,20.0,20.0,20.0,20.0,20.0
mean,3.25,2.5,1.75,4.2,8.3
std,2.953588,2.013115,1.292692,1.794729,6.497773
min,1.0,1.0,1.0,1.0,2.0
25%,1.0,1.0,1.0,3.0,2.75
50%,2.0,1.0,1.0,4.5,7.0
75%,4.75,4.0,2.0,5.0,11.25
max,9.0,8.0,6.0,7.0,20.0


In [184]:
range_columns=[10,10,6,7,20]

In [185]:
bad_columns = find_bad_columns(data_wave1_iids,iid_dating_vars,range_columns)
bad_columns

[]

All good !! Nothing is wrong with thoses columns

 -- Cleaning : " Attributes variables " --

For those variables, we have diffrent ways that were used to collect them : 

- By importance (1-10) for wave 6-9

- With a distribution for 100 points

To unnify all those criterias, we gonna convert all ratings to the second way. Which means a distribution of 100 points.

In order to do that, we gonna sum all variables and then divide each one with the sum. Which gonna give us the percentage priority of each attribute  



In [199]:
def unify_attributes(dataset,attribute_list):
    return dataset[attribute_list].div(dataset[attribute_list].sum(axis=1),axis='index')*100


In [201]:
dataset_unified = unify_attributes(data_wave1,iid_attributes1_1_vars)
data_wave1[iid_attributes1_1_vars]=dataset_unified

All done, let's verify that all attributes are unified

In [202]:
data_wave1_iids= data_wave1.drop_duplicates(subset=['iid'])
data_wave1_iids[iid_attributes1_1_vars].sum(axis=1)

0      100.0
10     100.0
20     100.0
30     100.0
40     100.0
50     100.0
60     100.0
70     100.0
80     100.0
90     100.0
100    100.0
110    100.0
120    100.0
130    100.0
140    100.0
150    100.0
160    100.0
170    100.0
180    100.0
190    100.0
dtype: float64

It all work out well.

Let's now unify all the other attributes

In [209]:
#attributes2_1
data_wave1[iid_attributes2_1_vars]=unify_attributes(data_wave1,iid_attributes2_1_vars)

#attributes3_1
data_wave1[iid_attributes3_1_vars]=unify_attributes(data_wave1,iid_attributes3_1_vars)

#attributes4_1
data_wave1[iid_attributes4_1_vars]=unify_attributes(data_wave1,iid_attributes4_1_vars)

#attributes5_1
data_wave1[iid_attributes5_1_vars]=unify_attributes(data_wave1,iid_attributes5_1_vars)


## What a person like in the opposite sex ?

###  Expectations 

#### 1- Individual Expectations

In [233]:
attributes=['Attractive','Sincere','Intelligent','Fun','Ambitious','Has shared interests']

In [232]:
iid_attributes1_1_vars

['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']

In [245]:
data_wave1_iids= data_wave1.drop_duplicates(subset=['iid'])
dataset_temp = data_wave1_iids.loc[dataset['gender']==0][iid_attributes1_1_vars]
data1_1_W = dataset_temp[iid_attributes1_1_vars].describe()

In [246]:
data_wave1_iids= data_wave1.drop_duplicates(subset=['iid'])
dataset_temp2 = data_wave1_iids.loc[dataset['gender']==1][iid_attributes1_1_vars]
data1_1_M = dataset_temp2[iid_attributes1_1_vars].describe()

In [248]:

# plot star plot for comparison
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "polar"}, {"type": "polar"}]],
    subplot_titles=("Women expectations", "Men expectations")
)
  
fig.add_trace(go.Scatterpolar(
      r=data1_1_W.iloc[1][iid_attributes1_1_vars],
      theta=attributes,
      fill='toself',
      name='women'
),
row=1,col=1)
fig.add_trace(go.Scatterpolar(
      r=data1_1_M.iloc[1][iid_attributes1_1_vars],
      theta=attributes,
      fill='toself',
      name='men'
),
row=1,col=2)
  
fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True
          )),
    template = 'plotly_dark',
  
  showlegend=True
)
  
fig.show()

#### 2- Collective Expectations 
(only appliable for waves after wave 6)

In [249]:
iid_attributes4_1_vars

#For women
dataset_temp = data_wave1_iids.loc[dataset['gender']==0][iid_attributes4_1_vars]
data4_1_W = dataset_temp[iid_attributes4_1_vars].describe()

#For men
dataset_temp2 = data_wave1_iids.loc[dataset['gender']==1][iid_attributes4_1_vars]
data4_1_M = dataset_temp2[iid_attributes4_1_vars].describe()

In [258]:
import plotly

cols = plotly.colors.qualitative.Dark24

# plot star plot for comparison 
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "polar"}, {"type": "polar"}]],
    subplot_titles=("Women expectations", "Men expectations")
)

  
fig.add_trace(go.Scatterpolar(
      r=data1_1_W.iloc[1][iid_attributes1_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[0]),
      name='Individual Expectations'
),
row=1,col=1)
fig.add_trace(go.Scatterpolar(
      r=data4_1_W.iloc[1][iid_attributes4_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[1]),
      name='Collective Expectations'
),
row=1,col=1)

fig.add_trace(go.Scatterpolar(
      r=data1_1_M.iloc[1][iid_attributes1_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[0]),
      name='Individual Expectations',
      showlegend=False
),
row=1,col=2)

fig.add_trace(go.Scatterpolar(
      r=data4_1_M.iloc[1][iid_attributes4_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[1]),
      name='Collective Expectations',
      showlegend=False
),
row=1,col=2)
  
fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True
          )),
    template = 'plotly_dark',
  
  showlegend=True
)
  
fig.show()

#### 3- Opposite sex Expectations

In [259]:
#What women think of men expectations
dataset_temp = data_wave1_iids.loc[dataset['gender']==0][iid_attributes2_1_vars]
data2_1_W = dataset_temp[iid_attributes2_1_vars].describe()

#What men think of women expectations
dataset_temp2 = data_wave1_iids.loc[dataset['gender']==1][iid_attributes2_1_vars]
data2_1_M = dataset_temp2[iid_attributes2_1_vars].describe()

In [261]:
# plot star plot for comparison 
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "polar"}, {"type": "polar"}]],
    subplot_titles=("Women expectations", "Men expectations")
)

  
fig.add_trace(go.Scatterpolar(
      r=data1_1_W.iloc[1][iid_attributes1_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[0]),
      name='Individual Evaluation'
),
row=1,col=1)
fig.add_trace(go.Scatterpolar(
      r=data2_1_M.iloc[1][iid_attributes2_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[1]),
      name='Other sex Evaluation'
),
row=1,col=1)

fig.add_trace(go.Scatterpolar(
      r=data1_1_M.iloc[1][iid_attributes1_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[0]),
      name='Individual Expectations',
      showlegend=False
),
row=1,col=2)

fig.add_trace(go.Scatterpolar(
      r=data2_1_W.iloc[1][iid_attributes2_1_vars],
      theta=attributes,
      fill='toself',
      line=dict(color=cols[1]),
      name='Other sex Expectations',
      showlegend=False
),
row=1,col=2)
  
fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True
          )),
    template = 'plotly_dark',
  
  showlegend=True
)
  
fig.show()

We see on this graphs that men overestimated the importance of 'Attractiveness' and 'Ambition' about women expectations, but underestimated the 'Intelligence' and the 'Fun' factors. Which is logic in a man's mind cause it's how men evaluates others.

In the other hand, women overestimated the importance of 'Fun' about men expectations, but had All right about the importance of 'Attractivnesse' which is dominant for men expectations of the other sex.  

###  Self estime 

##### 1- Average of self estime

In [262]:
self_perception = []

for i,j in zip(iid_attributes3_1_vars,iid_attributes5_1_vars):
    self_perception+=[i,j]


In [268]:
data_wave1_src=dataset.loc[dataset['wave']==1]
data_src_iid=data_wave1_src.drop_duplicates(subset=['iid'])


In [270]:
#What women think of themselfs
dataset_temp = data_src_iid.loc[dataset['gender']==0][iid_attributes3_1_vars]
data3_1_W = dataset_temp[iid_attributes3_1_vars].describe()

#What men think of of themselfs
dataset_temp2 = data_src_iid.loc[dataset['gender']==1][iid_attributes3_1_vars]
data3_1_M = dataset_temp2[iid_attributes3_1_vars].describe()

In [293]:
# plot star plot for comparison 
fig = make_subplots(
    rows=1, cols=1,
    specs=[[{"type": "polar"}]],)
  
fig.add_trace(go.Scatterpolar(
      r=data3_1_W.iloc[1][iid_attributes3_1_vars],
      theta=attributes[:len(iid_attributes3_1_vars)],
      fill='toself',
      line=dict(color=cols[1]),
      name='Women'
),
row=1,col=1)

fig.add_trace(go.Scatterpolar(
      r=data3_1_M.iloc[1][iid_attributes3_1_vars],
      theta=attributes[:len(iid_attributes3_1_vars)],
      fill='toself',
      line=dict(color=cols[0]),
      name='Men'
),
row=1,col=1)

  
fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True
          )),
  template = 'plotly_dark',
  title={
        'text': "Self Estime Evaluation",
        'y':0.9,
        'x':0.485,
        'xanchor': 'center',
        'yanchor': 'top'},

  showlegend=True
)
  
fig.show()

We can see in this graph that men are on average overconfident than women in this wave

##### 2- Confidence and number of confident people by gender:

We're going to define categories of self estime depending on scores that are self attributed:
- Non confident : if no score is above 8
- Confident : if at least on score is 9 or 10
- Over Confident : if 4 or 5 scores are 9 or 10

In [None]:
data_src_iid['confident'] = (data_src_iid[iid_attributes3_1_vars]>8).sum(axis=1)
data_src_iid['confident'] = data_src_iid['confident'].apply(lambda x : 2 if x>3 else 1 if x>0 else 0  )



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [371]:
data_src_iid['confident']
data_src_iid.loc[data_src_iid['gender']==1]['confident'].value_counts()

1    4
2    3
0    3
Name: confident, dtype: int64

Definition of a function that count overconfident, confident and non confident ones in a database containing a column confident with 3 categories 

In [372]:
def confident_count(dataset,gender, gender_colomn_name='gender',confident_column_name='confident'):
    count_of_conf = dataset.loc[dataset[gender_colomn_name]==gender][confident_column_name].value_counts()
    categories_detected = count_of_conf.index.tolist()
    counts = count_of_conf.tolist()

    final_counts=[]
    for i in [2,1,0]:
        try :
            index = categories_detected.index(i)
        except :
            final_counts.append(0)
        else :
            final_counts.append(counts[index])
    return [sum(final_counts[0:2]),final_counts[-1]],final_counts[0]


In [373]:
#What women think of themselfs
confident_list_W,overconfident_W = confident_count(data_src_iid,0)

#What men think of of themselfs
confident_list_M,overconfident_M = confident_count(data_src_iid,1)

In [374]:
# plot bar plot for comparison 
fig = make_subplots(
    rows=1, cols=2,
    shared_yaxes=True,
    subplot_titles=("Confident vs non confident nb", "Overconfident nb")
    )
  

fig.add_trace(go.Bar(
      x=['Women','Men'],
      y=[confident_list_W[1],confident_list_M[1]],
      marker_color=cols[1],
      name='Non Condifent',
      texttemplate = "Non Confident : <br> <b>%{value}</b> ",
      textposition = "inside"
),
row=1,col=1)


fig.add_trace(go.Bar(
      x=['Women','Men'],
      y=[confident_list_W[0],confident_list_M[0]],
      marker_color=cols[0],
      name='Condifent',
      texttemplate = "Confident : <br> <b>%{value}</b> ",
      textposition = "inside"
),
row=1,col=1)



fig.add_trace(go.Bar(
      x=['Women','Men'],
      y=[overconfident_W,overconfident_M],
      marker_color=cols[2],
      name='Overcondifent',
      texttemplate = "OverConfident : <br> <b>%{value}</b> ",
      textposition = "outside"
),
row=1,col=2)
  
fig.update_layout(
  template = 'plotly_dark',
  title={
        'text': "Self Estime Evaluation",
        'y':0.9,
        'x':0.485,
        'xanchor': 'center',
        'yanchor': 'top'},

  showlegend=True
)
fig.update_traces(textposition='inside')
fig.show()

when we look at numbers, we see that confidence is not depending on the sex, but there is tendency of overconfidence in men which confirm our last conclusion

!! let's integrate this new column 'confident' in our cleaned database !!

In [367]:
data_wave1['confident'] = (data_wave1_src[iid_attributes3_1_vars]>8).sum(axis=1)
data_wave1['confident'] = data_wave1['confident'].apply(lambda x : 2 if x>3 else 1 if x>0 else 0  )

In [375]:
#let's verify all good
data_wave1_iids= data_wave1.drop_duplicates(subset=['iid'])
data_wave1_iids['confident']==data_src_iid['confident']

0      True
10     True
20     True
30     True
40     True
50     True
60     True
70     True
80     True
90     True
100    True
110    True
120    True
130    True
140    True
150    True
160    True
170    True
180    True
190    True
Name: confident, dtype: bool

!! ALL GOOD THEN !!

### Reality

Now we are look for the reality of things: which means what really people think when they look at each other

In order to do that, we need to look for an answer for 3 diffrent questions :
1. If a person matches with you, does he gives you a better appriciation ?
2. Does our criteria of self estime is correct ?
3. Does confidence plays a big role in the perception of others ?

##### 1- Impact of matching

Before starting our analysis, let's begin by grouping all our last features and our updated database 

In [376]:
#All iid will have only thoses infos :

iid_infos_vars=['age','race']
iid_study_vars=['field_cd']
iid_career_vars =['career_c']
iid_entertain_vars =['go_out','sports','tvsports','exercise','dining','museums','art','hiking','gaming','clubbing','reading','tv','theater','movies','concerts','music','shopping','yoga']
iid_dating_vars = ['imprace','imprelig','goal','date','expnum']
iid_attributes1_1_vars = ['attr1_1','sinc1_1','intel1_1','fun1_1','amb1_1','shar1_1'] # iid expect from opposit sex 
iid_attributes2_1_vars = ['attr2_1','sinc2_1','intel2_1','fun2_1','amb2_1','shar2_1'] # opposit sex expect
iid_attributes3_1_vars = ['attr3_1','sinc3_1','intel3_1','fun3_1','amb3_1'] # iid expectation of himself
iid_attributes4_1_vars = ['attr4_1','sinc4_1','intel4_1','fun4_1','amb4_1','shar4_1'] # all genra of iid expect from opposit sex
iid_attributes5_1_vars = ['attr5_1','sinc5_1','intel5_1','fun5_1','amb5_1'] # iid expectation of perception of others

In [378]:
# We will still focus only on the first wave for now
database_used = data_wave1

In [433]:
database_used[iid_attributes3_1_vars]=data_wave1_src[iid_attributes3_1_vars]

In [434]:
database_used[iid_attributes3_1_vars]

Unnamed: 0,attr3_1,sinc3_1,intel3_1,fun3_1,amb3_1
0,6.0,8.0,8.0,8.0,7.0
1,6.0,8.0,8.0,8.0,7.0
2,6.0,8.0,8.0,8.0,7.0
3,6.0,8.0,8.0,8.0,7.0
4,6.0,8.0,8.0,8.0,7.0
...,...,...,...,...,...
195,7.0,7.0,10.0,10.0,10.0
196,7.0,7.0,10.0,10.0,10.0
197,7.0,7.0,10.0,10.0,10.0
198,7.0,7.0,10.0,10.0,10.0


Let's define matching features:

In [402]:
partner_iid = ['pid']
match=['match']
order_infos=['round','position','position1','order']
criterias=['int_corr','samerace']

In [403]:
database_used.loc[((database_used['dec']==1) & (database_used['iid']==1))|((database_used['dec']==1) & (database_used['pid']==1)) ,:][['iid','pid','match','dec_o','dec']]

Unnamed: 0,iid,pid,match,dec_o,dec
0,1,11.0,0,0,1
1,1,12.0,0,0,1
2,1,13.0,1,1,1
3,1,14.0,1,1,1
4,1,15.0,1,1,1
6,1,17.0,0,0,1
8,1,19.0,1,1,1
9,1,20.0,0,0,1
120,13,1.0,1,1,1
130,14,1.0,1,1,1


In [407]:
given_decision=['dec']
given_attributes_var=['attr','sinc','intel','fun','amb','shar']
gaved_attributes_var=['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o']
given_match_likeability = ['like','prob']
iid_given_match_prob=['match_es']

Let's calculate the average of scores given for each, depending on decision

In [443]:
toto=database_used.groupby(['iid','dec_o']).mean()[gaved_attributes_var].reset_index()
toto.loc[toto['dec_o']==0]

Unnamed: 0,iid,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o
0,1,0,5.6,7.0,7.2,6.4,7.4,6.0
2,2,0,6.5,5.25,6.75,6.25,6.75,4.75
4,3,0,5.6,5.8,7.0,4.8,6.6,4.6
6,4,0,6.25,6.75,8.5,6.5,6.75,5.5
8,5,0,4.571429,7.142857,7.285714,6.428571,7.0,5.142857
10,6,0,6.2,8.0,8.8,6.4,7.6,5.2
12,7,0,7.0,6.666667,7.0,6.0,6.333333,5.666667
14,8,0,7.0,5.5,6.5,5.5,6.0,4.5
16,9,0,6.0,7.0,6.666667,7.0,6.333333,5.333333
18,10,0,5.4,6.0,5.166667,5.166667,5.0,4.666667


In [440]:
#let's define a new feature of age diffrence
database_used['age_diff']=database_used['age_o']-database_used['age']

In [442]:
given_prc_attributes_var=['attr_p','sinc_p','intel_p','fun_p','amb_p','shar_p']
database_used[given_prc_attributes_var]=unify_attributes(database_used,given_attributes_var)

In [449]:
attributes_comp = []

for i,j in zip(given_prc_attributes_var,iid_attributes1_1_vars):
    attributes_comp+=[i,j]


In [450]:
database_used.groupby(['iid','dec']).mean()[attributes_comp]

Unnamed: 0_level_0,Unnamed: 1_level_0,attr_p,attr1_1,sinc_p,sinc1_1,intel_p,intel1_1,fun_p,fun1_1,amb_p,amb1_1,shar_p,shar1_1
iid,dec,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,11.287758,15.0,25.397456,20.0,19.753577,20.0,13.990461,15.0,15.580286,15.0,13.990461,15.0
1,1,14.990291,15.0,16.846866,20.0,18.028999,20.0,17.57826,15.0,15.808144,15.0,16.74744,15.0
2,0,16.520081,45.0,19.24135,5.0,20.331733,25.0,14.318019,20.0,17.588797,0.0,12.000021,5.0
2,1,16.158573,45.0,16.129824,5.0,19.0735,25.0,16.805028,20.0,14.68265,0.0,17.150423,5.0
3,0,16.002291,35.0,16.936376,10.0,18.586737,35.0,15.153671,10.0,17.398058,10.0,15.922866,0.0
4,0,12.514034,20.0,20.307134,20.0,19.217878,20.0,15.366515,20.0,18.205973,10.0,14.388465,10.0
4,1,15.928882,20.0,17.198839,20.0,17.186744,20.0,19.109821,20.0,14.005806,10.0,16.569908,10.0
5,0,17.100672,20.0,20.476905,5.0,21.828257,25.0,14.207064,25.0,12.180037,10.0,14.207064,15.0
5,1,16.888352,20.0,14.141766,5.0,17.630689,25.0,16.882672,25.0,16.657983,10.0,17.798538,15.0
6,0,14.369079,10.0,24.353168,25.0,28.722617,20.0,12.686266,25.0,13.010156,5.0,6.858713,15.0
