# 2950 Project Phase 2

Flavia Jiang (yj472), Rachel Wang (jw879)

In [822]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.linear_model import LinearRegression, LogisticRegression
import duckdb

## Research Question

## Data Description

1.What are the observations (rows) and the attributes (columns)?


2. Why was this dataset created?
This dataset was created to investigate and analyze various aspects of human dating behavior in the context of speed dating. 
Researchers were interested in understanding how individuals(males and females) make dating decisions, what attributes they 
consider important, and how different factors influence the outcomes of speed dating encounters.

3. Who funded the creation of the dataset?

The dataset was collected as part of academic research. 

4. What processes might have influenced what data was observed and recorded and what was not?

Participant Demographics: The age, gender, and demographic characteristics of the participants could influence the data collected. 
In this study, all subjects are from graduate and professional school of Columbia University.

Self-Selection of Participants: Participants in the speed dating experiment were volunteers, which means they self-selected to take part. 
This self-selection process may have introduced biases, as those who chose to participate might have different preferences or characteristics 
compared to the general population. This could impact the generalizability of the findings.

Experiment Design: The design of the speed dating experiment determined what data could be collected. The researchers structured the experiment, 
including the number of participants, the number of potential partners, and the available information about each partner. The experimental 
conditions may not fully represent real-world dating situations.

Survey Responses: The data collected was based on surveys and questionnaires filled out by participants. Data collection relied on participants' 
willingness to respond honestly and accurately, which could be influenced by social desirability bias or other factors.

5. What preprocessing was done, and how did the data come to be in the form that you are using?


6. If people are involved, were they aware of the data collection, and if so, what purpose did they expect the data to be used for?

Participants in the speed dating events would have been aware of the data collection process, as informed consent is a standard practice 
in research involving human subjects. They would have been informed about the purpose of the data collection, which is typically for academic 
research. Participants would have expected the data to be used to study dating behavior and potentially contribute to our understanding of human 
interactions and preferences.

7. Where can your raw source data be found, if applicable? Provide a link to the raw data

Link to the dataset and the documentation: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/ 


## Data Cleaning

Firstly, load the data. The data set is wide and long.

In [823]:
dating_df = pd.read_csv("speed_dating_data.csv", encoding="ISO-8859-1")
dating_df.shape

(8378, 195)

### Select necessary columns
The data set is super wide, and there are so many variables. So we went through the codebook made by the creators of this data set and selected variables we currently think would be necessary for our future analysis. Now we still have 41 columns. Certainly, we won't use all of them in the logistic regression model. 

In [824]:
select_list = ['iid', 'gender', 'wave', 'round', 'pid', 'samerace', 'age_o', 'age', 'field_cd', 
               'race','career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 
               'hiking', 'gaming', 'clubbing', 'reading', 'tv','theater', 'movies', 'concerts', 
               'music', 'shopping', 'yoga', 'attr3_1','sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 
               'dec', 'attr', 'sinc','intel', 'fun', 'amb', 'shar', 'prob']

dating_df = dating_df[select_list]
dating_df.shape

(8378, 41)

### Rename columns

Though we did not do any web scraping or merge data sets to get this data frame, one main effort we made was to interpret the meaning of each variable -- carefully reading through the 15-paged codebook. We noticed that some of the names of the given variable names were vague. So, we decided to change them so they conveyed more straightforward information about the variables. 

In [825]:
dating_df = dating_df.rename(columns = {"iid":"id",
                                        "age_o":"partner_age",
                                        "round": "num_dates", 
                                        "pid": "partner_id", 
                                        "samerace": "same_race", 
                                        "dec": "decision"})
dating_df.head()

Unnamed: 0,id,gender,wave,num_dates,partner_id,same_race,partner_age,age,field_cd,race,...,intel3_1,amb3_1,decision,attr,sinc,intel,fun,amb,shar,prob
0,1,0,1,10,11.0,0,27.0,21.0,1.0,4.0,...,8.0,7.0,1,6.0,9.0,7.0,7.0,6.0,5.0,6.0
1,1,0,1,10,12.0,0,22.0,21.0,1.0,4.0,...,8.0,7.0,1,7.0,8.0,7.0,8.0,5.0,6.0,5.0
2,1,0,1,10,13.0,1,22.0,21.0,1.0,4.0,...,8.0,7.0,1,5.0,8.0,9.0,8.0,5.0,7.0,
3,1,0,1,10,14.0,0,23.0,21.0,1.0,4.0,...,8.0,7.0,1,7.0,6.0,8.0,7.0,6.0,8.0,6.0
4,1,0,1,10,15.0,0,24.0,21.0,1.0,4.0,...,8.0,7.0,1,5.0,6.0,7.0,7.0,6.0,6.0,6.0


### Remove biased data points
As described in Data Description part, the researchers ran 21 speed dating sessions, or waves, in total. However, as they explained in their paper, they removed four sessions (waves 18-21) from the analysis "because they involved an experimental intervention where participants were asked to bring their favorite book. These four sessions were run specifically to study how decision weights and selectivity would be affected by an intervention designed to shift subjects’ attention away from superficial physical attributes. The inclusion of these four sessions does not alter the results reported below; they are omitted so that the only experimental difference across sessions is group size." Accordingly we also removed data for these four sessions. 

The researchers also said they removed another wave (#12) because they "imposed a maximum number of acceptances" on participants of this wave. We thought this restriction would affect participants' decisions, so we also removed this wave.

In [826]:
dating_df = dating_df[~dating_df['wave'].isin([12, 18, 19, 20, 21])]
dating_df.shape

(6412, 41)

### Deal with missing values
We noticed there were many missing values due to how the experiment was designed and conducted. For each variable with more than 200 missing values, we re-examined whether we still thought it would be a potential good predictor in our future modeling given the fact that including it would make the model less robust. Finally we decided to remove the variable called "shar," which was the dater's rating of shared interests/hobbies for the datee.

In [827]:
for col in dating_df:
    n = sum(pd.isna(dating_df[col]))
    if (n > 0):
        print([col, n])

['partner_id', 10]
['partner_age', 82]
['age', 73]
['field_cd', 82]
['race', 63]
['career_c', 138]
['sports', 79]
['tvsports', 79]
['exercise', 79]
['dining', 79]
['museums', 79]
['art', 79]
['hiking', 79]
['gaming', 79]
['clubbing', 79]
['reading', 79]
['tv', 79]
['theater', 79]
['movies', 79]
['concerts', 79]
['music', 79]
['shopping', 79]
['yoga', 79]
['attr3_1', 105]
['sinc3_1', 105]
['fun3_1', 105]
['intel3_1', 105]
['amb3_1', 105]
['attr', 130]
['sinc', 196]
['intel', 208]
['fun', 260]
['amb', 553]
['shar', 874]
['prob', 206]


Next, let's see how many rows would be left if all rows with any missing values were dropped.

In [828]:
dating_df = dating_df.drop(["shar"], axis = 1)
dating_df = dating_df.dropna()
dating_df.shape

(5493, 40)

From a statistical standpoint, 5079 datapoints were good enough for a robust logistic regression. 

### Map coded categorical variables to their corresponding values

We also noticed that some categorical variables were coded as integers (e.g., field, race, career). This was for data storage and system performance reasons. For our purposes, we thought it would be better if these variables were presented as the actual values rather than the integer codes so that we could visualize and analyze them more efficiently. So we did the following conversion.

In [829]:
dating_df["field"] = dating_df["field_cd"].map({1:"Law", 2:"Math", 3:"Social Science, Psychologist", 
                                                4:"Medical Science, Pharmaceuticals, and Bio Tech", 
                                                5:"Engineering", 6:"English/Creative Writing/Journalism", 
                                                7:"History/Religion/Philosophy", 8:"Business/Econ/Finance", 
                                                9:"Education, Academia", 10:"Biological Sciences/Chemistry/Physics", 
                                                11:"Social Work", 12:"Undergrad/undecided", 
                                                13:"Political Science/International Affairs", 14:"Film", 
                                                15:"Fine Arts/Arts Administration", 16:"Languages", 
                                                17:"Architecture", 18:"Other"})
dating_df = dating_df.drop(["field_cd"], axis=1)

dating_df["race"] = dating_df["race"].map({1: "Black/African American", 2:"European/Caucasian-American", 
                                           3:"Latino/Hispanic American", 4:"Asian/Pacific Islander/Asian-American", 
                                           5:"Native American", 6:"Other"})

dating_df["career"] = dating_df["career_c"].map({1:"Lawyer ", 2:"Academic/Research", 3:"Psychologist",4:"Doctor/Medicine",
                                                  5:"Engineer", 6:"Creative Arts/Entertainment", 
                                                  7:"Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin", 
                                                  8:"Real Estate", 9:"International/Humanitarian Affairs", 10:"Undecided", 
                                                  11:"Social Work", 12:"Speech Pathology", 13:"Politics", 14:"Pro sports/Athletics", 
                                                  15:"Other", 16:"Journalism", 17:"Architecture"})
dating_df = dating_df.drop(["career_c"], axis=1)

### Convert data types

In [830]:
# R: correct datatype (e.g. partner_id ...)

### Remove inaccurate data points

In [831]:
# R: verify 100 points ('attr1_1','sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr', 'sinc','intel', 'fun', 'amb', 'shar')

### Preprocess regressors

In [832]:
# F: make new variables: age difference, interest difference, and others

# age_diff
similarity_df = dating_df
similarity_df["age_diff"] = similarity_df["partner_age"] - similarity_df["age"]

In [833]:
# interest_diff
interest_df = similarity_df[['id', 'partner_id', 'sports', 'tvsports', 'exercise', 'dining',
       'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv',
       'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga']]
interest_df_partner = interest_df.drop(["partner_id"], axis = 1).groupby("id").mean()
interest_df_partner = interest_df_partner.reset_index()
interest_merged = duckdb.sql("SELECT * FROM interest_df a LEFT JOIN interest_df_partner b ON a.partner_id = b.id").df()
print(interest_merged.head())

interest_merged["interest_diff"] = 0
for self_interest in ['sports', 'tvsports', 'exercise', 'dining',
       'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv',
       'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga']:
    partner_interest = self_interest + "_2"
    interest_merged["interest_diff"] += interest_merged[partner_interest] - interest_merged[self_interest]

print(similarity_df[similarity_df["id"] == 416])
interest_merged
similarity_df = pd.concat([similarity_df, interest_merged[["interest_diff"]]], axis = 1)
similarity_df = similarity_df.dropna()
print(similarity_df.shape)

   id  partner_id  sports  tvsports  exercise  dining  museums  art  hiking  \
0   4        11.0     1.0       1.0       6.0     7.0      6.0  7.0     7.0   
1   4        12.0     1.0       1.0       6.0     7.0      6.0  7.0     7.0   
2   4        13.0     1.0       1.0       6.0     7.0      6.0  7.0     7.0   
3   4        17.0     1.0       1.0       6.0     7.0      6.0  7.0     7.0   
4   4        18.0     1.0       1.0       6.0     7.0      6.0  7.0     7.0   

   gaming  ...  gaming_2  clubbing_2  reading_2  tv_2  theater_2  movies_2  \
0     5.0  ...       5.0         4.0        9.0   2.0        4.0       8.0   
1     5.0  ...       3.0         5.0        6.0   6.0        4.0       7.0   
2     5.0  ...       7.0         7.0        6.0   8.0       10.0       8.0   
3     5.0  ...       2.0         6.0        4.0   2.0        7.0       9.0   
4     5.0  ...       4.0         2.0        6.0   9.0        3.0       9.0   

   concerts_2  music_2  shopping_2  yoga_2  
0         7

In [834]:
# char_diff
interest_df = similarity_df[['id', 'partner_id', 'attr3_1','sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1']]
interest_df_partner = interest_df.drop(["partner_id"], axis = 1).groupby("id").mean()
interest_df_partner = interest_df_partner.reset_index()
interest_merged = duckdb.sql("SELECT * FROM interest_df a LEFT JOIN interest_df_partner b ON a.partner_id = b.id").df()
print(interest_merged.head())

interest_merged["char_diff"] = 0
for self_interest in ['attr3_1','sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1']:
    partner_interest = self_interest + "_2"
    interest_merged["char_diff"] += interest_merged[partner_interest] - interest_merged[self_interest]

interest_merged
similarity_df = pd.concat([similarity_df, interest_merged[["char_diff"]]], axis = 1)
similarity_df = similarity_df.dropna()
print(interest_merged)
print(similarity_df.shape)

    id  partner_id  attr3_1  sinc3_1  fun3_1  intel3_1  amb3_1  id_2  \
0  4.0        11.0      7.0      8.0     9.0       7.0     8.0  11.0   
1  4.0        12.0      7.0      8.0     9.0       7.0     8.0  12.0   
2  4.0        13.0      7.0      8.0     9.0       7.0     8.0  13.0   
3  4.0        17.0      7.0      8.0     9.0       7.0     8.0  17.0   
4  4.0        18.0      7.0      8.0     9.0       7.0     8.0  18.0   

   attr3_1_2  sinc3_1_2  fun3_1_2  intel3_1_2  amb3_1_2  
0        8.0        9.0       7.0         8.0       5.0  
1        9.0        9.0       9.0        10.0       9.0  
2        4.0        7.0       8.0         8.0       3.0  
3        7.0        7.0       6.0         8.0       4.0  
4        6.0        8.0       6.0         8.0       9.0  
         id  partner_id  attr3_1  sinc3_1  fun3_1  intel3_1  amb3_1   id_2  \
0       4.0        11.0      7.0      8.0     9.0       7.0     8.0   11.0   
1       4.0        12.0      7.0      8.0     9.0       7.0    

## Data Limitations


## Descriptive Analysis

In [835]:
# R: general: dimension
dating_df.shape

(5493, 41)

In [836]:
# R: one-variable: mean, sd, #observations
# for numerical variable: dating_df.describe()

In [837]:
# R: one-variable: plot distribution
# histogram for numeric variable (reference: kaggle example)
# pie chart for categorical variable (reference: discussion example)

In [838]:
# F: two variable
# change over Time1, during event, Time 2, Time 3
# correlation