# Date-a Science
_Aindra Thin, Iain Bromley, Shiyue (Sybil) Wang, Yash Vig_

The purpose of our research project is to explore the specific attributes and characteristics which shape people’s decisions in their selecting their romantic partners. This study fits into the broader scope of psychological research which tests predefined notions of interpersonal relationships and their realistic outcomes.

We aim to answer the following questions:
- How much does race play a factor in the selection of potential partners? 
- How important are shared interests important in terms of relationship compatibility? 
- What are factors that affect the likelihood of going on a second date?
- Difference in male and female partner selection

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

In [None]:
all_data = pd.read_csv('data/data.csv')

## Data Prep
> The dataset that we acquired contained 195 columns. Much of this data does not help us answer our research questions, so we have filtered them out.

In [None]:
data = all_data[['iid', 'pid', 'gender', 'order', 'age', 'age_o', 'imprace', 'samerace', 'attr', 'sinc', 'intel', \
                 'fun', 'amb', 'shar', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o',  'int_corr', \
                 'dec', 'dec_o', 'match', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', \
                 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1']]
                

## Exploratory Data Analysis
> The first section of your report will provide a detailed overview of the dataset. Using both written and visual approaches, this section will introduce the data to the reader in the context of the research questions. Be sure to provide in depth analyses of the distributions of key variables of interest. More than anything, this section should convey a nuanced understanding of the dataset being used.

In [None]:
data_select = data[['attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'dec']]
data_select.columns = ["Attractiveness", "Sincerity", "Intelligence", "Funny", "Ambition", "Shared Interests", "Decision"]
shaped = pd.melt(data_select, id_vars=['Decision'], value_name='score')
sns.set(style="whitegrid")
ax = sns.boxplot(x="variable", y="score", hue="Decision",
                data=shaped, palette="Set3")
ax.set_title('Decisions based on Attributes')
ax.set_ylabel('Rating')
ax.set_xlabel('Attributes')
plt.legend(loc='upper right')
plt.show()


In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
def linreg(col):
    mean_attr = data.groupby('pid')[col].mean()
    dec_attr = data.groupby('pid')['dec'].mean()
    lm = linear_model.LinearRegression()
    lm = lm.fit(mean_attr.reshape(-1,1), dec_attr.reshape(-1, 1))
    pred = lm.predict(mean_attr.reshape(-1, 1))
    return mean_attr, dec_attr, pred, math.sqrt(lm.score(mean_attr.reshape(-1, 1), dec_attr.reshape(-1, 1)))

feature_list = ['attr', 'sinc', 'intel', 'fun', 'amb', 'shar']
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 10), sharex=True, sharey=True)
axes.flat
for index, col in enumerate(feature_list):
    mean_attr, dec_attr, pred, text = linreg(col)
    sns.regplot(x=mean_attr, y=dec_attr, ax=axes.flat[index])
    axes.flat[index].annotate("r = {:.2f}".format(text), xy=(.1, 0.9), xycoords=axes.flat[index].transAxes)
    axes.flat[index].set_title('Decisions based on ' + col)
plt.show()


In [None]:
by_person = data[["iid", "dec", "match"]].groupby("iid").aggregate("sum")
#decisions = np.array(by_person["dec"], dtype=np.float)
#matches = np.array(by_person["match"], dtype=np.float)
by_person["success_rate"] = by_person["match"] / by_person["dec"]

by_person
for i, row in by_person.iterrows():
    if row['dec'] == 0.0:
        row['success_rate'] = 0
by_person




#plt.hist(success_rate)
#plt.title("Success Rate")
#plt.xlabel("")
#success_rate