# Speed-Dating Dataset Preprocessing

### Attribute Information

 * gender: Gender of self  
 * age: Age of self  
 * age_o: Age of partner  
 * d_age: Difference in age  
 * race: Race of self  
 * race_o: Race of partner  
 * samerace: Whether the two persons have the same race or not.  
 * importance_same_race: How important is it that partner is of same race?  
 * importance_same_religion: How important is it that partner has same religion?  
 * field: Field of study  
 * pref_o_attractive: How important does partner rate attractiveness  
 * pref_o_sinsere: How important does partner rate sincerity  
 * pref_o_intelligence: How important does partner rate intelligence  
 * pref_o_funny: How important does partner rate being funny  
 * pref_o_ambitious: How important does partner rate ambition  
 * pref_o_shared_interests: How important does partner rate having shared interests  
 * attractive_o: Rating by partner (about me) at night of event on attractiveness  
 * sincere_o: Rating by partner (about me) at night of event on sincerity  
 * intelligence_o: Rating by partner (about me) at night of event on intelligence  
 * funny_o: Rating by partner (about me) at night of event on being funny  
 * ambitous_o: Rating by partner (about me) at night of event on being ambitious  
 * shared_interests_o: Rating by partner (about me) at night of event on shared interest  
 * attractive_important: What do you look for in a partner - attractiveness  
 * sincere_important: What do you look for in a partner - sincerity  
 * intellicence_important: What do you look for in a partner - intelligence  
 * funny_important: What do you look for in a partner - being funny  
 * ambtition_important: What do you look for in a partner - ambition  
 * shared_interests_important: What do you look for in a partner - shared interests  
 * attractive: Rate yourself - attractiveness  
 * sincere: Rate yourself - sincerity   
 * intelligence: Rate yourself - intelligence   
 * funny: Rate yourself - being funny   
 * ambition: Rate yourself - ambition  
 * attractive_partner: Rate your partner - attractiveness  
 * sincere_partner: Rate your partner - sincerity   
 * intelligence_partner: Rate your partner - intelligence   
 * funny_partner: Rate your partner - being funny   
 * ambition_partner: Rate your partner - ambition   
 * shared_interests_partner: Rate your partner - shared interests  
 * sports: Your own interests [1-10]  
 * tvsports  
 * exercise  
 * dining  
 * museums  
 * art  
 * hiking  
 * gaming  
 * clubbing  
 * reading  
 * tv  
 * theater  
 * movies  
 * concerts  
 * music  
 * shopping  
 * yoga  
 * interests_correlate: Correlation between participant’s and partner’s ratings of interests.  
 * expected_happy_with_sd_people: How happy do you expect to be with the people you meet during the speed-dating event?  
 * expected_num_interested_in_me: Out of the 20 people you will meet, how many do you expect will be interested in dating you?  
 * expected_num_matches: How many matches do you expect to get?  
 * like: Did you like your partner?  
 * guess_prob_liked: How likely do you think it is that your partner likes you?   
 * met: Have you met your partner before?  
 * decision: Decision at night of event.
 * decision_o: Decision of partner at night of event.  
 * match: Match (yes/no)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('data/landing/raw_data.csv')
raw_df = df.copy() # for testing purposes

### Basic Exploration

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Columns: 123 entries, has_null to match
dtypes: float64(59), object(64)
memory usage: 7.9+ MB


In [2]:
df.head()

Unnamed: 0,has_null,wave,gender,age,age_o,d_age,d_d_age,race,race_o,samerace,...,d_expected_num_interested_in_me,d_expected_num_matches,like,guess_prob_liked,d_like,d_guess_prob_liked,met,decision,decision_o,match
0,b'0',1.0,b'female',21.0,27.0,6.0,b'[4-6]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'0',b'0'
1,b'0',1.0,b'female',21.0,22.0,1.0,b'[0-1]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,5.0,b'[6-8]',b'[5-6]',1.0,b'1',b'0',b'0'
2,b'1',1.0,b'female',21.0,22.0,1.0,b'[0-1]',b'Asian/Pacific Islander/Asian-American',b'Asian/Pacific Islander/Asian-American',b'1',...,b'[0-3]',b'[3-5]',7.0,,b'[6-8]',b'[0-4]',1.0,b'1',b'1',b'1'
3,b'0',1.0,b'female',21.0,23.0,2.0,b'[2-3]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'1',b'1'
4,b'0',1.0,b'female',21.0,24.0,3.0,b'[2-3]',b'Asian/Pacific Islander/Asian-American',b'Latino/Hispanic American',b'0',...,b'[0-3]',b'[3-5]',6.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'1',b'1'


### Basic Clean-up

In [5]:
# stripping b'' from string variables in raw DF
for col in df.columns:
    if (type(df[col][0]) == str) and (df[col][0][0]=='b'):
        df[col] = df[col].apply(lambda x: x[2:-1])

# assign rank to all ordinal variables in raw DF
for col in df.columns:
    # determine rank for all ordinal attributes except 'd_interests_correlate'
    if (type(df[col][0]) == str) and (df[col][0][0]=='[') and (col != 'd_interests_correlate'):
        range_list = [] # initialize empty range_list at every iteration
        # fill range_list
        for i in df[col]: 
            if i[1:-1].split('-') not in range_list:
                range_list.append(i[1:-1].split('-'))
        # convert ranges from str to int
        for i in range(0,len(range_list)):
            for j in range(0,len(range_list[i])):
                range_list[i][j] = int(range_list[i][j])

        # sort ranges ascending
        range_list.sort(key=lambda x: x[0])

        # convert ranges back from int to str
        for i in range(0,len(range_list)):
            for j in range(0,len(range_list[i])):
                range_list[i][j] = str(range_list[i][j])

        # map range to rank in sorted range_list
        def assign_rank(x, sorted_list=range_list):
            rank = sorted_list.index(x)
            return rank # maybe +1 ---> investigate!

        # apply lambda to replace str ranges with ordinal ranks
        df[col] = df[col].apply(lambda x: assign_rank(x[1:-1].split('-')))

# assign rank to 'd_interests_correlate'
df.d_interests_correlate.replace('[-1-0]', -1, inplace=True)
df.d_interests_correlate.replace('[0-0.33]', 0, inplace=True)
df.d_interests_correlate.replace('[0.33-1]', 1, inplace=True)

# export clean-ish DF to landing
df.to_csv('data/landing/clean-ish_data.csv', index=False)


In [4]:
df.head()

Unnamed: 0,has_null,wave,gender,age,age_o,d_age,d_d_age,race,race_o,samerace,...,d_expected_num_interested_in_me,d_expected_num_matches,like,guess_prob_liked,d_like,d_guess_prob_liked,met,decision,decision_o,match
0,0,1.0,female,21.0,27.0,6.0,2,Asian/Pacific Islander/Asian-American,European/Caucasian-American,0,...,0,1,7.0,6.0,1,1,0.0,1,0,0
1,0,1.0,female,21.0,22.0,1.0,0,Asian/Pacific Islander/Asian-American,European/Caucasian-American,0,...,0,1,7.0,5.0,1,1,1.0,1,0,0
2,1,1.0,female,21.0,22.0,1.0,0,Asian/Pacific Islander/Asian-American,Asian/Pacific Islander/Asian-American,1,...,0,1,7.0,,1,0,1.0,1,1,1
3,0,1.0,female,21.0,23.0,2.0,1,Asian/Pacific Islander/Asian-American,European/Caucasian-American,0,...,0,1,7.0,6.0,1,1,0.0,1,1,1
4,0,1.0,female,21.0,24.0,3.0,1,Asian/Pacific Islander/Asian-American,Latino/Hispanic American,0,...,0,1,6.0,6.0,1,1,0.0,1,1,1


In [5]:
df.describe()

Unnamed: 0,wave,age,age_o,d_age,d_d_age,importance_same_race,importance_same_religion,d_importance_same_race,d_importance_same_religion,pref_o_attractive,...,expected_num_interested_in_me,expected_num_matches,d_expected_happy_with_sd_people,d_expected_num_interested_in_me,d_expected_num_matches,like,guess_prob_liked,d_like,d_guess_prob_liked,met
count,8378.0,8283.0,8274.0,8378.0,8378.0,8299.0,8299.0,8378.0,8378.0,8289.0,...,1800.0,7205.0,8378.0,8378.0,8378.0,8138.0,8069.0,8378.0,8378.0,8003.0
mean,11.350919,26.358928,26.364999,4.185605,1.351755,3.784793,3.651645,0.940797,0.897708,22.495347,...,5.570556,3.207814,1.037718,0.168298,0.576271,6.134087,5.207523,0.721055,0.912986,0.049856
std,5.995903,3.566763,3.563648,4.596171,1.049246,2.845708,2.805237,0.791249,0.793712,12.569802,...,4.762569,2.444813,0.718958,0.479831,0.688742,1.841285,2.129565,0.588285,0.781453,0.282168
min,1.0,18.0,18.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.0,24.0,24.0,1.0,0.0,1.0,1.0,0.0,0.0,15.0,...,2.0,2.0,1.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0
50%,11.0,26.0,26.0,3.0,1.0,3.0,3.0,1.0,1.0,20.0,...,4.0,3.0,1.0,0.0,0.0,6.0,5.0,1.0,1.0,0.0
75%,15.0,28.0,28.0,5.0,2.0,6.0,6.0,2.0,2.0,25.0,...,8.0,4.0,2.0,0.0,1.0,7.0,7.0,1.0,2.0,0.0
max,21.0,55.0,55.0,37.0,3.0,10.0,10.0,2.0,2.0,100.0,...,20.0,18.0,2.0,2.0,2.0,10.0,10.0,2.0,2.0,8.0


In [6]:
df.race.value_counts()

European/Caucasian-American              4727
Asian/Pacific Islander/Asian-American    1982
Latino/Hispanic American                  664
Other                                     522
Black/African American                    420
?                                          63
Name: race, dtype: int64

In [7]:
df.race_o.value_counts()

European/Caucasian-American              4722
Asian/Pacific Islander/Asian-American    1978
Latino/Hispanic American                  664
Other                                     521
Black/African American                    420
?                                          73
Name: race_o, dtype: int64

In [9]:
array = [[1,2,3],
         [4,5,6],
         [7,8,9]]
expected = [1,2,3,6,9,8,7,4,5]

def snail(snail_map):
    up_bound, low_bound = (len(snail_map) - 1), 0
    n_elements = len(snail_map)**2
    snail_path = []
    while n_elements > 0:
        

        n_elements -= 1
        
    return snail_path

#snail(array)
len(array)

3