## Speed Dating Data Set

In [1]:
# do the neccessary imports
import numpy as np
import pandas as pd
from scipy.io.arff import loadarff

In [2]:
#transforming arff file in csv
df = pd.read_csv('data/speed-dating/speeddating.csv')
print(df.dtypes)
df.shape

id                     int64
has_null               int64
wave                   int64
gender                object
age                   object
                       ...  
d_guess_prob_liked    object
met                   object
decision               int64
decision_o             int64
match                  int64
Length: 124, dtype: object


  df = pd.read_csv('data/speed-dating/speeddating.csv')


(8378, 124)

In [3]:
df[df.isin(["?"]).any(axis=1)].shape

(7330, 124)

## About Ratings

When a column is giving a rating, for example any column that has ```importance``` or ```pref_o_``` also include scales, which are weird and we need to figure out how to normalize everything. 

# Missing values

Some rows in these rating columns also have missing values which can't simply be thrown out. Instead we have to look at the context, for example for the missing values in ```importance_same_race``` we can fill them in by taking the median/mean of the ratings that people of the same race have given.

In [4]:
df.replace('?', np.nan, inplace=True)

In [5]:
df[['race','importance_same_race']][df['race'].isna() == True]
# 63 rows with no race and no importance of race so we just drop these


Unnamed: 0,race,importance_same_race
828,,
829,,
830,,
831,,
832,,
...,...,...
5127,,
5128,,
5129,,
5130,,


In [6]:
df = df[df['race'].notna()]

We try to divide df temporarily in to races to omit the nan value of importance of race by the mode. For the group other, we do the same

Update: As it looks like, only Europeans/Caucasian-Americans have empty values in this dataset so we can just fill them with the mode of the whole dataset

In [7]:
df['importance_same_race'][df['importance_same_race'].isna() == True]

312    NaN
313    NaN
314    NaN
315    NaN
316    NaN
317    NaN
318    NaN
319    NaN
320    NaN
321    NaN
322    NaN
323    NaN
324    NaN
325    NaN
326    NaN
327    NaN
Name: importance_same_race, dtype: object

In [8]:
#First convert column to int instead of string
df['importance_same_race'] = df['importance_same_race'].fillna(100).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['importance_same_race'] = df['importance_same_race'].fillna(100).astype(int)


In [9]:
#We replaced nan with value 100 for conversion to int then replaced 100 with the mode
df['importance_same_race'].replace(100, df['importance_same_race'][df['race'] == 'European/Caucasian-American'].mode()[0], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['importance_same_race'].replace(100, df['importance_same_race'][df['race'] == 'European/Caucasian-American'].mode()[0], inplace=True)


In [10]:
# We do the same thing for religion
# Note there are missing values for these columns only for the europeans
df['importance_same_religion'] = df['importance_same_religion'].fillna(100).astype(int)
df['importance_same_religion'].replace(100, df['importance_same_religion'][df['race'] == 'European/Caucasian-American'].mode()[0], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['importance_same_religion'] = df['importance_same_religion'].fillna(100).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['importance_same_religion'].replace(100, df['importance_same_religion'][df['race'] == 'European/Caucasian-American'].mode()[0], inplace=True)


### Dealing with NaN for ```preference_of...```

For this case the number of NaN is also not that big so we could actually drop them since the dataset is relatively big. We are losing at most 192 values

In [11]:
df.dropna(subset=['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(subset=['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests'], inplace=True)


In [12]:
df.shape

(8186, 124)

## Casting strings to float and rounding float values to int for ```preferece_o```

In [13]:
df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']] = df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']] = df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']].astype(float)


In [14]:
df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']] = df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']].round()
df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']] = df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']] = df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']].round()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests']] = df[['pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'p

### Handling age

In [15]:
df['age'].fillna(1000, inplace=True)
df['age_o'].fillna(1000, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'].fillna(1000, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age_o'].fillna(1000, inplace=True)


In [16]:
df[['age', 'age_o']] = df[['age', 'age_o']].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['age', 'age_o']] = df[['age', 'age_o']].astype(int)


In [17]:
df['age'].replace(1000, df['age'].median(), inplace=True)
df['age_o'].replace(1000, df['age_o'].median(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'].replace(1000, df['age'].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age_o'].replace(1000, df['age_o'].median(), inplace=True)


### Handling Duplicate Fields in Field

In [18]:
df['field'] = df['field'].str.upper()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['field'] = df['field'].str.upper()


In [19]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [20]:
df['field_encoded'] = le.fit_transform(df['field'])
le.classes_

#Many fields are still the same so we sub with regex

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['field_encoded'] = le.fit_transform(df['field'])


array(['ACTING', 'AFRICAN-AMERICAN STUDIES/HISTORY', 'AMERICAN STUDIES',
       'AMERICAN STUDIES [MASTERS]', 'ANTHROPOLOGY',
       'ANTHROPOLOGY/EDUCATION', 'APPLIED MATHS/ECONS',
       'APPLIED PHYSIOLOGY & NUTRITION', 'ARCHITECTURE', 'ART EDUCATION',
       'ART HISTORY', 'ART HISTORY/MEDICINE', 'ARTS ADMINISTRATION',
       'BILINGUAL EDUCATION', 'BIOCHEMISTRY',
       'BIOCHEMISTRY & MOLECULAR BIOPHYSICS', 'BIOCHEMISTRY/GENETICS',
       'BIOLOGY', 'BIOLOGY PHD', 'BIOMEDICAL ENGINEERING',
       'BIOMEDICAL INFORMATICS', 'BIOMEDICINE', 'BIOTECHNOLOGY',
       'BUSINESS', 'BUSINESS & INTERNATIONAL AFFAIRS',
       'BUSINESS ADMINISTRATION',
       'BUSINESS AND INTERNATIONAL AFFAIRS [MBA/MIA DUAL DEGREE]',
       'BUSINESS CONSULTING', 'BUSINESS SCHOOL',
       'BUSINESS [FINANCE & MARKETING]', 'BUSINESS [MBA]',
       'BUSINESS- MBA', 'BUSINESS/ FINANCE/ REAL ESTATE', 'BUSINESS/LAW',
       'BUSINESS; MARKETING', 'BUSINESS; MEDIA', 'CELL BIOLOGY',
       'CHEMISTRY', 'CLASSICS',

In [21]:
le.classes_.size

219

In [22]:
df['field'] = df['field'].replace('.*BUSINESS.*|MBA.*|ECONOMICS.*|.*FINANCE.*', 'BUSINESS/ECONOMICS/FINANCE', regex=True)
df['field'] = df['field'].replace('.*INTERNATIONAL AFFAIRS.*|SIPA.*', 'INTERNATIONAL AFFAIRS', regex=True)
df['field'] = df['field'].replace('LAW.*', 'LAW', regex=True)
df['field'] = df['field'].replace('OPERATIONS RESEARCH.*', 'OPERATIONS RESEARCH', regex=True)
df['field'] = df['field'].replace('PHILOSOPHY.*', 'PHILOSOPHY', regex=True)
df['field'] = df['field'].replace('PHYSICS.*', 'PHYSICS', regex=True)
df['field'] = df['field'].replace('.*INDUSTRIAL ENGINEERING.*', 'INDUSTRIAL ENGINEERING', regex=True)
df['field'] = df['field'].replace('.*MATH.*|.*STAT.*', 'MATHEMATICS', regex=True)
df['field'] = df['field'].replace('ART.*', 'ART', regex=True)
df['field'] = df['field'].replace('.*BIO.*', 'BIOLOGY', regex=True)
df['field'] = df['field'].replace('.*AMERICAN.*', 'AMERICAN STUDIES', regex=True)
df['field'] = df['field'].replace('CLIMATE.*|ENVIRON.*|.*EARTH.*', 'ENVIRONMENTAL SCIENCE', regex=True)
df['field'] = df['field'].replace('.*WRITING.*', 'WRITING', regex=True)
df['field'] = df['field'].replace('.*SOCI.*', 'SOCIOLOGY/SOCIAL STUDIES', regex=True)
df['field'] = df['field'].replace('.*NEURO.*', 'NEUROSCIENCE', regex=True)
df['field'] = df['field'].replace('.*ENGLISH.*|.*GERMAN.*|.*POLISH.*|.*FRENCH.*|.*LANG.*|.*CHINE.*|.*JAP.*', 'LANGUAGES', regex=True)
df['field'] = df['field'].replace('.*HIST.*', 'HISTORY', regex=True)
df['field'] = df['field'].replace('.*PSYCH.*', 'PSYCHOLOGY', regex=True)
df['field'] = df['field'].replace('.*ANTH.*', 'ANTHROPOLOGY', regex=True)
df['field'] = df['field'].replace('.*EDU.*', 'EDUCATION', regex=True)
df['field'] = df['field'].replace('.*THEA.*', 'THEATER', regex=True)
df['field'] = df['field'].replace('.*RELI.*', 'RELIGION', regex=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['field'] = df['field'].replace('.*BUSINESS.*|MBA.*|ECONOMICS.*|.*FINANCE.*', 'BUSINESS/ECONOMICS/FINANCE', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['field'] = df['field'].replace('.*INTERNATIONAL AFFAIRS.*|SIPA.*', 'INTERNATIONAL AFFAIRS', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning

In [23]:
df['field_encoded'] = le.fit_transform(df['field'])
le.classes_.size

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['field_encoded'] = le.fit_transform(df['field'])


86

## Intervall Description
given values are represented in fixed intervalls. These intervalls represet if a category was valued not, medium or very importand

In "importance" columns we have the given Intervalls 0-1, 2-5, 6-10

These will be represented as NOT IMPORTAND, IMPORTAND and VERY IMPORTAND

In [24]:
for col in df:
    if "d_importance" in col:
        df[col] = df[col].str.replace("[0-1]","not important",regex=False)
        df[col] = df[col].str.replace("[2-5]","important",regex=False)
        df[col] = df[col].str.replace("[6-10]","very important",regex=False)
    elif "d_d_age" in col:
        df[col] = df[col].str.replace("[0-1]","no age difference",regex=False)
        df[col] = df[col].str.replace("[2-3]","small age difference",regex=False)
        df[col] = df[col].str.replace("[4-6]","medium age difference",regex=False)
        df[col] = df[col].str.replace("[7-37]","large age difference",regex=False)
    else:
        try:
            ## importance of partner having those attributes
            df[col] = df[col].str.replace("[0-15]","not important",regex=False)
            df[col] = df[col].str.replace("[16-20]","important",regex=False)
            df[col] = df[col].str.replace("[21-100]","very important",regex=False)
            ## categories for rating themselves
            df[col] = df[col].str.replace("[0-5]","low",regex=False)
            df[col] = df[col].str.replace("[6-8]","average",regex=False)
            df[col] = df[col].str.replace("[9-10]","high",regex=False)
            ## expected number of people interested in participant
            df[col] = df[col].str.replace("[0-3]","few",regex=False)
            df[col] = df[col].str.replace("[4-9]","medium",regex=False)
            df[col] = df[col].str.replace("[10-20]","a lot",regex=False)
        except:
            d = 1
        
        




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].str.replace("[0-15]","not important",regex=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].str.replace("[16-20]","important",regex=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].str.replace("[21-100]","very important",regex=False)
A

### exclude pre calculated values
a few column also include "expected values", we decided to exclude them

In [25]:
for col in df:
    if "expected" in col:
        df = df.drop(col, axis=1)
        
df = df.drop(['guess_prob_liked', 'd_guess_prob_liked'], axis=1)
df.head()

Unnamed: 0,id,has_null,wave,gender,age,age_o,d_age,d_d_age,race,race_o,...,d_yoga,interests_correlate,d_interests_correlate,like,d_like,met,decision,decision_o,match,field_encoded
0,1,0,1,female,21,27,6,medium age difference,Asian/Pacific Islander/Asian-American,European/Caucasian-American,...,low,0.14,[0-0.33],7,average,0,1,0,0,48
1,2,0,1,female,21,22,1,no age difference,Asian/Pacific Islander/Asian-American,European/Caucasian-American,...,low,0.54,[0.33-1],7,average,1,1,0,0,48
2,3,1,1,female,21,22,1,no age difference,Asian/Pacific Islander/Asian-American,Asian/Pacific Islander/Asian-American,...,low,0.16,[0-0.33],7,average,1,1,1,1,48
3,4,0,1,female,21,23,2,small age difference,Asian/Pacific Islander/Asian-American,European/Caucasian-American,...,low,0.61,[0.33-1],7,average,0,1,1,1,48
4,5,0,1,female,21,24,3,small age difference,Asian/Pacific Islander/Asian-American,Latino/Hispanic American,...,low,0.21,[0-0.33],6,average,0,1,1,1,48


In [26]:
df.to_csv("out.csv",index=False)

### Handle NaN Values in "met" column
met column tells if the people from speed dating has met before. Since it is not very common to meet people at speed dating more than one time we asume that cells with NaN can be filled with 0 that stands for "have not met before"

In [27]:
print(df.met.isnull().sum())
df["met"] = df["met"].fillna(0)
print(df.met.isnull().sum())

354
0


there a a few rows where multiple values are missing, we drop them because we cannot assume the values. We found out that when colomn "sport" is NaN than all colomns for all excercises and all "important" colomns are null. So we drop the rows because there is too much data we would need to simulate

In [37]:
for index, row in df.iterrows():
    try:
        i = int(row["sports"])
    except:
        df.drop(index)

In [39]:
df.sports.isnull().values.sum()

193

After that we see that we only have 193 NaN values left. For simple reasons we drop them too

In [40]:
df=df.dropna()