# OkCupid Orientation Prediction

Hi everyone! This notebook is a beginner's exploration of applying basic scikit-learn machine learning algorithms. I recently came across this OkCupid profile dataset (https://github.com/rudeboybert/JSE_OkCupid) and wondered if it was possible at all to predict a person's sexual orientation with all the other information available. Now in reality this is not a good question to focus on, since orientation is not likely to be easily predictable. But I decided on this question just for fun!

Note that I already did a lot of analysis on the structure of the dataset beforehand, so what follows is the fruit of this previous labor.

Before we do anything, we want to import Pandas. Pandas is absolutely essential for managing the data and preprocessing it. 

In [1]:
import pandas as pd

Next we want to load the OkCupid profile dataset into a Pandas dataframe so that we can actually do something with the data.

In [2]:
profile_data = pd.read_csv('http://briannadardin.com/profiles.csv')

Now the first step is to remove all the rows with missing values in the "orientation" column. We only want to train our models on data that have this attribute so that it can more reliably predict it.

In [3]:
profile_data = profile_data.dropna(subset=['orientation'])

Next we want to consider which columns to include in our analysis. Right off the bat, we want to drop all the essay columns, because all the values in these columns are unique to each individual. Now certainly we could build a predictive model off of generating all the words and/or n-grams and their frequencies, but we'll save that for a future project.

With this particular dataset, I will also drop the location column, because the bulk of the users here are from the Bay Area. I might keep this column if we had a national sample of users, since some areas of the country do have a higher percentage of LGBT people. Since this is focused on one particular area, the precise differences in cities doesn't seem too relevant.

I will also drop the "speaks" column, primarily because mapping not only every possible language but also their proficiency in each language would greatly increase dimensionality, with little likely predictive power.

In [4]:
profile_data = profile_data.drop(['essay0','essay1','essay2','essay3','essay4',
                                  'essay5','essay6','essay7','essay8','essay9',
                                  'location','speaks'],axis=1)

Now we look at the "last_online" column. The dates by themselves don't seem to communicate a lot of information, but we can extract more useful information from these dates. For example, the day of the week could potentially be relevant, as well as the time of day (so we can capture early birds vs. night owls, for example). We could also calculate the most recent users (and therefore potentially more active) vs the users that signed on a comparatively long time ago. Let's tackle these one at a time.

First with the day of the week, we create a simple function to extract the day of the week as an integer for each row, then add the column to the dataframe.

In [5]:
import datetime

def day_of_week(lastOnline):
    dt = datetime.datetime.strptime(lastOnline, '%Y-%m-%d-%H-%M')
    return dt.weekday()

profile_data['online_day'] = profile_data.apply(lambda row: day_of_week(row['last_online']), axis=1)

profile_data['online_day'].head()

0    3
1    4
2    2
3    3
4    2
Name: online_day, dtype: int64

Next we extract the time. In this case, I don't feel the actual minute is as relevant as just the hour, so we'll just extract the hour only and add that as a new column.

In [6]:
def hour(lastOnline):
    dt = datetime.datetime.strptime(lastOnline, '%Y-%m-%d-%H-%M')
    return dt.hour

profile_data['online_hour'] = profile_data.apply(lambda row: hour(row['last_online']), axis=1)

profile_data['online_hour'].head()

0    20
1    21
2     9
3    14
4    21
Name: online_hour, dtype: int64

Next we want to see how recently a user has signed on, as this may indicate level of activity. First we have to get the most recent date from the column, and then compare all the other dates to this date and indicate how many days behind it is.

In [7]:
max_index = profile_data.sort_values(by='last_online', ascending=False).index[0]
max_day = datetime.datetime.strptime(profile_data['last_online'][max_index], '%Y-%m-%d-%H-%M')

def time_diff(lastOnline):
    dt = datetime.datetime.strptime(lastOnline, '%Y-%m-%d-%H-%M')
    time_delta = max_day - dt
    return time_delta.days

profile_data['last_online_days'] = profile_data.apply(lambda row: time_diff(row['last_online']), axis=1)

profile_data['last_online_days'].head()

0    2
1    1
2    3
3    2
4    3
Name: last_online_days, dtype: int64

Now that we've extracted the data we wanted from the last_online column, we can drop it:

In [8]:
profile_data = profile_data.drop('last_online',axis=1)

So the recommended approach for dealing with categorical variables is to "one hot encode" them. What this means is that, if a column has 5 distinct possible values, one hot encoding this column will result in creating 5 new columns, each corresponding to a particular value. The columns themselves are booleans - they will either be 1 if that value is true for the row, or 0 if not. So each row will then only have 1 column in this set of 5 with a value of 1, and the others will be 0s.

However an issue arises when a column has a LOT of distinct possible values. The more values, the more columns, thus more work for our models. Some of the columns in this dataset have a lot of possible values - in fact, I purposely removed the "speaks" column earlier precisely for this reason. But what to do with all the other columns with multiple values?

I ended up coming up with an unconventional approach - this may not be the "best practice" approach, but for our simple analysis, it will work. Basically what I noticed is that many of these columns are actually the combination of 2 drop down fields on the site itself. For example, the "diet" column is comprised of a drop down with "strictly" and "mostly" as options, and then another drop down with "anything", "halal", "kosher", "other", "vegan" and "vegetarian". The diet column then contains values such as "strictly vegan", "mostly vegetarian", etc. 

I decided that instead of one hot encoding all of these combined options, I will one hot encode the underlying drop down choices. What this means is that I will create a column corresponding to each value in the first drop down ("strictly" and "mostly") and to each value in the second drop down ("anything", "halal", etc). So each row will then either have 2 1s in this set of columns (1 corresponding to the first drop down, 1 corresponding to the second), just 1 (if they omitted the first drop down), or none (if they left both drop downs blank). 

This explanation may seem a little confusing, so let me show you in action. First, I created a nested dictionary for all the columns structured like this, and the unique values for their underlying drop downs.

In [9]:
new_columns = { 'diet': ['strictly', 'mostly', 'anything', 'halal', 'kosher', 
                         'other', 'vegan', 'vegetarian'],
                'education': ['high school', 'college/university', 'law school', 
                              'masters program', 'med school', 'ph.d program', 
                              'space camp', 'two-year college', 'dropped out', 
                              'graduated', 'working on'],
                'ethnicity': ['asian', 'black', 'hispanic / latin', 'indian', 
                             'middle eastern', 'native american', 'other', 
                             'pacific islander', 'white'],
                'offspring': ['doesn&rsquo;t have kids', 'has a kid', 'has kids', 
                              'doesn&rsquo;t want', 'might want', 'wants'],
                'religion': ['agnosticism', 'atheism', 'buddhism', 'catholicism', 
                             'christianity', 'hinduism', 'islam', 'judaism', 
                             'other', 'laughing', 'somewhat serious', 
                             'very serious', 'not too serious'],
                'sign': ['aquarius', 'aries', 'cancer', 'capricorn', 'gemini', 
                         'leo', 'libra', 'pisces', 'sagittarius', 'scorpio', 
                         'taurus', 'virgo', 'matters a lot', 
                         'doesn&rsquo;t matter', 'fun to think about']
                }

With this dictionary established, we can then iterate through this dictionary to generate our new columns. When iterating through the rows, we first look at each column in the dictionary. We then compare the value that row has in that column with the values in the dictionary. If there's a match between words/phrases, a 1 gets added to an array, otherwise a 0 is added. Each row's individual array gets added to the array of all the rows' arrays before moving on to the next.

In [10]:
new_column_data = []
for index, row in profile_data.iterrows():
    data_row = []
    for key in new_columns.keys():
        for val in new_columns[key]:
            if pd.notnull(row[key]) and val in row[key]:
                data_row.append(1)
            else:
                data_row.append(0)
    new_column_data.append(data_row)

Once we have that done, we want to then create an array with all the corresponding column names.

In [11]:
new_column_names = []
for key in new_columns.keys():
    for val in new_columns[key]:
        new_column_names.append(key+"_"+val)

With both of those arrays compiled, we can combine them to create a new dataframe, and then merge that dataframe with the one we started with, and then drop the now redundant columns.

In [12]:
new_column_df = pd.DataFrame(data=new_column_data,columns=new_column_names)
profile_data = pd.concat([profile_data,new_column_df],axis=1)
profile_data = profile_data.drop(new_columns.keys(),axis=1)

Next we'll try to clean up some missing values. For example, two columns in this dataset, "body_type" and "job" both have the "rather not say" option, yet there are still many missing values. A missing value is basically the same as "rather not say", so we'll convert them accordingly.

In [13]:
profile_data['body_type'] = profile_data['body_type'].fillna('rather not say')
profile_data['job'] = profile_data['job'].fillna('rather not say')

Before we move on, let's check for missing values.

In [14]:
print profile_data.isnull().sum()

age                                      0
body_type                                0
drinks                                2985
drugs                                14080
height                                   3
income                                   0
job                                      0
orientation                              0
pets                                 19921
sex                                      0
smokes                                5512
status                                   0
online_day                               0
online_hour                              0
last_online_days                         0
diet_strictly                            0
diet_mostly                              0
diet_anything                            0
diet_halal                               0
diet_kosher                              0
diet_other                               0
diet_vegan                               0
diet_vegetarian                          0
sign_aquari

A few columns still have many null values, but that's because we haven't processed them yet. What stands out here however is that the "height" column only has 3 missing values. I think we can drop these rows without significantly impacting our analysis.

In [15]:
profile_data = profile_data.dropna(subset=['height'])

Now we can go ahead and properly one hot encode the columns I left out of the previous processing. This is also the stage where I decided to separate my target variable ("orientation") from the others.

In [16]:
target = profile_data['orientation']
features = profile_data.drop(['orientation'],axis=1)

target_dummies = pd.get_dummies(data=target)
feature_dummies = pd.get_dummies(data=features, dummy_na=True)

Note that I used the parameter "dummy_na=True" for feature_dummies, because I wanted a column created for null values as well. However, this parameter creates a column for null values even if the original column doesn't contain any. So I want to remove those columns.

In [17]:
nunique = feature_dummies.apply(pd.Series.nunique)
cols_to_drop = nunique[nunique == 1].index
feature_dummies = feature_dummies.drop(cols_to_drop, axis=1)

We're almost ready to start testing various machine learning algorithms! Now we need to split our data into training and testing data.

In [18]:
from sklearn.model_selection import train_test_split

Y = target_dummies.values
X = feature_dummies.values

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)

There is just ONE more thing to do before we begin. See, these algorithms are fussy about what kind of targets they're willing to accept. They do not like multi-dimensional targets, like our current target, which has 3 possible values (in this dataset - out in the real world, of course, there are more): "straight", "gay", and "bisexual". Even though we one hot encoded this column properly, we still need to reduce the dimensionality of the target.

To do this, we'll use NumPy's "argmax" function. Basically what this will do is record which index, aka which column, in the original had the highest value. Since only one of the columns will have a 1 in it, that column's index will be recorded. It preserves the meaning, just shrinks the target down from 3 columns to 1.

In [19]:
import numpy as np

y_train = np.argmax(y_train, axis=1)
y_test = np.argmax(y_test, axis=1)

Now we can move on to testing various different algorithms! Note that I currently do not have an in depth understanding of any of the algorithms I'm about to use. This is basically just a survey of some common algorithms implemented in scikit-learn with their defaults left in tact, to see how well they perform on this particular problem. They each have their own optimal use cases, so none of the results here mean that a particular algorithm is "good" or "bad" - it just means whether it's well-suited to this particular problem or not. 

In order to evaluate each algorithm, we need to import 2 different metrics - the accuracy score and the confusion matrix. The accuracy score is just a percentage that tells you how often the algorithm was correct. This doesn't tell the whole story however. The confusion matrix shows us the breakdown of predictions - how many times it predicted a certain value, and how often it was correct in each value or not. In order to read the confusion matrix accurately, we want to be sure which order the columns were in. So we'll print those values for our reference.

In [20]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

print list(target_dummies)

['bisexual', 'gay', 'straight']


Now we know to read our confusion matrices as representing "bisexual" in the first column, "gay" in the second column, and "straight" in the third column! This order also determines the row order, as we'll see later.

Since I ran these previously, I decided on the order here based on a certain kind of logical flow. We'll look at each algorithm one at a time, and at the end decide which seems the most promising, in case we want to dive deeper later and get better results with some fine tuning.

First up, Logistic Regression!

In [21]:
from sklearn.linear_model import LogisticRegression
LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)
y_pred_log = LogReg.predict(X_test)
print "Logistic Regression accuracy is ", accuracy_score(y_test,y_pred_log)*100
log_matrix = confusion_matrix(y_test, y_pred_log)
print log_matrix

Logistic Regression accuracy is  86.1035422343
[[    0     0   836]
 [    0     0  1663]
 [    0     0 15484]]


At first you may be tempted to be like "86% sounds pretty good!", but then you look at the confusion matrix. It literally scored that high by assuming that everyone is straight (as all the values are in the third column). Who knew that logistic regression is homophobic? This is definitely not the approach we would want in solving this problem, so maybe we'll get better results with something else? Let's find out!

Next up, Naive Bayes!

In [22]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
print "Naive Bayes accuracy is ", accuracy_score(y_test,y_pred_gnb)*100
gnb_matrix = confusion_matrix(y_test, y_pred_gnb)
print gnb_matrix

Naive Bayes accuracy is  85.3583940388
[[   12     0   824]
 [   11     0  1652]
 [  146     0 15338]]


The accuracy score is slightly lower, but we're getting a little better when it comes to confusion matrix. This one acknowledges that bisexual people exist (first column), but completely ignores gay people (second column). So it's still not what we're looking for. 

Let's try a different one then. Next up, K-Nearest!

In [23]:
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred_knc = knc.predict(X_test)
print "K-nearest accuracy is ", accuracy_score(y_test,y_pred_knc)*100
knc_matrix = confusion_matrix(y_test, y_pred_knc)
print knc_matrix

K-nearest accuracy is  84.6521714953
[[   12    15   809]
 [   12    50  1601]
 [   75   248 15161]]


Now we're getting somewhere! Sure, the accuracy score is a little lower, but it actually acknowledges both bisexual and gay pepole! Finally making some progress! Note that the rows are in the same order as the columns, so first row = "bisexual", second row = "gay", third row = "straight". Here is a step-by-step way of reading the matrix (which will apply for all the future ones in this notebook):

first column, first row = bisexual prediction, actually bisexual  
first column, second row = bisexual prediction, actually gay  
first column, third row = bisexual prediction, actually straight  
second column, first row = gay prediction, actually bisexual  
second column, second row = gay prediction, actually gay  
second colum, third row = gay prediction, actually straight  
third column, first row = straight prediction, actually bisexual  
third column, second row = straight prediction, actually gay  
third column, third row = straight prediction, actually straight  

Those aren't particularly impressive scores, can we do better than that? Let's find out! Next up is a Decision Tree!

In [24]:
from sklearn.tree import DecisionTreeClassifier
dec_tree = DecisionTreeClassifier()
dec_tree.fit(X_train, y_train)
y_pred_tree = dec_tree.predict(X_test)
print "Decision Tree accuracy is ", accuracy_score(y_test,y_pred_tree)*100
tree_matrix = confusion_matrix(y_test, y_pred_tree)
print tree_matrix

Decision Tree accuracy is  76.8948451315
[[  147    98   591]
 [   79   276  1308]
 [  625  1454 13405]]


This is easily the lowest accuracy score we've gotten so far, but look at that matrix! Finally, an algorithm that cares about bisexual and gay people! 

Since it seems like we're on the right track, we'll try one more algorithm - Random Forests, which is basically Decision Trees on steroids. Let's see what we get!

In [25]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
y_pred_forest = forest.predict(X_test)
print "Random Forest accuracy is ", accuracy_score(y_test,y_pred_forest)*100
forest_matrix = confusion_matrix(y_test, y_pred_forest)
print forest_matrix

Random Forest accuracy is  85.775454596
[[   61     8   767]
 [   21    52  1590]
 [   71   101 15312]]


Here we have a higher accuracy percentage, but also lower numbers of predictions for bisexual and gay people.

Out of all the algorithms we sampled here, Random Forest seems to have the best ratio of accuracy score and gay/bisexual predictions, so this may be the best one to pursue further. However, Decision Trees has the highest number of correct gay/bisexual predictions, so that may also be the best one to pursue. There are also many other algorithms I left out of this notebook - maybe one of those could work better with some fine tuning.

Ultimately though, looking at all of these numbers, it is easy to see that accurately predicting someone's orientation based off their OkCupid data is ultimately not going to be very accurate. Thankfully we already realized this at the outset - this was simply to get our toes wet in preparing data for use by machine learning algorithms. :)