# OkCupid Date-A-Scientist

## Introduction

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Data exploration

### Initial exploration

To begin, we'll look at some basic information in this dataset, including:
- The first few rows of the table
- The distinct columns and their types
- How many total rows we have
- Where, if anywhere, there may be values missing

In [2]:
df = pd.read_csv("profiles.csv")
df.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [3]:
print(df.columns)
print(df.dtypes)
print("Total amount of rows: {0}".format(len(df)))
print("Unique values in 'speaks' column: {0}".format(df['speaks'].nunique()))

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')
age              int64
body_type       object
diet            object
drinks          object
drugs           object
education       object
essay0          object
essay1          object
essay2          object
essay3          object
essay4          object
essay5          object
essay6          object
essay7          object
essay8          object
essay9          object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
offspring       object
orientation     object
pets            object
religion        object
sex             object
s

The first few rows indicate that this data needs considerable tidying. For starters, the "essays" that the users write about themselves contain HTML tags that would need to be removed in order to be processed in any meaningful way. Secondly, certain columns such as 'religion' or 'sign' include what could be two potential features. We have the sign or religion itself, and how seriously the person adheres to astrology or their faith. 

This is important when finding potential compability as we wouldn't necessarily want to match someone who strictly adheres to their faith with someone who isn't very serious about it.

Another potential problem could be the 'speaks' column, as it has over 7000 unique values. This is because there's a large number of potential combinations, as well as some humorous additions to the column (such as "c++"). We may only want to consider languages that the user has stated they're fluent in.

One last thing we can check before actually beginning to make changes to these columns is where there may be any `NaN` values.

In [4]:
df.isna().sum()

age                0
body_type       5296
diet           24395
drinks          2985
drugs          14080
education       6628
essay0          5488
essay1          7572
essay2          9638
essay3         11476
essay4         10537
essay5         10850
essay6         13771
essay7         12451
essay8         19225
essay9         12603
ethnicity       5680
height             3
income             0
job             8198
last_online        0
location           0
offspring      35561
orientation        0
pets           19921
religion       20226
sex                0
sign           11056
smokes          5512
speaks            50
status             0
dtype: int64

Eliminating rows that have `NaN` values would not really be feasible for most comparisons. For instance, `offspring` has over 35,000 `NaN` values! That's more than half the dataset that would get removed by simply dropping rows where `NaN` values are found. Instead, the approach we'll take is simply split up some of the features based on what we want to analyze, and then work to remove any `NaN` values that we may find.

## Cleaning the data

One of the first things we can do is removing HTML tags, line break characters or other special characters that break the text up and make it more difficult to read and process. Most common ones that are evident immediately are line breaks `\n` and `<br />.` Other ones that can be seen are `&amp;` denoting an ampersand and `&rsquo;` denoting an apostrophe. 

In [5]:
df.replace(r"<[^<]+?>", " ", regex=True, inplace=True)
df.replace(r"\n", " ", regex=True, inplace=True)
df.replace(r"&amp;", "&", regex=True, inplace=True)
df.replace(r"&rsquo;", "'", regex=True, inplace=True)
df.head(1)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me: i would love to think that i was ...,currently working as an international agent fo...,making people laugh. ranting about a good sal...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn't have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single


Just looking at one row we can already tell the data's more readable. The next step is to combine all essays into a sort of user "biography." We'll create a `bio` column, which will be the sum of all the essays for that user. A potential problem here would be any essay that the user chose not to fill out, thus resulting in a `NaN` value. We'll use `fillna()` to simply turn that into whitespace, which we will then strip away to avoid excess whitespace resulting from multiple blank "essays" in a row.

In [6]:
essays_list = ["essay" + str(i) for i in range(0,10)]
df['bio'] = df[essays_list].fillna(' ').sum(axis=1)
df['bio'].replace(r"\s+", " ", regex=True, inplace=True)
print(df['bio'][3])
print(df['sign'][3])
print(df['offspring'][3])

i work in a library and go to school. . .reading things written by old dead peopleplaying synthesizers and organizing books according to the library of congress classification systemsocially awkward but i do my bestbataille, celine, beckett. . . lynch, jarmusch, r.w. fassbender. . . twin peaks & fishing w/ john joy division, throbbing gristle, cabaret voltaire. . . vegetarian pho and coffee cats and german philosophy you feel so inclined.
pisces
doesn't want kids


Perfect! We can see in the example a fully cleaned `bio`, along with that person's cleaned `sign` and `offspring` columns.

## Using Machine Learning to Predict Zodiac Sign

If we check back to our table containing `NaN` values, we can see that `sign` has around 11,000 missing values. Using our cleaned biography data along with a few other features, we could potentially predict someone's zodiac sign in order to fill in the missing values. We'll want to do two things before we actually begin training an ML model on our data: select the features we want to train the model on, and use a subset of the dataframe that doesn't include any row where `sign` is `NaN`.

Finally, we're only interested in the sign itself, not whether or not the user in question is a fervent believer in astrology. 

In [7]:
signs_features = ['bio', 'drinks', 'drugs', 'smokes', 'sign']
signs_df = df[signs_features][df['sign'].notna()]
signs_df['sign'] = signs_df['sign'].apply(lambda str: str.split()[0].strip())
print(signs_df['sign'].isna().any())
print(signs_df.head())

False
                                                 bio    drinks      drugs  \
0  about me: i would love to think that i was som...  socially      never   
1  i am a chef: this is what that means. 1. i am ...     often  sometimes   
2  i'm not ashamed of much, but writing public te...  socially        NaN   
3  i work in a library and go to school. . .readi...  socially        NaN   
4  hey how's it going? currently vague on the pro...  socially      never   

      smokes      sign  
0  sometimes    gemini  
1         no    cancer  
2         no    pisces  
3         no    pisces  
4         no  aquarius  


One final step with this dataframe is to turn the signs into integers that we can pass as labels to our model later.

In [8]:
signs_df['sign'] = signs_df['sign'].replace({"aries":0, "taurus":1, "gemini":2, \
                                            "cancer":3, "leo":4, "virgo":5, "libra":6, \
                                            "scorpio":7, "sagittarius":8, "capricorn":9, \
                                            "aquarius":10, "pisces":11})

IndentationError: unexpected indent (<ipython-input-8-e549db3e235b>, line 2)

Now we can split our data into a training set, and a testing set. We'll use these to train and verify the accuracy of the machine learning models.

In [16]:
from sklearn.pipeline import Pipeline

classifier = Pipeline([
    ('counter', CountVectorizer()),
    ('transformer', TfidfTransformer()),
    ('clf', MultinomialNB())
])

data = signs_df['bio'].values
labels = signs_df['sign'].values


signs_train_data, signs_test_data, signs_train_labels, signs_test_labels = train_test_split(data, labels)
classifier.fit(signs_train_data, signs_train_labels)
predictions = classifier.predict(signs_test_data)
np.mean(predictions == signs_test_labels)

0.08811257465434018