# OKCupid Profiles

<img src="http://www.echoexaminer.com/wp-content/uploads/2015/12/OkCupid.jpg"/ width=500>

Included in this dataset are many different OKCupid profiles.  Using this data, let's see who can get the highest cross_validated score in predicting male/female based on the features below. Because there are so many features, you may have better luck using SVMs to classify the data.

**Steps:**
1. View the data. Think about some of the features that might be relevant. 
2. Start simple!! Use only one or two features first, and then work your way up.
3. Score your classifier using 10 fold cross validation (Try using KFold or StratifiedKFold to ensure shuffling of the data)
4. Report your score, let's make a competition out of this!

**Some useful hints:**
* Try using `pandas.get_dummies( data ) ` to convert string based features into many dummy features
* Remember the distinction between categorical and continuous features
* Using CountVectorizors here may be useful (think back to Naive Bayes examples)

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm

%matplotlib inline
pd.options.display.max_columns=1000



In [4]:
#path_to_repo = ""
data = pd.read_csv("profiles.csv")
data.head()



Unnamed: 0.1,Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,46766,42,fit,,socially,,graduated from college/university,"what am i to me? i am a mother, artist, friend...","currently, i work in the healthcare field, go ...",making people laugh and mixing colors!,,"if i look at my bookcase, i have a lot of trea...",my son<br />\nthe ocean<br />\nbooks<br />\nsk...,"how to create a story, how to draw a character...",~debating whether i want to drive to the city ...,,"you are curious about me, are creative, want t...",,64,-1,,2012-07-01-00-08,"san francisco, california","has a kid, but doesn&rsquo;t want more",straight,,,f,cancer and it&rsquo;s fun to think about,,english,single
1,52438,19,athletic,anything,rarely,never,working on college/university,"outgoing, love sports, and full time student.","student, aiming for a biology major.","running, soccer, talking, joking around, liste...",,"the girl, sister carrie, always running, red h...",1. chips<br />\n2. phone<br />\n3. sports<br /...,my future,"relieved from school, playing soccer till nigh...",ask me,,,68,-1,,2012-05-29-17-42,"san francisco, california",,straight,,catholicism,m,,no,"english, spanish",single
2,26151,43,average,,socially,never,,"it's hard to know exactly what to write, but h...","working (i love my job), spending time with fa...","being honest, listening (for some reason compl...","my sarcastic wit, red hair and green eyes.","books: i love to read horror, thrillers, sci-f...","books, my mp3 player, gps (i have an amazing a...",if i could have one super power i would want t...,"hanging out with friends, watching scary movie...",i'm not.,you have a sense of humor and can just as easi...,white,64,-1,other,2012-06-10-14-37,"pinole, california",has a kid,straight,,christianity,f,,trying to quit,english,single
3,7733,29,fit,anything,socially,never,graduated from college/university,"still single, still picky<br />\n<br />\n<stro...",i strive to be some sort of errant intellectua...,finding the humor in anything.<br />\napplying...,my irreverence with respect to long held socia...,books: <i>a short history of nearly everything...,1. a breathable mixture of oxygen and nitrogen...,both the distant future of humanity and the ve...,developing talent in tranquility or character ...,i snowboard goofy and can't whistle,you are still reading this and have somehow re...,white,74,-1,science / tech / engineering,2012-06-29-01-09,"san francisco, california","doesn&rsquo;t have kids, but wants them",straight,likes dogs and likes cats,agnosticism and laughing about it,m,pisces and it&rsquo;s fun to think about,no,"english (fluently), spanish (poorly), c++ (flu...",single
4,52145,26,,mostly vegetarian,socially,,graduated from masters program,i am moving to san francisco in a few days and...,i just became the program director for a non-p...,,,some of my favorite books in the last year hav...,,,,,you think you can make me laugh :),,64,-1,education / academia,2012-06-26-23-30,"san francisco, california",,straight,,,f,sagittarius,no,"english, spanish (okay)",single


In [5]:
# Your code here!
y = np.array([1 if x=="m" else 0 for x in data.sex.tolist()])
X = data[['height']]

In [6]:
from sklearn import svm

In [7]:
model = svm.SVC(kernel="rbf")
model.fit(X,y)
model.score(X,y)

0.83399999999999996

In [8]:
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
from sklearn.cross_validation import StratifiedShuffleSplit

In [9]:
np.mean(cross_val_score(model,X,y,cv=KFold(len(X),n_folds=10,shuffle=True)))

0.83299999999999996

In [10]:
from sklearn import grid_search

In [11]:
C_range = np.logspace(-3,10,5)
gamma_range = np.logspace(-3,10,5)
param_grid = dict(C=C_range,gamma=gamma_range)

cv = StratifiedShuffleSplit(y, n_iter=5, test_size=0.2)
gridmodel = grid_search.GridSearchCV(svm.SVC(),param_grid=param_grid,cv=cv)

In [12]:
gridmodel.fit(X,y)

GridSearchCV(cv=StratifiedShuffleSplit(labels=[0 1 ..., 0 1], n_iter=5, test_size=0.2, random_state=None),
       error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   1.77828e+00,   3.16228e+03,   5.62341e+06,
         1.00000e+10]), 'gamma': array([  1.00000e-03,   1.77828e+00,   3.16228e+03,   5.62341e+06,
         1.00000e+10])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [15]:
print gridmodel.best_score_
print gridmodel.best_params_

0.849
{'C': 10000000000.0, 'gamma': 10000000000.0}
