# Date-A-Scientist - Analyzing and Exploring Data from OKCupid
In this project, I will explore, analyze, visualize and finally predict data regarding OKCupid users seeking their perfect match. 

# 1 - Import libraries, load data, and get a basic understanding of the table

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')
sns.set(font_scale=1.25)
sns.set_palette('Accent')

%matplotlib inline

In [2]:
df = pd.read_csv('profiles.csv')

In [3]:
df.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')

In [4]:
df.shape

(59946, 31)

In [5]:
df.job.value_counts()

other                                7589
student                              4882
science / tech / engineering         4848
computer / hardware / software       4709
artistic / musical / writer          4439
sales / marketing / biz dev          4391
medicine / health                    3680
education / academia                 3513
executive / management               2373
banking / financial / real estate    2266
entertainment / media                2250
law / legal services                 1381
hospitality / travel                 1364
construction / craftsmanship         1021
clerical / administrative             805
political / government                708
rather not say                        436
transportation                        366
unemployed                            273
retired                               250
military                              204
Name: job, dtype: int64

In [6]:
df.status.value_counts()

single            55697
seeing someone     2064
available          1865
married             310
unknown              10
Name: status, dtype: int64

# 2 - Find duplicate values and outliers

In [7]:
## To find outliers, I will use df.describe to see if any numerical data is unlikely
df.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


In [8]:
## There is no way that someone on OKCupid is 110 years old. Let's take a closer look at this datapoint.

df[df['age'] > 80]

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
2512,110,,,,,,,,,,...,"daly city, california",,straight,,,f,,,english,single
25324,109,athletic,mostly other,,never,working on masters program,,,,nothing,...,"san francisco, california",might want kids,straight,,other and somewhat serious about it,m,aquarius but it doesn&rsquo;t matter,when drinking,english (okay),available


In [9]:
## I'm going to delete both of these columns, as these are examples of corrupt/bad data

df = df.drop([2512,25324])

In [10]:
## Let's see if it worked 

df[df['age'] > 80]

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status


In [11]:
## There is also some impossible information in the 'height' column. I'm going to drop any column where the height 
## is below 3 feet. 

df[df['height'] < 36]

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
12193,30,,mostly vegetarian,socially,,graduated from space camp,"well, hello and thank you for stopping by my g...","i mostly try to be good at what i do, maintain...",folding laundry.,perhaps my eyes? i'm told they are a unique co...,...,"berkeley, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,judaism but not too serious about it,f,taurus but it doesn&rsquo;t matter,no,english,single
18832,39,athletic,,socially,never,graduated from masters program,home: i was a happy surprise for my parents. y...,working at a start-up in technology. never rea...,online dating. it's the dating in real life th...,that i am trying to notice what they are tryin...,...,"san francisco, california",,straight,likes dogs,christianity but not too serious about it,m,gemini but it doesn&rsquo;t matter,no,"english (fluently), chinese (okay), french (po...",single
23767,32,thin,mostly anything,desperately,,working on ph.d program,small teller of tall tales. nostalgiaphile. un...,entropologist/enigmatologist/muse/things that ...,the things i don't quit. but only because i te...,something about the eyes-to-face ratio. and th...,...,"san francisco, california",,straight,likes dogs and has cats,agnosticism,f,,no,"english (fluently), french (okay), german (poo...",single
37111,25,fit,mostly anything,socially,never,graduated from college/university,dating me is not for everyone. i will frequent...,"in preschool, i was the boy that drew transfor...",besides the obvious stated above:<br />\ncrack...,"the hair. and yes, they're deadly poisonous sp...",...,"san mateo, california",,straight,,,m,leo and it&rsquo;s fun to think about,when drinking,"english (fluently), chinese (fluently), japane...",single
45959,36,,,very often,never,graduated from college/university,i'm a transplant from southern california with...,"living each day as if it'd be my last, and man...","writing, rambling, fixing anything thats break...","my otherwise perfect hair, my half-sleeve tatt...",...,"oakland, california",,straight,likes dogs and has cats,judaism and laughing about it,m,leo and it&rsquo;s fun to think about,when drinking,"english (fluently), c++ (fluently), hebrew (po...",single
48280,29,,,socially,never,working on college/university,"small and agile, this nocturnal omnivore spend...",i'm hoping to write for a living someday... th...,"triathalons, physical activity in general, poe...",i can take many hits from a can of pepper-spra...,...,"oakland, california",,straight,likes dogs and likes cats,,m,capricorn and it&rsquo;s fun to think about,no,"english (fluently), chinese (fluently), spanis...",single
56287,25,,,often,often,working on space camp,i love not dancing and portobello mushrooms. a...,walking around town left handed.,being online and starting collections. i am re...,my long fingers and ability to be awkward.,...,"san francisco, california",,straight,has dogs and dislikes cats,other and laughing about it,f,cancer but it doesn&rsquo;t matter,yes,"english (fluently), spanish (poorly)",single


In [12]:
df = df.drop([12193,18832,23767,37111,45959,48280,56287])

In [13]:
df[df['height'] < 36] 

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status


In [14]:
duplicated = df[df.duplicated()]

In [15]:
duplicated ## no duplicate values to remove

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status


# 3 - Create more numeric columns

In [16]:
drink_mapping = {"not at all": 0, "rarely": 1, "socially": 2, "often": 3, "very often": 4, "desperately": 5}

df["drinks_code"] = df.drinks.map(drink_mapping)

In [17]:
df.drugs.value_counts()

never        37719
sometimes     7732
often          409
Name: drugs, dtype: int64

In [18]:
drugs_mapping = {"never": 0, "sometimes": 1, "often": 2}

df["drugs_code"] = df.drugs.map(drugs_mapping)

In [19]:
veg_mapping = {"strictly vegan": 0, "vegan": 1, "mostly vegan": 2, "strictly vegetarian": 3, "vegetarian": 4, 
               "mostly vegetarian": 5, "mostly anything": 6, "anything": 6, "strictly anything": 6, 
               "mostly other": 6, "strictly other": 6, "other": 6, "mostly kosher": 5, "mostly halal": 5, 
               "strictly halal": 5, "strictly kosher": 5, "kosher": 5, "halal": 5}

df["veg_code"] = df.diet.map(veg_mapping)

In [20]:
ed_mapping = {"graduated from ph.d program": 0, "graduated from law school": 0, "working on ph.d program": 0,
              "graduated from med school": 0, "working on law school": 0, "working on med school": 0, 
              "ph.d program": 0, "law school": 0, "med school": 0, "graduated from masters program": 1, 
              "working on masters program": 1, "masters program": 1, "graduated from college/university": 2,
              "college/university": 2, "dropped out of masters program": 2, "dropped out of ph.d program": 2,
              "dropped out of law school": 2, "dropped out of med school": 2, "working on college/university": 3,
              "graduated from two-year college": 3, "working on two-year college": 3, "two-year college": 3,
              "graduated from high school": 4, "dropped out of college/university": 4, 
              "dropped out of two-year college": 4, "high school": 4, "graduated from space camp": 5, 
              "dropped out of space camp": 5, "working on space camp": 5, "dropped out of high school": 5,
              "working on high school": 5, "space camp": 5}

df["ed_code"] = df.education.map(ed_mapping)

In [21]:
smoke_mapping = {"no": 0, "trying to quit": 1, "sometimes": 2, "when drinking": 2, "yes": 3}

df["smoke_code"] = df.smokes.map(smoke_mapping)

In [22]:
sex_mapping = {"m": 0, "f": 1}

df["sex_code"] = df.sex.map(sex_mapping)

# 4 - Save clean DF to CSV

In [23]:
df.to_csv('profiles_clean.csv')