<h1>Exploring OkCupid Dataset</h1>

<h2>Packages installed</h2>

In [1]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import helper as hp ###function I coded for this project
from IPython.display import display
%matplotlib inline
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

print('pandas version is {}.'.format(pd.__version__))
print('numpy version is {}.'.format(np.__version__))
print('scikit-learn version is {}.'.format(sklearn.__version__))
print('seaborn version is {}.'.format(sns.__version__))
print('matplotlib version is {}.'.format(matplotlib.__version__))

pandas version is 0.20.2.
numpy version is 1.13.1.
scikit-learn version is 0.18.2.
seaborn version is 0.7.1.
matplotlib version is 2.0.2.


<h2>Loading the Dataset</h2>

In [2]:
data = pd.read_csv("profiles.csv")
print("This set has {} data points and {} features.".format(*data.shape))

This set has 59946 data points and 31 features.


In [3]:
for i in data.columns:
    print (i, end = ", ")

age, body_type, diet, drinks, drugs, education, essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9, ethnicity, height, income, job, last_online, location, offspring, orientation, pets, religion, sex, sign, smokes, speaks, status, 

<h2>Dividing the Population by Gender</h2>

In [4]:
males = data[data['sex'] == 'm'].copy()
females = data[data['sex'] == 'f'].copy()

In [5]:
print("This data set has {} males and {} females".format(males['sex'].count(), females['sex'].count()))
print("male:female ratio = {0:.{1}f}".format(males['sex'].count() / females['sex'].count(), 2))

This data set has 35829 males and 24117 females
male:female ratio = 1.49


<h2>Basic Summary (both populations)</h2>

In [6]:
data.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


<h2>Basic Summary (Males)</h2>

In [7]:
males.describe()

Unnamed: 0,age,height,income
count,35829.0,35827.0,35829.0
mean,32.018588,70.443492,25991.307656
std,9.032881,3.076521,109845.249284
min,18.0,1.0,-1.0
25%,26.0,68.0,-1.0
50%,30.0,70.0,-1.0
75%,36.0,72.0,-1.0
max,109.0,95.0,1000000.0


<h2>Basic Summary (Females)</h2>

In [8]:
females.describe()

Unnamed: 0,age,height,income
count,24117.0,24116.0,24117.0
mean,32.81822,65.103873,11181.697392
std,10.025385,2.926502,74149.778856
min,18.0,4.0,-1.0
25%,26.0,63.0,-1.0
50%,30.0,65.0,-1.0
75%,37.0,67.0,-1.0
max,110.0,95.0,1000000.0


In [9]:
import scipy.stats as stats
print(stats.ttest_ind(males['age'], females['age'], nan_policy = 'omit'))
print(stats.ttest_ind(males['height'], females['height'], nan_policy = 'omit'))
print(stats.ttest_ind(males['income'], females['income'], nan_policy = 'omit'))

Ttest_indResult(statistic=-10.164814501817149, pvalue=2.9786406182534982e-24)
Ttest_indResult(statistic=212.47827019501977, pvalue=0.0)
Ttest_indResult(statistic=18.316049145440861, pvalue=9.8645248736508528e-75)


<h4>Comments on Basic Summaries by Sex</h4>
<p>t-testing yielded low p-values below 0.005. This suggests we reject the hypothesis that average age and average height are the same across users on OkCupid. 50% of men from this dataset fall within 5'8" - 6'0" and are between the ages of 26 - 36. Women fall within 5'3" - 5'7" and are between the ages 26 - 37.</p>

In [10]:
#from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
#vectorizer = TfidfVectorizer()
#experimenting = data[~data['essay1'].isNaN()]
#type(experimenting['essay1'][13])
#tfidf = vectorizer.fit_transform(data['essay1'])

<h2>Types of Bodies</h2>

In [12]:
hp.word_counter(data['body_type'])

a little extra : 2629 is 4.39% of the dataset.
average : 14652 is 24.44% of the dataset.
thin : 4711 is 7.86% of the dataset.
athletic : 11819 is 19.72% of the dataset.
fit : 12711 is 21.20% of the dataset.
nan : 5296 is 8.83% of the dataset.
skinny : 1777 is 2.96% of the dataset.
curvy : 3924 is 6.55% of the dataset.
full figured : 1009 is 1.68% of the dataset.
jacked : 421 is 0.70% of the dataset.
rather not say : 198 is 0.33% of the dataset.
used up : 355 is 0.59% of the dataset.
overweight : 444 is 0.74% of the dataset.


<h3>Males</h3>

In [13]:
hp.word_counter(males['body_type'])

a little extra : 1808 is 5.05% of the dataset.
average : 9032 is 25.21% of the dataset.
thin : 2242 is 6.26% of the dataset.
athletic : 9510 is 26.54% of the dataset.
nan : 2593 is 7.24% of the dataset.
fit : 8280 is 23.11% of the dataset.
skinny : 1176 is 3.28% of the dataset.
jacked : 292 is 0.81% of the dataset.
curvy : 113 is 0.32% of the dataset.
used up : 253 is 0.71% of the dataset.
overweight : 299 is 0.83% of the dataset.
rather not say : 92 is 0.26% of the dataset.
full figured : 139 is 0.39% of the dataset.


<h3>Females</h3>

In [14]:
hp.word_counter(females['body_type'])

fit : 4431 is 18.37% of the dataset.
average : 5620 is 23.30% of the dataset.
nan : 2703 is 11.21% of the dataset.
skinny : 601 is 2.49% of the dataset.
thin : 2469 is 10.24% of the dataset.
athletic : 2309 is 9.57% of the dataset.
curvy : 3811 is 15.80% of the dataset.
a little extra : 821 is 3.40% of the dataset.
full figured : 870 is 3.61% of the dataset.
rather not say : 106 is 0.44% of the dataset.
jacked : 129 is 0.53% of the dataset.
overweight : 145 is 0.60% of the dataset.
used up : 102 is 0.42% of the dataset.


<h4>Comments on Body Types</h4>
<p>Average, athletic, and fit people make up 65.3% of the reported body types in the dataset.</p>

<h2>How Educated are people</h2>

In [15]:
hp.word_counter(data['education'])

working on college/university : 5712 is 9.53% of the dataset.
working on space camp : 445 is 0.74% of the dataset.
graduated from masters program : 8961 is 14.95% of the dataset.
graduated from college/university : 23959 is 39.97% of the dataset.
working on two-year college : 1074 is 1.79% of the dataset.
nan : 6628 is 11.06% of the dataset.
graduated from high school : 1428 is 2.38% of the dataset.
working on masters program : 1683 is 2.81% of the dataset.
graduated from space camp : 657 is 1.10% of the dataset.
college/university : 801 is 1.34% of the dataset.
dropped out of space camp : 523 is 0.87% of the dataset.
graduated from ph.d program : 1272 is 2.12% of the dataset.
graduated from law school : 1122 is 1.87% of the dataset.
working on ph.d program : 983 is 1.64% of the dataset.
two-year college : 222 is 0.37% of the dataset.
graduated from two-year college : 1531 is 2.55% of the dataset.
working on med school : 212 is 0.35% of the dataset.
dropped out of college/university : 

<h3>Males</h3>

In [16]:
hp.word_counter(males['education'])

working on college/university : 3317 is 9.26% of the dataset.
working on space camp : 308 is 0.86% of the dataset.
graduated from masters program : 4698 is 13.11% of the dataset.
graduated from college/university : 14116 is 39.40% of the dataset.
working on two-year college : 715 is 2.00% of the dataset.
nan : 4241 is 11.84% of the dataset.
working on masters program : 819 is 2.29% of the dataset.
graduated from space camp : 455 is 1.27% of the dataset.
college/university : 520 is 1.45% of the dataset.
graduated from ph.d program : 851 is 2.38% of the dataset.
working on ph.d program : 644 is 1.80% of the dataset.
two-year college : 153 is 0.43% of the dataset.
dropped out of space camp : 394 is 1.10% of the dataset.
working on med school : 112 is 0.31% of the dataset.
dropped out of college/university : 772 is 2.15% of the dataset.
graduated from two-year college : 987 is 2.75% of the dataset.
dropped out of high school : 81 is 0.23% of the dataset.
graduated from law school : 614 is 

<h3>Females</h3>

In [17]:
hp.word_counter(females['education'])

graduated from college/university : 9843 is 40.81% of the dataset.
graduated from high school : 408 is 1.69% of the dataset.
working on college/university : 2395 is 9.93% of the dataset.
nan : 2387 is 9.90% of the dataset.
graduated from masters program : 4263 is 17.68% of the dataset.
dropped out of space camp : 129 is 0.53% of the dataset.
working on masters program : 864 is 3.58% of the dataset.
graduated from law school : 508 is 2.11% of the dataset.
college/university : 281 is 1.17% of the dataset.
graduated from two-year college : 544 is 2.26% of the dataset.
working on space camp : 137 is 0.57% of the dataset.
two-year college : 69 is 0.29% of the dataset.
space camp : 19 is 0.08% of the dataset.
graduated from med school : 207 is 0.86% of the dataset.
working on ph.d program : 339 is 1.41% of the dataset.
masters program : 59 is 0.24% of the dataset.
graduated from ph.d program : 421 is 1.75% of the dataset.
dropped out of ph.d program : 40 is 0.17% of the dataset.
dropped out 

<h2>Diets</h2>

In [18]:
hp.word_counter(data['diet'])

strictly anything : 5113 is 8.53% of the dataset.
mostly other : 1007 is 1.68% of the dataset.
anything : 6183 is 10.31% of the dataset.
vegetarian : 667 is 1.11% of the dataset.
nan : 24395 is 40.69% of the dataset.
mostly anything : 16585 is 27.67% of the dataset.
mostly vegetarian : 3444 is 5.75% of the dataset.
strictly vegan : 228 is 0.38% of the dataset.
strictly vegetarian : 875 is 1.46% of the dataset.
mostly vegan : 338 is 0.56% of the dataset.
strictly other : 452 is 0.75% of the dataset.
mostly halal : 48 is 0.08% of the dataset.
other : 331 is 0.55% of the dataset.
vegan : 136 is 0.23% of the dataset.
mostly kosher : 86 is 0.14% of the dataset.
strictly halal : 18 is 0.03% of the dataset.
halal : 11 is 0.02% of the dataset.
strictly kosher : 18 is 0.03% of the dataset.
kosher : 11 is 0.02% of the dataset.


<h3>Males</h3>

In [19]:
hp.word_counter(males['diet'])

strictly anything : 3634 is 10.14% of the dataset.
mostly other : 582 is 1.62% of the dataset.
anything : 3766 is 10.51% of the dataset.
vegetarian : 289 is 0.81% of the dataset.
nan : 14482 is 40.42% of the dataset.
mostly anything : 10227 is 28.54% of the dataset.
mostly vegetarian : 1531 is 4.27% of the dataset.
strictly vegan : 121 is 0.34% of the dataset.
mostly vegan : 157 is 0.44% of the dataset.
strictly other : 274 is 0.76% of the dataset.
mostly halal : 36 is 0.10% of the dataset.
strictly vegetarian : 406 is 1.13% of the dataset.
vegan : 65 is 0.18% of the dataset.
other : 167 is 0.47% of the dataset.
mostly kosher : 50 is 0.14% of the dataset.
strictly halal : 16 is 0.04% of the dataset.
halal : 8 is 0.02% of the dataset.
strictly kosher : 13 is 0.04% of the dataset.
kosher : 5 is 0.01% of the dataset.


<h3>Females</h3>

In [20]:
hp.word_counter(females['diet'])

strictly anything : 1479 is 6.13% of the dataset.
mostly anything : 6358 is 26.36% of the dataset.
nan : 9913 is 41.10% of the dataset.
anything : 2417 is 10.02% of the dataset.
mostly vegetarian : 1913 is 7.93% of the dataset.
strictly vegetarian : 469 is 1.94% of the dataset.
vegetarian : 378 is 1.57% of the dataset.
other : 164 is 0.68% of the dataset.
mostly other : 425 is 1.76% of the dataset.
mostly kosher : 36 is 0.15% of the dataset.
strictly other : 178 is 0.74% of the dataset.
mostly vegan : 181 is 0.75% of the dataset.
strictly vegan : 107 is 0.44% of the dataset.
vegan : 71 is 0.29% of the dataset.
mostly halal : 12 is 0.05% of the dataset.
kosher : 6 is 0.02% of the dataset.
strictly halal : 2 is 0.01% of the dataset.
strictly kosher : 5 is 0.02% of the dataset.
halal : 3 is 0.01% of the dataset.


<h2>Drinks</h2>

In [21]:
hp.word_counter(data['drinks'])

socially : 41780 is 69.70% of the dataset.
often : 5164 is 8.61% of the dataset.
not at all : 3267 is 5.45% of the dataset.
rarely : 5957 is 9.94% of the dataset.
nan : 2985 is 4.98% of the dataset.
very often : 471 is 0.79% of the dataset.
desperately : 322 is 0.54% of the dataset.


<h3>Males</h3>

In [22]:
hp.word_counter(males['drinks'])

socially : 24557 is 68.54% of the dataset.
often : 3314 is 9.25% of the dataset.
not at all : 2034 is 5.68% of the dataset.
rarely : 3549 is 9.91% of the dataset.
nan : 1873 is 5.23% of the dataset.
very often : 294 is 0.82% of the dataset.
desperately : 208 is 0.58% of the dataset.


<h3>Females</h3>

In [23]:
hp.word_counter(females['drinks'])

socially : 17223 is 71.41% of the dataset.
often : 1850 is 7.67% of the dataset.
rarely : 2408 is 9.98% of the dataset.
nan : 1112 is 4.61% of the dataset.
not at all : 1233 is 5.11% of the dataset.
desperately : 114 is 0.47% of the dataset.
very often : 177 is 0.73% of the dataset.


<h2>Drugs</h2>

In [24]:
hp.word_counter(data['drugs'])

never : 37724 is 62.93% of the dataset.
sometimes : 7732 is 12.90% of the dataset.
nan : 14080 is 23.49% of the dataset.
often : 410 is 0.68% of the dataset.


<h3>Males</h3>

In [25]:
hp.word_counter(males['drugs'])

never : 21895 is 61.11% of the dataset.
sometimes : 5037 is 14.06% of the dataset.
nan : 8615 is 24.04% of the dataset.
often : 282 is 0.79% of the dataset.


<h3>Females</h3>

In [26]:
hp.word_counter(females['drugs'])

never : 15829 is 65.63% of the dataset.
nan : 5465 is 22.66% of the dataset.
sometimes : 2695 is 11.17% of the dataset.
often : 128 is 0.53% of the dataset.


<h2>Job</h2>

In [27]:
hp.word_counter(data['job'])

transportation : 366 is 0.61% of the dataset.
hospitality / travel : 1364 is 2.28% of the dataset.
nan : 8198 is 13.68% of the dataset.
student : 4882 is 8.14% of the dataset.
artistic / musical / writer : 4439 is 7.40% of the dataset.
computer / hardware / software : 4709 is 7.86% of the dataset.
banking / financial / real estate : 2266 is 3.78% of the dataset.
entertainment / media : 2250 is 3.75% of the dataset.
sales / marketing / biz dev : 4391 is 7.32% of the dataset.
other : 7589 is 12.66% of the dataset.
medicine / health : 3680 is 6.14% of the dataset.
science / tech / engineering : 4848 is 8.09% of the dataset.
executive / management : 2373 is 3.96% of the dataset.
education / academia : 3513 is 5.86% of the dataset.
clerical / administrative : 805 is 1.34% of the dataset.
construction / craftsmanship : 1021 is 1.70% of the dataset.
rather not say : 436 is 0.73% of the dataset.
political / government : 708 is 1.18% of the dataset.
law / legal services : 1381 is 2.30% of the d

<h3>Males</h3>

In [28]:
hp.word_counter(males['job'])

transportation : 308 is 0.86% of the dataset.
hospitality / travel : 794 is 2.22% of the dataset.
nan : 4542 is 12.68% of the dataset.
student : 2731 is 7.62% of the dataset.
artistic / musical / writer : 2557 is 7.14% of the dataset.
computer / hardware / software : 4068 is 11.35% of the dataset.
banking / financial / real estate : 1476 is 4.12% of the dataset.
entertainment / media : 1569 is 4.38% of the dataset.
medicine / health : 1493 is 4.17% of the dataset.
science / tech / engineering : 3832 is 10.70% of the dataset.
executive / management : 1618 is 4.52% of the dataset.
education / academia : 1358 is 3.79% of the dataset.
clerical / administrative : 233 is 0.65% of the dataset.
other : 3945 is 11.01% of the dataset.
construction / craftsmanship : 940 is 2.62% of the dataset.
rather not say : 265 is 0.74% of the dataset.
political / government : 398 is 1.11% of the dataset.
law / legal services : 741 is 2.07% of the dataset.
sales / marketing / biz dev : 2475 is 6.91% of the da

<h3>Females</h3>

In [29]:
hp.word_counter(females['job'])

nan : 3656 is 15.16% of the dataset.
artistic / musical / writer : 1882 is 7.80% of the dataset.
sales / marketing / biz dev : 1916 is 7.94% of the dataset.
other : 3644 is 15.11% of the dataset.
medicine / health : 2187 is 9.07% of the dataset.
banking / financial / real estate : 790 is 3.28% of the dataset.
student : 2151 is 8.92% of the dataset.
computer / hardware / software : 641 is 2.66% of the dataset.
hospitality / travel : 570 is 2.36% of the dataset.
entertainment / media : 681 is 2.82% of the dataset.
science / tech / engineering : 1016 is 4.21% of the dataset.
education / academia : 2155 is 8.94% of the dataset.
law / legal services : 640 is 2.65% of the dataset.
rather not say : 171 is 0.71% of the dataset.
executive / management : 755 is 3.13% of the dataset.
transportation : 58 is 0.24% of the dataset.
political / government : 310 is 1.29% of the dataset.
retired : 106 is 0.44% of the dataset.
clerical / administrative : 572 is 2.37% of the dataset.
construction / crafts

<h2>Offspring</h2>

In [30]:
hp.word_counter(data['offspring'])

doesn&rsquo;t have kids, but might want them : 3875 is 6.46% of the dataset.
nan : 35561 is 59.32% of the dataset.
doesn&rsquo;t want kids : 2927 is 4.88% of the dataset.
doesn&rsquo;t have kids, but wants them : 3565 is 5.95% of the dataset.
doesn&rsquo;t have kids : 7560 is 12.61% of the dataset.
wants kids : 225 is 0.38% of the dataset.
has a kid : 1881 is 3.14% of the dataset.
has kids : 1883 is 3.14% of the dataset.
doesn&rsquo;t have kids, and doesn&rsquo;t want any : 1132 is 1.89% of the dataset.
has kids, but doesn&rsquo;t want more : 442 is 0.74% of the dataset.
has a kid, but doesn&rsquo;t want more : 275 is 0.46% of the dataset.
has a kid, and wants more : 71 is 0.12% of the dataset.
has kids, and might want more : 115 is 0.19% of the dataset.
might want kids : 182 is 0.30% of the dataset.
has a kid, and might want more : 231 is 0.39% of the dataset.
has kids, and wants more : 21 is 0.04% of the dataset.


<h3>Males</h3>

In [31]:
hp.word_counter(males['offspring'])

doesn&rsquo;t have kids, but might want them : 2369 is 6.61% of the dataset.
nan : 22336 is 62.34% of the dataset.
doesn&rsquo;t want kids : 1750 is 4.88% of the dataset.
doesn&rsquo;t have kids : 4435 is 12.38% of the dataset.
doesn&rsquo;t have kids, but wants them : 1725 is 4.81% of the dataset.
has a kid : 884 is 2.47% of the dataset.
has kids : 940 is 2.62% of the dataset.
doesn&rsquo;t have kids, and doesn&rsquo;t want any : 624 is 1.74% of the dataset.
has a kid, but doesn&rsquo;t want more : 97 is 0.27% of the dataset.
has a kid, and wants more : 45 is 0.13% of the dataset.
has kids, but doesn&rsquo;t want more : 187 is 0.52% of the dataset.
has kids, and might want more : 69 is 0.19% of the dataset.
has a kid, and might want more : 141 is 0.39% of the dataset.
might want kids : 120 is 0.33% of the dataset.
has kids, and wants more : 13 is 0.04% of the dataset.
wants kids : 94 is 0.26% of the dataset.


<h3>Females</h3>

In [32]:
hp.word_counter(females['offspring'])

nan : 13225 is 54.84% of the dataset.
doesn&rsquo;t have kids, but wants them : 1840 is 7.63% of the dataset.
doesn&rsquo;t have kids : 3125 is 12.96% of the dataset.
doesn&rsquo;t have kids, but might want them : 1506 is 6.24% of the dataset.
doesn&rsquo;t want kids : 1177 is 4.88% of the dataset.
wants kids : 131 is 0.54% of the dataset.
has kids : 943 is 3.91% of the dataset.
has a kid : 997 is 4.13% of the dataset.
doesn&rsquo;t have kids, and doesn&rsquo;t want any : 508 is 2.11% of the dataset.
has kids, but doesn&rsquo;t want more : 255 is 1.06% of the dataset.
has kids, and might want more : 46 is 0.19% of the dataset.
might want kids : 62 is 0.26% of the dataset.
has a kid, and might want more : 90 is 0.37% of the dataset.
has a kid, but doesn&rsquo;t want more : 178 is 0.74% of the dataset.
has a kid, and wants more : 26 is 0.11% of the dataset.
has kids, and wants more : 8 is 0.03% of the dataset.


<h2>Location</h2>

In [33]:
hp.word_counter(data['location'])

south san francisco, california : 416 is 0.69% of the dataset.
oakland, california : 7214 is 12.03% of the dataset.
san francisco, california : 31064 is 51.82% of the dataset.
berkeley, california : 4212 is 7.03% of the dataset.
belvedere tiburon, california : 57 is 0.10% of the dataset.
san mateo, california : 1331 is 2.22% of the dataset.
daly city, california : 681 is 1.14% of the dataset.
san leandro, california : 651 is 1.09% of the dataset.
atherton, california : 45 is 0.08% of the dataset.
san rafael, california : 755 is 1.26% of the dataset.
walnut creek, california : 644 is 1.07% of the dataset.
menlo park, california : 479 is 0.80% of the dataset.
belmont, california : 243 is 0.41% of the dataset.
san jose, california : 2 is 0.00% of the dataset.
palo alto, california : 1064 is 1.77% of the dataset.
emeryville, california : 738 is 1.23% of the dataset.
el granada, california : 27 is 0.05% of the dataset.
castro valley, california : 345 is 0.58% of the dataset.
fairfax, califo

<h3>Males</h3>

In [34]:
hp.word_counter(males['location'])

south san francisco, california : 294 is 0.82% of the dataset.
oakland, california : 3723 is 10.39% of the dataset.
san francisco, california : 18799 is 52.47% of the dataset.
berkeley, california : 2455 is 6.85% of the dataset.
san mateo, california : 852 is 2.38% of the dataset.
daly city, california : 491 is 1.37% of the dataset.
atherton, california : 25 is 0.07% of the dataset.
san rafael, california : 415 is 1.16% of the dataset.
menlo park, california : 280 is 0.78% of the dataset.
san jose, california : 2 is 0.01% of the dataset.
emeryville, california : 469 is 1.31% of the dataset.
palo alto, california : 695 is 1.94% of the dataset.
castro valley, california : 202 is 0.56% of the dataset.
fairfax, california : 64 is 0.18% of the dataset.
burlingame, california : 207 is 0.58% of the dataset.
martinez, california : 179 is 0.50% of the dataset.
pleasant hill, california : 210 is 0.59% of the dataset.
hayward, california : 498 is 1.39% of the dataset.
alameda, california : 543 is

<h3>Females</h3>

In [35]:
hp.word_counter(females['location'])

san francisco, california : 12265 is 50.86% of the dataset.
belvedere tiburon, california : 29 is 0.12% of the dataset.
san leandro, california : 250 is 1.04% of the dataset.
san rafael, california : 340 is 1.41% of the dataset.
walnut creek, california : 257 is 1.07% of the dataset.
belmont, california : 94 is 0.39% of the dataset.
oakland, california : 3491 is 14.48% of the dataset.
palo alto, california : 369 is 1.53% of the dataset.
el granada, california : 17 is 0.07% of the dataset.
mountain view, california : 106 is 0.44% of the dataset.
menlo park, california : 199 is 0.83% of the dataset.
berkeley, california : 1757 is 7.29% of the dataset.
mill valley, california : 176 is 0.73% of the dataset.
san mateo, california : 479 is 1.99% of the dataset.
richmond, california : 166 is 0.69% of the dataset.
redwood city, california : 259 is 1.07% of the dataset.
el cerrito, california : 121 is 0.50% of the dataset.
alameda, california : 367 is 1.52% of the dataset.
daly city, california

<h2>Orientation</h2>

In [36]:
hp.word_counter(data['orientation'])

straight : 51606 is 86.09% of the dataset.
bisexual : 2767 is 4.62% of the dataset.
gay : 5573 is 9.30% of the dataset.


<h3>Males</h3>

In [37]:
hp.word_counter(males['orientation'])

straight : 31073 is 86.73% of the dataset.
bisexual : 771 is 2.15% of the dataset.
gay : 3985 is 11.12% of the dataset.


<h3>Females</h3>

In [38]:
hp.word_counter(females['orientation'])

straight : 20533 is 85.14% of the dataset.
bisexual : 1996 is 8.28% of the dataset.
gay : 1588 is 6.58% of the dataset.


<h2>Pets</h2>

In [39]:
hp.word_counter(data['pets'])

likes dogs and likes cats : 14814 is 24.71% of the dataset.
has cats : 1406 is 2.35% of the dataset.
likes cats : 1063 is 1.77% of the dataset.
nan : 19921 is 33.23% of the dataset.
has dogs and likes cats : 2333 is 3.89% of the dataset.
likes dogs and has cats : 4313 is 7.19% of the dataset.
likes dogs and dislikes cats : 2029 is 3.38% of the dataset.
has dogs : 4134 is 6.90% of the dataset.
has dogs and dislikes cats : 552 is 0.92% of the dataset.
likes dogs : 7224 is 12.05% of the dataset.
has dogs and has cats : 1474 is 2.46% of the dataset.
dislikes dogs and has cats : 81 is 0.14% of the dataset.
dislikes dogs and dislikes cats : 196 is 0.33% of the dataset.
dislikes cats : 122 is 0.20% of the dataset.
dislikes dogs and likes cats : 240 is 0.40% of the dataset.
dislikes dogs : 44 is 0.07% of the dataset.


<h3>Males</h3>

In [40]:
hp.word_counter(males['pets'])

likes dogs and likes cats : 9550 is 26.65% of the dataset.
has cats : 611 is 1.71% of the dataset.
likes cats : 716 is 2.00% of the dataset.
nan : 13023 is 36.35% of the dataset.
has dogs : 2167 is 6.05% of the dataset.
likes dogs and dislikes cats : 1111 is 3.10% of the dataset.
has dogs and dislikes cats : 238 is 0.66% of the dataset.
likes dogs : 4524 is 12.63% of the dataset.
likes dogs and has cats : 1728 is 4.82% of the dataset.
has dogs and has cats : 577 is 1.61% of the dataset.
dislikes dogs and has cats : 49 is 0.14% of the dataset.
has dogs and likes cats : 1183 is 3.30% of the dataset.
dislikes dogs and dislikes cats : 101 is 0.28% of the dataset.
dislikes cats : 64 is 0.18% of the dataset.
dislikes dogs and likes cats : 165 is 0.46% of the dataset.
dislikes dogs : 22 is 0.06% of the dataset.


<h3>Females</h3>

In [41]:
hp.word_counter(females['pets'])

likes dogs and likes cats : 5264 is 21.83% of the dataset.
has dogs and likes cats : 1150 is 4.77% of the dataset.
likes dogs and has cats : 2585 is 10.72% of the dataset.
nan : 6898 is 28.60% of the dataset.
likes dogs and dislikes cats : 918 is 3.81% of the dataset.
has cats : 795 is 3.30% of the dataset.
likes dogs : 2700 is 11.20% of the dataset.
has dogs : 1967 is 8.16% of the dataset.
likes cats : 347 is 1.44% of the dataset.
has dogs and has cats : 897 is 3.72% of the dataset.
has dogs and dislikes cats : 314 is 1.30% of the dataset.
dislikes dogs and dislikes cats : 95 is 0.39% of the dataset.
dislikes cats : 58 is 0.24% of the dataset.
dislikes dogs and has cats : 32 is 0.13% of the dataset.
dislikes dogs and likes cats : 75 is 0.31% of the dataset.
dislikes dogs : 22 is 0.09% of the dataset.


<h2>Religion</h2>

In [42]:
hp.word_counter(data['religion'])

agnosticism and very serious about it : 314 is 0.52% of the dataset.
agnosticism but not too serious about it : 2636 is 4.40% of the dataset.
nan : 20226 is 33.74% of the dataset.
atheism : 2175 is 3.63% of the dataset.
christianity : 1957 is 3.26% of the dataset.
christianity but not too serious about it : 1952 is 3.26% of the dataset.
atheism and laughing about it : 2074 is 3.46% of the dataset.
christianity and very serious about it : 578 is 0.96% of the dataset.
other : 2691 is 4.49% of the dataset.
catholicism : 1064 is 1.77% of the dataset.
catholicism but not too serious about it : 2318 is 3.87% of the dataset.
catholicism and somewhat serious about it : 548 is 0.91% of the dataset.
agnosticism and somewhat serious about it : 642 is 1.07% of the dataset.
catholicism and laughing about it : 726 is 1.21% of the dataset.
agnosticism and laughing about it : 2496 is 4.16% of the dataset.
agnosticism : 2724 is 4.54% of the dataset.
atheism and somewhat serious about it : 848 is 1.41% 

<h3>Males</h3>

In [43]:
hp.word_counter(males['religion'])

agnosticism and very serious about it : 216 is 0.60% of the dataset.
agnosticism but not too serious about it : 1668 is 4.66% of the dataset.
nan : 11980 is 33.44% of the dataset.
atheism : 1501 is 4.19% of the dataset.
atheism and laughing about it : 1541 is 4.30% of the dataset.
christianity and very serious about it : 254 is 0.71% of the dataset.
other : 1373 is 3.83% of the dataset.
christianity : 999 is 2.79% of the dataset.
catholicism but not too serious about it : 1307 is 3.65% of the dataset.
agnosticism and somewhat serious about it : 407 is 1.14% of the dataset.
catholicism and laughing about it : 426 is 1.19% of the dataset.
agnosticism and laughing about it : 1688 is 4.71% of the dataset.
atheism and somewhat serious about it : 624 is 1.74% of the dataset.
buddhism but not too serious about it : 405 is 1.13% of the dataset.
agnosticism : 1570 is 4.38% of the dataset.
other but not too serious about it : 940 is 2.62% of the dataset.
catholicism and somewhat serious about it

<h3>Females</h3>

In [44]:
hp.word_counter(females['religion'])

nan : 8246 is 34.19% of the dataset.
christianity : 958 is 3.97% of the dataset.
christianity but not too serious about it : 795 is 3.30% of the dataset.
catholicism : 458 is 1.90% of the dataset.
atheism and laughing about it : 533 is 2.21% of the dataset.
catholicism and somewhat serious about it : 273 is 1.13% of the dataset.
agnosticism : 1154 is 4.79% of the dataset.
other : 1318 is 5.47% of the dataset.
other but not too serious about it : 614 is 2.55% of the dataset.
other and laughing about it : 786 is 3.26% of the dataset.
other and somewhat serious about it : 429 is 1.78% of the dataset.
other and very serious about it : 226 is 0.94% of the dataset.
hinduism but not too serious about it : 67 is 0.28% of the dataset.
agnosticism and laughing about it : 808 is 3.35% of the dataset.
agnosticism but not too serious about it : 968 is 4.01% of the dataset.
atheism : 674 is 2.79% of the dataset.
judaism : 333 is 1.38% of the dataset.
buddhism : 198 is 0.82% of the dataset.
judaism b

<h2>Sign</h2>

In [45]:
hp.word_counter(data['sign'])

gemini : 1013 is 1.69% of the dataset.
cancer : 1092 is 1.82% of the dataset.
pisces but it doesn&rsquo;t matter : 1300 is 2.17% of the dataset.
pisces : 992 is 1.65% of the dataset.
aquarius : 954 is 1.59% of the dataset.
taurus : 1001 is 1.67% of the dataset.
virgo : 1029 is 1.72% of the dataset.
sagittarius : 937 is 1.56% of the dataset.
gemini but it doesn&rsquo;t matter : 1453 is 2.42% of the dataset.
cancer but it doesn&rsquo;t matter : 1454 is 2.43% of the dataset.
leo but it doesn&rsquo;t matter : 1457 is 2.43% of the dataset.
nan : 11056 is 18.44% of the dataset.
aquarius but it doesn&rsquo;t matter : 1408 is 2.35% of the dataset.
aries and it&rsquo;s fun to think about : 1573 is 2.62% of the dataset.
libra but it doesn&rsquo;t matter : 1408 is 2.35% of the dataset.
pisces and it&rsquo;s fun to think about : 1592 is 2.66% of the dataset.
libra : 1098 is 1.83% of the dataset.
taurus but it doesn&rsquo;t matter : 1450 is 2.42% of the dataset.
sagittarius but it doesn&rsquo;t mat

<h3>Males</h3>

In [46]:
hp.word_counter(males['sign'])

gemini : 627 is 1.75% of the dataset.
cancer : 651 is 1.82% of the dataset.
pisces but it doesn&rsquo;t matter : 890 is 2.48% of the dataset.
pisces : 573 is 1.60% of the dataset.
aquarius : 562 is 1.57% of the dataset.
taurus : 617 is 1.72% of the dataset.
cancer but it doesn&rsquo;t matter : 967 is 2.70% of the dataset.
leo but it doesn&rsquo;t matter : 985 is 2.75% of the dataset.
libra but it doesn&rsquo;t matter : 940 is 2.62% of the dataset.
pisces and it&rsquo;s fun to think about : 809 is 2.26% of the dataset.
sagittarius but it doesn&rsquo;t matter : 880 is 2.46% of the dataset.
aquarius but it doesn&rsquo;t matter : 940 is 2.62% of the dataset.
scorpio and it matters a lot : 36 is 0.10% of the dataset.
gemini and it&rsquo;s fun to think about : 922 is 2.57% of the dataset.
leo and it&rsquo;s fun to think about : 851 is 2.38% of the dataset.
nan : 7333 is 20.47% of the dataset.
cancer and it&rsquo;s fun to think about : 836 is 2.33% of the dataset.
libra and it&rsquo;s fun to 

<h3>Females</h3>

In [47]:
hp.word_counter(females['sign'])

virgo : 410 is 1.70% of the dataset.
sagittarius : 399 is 1.65% of the dataset.
gemini but it doesn&rsquo;t matter : 496 is 2.06% of the dataset.
nan : 3723 is 15.44% of the dataset.
taurus : 384 is 1.59% of the dataset.
aquarius but it doesn&rsquo;t matter : 468 is 1.94% of the dataset.
aries and it&rsquo;s fun to think about : 785 is 3.25% of the dataset.
libra : 434 is 1.80% of the dataset.
taurus but it doesn&rsquo;t matter : 483 is 2.00% of the dataset.
gemini : 386 is 1.60% of the dataset.
cancer : 441 is 1.83% of the dataset.
aquarius and it&rsquo;s fun to think about : 764 is 3.17% of the dataset.
gemini and it&rsquo;s fun to think about : 860 is 3.57% of the dataset.
pisces : 419 is 1.74% of the dataset.
libra and it&rsquo;s fun to think about : 787 is 3.26% of the dataset.
leo and it&rsquo;s fun to think about : 841 is 3.49% of the dataset.
cancer but it doesn&rsquo;t matter : 487 is 2.02% of the dataset.
pisces and it&rsquo;s fun to think about : 783 is 3.25% of the dataset.

<h2>Smokes</h2>

In [48]:
hp.word_counter(data['smokes'])

sometimes : 3787 is 6.32% of the dataset.
no : 43896 is 73.23% of the dataset.
nan : 5512 is 9.19% of the dataset.
when drinking : 3040 is 5.07% of the dataset.
yes : 2231 is 3.72% of the dataset.
trying to quit : 1480 is 2.47% of the dataset.


<h3>Males</h3>

In [49]:
hp.word_counter(males['smokes'])

sometimes : 2421 is 6.76% of the dataset.
no : 25635 is 71.55% of the dataset.
yes : 1435 is 4.01% of the dataset.
nan : 3460 is 9.66% of the dataset.
trying to quit : 1000 is 2.79% of the dataset.
when drinking : 1878 is 5.24% of the dataset.


<h3>Females</h3>

In [50]:
hp.word_counter(females['smokes'])

nan : 2052 is 8.51% of the dataset.
no : 18261 is 75.72% of the dataset.
when drinking : 1162 is 4.82% of the dataset.
trying to quit : 480 is 1.99% of the dataset.
sometimes : 1366 is 5.66% of the dataset.
yes : 796 is 3.30% of the dataset.


<h2>Status</h2>

In [51]:
hp.word_counter(data['status'])

single : 55697 is 92.91% of the dataset.
available : 1865 is 3.11% of the dataset.
seeing someone : 2064 is 3.44% of the dataset.
married : 310 is 0.52% of the dataset.
unknown : 10 is 0.02% of the dataset.


<h3>Males</h3>

In [52]:
hp.word_counter(males['status'])

single : 33378 is 93.16% of the dataset.
available : 1209 is 3.37% of the dataset.
seeing someone : 1061 is 2.96% of the dataset.
married : 175 is 0.49% of the dataset.
unknown : 6 is 0.02% of the dataset.


<h3>Females</h3>

In [53]:
hp.word_counter(females['status'])

single : 22319 is 92.54% of the dataset.
available : 656 is 2.72% of the dataset.
seeing someone : 1003 is 4.16% of the dataset.
married : 135 is 0.56% of the dataset.
unknown : 4 is 0.02% of the dataset.
