# What's in an online dating profile? 

People say a lot about themselves in online dating profiles, especially on sites like OKCupid that encourage people to answer questions. Thus, we can learn a lot about people by studying what they write. OKC has made some of their profile data from San Fransisco public. We will be using that data in this lab to explore different cultural questions. 

Our first question is whether and how men and women talk about themselves differently in their profiles. Popular culture is constantly telling us that men and women have different interests, hobbies, and relationship goals. Yet there are also many examples of women who like stereotypically masculine things and men who like feminine ones. This is especially interesting in online dating, because people are seeking partners with similar interests and relationship goals. Finding a partner would be hard for straight men and women if these two groups had very different interests. 

OKC shared 59,946 profiles though -- way too many to read! Computers can read them all and tell us how common different words are. So our first approach will be simple. We can ask 
1. Which words are used the most by men and women? 
2. Which words are used often by men but not women, and vice versa? 

At the end of the lab, you'll be able to ask this question about other social groups too (like sexual orientation, race/ethnicity, age, level of education, even whether someone likes dogs or cats).

@Author: [Jeff Lockhart](http://www-personal.umich.edu/~jwlock/)

# 0. Setup
### Import the packages we'll use.
- Packages contain a bunch of useful code others have written to make our jobs easier.
- `%matplotlib inline` lets us see charts and plots right here in the notebook!

In [None]:
import re
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize 
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
from collections import Counter

%matplotlib inline

### Read our data.
**Before you run this code:** make sure that you have downloaded the data.
- Go to [https://github.com/rudeboybert/JSE_OkCupid](https://github.com/rudeboybert/JSE_OkCupid). 
    - If you're new to github, the easiest way is to right-click each file and "save link as." 
    - You can also clone or download the whole repository. 
- Download the `okcupid_codebook.txt` and `profiles.csv.zip` files and save them in the `data` directory (folder).
- Unzip the profiles file in the same place.

In [None]:
profiles = pd.read_csv('data/profiles.csv')

In [None]:
#Show how many rows and columns the data has
profiles.shape

In [None]:
#show the names of the columns
profiles.columns

In [None]:
#show the first few rows of data
profiles.head()

### First, a little housekeeping...
You don't need to worry about the code here right now. Just run it and continue reading below. Expand for more details on how it works.

- The OKC data has 10 different columns with profile text, one for each long-answer question in users' profiles. We want to look at all of the profile text, so this merges it all together in a new column called `text`.
- This code also simplifies the categories people pick for other things like level of education, the pets they have, etc.
- It removes people under 18 and over 60.
- It saves this cleaner version of the data so we can use it later.

### The code
This cell tells python a bunch of information about our data.

In [None]:
essay_cols = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 
              'essay7', 'essay8', 'essay9']

ed_levels = {'<HS': ['dropped out of high school', 'working on high school'],
             'HS': ['graduated from high school', 'working on college/university', 
                    'two-year college', 'dropped out of college/university', 
                    'high school'], 
             'BA': ['graduated from college/university', 
                    'working on masters program', 'working on ph.d program', 
                    'college/university', 'working on law school', 
                    'dropped out of masters program', 
                    'dropped out of ph.d program', 'dropped out of law school', 
                    'dropped out of med school'],
             'Grad_Pro': ['graduated from masters program',
                          'graduated from ph.d program',                           
                          'graduated from law school', 
                          'graduated from med school', 'masters program', 
                          'ph.d program', 'law school', 'med school']
            }

bodies = {'average': ['average'], 
          'fit': ['fit', 'athletic', 'jacked'], 
          'thin': ['thin', 'skinny'], 
          'overweight': ['curvey', 'a little extra', 'full figured', 'overweight']
         }

smoke = {'no': ['no'], np.nan: ['nan']}

kids = {'yes': ['has a kid', 'has kids']}

has_pets = {'yes': ['has']}

ethn = {'White': ['white', 'middle eastern', 'middle eastern, white'], 
        'Asian': ['asian', 'indian', 'asian, pacific islander'], 
        'Black': ['black']
       }   

ethn2 = {'Latinx': ['latin'], 'multiple': [','], np.nan: ['nan']}   

drinks = {'no': ['rarely', 'not at all']}

drugs = {'no': ['never']}

jobs = {'education': ['student', 'education'], 
        'STEM': ['science', 'computer'], 
        'business': ['sales', 'executive', 'banking'], 
        'creative': ['artistic', 'entertainment'], 
        'med_law': ['medicine', 'law'],
        np.nan: ['nan']
       }

religion = {'none': ['agnosticism', 'atheism'],
            'catholicism': ['catholicism'],
            'christianity': ['christianity'],
            'judaism': ['judaism'],
            'buddhism': ['buddhism'],
            np.nan: ['nan']
           }

languages = {'multiple': [',']}

This cell defines some functions we'll use to clean up the data.

In [None]:
def concat(row, cols):
    tmp = []
    for c in cols:
        tmp.append(str(row[c]))
    new = '\n'.join(tmp)
    return new

def recode(text, dictionary, default=np.nan):
    '''Function for recoding categories in a column based on exact matches'''
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y == text: #exact match
                out = x
                return out
    return out

def recode_fuzzy(text, dictionary, default=np.nan):
    '''Function for recoding categories in a column based on partial matches'''
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y in text: #partial match
                out = x
                return out
    return out


def which_pets(t, criterion='has'):
    '''Function for determining which pets someone has or likes'''
    d = False
    c = False
    t = str(t)
    p = 'neither'
    if t == 'nan':
        p = np.nan
    
    if 'has dogs' in t:
        d = True
    if 'has cats' in t:
        c = True
        
    if criterion == 'likes':
        if 'likes dogs' in t:
            if 'dislikes dogs' not in t:
                d = True
        if 'likes cats' in t:
            if 'dislikes cats' not in t:
                c = True
        
    if c and d:
        p = 'both'
    elif c:
        p = 'cats'
    elif d:
        p = 'dogs'
        
    return p

def census_2010_ethnicity(t):
    '''
    Function gathers choices for this question gathered by the US Census 2010.
    It deviates from the census by creating exclusive Latinx category. Selecting 
    just 'latin' and nothing else was the 3rd most frequent ethnicity in this 
    data. The discision to include people who identified 'latin' and another race
    is based in research on Latinx people's experience with the US Census, but 
    like all racial and ethnic categorization systems, it is flawed. 
    '''
    text = str(t)
    
    e = recode(text, ethn, default='other')
    if 'other' == e:
        e = recode_fuzzy(text, ethn2, default='other')
    
    return e

def height(inches):
    h = 'under_6'
    if inches >= 72:
        h = 'over_6'
    return h

#### Do the clean up
This cell calls the functions we created in the last cell, along with the information about our data from the cell before it, to actually clean our data.

In [None]:
#remove people 60+ and 17-
profiles = profiles[(profiles.age < 60) & (profiles.age > 17)]

#recode categorical columns into simpler categories
profiles['text'] = profiles.apply(concat, axis=1, cols=essay_cols)
profiles['edu'] = profiles.education.apply(recode, dictionary=ed_levels, 
                                            default='unknown')
profiles['kids'] = profiles.offspring.apply(recode_fuzzy, dictionary=kids, 
                                            default='no')
profiles['pets_likes'] = profiles.pets.apply(which_pets, criterion='likes')
profiles['pets_has'] = profiles.pets.apply(which_pets, criterion='has')
profiles['pets_any'] = profiles.pets.apply(recode_fuzzy, dictionary=has_pets, 
                                            default='no')
profiles['age_group'] = profiles.age.apply(lambda x: str(int(x/10)*10))
profiles['height_group'] = profiles.height.apply(height)
profiles['race_ethnicity'] = profiles.ethnicity.apply(census_2010_ethnicity)
profiles['smoker'] = profiles.smokes.apply(recode, dictionary=smoke, 
                                            default='yes')
profiles['body'] = profiles.body_type.apply(recode, dictionary=bodies, 
                                            default='unknown')
profiles['alcohol_use'] = profiles.drinks.apply(recode, dictionary=drinks, 
                                            default='yes')
profiles['drug_use'] = profiles.drugs.apply(recode, dictionary=drugs, 
                                            default='yes')
profiles['industry'] = profiles.job.apply(recode_fuzzy, dictionary=jobs, 
                                            default='other')
profiles['religion'] = profiles.religion.apply(recode_fuzzy, dictionary=religion, 
                                            default='other')
profiles['languages'] = profiles.speaks.apply(recode_fuzzy, dictionary=languages, 
                                            default='English_only')

# keep just these columns
profiles = profiles[['age_group', 'age', 'body', 'alcohol_use', 'drug_use', 'edu', 
                     'race_ethnicity', 'height_group', 'industry', 'kids', 
                     'orientation', 'pets_likes', 'pets_has', 'pets_any', 
                     'religion', 'sex', 'smoker', 'languages', 'text', 'essay0', 
                     'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 
                     'essay7', 'essay8', 'essay9']]

profiles.head()

#### Save the results
This cell saves the cleaned up data to a file so we can use it again later.

In [None]:
profiles.to_csv('data/clean_profiles.tsv', sep='\t', index=False)

### For laptop or personal computer users
Run this code so that you're working with a smaller amount of data and don't crash your computer. It takes a simple random sample of the data.

In [None]:
profiles = profiles.sample(20000)
profiles = profiles.reset_index(drop=True)

### While we're at it, let's make some helper functions for later.
Run this code, but don't worry about these now.

In [None]:
def extract_example(text, word, context=False):
    #regex for selecting the whole word from a stem
    expr = word + '\w*'
    
    if context:
        #regex for selecting a stem and also the 2 words before and after it
        #this lets us see the context in which it is used
        expr = '\w*\W*\w*\W*' + word + '\w*\W*\w*\W*\w*'

    return re.search(expr, text, re.I).group()

def get_examples(data, word, n=5, context=True, limit_col=None, limit_val=None):
    if word.endswith('i'):
        #the Porter2 stemmer sometimes adds 'i' to stems. This trimms it off.
        word = word[:-1]
    
    #restrict to just some group of interest
    if limit_col is not None:
        data = data[data[limit_col] == limit_val]
    
    #sample our data so this operation goes faster
    if data.shape[0] > 1000:
        data = data.sample(1000)
    
    #find profiles with the word in them
    tmp = data.text.apply(lambda x: word in x)
    #select n random profiles that have the word
    count = tmp.sum()
    
    #if we wanted more examples than there are
    if n > count:
        n = count
    tmp2 = data[tmp].text.sample(n).values
    
    #get an example out of each profile we selected
    tmp = []
    for t in tmp2:
        tmp.append(extract_example(t, word, context))
    
    return tmp

def unstem(word, data, n=50):
    if word.endswith('i'):
        #the Porter2 stemmer sometimes adds 'i' to stems. This trimms it off.
        word = word[:-1]

    #use the function we made before to get examples of the stem
    tmp = get_examples(data, word=word, n=n, context=False)
    
    #count up and return the most common form of the word matching the stem
    return Counter(tmp).most_common(1)[0][0]

def clean_index(df, text):
    #replaces stems in the index of a dataframe with whole words
    df.reset_index(inplace=True)
    df['index'] = df['index'].apply(unstem, data=text)
    df.set_index('index', inplace=True)
    return df

# 1. Tokenizing text

#### Let's peak at an example of the text so we know what we're working with.
This code shows us the text for the 6th profile (python counts from 0, so the first profile is #0, the second is #1, and so on). 5 here could be any number. Try changing it to see.

In [None]:
profiles.text[6]

### We want to split the text into words so we can count them. Here's a simple first try.
- The `split()` function, like its name suggests, splits text into chunks. If we split on spaces (the default), it will split the text into words. Let's `apply` it to the `text` of our `profiles`.
- Notice that this is a little messy. The punctuation and some HTML things are mixed in with our words.

In [None]:
tmp = profiles['text'].apply(lambda x: x.split())
tmp.head()

### A better way of getting words from text
### Steps
#### 1. Removing stop words
- "Stop words" are words that are common in a language but don't tell us much about what's happening, like "a," "of," or "and." It is common to remove them so we can focus on more meaningful words. [Learn more](https://en.wikipedia.org/wiki/Stop_words)
- Here we use the set of English stop words we imported above.
- This lab makes an exception to the normal list of stop words and keeps the pronouns because some research shows that pronoun use matters in dating. You could add more words to remove or keep, depending on what you think is important. 

In [None]:
import nltk
nltk.download()

In [None]:
sw = set(stopwords.words('english'))

keep_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
              'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 
              'himself', 'she', 'her', 'hers', 'herself', 'they', 'them', 'their',
              'theirs', 'themselves']

for k in keep_words:
    sw.discard(k) #could use remove if we wanted keyerrors
    
print("Here are the words we will remove:\n\n", sw)

#### 2. Better tokenizing
- This code actually cleans the text. 
    - We use the BeautifulSoup library to remove all the HTML code from the text
    - We also remove some other non-word text like "www"
    - We convert all the text to lowercase, so that the computer sees "Dogs", "DOGS", and "dogs" as the same word.
    - We remove all the stop words.

In [None]:
def clean(text, sw):
    t = BeautifulSoup(text, 'lxml').get_text()
    
    bad_words = ['http', 'www', '\nnan']
    for b in bad_words:
        t = t.replace(b, '')
    
    t = t.lower()
    t = regexp_tokenize(t, '\w+')
    
    final = []
    for w in t:
        if w not in sw:
            final.append(w)
    
    return final

profiles['tokens'] = profiles['text'].apply(clean, sw=sw)
profiles.tokens.head()

# 2. Comparing the words used by men and women
#### Step 1: We separate the profiles of women and men.
We'll limit it to straight people for now. You'll have the chance to explore other groups later in the lab.

In [None]:
men = profiles[(profiles['sex'] == 'm') & (profiles['orientation'] == 'straight')]
women = profiles[(profiles['sex'] == 'f') & (profiles['orientation'] == 'straight')]

men.tokens.head()

#### Step 2: Counting how often each gender uses each word

In [None]:
def flatten(series):
    l = []
    for x in series:
        l.extend(x) #each x is a list we want to unnest
    return l

#"tmp" is often used for temporary or intermediate data that we won't use for long.
tmp = flatten(men.tokens)

print('Ten most common words used by men:')
mens_words = Counter(tmp) #this counts how many times each word shows up
mens_words.most_common(10) #this shows us the 10 most common words

In [None]:
tmp = flatten(women.tokens) #repeat for women

print('Ten most common words used by women:')
womens_words = Counter(tmp)
womens_words.most_common(10)

You can see that the most popular words are basically the same for each gender.

#### Step 3a: Put the word counts in a data frame so they're easier to work with

In [None]:
#turn the two word count data into a single dataframe so it's easy to work with 
tmp = {'women': womens_words, 'men': mens_words}
popular_words = pd.DataFrame(tmp)

popular_words.head()

Right now, the words are sorted alphabetically. That's not super useful, though.

#### Step 3b: Sort the words by popularity

In [None]:
popular_words = popular_words.sort_values(by='women', ascending=False)
popular_words.head()

#### Step 4: Convert those word counts to frequencies (percent of total words)

In [None]:
#convert the word counts into percents (i.e. what percent of total words are x)
popular_words['men'] = (popular_words['men'] /  popular_words['men'].sum())*100
popular_words['women'] = (popular_words['women'] /  popular_words['women'].sum())*100

#create a column "max" that has the word's maxmum popularity (in either men or women)
popular_words['max'] = popular_words.max(axis=1)

#show the most popular words overall
popular_words.sort_values(by='max', ascending=False, inplace=True)
popular_words.head(10).round(2)

#### Let's see some typical examples of how these words are used
- You can change the number `6` to show more or less examples.
- You can change the world `'love'` to any word you're interested in. 
    - "love" is interesting because it is not always used the way we might expect in a dating profile. 

In [None]:
get_examples(data=profiles, word='love', n=6)

#### Let's look just at examples of how men use the word 'love'
- You can change `limit_col` to something other than `sex` if you want to look at a different attribute.
- You can change `limit_val` to something other than `m` if you want to look at a different group within the attribute (e.g. change it to `f` if you want to see women's use).

In [None]:
get_examples(data=profiles, word='love', n=6, limit_col='sex', limit_val='f')

#### Most words are very uncommon
- The X axis in this histogram is the word popularity (percent of total words that are this word). 
- The Y axis is the number of words that have that level of popularity.

In [None]:
#show a histogram with 100 bins
popular_words['max'].plot.hist(bins=100, log=True)

#### Step 5: Look at just the 1000 most popular words
- Note that the shape of the distribution looks similar, but the Y axis is much smaller ($ 10^3 $ instead of $ 10^5 $), meaning we have removed many extremely uncommon words.

In [None]:
#select only the 1000 most popular words
popular_words = popular_words.sort_values(by='max', ascending=False).head(1000)

#show the histogram again
popular_words['max'].plot.hist(bins=100, log=True)

### Step 6: Figure out which words are more popular with one gender than the other
- Here we calculate how many times different the usage of words by men or women is, so if men use a word twice as often as women use the same word, then then men's use is 2 times different. 
- Like we saw before, both groups use the most popular words about the same amount.

In [None]:
def times_diff(row):
    #calculate how many times more men use a word than women
    #or vice versa if women use the word more
    if row.men > row.women:
        return row.men / row.women
    else:
        return -1 * (row.women / row.men)
    
popular_words['times_diff'] = popular_words.apply(times_diff, axis=1)
popular_words = popular_words.sort_values(by='max', ascending=False)

print('Most popular words:')
popular_words.head(10).round(3)

#### Let's look at the words that are most different between them.

In [None]:
print('Words men use more than women:')
popular_words.sort_values(by='times_diff', ascending=False).head(15).round(3)

In [None]:
print('Words women use more than men:')
popular_words.sort_values(by='times_diff', ascending=True).head(15).round(3)

# 3. Getting cleaner results

#### Stemming lets us count similar words like "dog" and "dogs" or " run" and "running" as the same word.
- Stemming grabs just the "stem" of each word (e.g. the stem of both "runs" and "running" is "run"). When the words are converted to their stems, the computer sees them as the same. [Learn more](https://en.wikipedia.org/wiki/Stemming)
- Stemming can be a little slow.

In [None]:
#snowball English (aka porter2) is the best general stemmer
stemmer = SnowballStemmer("english") 

def stem(t):
    out = []
    for w in t:
        out.append(stemmer.stem(w))
    return out

print("Stemming words from profile text...")
profiles['stems'] = profiles['tokens'].apply(stem)
profiles.stems.head()

#### These functions let us do the same things we did before without rewriting all the steps each time.
You don't have to worry about what's in them right now. Just run the cell and scroll down.

In [None]:
# functions for summarizing word use by a trait
def times_diff2(row, group, ref):
    if row[ref] > row[group]:
        return -1 * (row[ref] / row[group])
    else:
        return row[group] / row[ref]

#normally we wouldn't paste this function here because we already have it above
#but it helps to show in the lab
def flatten(series):
    l = []
    for x in series:
        l.extend(x) #each x is a list we want to unnest
    return l
    
def flatten2(series):
    l = []
    for x in series:
        tmp = set(x) #make the tokens into a set, thus dropping repeats
        tmp = list(tmp) #turn it back into a list we can attach to the other lists
        l.extend(tmp) 
    return l


def count(data, per_person):
    #count the people in each category
    l = len(data)

    #apply the right aggregation function, depending whether we want 
    #most common words, or words used by most people
    if per_person:
        data = flatten2(data)
    else:
        data = flatten(data)
            
    c = Counter(data)
    
    return c, l

def word_use(df, att, ref=None, per_person=False, undostems=False):
    #list all of the categories in this column
    types = list(df[att].value_counts().index.values)
    #variables that will store our results
    data = {}
    lens = {}
    
    print("Counting the words used by each group...")
    for t in types:
        #get the stems for each category
        tmp = df[df[att] == t].stems
        #count how often each is used
        data[t], lens[t] = count(tmp, per_person)
        
        #also compute the inverse of each category
        tmp = df[df[att] != t].stems
        data['not_'+str(t)], lens['not_'+str(t)] = count(tmp, per_person)        
        
    #convert those results to a pandas data frame for easy handling
    popular_words = pd.DataFrame(data)
    
    print('Calculating percentages...')
    # convert the counts in each column to percents
    for t in popular_words.columns:
        n = lens[t] #if we want percent of people
        
        if not per_person: #if we want percent of total words 
            n = popular_words[t].sum()
        
        popular_words[t] = (popular_words[t] / n) * 100
    
    print('Selecting the most popular words...')
    #find overall most popular words
    popular_words['max'] = popular_words.max(axis=1)
    #sort the words and select the top 1000 most popular
    popular_words = popular_words.sort_values(by='max', ascending=False)
    popular_words = popular_words.head(1000)

    print('Calculating most distinctive words...')
    #calculate the rate each type of person uses these words relative to others
    for t in types:
        r = ref
        
        if ref == None: #if we do not have a reference category, use the inverse
            r = 'not_'+str(t)
            
        if t != ref: #don't compare a trait to itself
            #apply our times_diff2 function
            popular_words['times_diff_'+str(t)] = popular_words.apply(times_diff2, 
                                                                 group=t, 
                                                                 ref=r, 
                                                                 axis=1)

    #remove the inverse columns we created
    popular_words = popular_words.drop(popular_words.filter(regex='not_'), axis=1)
    
    if undostems:
        print('Cleaning up word stems for readability...')
        popular_words = clean_index(popular_words, df)
    
    print('Done!')
    return popular_words

#### Let's try comparing men's and women's words again with stems this time
- The top words are somewhat different now that we're counting similar words as the same.
- We see word stems rather than whole words listed.

In [None]:
popular_words = word_use(profiles, att='sex')
popular_words = popular_words.sort_values(by='times_diff_m', ascending=False)
print("Men's words:")
popular_words.head(10).round(2)

#### Those word stems in our table are a little hard to read. Let's change that.
- The `undostems=True` option converts the stems back to whole words before showing us the result.

In [None]:
popular_words = word_use(profiles, att='sex', undostems=True)
popular_words = popular_words.sort_values(by='times_diff_m', ascending=False)
print("Men's distinctive words:")
popular_words.head(10).round(2)

In [None]:
popular_words = popular_words.sort_values(by='times_diff_f', ascending=False)
print("Women's distinctive words:")
popular_words.head(10).round(2)

#### But, wait! Not all profiles have the same number of words. 
- What if a single man just wrote "computer" a thousand times and that is skewing our results?
- With `per_person=True` we can see which words are used by the most different people, rather than which words are most common out of all the words used.

In [None]:
popular_words = word_use(profiles, att='sex', per_person=True, undostems=True)
print("Men's words:")
popular_words.sort_values(by='times_diff_m', ascending=False).head(10).round(2)

In [None]:
print("Women's words:")
popular_words.sort_values(by='times_diff_f', ascending=False).head(10).round(2)

# 4. Your turn to try it with another trait

### Options (traits)
We have a lot more information about people than just whether they're men or women. Try the analysis again with one of these other traits. (Expand for a list.)

- age_group (How old someone is. Youngest users are 18.)
    - categories: ['10', '20', '30', '40', '50']
- body (self-described)
    - categories: ['average', 'fit', 'thin', 'overweight', 'unknown']
- alcohol_use
    - categories: ['yes', 'no']
- drug_use
    - categories: ['yes', 'no']
- edu (highest degree completed)
    - categories: ['`<HS`', 'HS', 'BA', 'Grad_Pro', 'unknown'] 
- race_ethnicity
    - categories: ['Asian', 'Black', 'Latinx', 'White', 'multiple', 'other']
- height_group (whether someone is over or under six feet tall)
    - categories: ['under_6', 'over_6']
- industry (what field they work in)
    - categories: ['STEM', 'business', 'education', 'creative', 'med_law', 'other'] 
- kids (whether they have children)
    - categories: ['yes', 'no']
- orientation
    - categories: ['straight', 'gay', 'bisexual']
- pets_likes (what pets they like)
    - categories: ['both', 'dogs', 'cats', 'neither']
- pets_has (what pets they have)
    - categories: ['both', 'dogs', 'cats', 'neither']
- pets_any (whether they have pets or not)
    - categories: ['yes', 'no']
- religion
    - categories: ['christianity', 'catholicism', 'judaism', 'buddhism', 'none', 'other'] 
- sex
    - categories: ['m', 'f']
- smoker
    - categories: ['yes', 'no']
- languages
    - categories: ['multiple', 'English_only'] 

### How to (steps)
#### Step 1a: Decide which of the traits above you want to look at.
#### Step 1b: Load the profile data.

In [None]:
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')

#### Step 2a: If you want, limit the data to just men or women.
- For everyone, leave this code how it is.
- For only men, remove the `#`
- For only women, remove the `#` and change the `'m'` in this line to `'f'`

In [None]:
#profiles = profiles[profiles['sex'] == 'm']

#### Step 2b: If you're running this on your personal computer
Run this code to use just a sample of the data set, because the full data is big enough to crash most personal computers. You can make the sample bigger or smaller by changing the number here.

In [None]:
profiles = profiles.sample(20000)
profiles.shape

#### Step 3: Tokenize and stem the text for these profiles.

In [None]:
print("Tokenizing...")
profiles['tokens'] = profiles['text'].apply(clean, sw=sw)
print("Stemming...")
profiles['stems'] = profiles['tokens'].apply(stem)
print("Done!")

#### Step 4: Compute the word usage statistics for your chosen attribute.
You can change the code below:
- You can change `att='age_group'` to your attribute of interest (e.g. `pets_likes` or `orientation`)
- The `per_person` and `undostems` are the same as we saw before.

In [None]:
result = word_use(profiles, att='age_group', per_person=True, undostems=True)

#### Step 5a: Look at the results.
First, let's just see what columns we have.

In [None]:
result.head(2).round(2)

#### Step 5b: Looking at the most distinctive words by category
You can change two things in this code:
1. Change `'times_diff_10'` to the name of the column you want to sort by, i.e. the column you want to see the most popular words in. 
2. Change the number in `head(10)` to a bigger or smaller number to see more or less rows of output.

You can paste this line into more cells below and change it again to show different groups.

In [None]:
result.sort_values(by='times_diff_10', ascending=False).head(10).round(2)

In [None]:
result.sort_values(by='times_diff_20', ascending=False).head(10).round(2)

In [None]:
result.sort_values(by='times_diff_30', ascending=False).head(10).round(2)

In [None]:
result.sort_values(by='times_diff_40', ascending=False).head(10).round(2)

In [None]:
result.sort_values(by='times_diff_50', ascending=False).head(10).round(2)

# 5. What we learned
Expand for more.

### Sociology & Gender
1. Overall, the most common words in online dating are the same for men and women in San Fransisco. What they say about themselves is not that different. 
2. There are some words that men use much more often than women, and vice versa. These fit stereotypical gender roles: for example, men in San Fransisco are much more likely to talk about computers, startups, engineering, and sports. And women are much more likely to talk about food (e.g. baking and chocolate) or feelings (adore, laughter). 
3. There are many possible causes for these differences in word use. For example, it is often taboo for men to talk about their feelings, so they may mention them less here because of social expectations rather than because they are less emotional. Social factors can also increase expression: for instance, women typically do the majority of food preparation in American families, so it is not surprising that they are more likely than men to talk about it in dating profiles. 
4. Not every person conforms to these broad patterns. Only 10-20% of these men mention computers. A similar percent of the women mention baking. Some women talk about computers, and some men talk about baking. Most people aren't using these very gendered words at all. What we showed is that there are broad patterns of some topics being much more popular with men or women, and that these patterns line up with common cultural expectations of gender.

### Text analysis
1. **Tokenizing** is the process of splitting text into words (tokens). Simple approaches can separate words based on spaces, but punctuation, HTML, and other things can make this more complicated. 
2. **Stop words** are words that are common but don't give us much information. They're often removed before we do analysis.
3. **Stemming** lets us combine similar words like "runs" and "running" by looking at the stem of the words (in this case, "run"). 
4. Most words are not very common. [Oxford Dictionaries](https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language) lists over 171,000 currently used English words, but as we saw, only a few words show up in more than a few profiles. 