# What's in an online dating profile? 

People say a lot about themselves in online dating profiles, especially on sites like OKCupid that encourage people to answer questions. Thus, we can learn a lot about people by studying what they write. OKC has made some of their profile data from San Fransisco public. We will be using that data in this lab to explore different cultural questions. 

Our first question is whether and how men and women talk about themselves differently in their profiles. Popular culture is constantly telling us that men and women have different interests, hobbies, and relationship goals. Yet there are also many examples of women who like stereotypically masculine things and men who like feminine ones. This is especially interesting in online dating, because people are seeking partners with similar interests and relationship goals. Finding a partner would be hard for straight men and women if these two groups had very different interests. 

OKC shared 59,946 profiles though -- way too many to read! Computers can read them all and tell us how common different words are. So our first approach will be simple. We can ask 
1. Which words are used the most by men and women? 
2. Which words are used often by men but not women, and vice versa? 

At the end of the lab, you'll be able to ask this question about other social groups too (like sexual orientation, race/ethnicity, and level of education).

### Import the libraries we'll use.
`%matplotlib inline` lets us see charts and plots right here in the notebook!

In [None]:
from collections import Counter

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize 
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

%matplotlib inline

### Read our data.

In [None]:
profiles = pd.read_csv('data/profiles.csv')

In [None]:
profiles.shape

In [None]:
profiles.columns.values

In [None]:
profiles.head()

### A little housekeeping...
Expand for more.

- The OKC data has 10 different columns with profile text, one for each long-answer question in users' profiles.
- We want to look at all of the profile text, so this cell merges it all together in a new column called `text`.

### The code

In [None]:
essay_cols = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 
              'essay7', 'essay8', 'essay9']

ed_levels = {'<HS': ['dropped out of high school', 'working on high school'],
             'HS': ['graduated from high school', 'working on college/university', 
                    'two-year college', 'dropped out of college/university', 
                    'high school'], 
             'BA': ['graduated from college/university', 
                    'working on masters program', 'working on ph.d program', 
                    'college/university', 'working on law school', 
                    'dropped out of masters program', 
                    'dropped out of ph.d program', 'dropped out of law school', 
                    'dropped out of med school'],
             'Grad_Pro': ['graduated from masters program',
                          'graduated from ph.d program',                           
                          'graduated from law school', 
                          'graduated from med school', 'masters program', 
                          'ph.d program', 'law school', 'med school']
            }

bodies = {'average': ['average'], 
          'fit': ['fit', 'athletic', 'jacked'], 
          'thin': ['thin', 'skinny'], 
          'overweight': ['curvey', 'a little extra', 'full figured', 'overweight']
         }

smoke = {'no': ['no'], np.nan: ['nan']}

kids = {'yes': ['has a kid', 'has kids']}

has_pets = {'yes': ['has']}

ethn = {'White': ['white', 'middle eastern', 'middle eastern, white'], 
        'Asian': ['asian', 'indian', 'asian, pacific islander'], 
        'Black': ['black']
       }   

ethn2 = {'Latinx': ['latin'], 'multiple': [','], np.nan: ['nan']}   

drinks = {'no': ['rarely', 'not at all']}

drugs = {'no': ['never']}

jobs = {'education': ['student', 'education'], 
        'STEM': ['science', 'computer'], 
        'business': ['sales', 'executive', 'banking'], 
        'creative': ['artistic', 'entertainment'], 
        'med_law': ['medicine', 'law'],
        np.nan: ['nan']
       }

religion = {'none': ['agnosticism', 'atheism'],
            'catholicism': ['catholicism'],
            'christianity': ['christianity'],
            'judaism': ['judaism'],
            'buddhism': ['buddhism'],
            np.nan: ['nan']
           }

languages = {'multiple': [',']}


def concat(row, cols):
    tmp = []
    for c in cols:
        tmp.append(str(row[c]))
    new = '\n'.join(tmp)
    return new

def recode(text, dictionary, default=np.nan):
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y == text:
                out = x
                return out
    return out

def recode_fuzzy(text, dictionary, default=np.nan):
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y in text:
                out = x
                return out
    return out


def which_pets(t, criterion='has'):
    d = False
    c = False
    t = str(t)
    p = 'neither'
    if t == 'nan':
        p = np.nan
    
    if 'has dogs' in t:
        d = True
    if 'has cats' in t:
        c = True
        
    if criterion == 'likes':
        if 'likes dogs' in t:
            if 'dislikes dogs' not in t:
                d = True
        if 'likes cats' in t:
            if 'dislikes cats' not in t:
                c = True
        
    if c and d:
        p = 'both'
    elif c:
        p = 'cats'
    elif d:
        p = 'dogs'
        
    return p

def census_2010_ethnicity(t):
    '''
    Function gathers choices for this question gathered by the US Census 2010.
    It deviates from the census by creating exclusive Latinx category. Selecting 
    just 'latin' and nothing else was the 3rd most frequent ethnicity in this 
    data. The discision to include people who identified 'latin' and another race
    is based in research on Latinx people's experience with the US Census, but 
    like all racial and ethnic categorization systems it is flawed. 
    '''
    text = str(t)
    
    e = recode(text, ethn, default='other')
    if 'other' == e:
        e = recode_fuzzy(text, ethn2, default='other')
    
    return e

def height(inches):
    h = 'under_6'
    if inches >= 72:
        h = 'over_6'
    return h

In [None]:
#remove people 60+ and 17-
profiles = profiles[(profiles.age < 60) & (profiles.age > 17)]

profiles['text'] = profiles.apply(concat, axis=1, cols=essay_cols)
profiles['edu'] = profiles.education.apply(recode, dictionary=ed_levels, 
                                            default='unknown')
profiles['kids'] = profiles.offspring.apply(recode_fuzzy, dictionary=kids, 
                                            default='no')
profiles['pets_likes'] = profiles.pets.apply(which_pets, criterion='likes')
profiles['pets_has'] = profiles.pets.apply(which_pets, criterion='has')
profiles['pets_any'] = profiles.pets.apply(recode_fuzzy, dictionary=has_pets, 
                                            default='no')
profiles['age_group'] = profiles.age.apply(lambda x: str(int(x/10)*10))
profiles['height_group'] = profiles.height.apply(height)
profiles['race_ethnicity'] = profiles.ethnicity.apply(census_2010_ethnicity)
profiles['smoker'] = profiles.smokes.apply(recode, dictionary=smoke, 
                                            default='yes')
profiles['body'] = profiles.body_type.apply(recode, dictionary=bodies, 
                                            default='unknown')
profiles['alcohol_use'] = profiles.drinks.apply(recode, dictionary=drinks, 
                                            default='yes')
profiles['drug_use'] = profiles.drugs.apply(recode, dictionary=drugs, 
                                            default='yes')
profiles['industry'] = profiles.job.apply(recode_fuzzy, dictionary=jobs, 
                                            default='other')
profiles['religion'] = profiles.religion.apply(recode_fuzzy, dictionary=religion, 
                                            default='other')
profiles['languages'] = profiles.speaks.apply(recode_fuzzy, dictionary=languages, 
                                            default='English_only')

profiles = profiles[['age_group', 'body', 'alcohol_use', 'drug_use', 'edu', 
                     'race_ethnicity', 'height_group', 'industry', 'kids', 
                     'orientation', 'pets_likes', 'pets_has', 'pets_any', 
                     'religion', 'sex', 'smoker', 'languages', 'text']]

profiles.head()

In [None]:
profiles.to_csv('data/clean_profiles.tsv', sep='\t', index=False)

In [None]:
#random subsample so this doesn't crash on laptops
profiles = profiles.sample(20000)
profiles = profiles.reset_index(drop=True)

#### Let's peak at an example of the text so we know what we're working with.

In [None]:
profiles.text[5]

## Tokenizing text

### We want to split the text into words.
Expand for details

- We can do this by applying the `split()` function to text in every profile. 
- Notice, however, that this is a little messy.
    - `split()` is just cutting up the text based on the spaces, leaving the punctuation and some HTML things mized in with our words.

### A first try

In [None]:
tmp = profiles['text'].apply(lambda x: x.split())
tmp.head()

### Getting text from words
Expand for details

Here we define a function to clean up the text a bit more. It does a few things:
- Removes HTML code from the text using BeautifulSoup. (Remember, we want just the words people actually typed.) 
- Converts all of the text to lowercase, so that `Hello`, `hello`, `"HeLlO`, and `HELLO` all look the same to the computer.
- Uses the Natural Language Tool Kit (`nltk`) to tokenize the remaining text. 
    - "Tokenize" is jargon for splitting text into "tokens." Tokens are usually words, but they could be sentences, paragraphs, letters, or whatever we needed. 
    - The nltk tokenizers are much smarter than the simple `string.split()` function we used before. This one (which we imported in the beginning) selects the words, but ignores the whitespace and punctuation.

### A second try

#### Removing stop words

In [None]:
keep_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
              'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 
              'himself', 'she', 'her', 'hers', 'herself', 'they', 'them', 'their',
              'theirs', 'themselves']

sw = set(stopwords.words('english'))

for k in keep_words:
    sw.discard(k) #could use remove if we wanted keyerrors
    
print(sw)

In [None]:
def clean(text):
    t = BeautifulSoup(text, 'lxml').get_text()
    
    bad_words = ['http', 'www', '\nnan']
    for b in bad_words:
        t = t.replace(b, '')
    
    t = t.lower()
    t = regexp_tokenize(t, '\w+')
    
    final = []
    for w in t:
        if w not in sw:
            final.append(w)
    
    return final

profiles['tokens'] = profiles['text'].apply(clean)
profiles.tokens.head()

## Comparing the words used by men and women

In [None]:
men = profiles[(profiles['sex'] == 'm') & (profiles['orientation'] == 'straight')]
women = profiles[(profiles['sex'] == 'f') & (profiles['orientation'] == 'straight')]

men.tokens.head()

#### Counting how often each gender uses each word

In [None]:
def flatten(series):
    l = []
    for x in series:
        l.extend(x) #each x is a list we want to unnest
    return l

tmp = flatten(men.tokens)

mens_words = Counter(tmp)
mens_words.most_common(10)

In [None]:
tmp = flatten(women.tokens)

womens_words = Counter(tmp)
womens_words.most_common(10)

#### Convert those word counts to frequencies (percent of total words)

In [None]:
tmp = {'women': womens_words,
       'men': mens_words
      }

popular_words = pd.DataFrame(tmp)

popular_words['men'] = (popular_words['men'] /  popular_words['men'].sum())*100
popular_words['women'] = (popular_words['women'] /  popular_words['women'].sum())*100

popular_words.sort_values(by='men', inplace=True, ascending=False)
popular_words.head().round(2)

In [None]:
popular_words['max'] = popular_words.max(axis=1)
popular_words = popular_words.sort_values(by='max', ascending=False)
popular_words.head(10).round(2)

#### See the distribution of word popularity

In [None]:
popular_words['max'].hist(bins=100)

#### Look at just the 1000 most popular words

In [None]:
popular_words = popular_words.head(1000)
print(popular_words.shape)
popular_words['max'].hist(bins=100)

### Figure out which words are popular with one gender but not the other

In [None]:
def times_diff(row):
    if row.men > row.women:
        return row.men / row.women
    else:
        return -1 * (row.women / row.men)
    
popular_words['times_diff'] = popular_words.apply(times_diff, axis=1)
popular_words = popular_words.sort_values(by='max', ascending=False)

print('Most popular words:')
popular_words.head(10).round(3)

In [None]:
popular_words = popular_words.sort_values(by='times_diff', ascending=False)

print('Words men use more than women:')
popular_words.head(15).round(3)

In [None]:
popular_words = popular_words.sort_values(by='times_diff', ascending=True)

print('Words women use more than men:')
popular_words.head(15).round(3)

### Stemming

In [None]:
#snowball English (aka porter2) is improved over the porter stemmer
stemmer = SnowballStemmer("english") 

def stem(t):
    out = []
    for w in t:
        out.append(stemmer.stem(w))
    return out

profiles['stems'] = profiles['tokens'].apply(stem)
profiles.stems.head()

#### Let's try it again with stems this time

In [None]:
# functions for summarizing word use by a trait
def times_diff2(row, group, ref):
    if row[ref] > row[group]:
        return -1 * (row[ref] / row[group])
    else:
        return row[group] / row[ref]

#normally we wouldn't paste this function here but it helps to show in the lab
def flatten(series):
    l = []
    for x in series:
        l.extend(x) #each x is a list we want to unnest
    return l
    
def flatten2(series):
    l = []
    for x in series:
        tmp = set(x) #make the tokens into a set, thus dropping repeats
        tmp = list(tmp) #turn it back into a list we can attach to the other lists
        l.extend(tmp) 
    return l


def count(data, per_person):
    #count the people in each category
    l = len(data)

    #apply the right aggregation function, depending whether we want 
    #most common words, or words used by most people
    if per_person:
        data = flatten2(data)
    else:
        data = flatten(data)
            
    c = Counter(data)
    
    return c, l

def word_use(df, att, ref=None, per_person=False):
    #list all of the categories in this column
    types = list(df[att].value_counts().index.values)
    #variables that will store our results
    data = {}
    lens = {}
    
    for t in types:
        #get the stems for each category
        tmp = df[df[att] == t].stems
        #count how often each is used
        data[t], lens[t] = count(tmp, per_person)
        
        #also compute the inverse of each category
        tmp = df[df[att] != t].stems
        data['not_'+t], lens['not_'+t] = count(tmp, per_person)        
        
    #convert those results to a pandas data frame for easy handling
    popular_words = pd.DataFrame(data)
    
    # convert the counts in each column to percents
    for t in popular_words.columns:
        n = lens[t] #if we want percent of people
        
        if not per_person: #if we want percent of total words 
            n = popular_words[t].sum()
        
        popular_words[t] = (popular_words[t] / n) * 100
    
    #find overall most popular words
    popular_words['max'] = popular_words.max(axis=1)
    
    #sort the words and select the top 1000 most popular
    popular_words = popular_words.sort_values(by='max', ascending=False)
    popular_words = popular_words.head(1000)

    #calculate the rate each type of person uses these words relative to others
    for t in types:
        r = ref
        
        if ref == None: #if we do not have a reference category, use the inverse
            r = 'not_'+t
            
        if t != ref: #don't compare a trait to itself
            #apply our times_diff2 function
            popular_words['times_diff_'+t] = popular_words.apply(times_diff2, 
                                                                 group=t, 
                                                                 ref=r, 
                                                                 axis=1)

    #remove the inverse columns we created
    popular_words = popular_words.drop(popular_words.filter(regex='not_'), axis=1)
    
    return popular_words

In [None]:
popular_words = word_use(profiles, att='sex')
popular_words = popular_words.sort_values(by='times_diff_m', ascending=False)
print("Men's words:")
popular_words.head(10).round(2)

In [None]:
popular_words = popular_words.sort_values(by='times_diff_f', ascending=False)
print("Women's words:")
popular_words.head(10).round(2)

#### Let's try it again, but with percent of profiles rather than percent of words
not all profiles have the same number of words

In [None]:
popular_words = word_use(profiles, att='sex', per_person=True)
popular_words = popular_words.sort_values(by='times_diff_m', ascending=False)
print("Men's words:")
popular_words.head(10).round(2)

In [None]:
popular_words = popular_words.sort_values(by='times_diff_f', ascending=False)
print("Women's words:")
popular_words.head(10).round(2)

## Try it with another trait

### Options (traits)
We have a lot more information about people than just whether they're men or women. Try the analysis again with one of these other traits. (Expand for a list.)

- age_group (How old someone is. Youngest users are 18.)
    - categories: ['10', '20', '30', '40', '50']
- body (self-described)
    - categories: ['fit', 'average', 'thin', 'overweight', 'unknown']
- alcohol_use
    - categories: ['yes', 'no']
- drug_use
    - categories: ['yes', 'no']
- edu (highest degree completed)
    - categories: ['`<HS`', 'HS', 'BA', 'Grad_Pro', 'unknown'] 
- race_ethnicity
    - categories: ['White', 'Asian', 'Latinx', 'multiple', 'other', 'Black']
- height_group (whether someone is over or under six feet tall)
    - categories: ['under_6', 'over_6']
- industry (what field they work in)
    - categories: ['STEM', 'business', 'education', 'creative', 'med_law', 'other'] 
- kids (whether they have children)
    - categories: ['yes', 'no']
- orientation
    - categories: ['straight', 'gay', 'bisexual']
- pets_likes (what pets they like)
    - categories: ['both', 'dogs', 'cats', 'neither']
- pets_has (what pets they have)
    - categories: ['both', 'dogs', 'cats', 'neither']
- pets_any (whether they have pets or not)
    - categories: ['yes', 'no']
- religion
    - categories: ['christianity', 'catholicism', 'judaism', 'buddhism', 'none', 'other'] 
- sex
    - categories: ['m', 'f']
- smoker
    - categories: ['yes', 'no']
- languages
    - categories: ['multiple', 'English_only'] 

### How to (steps)
#### Step 1a: Decide which of the traits above you want to look at.
#### Step 1b: Load the profile data.

In [None]:
#open the data we cleaned and saved earlier. This is the same for everyone. 
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')

#### Step 2: If you want, limit the data to just men or women.
- For only men, leave this code how it is.
- For only women, change the `'m'` in this line to `'f'`
- For all people, add a `#` at the start of this line to comment it out

In [None]:
profiles = profiles[profiles['sex'] == 'm']

#if you're running on a laptop, it is smart to use a smaller amount of data
profiles = profiles.sample(20000)
profiles.shape

#### Step 3: Tokenize and stem the text for these profiles

In [None]:
print("Tokenizing...")
profiles['tokens'] = profiles['text'].apply(clean)
print("Stemming...")
profiles['stems'] = profiles['tokens'].apply(stem)
print("Done!")

#### Step 4: Compute the word usage statistics for your chosen attribute.
You can change two things in the code below:
- You can change `att='sex'` to your attribute of interest (e.g. `body` or `orientation`)
- You can change `per_person=True` to `False` if you want to see which words are used most often, rather than which words are used by the most people. 

In [None]:
result = word_use(profiles, att='sex', per_person=True)

#### Step 5a: Look at the results
First, let's just see what columns we have.

In [None]:
result.head(2).round(2)

#### Step 5b: Looking at the most distinctive words by category
You can change two things in this code:
1. Change `'times_diff_f'` to the name of the column you want to sort by, i.e. the column you want to see the most popular words in. 
2. Change the number in `head(10)` to a bigger or smaller number to see more or less rows of output.

You can paste this line into more cells below to show different groups

In [None]:
result.sort_values(by='times_diff_f', ascending=False).head(10).round(2)

#### Sexual Orientation

In [None]:
result = word_use(profiles, att='orientation', per_person=True)

In [None]:
result = result.sort_values(by='times_diff_gay', ascending=False)
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_bisexual', ascending=False)
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_straight', ascending=False)
result.head(10).round(2)

#### Whether someone is a parent

In [None]:
result = word_use(profiles, att='kids', per_person=True)

In [None]:
result = result.sort_values(by='times_diff_no', ascending=True)
print("Top words distinguishing parents:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_yes', ascending=True)
print("Top words distinguishing non-parents:")
result.head(10).round(2)

#### Age

In [None]:
result = word_use(profiles, att='age_group', per_person=True)

In [None]:
result = result.sort_values(by='times_diff_10', ascending=False)
print("Top words distinguishing teens:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_20', ascending=False)
print("Top words distinguishing 20s:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_30', ascending=False)
print("Top words distinguishing 30s:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_40', ascending=False)
print("Top words distinguishing 40s:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_50', ascending=False)
print("Top words distinguishing 50s:")
result.head(10).round(2)

#### Dogs vs Cats

In [None]:
result = word_use(p2, att='pets_likes', ref='dogs', per_person=True)

In [None]:
result = result.sort_values(by='times_diff_cats', ascending=False)
print("Top words distinguishing people with cats:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_cats', ascending=True)
print("Top words distinguishing people with dogs:")
result.head(10).round(2)

#### Education

In [None]:
result = word_use(profiles, att='edu', ref='BA', per_person=True)

In [None]:
result = result.sort_values(by='times_diff_HS', ascending=False)
print("Top words distinguishing High School from BAs:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_HS', ascending=True)
print("Top words distinguishing BAs from HS:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_Grad_Pro', ascending=False)
print("Top words distinguishing graduate and professional degree holders from BAs:")
result.head(10).round(2)

In [None]:
result = result.sort_values(by='times_diff_Grad_Pro', ascending=True)
print("Top words distinguishing BAs from Grad/Pro:")
result.head(10).round(2)

# What we learned
Expand for more.