# Data munging

Most of the time and effort in projects using computational methods goes into data munging, the process of gathering, cleaning, and preparing data for our actual analysis. This notebook walks through that process with the OKCupid public profile data. You may either:
1. Step through the code one cell at a time, readign the descriptions to learn how it works, or
2. Run all the code here to prepare data for the other labs / notebooks. 
    - **Note:** You should check for errors after you run this. Common errors and what to do about them are documented below. 
    
@Author: [Jeff Lockhart](http://www-personal.umich.edu/~jwlock/)

## 0. Imports
- This cell tells python which libraries we'll be using.

In [None]:
import pandas as pd
import numpy as np

## 1. Go download the data
1. Go to [https://github.com/rudeboybert/JSE_OkCupid](https://github.com/rudeboybert/JSE_OkCupid). 
    - **If you use git**: clone this repository.
    - **If you're new to github**:
        - The easiest way is to right-click each file you want and `save link as`. 
        - You can also go to the green button in the top right and `clone or download`, then `Download zip`. After you download the zip file, you'll need to unzip it. 
2. You need 2 files: `okcupid_codebook.txt` and `profiles.csv.zip`.
3. Copy these files into the folder (directory) called `data` that came with this lab, i.e. the folder `CSSLabs-NLP/Data`.
- Unzip the `profiles.csv.zip` file in the same place.

## 2. Read the data into python
- **Common error**: `FileNotFoundError: File b'data/profiles.csv' does not exist`
    - If you get this error, it means you have not saved the data in the right place. 
        - Open your file browser / finder and go to where this notebook is saved. 
        - Open the subfolder called `data`.
        - Check if `profiles.csv` is there. If it is missing, copy it there. If it has `.zip` at the end, unzip it. if it has some other name like `profiles (1).csv`, change the name to be `profiles.csv`. 
    - **Advanced users**: You may change the path here (i.e. `data/profiles.csv`) so that it points to wherever you saved the profiles data. 

In [None]:
profiles = pd.read_csv('data/profiles.csv')

## 3. Give python information about our data
This cell tells python a bunch of information about our data.
- The OKC data has 10 different columns with profile text, one for each long-answer question in users' profiles. We want to look at all of the profile text, so this merges it all together in a new column called `text`.
- This code also simplifies the categories people pick for other things like level of education, the pets they have, etc.
- It removes people under 18 and over 60.
- It saves this cleaner version of the data so we can use it later.

In [None]:
# which columns have text in them
essay_cols = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 
              'essay7', 'essay8', 'essay9']

# education
ed_levels = {'<HS': ['dropped out of high school', 'working on high school'],
             'HS': ['graduated from high school', 'working on college/university', 
                    'two-year college', 'dropped out of college/university', 
                    'high school'], 
             'BA': ['graduated from college/university', 
                    'working on masters program', 'working on ph.d program', 
                    'college/university', 'working on law school', 
                    'dropped out of masters program', 
                    'dropped out of ph.d program', 'dropped out of law school', 
                    'dropped out of med school'],
             'Grad_Pro': ['graduated from masters program',
                          'graduated from ph.d program',                           
                          'graduated from law school', 
                          'graduated from med school', 'masters program', 
                          'ph.d program', 'law school', 'med school']
            }

#body type
bodies = {'average': ['average'], 
          'fit': ['fit', 'athletic', 'jacked'], 
          'thin': ['thin', 'skinny'], 
          'overweight': ['curvey', 'a little extra', 'full figured', 'overweight']
         }

# smoking
smoke = {'no': ['no'], np.nan: ['nan']}

# Has children
kids = {'yes': ['has a kid', 'has kids']}

#has pets
has_pets = {'yes': ['has']}

# race/ethnicity for exact matching
ethn = {'White': ['white', 'middle eastern', 'middle eastern, white'], 
        'Asian': ['asian', 'indian', 'asian, pacific islander'], 
        'Black': ['black']
       }   

# race/ethnicityfor fuzzy matching
ethn2 = {'Latinx': ['latin'], 'multiple': [','], np.nan: ['nan']}   

# alcohol use
drinks = {'no': ['rarely', 'not at all']}

# drug use
drugs = {'no': ['never']}

# Employment sector
jobs = {'education': ['student', 'education'], 
        'STEM': ['science', 'computer'], 
        'business': ['sales', 'executive', 'banking'], 
        'creative': ['artistic', 'entertainment'], 
        'med_law': ['medicine', 'law'],
        np.nan: ['nan']
       }

# religion 
religion = {'none': ['agnosticism', 'atheism'],
            'catholicism': ['catholicism'],
            'christianity': ['christianity'],
            'judaism': ['judaism'],
            'buddhism': ['buddhism'],
            np.nan: ['nan']
           }

# languages spoken
languages = {'multiple': [',']}

## 4. Create functions for cleaning up our data

In [None]:
def concat(row, cols):
    tmp = []
    for c in cols:
        tmp.append(str(row[c]))
    new = '\n'.join(tmp)
    return new

def recode(text, dictionary, default=np.nan):
    '''Function for recoding categories in a column based on exact matches'''
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y == text: #exact match
                out = x
                return out
    return out

def recode_fuzzy(text, dictionary, default=np.nan):
    '''Function for recoding categories in a column based on partial matches'''
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y in text: #partial match
                out = x
                return out
    return out


def which_pets(t, criterion='has'):
    '''Function for determining which pets someone has or likes'''
    d = False
    c = False
    t = str(t)
    p = 'neither'
    if t == 'nan':
        p = np.nan
    
    if 'has dogs' in t:
        d = True
    if 'has cats' in t:
        c = True
        
    if criterion == 'likes':
        if 'likes dogs' in t:
            if 'dislikes dogs' not in t:
                d = True
        if 'likes cats' in t:
            if 'dislikes cats' not in t:
                c = True
        
    if c and d:
        p = 'both'
    elif c:
        p = 'cats'
    elif d:
        p = 'dogs'
        
    return p

def census_2010_ethnicity(t):
    '''
    Function gathers choices for this question gathered by the US Census 2010.
    It deviates from the census by creating exclusive Latinx category. Selecting 
    just 'latin' and nothing else was the 3rd most frequent ethnicity in this 
    data. The discision to include people who identified 'latin' and another race
    is based in research on Latinx people's experience with the US Census, but 
    like all racial and ethnic categorization systems, it is flawed. 
    '''
    text = str(t)
    
    e = recode(text, ethn, default='other')
    if 'other' == e:
        e = recode_fuzzy(text, ethn2, default='other')
    
    return e

def height(inches):
    h = 'under_6'
    if inches >= 72:
        h = 'over_6'
    return h

## 5. Clean up the data
This cell calls the functions we created in the last cell, along with the information about our data from the cell before it, to actually clean our data.

In [None]:
#remove people 60+ and 17-
profiles = profiles[(profiles.age < 60) & (profiles.age > 17)]

#recode categorical columns into simpler categories
profiles['text'] = profiles.apply(concat, axis=1, cols=essay_cols)
profiles['edu'] = profiles.education.apply(recode, dictionary=ed_levels, 
                                            default='unknown')
profiles['kids'] = profiles.offspring.apply(recode_fuzzy, dictionary=kids, 
                                            default='no')
profiles['pets_likes'] = profiles.pets.apply(which_pets, criterion='likes')
profiles['pets_has'] = profiles.pets.apply(which_pets, criterion='has')
profiles['pets_any'] = profiles.pets.apply(recode_fuzzy, dictionary=has_pets, 
                                            default='no')
profiles['age_group'] = profiles.age.apply(lambda x: str(int(x/10)*10))
profiles['height_group'] = profiles.height.apply(height)
profiles['race_ethnicity'] = profiles.ethnicity.apply(census_2010_ethnicity)
profiles['smoker'] = profiles.smokes.apply(recode, dictionary=smoke, 
                                            default='yes')
profiles['body'] = profiles.body_type.apply(recode, dictionary=bodies, 
                                            default='unknown')
profiles['alcohol_use'] = profiles.drinks.apply(recode, dictionary=drinks, 
                                            default='yes')
profiles['drug_use'] = profiles.drugs.apply(recode, dictionary=drugs, 
                                            default='yes')
profiles['industry'] = profiles.job.apply(recode_fuzzy, dictionary=jobs, 
                                            default='other')
profiles['religion'] = profiles.religion.apply(recode_fuzzy, dictionary=religion, 
                                            default='other')
profiles['languages'] = profiles.speaks.apply(recode_fuzzy, dictionary=languages, 
                                            default='English_only')

# keep just these columns
profiles = profiles[['age_group', 'age', 'body', 'alcohol_use', 'drug_use', 'edu', 
                     'race_ethnicity', 'height_group', 'industry', 'kids', 
                     'orientation', 'pets_likes', 'pets_has', 'pets_any', 
                     'religion', 'sex', 'smoker', 'languages', 'text', 'essay0', 
                     'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 
                     'essay7', 'essay8', 'essay9']]

profiles.head()

## 6. Save the results
This cell saves the cleaned up data to a file so we can use it again later.

In [None]:
profiles.to_csv('data/clean_profiles.tsv', sep='\t', index=False)