# Data Science for Social Justice Workshop: Module 02


## Preprocessing Text Data

One type of data that we often work with is text data. This could include social media posts, novels, customer reviews, or interview transcripts. Text is a powerful source of information, but requires specific preprocessing in order to be used in machine learning methods. In this notebook we will go over the preprocessing text, which is the first step that we need to take in order to be able to visualize, model, and analyze text.

This notebook is designed to help you: 

1. Work with CSV files in a Pandas Dataframe;
2. Preprocess text from a DataFrame including the following key skills: tokenizing, stopword removal, N-grams extraction, and lemmatization using spaCy;
3. Saving the processed text to the DataFrame.

In module 1, you completed an introduction to Python fundamentals, including the basics of the Pandas package, which we will use extensively in this notebook. This module is designed to accelerate your coding skills so that you can use Python tools effectively for generating research results. However, if you have less experience with Python, some of this will be overwhelming. That's totally normal, since it takes more than a couple of hours to learn how to code! If you are feeling overwhelmed, it can be helpful to focus on the broader purpose of the functions in the notebook for now and how we can use them to further our research purposes, rather than the details in the code. It can also help to remember that you don't need to memorize every function you want to use. Even programmers who have been working with Python for years will regularly refer to documentation and resources (like this notebook!) while they are developing programs. With time and experience as you progress through the notebooks you will get more comfortable with the Python code underlying this notebook. 

**Note**: In this notebook, swap out the data in the code for your own data.

**Note**: Some of the code will take a long time to execute. The `*` on the left side of the cell will indicate that the cell is still running. The nice thing about preprocessing is that once we have the pipeline complete, we will only have to run these cells one time, even if they take a while.



## Changing the Working Directory

First, we will use `!pwd` to check the location of our **working directory**. This is the folder that the program is looking into when you are loading and saving files. The default the folder that the current Jupyter Notebook is in. 

In [1]:
# "print" working directory
!pwd

/Users/emilygrabowski/Documents/GitHub/Data-Science-Social-Justice/notebooks/module02


We can also navigate around by importing the `os` module and using the `chdir()` ('change directory') method.

Our default working directory is wherever we launched the notebook - in our case the ```module02``` folder. We want to access the "Data" folder, which is two levels "up".

We can change the file path using the following code:  `os.chdir("../../Data")`

This means "go up two levels from the current directory, then go into the folder 'Data'". You can think of this as the computer navigating the file tree. 


In [1]:
import os
# We include two ../ because we want to go two levels up in the file structure
os.chdir('../../Data')

Since the Data folder will always be on the same **relative path** to all of our Jupyter notebooks, we can re-use this line of code whenever we want to change our working directory to the 'Data' folder (Just make sure to import `os` as well!)

Now let's check our working directory again.

In [2]:
%pwd

'/Users/emilygrabowski/Documents/GitHub/Data-Science-Social-Justice/data'

Great! Now we can use another function from `os` called `listdir()` ('list directory') which returns a list of all of the files in the working directory. We will use this code to make sure we are in the right place (do you see the data folders in the list?)

In [11]:
list_of_files = os.listdir()
print(list_of_files)

['01_preprocessing.ipynb', '03_tf_idf.ipynb', '02_distant_reading.ipynb', '.ipynb_checkpoints']


If your file paths get messed up in all of this navigation, reset the notebook is to restart your kernel using Kernel--> Restart Kernel. This is a helpful trick at any point if the code seems to start behaving unexpectedly: restart the kernel and rerun all cells in order until you find the issue.


## Importing Data with Pandas

First step complete! We see our .csv files with data in it. The next step is to **import** the data we are going to work with into Python so that we can work with it. The best way to do that is to import the .csv file as a Pandas dataframe, which will make a data table that we can work with. 

In this course we will use a dataset taken from [r/amitheasshole](reddit.com/r/AmItheAsshole). 

The subreddit describes itself as "A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the asshole."

The subreddit has structures in place that the community follow to come to a decision about the situation. First, OP (original poster) writes up the situation, asking AITA (Am I The Asshole). In response, for eighteen hours, the community of the subreddit will respond to the post with one of five judgments: YTA (You’re The Asshole), NTA (Not The Asshole), ESH (Everyone Sucks Here), NAH (No Assholes Here), or INFO (Not Enough Info).

For more info on the subreddit, see [here]('https://www.inverse.com/culture/am-i-the-asshole/amp'). 

First, we have to retrieve the data. We'll use a subset of the full dataset consisting of the top most popular posts, the assumption being that this will yield the most interesting results (`aita_sub_top_sm.csv`). The full dataset is available as well (in `aita_sub_full.csv`). We will use `pd.read_csv()` to import the .csv file as a DataFrame.


In [2]:
import pandas as pd 

# importing file in df
df = pd.read_csv('aita_sub_top_sm.csv')


FileNotFoundError: [Errno 2] No such file or directory: 'aita_sub_top_sm.csv'

Now that we've loaded in the file, let's take a look at what's been imported `len(df)` will give the number of rows in the DataFrame.

In [None]:
len(df)

`df.dtypes` will list the type of each column in the DataFrame. Note that some columns are already imported as integers or floats. Others are of the 'object type'. If we look at the first few rows in the cell below, we see that those are the columns containing strings.

In [12]:
df.dtypes

idint                int64
idstr               object
created              int64
self               float64
nsfw               float64
author              object
title               object
url                float64
selftext            object
score              float64
subreddit           object
distinguish         object
textlen            float64
num_comments       float64
flair_text          object
flair_css_class     object
augmented_at       float64
augmented_count    float64
dtype: object

We can also show the first few lines of the dataframe using the `.head()` method. This helps confirm that the file imported properly, and see examples of each column.

In [3]:
df.head()

NameError: name 'df' is not defined

This particular dataset only includes the original posts in the subreddit (so not the comments on the posts). The "selftext" column contains the actual posts.

other columns contain valuable metadata you can use in your analyses, such as: 
- `created`: the time of the post's creation
- `score`: amount of upvotes minus downvotes
- `textlen`: amount of words
- `num_comments`: the amount of comments
- `distinguish`: posts that have been moderated
- `nsfw`: posts flagged for NSFW content
- `flair_text`: a 'tag' that users within a subreddit can add
- `augmented_count`: how often a user or moderator has edited the text

`NaN` labels indicate missing values. Missing values are like holes in the DataFrame, and dealing with missing values is an important part of preprocessing. 

If you want to access the columns of the DataFrame, you can use `df.columns`. This is used frequently for getting the exact spelling of a column, for example.

In [14]:
# This allows you to quickly see which columns you have
df.columns

Index(['idint', 'idstr', 'created', 'self', 'nsfw', 'author', 'title', 'url',
       'selftext', 'score', 'subreddit', 'distinguish', 'textlen',
       'num_comments', 'flair_text', 'flair_css_class', 'augmented_at',
       'augmented_count'],
      dtype='object')

## Drop Columns and Missing Values

Now that we have imported the data and confirmed the import worked, let's start preprocessing the **raw data** into **processed data**. Let's first remove some columns that we are not going to use. This is helpful in keeping the data more manageable.

[`df.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) takes a list of items to drop (columns or row names). The axis argument indicates whether to drop rows (`axis=0`) or columns (`axis=1`).

Let's first remove some columns that we are not going to use.

In [4]:
df = df.drop(['self','url', 'subreddit', 'augmented_at', 'augmented_count'], axis = 1)

NameError: name 'df' is not defined

Let's get rid of some missing values. First, we want to select the rows of deleted posts. On Reddit, removed posts get flagged as "[removed]" or "[deleted]", so we have to get rid of this too. We can do that with the [`isin()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html) method. Then we can check the new length of the DataFrame to see how many rows are left.


In [None]:
#select all rows that don't have 'removed' or 'deleted'
df = df[~df['selftext'].isin(['[removed]', '[deleted]' ])]
len(df)

Next, we need to drop **null values** These are values that are totally missing, for example if the webscraper wasn't able to extract text for the post. They are replaced with **NaN** which stands for a null value in pandas. We can use [`dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna) to remove the rows with null values in the `selftext` column

In [None]:
df.dropna(subset='selftext')

## Text Preprocessing

Prepocessing is the very first step of NLP projects. The main point of preprocessing is to make the text more interpretable for computers in later machine learning steps. For example if we are counting instances of "weather" in text, we might want "weather", "weather.", "Weather" to all be counted as instances of the same word, but in raw text form, these would be treated as separate words. To do so, we can take some common preprocessing steps to help the machine learning process. Some common preprocessing steps are:

- Removing punctuations like . , ! $( ) * % @
- Removing URLs
- Removing stopwords (non-content words like a, the, is)
- Lowercasing
- Tokenization (e.g. splitting a sentence into words)
- Stemming (e.g. places -> place')
- Lemmatization (changing words to 'dictionary form' e.g. runs, running, run -> run)

Fortunately, we don't need to code every one of these steps. Instead, we will use [spaCy](https://spacy.io/) to do these things. If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box. We will use the [`en_core_web_sm`](https://spacy.io/models/en/#en_core_web_sm) pipeline to cover the steps listed above. 


In [7]:
import spacy
#load the English preprocessing pipeline
nlp = spacy.load('en_core_web_sm')

#parse the first reddit post in the dataset
parsed_sub = nlp(df.selftext[0])
print(parsed_sub)

NameError: name 'df' is not defined

The base text looks the same, but we can take a closer look at `parsed_sub` to see what happened under the hood.

In [None]:
#print each sentence in the parse
for num, sentence in enumerate(parsed_sub.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

In [None]:
#print each entity in the parse
for num, entity in enumerate(parsed_sub.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)

We can also print out a DataFrame with some of the information spaCy has extracted from our text. 


In [20]:
#extract the first 15 items for the following properties of the parse
#orthography 
token_text = [token.orth_ for token in parsed_sub][:15]   
#part of speech 
token_pos = [token.pos_ for token in parsed_sub][:15]   
#lemma (or 'dictionary form')
token_lemma = [token.lemma_ for token in parsed_sub][:15]

#stop word? t/f
token_stop = [token.is_stop for token in parsed_sub][:15]

#puncutation? t/f
token_punct = [token.is_punct for token in parsed_sub][:15]

#make a dataframe with these items
pd.DataFrame(zip(token_text, token_pos, token_lemma, token_entity_type, token_stop, token_punct),
             columns=['token_text','part_of_speech','token_lemma','entity_type', 'token_stop', 'token_punct'])

Unnamed: 0,token_text,part_of_speech,token_lemma,entity_type,token_stop,token_punct
0,My,DET,-PRON-,,True,False
1,girlfriend,NOUN,girlfriend,,False,False
2,recently,ADV,recently,,False,False
3,went,VERB,go,,False,False
4,to,ADP,to,,True,False
5,the,DET,the,,True,False
6,beach,NOUN,beach,,False,False
7,with,ADP,with,,True,False
8,a,DET,a,,True,False
9,few,ADJ,few,,True,False


Turns out spaCy does a *lot* of work. 

- It did part of speech tagging (nouns, verbs etc.)
- It identified stop words and punctuation
- It lemmatized the text. The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form called a lemma. For instance, "been" $\Rightarrow$ "be".


## Leftover Tokens

While spaCy's tokenizer is great, it has limitations (like any tokenizer). Language is always domain-specific and our data might contain text that spaCy doesn't properly deal with. 

It is always worth going over some test data to check if your tokenizer works well, and then make some adjustments where needed. For instance, see the following sentence. As a test, we can process the sentence "That's actually not true" and check the lemmatized tokens.

In [8]:
#preprocess the sentence
test = nlp("That’s actually not true")
tokens = []
#for each token in the processed object
for token in test:
    #check if the token is punctuation
    if not token.is_punct:
        #append the lower-case lemma to the list
        tokens.append(token.lemma_.lower())
print(tokens)

['that', '’', 'actually', 'not', 'true']


Even though we are using an if-statement to filter out punctuation, the apostrophe in `that's` is not getting marked as punctuation in the processing.

Let's further investigate with an example with many apostrophes:

In [9]:
test = nlp("I’m you're he's he’s That’s that's he’s he's father’s father's")
tokens = [token.lemma_.lower() for token in test if not token.is_punct]
for i, tokens in enumerate(tokens):
    print(i, tokens)

0 i
1 ’m
2 you
3 be
4 he
5 be
6 he
7 ’
8 that
9 ’s
10 that
11 be
12 he
13 ’
14 he
15 's
16 father
17 ’s
18 father
19 's


We can add our own process to enhance the automatic parser in order to deal with these cases. For example, let's remove lemmas that are associated with the possesive "'s". How about just adding a little rule of our own: after our first pass, let's add another if-statement to remove problematic lemmas. 

In [13]:
test = nlp("I’m you're he's he’s That’s that's he’s he's father’s father's")
tokens = [token.lemma_.lower() if token.lemma_ != '-PRON-' else token.lower_ for token in test if not token.is_punct and not token.is_digit]
leftover = ["'s",  "’s", "’"]
tokens_c = [token for token in tokens if not token in leftover]
tokens = []
for token in test:
    if not token.is_punct and not token.is_digit:
        if not token.lemma_.lower() in leftover:
            tokens.append(token.lemma_.lower())


for i, t in enumerate(tokens):
    print(i, t)

0 i
1 ’m
2 you
3 be
4 he
5 be
6 he
7 that
8 that
9 be
10 he
11 he
12 father
13 father


It's still not perfect, but now a lot of those apostrophe's are gone as lemmas. At this point, you can continue to fine tune the automatic text processing pipeline until you are happy with the final result. After that, the next step is to process all of the items in the dataset with that processor.

## Preprocessing all data
Now we can scale up to our data. Up until this point, we have run the preprocessing pipeline on a single post, but we want to automate our code to process all posts at once. 

To do this, Let's define a few helper functions that we'll use for text normalization. In particular, the `lemmatized_sentence_corpus ` generator function will use spaCy to:

- Iterate over the DF
- Segment the threads into individual sentences
- Remove punctuation and excess whitespace
- Lemmatize the text

These helper functions will automate the preprocessing for the posts in this dataset. Don't worry too much about deciphering each line of code, the main goal of these helper functions is to do a lot of the preprocessing for you so that you can use this text in analysis going forward. We can use these functions wholesale, especially the `preprocess()` function to preprocess our reddit data. 

In [14]:
import spacy
nlp = spacy.load('en_core_web_sm')
from gensim.models.phrases import Phrases, Phraser

def clean(token):
    """
    helper function to eliminate tokens
    that are pure punctuation, whitespace, or digits
    """
    return token.is_punct or token.is_space or token.is_digit

def line_read(df):
    """
    generator function to read in text from df
    and get rid of line breaks in the text
    """    
    for text in df.selftext:
        yield text.replace('\n', '')

def preprocess(df, allowed_postags=['NOUN', 'ADJ']):
    for parsed in nlp.pipe(line_read(df), batch_size=1000, disable=["tok2vec", "ner"]):
        lemmas = [token.lemma_.lower() if token.lemma_ != '-PRON-' else token.lower_ for token in parsed if not clean(token)]
        lemmas_c = [l for l in lemmas if not l in ["'s",  "’s", "’"]]
        nostops = [term for term in lemmas_c if term not in spacy.lang.en.stop_words.STOP_WORDS]
        yield ' '.join(nostops)




Now let's run `preprocess()` over our DataFrame and look at the first output. Although it looks less like coherent text, it is cleaner and will be much easier for the computer to process.

In [25]:
# This will take a while

lemmas = [line for line in preprocess(df)]

In [50]:
lemmas[0]

'girlfriend recently beach friend tiny bikini basically thong hate wear public wear wear pose bathroom mirror hotel room profile picture stick post snapchat story worth mention friend snapchat reason similar sick fight girl night post video sit table like dude invite girl completely unknown arrive guy add snapchat save send pic beach trip screenshot particular snap leave camera roll ass intentional know claim like way stomach look pic beach black bikini look micro bikini bathroom mirror ocean sand friend tit ass bathroom mirror age typical year old behavior nowadays wrong think inappropriate behavior relationship asshole big deal it?**edit guy let clear try control point time tell wear yes hate ass completely bikini tell wear problem wear general wear summer long despite groaning issue find super annoying yes issue send picture ass snapchat plain simple draw line break wear break turn sideways bathroom mirror stick butt picture send snapchat borderline nude suit wildly inappropriate cl

## Phrase modeling with Gensim

Many kinds of NLP methods work better when using **N-grams**. An n-gram treats small groups of words as tokens rather than single words. This allows words that frequently appearing together to be concatenated (e.g. "new york" means something different and more specific than "new" and "york" separately). Most commonly, a **bigram** = 2-gram and a **trigram** = 3-gram.

**Phrase modeling** is an approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our lemmatized dataset and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. 

Gensim’s [`Phrases`](https://radimrehurek.com/gensim/models/phrases.html) model implements bigrams, trigrams, quadgrams, etc. `Phrases` detects phrases based on collocation counts. It builds a model of input text that you then can use on other data.


Gensim detects a bigram if a scoring function for two words exceeds a threshold. The two important arguments to `Phrases` are `min_count` and `threshold`. The higher the values of these parameters, the harder it is for words to be combined to bigrams. We can change the value of these parameters to fine-tune our model. Try changing `min count` and `threshold` below. How does that change the output?

In [32]:
from gensim.models.phrases import Phrases, Phraser

docs = ['new york is great', 'new york is in the united states',
        'i love to stay in new york', 'people visit the united states']

tokens = [doc.split(" ") for doc in docs]
bigram = Phrases(tokens, min_count=2, threshold=3,delimiter='_')
bigram_phraser = Phraser(bigram)
[bigram_phraser[token] for token in tokens]

[['new_york', 'is', 'great'],
 ['new_york', 'is', 'in', 'the', 'united', 'states'],
 ['i', 'love', 'to', 'stay', 'in', 'new_york'],
 ['people', 'visit', 'the', 'united', 'states']]

Now let's make a bigram and trigram model for our data. Starting from the preprocessed lemmas from the `preprocess()` function above, we can use the gensim models to identify bigrams and trigrams in the dataset. Again, the `min_count` and `threshold` arguments can be modified to change the output of the model. 

In [33]:
# create bigram and trigram models
lemmas_s = [doc.split(" ") for doc in lemmas]
bigram = Phrases(lemmas_s, min_count=10, threshold=100)
trigram = Phrases(bigram[lemmas_s], min_count=10, threshold=50)  
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

def make_bigrams(texts):
    return [bigram_phraser[doc] for doc in texts]

def make_trigrams(texts):
    return [' '.join(trigram_phraser[bigram_phraser[doc]]) for doc in texts]

# Form trigrams
trigrams = make_trigrams(lemmas_s)

NameError: name 'lemmas' is not defined

In [56]:
trigrams[1]

'parent smoke like chimney use quit wife young son vacation time week invite parent come visit watch son wife date live hour half away spend night come visit occasion tell remind time son bear month ago absolutely smoking address inside outside street car driveway near property want son want ask question big idea know okay morning leave review footage security_camera door dad step driveway smoke time hour wife mom texte dad afterward tell disrespect wish smoke property meet public come house ask stay want stay overnight hotel smoke grandson parent basically think ridiculous smoke brother fine actually true young brother asthma chronic ear infection kid sure exacerbate smoking plus pick year quit grandson danger air thank babysitting babysitte hour carte blanche disrespect wife wish big_deal wonder harsh rule stand maybe nice'

Once our phrase model has been trained on our total dataset, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.


In [34]:
trigram_phraser["That", "was", "not", "a", "big", "deal"]

NameError: name 'trigram_phraser' is not defined

Let's take a look at the bigram parser. We can use `.keys()` to identify the bigrams in the dataset.

In [31]:
len(bigram_phraser.phrasegrams.keys())

96

Let's print the first few bigrams identified in the model as well to check if they seem like appropriate bigrams. If not, we can change the parameters of the bigram model to adjust the sensitivity of the model.

In [32]:
[bigram for bigram in bigram_phraser.phrasegrams.keys()][:10]

[(b'bathing', b'suit'),
 (b'mash', b'potato'),
 (b'social', b'medium'),
 (b'final', b'straw'),
 (b'ice', b'cream'),
 (b'red', b'flag'),
 (b'credit', b'card'),
 (b'cerebral', b'palsy'),
 (b'fast', b'forward'),
 (b'paternity', b'test')]

## Adding and Saving to .csv
Finally, let's add our new preprocessed data to our .csv in a new column.

In [34]:
# inserting next to selftext column
df.insert(loc=7, column='lemmas', value=trigrams)
# removing empty rows in lemmas
df = df[~df['lemmas'].isin([''])]

In [35]:
# save to new csv
df.to_csv('aita_sub_top_sm_lemmas.csv', index=False)

Let's look at the new column in the dataframe using `.head()`

In [36]:
 df.head(3)

Unnamed: 0,idint,idstr,created,nsfw,author,title,selftext,lemmas,score,distinguish,textlen,num_comments,flair_text,flair_css_class
0,427576402,t3_72kg2a,1506433689,0.0,Ritsku,AITA for breaking up with my girlfriend becaus...,My girlfriend recently went to the beach with ...,girlfriend recently beach friend tiny bikini b...,679.0,,4917.0,434.0,no a--holes here,
1,551887974,t3_94kvhi,1533404095,0.0,hhhhhhffff678,AITA for banning smoking in my house and telli...,My parents smoke like chimneys. I used to as w...,parent smoke like chimney use quit wife young ...,832.0,,2076.0,357.0,asshole,ass
2,552654542,t3_951az2,1533562299,0.0,creepatthepool,AITA? Creep wears skimpy bathing suit to pool,Hi guys. Throwaway for obv reasons.\n\nI'm a f...,hi guy throwaway obv reason i'm female child b...,23.0,,1741.0,335.0,Shitpost,


Congratulations! You now have the ability to preprocess text data into trigrams and save the output. This gives us clean text that we can then use in our further analysis. This notebook can be used as a primer for preprocessing data and for further reference. However, once our text is processed and the DataFrame saved to a .csv, all we have to do for our next step in the analysis is to load that processed .csv file, rather than having to re-run these models every time we work with the dataset.