# Data Science for Social Justice Workshop: Preprocessing

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Understand why textual data needs to be preprocessed.
* Engage in common preprocessing tasks, such as lemmatization and phrase modeling.
* Distinguish between different Python packages for preprocessing.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>

### Sections
1. [About This Workshop](#intro)
2. [Reading Data with Pandas](#read)
3. [Dropping Columns and Missing Values](#drop)
4. [Preprocessing Text Data with SpaCy](#clean)
5. [Phrase Modeling with Gensim](#gensim)
6. [Saving Data](#save)

<a id='intro'></a>

# About This Workshop

One of the most common types of data used in machine learning and data science is text data. This includes social media posts, novels, customer reviews, interview transcripts, and many others. Text is a powerful source of information, but requires specific preprocessing in order to be used in machine learning contexts. In particular, machine learning algorithms are designed to work with numerical data, so the end goal of text preprocessing is to have numeric features associated with each text in the dataset. 

In this notebook we will go over a preprocessing **pipeline** (or series of sequential steps) for text, which is the first step that we need to take in order to analyze text using our data science techniques.

This notebook is designed to help you: 

1. Read, manipulate, and write `.csv` files in a `pandas` dataframe;
2. Preprocess text with the following key skills: tokenizing, stop-word removal, n-grams extraction, and lemmatization using `spaCy`;

In Python Fundamentals, you were introduced to the basics of the `pandas` package, which we will use extensively in this notebook. This module is designed to accelerate your coding skills so that you can use Python effectively for generating research results. However, if you have less experience with Python, some of this will be overwhelming. That's totally normal, since it takes more than a couple of hours to learn how to code! 

If you are feeling overwhelmed, it can be helpful to focus on the broader purpose of the functions in the notebook for now and how we can use them to further our research purposes, rather than the details of the algorithms. It can also help to remember that you don't need to memorize every function you want to use. Even programmers who have been working with Python for years will regularly refer to documentation and resources (like this notebook!) while they are developing a project. With time and experience as you progress through the notebooks you will get more comfortable with the Python code underlying this notebook.

⚠️ **Warning:** Some of the code will take a long time to execute. The `*` on the left side of the cell will indicate that the cell is still running. The nice thing about preprocessing is that once we have the pipeline complete, we will save our results so we don't need to re-run these cells.

<a id='read'></a>

# Reading Data with `pandas`

The first step is to read the data we are going to work with into Python so that we can work with it. The best way to do that is to import the `.csv` file as a `pandas` dataframe, which will make a data table that we can work with. 

In this workshop we will use a dataset taken from the subreddit [r/amitheasshole](reddit.com/r/AmItheAsshole). 

The subreddit describes itself as follows:

<img src="../../images/aita_desc.png" alt="Am I The Asshole - description" width="300"/>

The subreddit has structures in place that the community follow to come to a decision about the situation. First, OP (original poster) writes up the situation, asking AITA (Am I The Asshole). In response, for eighteen hours, the community of the subreddit will respond to the post with one of five judgments: YTA (You’re The Asshole), NTA (Not The Asshole), ESH (Everyone Sucks Here), NAH (No Assholes Here), or INFO (Not Enough Info).

💡 **Tip**: For more info on the subreddit, see [here](https://www.inverse.com/culture/am-i-the-asshole/amp). 

First, we have to read the data. We'll use a subset of the full dataset consisting of the top most popular posts, the assumption being that this will yield the most interesting results (`aita_sub_top_sm.csv`). We use `pd.read_csv()` to import the .csv file as a DataFrame.

In [None]:
# Import the pandas package
import pandas as pd 

# Read the csv file
df = pd.read_csv('../../data/aita_sub_top_sm.csv')

Let's have a look at look and the shape, top rows, and columns of the data.

In [None]:
df.shape

In [None]:
df.head()

In [None]:
# This allows you to quickly see which columns you have
list(df)

This particular dataset only includes the original posts in the subreddit (so not the comments on the posts). 

There is one row per post in the dataset. Only one column contains the actual text of the post (in this case `selftext`), and the reset of the columns contain metadata that can augment the text data itself in your analyses. These include:

- `created`: the time of the post's creation.
- `score`: amount of upvotes minus downvotes.
- `textlen`: amount of words.
- `num_comments`: the amount of comments.
- `distinguish`: posts that have been moderated.
- `nsfw`: posts flagged for NSFW content.
- `flair_text`: a 'tag' that users within a subreddit can add.
- `augmented_count`: how often a user or moderator has edited the text.

🔔 **Question**: Which of these columns could contain interesting data for your research purposes?

<a id='drop'></a>

# Dropping Columns and Missing Values

Now that we have imported the data and confirmed the import worked, let's start preprocessing the **raw data** into **processed data**. Let's first remove some columns that we are not going to use. This is helpful in keeping the data more manageable.
`NaN` labels in a dataframe indicate missing values. Missing values are like holes in the dataframe, and dealing with missing values is an important part of preprocessing. 

The method [`df.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) takes a list of items to drop (columns or row names). The axis argument indicates whether to drop rows (`axis=0`) or columns (`axis=1`).

## 🥊 Challenge 1: Removing Columns

We started a `drop()` statement below. Fill it in with the following 2 arguments, separated by a comma:
1. A list containing the columns 'self', 'url', 'subreddit', 'augmented_at', 'augmented_count'.
2. an `axis` argument set to 1.

In [None]:
# YOUR CODE HERE
df = df.drop(...)

## Removing Missing Values

First, we want to select the rows of deleted posts. On Reddit, removed posts get flagged as "[removed]" or "[deleted]". We should remove posts flagged as such, since they'll lack any text. We can do that with the [`isin()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html) method. Then we can check the new length of the dataframe to see how many rows are left.

The following line of code will select all lines that do not have 'removed' or 'deleted' in the post's text. The `~` is Python syntax for 'not'.

In [None]:
# Select all rows that don't have '[removed]' or '[deleted]'
df = df.loc[~df['selftext'].isin(['[removed]', '[deleted]' ]),:]
df.shape

🔔 **Question**: How many posts are left in the dataset?

## Removing Null Values

Next, we need to drop **null values**. These are values that are totally missing. In this case, the web scraper may have been unable to extract text for the post. They are replaced with the value **NaN**, which stands for a null value in `pandas`. We need to deal with null values in any column that we plan to use for analysis, which in this case is the `selftext` column. We can use [`dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna) to remove the rows with null values in the target column. 

## 🥊 Challenge 2: Using `dropna()`

Finish the statement below by using the `dropna()` method on the dataframe. Use the argument `subset`, which you set to `'selftext'`). Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) if you need to!

The shape of the resulting DataFrame should be `19988, 18`.

In [None]:
# YOUR CODE HERE


<a id='preprocess'></a>

# Preprocessing Text Data with SpaCy

Text data collected in the real world is always going to be variable, which poses a challenge for analysis. But by reducing some of this variation, we can help improve our results. For example, if we are counting instances of the word `"weather"` in text, we might want the strings `"weather"`, `"weather."`, and `"Weather"` to all be counted as instances of the same word. However, in raw text form, these would be treated as separate strings. By performing text cleaning, we can standardize these cases and make our data easier to analyze. Some common preprocessing steps are:

- Removing punctuation
- Removing URLs
- Removing stopwords (non-content words like "a", "the", "is", etc.)
- Lowercasing
- Tokenization (e.g., splitting a sentence into distinct "chunks" or "tokens")
- Stemming, or removing the ends of words (e.g., places -> place)
- Lemmatization, or changing words to 'dictionary form' (e.g., runs, running, run -> run)

Fortunately, we don't need to code every one of these steps. Instead, we will use a package called [spaCy](https://spacy.io/) to do these things. If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), `spaCy` is ready to use out-of-the-box. We will use the [`en_core_web_sm`](https://spacy.io/models/en/#en_core_web_sm) pipeline to cover the steps listed above. 

In [None]:
# Import spaCy
import spacy

# Load the English preprocessing pipeline
nlp = spacy.load('en_core_web_sm')

# Parse the first reddit post in the dataset
parsed_post = nlp(df.selftext[0])
print(parsed_post)

The base text looks the same, but we can take a closer look at `parsed_post` to see what happened under the hood.

In [None]:
# Print each sentence in the parsed post
for idx, sentence in enumerate(parsed_post.sents):
    print(f'Sentence {idx + 1}')
    print(sentence)
    print('')

We can also print out a DataFrame with some of the information spaCy has extracted from our text. This information operates on each **token** in the text. The process of tokenization can vary, depending on decisions made by the package you're using. Once you've tokenized, you can get information about each token:

## 🥊 Challenge 3: Token attributes

Write a for-loop that loops over the first 5 tokens of `parsed_post`. `parsed_post` behaves like a list--that is, you can index and slice it just like a list.

In the body of the for-loop, `print()` the following **attributes** of the loop variable:
- orth_
- pos_
- lemma_
- is_stop_
- is_punct


In [None]:
# YOUR CODE HERE


As you can see, `spaCy` does a *lot* of work:

- It did **part-of-speech** tagging – identifying nouns, verbs, adverbs, and so on.
- It **lemmatized** the text. The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form called a lemma. For instance, "been" -> "be".
- It identified **punctuation** in the text.
- It identified **stop words** such as "I", "to", "as". Stop words are common words that don't provide much information about the text you are analyzing.

## Preprocessing all data
Now we can scale up this process to our whole dataset. Up until this point, we have run the preprocessing pipeline on a single post, but we want to automate our code to process all posts at once. 

To do this, Let's define a few helper functions that we'll use for text normalization. In particular, the `lemmatized_sentence_corpus` generator function will use `spaCy` to:

- Iterate over the dataframe.
- Segment the threads into individual sentences.
- Remove punctuation and excess whitespace.
- Lemmatize the text.

These helper functions will automate the preprocessing for the posts in this dataset. Don't worry too much about deciphering each line of code. The main goal of these helper functions is to do a lot of the preprocessing for you so that you can use the text in analysis going forward. 

**You can find this function in the `1_Preprocessing_Project.ipynb` notebook to use on your own data.**

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
from gensim.models.phrases import Phrases, Phraser

In [None]:
def clean(token):
    """Helper function that specifies whether a token is:
        - punctuation
        - space
        - digit
    """
    return token.is_punct or token.is_space or token.is_digit

def line_read(df, text_col='selftext'):
    """Generator function to read in text from df and get rid of line breaks."""    
    for text in df[text_col]:
        yield text.replace('\n', '')

def preprocess(df, text_col='selftext', allowed_postags=['NOUN', 'ADJ']):
    """Preprocessing function to apply to a dataframe."""
    for parsed in nlp.pipe(line_read(df, text_col), batch_size=1000, disable=["tok2vec", "ner"]):
        # Gather lowercased, lemmatized tokens
        tokens = [token.lemma_.lower() if token.lemma_ != '-PRON-'
                  else token.lower_ 
                  for token in parsed if not clean(token)]
        # Remove specific lemmatizations, and words that are not nouns or adjectives
        tokens = [lemma
                  for lemma in tokens
                  if not lemma in ["'s",  "’s", "’"] and not lemma in allowed_postags]
        # Remove stop words
        tokens = [token for token in tokens if token not in spacy.lang.en.stop_words.STOP_WORDS]
        yield tokens

Now let's run `preprocess()` over our dataframe and look at the first output. Although it looks less like coherent text, it is cleaner and will be much easier to apply natural language processing techniques.

In [None]:
# This may take a while
lemmas = [line for line in preprocess(df)]

In [None]:
lemmas[0]

<a id='gensim'></a>

# Phrase Modeling with Gensim

Many kinds of NLP methods work better when using **N-grams**. An n-gram treats small groups of words as tokens rather than single words. This allows words that frequently appearing together to be concatenated (e.g. "new york" means something different and more specific than "new" and "york" separately). We most commonly use **bigrams** (2-word phrases) and **trigrams** (3-word phrases).

**Phrase modeling** is an approach to learning combinations of tokens that together represent meaningful multi-word concepts. So rather than treating every pair of words as a n-gram, we look for pairs that occur together frequently and identify those as n-grams. This constrains the token space by limiting the number of multi-word tokens, but requires information about what words co-occur together frequently, which is where **phrase models** come in. We can develop phrase models by looping over the the words in our lemmatized dataset and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. 

`gensim` is a popular natural language processing package. We will use it in later lessons for topic modeling and word embeddings. It also contains a [`Phrases`](https://radimrehurek.com/gensim/models/phrases.html) model that implements phrase modeling for identifying bigrams, trigrams, quadgrams, etc. `Phrases` detects phrases based on collocation counts. It builds a model of input text that you then can use on other data.

`gensim` detects a bigram if a scoring function for two words exceeds a threshold. The two important arguments to `Phrases` are `min_count` and `threshold`. The higher the values of these parameters, the harder it is for words to be combined to bigrams. We can change the value of these parameters to fine-tune our model. Try changing `min count` and `threshold` below. How does that change the output?

In [None]:
from gensim.models.phrases import Phrases, Phraser

# "Documents"
docs = ['new york is great',
        'new york is in the united states',
        'i love to stay in new york',
        'people visit the united states']
# Rudimentary tokenization
tokens = [doc.split(" ") for doc in docs]
# Create bigrams
bigram = Phrases(tokens, min_count=2, threshold=3, delimiter='_')
# Freeze bigrams and apply to data
bigram_phraser = Phraser(bigram)
[bigram_phraser[token] for token in tokens]

Now let's make a bigram and trigram model for our data. Starting from the preprocessed lemmas from the `preprocess()` function above, we can use the `gensim` models to identify bigrams and trigrams in the dataset. Again, the `min_count` and `threshold` arguments can be modified to change the output of the model. 

In the code below we make a bigram model using the gensim `Phrases` object, and then build a trigram model on top of that bigram model. Finally, we use the lemmas from the preprocessed text to make a trigram model of the data. We are processing the whole dataset in this cell, so it may take a little while to run.

In [None]:
# Create bigram and trigram models
bigram = Phrases(lemmas, min_count=10, threshold=100)
trigram = Phrases(bigram[lemmas], min_count=10, threshold=50)  
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# Form trigrams
trigrams = [trigram_phraser[bigram_phraser[doc]] for doc in lemmas]

In [None]:
# Join each into a string
trigrams_joined = [' '.join(trigram) for trigram in trigrams]
trigrams_joined[0]

Once our phrase model has been trained on our total dataset, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.


In [None]:
trigram_phraser["That", "was", "not", "a", "big", "deal"]

Let's take a look at the bigram parser. We can use `.keys()` to identify the bigrams in the dataset. How many bigrams were identified by the parser?

In [None]:
len(bigram_phraser.phrasegrams.keys())

Let's print the first few bigrams identified in the model as well to check if they seem like appropriate bigrams. If not, we can change the parameters of the bigram model to adjust the sensitivity of the model.

In [None]:
list(bigram_phraser.phrasegrams.keys())[:10]

In [None]:
# Look at trigrams
[trigram for trigram in list(trigram_phraser.phrasegrams.keys()) if trigram.count('_') == 2]

<a id='save'></a>

# Saving Data

Finally, let's save our data.  In Python, **pickling** is the process of converting Python objects into binary format that can be stored or transferred. You can then reconstruct the original object from that binary format.

Pickling is useful when you want to save the state of an object so that it can be used later, or when you need to transfer an object between different Python processes.

In [None]:
import pickle

with open('aita_trigrams.pickle', 'wb') as f:
    # Save (or "dump") the object into the file
    pickle.dump(trigrams_joined, f)

# Open the pickled file and load the object
with open('aita_trigrams.pickle', 'rb') as f:
    # Load the object from the file
    loaded_dict = pickle.load(f)

<div class="alert alert-success">

## ❗ Key Points

* The Pandas `DataFrame` format can be used to save Reddit data.
* Before cleaning text data, it is a good idea to drop rows and columns you don't need.
* SpaCy can be used to preprocess textual data, including tokenizatio and lemmatization.
* Gensim can be used to combine tokens into N-grams.
* Python objects can be easily saved into binary format. This is called "pickling".

</div>