# Data Science for Social Justice Workshop: Preprocessing

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Understand why textual data needs to be preprocessed.
* Engage in common preprocessing tasks, such as lemmatization and phrase modeling.
* Distinguish between different Python packages for preprocessing.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

### Sections
1. [About This Workshop](#intro)
2. [Reading Data with Pandas](#read)
3. [Dropping Columns and Missing Values](#drop)
4. [Preprocessing Text Data with SpaCy](#clean)
5. [Phrase Modeling with Gensim](#gensim)
6. [Saving Data](#save)

<a id='intro'></a>

# About This Workshop

One of the most common types of data used in machine learning and data science is text data. This includes social media posts, novels, customer reviews, interview transcripts, and many others. Text is a powerful source of information, but requires specific preprocessing in order to be used in machine learning contexts. In particular, machine learning algorithms are designed to work with numerical data, so the end goal of text preprocessing is to have numeric features associated with each text in the dataset. 

In this notebook we will go over a preprocessing **pipeline** (or series of sequential steps) for text, which is the first step that we need to take in order to analyze text using our data science techniques.

This notebook is designed to help you: 

1. Read, manipulate, and write `.csv` files in a `pandas` dataframe;
2. Preprocess text with the following key skills: tokenizing, stop-word removal, n-grams extraction, and lemmatization using `spaCy`;

In Python Fundamentals, you were introduced to the basics of the `pandas` package, which we will use extensively in this notebook. This module is designed to accelerate your coding skills so that you can use Python effectively for generating research results. However, if you have less experience with Python, some of this will be overwhelming. That's totally normal, since it takes more than a couple of hours to learn how to code! 

If you are feeling overwhelmed, it can be helpful to focus on the broader purpose of the functions in the notebook for now and how we can use them to further our research purposes, rather than the details of the algorithms. It can also help to remember that you don't need to memorize every function you want to use. Even programmers who have been working with Python for years will regularly refer to documentation and resources (like this notebook!) while they are developing a project. With time and experience as you progress through the notebooks you will get more comfortable with the Python code underlying this notebook.

⚠️ **Warning:** Some of the code will take a long time to execute. The `*` on the left side of the cell will indicate that the cell is still running. The nice thing about preprocessing is that once we have the pipeline complete, we will save our results so we don't need to re-run these cells.

<a id='read'></a>

# Recap: Our Data

The first step is to read the data we are going to work with into Python so that we can work with it.

We're using the dataset taken from the subreddit [r/amitheasshole](http://www.reddit.com/r/AmItheAsshole). 

The subreddit describes itself as follows:

<img src="../../images/aita_desc.png" alt="Am I The Asshole - description" width="300"/>

The subreddit has structures in place that the community follow to come to a decision about the situation. First, OP (original poster) writes up the situation, asking AITA (Am I The Asshole). In response, for eighteen hours, the community of the subreddit will respond to the post with one of five judgments: YTA (You’re The Asshole), NTA (Not The Asshole), ESH (Everyone Sucks Here), NAH (No Assholes Here), or INFO (Not Enough Info).

💡 **Tip**: For more info on the subreddit, see [here](https://www.inverse.com/culture/am-i-the-asshole/amp). 

First, we have to read the data. We'll use a subset of the full dataset consisting of the top most popular posts, the assumption being that this will yield the most interesting results (`aita_sub_top_sm.csv`). We use `pd.read_csv()` to import the .csv file as a DataFrame.

In [162]:
# Import the pandas package
import pandas as pd 

# Read the csv file
df = pd.read_csv('../../data/aita_top_submissions.csv')

In [163]:
pd.__version__

'1.5.3'

Let's have a look at look and the shape, top rows, and columns of the data.

In [164]:
df.shape

(20000, 9)

In [165]:
df.head()

Unnamed: 0,idstr,created,author,title,selftext,score,num_comments,nsfw,flair_text
0,t3_72kg2a,2017-09-26 13:48:09,Ritsku,AITA for breaking up with my girlfriend becaus...,My girlfriend recently went to the beach with ...,679.0,434.0,0.0,no a--holes here
1,t3_94kvhi,2018-08-04 17:34:55,hhhhhhffff678,AITA for banning smoking in my house and telli...,My parents smoke like chimneys. I used to as w...,832.0,357.0,0.0,asshole
2,t3_951az2,2018-08-06 13:31:39,creepatthepool,AITA? Creep wears skimpy bathing suit to pool,Hi guys. Throwaway for obv reasons.\n\nI'm a f...,23.0,335.0,0.0,Shitpost
3,t3_978ioa,2018-08-14 13:50:41,Pauly104,AITA for eating steak in front of my vegan GF?,"Yesterday night, me and my GF decided to go ou...",1011.0,380.0,0.0,not the a-hole
4,t3_99yo3c,2018-08-24 16:03:40,ThatSpencerGuy,AITA for not wanting to cook my mother-in-law ...,"My wife and I are vegetarians, much to my in-l...",349.0,360.0,0.0,not the a-hole


In [166]:
# This allows you to quickly see which columns you have
list(df)

['idstr',
 'created',
 'author',
 'title',
 'selftext',
 'score',
 'num_comments',
 'nsfw',
 'flair_text']

This particular dataset only includes the original posts in the subreddit (so not the comments on the posts). 

There is one row per post in the dataset. The columns are as follows:

-  `idstr`: ID of the post.
- `created`: the time of the post's creation.
- `author`: Reddit author of the post.
- `title`: Title of the post.
- `selftext`: Text of the post.
- `score`: Amount of upvotes minus downvotes.
- `textlen`: Amount of words.
- `num_comments`: Amount of comments.
- `nsfw`: Flag for NSFW content.
- `flair_text`: A 'tag' that users within a subreddit can add.

💭 **Reflection**: Which of these columns could contain interesting data for your research purposes? How could you make use of things like the number of comments, NSFW flags, or even text length, in relation to a discourse community?

## Removing Missing Values

Data collected from the internet typically has missing values and other inconsistencies. It's typically called "dirty data", and it needs to be cleaned up first.

First, we want to select the rows of deleted posts. On Reddit, removed posts get flagged as "[removed]" or "[deleted]". We should remove these posts, since they'll lack any text. 

We can do this using the [`isin()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html) method. We run it on the `selftext` column and include a list of phrases that indicate a removed post – that is, "[removed]" and "[deleted]".

The following line of code will select all lines that do not have 'removed' or 'deleted' in the post's text. The `~` means 'not'.

In [167]:
# Select all rows that don't have '[removed]' or '[deleted]'
df = df.loc[~df['selftext'].isin(['[removed]', '[deleted]' ]),:]

# Select all rows that have >3 characters in selftext
df = df.loc[df['selftext'].str.len() > 3]

df.shape

(16309, 9)

🔔 **Question**: How many posts are left in the dataset? How many did we lose?

## Removing Null Values

Next, we need to drop **null values**. These are values that are totally missing. In this case, the web scraper may have been unable to extract text for the post. They are replaced with the value **NaN**, which stands for a null value in `pandas`. We need to deal with null values in any column that we plan to use for analysis, which in this case is the `selftext` column. We can use [`dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna) to remove the rows with null values in the target column. 

Let's use the `dropna()` method on the dataframe. We use the argument `subset`, which you set to `'selftext'`). Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) if you want a reminder of how this works!

In [168]:
df = df.dropna(subset=['selftext'])

🔔 **Question**: How many posts are left in the dataset? How many did we lose?

In [169]:
df.shape

(16309, 9)

<a id='preprocess'></a>

# Preprocessing Text Data with SpaCy

Text data collected in the real world is always going to be variable, which poses a challenge for analysis. But by reducing some of this variation, we can help improve our results. For example, if we are counting instances of the word `"weather"` in text, we might want the strings `"weather"`, `"weather."`, and `"Weather"` to all be counted as instances of the same word. However, in raw text form, these would be treated as separate strings. By performing text cleaning, we can standardize these cases and make our data easier to analyze. Some common preprocessing steps are:

- Removing punctuation
- Removing URLs
- Removing stopwords (non-content words like "a", "the", "is", etc.)
- Lowercasing
- Tokenization (e.g., splitting a sentence into distinct "chunks" or "tokens")
- Lemmatization, or changing words to 'dictionary form' (e.g., runs, running, ran -> run)

Fortunately, we don't need to code every one of these steps. Instead, we will use a package called [spaCy](https://spacy.io/) to do these things. If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), `spaCy` is ready to use out-of-the-box. 

However, we need to do one more installation: the underlying language "pipeline" that spaCy uses. This pipeline contains all the rules and code spaCy uses to perform text preprocessing. Using this pipeline, spaCy will extract information about **attributes** of our text: the tokens, lemmas, stopwords, and so on.

The most common pipeline to use is the [`en_core_web_sm`](https://spacy.io/models/en/#en_core_web_sm) pipeline, which is the English ("en") model, which contains many "core" functions, is developed partially with "web" data, and is the small version ("sm").

In [170]:
# Uncomment the following lines to install spaCy and language pipeline

#%pip install -U pydantic spacy
#%python -m spacy download en_core_web_sm

In [171]:
# Import spaCy
import spacy

# Load the English preprocessing pipeline
nlp = spacy.load('en_core_web_sm')

We can now use this `nlp` object--holding a spaCy pipeline--to parse a Reddit post in our dataset.

In [172]:
# Parse a reddit post in the dataset
parsed_post = nlp(df['selftext'][1000])
print(parsed_post)

“Are we the assholes” really because my husband made the decision with me.

My father passed away a few months ago and left a pretty sizeable estate behind. The majority went to my sister and me, with an equal amount of money to each grandchild (or so I thought.) 

My kids are grown so I never really thought to reach out about the money because the executor was handling it. My daughter never mentioned it, but she’s always been frugal so I figured she didn’t want to discuss it and had some plan for it.

Recently we had a family meal and my son brought up the inheritance, and my daughter revealed that she had received nothing and didn’t know that any of her cousins or her brother had. It is very clear to us that she was cut from the will because my father always disapproved of my son in law.

My husband and I talked it over and decided to give a portion of our inheritance to our daughter and her husband to match the amount given to the other grandkids. We are still working with a lawyer 

The base text looks the same, but we can take a closer look at `parsed_post` to see what happened under the hood. A lot of the information SpaCy has gathered operates on each **token** in the text. 

<a id='iter'></a>

# For-Loops

In the following, we will be iterating over the tokens we've created in order to see all the **attributes** spaCy extracted. This involves the use of **for-loops**, which we haven't covered yet.

A **[for loop](https://www.w3schools.com/python/python_for_loops.asp)** executes some statements once *for* each value in an iterable (like a list or a string). It says: "*for* each thing in this group, *do* these operations".

In [173]:
tokens = ['Data','science','for','social','justice']

for token in tokens:
    up = token.upper()
    print(up)

DATA
SCIENCE
FOR
SOCIAL
JUSTICE


Let's look at the syntax of this `for` loop a bit more closely:

<img src="../../images/for.svg" alt="For loop in Python" width="700"/>

Pay attention to the **loop variable** (`token`). It stands for each item in the list (`tokens`) we are iterating through. Loop variables can have any name; if we'd change it to `x`, it would still work. However, loop variables only exist inside the loop.


### Accumulator Variables

In the above example, we are simply printing each value in `tokens`. However, we often will want to save these values somehow. We can do this with a so-called **accumulator variable**. Let's see how that works:

In [174]:
tokens = ['Data','science','for','social','justice']
cap_tokens = []

for token in tokens:
    up = token.upper()
    cap_tokens.append(up)
    
print(cap_tokens)

['DATA', 'SCIENCE', 'FOR', 'SOCIAL', 'JUSTICE']


We first initialize an accumulator variable `cap_tokens`, which will hold the uppercased strings.
Then, in the body of the loop, we add each uppercased string to that list. As the loop runs, the list gets filled with the uppercased strings. 

## SpaCy Attributes

When we ran spaCy on our post, it automatically tokenized all our data. If we loop over our `parsed_post`, we will loop over all of these tokens. We can now look at some **attributes** that SpaCy has extracted for each token. 

We will save these attributes in a list of dictionary items, then put that list into a DataFrame like we did in Python Fundamentals.

In [175]:
# Create an empty list to store token attributes
token_data = []

# Iterate over tokens and extract attributes
for token in parsed_post:
    token_data.append({
        "text": token.text,
        "lemma": token.lemma_,
        "pos": token.pos_,
        "tag": token.tag_,
        "dep": token.dep_,
        "shape": token.shape_,
        "is_alpha": token.is_alpha,
        "is_stop": token.is_stop,
    })

# Create a pandas DataFrame from the token data
token_df = pd.DataFrame(token_data)

# Display the DataFrame
token_df

Unnamed: 0,text,lemma,pos,tag,dep,shape,is_alpha,is_stop
0,“,"""",PUNCT,``,punct,“,False,False
1,Are,be,AUX,VBP,ROOT,Xxx,True,True
2,we,we,PRON,PRP,nsubj,xx,True,True
3,the,the,DET,DT,det,xxx,True,True
4,assholes,asshole,NOUN,NNS,attr,xxxx,True,False
...,...,...,...,...,...,...,...,...
314,giving,give,VERB,VBG,conj,xxxx,True,False
315,our,our,PRON,PRP$,poss,xxx,True,True
316,son,son,NOUN,NN,dative,xxx,True,False
317,anything,anything,PRON,NN,dobj,xxxx,True,True


As you can see, `spaCy` does a *lot* of work. Let's have a look at the [documentation](https://spacy.io/api/attributes) to see which attributes we are looking at here.

## Preprocessing all data
Now we can scale up this process to our whole dataset. Up until this point, we have run the preprocessing pipeline on a single post, but we want to automate our code to process all posts at once. 

To do this, Let's define a helper function that we'll use for preprocessing. The `process_text` function will use `spaCy` to:

- Iterate over the dataframe.
- Segment the threads into individual sentences.
- Remove punctuation, unneccessary tokens such as stopwords, and excess whitespace.
- Lemmatize the text.

The function will automate the preprocessing for the posts in this dataset. Don't worry too much about deciphering each line of code. The main goal of this function is to do a lot of the preprocessing for you so that you can use the text in analysis going forward. 

**You can find this function in the `1_Preprocessing_Project.ipynb` notebook to use on your own data.**

In [176]:
def process_text(text):
    """Function to process a single text string."""

    text = text.replace('\n', '')
    parsed = nlp(text, disable=["tok2vec", "ner"])

    # Gather lowercased, lemmatized tokens that are not punctuation, space, or digit
    tokens = [
        token.lemma_.lower() if token.lemma_ != '-PRON-'
        else token.lower_ 
        for token in parsed 
        if not (token.is_punct or token.is_space or token.is_digit)
    ]

    # Remove specific lemmatizations, and words that are not nouns or adjectives
    tokens = [
        lemma
        for lemma in tokens
        if not lemma in ["'s",  "’s", "’"]
    ]

    # Remove stop words
    tokens = [
        token 
        for token in tokens 
        if token not in spacy.lang.en.stop_words.STOP_WORDS
    ]

    return ' '.join(tokens)

We also create the `preprocess` function, which will call the above helper function and `apply()` it over our dataframe.

Note that we are setting a parameter `text_col` which refers to the name of the text column we are expecting. We do this so it's easy to run `preprocess` when the text column has another name. This is the case when you're working with .CSV files with comments--the text column will not be called `selftext` but `body`. 

In [177]:
def preprocess(df, text_col='selftext'):
    """Preprocessing function to apply to a dataframe."""

    df['pp_text'] = df[text_col].apply(process_text)

    return df


Now let's run `preprocess()` over our dataframe.

In [178]:
# This may take a while
df = preprocess(df, text_col='selftext')

In [179]:
df.head()

Unnamed: 0,idstr,created,author,title,selftext,score,num_comments,nsfw,flair_text,pp_text
0,t3_72kg2a,2017-09-26 13:48:09,Ritsku,AITA for breaking up with my girlfriend becaus...,My girlfriend recently went to the beach with ...,679.0,434.0,0.0,no a--holes here,girlfriend recently went beach friends tiny bi...
1,t3_94kvhi,2018-08-04 17:34:55,hhhhhhffff678,AITA for banning smoking in my house and telli...,My parents smoke like chimneys. I used to as w...,832.0,357.0,0.0,asshole,parents smoke like chimneys quit wife got youn...
2,t3_951az2,2018-08-06 13:31:39,creepatthepool,AITA? Creep wears skimpy bathing suit to pool,Hi guys. Throwaway for obv reasons.\n\nI'm a f...,23.0,335.0,0.0,Shitpost,hi guys throwaway obv reasons i'm female child...
3,t3_978ioa,2018-08-14 13:50:41,Pauly104,AITA for eating steak in front of my vegan GF?,"Yesterday night, me and my GF decided to go ou...",1011.0,380.0,0.0,not the a-hole,yesterday night gf decided eat vegan day going...
4,t3_99yo3c,2018-08-24 16:03:40,ThatSpencerGuy,AITA for not wanting to cook my mother-in-law ...,"My wife and I are vegetarians, much to my in-l...",349.0,360.0,0.0,not the a-hole,wife vegetarians laws vocal annoyance year vis...


## 💭 Reflection

In one of the lines of our `preprocess` function, we are getting rid of stopwords. These are the most common words in any language (like articles, prepositions, pronouns, conjunctions), such as "the", "a", "an", and "so".

Stopwords are considered "low-level information": the removal of these words does not show any negative consequences on the model we train for our task. It also reduces the dataset size and thus reduces the training time due to the fewer number of tokens involved in the training.

Think about what **ethical implications** there could be for removing stopwords. What kind of information are we losing here? What would be a situation in which you would want to keep stopwords?

<a id='gensim'></a>

# Phrase Modeling with Gensim

Many kinds of NLP methods work better when using **N-grams**. An n-gram treats small groups of words as tokens rather than single words. This allows words that frequently appearing together to be concatenated (e.g. "new york" means something different and more specific than "new" and "york" separately). We most commonly use **bigrams** (2-word phrases) and **trigrams** (3-word phrases).

**Phrase modeling** is an approach to learning combinations of tokens that together represent meaningful multi-word concepts. So rather than treating every pair of words as a n-gram, we look for pairs that occur together frequently and identify those as n-grams. This constrains the token space by limiting the number of multi-word tokens, but requires information about what words co-occur together frequently, which is where **phrase models** come in. We can develop phrase models by looping over the the words in our lemmatized dataset and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. 

`gensim` is a popular natural language processing package. We will use it in later lessons for topic modeling and word embeddings. It also contains a [`Phrases`](https://radimrehurek.com/gensim/models/phrases.html) model that implements phrase modeling for identifying bigrams, trigrams, quadgrams, etc. `Phrases` detects phrases based on collocation counts. It builds a model of input text that you then can use on other data.

`gensim` detects a bigram if a scoring function for two words exceeds a threshold. The two important arguments to `Phrases` are `min_count` and `threshold`. The higher the values of these parameters, the harder it is for words to be combined to bigrams. We can change the value of these parameters to fine-tune our model. Try changing `min count` and `threshold` below. How does that change the output?

In [180]:
from gensim.models.phrases import Phrases, Phraser

# "Documents"
docs = ['new york is great',
        'new york is in the united states',
        'i love to stay in new york',
        'people visit the united states']
# Rudimentary tokenization
tokens = [doc.split(" ") for doc in docs]
# Create bigrams
bigram = Phrases(tokens, min_count=2, threshold=3, delimiter='_')
# Freeze bigrams and apply to data
bigram_phraser = Phraser(bigram)
[bigram_phraser[token] for token in tokens]

[['new_york', 'is', 'great'],
 ['new_york', 'is', 'in', 'the', 'united', 'states'],
 ['i', 'love', 'to', 'stay', 'in', 'new_york'],
 ['people', 'visit', 'the', 'united', 'states']]

Now let's make a bigram and trigram model for our data. Starting from the preprocessed lemmas from the `preprocess()` function above, we can use the `gensim` models to identify bigrams and trigrams in the dataset. Again, the `min_count` and `threshold` arguments can be modified to change the output of the model. 

In the code below we make a bigram model using the gensim `Phrases` object, and then build a trigram model on top of that bigram model. Finally, we use the lemmas from the preprocessed text to make a trigram model of the data. We are processing the whole dataset in this cell, so it may take a little while to run.

In [181]:
# Create bigram and trigram models
tokens = [doc.split(" ") for doc in df['pp_text']]

bigram = Phrases(tokens, min_count=10, threshold=100)
trigram = Phrases(bigram[tokens], min_count=10, threshold=50)  
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# Form trigrams
df['pp_text'] = [' '.join(trigram_phraser[bigram_phraser[doc]]) for doc in tokens]

In [182]:
df.head()

Unnamed: 0,idstr,created,author,title,selftext,score,num_comments,nsfw,flair_text,pp_text
0,t3_72kg2a,2017-09-26 13:48:09,Ritsku,AITA for breaking up with my girlfriend becaus...,My girlfriend recently went to the beach with ...,679.0,434.0,0.0,no a--holes here,girlfriend recently went beach friends tiny bi...
1,t3_94kvhi,2018-08-04 17:34:55,hhhhhhffff678,AITA for banning smoking in my house and telli...,My parents smoke like chimneys. I used to as w...,832.0,357.0,0.0,asshole,parents smoke like chimneys quit wife got youn...
2,t3_951az2,2018-08-06 13:31:39,creepatthepool,AITA? Creep wears skimpy bathing suit to pool,Hi guys. Throwaway for obv reasons.\n\nI'm a f...,23.0,335.0,0.0,Shitpost,hi guys throwaway obv reasons i'm female child...
3,t3_978ioa,2018-08-14 13:50:41,Pauly104,AITA for eating steak in front of my vegan GF?,"Yesterday night, me and my GF decided to go ou...",1011.0,380.0,0.0,not the a-hole,yesterday night gf decided eat vegan day going...
4,t3_99yo3c,2018-08-24 16:03:40,ThatSpencerGuy,AITA for not wanting to cook my mother-in-law ...,"My wife and I are vegetarians, much to my in-l...",349.0,360.0,0.0,not the a-hole,wife vegetarians laws vocal annoyance year vis...


In [183]:
# Have a look at a slice of the first post
df['pp_text'][0][25:35]

'beach frie'

Once our phrase model has been trained on our total dataset, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.


In [184]:
trigram_phraser["That", "was", "not", "a", "big", "deal"]

['That', 'was', 'not', 'a', 'big_deal']

Let's take a look at the bigram parser. We can use `.keys()` to identify the bigrams in the dataset. How many bigrams were identified by the parser?

In [185]:
len(bigram_phraser.phrasegrams.keys())

775

Let's print the first few bigrams identified in the model as well to check if they seem like appropriate bigrams. If not, we can change the parameters of the bigram model to adjust the sensitivity of the model.

In [186]:
list(bigram_phraser.phrasegrams.keys())[:10]

['worth_mentioning',
 'big_deal',
 'security_camera',
 'mashed_potatoes',
 'walking_eggshells',
 'piece_shit',
 'forgot_mention',
 'high_school',
 'outside_perspective',
 'gain_weight']

In [187]:
# Look at trigrams
[trigram for trigram in list(trigram_phraser.phrasegrams.keys()) if trigram.count('_') == 2]

['blocked_social_media',
 'social_media_accounts',
 'long_story_short',
 'mental_health_issues',
 'upper_middle_class',
 'giving_silent_treatment',
 'bit_taken_aback',
 'r_amitheasshole_comments',
 'real_estate_agent',
 'playing_video_games',
 'plays_video_games',
 'high_school_sweetheart',
 'credit_card_debt',
 'hell_broke_loose',
 'posting_social_media',
 'blah_blah_blah',
 'rubbed_wrong_way',
 'straw_broke_camel',
 'play_video_games',
 'posted_social_media',
 'worst_case_scenario',
 'giving_cold_shoulder',
 'passive_aggressive_comments',
 '=_share&utm_medium',
 'mobile_sorry_formatting',
 'slammed_door_face',
 'post_social_media',
 'straw_broke_camels',
 'aunts_uncles_cousins',
 'monthly_open_forum',
 'place_share_meta',
 'dialog_mod_team.#keep',
 'civil_rules_apply',
 'discourage_brigading_needs',
 'context_use_modmail',
 'uncensored_screenshots_comments']

<a id='save'></a>

# Saving Data

Let's save this cleaned-up dataframe in a new CSV.

In [188]:
df.to_csv('../../data/aita_pp.csv')

After running `to_csv()`, you'll see the data file gets added in the "data" folder, which is in the main folder of this repository. 

🔔 **Question**: Can you find the data file using a file browser (either in Jupyter or on your machine)?

## Pickling

In Python, **pickling** is the process of converting Python objects into binary format that can be stored or transferred. You can then reconstruct the original object from that binary format.

Pickling is useful when you want to save the state of an object so that it can be used later, or when you need to transfer an object between different Python processes.

For convenience, let's convert the `pp_text` column into a list and save it to a file.

In [189]:
import pickle

with open('../../data/aita_submission_trigrams.pickle', 'wb') as f:
    # Save (or "dump") the object into the file
    pickle.dump(df['pp_text'].tolist(), f)

⚠️ **Warning:** `aita_trigrams.pickle` is a type of file you can't just open with another program, as it's in binary format. You need to use a special pickle method called `load()` to open the pickled file and load the object:

In [190]:
with open('../../data/aita_submission_trigrams.pickle', 'rb') as f:
    # Load the object from the file
    trigrams = pickle.load(f)

In [191]:
trigrams[0]

'girlfriend recently went beach friends tiny bikini basically thong hate wears public wore wear posed bathroom mirror hotel room profile picture ass sticking posted snapchat story worth_mentioning friends snapchat reasons similar sick getting fights says going girls night posts videos sitting table like dudes got invited girls completely unknown arrived guys adds snapchat save send saw showing pics beach trip screenshot particular snap left camera roll ass intentional know claims liked way stomach looked pic beach black bikini owns look micro bikini bathroom mirror ocean sand friends tits ass bathroom mirror age typical year_old behavior nowadays wrong thinking inappropriate behavior relationship asshole making big_deal it?**edit guys let clear try control point time told wear yes hate ass completely bikini told wear problem wore general wore summer long despite groaning issue found super annoying yes issue sent picture ass snapchat plain simple drew line break wore broke turned sidewa

<div class="alert alert-success">

## ❗ Key Points

* The Pandas `DataFrame` format can be used to save Reddit data.
* Before cleaning text data, it is a good idea to drop rows and columns you don't need.
* SpaCy can be used to preprocess textual data, including tokenization and lemmatization.
* Gensim can be used to combine tokens into N-grams.
* Python objects can be easily saved into binary format. This is called "pickling".

</div>