# Data Science for Social Justice Workshop: Preprocessing – Project Notebook

Use this notebook for carrying out the analyses from the workshop notebook on your own subreddit data.

### Icons Used in This Notebook
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

## Reading the Data

Put your data in the `data` folder of this repo and replace `YOUR_FILE.csv` below with the name of your file.

In [None]:
# Import the pandas package
import pandas as pd 

# Replace this with your own file!
df = pd.read_csv('../../data/YOUR_FILE.csv')

Check out the shape, first rows, and columns.

In [None]:
df.shape

In [None]:
df.head()

In [None]:
# This allows you to quickly see which columns you have
list(df)

## 💭 Reflection

Take some time to look at the reddit community you have chosen with your teammates. If it no longer exists, look at some of the posts in your DataFrame. Discuss the following questions:

- What kinds of norms and values does this community seem to be organized around?
- Does the community include a FAQ or Wiki page that explain rules for interaction among members? What are these rules?
- What are the most popular posts of all time? Do they have something in common?
- Are there dissenting voices when it comes to these norms? How do others respond to them?

## Removing columns and rows

In [None]:
# Drop some columns
df = df.drop(['self', 'url', 'subreddit', 'augmented_at', 'augmented_count'], axis=1)

NOTE: If you are preprocessing a **comments** file, the `selftext` column is called `body`! Make sure to replace this in the code below or you'll get a `KeyError`.

In [None]:
# Select rows that don't have 'removed' or 'deleted' as the selftext
df = df.loc[~df['selftext'].isin(['[removed]', '[deleted]' ]),:]
df.shape

In [None]:
# Drop null values in selftext
df = df.dropna(subset=['selftext'])
df.shape

Let's save this cleaned-up dataframe in a new CSV.

In [None]:
df.to_csv('../../data/YOUR_FILE.csv')

## Preprocessing Data with Spacy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
from gensim.models.phrases import Phrases, Phraser

In [None]:
def preprocess(df, text_col='selftext'):
    """Preprocessing function to apply to a dataframe."""
    
    for text in df[text_col]:
        text = text.replace('\n', '')
        parsed = nlp(text, disable=["tok2vec", "ner"])

        # Gather lowercased, lemmatized tokens that are not punctuation, space, or digit
        tokens = [
            token.lemma_.lower() if token.lemma_ != '-PRON-'
            else token.lower_ 
            for token in parsed 
            if not (token.is_punct or token.is_space or token.is_digit)
        ]

        # Remove specific lemmatizations, and words that are not nouns or adjectives
        tokens = [
            lemma
            for lemma in tokens
            if not lemma in ["'s",  "’s", "’"]
        ]

        # Remove stop words
        tokens = [
            token 
            for token in tokens 
            if token not in spacy.lang.en.stop_words.STOP_WORDS
        ]

        yield tokens

In [None]:
# This may take a while
lemmas = [line for line in preprocess(df)]

## Phrase modeling

In [None]:
from gensim.models.phrases import Phrases, Phraser

# Create bigram and trigram models
bigram = Phrases(lemmas, min_count=10, threshold=100)
trigram = Phrases(bigram[lemmas], min_count=10, threshold=50)  
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# Form trigrams
trigrams = [trigram_phraser[bigram_phraser[doc]] for doc in lemmas]

In [None]:
# Join each into a string
trigrams_joined = [' '.join(trigram) for trigram in trigrams]
trigrams_joined[0]

Check how many bigrams were identified by the parser.

In [None]:
len(bigram_phraser.phrasegrams.keys())

Print the first few bigrams identified in the model to check if they seem  appropriate. If not, you can play around with the parameters of the bigram model to adjust the sensitivity of the model (the values for `min_count` and `threshold` above).

In [None]:
list(bigram_phraser.phrasegrams.keys())[:10]

In [None]:
# Look at trigrams
[trigram for trigram in list(trigram_phraser.phrasegrams.keys()) if trigram.count('_') == 2]

## Saving data

Make sure to give this file a good name - e.g. indicate in the file name whether you have preprocessed submissions or comments.

In [None]:
import pickle

with open('../../data/YOUR_FILE.pickle', 'wb') as f:
    # Save (or "dump") the object into the file
    pickle.dump(trigrams_joined, f)