# Python Text Analysis Fundamentals: Solutions

This notebook contains solutions for both text preprocessing and bag-of-words representation challenges.

## Part 1: Text Preprocessing Solutions

In [None]:
import pandas as pd
import os
import re
import nltk
import spacy

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation

In [None]:
# Import pandas
import pandas as pd

# Use pandas to import Tweets
csv_path = '../../../data/airline_tweets.csv'
tweets = pd.read_csv(csv_path, sep=',')

## 🥊 Challenge 1: Preprocessing with Multiple Steps

So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function. 

The example text data for challenge 1 has been read in. Write a function to:
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters

Feel free to recycle the codes we've used above!

In [None]:
challenge1_path = '../../../data/example1.txt'

with open(challenge1_path, 'r') as file:
    challenge1 = file.read()
    
print(challenge1)

In [None]:
from string import punctuation

def remove_punct(text):
    '''Remove punctuation marks in input text'''
    
    # Select characters not in puncutaion
    no_punct = []
    for char in text:
        if char not in punctuation:
            no_punct.append(char)

    # Join the characters into a string
    text_no_punct = ''.join(no_punct)   
    
    return text_no_punct

In [None]:
# Write a pattern in regex
blankspace_pattern = r'\s+'

# Write a replacement for the pattern identfied
blankspace_repl = ' '

def clean_text(text):

    # Step 1: Lowercase the input text
    text = text.lower()

    # Step 2: Use remove_punct to remove puncutuation marks
    text = remove_punct(text)

    # Step 3: Remove extra whitespace characters
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    text = text.strip()
    
    return text
    
clean_text(challenge1)

## 🥊 Challenge 2: Remove Stop Words

We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package. 

Let's write **two** functions to remove stop words from our text data. 

- Complete the function for stop words removal using `nltk`
    - The starter code requires two arguments: the raw text input and a list of predefined stop words
- Complete the function for stop words removal using `spaCy`
    - The starter code requires one argument: the raw text input
 
A friendly reminder before we dive in: both functions take raw text as input—that's a signal to perform tokenization on the raw text first!

In [None]:
stop = stopwords.words('english')

def remove_stopword_nltk(raw_text, stopword):
    
    # Step 1: Tokenization with nltk
    tokens = word_tokenize(raw_text)
    
    # Step 2: Filter out tokens in the stop word list
    text = [token for token in tokens if token not in stopword]
    
    return text

In [None]:
nlp = spacy.load('en_core_web_sm')

def remove_stopword_spacy(raw_text):

    # Step 1: Apply the nlp pipeline
    doc = nlp(raw_text)
    
    # Step 2: Filter out tokens in the stop word list
    text = [token.text for token in doc if token.is_stop is False]

    return text

In [None]:
text = tweets['text'][7]

In [None]:
remove_stopword_nltk(text, stop)

In [None]:
remove_stopword_spacy(text)

## 🥊 Challenge 3: Find the Word Boundary

Now that we know BERT tokenization would often return subwords. Let's try a few more examples! 

Does the result make sense to you? What do you think is the correct word boundary to split the following words into subwords? 

Also feel free to read more about limitations of the WordPiece algorithm. For instance, [this blog post](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99) dives into why would it fail, and [this one](https://tinkerd.net/blog/machine-learning/bert-tokenization/#demo-bert-tokenizer) introduces the mechanism underlying the algoritm. 

In [None]:
# Load BERT tokenizer in
from transformers import BertTokenizer

# Initialize the tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
def get_tokens(string):
    '''Tokenzie the input string with BERT'''
    tokens = tokenizer.tokenize(string)
    return print(tokens)

In [None]:
# Abbreviations
get_tokens('dlab')

# OOV
get_tokens('covid')

# Prefix
get_tokens('huggable')

# Digits
get_tokens('378')

# YOUR EXAMPLE

## Part 2: Bag-of-Words Representation Solutions

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

In [None]:
# Use pandas to import tweets
tweets_path = '../../../data/airline_tweets.csv'
tweets = pd.read_csv(tweets_path, sep=',')

## 🥊 Challenge 4: Apply a Text Cleaning Pipeline

Write a function called `preprocess` that performs the following steps on a text input:

* Step 1: Lowercase the text input.
* Step 2: Replace the following patterns with placeholders:
    * URLs &rarr; ` URL `
    * Digits &rarr; ` DIGIT `
    * Hashtags &rarr; ` HASHTAG `
    * Tweet handles &rarr; ` USER `
* Step 3: Remove extra blankspace.

Here are some hints to guide you through this challenge:

* For Step 1, recall from Part 1 that a string method called [`.lower()`](https://docs.python.org/3.11/library/stdtypes.html#str.lower) can be usd to convert text to lowercase. 
* We have integrated Step 2 into a function called `placeholder`. Run the cell below to import it into your notebook, and you can use it just like any other functions.
* For Step 3, we have provided the regex pattern for identifying whitespace characters as well as the correct replacement for extract whitespace. 

Run your `preprocess` function on `example_tweet` (three cells below) to check if it works. If it does, apply it to the entire `text` column in the tweets dataframe.

In [None]:
from utils import placeholder

In [None]:
blankspace_pattern = r'\s+'
blankspace_repl = ' '

def preprocess(text):
    '''Create a preprocess pipeline that cleans the tweet data.'''

    # Step 1: Lowercase
    text = text.lower()

    # Step 2: Replace patterns with placeholders
    text = placeholder(text)

    # Step 3: Remove extra whitespace characters
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    text = text.strip()
    
    return text

In [None]:
example_tweet = 'lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo'

In [None]:
# Apply the function to the example tweet
print(example_tweet)
print(f"{'='*50}")
print(preprocess(example_tweet))

In [None]:
# Apply the function to the text column and assign the preprocessed tweets to a new column
tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess(x))
tweets['text_processed'].head()

## 🥊 Challenge 5: Lemmatize the Text Input

Recall from Part 1 that we introduced using `spaCy` to perform lemmatization, i.e., to "recover" the base form of a word. This process will reduce vocabulary size by keeping word variations minimal—a smaller vocabularly may help improve model performance in sentiment classification. 

Now let's implement lemmatization on our tweet data and use the lemmatized text to create a third DTM. 

Complete the function `lemmatize_text`. It requires a text input and returns the lemmas of all tokens. 

Here are some hints to guide you through this challenge:

- Step 1: initialize a list to hold lemmas
- Step 2: apply the `nlp` pipeline to the input text
- Step 3: iterate over tokens in the processed text and retrieve the lemma of the token
    - HINT: lemmatization is one of the linguistic annotations that the `nlp` pipeline automatically does for us. We can use `token.lemma_` to access the annotation.

In [None]:
# Import spaCy
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Create a function to lemmatize text
def lemmatize_text(text):
    '''Lemmatize the text input with spaCy annotations.'''

    # Step 1: Initialize an empty list to hold lemmas
    lemma = []

    # Step 2: Apply the nlp pipeline to input text
    doc = nlp(text)

    # Step 3: Iterate over tokens in the text to get the token lemma
    for token in doc:
        lemma.append(token.lemma_)

    # Step 4: Join lemmas together into a single string
    text_lemma = ' '.join(lemma)
    
    return text_lemma

In [None]:
# Apply the function to an example tweet
print(tweets.iloc[101]["text_processed"])
print(f"{'='*50}")
print(lemmatize_text(tweets.iloc[101]['text_processed']))

In [None]:
# This may take a while!
tweets['text_lemmatized'] = tweets['text_processed'].apply(lambda x: lemmatize_text(x))

In [None]:
# Print the preprocessed tweet
print(tweets['text_processed'].iloc[101])
print(f"{'='*50}")
# Print the lemmatized tweet
print(tweets['text_lemmatized'].iloc[101])

## 🥊 Challenge 6: Words with Highest Mean TF-IDF scores

We have obtained tf-idf values for each term in each document. But what do these values tell us about the sentiments of tweets? Are there any words that are  particularly informative for positive/negative tweets? 

To explore this, let's gather the indices of all positive/negative tweets and calculate the mean tf-idf scores of words appear in each category. 

We've provided the following starter code to guide you:
- Subset the `tweets` dataframe according to the `airline_sentiment` label and retrieve the index of each subset (`.index`). Assign the index to `positive_index` or `negative_index`.
- For each subset:
    - Retrieve the td-idf representation 
    - Take the mean tf-idf values across the subset using `.mean()`
    - Sort the mean values in the descending order using `.sort_values()`
    - Get the top 10 terms using `.head()`

Next, run `pos.plot` and `neg.plot` to plot the words with the highest mean tf-idf scores for each subset. 

In [None]:
# Create a tfidf vectorizer
vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

# Fit and transform 
tf_dtm = vectorizer.fit_transform(tweets['text_lemmatized'])

# Create a tf-idf dataframe
tfidf = pd.DataFrame(tf_dtm.todense(),
                     columns=vectorizer.get_feature_names_out(),
                     index=tweets.index)

In [None]:
# Complete the boolean masks 
positive_index = tweets[tweets['airline_sentiment'] == 'positive'].index
negative_index = tweets[tweets['airline_sentiment'] == 'negative'].index

In [None]:
# Complete the following two lines
pos = tfidf.loc[positive_index].mean().sort_values(ascending=False).head(10)
neg = tfidf.loc[negative_index].mean().sort_values(ascending=False).head(10)

In [None]:
pos.plot(kind='barh', 
         xlim=(0, 0.18),
         color='cornflowerblue',
         title='Top 10 terms with the highest mean tf-idf values for positive tweets');

In [None]:
neg.plot(kind='barh', 
         xlim=(0, 0.18),
         color='darksalmon',
         title='Top 10 terms with the highest mean tf-idf values for negative tweets');