# Python Text Analysis: Preprocessing and Bag of Words

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Learn common steps for preprocessing text data, as well as specific operations for preprocessing Twitter data.
* Know commonly used NLP packages and what they are capable of.
* Understand tokenizers, and how they have changed since the advent of Large Language Models.
* Learn how to convert text data into numbers through a Bag-of-Words approach.
* Understand the TF-IDF algorithm and how it complements the Bag-of-Words representation.
* Implement Bag-of-Words and TF-IDF using the `sklearn` package and understand its parameter settings.
* Use the numerical representations of text data to perform sentiment analysis.
</div>

### Icons Used in This Notebook
üîî **Question**: A quick question to help you understand what's going on.<br>
ü•ä **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
üí° **Tip**: How to do something a bit more efficiently or effectively.<br>
‚ö†Ô∏è **Warning:** Heads-up about tricky stuff or common mistakes.<br>
üé¨ **Demo**: Showing off something more advanced ‚Äì so you know what Python can be used for!<br> 

### Sections
1. [Preprocessing](#section1)
2. [Tokenization](#section2)
3. [The Bag-of-Words Representation](#section3)
4. [Term Frequency-Inverse Document Frequency](#section4)
5. [Sentiment Classification Using the TF-IDF Representation](#section5)

In this workshop, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).

Now, let's have these packages properly installed before diving into the materials.

In [None]:
# Uncomment the following lines to install packages/model
# %pip install NLTK
# %pip install transformers
# %pip install spaCy
# %pip install scikit-learn
# !python -m spacy download en_core_web_sm

In [None]:
# Import necessary packages
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from string import punctuation
%matplotlib inline

<a id='section1'></a>

# Preprocessing

In the first part of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.

You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. Later in this workshop, we will begin our foray into converting the text data into a numerical representation‚Äîa format that can be more readily handled by computers. 

üîî **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data. 
- What is the format of the text data you have interacted with (plain text, CSV, or XML)?
- Where does it come from (structured corpus, scraped from the web, survey data)?
- Is it messy (i.e., is the data formatted consistently)?

## Common Processes

Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.

Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.

The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions. 
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters
- Remove stop words

After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features).  

Before we jump into these operations, let's take a look at our data!

### Import the Text Data

The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015. 

Let's read the file `airline_tweets.csv` into dataframe with `pandas`.

In [None]:
# Import pandas
import pandas as pd

# File path to data
csv_path = '../../../data/airline_tweets.csv'

# Specify the separator
tweets = pd.read_csv(csv_path, sep=',')

In [None]:
# Show the first five rows
tweets.head()

The dataframe has one row per tweet. The text of tweet is shown in the `text` column.
- `text` (`str`): the text of the tweet.

Other metadata we are interested in include: 
- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as "neutral," "positive," or "negative."
- `airline` (`str`): the airline that is tweeted about.
- `retweet count` (`int`): how many times the tweet was retweeted.

Let's take a look at some of the tweets:

In [None]:
print(tweets['text'].iloc[0])
print(tweets['text'].iloc[1])
print(tweets['text'].iloc[2])

üîî **Question**: What have you noticed? What are the stylistic features of tweets?

### Lowercasing

While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.

More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.

We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.

Let's apply it to the following example:

In [None]:
# Print the first example
first_example = tweets['text'][108]
print(first_example)

In [None]:
# Check if all characters are in lowercase
print(first_example.islower())
print(f"{'=' * 50}")

# Convert it to lowercase
print(first_example.lower())
print(f"{'=' * 50}")

# Convert it to uppercase
print(first_example.upper())

### Remove Extra Whitespace Characters

Sometimes we might come across texts with extraneous whitespace, such as spaces, tabs, and newline characters, which is particularly common when the text is scrapped from web pages. Before we dive into the details, let's briefly introduce Regular Expressions (regex) and the `re` package. 

Regular expressions are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but they can be very efficient when we get a handle on them. Many NLP packages heavily rely on regex under the hood. Regex testers, such as [regex101](https://regex101.com), are useful tools in both understanding and creating regex expressions.

Our goal in this workshop is not to provide a deep (or even shallow) dive into regex; instead, we want to expose you to them so that you are better prepared to do deep dives in the future!

The following example is a poem by William Wordsworth. Like many poems, the text may contain extra line breaks (i.e., newline characters, `\n`) that we want to remove.

In [None]:
# File path to the poem
text_path = '../../../data/poem_wordsworth.txt'

# Read the poem in
with open(text_path, 'r') as file:
    text = file.read()
    file.close()

As you can see, the poem is formatted as a continuous string of text with line breaks placed at the end of each line, making it difficult to read. 

In [None]:
text

One handy function we can use to display the poem properly is `.splitlines()`. As the name suggests, it splits a long text sequence into a list of lines whenever there is a newline character.   

In [None]:
# Split the single string into a list of lines
text.splitlines()

Let's return to our tweet data for an example.

In [None]:
# Print the second example
second_example = tweets['text'][5]
second_example

In this case, we don't really want to split the tweet into a list of strings. We still expect a single string of text but would like to remove the line break completely from the string.

The string method `.strip()` effectively does the job of stripping away spaces at both ends of the text. However, it won't work in our example as the newline character is in the middle of the string.

In [None]:
# Strip only removed blankspace at both ends
second_example.strip()

This is where regex could be really helpful.

In [None]:
import re

Now, with regex, we are essentially calling it to match a pattern that we have identified in the text data, and we want to do some operations to the matched part‚Äîextract it, replace it with something else, or remove it completely. Therefore, the way regex works could be unpacked into the following steps:

- Identify and write the pattern in regex (`r'PATTERN'`)
- Write the replacement for the pattern (`'REPLACEMENT'`)
- Call the specific regex function (e.g., `re.sub()`)

In our example, the pattern we are looking for is `\s`, which is the regex short name for any whitespace character (`\n` and `\t` included). We also add a quantifier `+` to the end: `\s+`. It means we'd like to capture one or more occurences of the whitespace character.

In [None]:
# Write a pattern in regex
blankspace_pattern = r'\s+'

The replacement for one or more whitespace characters is exactly one single whitespace, which is the canonical word boundary in English. Any additional whitespace will be reduced to a single whitespace. 

In [None]:
# Write a replacement for the pattern identfied
blankspace_repl = ' '

Lastly, let's put everything together using the function [`re.sub()`](https://docs.python.org/3.11/library/re.html#re.sub), which means we want to substitute a pattern with a replacement. The function takes in three arguments‚Äîthe pattern, the replacement, and the string to which we want to apply the function.

In [None]:
# Replace whitespace(s) with ' '
clean_text = re.sub(pattern = blankspace_pattern, 
                    repl = blankspace_repl, 
                    string = second_example)
print(clean_text)

Ta-da! The newline character is no longer there.

### Remove Punctuation Marks

Sometimes we are only interested in analyzing **alphanumeric characters** (i.e., the letters and numbers), in which case we might want to remove punctuation marks. 

The `string` module contains a list of predefined punctuation marks. Let's print them out.

In [None]:
# Load in a predefined list of punctuation marks
from string import punctuation
print(punctuation)

In practice, to remove these punctuation characters, we can simply iterate over the text and remove characters found in the list, such as shown below in the `remove_punct` function.

In [None]:
def remove_punct(text):
    '''Remove punctuation marks in input text'''
    
    # Select characters not in puncutaion
    no_punct = []
    for char in text:
        if char not in punctuation:
            no_punct.append(char)

    # Join the characters into a string
    text_no_punct = ''.join(no_punct)   
    
    return text_no_punct

Let's apply the function to the example below. 

In [None]:
# Print the third example
third_example = tweets['text'][20]
print(third_example)
print(f"{'=' * 50}")

# Apply the function 
remove_punct(third_example)

Let's give it a try with another tweet. What have you noticed?

In [None]:
# Print another tweet
print(tweets['text'][100])
print(f"{'=' * 50}")

# Apply the function
remove_punct(tweets['text'][100])

What about the following example?

In [None]:
# Print a text with contraction
contraction_text = "We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab."

# Apply the function
remove_punct(contraction_text)

‚ö†Ô∏è **Warning:** In many cases, we want to remove punctuation marks **after** tokenization, which we will discuss in a minute. This tells us that the **order** of preprocessing is a matter of importance!

## ü•ä Challenge 1: Preprocessing with Multiple Steps

So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function. 

The example text data for challenge 1 is shown below. Write a function to:
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters

Feel free to recycle the codes we've used above!

In [None]:
challenge1_path = '../../../data/example1.txt'

with open(challenge1_path, 'r') as file:
    challenge1 = file.read()
    
print(challenge1)

In [None]:
def clean_text(text):

    # Step 1: Lowercase
    text = ...

    # Step 2: Use remove_punct to remove punctuation marks
    text = ...

    # Step 3: Remove extra whitespace characters
    text = ...

    return text

In [None]:
# Uncomment to apply the above function to challenge 1 text 
# clean_text(challenge1)

<a id='section2'></a>

# Tokenization

One of the most important steps in text analysis is tokenization. This is the process of breaking a long sequence of text into word tokens. With these tokens available, we are ready to perform word-level analysis. For instance, we can filter out tokens that don't contribute to the core meaning of the text.

In this section, we'll introduce how to perform tokenization using `nltk`, `spaCy`, and a Large Language Model (`bert`). The purpose is to expose you to different NLP packages, help you understand their functionalities, and demonstrate how to access key functions in each package.

### `nltk`

The first package we'll be using is called **Natural Language Toolkit**, or `nltk`. 

Let's install a couple modules from the package.

In [None]:
import nltk

In [None]:
# Uncomment the following lines to install these modules
# nltk.download('wordnet')
# nltk.download('stopwords')
# nltk.download('punkt')

`nltk` has a function called `word_tokenize`. It requires one argument, which is the text to be tokenized, and it returns a list of tokens for us.

In [None]:
# Load word_tokenize 
from nltk.tokenize import word_tokenize

# Print the example
text = tweets['text'][7]
print(text)

In [None]:
# Apply the NLTK tokenizer
nltk_tokens = word_tokenize(text)
nltk_tokens

Here we are, with a list of tokens identified by `nltk`. Let's take a minute to inspect them! 

üîî **Question**: Do word boundaries decided by `nltk` make sense to you? Pay attention to the twitter handle and the URL in the example tweet. 

You may feel that accessing functions in `nltk` is pretty straightforward. The function we used above was imported from the `nltk.tokenize` module, which as the name suggests, primarily does the job of tokenization. 

Underlyingly, `nltk` has [a collection of modules](https://www.nltk.org/api/nltk.html) that fulfill different purposes, to name a few:

| NLTK module   | Fucntion                  | Link                                                         |
|---------------|---------------------------|--------------------------------------------------------------|
| nltk.tokenize | Tokenization              | [Documentation](https://www.nltk.org/api/nltk.tokenize.html) |
| nltk.corpus   | Retrieve built-in corpora | [Documentation](https://www.nltk.org/nltk_data/)             |
| nltk.tag      | Part-of-speech tagging    | [Documentation](https://www.nltk.org/api/nltk.tag.html)      |
| nltk.stem     | Stemming                  | [Documentation](https://www.nltk.org/api/nltk.stem.html)     |
| ...           | ...                       | ...                                                          |

Let's import `stopwords` from the `nltk.corpus` module, which hosts a range of built-in corpora. 

In [None]:
# Load predefined stop words from nltk
from nltk.corpus import stopwords

Let's specificy that we want to retrieve English stop words. The function simply returns a list of stop words, mostly function words, that `nltk` identifies. 

In [None]:
# Print the first 10 stopwords
stop = stopwords.words('english')
stop[:10]

### `spaCy`
Other than `nltk`, we have another widely-used package called `spaCy`. 

`spaCy` has its own processing pipeline. It takes in a string of text, runs the `nlp` pipeline on it, and stores the processed text and its annotations in an object called `doc`. The `nlp` pipeline always performs tokenization, as well as [other text analysis components](https://spacy.io/usage/processing-pipelines#custom-components) requested by the user. These components are pretty similar to modules in `nltk`. 

<img src='../../../img/spacy.png' alt="spacy pipeline" width="700">

Note that we always start by initializing the `nlp` pipeline, depending on the language of the text. Here, we are loading a pretrained language model for English: `en_core_web_sm`. The name suggests that it is a lightweight model trained on some text data (e.g., blogs); see model descriptions [here](https://spacy.io/models/en#en_core_web_sm).

This is the first time we encounter the concept of **pretraining**, though you may have heard it elsewhere. In the context of NLP, pretraining means that the model has been trained on a vast amount of data. As a result, it comes with a certain "knowledge" of word structure and grammar of the language.

Therefore, when we apply the model to our own data, we can expect it to be reasonably accurate in performing various annotation tasks, e.g., tagging a word's part of speech, identifying the syntactic head of a phrase, and etc. 

Let's dive in! We'll first need to load the pretrained language model we installed earlier.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

The `nlp` pipeline, by default, includes a set of components, which we can access via the `.pipe_names` attribute. 

You may notice that it dosen't include the tokenizer. Don't worry! Tokenizer is a special component that the pipeline always includes.

In [None]:
# Retrieve components included in NLP pipeline
nlp.pipe_names

Let's run the `nlp` pipeline on our example tweet data, and assign it to a variable `doc`.

In [None]:
# Apply the pipeline to example tweet
doc = nlp(tweets['text'][7])

Under the hood, the `doc` object contains the tokens (created by the tokenizer) and their annotations (created by other components), which are [linguistic features](
https://spacy.io/usage/linguistic-features) useful for text analysis. We retrieve the token and its annotations by accessing corresponding attributes. 

| Attribute      | Annotation                              | Link                                                                      |
|----------------|-----------------------------------------|---------------------------------------------------------------------------|
| token.text     | The token in verbatim text              | [Documentation](https://spacy.io/api/token#attributes)                    |
| token.is_stop  | Whether the token is a stop word        | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.is_punct | Whether the token is a punctuation mark | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.lemma_   | The base form of the token              | [Documentation](https://spacy.io/usage/linguistic-features#lemmatization) |
| token.pos_     | The simple POS-tag of the token         | [Documentation](https://spacy.io/usage/linguistic-features#pos-tagging)   |
| ...            | ...                                     | ...                                                                       |

Let's first get the tokens themselves! We'll iterate over the `doc` object and retrieve the text of each token. 

In [None]:
# Get the verbatim texts of tokens
spacy_tokens = [token.text for token in doc]
spacy_tokens

In [None]:
# Get the NLTK tokens
nltk_tokens

üîî **Question**: Let's pause for a minute to compare the tokens generated by `nltk` and `spaCy`. What have you noticed?

Remember we can also access various annotations of these okens. For instance, one annotation `spaCy` offers is that it conveniently encodes whether a token is a stop word. 

In [None]:
# Retrieve the is_stop annotation
spacy_stops = [token.is_stop for token in doc]

# The results are boolean values
spacy_stops

## ü•ä Challenge 2: Remove Stop Words

We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package. 

Let's write **two** functions to remove stop words from our text data. 

- Complete the function for stop words removal using `nltk`
    - The starter code requires two arguments: the raw text input and a list of predefined stop words
- Complete the function for stop words removal using `spaCy`
    - The starter code requires one argument: the raw text input
 
A friendly reminder before we dive in: both functions take raw text as input‚Äîthat's a signal to perform tokenization on the raw text first!

In [None]:
def remove_stopword_nltk(raw_text, stopword):
    
    # Step 1: Tokenization with nltk
    # YOUR CODE HERE
    
    # Step 2: Filter out tokens in the stop word list
    # YOUR CODE HERE

In [None]:
def remove_stopword_spacy(raw_text):

    # Step 1: Apply the nlp pipeline
    # YOUR CODE HERE
    
    # Step 2: Filter out tokens that are stop words
    # YOUR CODE HERE

In [None]:
# remove_stopword_nltk(text, stop)

In [None]:
# remove_stopword_spacy(text)

<a id='section3'></a>

# The Bag-of-Words Representation

Now we move beyond preprocessing to converting text into numerical representations. We'll explore one of the most straightforward ways to generate a numeric representation from text: the **bag-of-words** (BoW). 

At the heart of the bag-of-words approach lies the assumption that the frequency of specific tokens is informative about the semantics and sentiment underlying the text.

The idea of bag-of-words (BoW), as the name suggests, is quite intuitive: we take a document and toss it in a bag. The action of "throwing" the document in a bag disregards the relative position between words, so what is "in the bag" is essentially "an unsorted set of words" [(Jurafsky & Martin, 2024)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf). In return, we have a list of unique words and the frequency of each of them. 

For example, as shown in the following illustration, the word "coffee" appears twice. 

<img src='../../../img/bow-illustration-1.png' alt="BoW-Part2" width="600">

With a bag-of-words representation, we make heavy use of word frequency but not too much of word order. 

In the context of sentiment analysis, the sentiment of a tweet is conveyed more strongly by specific words. For example, if a tweet contains the word "happy," it likely conveys positive sentiment, but not always (e.g., "not happy" denotes the opposite sentiment). When these words come up more often, they'll probably more strongly convey the sentiment.

## Exploratory Data Analysis

Before we ever do any preprocessing or modeling, let's perform exploratory data analysis to familiarize ourselves with the sentiment data.

To prepare us for sentiment classification, we'll partition the dataset to focus on the "positive" and "negative" tweets for now. 

In [None]:
tweets = tweets[tweets['airline_sentiment'] != 'neutral'].reset_index(drop=True)

Let's take a look at a few tweets first!

In [None]:
# Print first five tweets
for idx in range(5):
    print(tweets['text'].iloc[idx])

We can already see that some of these tweets contain negative sentiment‚Äîhow can we tell this is the case? 

Next, let's take a look at the distribution of sentiment labels in this dataset. 

In [None]:
# Make a bar plot showing the count of tweet sentiments
sns.countplot(data=tweets,
              x='airline_sentiment', 
              color='cornflowerblue',
              order=['positive', 'negative']);

It looks like the majority of the tweets in this dataset are expressing negative sentiment!

Let's take a look at what gets more retweeted:

In [None]:
# Get the mean retweet count for each sentiment
tweets.groupby('airline_sentiment')['retweet_count'].mean()

Negative tweets are clearly retweeted more often than positive ones!

Let's see which airline receives most negative tweets:

In [None]:
# Get the proportion of negative tweets by airline
proportions = tweets.groupby(['airline', 'airline_sentiment']).size() / tweets.groupby('airline').size()
proportions.unstack().sort_values('negative', ascending=False)

It looks like people are most dissatified with US Airways, followed by American Airline, both having over 85% negative tweets!

A lot of interesting discoveries could be made if you want to explore more about the data. Now let's return to our task of sentiment analysis. Before that, we need to preprocess the text data so that they are in a standard format.

## Text Preprocessing for Bag-of-Words

Let's apply what we learned about preprocessing! We'll create a preprocessing pipeline specifically for our tweet data.

## ü•ä Challenge 3: Apply a Text Cleaning Pipeline

Write a function called `preprocess_tweets` that performs the following steps on a text input:

* Step 1: Lowercase the text input.
* Step 2: Replace the following patterns with placeholders:
    * URLs &rarr; ` URL `
    * Digits &rarr; ` DIGIT `
    * Hashtags &rarr; ` HASHTAG `
    * Tweet handles &rarr; ` USER `
* Step 3: Remove extra blankspace.

Here are some regex patterns to help you:
- URLs: `r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'`
- Digits: `r'\d+'`
- Hashtags: `r'#\w+'` 
- Handles: `r'@\w+'`

In [None]:
def preprocess_tweets(text):
    '''Create a preprocess pipeline that cleans the tweet data.'''
    
    # Step 1: Lowercase
    text = text.lower()
    
    # Step 2: Replace patterns with placeholders
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' URL ', text)
    text = re.sub(r'\d+', ' DIGIT ', text)
    text = re.sub(r'#\w+', ' HASHTAG ', text)
    text = re.sub(r'@\w+', ' USER ', text)
    
    # Step 3: Remove extra whitespace characters
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

In [None]:
example_tweet = 'lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo'

# Print the example tweet
print(example_tweet)
print(f"{'='*50}")

# Print the preprocessed tweet
print(preprocess_tweets(example_tweet))

In [None]:
# Apply the function to the text column and assign the preprocessed tweets to a new column
tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess_tweets(x))
tweets['text_processed'].head()

Congratulations! Preprocessing is done. Let's dive into the bag-of-words!

## Document Term Matrix

Now let's implement the idea of bag-of-words. Before we dive deeper, let's step back for a moment. In practice, text analysis often involves handling many documents; from now on, we use the term **document** to represent a piece of text on which we perform analysis. It could be a phrase, a sentence, a tweet, or any other text‚Äîas long as it can be represented by a string, the length dosen't really matter. 

Imagine we have four documents (i.e., the four phrases shown above), and we toss them all in the bag. Instead of a word-frequency list, we'd expect a document-term matrix (DTM) in return. In a DTM, the word list is the **vocabulary** (V) that holds all unique words occur across the documents. For each **document** (D), we count the number of occurence of each word in the vocabulary, and then plug the number into the matrix. In other words, the DTM we will construct is a $D \times V$ matrix, where each row corresponds to a document, and each column corresponds to a token (or "term").

The unique tokens in this set of documents, arranged in alphabetical order, form the columns. For each document, we mark the occurence of each word present in the document. The numerical representation for each document is a row in the matrix. For example, the first document, "the coffee roaster," has the numerical representation $[0, 1, 0, 0, 0, 1, 1, 0]$.

Note that the left index column now displays these documents as text, but typically we would just assign an index to each of them. 

$$
\begin{array}{c|cccccccccccc}
 & \text{americano} & \text{coffee} & \text{iced} & \text{light} & \text{roast} & \text{roaster} & \text{the} & \text{time} \\\hline
\text{the coffee roaster} &0 &1	&0	&0	&0	&1	&1	&0 \\ 
\text{light roast} &0 &0	&0	&1	&1	&0	&0	&0 \\
\text{iced americano} &1 &0	&1	&0	&0	&0	&0	&0 \\
\text{coffee time} &0 &1	&0	&0	&0	&0	&0	&1 \\
\end{array}
$$

To create a DTM, we will use `CountVectorizer` from the package `sklearn`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

The following illustration depicts the three-step workflow of creating a DTM with `CountVectorizr`.

<img src='../../../img/CountVectorizer1.png' alt="CountVectorizer" width="500">

Let's walk through these steps with the toy example shown above.

### A Toy Example

In [None]:
# A toy example containing four documents
test = ['the coffee roaster',
        'light roast',
        'iced americano',
        'coffee time']

The first step is to initialize a `CountVectorizer` object. Within the round paratheses, we can specify parameter settings if desired. Let's take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and see what options are available.  

For now we can just leave it blank to use the default settings. 

In [None]:
# Create a CountVectorizer object
vectorizer = CountVectorizer()

The second step is to `fit` this `CountVectorizer` object to the data, which means creating a vocabulary of tokens from the set of documents. Thirdly, we `transform` our data according to the "fitted" `CountVectorizer` object, which means taking each of the document and counting the occurrences of tokens according to the vocabulary established during the "fitting" step.

It may sound a bit complex but steps 2 and 3 can be done in one swoop using a `fit_transform` function.

In [None]:
# Fit and transform to create a DTM
test_count = vectorizer.fit_transform(test)

The return of `fit_transform` is supposed to be the DTM. 

Let's take a look at it!

In [None]:
test_count

Apparently we've got a "sparse matrix"‚Äîa matrix that contains a lot of zeros. This makes sense. For each document, there are words that don't occur at all, and these are counted as zero in the DTM. This sparse matrix is stored in a "Compressed Sparse Row" format, a memory-saving format designed for handling sparse matrices. 

Let's convert it to a dense matrix, where those zeros are probably represented, as in a numpy array.

In [None]:
# Convert DTM to a dense matrix 
test_count.todense()

So this is our DTM! The matrix is the same as shown above. To make it more reader-friendly, let's convert it to a dataframe. The column names should be tokens in the vocabulary, which we can access with the `get_feature_names_out` function.

In [None]:
# Retrieve the vocabulary
vectorizer.get_feature_names_out()

In [None]:
# Create a DTM dataframe
test_dtm = pd.DataFrame(data=test_count.todense(),
                        columns=vectorizer.get_feature_names_out())

Here it is! The DTM of our toy data is now a dataframe. The index of `test_dtm` corresponds to the position of each document in the `test` list. 

In [None]:
test_dtm

Hopefully this toy example provides a clear walkthrough of creating a DTM.

Now it's time for our tweets data!

### DTM for Tweets

We'll begin by initializing a `CountVectorizer` object. In the following cell, we have included a few parameters that people often adjust. These parameters are currently set to their default values.

When we construct a DTM, the default is to lowercase the input text. If nothing is provided for `stop_words`, the default is to keep them. The next three parameters are used to control the size of the vocabulary, which we'll return to in a minute.

In [None]:
# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

In [None]:
# Fit and transform to create DTM
counts = vectorizer.fit_transform(tweets['text_processed'])
counts

In [None]:
# Extract tokens
tokens = vectorizer.get_feature_names_out()

# Create DTM
dtm = pd.DataFrame(data=counts.todense(),
                   index=tweets.index,
                   columns=tokens)

# Print the shape of DTM
print(dtm.shape)

In [None]:
dtm.head()

Most of the tokens have zero occurences at least in the first five tweets. 

Let's take a closer look at the DTM!

In [None]:
# Most frequent tokens
dtm.sum().sort_values(ascending=False).head(10)

<a id='section4'></a>

# Term Frequency-Inverse Document Frequency 

So far, we're relying on word frequency to give us information about a document. This assumes if a word appears more often in a document, it's more informative. However, this may not always be the case. For example, we've already removed stop words because they are not informative, despite the fact that they appear many times in a document. We also know the word "flight" is among the most frequent words, but it is not that informative, because it appears in many documents. Since we're looking at airline tweets, we shouldn't be surprised to see the word "flight"!

To remedy this, we use a weighting scheme called **tf-idf (term frequency-inverse document frequency)**. The big idea behind tf-idf is to weight a word not just by its frequency within a document, but also by its frequency in one document relative to the remaining documents. So, when we construct the DTM, we will be assigning each term a **tf-idf score**. Specifically, term $t$ in document $d$ is assigned a tf-idf score as follows:

<img src='../../../img/tf-idf_finalized.png' alt="TF-IDF" width="1200">

In essence, the tf-idf score of a word in a document is the product of two components: **term frequency (tf)** and **inverse document frequency (idf)**. The idf acts as a scaling factor. If a word occurs in all documents, then idf equals 1. No scaling will happen. But idf is typically greater than 1, which is the weight we assign to the word to make the tf-idf score higher, so as to highlight that the word is informative. In practice, we add 1 to both the denominator and numerator ("add-1 smooth") to prevent any issues with zero occurrences.

We can also create a tf-idf DTM using `sklearn`. We'll use a `TfidfVectorizer` this time:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Create a tfidf vectorizer
vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

In [None]:
# Fit and transform 
tf_dtm = vectorizer.fit_transform(tweets['text_processed'])
tf_dtm

In [None]:
# Create a tf-idf dataframe
tfidf = pd.DataFrame(tf_dtm.todense(),
                     columns=vectorizer.get_feature_names_out(),
                     index=tweets.index)
tfidf.head()

## ü•ä Challenge 4: Words with Highest Mean TF-IDF scores

We have obtained tf-idf values for each term in each document. But what do these values tell us about the sentiments of tweets? Are there any words that are  particularly informative for positive/negative tweets? 

To explore this, let's gather the indices of all positive/negative tweets and calculate the mean tf-idf scores of words appear in each category. 

We've provided the following starter code to guide you:
- Subset the `tweets` dataframe according to the `airline_sentiment` label and retrieve the index of each subset (`.index`). Assign the index to `positive_index` or `negative_index`.
- For each subset:
    - Retrieve the td-idf representation 
    - Take the mean tf-idf values across the subset using `.mean()`
    - Sort the mean values in the descending order using `.sort_values()`
    - Get the top 10 terms using `.head()`

Next, run `pos.plot` and `neg.plot` to plot the words with the highest mean tf-idf scores for each subset. 

In [None]:
# Complete the boolean masks 
positive_index = tweets[...].index
negative_index = tweets[...].index

In [None]:
# Complete the following two lines
pos = tfidf.loc[...].mean().sort_values(...).head(...)
neg = tfidf.loc[...].mean().sort_values(...).head(...)

In [None]:
pos.plot(kind='barh', 
         xlim=(0, 0.18),
         color='cornflowerblue',
         title='Top 10 terms with the highest mean tf-idf values for positive tweets');

In [None]:
neg.plot(kind='barh', 
         xlim=(0, 0.18),
         color='darksalmon',
         title='Top 10 terms with the highest mean tf-idf values for negative tweets');

üîî **Question**: How would you interpret these results? Share your thoughts in the chat!

<a id='section5'></a>

## üé¨ **Demo**: Sentiment Classification Using the TF-IDF Representation

Now that we have a tf-idf representation of the text, we are ready to do sentiment analysis!

In this demo, we will use a logistic regression model to perform the classification task. Here we briefly step through how logistic regression works as one of the supervised Machine Learning methods, but feel free to explore our workshop on [Python Machine Learning Fundamentals](https://github.com/dlab-berkeley/Python-Machine-Learning) if you want to learn more about it.

Logistic regression is a linear model, with which we use to predict the label of a tweet, based on a set of features ($x_1, x_2, x_3, ..., x_i$), as shown below:

$$
L = \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_T x_T
$$

The list of features we'll pass to the model is the vocabulary of the DTM. We also feed the model with a portion of the data, known as the training set, along with other model specification, to learn the coeffient ($\beta_1, \beta_2, \beta_3, ..., \beta_i$) of each feature. The coefficients tell us whether a feature contributes positively or negatively to the predicted value. The predicted value corresponds to adding all features (multiplied by their coefficients) up, and the predicted value gets passed to a [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) to be converted into the probability space, which tells us whether the predicted label is positive (when $p>0.5$) or negative (when $p<0.5$). 

The remaining portion of the data, known as the test set, is used to test whether the learned coefficients could be generalized to unseen data. 

Now that we already have the tf-idf dataframe, the feature set is ready. Let's dive into model specification!

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split

We'll use the `train_test_split` function from `sklearn` to separate our data into two sets:

In [None]:
# Train-test split
X = tfidf
y = tweets['airline_sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

The `fit_logistic_regression` function is written below to streamline the training process.

In [None]:
def fit_logistic_regression(X, y):
    '''Fits a logistic regression model to provided data.'''
    model = LogisticRegressionCV(Cs=10,
                                 penalty='l1',
                                 cv=5,
                                 solver='liblinear',
                                 class_weight='balanced',
                                 random_state=42,
                                 refit=True).fit(X, y)
    return model

We'll fit the model and compute the training and test accuracy.

In [None]:
# Fit the logistic regression model
model = fit_logistic_regression(X_train, y_train)

In [None]:
# Get the training and test accuracy
print(f"Training accuracy: {model.score(X_train, y_train)}")
print(f"Test accuracy: {model.score(X_test, y_test)}")

The model achieved ~94% accuracy on the training set and ~89% on the test set‚Äîthat's pretty good! The model generalizes reasonably well to the test data.

Next, let's also take a look at the fitted coefficients to see if what we see makes sense. 

We can access them using `coef_`, and we can match each coefficient to the tokens from the vectorizer:

In [None]:
# Get coefs of all features
coefs = model.coef_.ravel()

# Get all tokens
tokens = vectorizer.get_feature_names_out()

# Create a token-coef dataframe
importance = pd.DataFrame()
importance['token'] = tokens
importance['coefs'] = coefs

In [None]:
# Get the top 10 tokens with lowest coefs
neg_coef = importance.sort_values('coefs').head(10)
neg_coef

In [None]:
# Get the top 10 tokens with highest coefs
pos_coef = importance.sort_values('coefs').tail(10)
pos_coef 

Let's plot the top 10 tokens with the highest/lowest coefficients. 

In [None]:
# Plot the top 10 tokens that have the highest coefs
pos_coef.sort_values('coefs', ascending=False) \
        .plot(kind='barh', 
              xlim=(0, 18),
              x='token',
              color='cornflowerblue',
              title='Top 10 tokens with highest coeffient values');

In [None]:
# Plot the top 10 tokens that have the lowest coefs
neg_coef.plot(kind='barh', 
              xlim=(0, -18),
              x='token',
              color='darksalmon',
              title='Top 10 tokens with lowest coeffient values');

Words like "ruin," "rude," and "hour" are strong indicators of negative sentiment, while "thank," "awesome," and "wonderful" are associated with positive sentiment. 

We will wrap up this workshop with these plots. These coefficient terms and the words with the highest TF-IDF values provide different perspectives on the sentiment of tweets. If you'd like, take some time to compare the two sets of plots and see which one provides a better account of the sentiments conveyed in tweets.

<div class="alert alert-success">

## ‚ùó Key Points

* Preprocessing includes multiple steps, some of them are more common to text data regardlessly, and some are task-specific. 
* Both `nltk` and `spaCy` could be used for tokenization and stop word removal. The latter is more powerful in providing various linguistic annotations. 
* Tokenization works differently in BERT, which often involves breaking down a whole word into subwords. 
* A Bag-of-Words representation is a simple method to transform our text data to numbers. It focuses on word frequency but not word order. 
* A TF-IDF representation is a step further; it also considers if a certain word distinctively appears in one document or occurs uniformally across all documents. 
* With a numerical representation, we can perform a range of text classification task, such as sentiment analysis. 

</div>