In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

<div class="alert alert-success">

#### Lab 4
    
# Projections and Spans

### EECS 245, Fall 2025 at the University of Michigan
    
</div>

### Instructions

Most labs will have Jupyter Notebooks, like this one, designed to supplement the in-person worksheet. 

To write and run code in this notebook, you have two options:

1. **Use the EECS 245 DataHub.** To do this, click the "code" link under Lab 4 on the course website. Log in with your uniqname and set a password.
1. **Set up a Jupyter Notebook environment locally, and use `git` to clone our course repository.** For instructions on how to do this, see the [**Tech Support**](https://eecs245.org/tech-support) page of the course website.

To receive credit for the lab, you'll need to submit your notebook with all 5 tasks completed to Gradescope and show your TA that all test cases have passed before the end of the lab session. Instructions on how to do this are at the bottom of the notebook.

## From Words to Numbers 📕

---

### Introduction

The big application we'll explore in the programming section of the lab is how to represent a **text** document as a **vectors**. In text analysis, each piece of text we want to analyze is called a **document**, and a collection of documents is called a **corpus**. 

For example, if we're analyzing the lyrics of different songs, each document might represent the lyrics to a single song, and our corpus would be the total set of lyrics we have access to.

<center markdown="1">
<b>Goal: Use cosine similarity to measure the similarity between Presidential speeches.</b>
</center>

Each year, the sitting US President delivers a "State of the Union" address. The 2025 State of the Union (SOTU) address was on March 4th, 2025. ("Address" is another word for "speech".)

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('XkFKNkAEzQ8')

The file `'data/stateoftheunion1790-2025.txt'` contains the transcript of every SOTU address since 1790. Go open it in your favorite text editor to see how it's formatted! (Source: [The American Presidency Project](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union).)

In [None]:
with open('data/stateoftheunion1790-2025.txt') as f:
    sotu = f.read()

In [None]:
# Our corpus, in total, is over 10 million characters long!
len(sotu) / 1_000_000

Below, we've provided code that converts `sotu`, a string with 10 million characters, to a `pandas` DataFrame object. `pandas` is a Python library designed to work with tabular data – that is, data in tables – and "DataFrame" is their name for tables.

Run the next few cells.

In [None]:
# Speeches are separated by ***.
speeches_lst = sotu.split('\n***\n')[1:]
len(speeches_lst)

In [None]:
print(speeches_lst[-1][:1000])

In [None]:
import pandas as pd
import re

def create_speeches_df(speeches_lst):
    def extract_struct(speech):
        L = speech.strip().split('\n', maxsplit=3)
        L[3] = re.sub(r"[^A-Za-z' ]", ' ', L[3]).lower() # Replaces anything OTHER than letters with ' '.
        L[3] = re.sub(r"it's", 'it is', L[3]).replace(' s ', '')
        return dict(zip(['president', 'date', 'text'], L[1:]))

    speeches = pd.DataFrame(list(map(extract_struct, speeches_lst)))
    speeches.index = speeches['president'].str.strip() + ': ' + speeches['date']
    speeches = speeches[['text']]
    
    return speeches

speeches = create_speeches_df(speeches_lst)
speeches

Each row corresponds to a single speech; the DataFrame above has 235 rows, meaning we have the text of 235 speeches. Notice that punctuation has been removed from each speech, and all **terms** have been converted to lowercase.

Our goal is to produce 235 vectors, 

$$\vec v_{\text{George Washington: January 8, 1790}}, \vec v_{\text{George Washington: December 8, 1790}}, \ldots, \vec v_{\text{Donald J. Trump: March 4, 2025}}$$

so that we can measure how similar any two speeches by computing the cosine similarity of their corresponding vectors. Make your predictions now: of all pairs of the 235 speeches, which will be the most similar? The least?

### The Bag of Words Model

The big question is **how** to represent a speech (or more generally, a document) as a vector. One simple idea is to count the number of occurrences of every term in every document, and store these counts in a vector.

For example, consider the following 3 documents:

1. **big big big big data class**
1. **data big data science**
1. **science big data**

There are 4 unique terms across this corpus: **big**, **data**, **class**, and **science**. So, we can represent each document as a vector in $\mathbb{R}^4$.

$$\vec v_i = \begin{bmatrix} \text{count of "big" in document } i \\ \text{count of "data" in document } i \\ \text{count of "class" in document } i \\ \text{count of "science" in document } i \end{bmatrix}$$

For example, $\vec v_1 = \begin{bmatrix} 4 \\ 1 \\ 1 \\ 0 \end{bmatrix}$. See if you can identify what $\vec v_2$ and $\vec v_3$ are.

This technique for representing documents as vectors of word counts is called the **bag of words** model. It's called this because it doesn't consider the order of the words in the document; imagine the words are separated by spaces, and are shuffled around in a bag. The order of them within a document is irrelevant; all we care about are their frequencies.

<center><img src='imgs/bag-of-words.jpeg' width=500></center>

In this toy example, we only have 3 vectors, so writing them out side-by-side is manageable. But in our speeches example, we have 235 speeches, so 235 vectors, and it'll become inconvenient to write them out this way. Instead, we can represent all of these vectors in a single table, where **each row corresponds to a document's vector representation**.

| | big | data | class |  science |
| --- | --- | --- | --- | --- |
| **big big big big data class** | 4 | 1 | 1 | 0 |
| **data big data science** | 1 | 2 | 0 | 1 |
| **science big data** | 1 | 1 | 0 | 1 |

The first row of the table above is just $\vec v_1$ from before.

Once we have such a table, we can compute the cosine similarity between any pair of rows, which will give us the similarity between the corresponding documents.

### Applying the Bag of Words Model to Presidential Speeches

Let's produce this table for our speeches. We'll do some of the work for you, but you'll need to fill in the rest.

First, we'll find **all** unique terms across all speeches.

In [None]:
speeches

In [None]:
# Takes each speech's text, splits by spaces to get a list of terms per speech, then explodes the list into a single list of terms.
# The result below contains the unique terms, along with their counts across all speeches.
# So, the word "the" appears 147744 times across all speeches.
all_unique_terms = speeches['text'].str.split().explode().value_counts()
all_unique_terms

Since there are over 20,000 unique terms – 24,528 in fact – our future calculations will otherwise take too much time to run. Let's take the 500 most frequent, across all speeches, for speed.

In [None]:
unique_terms = list(all_unique_terms.iloc[:500].index)
unique_terms

So, we'll represent each speech as a vector in $\mathbb{R}^{500}$.

Now, we need to find the number of occurrences of each term in each speech. For example, how many times does "the" appear in the first speech?

In [None]:
first_speech = speeches['text'].iloc[0]
first_speech

The string `.count` method should give us the number of times a substring appears in a string. Notice that we've counted for the number of occurrences of `" the "` rather than `"the"`, because we don't want to count instances of "the" that are part of other words, like "thesaurus" or "there".

In [None]:
first_speech.count(' the ')

We _could_ write a nested loop, like:

```
for all 500 terms t:
    for all 235 speeches d:
        count the number of times t appears in d
```

But, as we've come to know in this course, there's _usually_ a better way. And indeed there is. `pandas` DataFrames, like `speeches` below, come equipped with several vectorized methods that allow us to apply an operation to every row.

In [None]:
speeches

Check out what happens below!

In [None]:
speeches['text'].str.count(' the ')

Above, we counted the number of times `" the "` appears in **each** speech. The first value above, 97, is the same as we got with `first_speech.count(' the ')`. This is considerably quicker than writing a loop over all 235 speeches.

The data structure above is a `pandas` Series, which you can think of as a 1-dimensional array, along with an index, which is a name for each element in the array.

So, the above allows us to instead write just a single loop:

```
for all 500 terms t:
    count the number of times t appears in every document d, using .str.count
```

<div class="alert alert-info" markdown="1">

### Task 1

</div>

Below, assign `counts_dict` to a **dictionary** with 500 keys, one for each of the top 500 unique terms. The value corresponding to a particular key should be a `pandas` Series of length 235, where each element is the number of times the key appears in a speech.

For example, `counts_dict["the"]` should be the same as `speeches['text'].str.count(" the ")` from above.

_Hint: Our solution uses a single `for`-loop, and takes ~10 seconds to run locally._

In [None]:
counts_dict = ...

# Feel free to change the string below to test your solution. 
counts_dict["americans"]

In [None]:
grader.check("task01")

Now that you've produced `counts_dict`, we can convert it to a DataFrame, where each row corresponds to a speech, and each column corresponds to a term.

In [None]:
counts_df = pd.DataFrame(counts_dict)
counts_df

Each **row** is a vector in $\mathbb{R}^{500}$, corresponding to a different speech. The speech names on the left are the **index** of the DataFrame, and aren't included as part of the vector.

To access the vector for a particular speech, we can use the `.loc` accessor. For example, `counts_df.loc["George Washington: January 8, 1790"]` gives us the first row.

In [None]:
counts_df.loc["George Washington: January 8, 1790"]

Equivalently, to access row `i` (indexed starting at 0), we can use `counts_df.iloc[i]`.

In [None]:
# Same as above!
counts_df.iloc[0]

Now that we know how to access the vector for a particular speech, we can compute the cosine similarity between any two speeches!

First, let's implement a general-purpose cosine similarity function that works on any two vectors $\vec u, \vec v \in \mathbb{R}^n$. Before you complete the next task, review the [end of Chapter 2.1](https://notes.eecs245.org/vectors-and-matrices/vectors/#np-linalg-norm-and-vectorization) and the example below.

In [None]:
import numpy as np

np.dot([1, 2], [3, 4])

<div class="alert alert-info" markdown="1">

### Task 2

</div>

Complete the implementation of the function `cosine_similarity`, which takes in two lists, arrays, or Series `u` and `v`, both corresponding to vectors in $\mathbb{R}^n$, and returns their cosine similarity. Example behavior is given below.

```python
>>> cosine_similarity([4, 3, 2], [1, -1, 1])
0.3216337604513385

>>> cosine_similarity(counts_df.iloc[0], counts_df.iloc[1])
0.9678045752893217
```

In [None]:
import numpy as np

def cosine_similarity(u, v):
    # Converts u and v to numpy arrays, in case they're lists or pandas Series to begin with.
    u = np.array(u)
    v = np.array(v)
    ...

# Feel free to change the inputs below to test your solution.
cosine_similarity([4, 3, 2], [1, -1, 1])

In [None]:
grader.check("task02")

<div class="alert alert-info" markdown="1">

### Task 3

</div>

Now, complete the implementation of the function `similarity_of_rows`, which takes in a DataFrame `df` and  **strings** `index_1` and `index_2` and returns the cosine similarity of the rows of `df` with index values of `index_1` and `index_2`, respectively.

Example behavior is given below.

```python
>>> similarity_of_speeches(counts_df, 
                           'Benjamin Harrison: December 3, 1889', 
                           'George H.W. Bush: January 28, 1992')
0.8487541155585746
```

_Hint: Our implementation is only one line long. Make sure `counts_df` doesn't appear in your solution! We've defined this function this way so that later on, `df` could be a **different** table with alternative vector representations of each speech._

In [None]:
def similarity_of_speeches(df, index_1, index_2):
    ...
    
# Feel free to change the inputs below to test your solution.
similarity_of_speeches(counts_df, 
                       'Benjamin Harrison: December 3, 1889', 
                       'George H.W. Bush: January 28, 1992')

In [None]:
grader.check("task03")

Let's use your implementation of `similarity_of_speeches` to compute the cosine similarity between all $\frac{n(n-1)}{2}$ pairs of speeches. (There are $n \choose 2$ pairs, if the binomial coefficient sounds familiar from EECS 203.)

In [None]:
from itertools import combinations

sims_dict = {}
# For every pair of speeches, find the similarity and store it in
# the sims_dict dictionary.
for pair in combinations(counts_df.index, 2):
    sims_dict[pair] = similarity_of_speeches(counts_df, pair[0], pair[1])
    
# Turn the sims_dict dictionary into a DataFrame.
sims = (
    pd.Series(sims_dict)
    .reset_index()
    .rename(columns={'level_0': 'speech 1', 'level_1': 'speech 2', 0: 'cosine similarity'})
    .sort_values('cosine similarity', ascending=False)
)
sims

It seems that the most similar pair of speeches, when we use the bag of words model to convert speeches to vectors, are speeches by William H. Taft in 1909 and 1911. The most dissimilar pair is a speech by John Quincy Adams from 1827 and George W. Bush in 2001. Cool!

Throughout this lab, you may have realized there's a key flaw with the bag of words model. Words like "the" and "of" appear very frequently in every speech, and so the "the" and "of" components of our vectors in $\mathbb{R}^{500}$ will consistently be very large. This will cause the cosine similarities of most pairs of vectors to be relatively large, not because they use a similar combination of rare-ish words, but rather, because they both use lots of the same common words.

You should notice in the table above in the `cosine similarity` column that even the least similar pair has a relatively high cosine similarity of 0.713541.

### Term Frequency-Inverse Document Frequency (TF-IDF)

Instead of using the bag of words model to convert speeches to vectors, let's instead use a different metric, called the **term frequency-inverse document frequency**.

Suppose $t$ is a single term and $d$ is a document. Then:

$$\begin{align*}\text{tfidf}(t, d) &= \frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of terms in $d$}} \cdot \log \left(\frac{\text{total # of documents}}{\text{# of documents in which $t$ appears}} \right)\end{align*} $$

$\text{tfidf}(t, d)$ is large when $t$ is common in document $d$, but rare overall, so you can think of it as an "importance score" for $t$ in $d$.

Before we apply this to our speeches dataset, let's work through a toy example.

<div class="alert alert-info" markdown="1">

### Task 4

</div>

Consider the following three documents.

1. **big big big big data class**
1. **data big data science**
1. **science big data**

Assign `tfidf_science_doc2` to $\text{tfidf}(\text{"science", document 2})$ and `tfidf_big_doc1` to $\text{tfidf}(\text{"big", document 1})$. A related value has already been calculated for you below.

In [None]:
tfidf_class_doc3 = (0 / 3) * np.log(3 / 1)
tfidf_science_doc2 = ...
tfidf_big_doc1 = ...

print('tfidf("class", document 3) = ', tfidf_class_doc3)
print('tfidf("science", document 2) = ', tfidf_science_doc2)
print('tfidf("big", document 1) = ', tfidf_big_doc1)

In [None]:
grader.check("task04")

You'll notice that both `tfidf_class_doc3` and `tfidf_big_doc1` were 0. But, they were 0 for different reasons.

Recall,

$$\begin{align*}\text{tfidf}(t, d) &= \frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of terms in $d$}} \cdot \log \left(\frac{\text{total # of documents}}{\text{# of documents in which $t$ appears}} \right)\end{align*} $$


$\text{tfidf}(t, d)$ is 0 when:
- $t$ doesn't appear in $d$. Here, $\frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of terms in $d$}} = 0$.
- $t$ appears in every single document. Why does this imply $\text{tfidf}(t, d) = 0$?

The hope, in introducing TF-IDF, is that if we use TF-IDF scores to turn speeches into vectors, the resulting vectors will contain more meaningful information about each speech, rather than just raw word frequencies.

Somehow, we'll need to create the equivalent of `counts_df`, but at row $d$ and column $t$, the new table should contain $\text{tfidf}(t, d)$, rather than $\text{count}(t, d)$.

In [None]:
counts_df

You've now finished the required programming component of Lab 4.

**Open-ended challenge**: In the space below, we encourage you to _try_ and figure out how to create a table `tfidf_df`, like the one we describe above.

There are a few ways to go about it:
- Start with `counts_df`, and use it to compute term frequencies (the TF in TF-IDF) and inverse document frequencies.
- Use the `TfidfVectorizer` class from `sklearn.feature_extraction.text`. Once you instantiate a `TfidfVectorizer` object, use the `fit_transform` method on it.

Feel free to come to us in office hours if you'd like guidance on how to make this work!

## Finish Line 🏁

Congratulations! You're ready to submit the programming portion of Lab 4.

To submit your work to Gradescope:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all public tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download`, then upload your notebook to Gradescope under "Lab 4".
5. Stick around for a few minutes while the Gradescope autograder grades your work. Make sure you see that all **public tests** have passed on Gradescope.