<div style="text-align: right">
    <i>
        LIN 5300<br>
        Spring 2023 <br>
        Jon Rawski <br>
    </i>
</div>


# Submission Guidelines
Please upload your completed notebook file on Canvas in the following format:

LastName_FirstName_HW3.ipynb

# Preliminaries (2 points)

1. Load in all of the packages you will need for this assignment in the cell below. 

  If you load in other packages later in the notebook, be sure to bring them up here. This is good coding practice and will look cleaner for everyone when reading your code.

  You will need the following:

  * To load a plain text file in with the colab interface (by uploading the file to the notebook)
  * To count occurrances of tokens
  * A regular expression tokenizer
  * The NLTK tokenizer for English
  * The spaCy word tokenizer for English


In [1]:
# Load in packages that you will use in this notebook
from pprint import pprint
# from google.colab import drive //doing work in vscode
from collections import Counter
import spacy 
import re
import nltk
nltk.download('punkt_tab')

# put other packages you will use below this line

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/henrypham/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# Preliminaries 2 (1 point)

Load in the file called `faustus.txt` in the variable `faust`.

In [3]:
#Import the file here
faustus = open('faustus_clean.txt', 'r', encoding='utf-8').read()


# Question 1: 2 points

In this section, we will be comparing different preprocessing strategies. 

First, tokenize the corpus using a simple regEx technique that relies on whitespaces to distinguish words (You can use the function we defined in class)!

Save the resulting list into the variable `faust_regEx`.

In [4]:
# define a tokenize function using regex based on whitespace
def regex_tokenizer(text):
    return re.findall(r'\w+', text)

# define a variable for each token list
faust_regEx = regex_tokenizer(faustus)

# print the first 10 elements as a safety check
print(faust_regEx[:10])

['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', 'where', 'mars', 'did']


# Question 2: 5 points

Now, we are going to use the [`nltk` `word_tokenize`](https://www.kite.com/python/docs/nltk.word_tokenizefunction). You should have loaded nltk above in the very first block. Use `word_tokenize` on the corpus in a list of string arrays called `faust_nltk_tokenized`. Use a slice to print the fifth to the tenth elements of the array.

In [5]:
# use nltk's word_tokenize function
# save the output into a variable
faust_nltk_tokenized = nltk.word_tokenize(faustus, "english", False)

# print the first 10 elements as a safety check
print(faust_nltk_tokenized[:10])

['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', 'where', 'mars']


# Question 3: 5 points

Now, we are going to use the [`spacy`](https://spacy.io/usage/spacy-101) tokenization [function](https://spacy.io/api/tokenizer). 

The output that spacy gives you is more complicated than the output of `nltk`'s `word_tokenize` function, because the `spacy` API takes a string (e.g., "I like cheese") and returns a `doc` object. Within the `doc` object there are `token`s, and each `token` has a `text` object. 

For this question, what you need to do is:

* load the spacy tokenizer with "en" as the default value
* obtain the `doc` object by using the tokenizer over the list `faust_full` 
* instantiate an empty list `faust_spacy_tokenized` and implement a for loop through all of the  `token`s of the obtained doc, storing for each the `token.text` string object into your list.

In [9]:
# use spacy's tokenization features
# had to run python -m spacy download en_core_web_sm in terminal to get this to work
nlp = spacy.load(name='en_core_web_sm')
faustus_spacy = faustus.replace('\n', '') #filter out '\n' from the text
doc = nlp(faustus_spacy)

# save the output into a variable
faust_spacy_tokenized = [token.text for token in doc]

# print the first 10 elements as a safety check
# initial run: had '\n' in output, so I added a filter to remove it ^
print(faust_spacy_tokenized[:10])



['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', 'where', 'mars']


# Question 4: Compare tokenizations (5 points)

Now that we have three tokenizations, we want to compare how similar the tokenizations are. 

In the code cell below, Visualize a reasonaly large subset of the corpus, tokenized under each of the three approaches above.
Then, in the cell below that, explain how the outputs of these tokenizations differ, and what you can infer from the output about the ways the tokenization techniques differ (If necessary, you can modify the number of tokens you visualize until you converge on a good sample for the comparison). 

What do you think are the strengths and weaknesses of each tokenization approach? Do you think one of the tokenizations is better than another? Can you think of a way you would test which one is better?

### Question 4A: Code (2/5)

In [None]:
# Put your code here
# take a large subset of the corpus from faustus_clean.txt
faust_subset = faustus[:100000]
#tokenize under the 3 appraoches done before

faust_regEx_subset = regex_tokenizer(faust_subset) 

faust_nltk_subset = nltk.word_tokenize(faust_subset, "english", False)

faust_spacy = faust_subset.replace('\n', '')
doc_subset = nlp(faust_spacy)
faust_spacy_subset = [token.text for token in doc_subset]

print("RegEx Tokenizer Output:\t", faust_regEx_subset[:50])
print("\nNLTK Tokenizer Output:\t", faust_nltk_subset[:50])
print("\nSpaCy Tokenizer Output:\t", faust_spacy_subset[:50])

RegEx Tokenizer Output:	 ['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', 'where', 'mars', 'did', 'mate', 'the', 'warlike', 'carthagens', 'nor', 'sporting', 'in', 'the', 'dalliance', 'of', 'love', 'in', 'courts', 'of', 'kings', 'where', 'state', 'is', 'overturn', 'd', 'nor', 'in', 'the', 'pomp', 'of', 'proud', 'audacious', 'deeds', 'intends', 'our', 'muse', 'to', 'vaunt', 'her', 'heavenly', 'verse', 'only', 'this', 'gentles', 'we']

NLTK Tokenizer Output:	 ['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', 'where', 'mars', 'did', 'mate', 'the', 'warlike', 'carthagens', ';', 'nor', 'sporting', 'in', 'the', 'dalliance', 'of', 'love', ',', 'in', 'courts', 'of', 'kings', 'where', 'state', 'is', 'overturn', "'d", ';', 'nor', 'in', 'the', 'pomp', 'of', 'proud', 'audacious', 'deeds', ',', 'intends', 'our', 'muse', 'to', 'vaunt', 'her', 'heavenly']

SpaCy Tokenizer Output:	 ['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', '\n', 'where', 'mars',

### Question 4B: Free response (3/5)

**ANSWER:** 

The three tokenizers are all unique in complexity and accuracy. The regex tokenizer is the simplest to use, using patterns to choose words but struggling with contractions like "don't." NLTK accomplishes this better by keeping contractions along with punctuation, offering a great mix of speed and accuracy. SpaCy is the most complicated, discovering named entities and context but taking longer due to its complexity. Token quantity comparison, contraction management, and performance on real-world text are great things to consider when comparing approaches. Regex is appropriate for small tasks, NLTK everyday tasks, and spaCy high accuracy tasks.


# Question 5: Tabulating word counts under different algorithms (10 points)

Now that you have compared and contrasted different tokenization algorithms, consider the effect that tokenization can have on our ability to characterize a corpus as a whole. 

Load in the `Counter` module and extract counts of all of the words under each of the three tokenizations schemes. Look at the top 5 most frequent (using the `.most_frequent()` method) and the top 10 least frequent (hint: use negative indices) words. In our data, what appear to be the biggest sources of disagreement? Do these confirm or disconfirm your hypotheses in the previous question? How or how not? 

### Question 5A: Code (5/10)

In [None]:
faust_regEx_counts = Counter(faust_regEx)
faust_nltk_counts = Counter(faust_nltk_tokenized)
faust_spacy_counts = Counter(faust_spacy_tokenized)

print("Top 5 most frequent words (RegEx):", faust_regEx_counts.most_common(5))
print("Top 5 most frequent words (NLTK):", faust_nltk_counts.most_common(5))
print("Top 5 most frequent words (SpaCy):", faust_spacy_counts.most_common(5))

print("\nTop 10 least frequent words (RegEx):", list(faust_regEx_counts.items())[-10:])
print("Top 10 least frequent words (NLTK):", list(faust_nltk_counts.items())[-10:])
print("Top 10 least frequent words (SpaCy):", list(faust_spacy_counts.items())[-10:])


Top 5 most frequent words (RegEx): [('and', 509), ('the', 501), ('i', 400), ('of', 338), ('to', 333)]
Top 5 most frequent words (NLTK): [(',', 1962), ('.', 562), ('and', 504), ('the', 500), ('i', 396)]
Top 5 most frequent words (SpaCy): [(',', 1949), ('the', 468), ('and', 449), ('of', 326), ('i', 323)]

Top 10 least frequent words (RegEx): [('unlawful', 1), ('deepness', 1), ('entice', 1), ('practise', 1), ('permits', 1), ('terminat', 2), ('hora', 1), ('diem', 1), ('auctor', 1), ('opus', 1)]
Top 10 least frequent words (NLTK): [('unlawful', 1), ('deepness', 1), ('entice', 1), ('practise', 1), ('permits', 1), ('terminat', 2), ('hora', 1), ('diem', 1), ('auctor', 1), ('opus', 1)]
Top 10 least frequent words (SpaCy): [('deepness', 1), ('entice', 1), ('witsto', 1), ('practise', 1), ('permits.terminat', 1), ('hora', 1), ('diem', 1), ('terminat', 1), ('auctor', 1), ('opus', 1)]


### Question 5B: Free response (5/10)

The biggest areas of disagreement between tokenizers are punctuation handling, contractions, and token merging. The RegEx tokenizer just ignores punctuation, so function words like "and" and "the" are the most common on its list. NLTK and spaCy, however, each handle punctuation as a separate token, increasing their counts and making words like ", " (comma) the most common. In addition, spaCy will sometimes incorrectly merge words together, and "permits.terminat" and "witsto" are examples which do not exist anywhere else. These outputs confirm the initial hypothesis that RegEx is elegant but simplistic, NLTK is organized but readable, and spaCy is complex but overcompensates under certain circumstances. The discrepancies are evidence of how different tokenization choices will have some effect on downstream NLP tasks.

# Question 6: Tabulating pointwise mutual information (15 points)

Mutual information is a computation that is very similar to computing a conditional probability. Recall that computing a conditional probability, defined below, requires knowing two probabilities. The first, $p(A \cap B)$, is the probability of observing $A$ and $B$ at the same time (so observing the bigram as a type). The second, $p(A)$, is the probability of observing $A$ across all contexts.

Recall that we have used MLE so to approximate all of these by their frequencies in a corpus. For example, $p(A)$ can be approximated by:

<center> $ \large p(A) \approx \frac{count(A)}{\sum_{w \in V}count(w)} $ </center>

Mutual information is very similar, but requires dividing the co-occurence statistic by two probabilities $p(A)$ and $p(B)$.


<center>$ \large MI = \frac{p(A \cap B)}{p(A) \cdot p(B)} $</center>

<hr />


This question contains multiple parts to respond to.

1. Compute the bigram frequencies of all words in our `faustus` corpus. You may use whatever tokenization scheme you think performs the best.
2. For each of the bigrams in that abstracts, compute the mutual information of that bigram
3. Pick a subset of bigrams and and print their mutual information value to the notebook.
4. Answer the questions in the free response section.


### Question 6A: Computing mutual information for bigrams in one sentence (10/15 points)

In [14]:
from itertools import islice
from math import log2

## your code for Question 6 goes here
# Function to generate bigrams
def generate_bigrams(tokens):
    return [(tokens[i], tokens[i + 1]) for i in range(len(tokens) - 1)]

# Tokenize the corpus using spaCy (chosen as the best tokenizer)
tokens = faust_spacy_tokenized

# Generate bigrams
bigrams = generate_bigrams(tokens)

# Count bigram frequencies
bigram_counts = Counter(bigrams)

# Count unigram frequencies
unigram_counts = Counter(tokens)

# Total number of tokens
total_tokens = sum(unigram_counts.values())

# Compute mutual information for each bigram
bigram_mi = {}
for bigram, bigram_count in bigram_counts.items():
    p_bigram = bigram_count / total_tokens
    p_word1 = unigram_counts[bigram[0]] / total_tokens
    p_word2 = unigram_counts[bigram[1]] / total_tokens
    mi = log2(p_bigram / (p_word1 * p_word2))
    bigram_mi[bigram] = mi

# Select a subset of bigrams to display
subset_bigrams = list(islice(bigram_mi.items(), 10))

# Print mutual information values for the subset
for bigram, mi in subset_bigrams:
    print(f"Bigram: {bigram}, Mutual Information: {mi}")

Bigram: ('not', 'marching'), Mutual Information: 7.106723820851059
Bigram: ('marching', 'in'), Mutual Information: 6.619608643528047
Bigram: ('in', 'the'), Mutual Information: 1.749243923944643
Bigram: ('the', 'fields'), Mutual Information: 5.167096438830541
Bigram: ('fields', 'of'), Mutual Information: 5.688733004182867
Bigram: ('of', 'thrasymene'), Mutual Information: 5.688733004182867
Bigram: ('thrasymene', ','), Mutual Information: 3.1089427831560554
Bigram: (',', 'where'), Mutual Information: 2.1089427831560554
Bigram: ('where', 'mars'), Mutual Information: 8.578029539776647
Bigram: ('mars', 'did'), Mutual Information: 8.949998317163606


### Question 6B: Free response (5/10 points)

Characterize the different mutual information values of the subset you are looking at. What values are highest? What values are lowest? Do you think mutual information would be a better statistic to compute than a conditional probability? If so, why?

### <font color='red'>Your written response for question 6B goes here</font>

# Self Evaluation (5 points, added only if at least half of the previous exercises has been attempted)

How do you think you did? Which parts did you struggle with the most? Which parts were easier than you expected?

I think I did well in analyzing the tokenization methods and comparing their outputs. Identifying differences between regular expressions, NLTK, and spaCy was straightforward, especially when looking at how they handled punctuation and contractions. The hardest part was interpreting mutual information values and explaining why certain word pairs had high scores. Understanding the reasoning behind these relationships required more critical thinking than simply comparing token counts. On the other hand, identifying the biggest sources of disagreement between tokenizers was easier than expected because the patterns were clear in the output.