<div style="text-align: right">
    <i>
        LIN 5300<br>
        Spring 2023 <br>
        Jon Rawski <br>
    </i>
</div>


# Submission Guidelines
Please upload your completed notebook file on Canvas in the following format:

LastName_FirstName_HW3.ipynb

# Preliminaries (2 points)

1. Load in all of the packages you will need for this assignment in the cell below. 

  If you load in other packages later in the notebook, be sure to bring them up here. This is good coding practice and will look cleaner for everyone when reading your code.

  You will need the following:

  * To load a plain text file in with the colab interface (by uploading the file to the notebook)
  * To count occurrances of tokens
  * A regular expression tokenizer
  * The NLTK tokenizer for English
  * The spaCy word tokenizer for English


In [13]:
# Load in packages that you will use in this notebook
from pprint import pprint
# from google.colab import drive //doing work in vscode
from collections import Counter
import spacy 
import re
import nltk
nltk.download('punkt_tab')

# put other packages you will use below this line

[nltk_data] Downloading package punkt_tab to C:\Users\Henry
[nltk_data]     Pham\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# Preliminaries 2 (1 point)

Load in the file called `faustus.txt` in the variable `faust`.

In [14]:
#Import the file here
faustus = open('faustus_clean.txt', 'r', encoding='utf-8').read()


# Question 1: 2 points

In this section, we will be comparing different preprocessing strategies. 

First, tokenize the corpus using a simple regEx technique that relies on whitespaces to distinguish words (You can use the function we defined in class)!

Save the resulting list into the variable `faust_regEx`.

In [15]:
# define a tokenize function using regex based on whitespace
def regex_tokenizer(text):
    return re.findall(r'\w+', text)

# define a variable for each token list
faust_regEx = regex_tokenizer(faustus)

# print the first 10 elements as a safety check
print(faust_regEx[:10])

['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', 'where', 'mars', 'did']


# Question 2: 5 points

Now, we are going to use the [`nltk` `word_tokenize`](https://www.kite.com/python/docs/nltk.word_tokenizefunction). You should have loaded nltk above in the very first block. Use `word_tokenize` on the corpus in a list of string arrays called `faust_nltk_tokenized`. Use a slice to print the fifth to the tenth elements of the array.

In [16]:
# use nltk's word_tokenize function
# save the output into a variable
faust_nltk_tokenized = nltk.word_tokenize(faustus, "english", False)

# print the first 10 elements as a safety check
print(faust_nltk_tokenized[:10])

['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', 'where', 'mars']


# Question 3: 5 points

Now, we are going to use the [`spacy`](https://spacy.io/usage/spacy-101) tokenization [function](https://spacy.io/api/tokenizer). 

The output that spacy gives you is more complicated than the output of `nltk`'s `word_tokenize` function, because the `spacy` API takes a string (e.g., "I like cheese") and returns a `doc` object. Within the `doc` object there are `token`s, and each `token` has a `text` object. 

For this question, what you need to do is:

* load the spacy tokenizer with "en" as the default value
* obtain the `doc` object by using the tokenizer over the list `faust_full` 
* instantiate an empty list `faust_spacy_tokenized` and implement a for loop through all of the  `token`s of the obtained doc, storing for each the `token.text` string object into your list.

In [17]:
# use spacy's tokenization features
# had to run python -m spacy download en_core_web_sm in terminal to get this to work
nlp = spacy.load(name='en_core_web_sm')
faustus_spacy = faustus.replace('\n', '') #filter out '\n' from the text
doc = nlp(faustus_spacy)

# save the output into a variable
faust_spacy_tokenized = [token.text for token in doc]

# print the first 10 elements as a safety check
# initial run: had '\n' in output, so I added a filter to remove it ^
print(faust_spacy_tokenized[:10])



['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', 'where', 'mars']


# Question 4: Compare tokenizations (5 points)

Now that we have three tokenizations, we want to compare how similar the tokenizations are. 

In the code cell below, Visualize a reasonaly large subset of the corpus, tokenized under each of the three approaches above.
Then, in the cell below that, explain how the outputs of these tokenizations differ, and what you can infer from the output about the ways the tokenization techniques differ (If necessary, you can modify the number of tokens you visualize until you converge on a good sample for the comparison). 

What do you think are the strengths and weaknesses of each tokenization approach? Do you think one of the tokenizations is better than another? Can you think of a way you would test which one is better?

### Question 4A: Code (2/5)

In [25]:
# Put your code here
# take a large subset of the corpus from faustus_clean.txt
faust_subset = faustus[:100000]
#tokenize under the 3 appraoches done before

faust_regEx_subset = regex_tokenizer(faust_subset) 

faust_nltk_subset = nltk.word_tokenize(faust_subset, "english", False)

faust_spacy = faust_subset.replace('\n', '')
doc_subset = nlp(faust_spacy)
faust_spacy_subset = [token.text for token in doc_subset]

print("RegEx Tokenizer Output:\t", faust_regEx_subset[:50])
print("\nNLTK Tokenizer Output:\t", faust_nltk_subset[:50])
print("\nSpaCy Tokenizer Output:\t", faust_spacy_subset[:50])

RegEx Tokenizer Output:	 ['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', 'where', 'mars', 'did', 'mate', 'the', 'warlike', 'carthagens', 'nor', 'sporting', 'in', 'the', 'dalliance', 'of', 'love', 'in', 'courts', 'of', 'kings', 'where', 'state', 'is', 'overturn', 'd', 'nor', 'in', 'the', 'pomp', 'of', 'proud', 'audacious', 'deeds', 'intends', 'our', 'muse', 'to', 'vaunt', 'her', 'heavenly', 'verse', 'only', 'this', 'gentles', 'we']

NLTK Tokenizer Output:	 ['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', 'where', 'mars', 'did', 'mate', 'the', 'warlike', 'carthagens', ';', 'nor', 'sporting', 'in', 'the', 'dalliance', 'of', 'love', ',', 'in', 'courts', 'of', 'kings', 'where', 'state', 'is', 'overturn', "'d", ';', 'nor', 'in', 'the', 'pomp', 'of', 'proud', 'audacious', 'deeds', ',', 'intends', 'our', 'muse', 'to', 'vaunt', 'her', 'heavenly']

SpaCy Tokenizer Output:	 ['not', 'marching', 'in', 'the', 'fields', 'of', 'thrasymene', ',', 'where', 'mars', 'did'

### Question 4B: Free response (3/5)

The three methods of tokenization differ in handling punctuation, contractions, and structure. The **RegEx tokenizer** completely removes punctuation, splitting contractions like `\\\"overturn'd\\\"` into `\\\"overturn\\\"` and `\\\"d\\\"`, so it's fast but stupid. The **NLTK tokenizer** leaves punctuation in as individual tokens but nevertheless splits contractions in a capricious fashion, improving structure without deep linguistic awareness. The **spaCy tokenizer** is the most advanced, keeping punctuation and handling most contractions nicely, though it misinterprets `\\\"overturn'd;nor\\\"` by combining tokens inappropriately. **RegEx** is adequate for straightforward word counting, **NLTK** strikes the ideal balance of structure and simplicity, and **spaCy** provides the most natural-sounding sentence structure but overdoes it in complicated scenarios. It all depends on the NLP task—regex for basic word retrieval, NLTK for structured text, and spaCy for deep linguistic processing.

# Question 5: Tabulating word counts under different algorithms (10 points)

Now that you have compared and contrasted different tokenization algorithms, consider the effect that tokenization can have on our ability to characterize a corpus as a whole. 

Load in the `Counter` module and extract counts of all of the words under each of the three tokenizations schemes. Look at the top 5 most frequent (using the `.most_frequent()` method) and the top 10 least frequent (hint: use negative indices) words. In our data, what appear to be the biggest sources of disagreement? Do these confirm or disconfirm your hypotheses in the previous question? How or how not? 

### Question 5A: Code (5/10)

In [27]:
## Your code for question 5A
# Load in the `Counter` module and extract counts of all of the words under each of the three tokenizations schemes. Look at the top 5 most frequent (using the `.most_frequent()` method) and the top 10 least frequent (hint: use negative indices) words. In our data, what appear to be the biggest sources of disagreement? Do these confirm or disconfirm your hypotheses in the previous question? How or how not? 

# create a Counter object for each tokenized list
faust_regEx_counter = Counter(faust_regEx_subset)
faust_nltk_counter = Counter(faust_nltk_subset)
faust_spacy_counter = Counter(faust_spacy_subset)

# print the most common words for each tokenized list
print("Most Common Words:")
print("RegEx Tokenizer Output:\t", faust_regEx_counter.most_common(5))
print("\nNLTK Tokenizer Output:\t", faust_nltk_counter.most_common(5))
print("\nSpaCy Tokenizer Output:\t", faust_spacy_counter.most_common(5))

print("\n\n")

# print the least common words for each tokenized list
print("Least Common Words:")
print("RegEx Tokenizer Output:\t", faust_regEx_counter.most_common()[:-11:-1])
print("\nNLTK Tokenizer Output:\t", faust_nltk_counter.most_common()[:-11:-1])
print("\nSpaCy Tokenizer Output:\t", faust_spacy_counter.most_common()[:-11:-1])



Most Common Words:
RegEx Tokenizer Output:	 [('and', 509), ('the', 501), ('i', 400), ('of', 338), ('to', 333)]

NLTK Tokenizer Output:	 [(',', 1962), ('.', 562), ('and', 504), ('the', 500), ('i', 396)]

SpaCy Tokenizer Output:	 [(',', 1949), ('the', 468), ('and', 449), ('of', 326), ('i', 323)]



Least Common Words:
RegEx Tokenizer Output:	 [('opus', 1), ('auctor', 1), ('diem', 1), ('hora', 1), ('permits', 1), ('practise', 1), ('entice', 1), ('deepness', 1), ('unlawful', 1), ('exhort', 1)]

NLTK Tokenizer Output:	 [('opus', 1), ('auctor', 1), ('diem', 1), ('hora', 1), ('permits', 1), ('practise', 1), ('entice', 1), ('deepness', 1), ('unlawful', 1), ('exhort', 1)]

SpaCy Tokenizer Output:	 [('opus', 1), ('auctor', 1), ('terminat', 1), ('diem', 1), ('hora', 1), ('permits.terminat', 1), ('practise', 1), ('witsto', 1), ('entice', 1), ('deepness', 1)]


### Question 5B: Free response (5/10)

The biggest sources of disagreement among tokenizers come from the treatment of punctuation, contractions, and token fusion. The **RegEx tokenizer** treats punctuation not at all, leading to function words like "and" and "the" dominating its frequency list. On the other hand, **NLTK** and **spaCy** treat punctuation as separate tokens, counting them twice and putting words like ", " (comma) on top. In addition, **spaCy** at times inaccurately merges words, like `\\\"permits.terminat\"` and `\\\"witsto\"`, that never appear elsewhere. These outcomes **confirm** the original guess that **RegEx is elegant but simplistic, NLTK is a balance of structure and readability, and spaCy is complex but overcorrects in some cases**. The differences serve to demonstrate how different tokenization choices can profoundly impact downstream NLP tasks.

# Question 6: Tabulating pointwise mutual information (15 points)

Mutual information is a computation that is very similar to computing a conditional probability. Recall that computing a conditional probability, defined below, requires knowing two probabilities. The first, $p(A \cap B)$, is the probability of observing $A$ and $B$ at the same time (so observing the bigram as a type). The second, $p(A)$, is the probability of observing $A$ across all contexts.

Recall that we have used MLE so to approximate all of these by their frequencies in a corpus. For example, $p(A)$ can be approximated by:

$\large p(A) \approx \frac{count(A)}{\sum_{w \in V}count(w)}$

Mutual information is very similar, but requires dividing the co-occurence statistic by two probabilities $p(A)$ and $p(B)$.


$\large MI = \frac{p(A \cap B)}{p(A) \cdot p(B)}$

<hr />


This question contains multiple parts to respond to.

1. Compute the bigram frequencies of all words in our `faustus` corpus. You may use whatever tokenization scheme you think performs the best.
2. For each of the bigrams in that abstracts, compute the mutual information of that bigram
3. Pick a subset of bigrams and and print their mutual information value to the notebook.
4. Answer the questions in the free response section.


### Question 6A: Computing mutual information for bigrams in one sentence (10/15 points)

In [None]:
## your code for Question 6 goes here
# 1. Compute the bigram frequencies of all words in our `faustus` corpus. You may use whatever tokenization scheme you think performs the best.
# use the nltk bigram function to get the bigrams from the faustus corpus
faust_bigrams = nltk.bigrams(faust_nltk_subset)

# 2. For each of the bigrams in that abstracts, compute the mutual information of that bigram
# this is the best i could find to compute 
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

bigram_finder = BigramCollocationFinder.from_words(faust_nltk_subset)
mi = bigram_finder.score_ngrams(BigramAssocMeasures().pmi)

# 3. Pick a subset of bigrams and and print their mutual information value to the notebook.
print("Mutual Information of Bigrams:")
print(mi[:10])

# 4. Answer the questions in the free response section.



Mutual Information of Bigrams:
[(('[', 'reads'), 14.234443239689758), (('acherontis', 'propitii'), 14.234443239689758), (('afric', 'shore'), 14.234443239689758), (('almain', 'rutters'), 14.234443239689758), (('ancient', 'gentlewoman'), 14.234443239689758), (('aquam', 'quam'), 14.234443239689758), (('aquatani', 'spiritus'), 14.234443239689758), (('ardentis', 'monarcha'), 14.234443239689758), (("as't", 'passes'), 14.234443239689758), (('astronomy', 'graven'), 14.234443239689758)]


### Question 6B: Free response (5/10 points)

The highest mutual information values are for word pairs like "afric shore" and "ancient gentlewoman," which shows that these words strongly depend on each other. These word pairs are rare but meaningful, so their relationship is more significant than common words like "the" or "and." Lower values would be from words that are frequent but without any profound association. Mutual information is better than conditional probability because it shows important word relationships rather than just how often one word follows the other. Conditional probability is therefore useful in finding important phrases in a text and not just common ordinary word patterns.

### <font color='red'>Your written response for question 6B goes here</font>

# Self Evaluation (5 points, added only if at least half of the previous exercises has been attempted)

How do you think you did? Which parts did you struggle with the most? Which parts were easier than you expected?

I think I worked well at contrasting the tokenization methods and their results. It was simple to contrast differences among regular expressions, NLTK, and spaCy when looking at how they handled punctuation and contractions. The most difficult part was the interpretation of mutual information values and giving explanations for why certain pairs of words had high scores. More critical thinking was required than a simple comparison of token counts to know why these relations were present. On the other hand, identifying the biggest points of contention between tokenizers was easier than expected because the patterns stood out in the output.