# Question 1 (5 points)
Load in all of the packages you will need for this assignment in the cell below. 

If you load in other packages later in the notebook, be sure to bring them up here. This is good coding practice and will look cleaner for everyone when reading your code.

You will need the following:

* To load a plain text file (`abstracts.tsv`) in with the colab interface (either local to your drive or by uploading the file to the notebook)
* The NLTK tokenizer for English
* The spaCy word tokenizer for English

In [None]:
# Load in packages that you will use in this notebook
from pprint import pprint
from google.colab import drive
from google.colab import files

# put other packages you will use below this line


# Question 2 (1 point)

Load in the file called `abstracts.tsv` in the `data/` subdirectory of this folder into this notebook.

Uncomment one of the two blocks below.

Then, edit the line that you uncommented to load in abstracts.tsv.

Note that using the `files` command requires you to do a bit more work to load the file in in Question 3. Be sure to check previous notebooks.

If you do this in Jupyter on your own machine, please load in the file in the same manner without these imports.

In [None]:
# Block 1: File stored on your google drive
# drive.mount('/content/drive', force_remount=True)
# if you go this route, then your code to open abstracts.txt goes here
# 

# Block 2: File stored on your computer that you upload to the notebook directly
# uploaded = files.upload()

# Question 3: 3 points

In this section, we will be comparing different preprocessing strategies. For this question, you should first preview the data by looking at the first 5 lines. Use [a slice](https://stackoverflow.com/questions/509211/understanding-slice-notation) to print the first five elements from the array.

Then, separate all of the abstracts on all whitespace. Store this in an array of string arrays called `split_abstracts`.

In [None]:
# preview data (print the first five lines)

# split every sentence on whitespace and save array

# Question 4: 4 points

Now, we are going to use the `nltk` `word_tokenize` function. You should have loaded this above in the very first block. Use `word_tokenize` on all the abstracts and store this in an array of string arrays called `nltk_tokenized_abstracts`. Use a slice to print the fifth to the tenth elements of the array.

In [None]:
# use nltk's word_tokenize function over all of the abstracts

# save the output into a variable

# Question 5: 5 points

Now, we are going to use the `spacy` tokenization function. The output that spacy gives you is more complicated than the output of `nltk`'s `word_tokenize` function, because the `spacy` API takes a string (e.g., "I like cheese") and returns a `Doc` object. Within the `Doc` object there are `Token`s, and each `Token` has a `text` object. 

For this question, what you need to do is implement another loop through all of the abstracts, and store a list (array) of all of the token _strings_ from each `Token` object. If you were paying attention during the tokenization lecture this should be easy.

Store all of these tokenizations into an array of string arrays called `spacy_tokenized_abstracts`.

In [None]:
# use spacy's tokenization features

# save the output into a variable

# Question 6: Compare tokenizations (8 points)

Now that we have three tokenizations (`split_abstracts`, `nltk_tokenized_abstracts`, and `spacy_tokenized_abstracts`), we want to compare how similar the tokenizations are. Pick a slice of 5 abstracts with any start and end indices. Demonstrate that the total number of abstracts that you selected is 5 by printing the length of that subset of abstracts.

Tokenize each of the 5 abstracts according to each of the three approach above, and print their output in the code cell below. Then, in the cell below that, explain how these tokenizations differ. What are the strengths and weaknesses of each tokenization approach? Do you think one of the tokenizations is better than another? Can you think of a way you would test which one is better? Refer to justification from the readings where appropriate.

### Question 6A: Code (3/8)

In [None]:
# select a slice of 5 abstracts from the documents

# print the length of this slice to show that it is five abstracts

# Hint: Get the tokenizations from all 3 tokenization schemes by using the random indices in Hint 1

# print the outputs of each of these 3 tokenizations for all 5 abstracts

### Question 6B: Free response (5/8)

$\color{red}{\text{Put your description of the above output here. Double click in this cell to edit it. Delete this line when you are done.}}$

# Question 7: Tabulating word counts under different algorithms (8 points)

Now that you have compared and contrasted different tokenization algorithms, consider the effect that tokenization can have on our ability to characterize a corpus as a whole. 

Load in the `Counter` module and extract counts of all of the words under each of the three tokenizations schemes. Look at the top 5 most frequent (using the `.most_frequent()` method) and the top 10 least frequent (hint: use negative indices) words. In our data, what appear to be the biggest sources of disagreement? Do these confirm or disconfirm your hypotheses in the previous question? How or how not? 

### Question 7A: Code (3/8)

In [None]:
## Your code for question 6 goes here

### Question 7B: Free response (5/8)

$\color{red}{\text{Put your description of the above output here. Double click in this cell to edit it. Delete this line when you are done.}}$

# Question 8: Tabulating pointwise mutual information under different tokenization schemes: 8 points

Mutual information is a computation that is very similar to computing a conditional probability. Recall that computing a conditional probability, defined below, requires knowing two probabilities. The first, $p(A \cap B)$, is the probability of observing $A$ and $B$ at the same time. The second, $p(A)$, is the probability of observing $A$ across all contexts.

Recall that we can approximate all of these by their frequencies in a corpus. For example, $p(A)$ can be approximated by:

<center> $\large p(A) \approx \frac{count(A)}{\sum_{w \in V}count(w)}$ </center>

A conditional probability like $p(B | A)$ is a measure that allows us to estimate how many of our observations of $B$ occur having already seen $A$.

<center>$\large p(B | A) = \frac{p(A \cap B)}{p(A)}$</center>

Mutual information is very similar, but requires dividing the co-occurence statistic by two probabilities $p(A)$ and $p(B)$.

<center>$\large MI = \frac{p(A \cap B)}{p(A) \cdot p(B)}$</center>

<hr />

This question contains multiple parts to respond to.

1. Compute the bigram frequencies of all words in our `abstracts.tsv` corpus. You may use whatever tokenization scheme you think performs the best.
2. Pick one of your tokenized abstracts from Question 5 that you think sounds interesting.
3. For each of the bigrams in that abstracts, compute the mutual information of that bigram and print the bigram and its mutual information value to the notebook.
4. Answer the questions in the free response section.


### Question 8A: Computing mutual information for bigrams in one sentence (5 points)

In [None]:
## your code for question 7 goes here

### Question 8B: Free response (3 points)

Characterize the different mutual information values of the sentence you used. What values are highest? What values are lowest? When do you think mutual information would be a better statistic to compute than a conditional probability?

### <font color='red'>Your written response for question 8B goes here</font>

# Submission guidelines (1 point)

Please upload your completed notebook file to UBLearns in the following format:

Lastname\_Firstname\_HW2.ipynb

e.g., Smith\_John\_HW2.ipynb.