# Question 1 (5 points)
Load in all of the packages you will need for this assignment in the cell below. 

If you load in other packages later in the notebook, be sure to bring them up here. This is good coding practice and will look cleaner for everyone when reading your code.

You will need the following:

* To load a plain text file (`abstracts.tsv`) in with the colab interface (either local to your drive or by uploading the file to the notebook)
* The NLTK tokenizer for English
* The spaCy word tokenizer for English

In [2]:
# Load in packages that you will use in this notebook
! pip install iteration_utilities

import nltk
nltk.download('punkt')

import spacy
import itertools

from pprint import pprint
from google.colab import drive
from google.colab import files
from nltk import word_tokenize
from collections import Counter
from iteration_utilities import Iterable
from itertools import chain
from nltk.collocations import BigramCollocationFinder


# put other packages you will use below this line
import os


Collecting iteration_utilities
  Downloading iteration_utilities-0.11.0-cp37-cp37m-manylinux2014_x86_64.whl (283 kB)
[K     |████████████████████████████████| 283 kB 5.2 MB/s 
[?25hInstalling collected packages: iteration-utilities
Successfully installed iteration-utilities-0.11.0
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Question 2 (1 point)

Load in the file called `abstracts.tsv` in the `data/` subdirectory of this folder into this notebook.

Uncomment one of the two blocks below.

Then, edit the line that you uncommented to load in abstracts.tsv.

Note that using the `files` command requires you to do a bit more work to load the file in in Question 3. Be sure to check previous notebooks.

If you do this in Jupyter on your own machine, please load in the file in the same manner without these imports.

In [3]:
uploaded = files.upload()


Saving abstracts.tsv to abstracts.tsv


In [4]:
#abstracts = uploaded['abstracts.tsv'].decode('utf-8')
with open ('abstracts.tsv', 'r') as handle:
  abstracts = handle.read().split('\n')


In [None]:
abstracts[0:5]

# Question 3: 3 points

In this section, we will be comparing different preprocessing strategies. For this question, you should first preview the data by looking at the first 5 lines. Use [a slice](https://stackoverflow.com/questions/509211/understanding-slice-notation) to print the first five elements from the array.

Then, separate all of the abstracts on all whitespace. Store this in an array of string arrays called `split_abstracts`.

In [5]:
# preview data (print the first five lines)
abstracts[0:5]
# split every sentence on whitespace and save array
split_abstract = []
for abstract in abstracts:
  split_abstract.append(abstract.split())


In [None]:
split_abstract[0:5]

# Question 4: 4 points

Now, we are going to use the `nltk` `word_tokenize` function. You should have loaded this above in the very first block. Use `word_tokenize` on all the abstracts and store this in an array of string arrays called `nltk_tokenized_abstracts`. Use a slice to print the fifth to the tenth elements of the array.

In [7]:
# use nltk's word_tokenize function over all of the abstracts
nltk_tokenized_abstracts = []

for abstract in abstracts:
  nltk_tokenized_abstracts.append(word_tokenize(abstract))

In [None]:
nltk_tokenized_abstracts[0:5]

# Question 5: 5 points

Now, we are going to use the `spacy` tokenization function. The output that spacy gives you is more complicated than the output of `nltk`'s `word_tokenize` function, because the `spacy` API takes a string (e.g., "I like cheese") and returns a `Doc` object. Within the `Doc` object there are `Token`s, and each `Token` has a `text` object. 

For this question, what you need to do is implement another loop through all of the abstracts, and store a list (array) of all of the token _strings_ from each `Token` object. If you were paying attention during the tokenization lecture this should be easy.

Store all of these tokenizations into an array of string arrays called `spacy_tokenized_abstracts`.

In [9]:
spacy_model = spacy.load('en_core_web_sm')
tokenizer = spacy_model.tokenizer

# save the output into a variable
spacy_tokenized_abstract = []
for iterate_abs in abstracts:
  spacy_abstract = tokenizer(iterate_abs)
  temp = []
  for words in spacy_abstract:
    temp.append(words.text)
  spacy_tokenized_abstract.append(temp)


In [10]:

spacy_tokenized_abstract[0:5]

[['Offensive',
  'language',
  'detection',
  '(',
  'OLD',
  ')',
  'has',
  'received',
  'increasing',
  'attention',
  'due',
  'to',
  'its',
  'societal',
  'impact',
  '.',
  'Recent',
  'work',
  'shows',
  'that',
  'bidirectional',
  'transformer',
  'based',
  'methods',
  'obtain',
  'impressive',
  'performance',
  'on',
  'OLD',
  '.',
  'However',
  ',',
  'such',
  'methods',
  'usually',
  'rely',
  'on',
  'large',
  '-',
  'scale',
  'well',
  '-',
  'labeled',
  'OLD',
  'datasets',
  'for',
  'model',
  'training',
  '.',
  'To',
  'address',
  'the',
  'issue',
  'of',
  'data',
  '/',
  'label',
  'scarcity',
  'in',
  'OLD',
  ',',
  'in',
  'this',
  'paper',
  ',',
  'we',
  'propose',
  'a',
  'simple',
  'yet',
  'effective',
  'domain',
  'adaptation',
  'approach',
  'to',
  'train',
  'bidirectional',
  'transformers',
  '.',
  'Our',
  'approach',
  'introduces',
  'domain',
  'adaptation',
  '(',
  'DA',
  ')',
  'training',
  'procedures',
  'to',
  'A

# Question 6: Compare tokenizations (8 points)

Now that we have three tokenizations (`split_abstracts`, `nltk_tokenized_abstracts`, and `spacy_tokenized_abstracts`), we want to compare how similar the tokenizations are. Pick a slice of 5 abstracts with any start and end indices. Demonstrate that the total number of abstracts that you selected is 5 by printing the length of that subset of abstracts.

Tokenize each of the 5 abstracts according to each of the three approach above, and print their output in the code cell below. Then, in the cell below that, explain how these tokenizations differ. What are the strengths and weaknesses of each tokenization approach? Do you think one of the tokenizations is better than another? Can you think of a way you would test which one is better? Refer to justification from the readings where appropriate.

### Question 6A: Code (3/8)

In [None]:
# select a slice of 5 abstracts from the documents
abstracts_list_1 = abstracts [5:10]
# print the length of this slice to show that it is five abstracts
print("Length::", len(abstracts_list_1))
# Hint: Get the tokenizations from all 3 tokenization schemes by using the random indices in Hint 1
# using split()
split_slice =split_abstract [5:10]
# using nltk
nltk_slice = nltk_tokenized_abstracts [5:10]
#using spacy
spacy_slice = spacy_tokenized_abstract [5:10]
# print the outputs of each of these 3 tokenizations for all 5 abstracts
print("split_slice" , split_slice)
print("nltk_slice" , nltk_slice)
print("spacy_slice", spacy_slice)

### Question 6B: Free response (5/8)

Using Split()

**Strengths:**
1. Here, you can seperate sentences on whitespaces. 
2. It is a simplest technique. Instead of whitespace you can use any delimiter. 

**Weakness:**
1. It includes sentence ending punctuation with token. Instead of separating it from token.

Using NLTK:

NLTK is one of the most famous library in in Python.It can be used for various purposes like tokenizing, parsing,lemmatization, etc. 

**Strengths:**

1. It supports many languages.

2. Also, it handles text as a group of strings.

**Weakness:**

1. It does not support word vector.
2. NLTK doesn't apply any semantic analysis on senetence tokenization.
3. By using NLTK on large dataset leads to slow execution.

Using spaCy:
Instead of working on strings, spaCy handles all the data in the form of objects.


**Strengths:**

1. Perfomance wise it is better than NLTK.
2. It supports word vectors.
3. It is faster than NLTK.

**Weakness:**

1. It has less flexibility than NLTK.
2. Compared to NLTK, it doesn't support many languages.

According to me, tokenization using spaCy is better than other two techniques. As you can see in the result, spaCy divides the abstracts on brackets and on dashes also. Eventhough the output generated by both NLTK and spaCy is quite similar, still spaCy gives in depth information of the abstract by diving it into small small words. Using spaCy, we can analyze data in more efficient way.


# Question 7: Tabulating word counts under different algorithms (8 points)

Now that you have compared and contrasted different tokenization algorithms, consider the effect that tokenization can have on our ability to characterize a corpus as a whole. 

Load in the `Counter` module and extract counts of all of the words under each of the three tokenizations schemes. Look at the top 5 most frequent (using the `.most_frequent()` method) and the top 10 least frequent (hint: use negative indices) words. In our data, what appear to be the biggest sources of disagreement? Do these confirm or disconfirm your hypotheses in the previous question? How or how not? 

### Question 7A: Code (3/8)

In [12]:
#On split_abstract
tuple1 = tuple(split_abstract)
split_counter = Counter(chain(*tuple1))
split_abstract_most_common = split_counter.most_common(5)
print("split_abstract_most_common", split_abstract_most_common)

split_abstract_least_common = split_counter.most_common()[-10:]
print("split_abstract_least_common", split_abstract_least_common)
#Counter(tuple1)

split_abstract_most_common [('the', 162905), ('of', 115534), ('and', 100664), ('a', 83250), ('to', 79238)]
split_abstract_least_common [('Sponsorship', 1), ('Equipment', 1), ('(Ronald', 1), ('Borden);', 1), ('26-27,', 1), ('Wisbey);', 1), ('{MIT}', 1), ('(Jonathan', 1), ('Allen);', 1), ('Bailey);', 1)]


In [13]:
# on nltk
abstracts_list_count = ' '.join(abstracts)
abstract_tokenized_into_words = {}

abstract_tokenized_into_words = word_tokenize(abstracts_list_count)
nltk_counter = Counter(abstract_tokenized_into_words)

nltk_abstract_most_common = nltk_counter.most_common(5)
print("nltk_abstract_most_common", nltk_abstract_most_common)

nltk_abstract_least_common = nltk_counter.most_common()[-10:]
print("nltk_abstract_least_common", nltk_abstract_least_common)


nltk_abstract_most_common [('the', 163172), (',', 160068), ('.', 158239), ('of', 115654), ('and', 101170)]
nltk_abstract_least_common [('Organizational', 1), ('Guyford', 1), ('Stever', 1), ('Buyers', 1), ('Dake', 1), ('Gaddy', 1), ('Sponsorship', 1), ('Borden', 1), ('26-27', 1), ('Wisbey', 1)]


In [14]:
# on spacy
flat=[]
for i in spacy_tokenized_abstract:
  for j in i:
    flat.append(j)

counter_method_output = Counter(flat)

spacy_abstract_most_common = counter_method_output.most_common(5)
print("spacy_abstract_most_common ", spacy_abstract_most_common)
spacy_abstract_least_common = counter_method_output.most_common()[-10:]
print("spacy_abstract_least_common" , spacy_abstract_least_common)

spacy_abstract_most_common  [('the', 168208), (',', 159384), ('.', 158688), ('of', 121903), ('-', 104316)]
spacy_abstract_least_common [('Shooman', 1), ('Organizational', 1), ('Guyford', 1), ('Stever', 1), ('Buyers', 1), ('Dake', 1), ('Gaddy', 1), ('Sponsorship', 1), ('Borden', 1), ('Wisbey', 1)]


### Question 7B: Free response (5/8)

In previous question, I said that spaCy is better than the other two orgnization. If you see the result of split_abstract_most_common, the abstracts are splitted on whitespace only. And that won't be helpful for any further advance analaysis. We can also see that, output genrated by NLTK and spaCy are almost similar. But spaCy tokenize words in great detail. For example, in NLTK, it considered '26-27' as a least frequent word, whereas in spaCy, it divides '26-27' into three parts '26' , '-' , '27' . So, I agree with the hypothesis that I made in previous question which is spaCy> NLTK > split().

# Question 8: Tabulating pointwise mutual information under different tokenization schemes: 8 points

Mutual information is a computation that is very similar to computing a conditional probability. Recall that computing a conditional probability, defined below, requires knowing two probabilities. The first, $p(A \cap B)$, is the probability of observing $A$ and $B$ at the same time. The second, $p(A)$, is the probability of observing $A$ across all contexts.

Recall that we can approximate all of these by their frequencies in a corpus. For example, $p(A)$ can be approximated by:

<center> $\large p(A) \approx \frac{count(A)}{\sum_{w \in V}count(w)}$ </center>

A conditional probability like $p(B | A)$ is a measure that allows us to estimate how many of our observations of $B$ occur having already seen $A$.

<center>$\large p(B | A) = \frac{p(A \cap B)}{p(A)}$</center>

Mutual information is very similar, but requires dividing the co-occurence statistic by two probabilities $p(A)$ and $p(B)$.

<center>$\large MI = \frac{p(A \cap B)}{p(A) \cdot p(B)}$</center>

<hr />

This question contains multiple parts to respond to.

1. Compute the bigram frequencies of all words in our `abstracts.tsv` corpus. You may use whatever tokenization scheme you think performs the best.
2. Pick one of your tokenized abstracts from Question 5 that you think sounds interesting.
3. For each of the bigrams in that abstracts, compute the mutual information of that bigram and print the bigram and its mutual information value to the notebook.
4. Answer the questions in the free response section.


### Question 8A: Computing mutual information for bigrams in one sentence (5 points)

In [24]:
#code for getting the bigram pairs & the counts: 
flat_spacy_tokenized_abstract=[]
for i in spacy_tokenized_abstract:
  for j in i:
    flat_spacy_tokenized_abstract.append(j)

list_of_counts = {}
iteration = 0
length = len(flat_spacy_tokenized_abstract)

for word in flat_spacy_tokenized_abstract:
  if iteration != length-1:
    if word not in list_of_counts:
      list_of_counts[word] = {}
    next_word = flat_spacy_tokenized_abstract[iteration+1]
    if next_word in list_of_counts[word]:
      list_of_counts[word][next_word] = list_of_counts[word][next_word] + 1
    else:
      list_of_counts[word][next_word] = 1
  iteration = iteration + 1
  
#print(list_of_counts['language'])

In [25]:
#for calculating the probability & etc on abstract[5:10]: 
list_of_counts
total_pairs = 0
for key in list_of_counts:
  total_pairs = total_pairs + sum(list_of_counts[key].values())
  
def calculations_1(word_a, word_b, choosen_abstract):
  number_counts_a = choosen_abstract.count(word_a)
  number_counts_b = choosen_abstract.count(word_b)
  if word_b not in list_of_counts[word_a]:
    number_counts_a_and_b = 0
  else:
    number_counts_a_and_b = list_of_counts[word_a][word_b]
  total_words = len(choosen_abstract)

  prob_a = number_counts_a/total_words
  prob_b = number_counts_b/total_words
  prob_a_inter_b = number_counts_a_and_b/total_pairs
  cond_prob = prob_a_inter_b/(prob_a)
  MI = prob_a_inter_b/(prob_a*prob_b)

  print(word_a + ' = ' + str(number_counts_a))
  print(word_b + ' = ' + str(number_counts_b))
  print('number of counts = ' + str(number_counts_a_and_b))
#  print('total_pairs = ' + str(total_pairs))
#  print('prob_a = ' + str(prob_a))
#  print('prob_b = ' + str(prob_b))
# print('prob_a_inter_b = ' + str(prob_a_inter_b))
#print('cond_prob = ' + str(cond_prob))
  print('MI = ' + str(MI))

In [None]:
#for calculating probability on abstracts[5:10]:
flat_spacy_tokenized_abstract=[]
for i in spacy_tokenized_abstract[5:10]:
  for j in i:
    flat_spacy_tokenized_abstract.append(j)

iter = 0
length  = len(flat_spacy_tokenized_abstract)
for word_a in flat_spacy_tokenized_abstract:
  for word_b in flat_spacy_tokenized_abstract[iter+1:length]:
    calculations_1(word_a, word_b, flat_spacy_tokenized_abstract)

Below code, is finding MI for all abstracts.

In [None]:
#1. or calculating probability & MI on entire abstract:

#COND_PROB_LIST = []
#flat_spacy_tokenized_abstract=[]
#for i in spacy_tokenized_abstract:
#  for j in i:
#    flat_spacy_tokenized_abstract.append(j)

#calculations('parsing', 'algorithm' , flat_spacy_tokenized_abstract)
flat_spacy_tokenized_abstract=[]
for i in spacy_tokenized_abstract:
  for j in i:
    flat_spacy_tokenized_abstract.append(j)

iter = 0
length  = len(flat_spacy_tokenized_abstract)
for word_a in flat_spacy_tokenized_abstract:
  for word_b in flat_spacy_tokenized_abstract[iter+1:length]:
    calculations(word_a, word_b, flat_spacy_tokenized_abstract)

In [16]:
#for calculating the probability & etc on entire abstract data: 
list_of_counts
total_pairs = 0
for key in list_of_counts:
  total_pairs = total_pairs + sum(list_of_counts[key].values())
  
def calculations(word_a, word_b, choosen_abstract):
  number_counts_a = choosen_abstract.count(word_a)
  number_counts_b = choosen_abstract.count(word_b)
  if word_b not in list_of_counts[word_a]:
    number_counts_a_and_b = 0
  else:
    number_counts_a_and_b = list_of_counts[word_a][word_b]
  total_words = len(choosen_abstract)

  prob_a = number_counts_a/total_words
  prob_b = number_counts_b/total_words
  prob_a_inter_b = number_counts_a_and_b/total_pairs
  #cond_prob = prob_a_inter_b/(prob_a)
  MI = prob_a_inter_b/(prob_a*prob_b)

  #print(word_a + ' = ' + str(number_counts_a))
  #print(word_b + ' = ' + str(number_counts_b))
  #print('number of counts = ' + str(number_counts_a_and_b))
  #print('total_pairs = ' + str(total_pairs))
  #print('prob_a = ' + str(prob_a))
  #print('prob_b = ' + str(prob_b))
  #print('prob_a_inter_b = ' + str(prob_a_inter_b))
  #print('cond_prob = ' + str(cond_prob))
  #print('MI = ' + str(MI))

### Question 8B: Free response (3 points)

Characterize the different mutual information values of the sentence you used. What values are highest? What values are lowest? When do you think mutual information would be a better statistic to compute than a conditional probability?


Mutual information is nothing but the mutual dependance between two variables. whereas conditional probability is the probabilty of an event occurring given that another event has already occurred. As you can see in the result, some of the biagrams have high mutual information. And, If the mutual information is high then it means there is large reduction in uncertainty. And that's what we want, so that we can get better idea about text.

# Submission guidelines (1 point)

Please upload your completed notebook file to UBLearns in the following format:

Lastname\_Firstname\_HW2.ipynb

e.g., Smith\_John\_HW2.ipynb.