Name : Vinayak Renu Nair

Panther ID : 002553736

1) Create a cosine similarity matrix for all of Shakespeare’s works found in the provided
file. This will result in a 42 by 42 matrix with the cosine similarity between each of his
works. In other words, calculate the document-wise cosine similarity between all of
Shakepeare’s works. Use TF_IDF for this. Note, you can use the Cosine
Similarity function on scikit-learn or implement your own, but no other library/package is
allowed.


In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
from sklearn.datasets import fetch_20newsgroups
import os, random
import numpy as np
import nltk
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [52]:
path = '/content/drive/MyDrive/shakespeares-works_TXT_FolgerShakespeare'

In [53]:
txt = os.listdir(path)

In [54]:
data = []

for f in txt :
  data.append(open(os.path.join(path,f), 'r').read())

In [55]:
tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(data)

In [56]:
cosine_sim = cosine_similarity(matrix,matrix)

In [57]:
print(cosine_sim)

[[1.         0.77494139 0.70637941 ... 0.81097319 0.59536997 0.66267568]
 [0.77494139 1.         0.69990711 ... 0.83172303 0.59251467 0.65190502]
 [0.70637941 0.69990711 1.         ... 0.7302251  0.56777768 0.62809744]
 ...
 [0.81097319 0.83172303 0.7302251  ... 1.         0.61221239 0.67887912]
 [0.59536997 0.59251467 0.56777768 ... 0.61221239 1.         0.52581905]
 [0.66267568 0.65190502 0.62809744 ... 0.67887912 0.52581905 1.        ]]


2) Write a function that takes the previous matrix and a number n as parameters (nothing
else will be accepted) and return the top n similar works. Use the function to output the
top 10 similar works.

In [58]:
def similar(cosine_sim, N):
  for i in range(len(cosine_sim)):
    cosine_sim[i][i] = 0 # Making diagonal elements as zero, to discard them from sorting algorithm

  index_1d = cosine_sim.flatten().argsort()[-(2*N):]
  x_idx, y_idx = np.unravel_index(index_1d, cosine_sim.shape)
  
  x_idx = np.flip(x_idx)
  y_idx = np.flip(y_idx)

  similar_works = []
  for i in range(0,len(x_idx),2):
    x = x_idx[i]
    y = y_idx[i]
    if(x != y):
      similar_works.append((x,y,cosine_sim[x][y]))
  
  return similar_works

In [59]:
similar_works = similar(cosine_sim,10)

In [60]:
print('Similar works :')
for n in similar_works:
  print(txt[n[0]],' and ',txt[n[1]],' - ',n[2])

Similar works :
lucrece_TXT_FolgerShakespeare.txt  and  venus-and-adonis_TXT_FolgerShakespeare.txt  -  0.9175430138884014
henry-iv-part-2_TXT_FolgerShakespeare.txt  and  henry-iv-part-1_TXT_FolgerShakespeare.txt  -  0.9076165568984115
shakespeares-sonnets_TXT_FolgerShakespeare.txt  and  lucrece_TXT_FolgerShakespeare.txt  -  0.867768874887524
henry-vi-part-1_TXT_FolgerShakespeare.txt  and  henry-vi-part-2_TXT_FolgerShakespeare.txt  -  0.8437915980298397
richard-ii_TXT_FolgerShakespeare.txt  and  richard-iii_TXT_FolgerShakespeare.txt  -  0.8389525010107518
shakespeares-sonnets_TXT_FolgerShakespeare.txt  and  venus-and-adonis_TXT_FolgerShakespeare.txt  -  0.8381452108269782
henry-v_TXT_FolgerShakespeare.txt  and  henry-viii_TXT_FolgerShakespeare.txt  -  0.8363042342905282
henry-vi-part-3_TXT_FolgerShakespeare.txt  and  henry-vi-part-2_TXT_FolgerShakespeare.txt  -  0.8354926670186524
henry-vi-part-2_TXT_FolgerShakespeare.txt  and  richard-ii_TXT_FolgerShakespeare.txt  -  0.8343507770652859

3) Using the code from the Language Models II class, train two simple language models
using all of the files (together) in shakespeares-works_TXT_FolgerShakespeare.zip. One
model should be trained using bigrams, the other using trigrams.

In [61]:
tokens = []
for doc in data:
  tokens.append(nltk.word_tokenize(doc))

Using Bigrams :

In [62]:
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

In [63]:
# Create a placeholder for model
model_bigram = defaultdict(lambda: defaultdict(lambda: 0))

In [64]:
# Count frequency of co-occurence  
for sentence in tokens:
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True):
        model_bigram[w1][w2] += 1

In [65]:
# Let's transform the counts to probabilities
for w1 in model_bigram:
    total_count = float(sum(model_bigram[w1].values()))
    for w2 in model_bigram[w1]:
        model_bigram[w1][w2] /= total_count

In [66]:
dict(model_bigram["King"])

{'!': 0.011428571428571429,
 "''": 0.0016326530612244899,
 "'s": 0.09551020408163265,
 ',': 0.11836734693877551,
 '--': 0.0024489795918367346,
 '.': 0.08897959183673469,
 ':': 0.004081632653061225,
 ';': 0.004081632653061225,
 '?': 0.0236734693877551,
 'And': 0.006530612244897959,
 'Be': 0.0008163265306122449,
 'Bolingbroke': 0.0008163265306122449,
 'Brothers': 0.0008163265306122449,
 'Cambyses': 0.0008163265306122449,
 'Capaneus': 0.0008163265306122449,
 'Cerberus': 0.0008163265306122449,
 'Claudius': 0.0008163265306122449,
 'Clothair': 0.0008163265306122449,
 'Cophetua': 0.0024489795918367346,
 'Digest': 0.0008163265306122449,
 'Dismiss': 0.0008163265306122449,
 'Does': 0.0008163265306122449,
 'Duncan': 0.0024489795918367346,
 'Edward': 0.027755102040816326,
 'Five': 0.0008163265306122449,
 'For': 0.0008163265306122449,
 'From': 0.0008163265306122449,
 'Gorboduc': 0.0008163265306122449,
 'Had': 0.0016326530612244899,
 'Hal': 0.0008163265306122449,
 'Hamlet': 0.0032653061224489797,
 '

Using Trigrams :

In [67]:
# Create a placeholder for model
model_trigram = defaultdict(lambda: defaultdict(lambda: 0))

In [68]:
# Count frequency of co-occurence  
for sentence in tokens:
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model_trigram[(w1, w2)][w3] += 1

In [69]:
# Let's transform the counts to probabilities
for w1_w2 in model_trigram:
    total_count = float(sum(model_trigram[w1_w2].values()))
    for w3 in model_trigram[w1_w2]:
        model_trigram[w1_w2][w3] /= total_count

In [70]:
dict(model_trigram["the","battle"])

{"'s": 0.06666666666666667,
 ',': 0.23333333333333334,
 '.': 0.1,
 'Which': 0.03333333333333333,
 'and': 0.06666666666666667,
 'came': 0.03333333333333333,
 'ends': 0.03333333333333333,
 'fly': 0.03333333333333333,
 "ha'": 0.03333333333333333,
 'let': 0.03333333333333333,
 'of': 0.03333333333333333,
 'range': 0.03333333333333333,
 'sought': 0.03333333333333333,
 'still': 0.03333333333333333,
 'than': 0.03333333333333333,
 'think': 0.06666666666666667,
 'thus': 0.06666666666666667,
 'to': 0.03333333333333333}

4) Write a function that takes the following three parameters: model, list of start words,
number of sentences to generate. This function should return the sentences generated
as a list. DO NOT print anything to the screen from within the function. Use this function
to generate 10 sentences with the bigram model from the previous question, and 5
sentences with the trigram model from the previous question. 

In [71]:
def new_sentences(model,start_words,num_sentences):
  sentences = []
  
  for key in model.keys():
    break
  
  if(type(key) == tuple):
    for i in range(num_sentences):
      text = start_words
      sentence_finished = False

      while not sentence_finished:
        # select a random probability threshold  
        r = random.random()
        accumulator = .0

        for word in model[tuple(text[-2:])].keys():
            accumulator += model[tuple(text[-2:])][word]
            # select words that are above the probability threshold
            if accumulator >= r:
                text.append(word)
                break

        if text[-2:] == [None, None]:
            sentence_finished = True
        
      sentences.append(' '.join([t for t in text if t]))
  else:
    for i in range(num_sentences):
      text = start_words
      sentence_finished = False

      while not sentence_finished:
        # select a random probability threshold  
        r = random.random()
        accumulator = .0

        for word in model[text[-1]].keys():
            accumulator += model[text[-1]][word]
            # select words that are above the probability threshold
            if accumulator >= r:
                text.append(word)
                break

        if text[-1:] == [None]:
            sentence_finished = True
        
      sentences.append(' '.join([t for t in text if t]))

  return sentences

In [72]:
tri_sentences = new_sentences(model_trigram,['the','battle'],5)

In [74]:
bi_sentences = new_sentences(model_bigram,['battle'],10)

Bonus (20 points): Using the same methodology from questions 1 and 2, create a similarity matrix between the 20 newsgroups corpus. And find the top 5 similar newsgroups.

In [79]:
newdata = fetch_20newsgroups()

In [80]:
tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(newdata.data)

In [81]:
cosine_sim = cosine_similarity(matrix, matrix)
print(cosine_sim)

[[1.         0.04405974 0.11017033 ... 0.04433678 0.04457107 0.0329325 ]
 [0.04405974 1.         0.06242113 ... 0.07373268 0.06959306 0.02439956]
 [0.11017033 0.06242113 1.         ... 0.07569182 0.06214891 0.02357985]
 ...
 [0.04433678 0.07373268 0.07569182 ... 1.         0.02909961 0.00716986]
 [0.04457107 0.06959306 0.06214891 ... 0.02909961 1.         0.02428174]
 [0.0329325  0.02439956 0.02357985 ... 0.00716986 0.02428174 1.        ]]


In [82]:
similar_works = similar(cosine_sim,10)

In [83]:
print(similar_works)

[(10777, 2002, 1.0000000000000002), (5392, 14, 1.0000000000000002), (9989, 800, 1.0), (4495, 4772, 0.9997809868890166), (8665, 4772, 0.9996809512865921), (981, 9920, 0.999622508768835), (8665, 4495, 0.9995302511028284), (4515, 4495, 0.9992512742959921), (4515, 4772, 0.9990101094182835), (4515, 8665, 0.9989914519713721)]
