# Assignment 4: Word Embeddings for Shakespearean English

## Part 1

In [144]:
# Importing necessary libraries
import nltk
import pandas as pd
import regex as re
from string import punctuation
from nltk import tokenize
from nltk.corpus import stopwords
from autocorrect import Speller

In [145]:
# Loading the shakespeare plays into a single variable and then converting them into lowercase
text = nltk.corpus.gutenberg.raw(['shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'shakespeare-caesar.txt']).lower()

In [146]:
# Tokenizing the text into sentences and sentences into words
text_words = [ tokenize.word_tokenize(sent) for sent in tokenize.sent_tokenize(text)]

In [147]:
print(text_words[:5])

[['[', 'the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare', '1599', ']', 'actus', 'primus', '.'], ['scoena', 'prima', '.'], ['enter', 'barnardo', 'and', 'francisco', 'two', 'centinels', '.'], ['barnardo', '.'], ['who', "'s", 'there', '?']]


In [148]:
# making a list of all the stopwords and adding our own as well
punctuations = list(punctuation)
stop_words = stopwords.words('english')
all_stopwords = punctuations + stop_words + ["'s", "mr.", "miss.", "mrs."]

In [149]:
filtered_words = [
    [word for word in sent if word not in all_stopwords]
    for sent in text_words
]

In [150]:
print(filtered_words[:5])

[['tragedie', 'hamlet', 'william', 'shakespeare', '1599', 'actus', 'primus'], ['scoena', 'prima'], ['enter', 'barnardo', 'francisco', 'two', 'centinels'], ['barnardo'], []]


In [151]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [152]:
lemmatizer = WordNetLemmatizer()
spell = Speller(fast = True)

In [153]:
# defining a lemmatizing function to lematize the sentences 
def lemmatizee(text):
    return [spell(lemmatizer.lemmatize(word)) for word in text] 

lemmatized_text = [lemmatizee(text) for text in filtered_words if text]

In [154]:
print(lemmatized_text[:10])

[['tragedies', 'hamlet', 'william', 'shakespeare', '1599', 'acts', 'primes'], ['scoena', 'prima'], ['enter', 'barnardo', 'francisco', 'two', 'centinels'], ['barnardo'], ['fran'], ['nay', 'answer', 'stand', 'unfold', 'self', 'bar'], ['long', 'like', 'king', 'fran'], ['barnardo'], ['bar'], ['fran']]


## Part 2

In [155]:
from gensim.models import Word2Vec

In [156]:
# Initializing a CBOW model
cbow_model = Word2Vec(
    sentences=lemmatized_text,   
    vector_size=300,                 # Size of word embeddings (dimensionality)
    window=5,                        # Context window size
    min_count=3,                     # Ignores words with total frequency lower than this                          
    epochs=10,                   
    workers=6                        # Use 4 CPU cores for parallelization
)


In [157]:
# Access the word vectors
vocab = cbow_model.wv

# Get the 20 most frequent words in the vocabulary along with their counts
print("20 Most Frequent Words and Their Counts:")
for word in list(vocab.key_to_index)[:20]:
    count = vocab.get_vecattr(word, "count")
    print(f" word: {word} count: {count}")

20 Most Frequent Words and Their Counts:
 word: 'd count: 585
 word: have count: 444
 word: ham count: 337
 word: thou count: 306
 word: lord count: 306
 word: shall count: 300
 word: come count: 283
 word: king count: 248
 word: cesar count: 231
 word: enter count: 230
 word: here count: 225
 word: let count: 220
 word: good count: 217
 word: mac count: 205
 word: thy count: 202
 word: like count: 200
 word: one count: 182
 word: make count: 181
 word: v count: 178
 word: thee count: 174


In [158]:
# Initializing a Skip gram model
skip_gram_model = Word2Vec(
    sentences=lemmatized_text,   
    vector_size=300,
    window=5,                        
    min_count=3, 
    sg = 1,                                      
    epochs=10,                   
    workers=6                        
)


In [159]:
# # Importing the pre trained glove model and then converting it to vectors
# from gensim.scripts.glove2word2vec import glove2word2vec

# glove_input_file = 'glove.6B.300d.txt'
# out_file = './glove.6B.300d.w2vformat.txt'

# glove2word2vec(glove_input_file, out_file)

  import sys


(400000, 300)

In [160]:
from gensim.models import KeyedVectors

# Load GloVe 300-dimensional embeddings
glove_model = KeyedVectors.load_word2vec_format(out_file, binary=False)

## Pretrained GloVe Model Background
### The GloVe model from Stanford was trained on multiple large corpora, including:

1. **Wikipedia (2014 edition)**: A comprehensive collection of Wikipedia articles.
2. **Gigaword 5**: A massive collection of English news text, including newswire from sources like the Associated Press, New York Times, and others.
3. **Common Crawl**: Over 840 billion tokens from a variety of web sources.
4. **Twitter**: A dataset of 2 billion tweets to capture social media vocabulary.

### The available GloVe models differ by vector size and dataset. The commonly used ones are:
1. **6B**: 6 billion tokens (Wikipedia and Gigaword 5).
2. **42B**: 42 billion tokens (Common Crawl).
3. **840B**: 840 billion tokens (Common Crawl, uncased).
4. **27B**: 27 billion tokens (Twitter).

## Key Differences in GloVe Training and Word2Vec Training
1. **Scale of Data**: GloVe was trained on vast datasets encompassing billions of tokens, while the Word2Vec models trained on Shakespeare's plays are limited to a specific vocabulary and style of language.
2. **Contextual Breadth**: GloVe’s training data spans a wide range of topics, styles, and vocabulary, making it more generalizable, while the Shakespeare Word2Vec models are more tailored to the specific context of early modern English literature.

## Part 3

In [161]:
# List of models with names and access methods
models = {
    "CBOW Word2Vec": cbow_model.wv,
    "Skip-Gram Word2Vec": skip_gram_model.wv,
    "GloVe": glove_model
}

# Define the terms to compare
terms = ['hamlet', 'cauldron', 'nature', 'spirit', 'general', 'prythee']

In [162]:
# cheching for most similar terms in each model
for i in terms:
    results = {}
    try:
        results['Cbow_model'] = [(word, round(prob, 4)) for word, prob in cbow_model.wv.most_similar(i, topn=5)]
        results['Skip_gram_model'] = [(word, round(prob, 4)) for word, prob in skip_gram_model.wv.most_similar(i, topn=5)]
        results['Glove_model'] = [(word, round(prob, 4)) for word, prob in glove_model.most_similar(i, topn=5)]
        
    except KeyError:
        print(KeyError)
            
    print(i.center(70))
    print('-'*70)
    print(pd.DataFrame(results))

                                hamlet                                
----------------------------------------------------------------------
         Cbow_model     Skip_gram_model         Glove_model
0   (queen, 0.9998)     (queen, 0.9895)   (village, 0.5545)
1    (lady, 0.9997)   (laertes, 0.9862)      (town, 0.4956)
2    (king, 0.9997)  (polonius, 0.9831)   (hamlets, 0.4852)
3  (banquo, 0.9997)      (king, 0.9824)   (othello, 0.4451)
4  (either, 0.9997)   (horatio, 0.9822)  (situated, 0.4446)
                               cauldron                               
----------------------------------------------------------------------
         Cbow_model       Skip_gram_model          Glove_model
0    (self, 0.9995)      (bubble, 0.9992)    (caldron, 0.6924)
1      (th, 0.9995)       (crack, 0.9991)      (flame, 0.5382)
2    (even, 0.9995)  (distracted, 0.9991)  (cauldrons, 0.4885)
4  (within, 0.9995)     (current, 0.9991)        (lit, 0.4593)
                                nature   

## Analysis and Comparison
###                                                Hamlet

- CBOW: Finds related characters and themes but includes some less relevant terms like "sweet" and "wife."
- Skip-gram: Captures Shakespearean characters directly associated with Hamlet (e.g., "Laertes," "Horatio").
- GloVe: Returns broader terms (e.g., "village," "town," "Othello"), which reflect general literary associations.

###                                                Cauldron

- CBOW: Mostly irrelevant terms like "self" and "even," with limited relevance to cauldron imagery.
- Skip-gram: Provides relevant imagery-rich words like "bubble," "crack," and "flower," fitting the supernatural theme.
- GloVe: Offers generally associated terms like "flame" and "torch," focusing more on practical meanings than context.

###                                                Nature

- CBOW: Contains high-frequency words like "till" and "whose," which don’t enhance the meaning.
- Skip-gram: Identifies words like "earth," "breath," and "seems," capturing natural and philosophical themes.
- GloVe: Shows general terms like "subject" and "existence," which lack specific Shakespearean associations.

###                                                Spirit

- CBOW: Includes frequent functional terms such as "whose" and "are" without context relevance.
- Skip-gram: Finds thematically relevant terms like "angel" and "damn," aligning with supernatural elements.
- GloVe: Offers broad associations (e.g., "spirits," "faith," "courage") but lacks depth in supernatural nuances.

###                                                General

- CBOW: Returns generic terms like "fire" and "self" without strong contextual ties.
- Skip-gram: Captures loosely related terms like "drone" and "judge," somewhat relevant to character roles.
- GloVe: Accurately reflects administrative and military meanings (e.g., "gen.," "brigadier," "chief").

###                                                Prythee

- CBOW: Contains mostly irrelevant terms such as "ile" and "self."
- Skip-gram: Provides loosely associated words but lacks strong connection (e.g., "does," "nobles").
- GloVe: Does not recognize this term due to its rarity outside of Shakespearean context.


In [163]:
# Getting the similarity score between given words among the three models
word_pairs = [
    ('brutus', 'murder'),
    ('macbeth', 'gertrude'),
    ('fortinbras', 'norway'),
    ('rome', 'norway'),
    ('ghost', 'spirit'),
    ('macbeth', 'hamlet')
]
for word1, word2 in word_pairs:
    try: 
        results_score = { 
            "similarity_cbow":cbow_model.wv.cosine_similarities(cbow_model.wv[word1], [cbow_model.wv[word2]]),
            "similarity_skip_gram":skip_gram_model.wv.cosine_similarities(skip_gram_model.wv[word1], [skip_gram_model.wv[word2]]),
            "similarity_glove": glove_model.cosine_similarities(glove_model[word1], [glove_model[word2]])
            }

    except(KeyError) as e:
        print(e)
    print()
    print(f"{word1}:{word2}".center(60))
    print(f"CBOW: {results_score['similarity_cbow']} | skipgram: {results_score['similarity_skip_gram']} | GLOVE: {results_score['similarity_glove']}")


                       brutus:murder                        
CBOW: [0.99658227] | skipgram: [0.9468968] | GLOVE: [0.06155146]

                      macbeth:gertrude                      
CBOW: [0.99893713] | skipgram: [0.9367072] | GLOVE: [0.23251496]

                     fortinbras:norway                      
CBOW: [0.9996064] | skipgram: [0.9988509] | GLOVE: [0.05642105]

                        rome:norway                         
CBOW: [0.9996088] | skipgram: [0.9952338] | GLOVE: [0.13604455]

                        ghost:spirit                        
CBOW: [0.99959236] | skipgram: [0.99067837] | GLOVE: [0.3149283]

                       macbeth:hamlet                       
CBOW: [0.9995727] | skipgram: [0.9406717] | GLOVE: [0.3862637]


## General Summary
### (‘brutus’, ‘murder’)

- CBOW: 0.9969 - Shows a very high similarity, indicating that the model captures the strong thematic link between Brutus and the act of murder in Julius Caesar.
- Skip-gram: 0.9569 - Also reflects a strong relationship, although slightly lower than CBOW. This still indicates an effective capture of the theme.
- GloVe: 0.0616 - Very low similarity, suggesting GloVe fails to recognize the thematic connection, likely due to its more general training data.

### (‘lady macbeth’, ‘queen gertrude’)

- CBOW: 0.9989 - Extremely high similarity, effectively capturing the connection between these two prominent female characters in Shakespeare.
- Skip-gram: 0.9330 - Also shows a strong association, although less than CBOW, which is still impressive for character relationships.
- GloVe: 0.2325 - Low to moderate similarity, indicating some recognition of the characters’ royalty but lacking depth in their contextual significance.

### (‘fortinbras’, ‘norway’)

- CBOW: 0.9996 - Indicates a very strong relationship, successfully connecting Fortinbras with his association to Norway in Hamlet.
- Skip-gram: 0.9987 - Also shows high similarity, further confirming the connection.
- GloVe: 0.0564 - Very low, indicating a failure to establish the relationship likely due to lack of context in the training data.

### (‘rome’, ‘norway’)

- CBOW: 0.9997 - Shows an unexpectedly high similarity, but this may stem from word frequency rather than thematic connection.
- Skip-gram: 0.9960 - Also indicates a high similarity, which might not accurately reflect their lack of direct relationship in Shakespearean context.
- GloVe: 0.1360 - Low similarity, recognizing some association but not capturing their relationship effectively.

### (‘ghost’, ‘spirit’)

- CBOW: 0.9996 - Very high similarity, effectively capturing the connection between the ghost and the broader theme of spirits in Hamlet.
- Skip-gram: 0.9887 - Also shows strong similarity, indicating a good recognition of the supernatural element.
- GloVe: 0.3149 - Moderate similarity, indicating some recognition of the terms' overlap but lacking the depth of connection captured by the other models.

### (‘macbeth’, ‘hamlet’)

- CBOW: 0.9995 - Extremely high similarity, capturing the strong literary connection between these two central characters in Shakespeare's works.
- Skip-gram: 0.9387 - Also indicates a good relationship, although slightly lower than CBOW.
- GloVe: 0.3863 - Moderate similarity, indicating some recognition of the characters' importance but lacking the context specific to Shakespeare’s themes

In [164]:
word_list = ['denmark','queen', 'scotland', 'army', 'general', 'father', 'woman', 'mother', 'man']

In [165]:
# getting the top 5 similar terms by adding two or three terms from three models
try:
    res = { 
            "similar_terms_cbow":[(word, round(prob, 4)) for word, prob in cbow_model.wv.most_similar(positive=['denmark', 'queen'], topn = 5)],
            "similar_terms_skipgram":[(word, round(prob, 4)) for word, prob in skip_gram_model.wv.most_similar(positive=['denmark', 'queen'], topn = 5)],
            "similar_terms_glove": [(word, round(prob, 4)) for word, prob in glove_model.most_similar(positive=['denmark', 'queen'], topn = 5)]
            }     
except(KeyError):
    print(KeyError)   
print("|Denmark + Queen|".center(70))
print('-' * 70)
print(pd.DataFrame(res))

                          |Denmark + Queen|                           
----------------------------------------------------------------------
  similar_terms_cbow   similar_terms_skipgram similar_terms_glove
0       (th, 0.9999)        (laertes, 0.9972)    (sweden, 0.6307)
1     (wife, 0.9999)        (ophelia, 0.9971)    (norway, 0.5922)
2     (poor, 0.9999)  (guildensterne, 0.9969)      (king, 0.5871)
3   (nature, 0.9999)      (attendant, 0.9968)    (danish, 0.5676)
4     (many, 0.9999)    (rosincrance, 0.9968)  (princess, 0.5676)


In [166]:
try:
    res_2 = { 
            "similar_terms_cbow":[(word, round(prob, 4)) for word, prob in cbow_model.wv.most_similar(positive=['scotland', 'army', 'general'], topn = 5)],
            "similar_terms_skipgram":[(word, round(prob, 4)) for word, prob in skip_gram_model.wv.most_similar(positive=['scotland', 'army', 'general'], topn = 5)],
            "similar_terms_glove": [(word, round(prob, 4)) for word, prob in glove_model.most_similar(positive=['scotland', 'army', 'general'], topn = 5)]
            }     
except(KeyError):
    print(KeyError)   
print("|scotland + army + general|".center(70))
print('-' * 70)
print(pd.DataFrame(res_2))

                     |scotland + army + general|                      
----------------------------------------------------------------------
  similar_terms_cbow similar_terms_skipgram  similar_terms_glove
0     (like, 0.9997)          (ran, 0.9995)   (military, 0.6035)
1     (both, 0.9997)       (having, 0.9994)       (gen., 0.5679)
2       (th, 0.9997)    (advantage, 0.9994)    (command, 0.5675)
3   (though, 0.9997)     (speaking, 0.9994)  (commander, 0.5628)
4      (way, 0.9997)       (statue, 0.9994)     (forces, 0.5607)


In [167]:
try:
    res_3 = { 
            "similar_terms_cbow":[(word, round(prob, 4)) for word, prob in cbow_model.wv.most_similar(positive=['father', 'woman'], negative=['man'], topn = 5)],
            "similar_terms_skipgram":[(word, round(prob, 4)) for word, prob in skip_gram_model.wv.most_similar(positive=['father', 'woman'], negative=['man'], topn = 5)],
            "similar_terms_glove": [(word, round(prob, 4)) for word, prob in glove_model.most_similar(positive=['father', 'woman'], negative=['man'], topn = 5)]
            }     
except(KeyError):
    print(KeyError)   
print("|father - man + woman|".center(70))
print('-' * 70)   
print(pd.DataFrame(res_3))

                        |father - man + woman|                        
----------------------------------------------------------------------
  similar_terms_cbow similar_terms_skipgram   similar_terms_glove
0   (mother, 0.9996)       (hamlet, 0.9778)      (mother, 0.8266)
1     (many, 0.9996)       (mother, 0.9774)    (daughter, 0.7898)
2     (wife, 0.9996)      (laertes, 0.9753)     (husband, 0.7276)
3   (heaven, 0.9996)           (oh, 0.9743)        (wife, 0.7243)
4     (thus, 0.9996)         (soul, 0.9708)  (grandmother, 0.696)


In [168]:
try:
    res_4 = { 
            "similar_terms_cbow":[(word, round(prob, 4)) for word, prob in cbow_model.wv.most_similar(positive=['mother', 'man'], negative=['woman'], topn = 5)],
            "similar_terms_skipgram":[(word, round(prob, 4)) for word, prob in skip_gram_model.wv.most_similar(positive=['mother', 'man'], negative=['woman'], topn = 5)],
            "similar_terms_glove": [(word, round(prob, 4)) for word, prob in glove_model.most_similar(positive=['mother', 'man'], negative=['woman'], topn = 5)]
            }     
except(KeyError):
    print(KeyError)   
print("|mother - woman + man|".center(70))
print('-' * 70)   
print(pd.DataFrame(res_4))

                        |mother - woman + man|                        
----------------------------------------------------------------------
  similar_terms_cbow similar_terms_skipgram    similar_terms_glove
0     (hand, 0.9996)         (kill, 0.9897)       (father, 0.7923)
1    (great, 0.9996)         (poor, 0.9895)      (brother, 0.7102)
2  (purpose, 0.9996)        (since, 0.9883)          (son, 0.6802)
3     (life, 0.9996)       (breath, 0.9883)        (uncle, 0.6393)
4    (still, 0.9996)      (denmark, 0.9882)  (grandfather, 0.6236)


### Denmark + Queen
- CBOW Model: Generates high-frequency but generic terms like "th" and "wife," lacking specific context.
- Skip-gram Model: Identifies key characters such as "attendant" and "Ophelia," effectively capturing royal relationships.
- GloVe Model: Produces broader terms like "Sweden" and "king," but less relevant to the specific query.

### Scotland + Army + General
- CBOW Model: Returns generic terms such as "like" and "way," showing limited context.
- Skip-gram Model: Yields terms like "legion," indicating some military relevance, but includes less relevant terms.
- GloVe Model: Captures military terms like "military" and "commander," demonstrating a better understanding of the query's themes.

### Father - Man + Woman
- CBOW Model: Produces terms like "mother" and "wife," reflecting gender-related themes, but lacks specificity.
- Skip-gram Model: Successfully identifies "mother" and "soul," showing familial relationships but with some irrelevant terms.
- GloVe Model: Effectively captures family roles with terms like "mother" and "daughter," demonstrating a solid grasp of gender dynamics.

### Mother - Woman + Man
- CBOW Model: Generates generic terms like "death" and "life," showing limited relevance to the query.
- Skip-gram Model: Produces terms like "kill" and "brother," indicating darker themes and familial connections.
- GloVe Model: Identifies terms like "father" and "son," successfully reflecting family dynamics.

### Overall Comments on Model Performance
1. CBOW Model:

- Strengths: Fast and efficient for training, producing embeddings for words based on context. It effectively captures some relationships and frequently returns common terms.
- Weaknesses: Tends to generate generic or less relevant terms, especially in nuanced contexts. It may overlook important character relationships and themes in Shakespearean English.

2. Skip-gram Model:

- Strengths: Excels in capturing contextual nuances and relationships, particularly in more complex and thematic phrases. It identifies relevant character names and actions more effectively.
- Weaknesses: Slower to train compared to CBOW, and can sometimes generate irrelevant terms when context is less clear.

3. GloVe Model:

- Strengths: Leverages global statistical information to produce embeddings that capture relationships effectively. It provides relevant terms in many contexts and is good at capturing similarities based on co-occurrence.
- Weaknesses: May miss subtleties and context-specific meanings, particularly in dense literary language like that of Shakespeare. It can yield broader, less focused terms.


### Recommendations for Training a Better Word Embedding Model
1. Diverse Shakespearean Texts
2. Contextual Literature
3. Thematic Groupings
4. Collaborative Data:
5. Additional Language Resources
6. Fine-Tuning with Modern Corpora