## Word Embedding and Tokenization

Problem Statement:
Train a word embedding on three famous works of Shakespeare to determine how well your embedding can understand the meaning of character names and other Shakespearean English words found in these plays.

1) The Tragedy of Hamlet, Prince of Denmark
2) The Tragedy of Macbeth
3) The Tragedy of Julius Caesar.

In [29]:
# import packages
import nltk
from nltk.corpus import gutenberg
nltk.download('gutenberg')
import gensim.downloader as api


# tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from autocorrect import Speller
import re
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Modelling
from gensim.models import Word2Vec             # for wordVec and CBow Model
from gensim.models import KeyedVectors

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\sheyi\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sheyi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sheyi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sheyi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
# load and merge the text dataset
text_hamlet = gutenberg.raw('shakespeare-hamlet.txt').lower()
text_macbeth = gutenberg.raw('shakespeare-macbeth.txt').lower()
text_julius = gutenberg.raw('shakespeare-caesar.txt').lower()

In [5]:
text_hamlet



In [6]:
# Combine the texts into a single variable
text = text_hamlet + " " + text_macbeth + " " + text_julius

## Tokenization

#### Split text into sentences and words

In [14]:
sentences = sent_tokenize(text)
sentences[:10]
print(len(sentences))

5325


In [15]:
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
tokenized_sentences

[['[',
  'the',
  'tragedie',
  'of',
  'hamlet',
  'by',
  'william',
  'shakespeare',
  '1599',
  ']',
  'actus',
  'primus',
  '.'],
 ['scoena', 'prima', '.'],
 ['enter', 'barnardo', 'and', 'francisco', 'two', 'centinels', '.'],
 ['barnardo', '.'],
 ['who', "'s", 'there', '?'],
 ['fran', '.'],
 ['nay',
  'answer',
  'me',
  ':',
  'stand',
  '&',
  'vnfold',
  'your',
  'selfe',
  'bar',
  '.'],
 ['long', 'liue', 'the', 'king', 'fran', '.'],
 ['barnardo', '?'],
 ['bar', '.'],
 ['he', 'fran', '.'],
 ['you', 'come', 'most', 'carefully', 'vpon', 'your', 'houre', 'bar', '.'],
 ["'t",
  'is',
  'now',
  'strook',
  'twelue',
  ',',
  'get',
  'thee',
  'to',
  'bed',
  'francisco',
  'fran',
  '.'],
 ['for',
  'this',
  'releefe',
  'much',
  'thankes',
  ':',
  "'t",
  'is',
  'bitter',
  'cold',
  ',',
  'and',
  'i',
  'am',
  'sicke',
  'at',
  'heart',
  'barn',
  '.'],
 ['haue', 'you', 'had', 'quiet', 'guard', '?'],
 ['fran', '.'],
 ['not', 'a', 'mouse', 'stirring', 'barn', '.'],
 

## Text Processing

Removing stopwords Using Speller, Stopwords, Stemmer and Lemmatizer

In [116]:
# Initialize tools
speller = Speller()
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [117]:
len(stop_words)
stop_words


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [120]:
# Words to remove
words_to_remove = {'the', 'and', 'to', 'of', 'is', 'that'}

# Remove the specified words
stop_words = stop_words - words_to_remove  # Using set subtraction
#stopwords ={stopwords.remove(word) for word in stopwords if word not in words_to_remove }

# Words to add
words_to_add = {
    'thou', 'thee', 'thy', 'art', 'hath', 'dost', 'fie', 'o', 
    'wherefore', 'alas', 'sir', 'gentle', 'queen', 'king', 
    'spirit', 'blood', 'murder', 'soul', 'fate', 'fortune'
}

# Add the new words
#stop_words = stop_words.update(words_to_add)

# Print the updated stopwords set
print("Updated stopwords set after removing some text:")
print(len(stop_words))

Updated stopwords set after removing some text:
173


In [121]:
len(stop_words)

173

In [122]:
# Add the new words
stop_words.update(words_to_add)

# Print the updated stopwords set
print("Updated stopwords set after adding some text:")
print(len(stop_words))

Updated stopwords set after adding some text:
192


In [123]:
# Remove Stopwords and Process Text
# Process text
processed_sentences = []
for sentence in tokenized_sentences:
    # Using Speller to process the words
    corrected_sentence = [speller(word) for word in sentence]
    # use regular expressions to replace  some words that are not stop words
    words = [re.sub(r'\W+', '', word) for word in corrected_sentence if word not in stop_words]
    # User Stemmer to process words
    stemmed_words = [stemmer.stem(word) for word in words if word.isalpha()]
    # Add the processed text into the arrary
    processed_sentences.append(stemmed_words)


print("First 5 sentences after processing:")
for i in range(5):
    print(processed_sentences[i])

First 5 sentences after processing:
['the', 'tragedi', 'of', 'hamlet', 'william', 'shakespear', 'act', 'prime']
['scene', 'prima']
['enter', 'bernard', 'and', 'francisco', 'two', 'sentinel']
['bernard']
['s']


## Modeling with CBOW and Skip-gram Word2Vec

In [124]:
cbow_model = Word2Vec(sentences=processed_sentences, vector_size=100, window=5, min_count=2, sg=0, epochs=10)
# sg=0 means CBOW model

In [125]:
#Display the most frequent words
most_frequent_words = cbow_model.wv.index_to_key[:20]
most_frequent_words

['the',
 'and',
 'to',
 'of',
 'that',
 'is',
 'd',
 'ham',
 'lord',
 'shall',
 'come',
 's',
 'enter',
 'good',
 'let',
 'mac',
 'like',
 'cesar',
 'make',
 'one']

In [126]:
# Top 20 most frequent words
word_counts = [cbow_model.wv.get_vecattr(word, "count") for word in most_frequent_words]
print("20 Most Frequent Words and Counts:", list(zip(most_frequent_words, word_counts)))

20 Most Frequent Words and Counts: [('the', 2223), ('and', 2035), ('to', 1512), ('of', 1302), ('that', 901), ('is', 852), ('d', 585), ('ham', 337), ('lord', 306), ('shall', 300), ('come', 298), ('s', 295), ('enter', 232), ('good', 223), ('let', 221), ('mac', 205), ('like', 204), ('cesar', 193), ('make', 193), ('one', 188)]


In [27]:
skipgram_model = Word2Vec(sentences=processed_sentences, vector_size=100, window=5, min_count=2, sg=1, epochs=10)
# sg=1 means Skip-gram model

In [32]:
#glove_model = api.load("glove-wiki-gigaword-100")  # 100-dimensional embeddings

glove_model = KeyedVectors.load_word2vec_format("glove.6B.100d.txt", binary=False, no_header=True)

## Observation
his model, trained on general English text, which enables a broader semantic context that contrasts with the domain-specific vocabulary seen in Shakespeare’s texts. It offers a wider perspective on word relationships that extends beyond the unique lexical patterns in Shakespearean English, making it valuable for general semantic analysis.

## Evaluations and Comparisons

We use cbow_model, skipgram_model, and glove_model to find the five most similar terms to each target word.

In [56]:
target_words = ['hamlet', 'cauldron', 'murder', 'spirit', 'heart', 'stand']

# Flatten processed_sentences into a set of unique words
vocabulary = set(word for sentence in processed_sentences for word in sentence)

for word in target_words:
    if word not in vocabulary:
       target_words.remove(word)
        
print("Filtered target words:", target_words)

Filtered target words: ['hamlet', 'cauldron', 'murder', 'spirit', 'heart', 'stand']


In [47]:
for word in target_words:
    print(f"\nMost similar words to '{word}' (CBOW):", cbow_model.wv.most_similar(word, topn=5))
    print(f"\nMost similar words to '{word}' (Skip-gram):", skipgram_model.wv.most_similar(word, topn=5))
    print(f"\nMost similar words to '{word}' (GloVe):", glove_model.most_similar(word, topn=5))
    print("\n-----------------------------------------------------------------------------------\n")


Most similar words to 'cauldron' (CBOW): [('might', 0.9985567927360535), ('grow', 0.998526930809021), ('self', 0.9985207319259644), ('sinc', 0.9985194206237793), ('henc', 0.9985079765319824)]

Most similar words to 'cauldron' (Skip-gram): [('neck', 0.9979887008666992), ('snake', 0.9979643225669861), ('salt', 0.9979466795921326), ('bubbl', 0.9978955984115601), ('larg', 0.9978904128074646)]

Most similar words to 'cauldron' (GloVe): [('caldron', 0.7603139281272888), ('flame', 0.6907342672348022), ('lit', 0.5912410020828247), ('torch', 0.5581894516944885), ('candle', 0.547653079032898)]

-----------------------------------------------------------------------------------


Most similar words to 'spirit' (CBOW): [('natur', 0.9997684955596924), ('sword', 0.9997303485870361), ('th', 0.9997278451919556), ('till', 0.9997237920761108), ('self', 0.9997237324714661)]

Most similar words to 'spirit' (Skip-gram): [('instrument', 0.9928393959999084), ('dell', 0.9926673769950867), ('loos', 0.99253845

## Cosine Similarity Between term

In [57]:
pairs = [('stand', 'spirit'), ('lady macbeth', 'queen gertrude'), ('fortinbras', 'norway'), ('rome', 'norway'), ('ghost', 'spirit'), ('macbeth', 'hamlet')]

pairs = [(word1, word2) for word1, word2 in pairs if word1 in vocabulary and word2 in vocabulary]

print("Filtered pairs:", pairs)



Filtered pairs: [('stand', 'spirit'), ('rome', 'norway'), ('ghost', 'spirit'), ('macbeth', 'hamlet')]


In [58]:
for word1, word2 in pairs:
    print(f"\nCosine similarity between '{word1}' and '{word2}' (CBOW):", cbow_model.wv.similarity(word1, word2))
    print(f"\nCosine similarity between '{word1}' and '{word2}' (Skip-gram):", skipgram_model.wv.similarity(word1, word2))
    print(f"\nCosine similarity between '{word1}' and '{word2}' (GloVe):", glove_model.similarity(word1, word2))
    print("\n-----------------------------------------------------------------------------------\n")


Cosine similarity between 'stand' and 'spirit' (CBOW): 0.9995566

Cosine similarity between 'stand' and 'spirit' (Skip-gram): 0.97179776

Cosine similarity between 'stand' and 'spirit' (GloVe): 0.35456046

-----------------------------------------------------------------------------------


Cosine similarity between 'rome' and 'norway' (CBOW): 0.99934185

Cosine similarity between 'rome' and 'norway' (Skip-gram): 0.98712957

Cosine similarity between 'rome' and 'norway' (GloVe): 0.28583667

-----------------------------------------------------------------------------------


Cosine similarity between 'ghost' and 'spirit' (CBOW): 0.9988049

Cosine similarity between 'ghost' and 'spirit' (Skip-gram): 0.9756504

Cosine similarity between 'ghost' and 'spirit' (GloVe): 0.4282089

-----------------------------------------------------------------------------------


Cosine similarity between 'macbeth' and 'hamlet' (CBOW): 0.99886763

Cosine similarity between 'macbeth' and 'hamlet' (Skip-gra

## Observation on Cosine Similarity of the three models
The CBOW and Skip-gram models consistently show high cosine similarity for terms closely associated in Shakespeare’s plays, capturing domain-specific context well. The GloVe model provides lower similarity scores, showing that it doesn't capture Shakespeare-specific associations due to its general-purpose English training data. This demonstrates that CBOW and Skip-gram are more effective for text trained on specific literature, while GloVe performs better with broader, general-context associations.

In [72]:

combinations = {
    'denmark + queen': cbow_model.wv['denmark'] + cbow_model.wv['queen'],
    'scotland + soldier + spirit': cbow_model.wv['scotland'] + cbow_model.wv['soldier'] + cbow_model.wv['spirit'],
    'father - man + woman': cbow_model.wv['father'] - cbow_model.wv['man'] + cbow_model.wv['woman'],
    'mother - woman + man': cbow_model.wv['mother'] - cbow_model.wv['woman'] + cbow_model.wv['man']
}
combinations




{'denmark + queen': array([-0.40256208,  0.33255523, -0.07730086, -0.08666072,  0.04695541,
        -0.8317013 ,  0.33547068,  1.143829  , -0.55700994, -0.42265314,
        -0.36383003, -0.7921395 , -0.01583188,  0.20867558, -0.04220178,
        -0.65008634, -0.14311442, -0.8600786 ,  0.04926191, -1.1204671 ,
         0.38027608,  0.24447873,  0.38557968, -0.27196223, -0.1115943 ,
        -0.26045567, -0.479841  , -0.6286197 , -0.7630228 , -0.03928147,
         0.5382228 ,  0.26230687, -0.02469668, -0.3603202 , -0.06503507,
         0.4724317 , -0.27539575, -0.4743856 , -0.37026858, -1.2770967 ,
        -0.08148487, -0.5203724 , -0.09608287, -0.12129311,  0.53602797,
        -0.28100795, -0.51687944,  0.09033443,  0.4333254 ,  0.47629994,
         0.29382744, -0.7122383 , -0.14804804, -0.17283083, -0.3106609 ,
         0.39275584,  0.10875832, -0.09138481, -0.8623644 ,  0.1694522 ,
         0.100564  ,  0.25043556, -0.09880011,  0.06686516, -0.8280623 ,
         0.18599689, -0.12147638

In [73]:

for description, vector in combinations.items():
    print(f"\nMost similar terms to '{description}' (CBOW):", cbow_model.wv.similar_by_vector(vector, topn=5))
    print(f"\nMost similar terms to '{description}' (Skip-gram):", skipgram_model.wv.similar_by_vector(vector, topn=5))
    print(f"\nMost similar terms to '{description}' (GloVe):", glove_model.similar_by_vector(vector, topn=5))
    print("\n-----------------------------------------------------------------------------------\n")


Most similar terms to 'denmark + queen' (CBOW): [('denmark', 0.9998441338539124), ('queen', 0.9998418092727661), ('th', 0.9996693134307861), ('dead', 0.9996498823165894), ('sword', 0.9996404051780701)]

Most similar terms to 'denmark + queen' (Skip-gram): [('speaker', 0.9710312485694885), ('caus', 0.9701303243637085), ('rest', 0.9701042771339417), ('mou', 0.9700025916099548), ('hart', 0.9699146747589111)]

Most similar terms to 'denmark + queen' (GloVe): [('liming', 0.38851821422576904), ('chien', 0.38057413697242737), ('keelung', 0.37729042768478394), ('beheer', 0.36269044876098633), ('gampel', 0.35875895619392395)]

-----------------------------------------------------------------------------------


Most similar terms to 'scotland + soldier + spirit' (CBOW): [('spirit', 0.9999086260795593), ('soldier', 0.9998430609703064), ('natur', 0.9998262524604797), ('self', 0.9997974038124084), ('th', 0.9997969269752502)]

Most similar terms to 'scotland + soldier + spirit' (Skip-gram): [('spe

## Observation on how each Model Distributed the Idea Behind the Word Vectors
The CBOW and Skip-gram models capture conceptual relationships within Shakespeare’s texts effectively, showing clear thematic connections specific to the literature (e.g., "Denmark + queen" and "Scotland + soldier + spirit"). GloVe, while effective in representing broad, foundational relationships (e.g., gender and family structures), lacks the nuanced literary connections, such as specific settings or supernatural themes, that CBOW and Skip-gram capture due to their Shakespeare-focused training. This makes CBOW and Skip-gram more precise in analyzing literary text, while GloVe provides a more generalized semantic context.

## Overall Comment on Model performance and Dataset Specification
The CBOW model captures general themes and high-frequency words in Shakespeare’s text effectively but struggles with nuanced, infrequent terms. Skip-gram excels with rare, complex word associations, making it suitable for capturing subtleties in literary language, while GloVe, trained on general English, lacks specificity for Shakespearean context.

To improve Shakespearean word embeddings, a specialized dataset could include:

- The complete works of Shakespeare,
- Early modern English literature by contemporaries,
- Annotated texts for figurative language,
- Historical documents from the Elizabethan era,
- Parallel modern English translations.
  
This enriched dataset would enable a model to better understand Shakespearean vocabulary, themes, and cultural references, enhancing performance in literary analysis.