In [1]:
import pandas as pd
import numpy as np
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [2]:
df = pd.read_csv("../Data/simpsons_dataset.csv")

In [3]:
df

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...
...,...,...
158309,Miss Hoover,I'm back.
158310,Miss Hoover,"You see, class, my Lyme disease turned out to ..."
158311,Miss Hoover,Psy-cho-so-ma-tic.
158312,Ralph Wiggum,Does that mean you were crazy?


### Checking missing values

In [4]:
df.isnull().sum()

raw_character_text    17814
spoken_words          26459
dtype: int64

In [5]:
df = df.dropna().reset_index(drop=True)
df.isnull().sum()

raw_character_text    0
spoken_words          0
dtype: int64

In [6]:
df

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...
...,...,...
131848,Miss Hoover,I'm back.
131849,Miss Hoover,"You see, class, my Lyme disease turned out to ..."
131850,Miss Hoover,Psy-cho-so-ma-tic.
131851,Ralph Wiggum,Does that mean you were crazy?


### Cleaning:
We are lemmatizing and removing the stopwords and non-alphabetic characters for each line of dialogue.

In [7]:
import spacy

# Load the smaller English model, disabling Named Entity Recognition and parsing
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # If the cleaned text has more than two words, return it
    if len(txt) > 2:
        return ' '.join(txt)

### Remove non alphabetic characters

In [8]:
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df['spoken_words'])

### Taking advantage of spaCy .pipe() attribute to speed-up the cleaning process:
This line of code processes text data using an NLP (Natural Language Processing) pipeline, with the following key parts:

1. nlp.pipe(brief_cleaning, batch_size=5000, n_process=-1):
nlp.pipe(): This is a method from an NLP library like SpaCy that allows you to process a large set of documents in a memory-efficient and fast way by processing them in batches.

brief_cleaning: This is the input data, which is likely a list of text documents that need to be processed. Each document in this list is passed through the nlp pipeline, which might involve tokenization, part-of-speech tagging, named entity recognition, and other NLP tasks depending on what the pipeline is set up to do.

batch_size=5000: The documents in brief_cleaning are processed in batches of 5000 at a time, which can improve efficiency and reduce memory usage when processing large datasets.

n_process=-1: This argument controls parallel processing. Setting n_process=-1 means that the code will use all available CPU cores to process the documents in parallel, speeding up the task.

2. cleaning(doc):
For each document (doc) returned by nlp.pipe(), a custom function named cleaning() is applied. This function likely performs some form of text preprocessing or cleaning, such as removing stop words, punctuation, or transforming the text in some other way (e.g., lowercasing or lemmatization).
3. txt = [ ... ]:
This list comprehension runs the cleaning() function on each document in the pipeline. The results (cleaned documents) are stored in the list txt.


In [9]:
import time
import re

t = time.time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_process=-1)]

print('Time to clean up everything: {} mins'.format(round((time.time() - t) / 60, 2)))

Time to clean up everything: 1.19 mins


### Put the results in a DataFrame to remove missing values and duplicates:



In [10]:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

(85955, 1)

In [11]:
from gensim.models.phrases import Phrases, Phraser

In [12]:
df_clean

Unnamed: 0,clean
0,actually little disease magazine news show nat...
2,know sure like talk touch lesson plan teach
3,life worth live
4,poll open end recess case decide thought final...
7,victory party slide
...,...
131829,oh mom wonderful find favorite dish help
131835,dye shoe pink
131846,mr bergstrom request pleasure company mr bergs...
131849,class lyme disease turn


In [13]:
sent = [row.split() for row in df_clean['clean']]

In [30]:
df_clean

Unnamed: 0,clean
0,actually little disease magazine news show nat...
2,know sure like talk touch lesson plan teach
3,life worth live
4,poll open end recess case decide thought final...
7,victory party slide
...,...
131829,oh mom wonderful find favorite dish help
131835,dye shoe pink
131846,mr bergstrom request pleasure company mr bergs...
131849,class lyme disease turn


### This line of code is using the **`Phrases`** class from the **Gensim** library, which is commonly used to detect multi-word expressions (also known as "collocations" or "phrases") in a text corpus. Here's a breakdown of what this specific line does:

### 1. **`Phrases(sent)`**:
   - **`Phrases`** is a Gensim function that identifies common phrases (e.g., "New York", "machine learning") in a list of tokenized sentences. 
   - **`sent`** is the input to the `Phrases` class. It is a list of tokenized sentences, where each sentence is a list of words (in this case, it was created from the previous code snippet: `sent = [row.split() for row in df_clean['clean']]`).
   
     For example:
     ```python
     sent = [['this', 'is', 'a', 'test'], ['another', 'example', 'here']]
     ```

   Gensim will look through these sentences to identify common multi-word expressions (phrases) like "New York" or "machine learning" that often occur together.

### 2. **`min_count=30`**:
   - The `min_count` parameter sets a threshold for phrase detection. In this case, only phrases that appear **at least 30 times** in the corpus (the list of sentences `sent`) will be considered as valid phrases.
   
     For example, if "machine learning" appears together 30 or more times in the dataset, it will be identified as a phrase. If it appears fewer than 30 times, it won't be considered a phrase.

### 3. **`progress_per=10000`**:
   - This parameter controls how often the progress of the phrase detection process is reported (useful for large datasets). It means that Gensim will output progress after every 10,000 sentences it processes, providing updates on how far along it is in detecting phrases.

### 4. **`phrases = ...`**:
   - The result of applying the `Phrases` class is stored in the variable `phrases`. This object contains information about the detected phrases and can be used to transform the tokenized sentences so that multi-word expressions are combined into single tokens.

### Example of Detected Phrases:
After running this code, if the phrases "machine learning" or "New York" are detected in the dataset, Gensim might combine those into single tokens, such as `machine_learning` and `New_York`. 

For example, the input sentence:
```python
['I', 'am', 'learning', 'machine', 'learning']
```
could be transformed into:
```python
['I', 'am', 'learning', 'machine_learning']
```

### Summary:
The line of code `phrases = Phrases(sent, min_count=30, progress_per=10000)` creates a model to detect multi-word expressions that occur frequently in the tokenized text data. Only phrases that appear at least 30 times will be considered, and the progress will be reported after every 10,000 sentences processed. The result, stored in `phrases`, is an object that can later be used to transform sentences by combining detected phrases into single tokens.

In [31]:
phrases = Phrases(sent, min_count=30, progress_per=10000)

INFO - 09:57:07: collecting all words and their counts
INFO - 09:57:07: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 09:57:07: collected 17 token types (unigram + bigrams) from a corpus of 11 words and 4 sentences
INFO - 09:57:07: merged Phrases<17 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>
INFO - 09:57:07: Phrases lifecycle event {'msg': 'built Phrases<17 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000> in 0.01s', 'datetime': '2024-10-19T09:57:07.505655', 'gensim': '4.3.0', 'python': '3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]', 'platform': 'macOS-15.0.1-arm64-arm-64bit', 'event': 'created'}


### The line of code sentences = bigram[sent] applies a bigram model to a list of tokenized sentences and produces new sentences where common bigrams (frequent two-word phrases) are combined into single tokens.

Here's what it does in detail:
1. bigram:
bigram is a model that identifies common two-word phrases (bigrams) in text.
This model is usually created from a Gensim Phrases object. For example, after detecting frequent bigrams using Phrases(sent, min_count=30), you can convert it to a more efficient form with:
python
Copy code
bigram = Phraser(phrases)
2. bigram[sent]:
sent is a list of tokenized sentences, where each sentence is a list of words (tokens). For example:
python
Copy code
sent = [['this', 'is', 'a', 'test'], ['machine', 'learning', 'is', 'fun']]
When you apply bigram[sent], the bigram model is applied to every sentence in sent. It looks for common two-word phrases (bigrams) like "machine learning" and combines them into a single token like "machine_learning".
As a result, the output sentences will contain the modified sentences where the detected bigrams are replaced by single tokens.
Example:
Suppose you have a sentence like:

python
Copy code
sent = [['machine', 'learning', 'is', 'fun'], ['this', 'is', 'a', 'test']]
If the bigram model detects that "machine learning" is a common phrase, applying bigram[sent] will transform the sentence as follows:

python
Copy code
sentences = [['machine_learning', 'is', 'fun'], ['this', 'is', 'a', 'test']]
3. sentences:
The variable sentences now holds the list of sentences where two-word phrases (bigrams) have been combined into single tokens.


In [32]:
from gensim.models.phrases import Phraser

bigram = Phraser(phrases)

sentences = bigram[sent]

INFO - 09:57:09: exporting phrases from Phrases<17 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>
INFO - 09:57:09: FrozenPhrases lifecycle event {'msg': 'exported FrozenPhrases<0 phrases, min_count=30, threshold=10.0> from Phrases<17 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000> in 0.00s', 'datetime': '2024-10-19T09:57:09.940939', 'gensim': '4.3.0', 'python': '3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]', 'platform': 'macOS-15.0.1-arm64-arm-64bit', 'event': 'created'}


### Most Frequent Words:¶
Mainly a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

This code snippet calculates the frequency of each word (or token) in the list of tokenized sentences and then determines how many unique words (tokens) exist in the dataset. Here’s a breakdown of how it works:

1. word_freq = defaultdict(int):
This initializes a defaultdict (from Python’s collections module) to store the frequency of each word.
A defaultdict works like a regular Python dictionary, but it provides a default value for keys that don’t exist. In this case, int initializes any missing key with a default value of 0.
2. for sent in sentences::
This loop iterates over each sentence (which is a list of tokens) in sentences.
sentences is expected to be a list of tokenized sentences (after applying the bigram model in the previous steps).
3. for i in sent::
This inner loop iterates over each word/token i in the sentence sent.
4. word_freq[i] += 1:
For each token i, the code increments its frequency count by 1 in the word_freq dictionary.
If the token i has not been seen before, the defaultdict initializes it with a value of 0, and then the count is incremented to 1. If the token has already been encountered, its count is simply incremented by 1.
5. len(word_freq):
This returns the total number of unique words (tokens) in the dataset by calculating the length of the word_freq dictionary, which holds each unique word as a key.
Example:
If sentences contains the following tokenized sentences:

python
Copy code
sentences = [['machine_learning', 'is', 'fun'], ['machine_learning', 'is', 'cool']]
The word_freq dictionary would look like this:

python
Copy code
word_freq = {
    'machine_learning': 2,
    'is': 2,
    'fun': 1,
    'cool': 1
}
Calling len(word_freq) would return 4 because there are 4 unique words.



In [17]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

29694

In [18]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['oh', 'like', 'know', 'get', 'hey', 'think', 'come', 'right', 'look', 'want']

In [19]:
word_freq

defaultdict(int,
            {'actually': 422,
             'little': 2098,
             'disease': 45,
             'magazine': 122,
             'news': 249,
             'show': 214,
             'natural': 77,
             'think': 3592,
             'know': 4818,
             'sure': 1198,
             'like': 5601,
             'talk': 931,
             'touch': 191,
             'lesson': 162,
             'plan': 302,
             'teach': 324,
             'life': 1224,
             'worth': 141,
             'live': 760,
             'poll': 20,
             'open': 418,
             'end': 461,
             'recess': 10,
             'case': 215,
             'decide': 134,
             'thought': 119,
             'final': 105,
             'statement': 21,
             'martin': 121,
             'victory': 30,
             'party': 421,
             'slide': 49,
             'mr': 797,
             'bergstrom': 17,
             'hey': 3620,
             'move': 165,
     

In [20]:
import multiprocessing

from gensim.models import Word2Vec

In [21]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

### Parameter Explanation:
min_count=20: This parameter specifies the minimum number of occurrences a word must have to be included in the model's vocabulary. Words that appear fewer than 20 times will be ignored. This helps in reducing noise and focusing on more significant words.

window=2: This parameter sets the maximum distance between the current and predicted word within a sentence. A window size of 2 means that the model will consider the two words before and two words after the target word during training.

size=300: This parameter defines the dimensionality of the word vectors. A size of 300 means each word will be represented by a vector with 300 dimensions, allowing for rich semantic representation.

sample=6e-5: This parameter is used for down-sampling frequent words. Words that occur very frequently (like "the", "and", etc.) will be randomly down-sampled based on this threshold to improve training efficiency and reduce overfitting.

alpha=0.03: This parameter sets the initial learning rate for training the model. A value of 0.03 is common for Word2Vec models.

min_alpha=0.0007: This parameter sets the minimum learning rate. The learning rate will decrease linearly from the initial alpha to min_alpha during training. This helps the model to converge effectively.

negative=20: This parameter specifies the number of negative samples to be drawn for each positive sample during training. A value of 20 indicates that for every positive word association, 20 negative samples will be considered. This helps in improving the model's performance by contrasting positive and negative examples.

workers=cores-1: This parameter specifies the number of worker threads to train the model. Using cores-1 utilizes all but one of the available CPU cores, allowing for parallel processing and speeding up the training process.

In [22]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    min_count=20,
    window=2,
    vector_size=300,  # Changed 'size' to 'vector_size'
    sample=6e-5, 
    alpha=0.03, 
    min_alpha=0.0007, 
    negative=20,
    workers=cores-1
)

INFO - 09:50:15: Word2Vec lifecycle event {'params': 'Word2Vec<vocab=0, vector_size=300, alpha=0.03>', 'datetime': '2024-10-19T09:50:15.139142', 'gensim': '4.3.0', 'python': '3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]', 'platform': 'macOS-15.0.1-arm64-arm-64bit', 'event': 'created'}


In [23]:
t = time.time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time.time() - t) / 60, 2)))

INFO - 09:50:17: collecting all words and their counts
INFO - 09:50:17: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 09:50:17: PROGRESS: at sentence #10000, processed 61697 words, keeping 9516 word types
INFO - 09:50:17: PROGRESS: at sentence #20000, processed 127312 words, keeping 14382 word types
INFO - 09:50:17: PROGRESS: at sentence #30000, processed 187772 words, keeping 17441 word types
INFO - 09:50:17: PROGRESS: at sentence #40000, processed 243265 words, keeping 20120 word types
INFO - 09:50:17: PROGRESS: at sentence #50000, processed 303120 words, keeping 22550 word types
INFO - 09:50:18: PROGRESS: at sentence #60000, processed 363858 words, keeping 24819 word types
INFO - 09:50:18: PROGRESS: at sentence #70000, processed 425311 words, keeping 26987 word types
INFO - 09:50:18: PROGRESS: at sentence #80000, processed 485433 words, keeping 28822 word types
INFO - 09:50:18: collected 29694 word types from a corpus of 523538 raw words and 85955 sentence

Time to build vocab: 0.0 mins


In [24]:
t = time.time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time.time() - t) / 60, 2)))

INFO - 09:50:19: Word2Vec lifecycle event {'msg': 'training model with 7 workers on 3324 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2 shrink_windows=True', 'datetime': '2024-10-19T09:50:19.480978', 'gensim': '4.3.0', 'python': '3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]', 'platform': 'macOS-15.0.1-arm64-arm-64bit', 'event': 'train'}
INFO - 09:50:20: EPOCH 0: training on 523538 raw words (199114 effective words) took 0.5s, 388143 effective words/s
INFO - 09:50:20: EPOCH 1: training on 523538 raw words (199798 effective words) took 0.3s, 632893 effective words/s
INFO - 09:50:20: EPOCH 2: training on 523538 raw words (199347 effective words) took 0.4s, 521019 effective words/s
INFO - 09:50:21: EPOCH 3: training on 523538 raw words (199277 effective words) took 0.4s, 455206 effective words/s
INFO - 09:50:21: EPOCH 4: training on 523538 raw words (199404 effective words) took 0.3s, 614231 effective words/s
INFO - 09:50:21: EPOCH 5: training on 523

Time to train the model: 0.21 mins


In [25]:
w2v_model.init_sims(replace=True)

  w2v_model.init_sims(replace=True)


In [26]:
w2v_model.wv.most_similar(positive=["homer"])

[('rude', 0.6942655444145203),
 ('marge', 0.6873096227645874),
 ('sweetheart', 0.6872086524963379),
 ('creepy', 0.680105984210968),
 ('gee', 0.671907365322113),
 ('jealous', 0.6673969626426697),
 ('snuggle', 0.6642676591873169),
 ('depressed', 0.6641891002655029),
 ('terrific', 0.6641192436218262),
 ('hammock', 0.6567658185958862)]

In [27]:
w2v_model.wv.most_similar(positive=["homer_simpson"])

[('select', 0.7401379346847534),
 ('congratulation', 0.7341328859329224),
 ('governor', 0.710042417049408),
 ('easily', 0.7082727551460266),
 ('montgomery_burn', 0.7031835317611694),
 ('recent', 0.6977173686027527),
 ('defeat', 0.696308970451355),
 ('council', 0.6908103227615356),
 ('united_states', 0.6816267967224121),
 ('threat', 0.6799395084381104)]

In [28]:
w2v_model.wv.most_similar(positive=["marge"])

[('humiliate', 0.7087414860725403),
 ('snuggle', 0.7066437005996704),
 ('becky', 0.698948085308075),
 ('rude', 0.6972123384475708),
 ('homer', 0.6873096227645874),
 ('affair', 0.6860363483428955),
 ('brunch', 0.6815887689590454),
 ('grownup', 0.6801910400390625),
 ('badly', 0.6784240007400513),
 ('depressed', 0.6776086091995239)]

In [30]:
w2v_model.wv.most_similar(negative=["bart"])

[('sauce', -0.009900564327836037),
 ('silver', -0.010088246315717697),
 ('east', -0.021327510476112366),
 ('liberty', -0.025091029703617096),
 ('duff', -0.03209099918603897),
 ('kent_brockman', -0.034821219742298126),
 ('thee', -0.039052318781614304),
 ('north', -0.04283853620290756),
 ('john', -0.051570214331150055),
 ('low', -0.05474331974983215)]

In [38]:
w2v_model.wv.similarity('maggie', 'baby')

0.68530107

In [39]:
w2v_model.wv.similarity('bart', 'nelson')

0.61052835

In [40]:
w2v_model.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])



'jimbo'

In [41]:
w2v_model.wv.doesnt_match(["nelson", "bart", "milhouse"])

'nelson'

In [42]:
w2v_model.wv.doesnt_match(['homer', 'patty', 'selma'])

'homer'

In [43]:
w2v_model.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)

[('see', 0.5720261335372925),
 ('admire', 0.5511806011199951),
 ('man', 0.5392982363700867)]

In [44]:
w2v_model.wv.most_similar(positive=["woman", "bart"], negative=["man"], topn=3)

[('lisa', 0.7240086793899536),
 ('mom', 0.6408773064613342),
 ('surprised', 0.6376112699508667)]

In [45]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
 
import seaborn as sns
sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [32]:
from sklearn.decomposition import PCA
import numpy as np

def tsnescatterplot(model, word, list_names):
    # Create an array to hold the word vectors
    arrays = np.empty((0, model.vector_size))
    
    # Get the vector for the specified word
    if word in model.wv.key_to_index:
        wrd_vector = model.wv[word].reshape(1, -1)
        arrays = np.append(arrays, wrd_vector, axis=0)
    
    # Get vectors for the list of names
    for wrd in list_names:
        if wrd in model.wv.key_to_index:
            wrd_vector = model.wv[wrd].reshape(1, -1)
            arrays = np.append(arrays, wrd_vector, axis=0)
    
    # Ensure arrays has sufficient dimensions for PCA
    if arrays.shape[0] > 1:  # Check if we have more than one vector
        # Reduces the dimensionality using PCA
        reduc = PCA(n_components=min(2, arrays.shape[0]-1)).fit_transform(arrays)  # Use 2 or less components
    else:
        raise ValueError("Not enough data points to perform PCA.")
    
    # t-SNE implementation can follow here...
    # e.g., using the reduc array for t-SNE visualizations

# Call the function with the Word2Vec model and your words
tsnescatterplot(w2v_model, 'homer', ['dog', 'bird', 'ah', 'maude', 'bob', 'mel', 'apu', 'duff'])

In [33]:
tsnescatterplot(w2v_model, 'homer', ['dog', 'bird', 'ah', 'maude', 'bob', 'mel', 'apu', 'duff'])

In [28]:
sent

['psy', 'cho', 'ma', 'tic']