
# Workshop on Extracting Embeddings for Pearl Hacks 2024
In this tutorial, we download the IMDb dataset and explore how to create word embeddings on the IMDb dataset reviews. We label the tokens from the word sentiments with 'pos' or 'neg'.

In [11]:
import os
import tarfile
import urllib.request
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
import os
from gensim.models import Word2Vec


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Now, we download the IMDb dataset and extract it.

In [None]:
# Download the IMDb dataset
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset_file = "aclImdb_v1.tar.gz"
if not os.path.exists(dataset_file):
    urllib.request.urlretrieve(url, dataset_file)

# Extract the dataset
with tarfile.open(dataset_file, "r:gz") as tar:
    tar.extractall()

Then, one of the most important parts in embeddings, we beginning pre processing the data. Models are trained on the preprocessed data. This step is VITAL.

In [12]:
def preprocess_imdb_data(dataset_dir):
    data = []
    for label in ['pos', 'neg']:
        label_dir = os.path.join(dataset_dir, label)
        for filename in os.listdir(label_dir):
            with open(os.path.join(label_dir, filename), 'r', encoding='utf-8') as file:
                text = file.read()
                tokens = word_tokenize(text.lower())
                data.append(tokens)
    return data

dataset_dir = "aclImdb/train"
preprocessed_data = preprocess_imdb_data(dataset_dir)

Before we pass in our preprocessed data and build our word2vec model for embeddings, consider the parameters in this model - sentences, vector_size, window, min_count, and workers.

* 	sentences: This parameter specifies the input data for training the Word2Vec model. In this case, preprocessed_data contains the preprocessed text data, where each element represents a list of tokens (words) from a movie review.
* 		vector_size: This parameter determines the dimensionality of the word vectors (embeddings) produced by the Word2Vec model. In this example, vector_size=100 means that each word will be represented by a dense vector of 100 dimensions.
* 		window: This parameter sets the maximum distance between the current word being processed and the other words in the window. It defines the context window size for the model to consider when learning word embeddings. In this case, window=5 means that the model will consider the five words before and after the current word in the text.
* 		min_count: This parameter specifies the minimum frequency count of words required to be included in the vocabulary. Words with frequency counts lower than this value will be ignored and not considered for training. Setting min_count=1 means that all words present in the dataset will be included in the vocabulary.
* 		workers: This parameter sets the number of threads to use for training the Word2Vec model. It determines the parallelism during training. In this case, workers=4 means that four worker threads will be used for training the model, which can speed up the training process on multi-core machines.


In [13]:
# Train Word2Vec model
model = Word2Vec(sentences=preprocessed_data, vector_size=100, window=5, min_count=1, workers=4)

We generate our word embeddings and store it into word_embeddings.
Here, we look at the example of the word good. You can try any words from the data set!

In [14]:
# Generate word embeddings
word_embeddings = model.wv

# Example of how to use word embeddings
word_vector = word_embeddings['good']
print("Word embedding for 'good':", word_vector)

Word embedding for 'good': [ 2.309283    3.0937796  -0.3786237   0.23520526  1.6532553   0.10221997
 -0.13742884  0.9272892   0.99913126  1.0752318   1.7283039   1.4420184
 -1.510665   -2.4869087   1.9908297  -0.57715434  2.574168   -0.7567778
  2.9907923  -1.138794   -0.8542557   3.1950972   0.1723586  -1.0455253
 -0.60189193 -2.4403648  -3.1632488   0.4443021   3.2067566  -1.1371427
 -1.2953486  -0.7832388   3.6576543   4.6164403  -0.03829485  0.905741
  3.158804    0.33012706 -2.7263613   0.65177727 -1.6560317  -1.5961897
 -2.432053    0.04080295 -1.3640958  -1.4595801  -2.3624341   0.6443799
  1.1676904   1.9009656  -2.0662339  -1.4817203  -0.65511596  4.0773234
  1.3304667  -1.0425212   0.6225425   0.27257732  2.5247307   0.12556818
 -0.28358814 -1.6507988   1.524475    0.7140454  -0.45654103  1.97686
 -0.30895022  0.9721136   0.04969786 -2.393618   -0.45060483  1.6070594
 -1.6238753   1.9371495  -0.7520865  -1.3773664  -2.2521145  -2.135156
 -1.9033598  -2.4243119  -0.68769217  0