### Importing the imdb dataset by stanford

In [22]:
from datasets import load_dataset
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#### Below is the code for preprocessing, Since according to syntax, we require the input text to be a string (whole document is a single string) for Bag of Words and for TF-IDF hence the step 7 mentioned below, when performing encoding, embedding, syntactially we require each word as seperately hence will execute those parts seperately

In [23]:
# --- NLTK Setup ---
print("Downloading NLTK data...")
nltk.download('punkt', quiet=True)       # For tokenization
nltk.download('stopwords', quiet=True)   # For stop words
nltk.download('wordnet', quiet=True)     # For lemmatization
nltk.download('omw-1.4', quiet=True)     # For lemmatization
nltk.download('punkt_tab')
print("NLTK data downloaded.")

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(example):
    """
    Applies a full preprocessing pipeline to a single text example.
    """
    # 'text' is the column name in the imdb dataset
    text = example['text']
    
    # 1. Remove HTML tags and URLs
    text = re.sub(r'<[^>]+>', ' ', text) # This regex removes HTML tags like <br />
    text = re.sub(r'http\S+|https\S+', '', text) # This regex removes URLs
    
    text = text.lower() # Convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text) # Remove punctuation and special characters (keep only letters and spaces)
    
    tokens = word_tokenize(text) # Splits the text into a list of words
    tokens = [word for word in tokens if word not in stop_words] # Remove stop words
    
    # --- 6. Apply Stemming OR Lemmatization
    # You generally do one or the other, not both.
    # Lemmatization is usually preferred as it produces real words.
    
    # Option A: Stemming
    # processed_tokens = [stemmer.stem(word) for word in tokens]
    
    # Option B: Lemmatization (default)
    processed_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # 7. Join tokens back into a single string
    text = " ".join(processed_tokens)
    
    # Update the example object
    example['text'] = text
    return example

# Load the dataset
print("Loading dataset...")
# We load 'train' split and take just the first 1000 examples for a quick demo
# Remove `split='train[:1000]'` to run on the full dataset (will take time)
dataset = load_dataset("stanfordnlp/imdb", split='train[:50000]')
    
# 2. Show an example *before* processing
print("\n--- BEFORE PREPROCESSING ---")
print(dataset[0]['text'])
    
# 3. Apply the preprocessing function using .map()
# This is the most efficient way and applies the function to every example
print("\nApplying preprocessing to dataset...")
processed_dataset = dataset.map(preprocess_text)
    
# 4. Show the same example *after* processing
print("\n--- AFTER PREPROCESSING ---")
print(processed_dataset[0]['text'])
    
print(f"\nSuccessfully processed {len(processed_dataset)} examples.")

Downloading NLTK data...
NLTK data downloaded.
Loading dataset...


[nltk_data] Downloading package punkt_tab to /Users/vihaa/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!



--- BEFORE PREPROCESSING ---
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scene

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]


--- AFTER PREPROCESSING ---
rented curiousyellow video store controversy surrounded first released also heard first seized u custom ever tried enter country therefore fan film considered controversial really see plot centered around young swedish drama student named lena want learn everything life particular want focus attention making sort documentary average swede thought certain political issue vietnam war race issue united state asking politician ordinary denizen stockholm opinion politics sex drama teacher classmate married men kill curiousyellow year ago considered pornographic really sex nudity scene far even shot like cheaply made porno countryman mind find shocking reality sex nudity major staple swedish cinema even ingmar bergman arguably answer good old boy john ford sex scene film commend filmmaker fact sex shown film shown artistic purpose rather shock people make money shown pornographic theater america curiousyellow good film anyone wanting study meat potato pun intende

#### The above shows the before and after of the text once we have completed preprocessing, we have converted all the words to lowercase so that same words in different cases are not considered different by the model, then we have removed stop words (like I, the, a etc) as they dont technically offer any information to make use of. We have also removed any URLS, HTML tags and special characters present and have converted each word to its lemma so that we can sort of reduce the number of words with same meaning in our vocabulary to make it more compact

#### Now the below code is to apply text processing methods like Bag of Words and TF-IDF

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

documents = processed_dataset['text']
labels = processed_dataset['label']

# --- 4. Apply Bag of Words (BoW) ---
print("\n--- Applying Bag of Words (CountVectorizer) ---")
    
# Initialize the vectorizer
bow_vectorizer = CountVectorizer(max_features=5000) # max_features=5000 means it will only keep the 5000 most common words
bow_matrix = bow_vectorizer.fit_transform(documents) # Fit the vectorizer to the data and transform the data into a matrix
print(f"BoW Matrix Shape: {bow_matrix.shape}") 
# bow_matrix is a "sparse matrix" of shape (num_documents, max_features), here the rows represent each of the documents
# and the columns represent the words and the value in each cell represents the number of times that word has occured in the document
    
feature_names_bow = bow_vectorizer.get_feature_names_out() # Show some of the vocabulary (features)
print(f"First 20 BoW features: {feature_names_bow[:20]}")
    
print("\nBoW representation of the first document (word_index, count):") # Show the BoW representation of the first document
print(bow_matrix[0])
    
# --- 5. Apply TF-IDF ---
print("\n--- Applying TF-IDF (TfidfVectorizer) ---")
    
# Initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf_vectorizer.fit_transform(documents) # Fit and transform
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}") # tfidf_matrix is also a sparse matrix of the same shape
    
# The features (vocabulary) will be the same as BoW if max_features is the same
feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
print(f"First 20 TF-IDF features: {feature_names_tfidf[:20]}")
    
# Show the TF-IDF representation of the first document
print("\nTF-IDF representation of the first document (word_index, tf-idf_score):")
print(tfidf_matrix[0])
    
print("\nVectorization complete. You can now use 'bow_matrix' or 'tfidf_matrix' to train a model.")


--- Applying Bag of Words (CountVectorizer) ---
BoW Matrix Shape: (25000, 5000)
First 20 BoW features: ['abandoned' 'abc' 'ability' 'able' 'abraham' 'abrupt' 'absence' 'absent'
 'absolute' 'absolutely' 'absurd' 'absurdity' 'abuse' 'abused' 'abusive'
 'abysmal' 'academy' 'accent' 'accept' 'acceptable']

BoW representation of the first document (word_index, count):
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 101 stored elements and shape (1, 5000)>
  Coords	Values
  (0, 3630)	1
  (0, 4754)	1
  (0, 4233)	1
  (0, 4349)	1
  (0, 1687)	2
  (0, 3599)	1
  (0, 145)	1
  (0, 2024)	1
  (0, 1493)	1
  (0, 4600)	1
  (0, 1449)	1
  (0, 975)	1
  (0, 4464)	1
  (0, 1601)	1
  (0, 1667)	5
  (0, 906)	2
  (0, 935)	1
  (0, 3540)	3
  (0, 3870)	1
  (0, 3281)	2
  (0, 239)	1
  (0, 4991)	1
  (0, 4366)	3
  (0, 1301)	2
  (0, 4266)	1
  :	:
  (0, 1889)	2
  (0, 3063)	1
  (0, 504)	1
  (0, 2364)	1
  (0, 1736)	1
  (0, 1670)	1
  (0, 1577)	1
  (0, 3981)	3
  (0, 251)	1
  (0, 3459)	1
  (0, 3513)	1
  (0, 3959)	1

#### In the above output, the Coords represent the coordinate of each cell in the matrix with the first index representing the row ie the document number and the second index representing the column ie the feature of the BoW and the value represents how many time it appears in BoW matrix while for TF-IDF matrix, it represents the TF-IDF value and the higher the value, the better is the word at discriminating (and the word is more important for that document sort of) that documents from others

#### Now the below code is for label encoding, embedding and Word2Vec and we require tokens seperately not just documents, hence re running the preprocessing steps and the other steps

In [25]:
def preprocess_text(example):
    """
    Applies a full preprocessing pipeline to a single text example.
    """
    # 'text' is the column name in the imdb dataset
    text = example['text']
    
    # 1. Remove HTML tags and URLs
    text = re.sub(r'<[^>]+>', ' ', text) # This regex removes HTML tags like <br />
    text = re.sub(r'http\S+|https\S+', '', text) # This regex removes URLs
    
    text = text.lower() # Convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text) # Remove punctuation and special characters (keep only letters and spaces)
    
    tokens = word_tokenize(text) # Splits the text into a list of words
    tokens = [word for word in tokens if word not in stop_words] # Remove stop words
    
    # --- 6. Apply Stemming OR Lemmatization
    # You generally do one or the other, not both.
    # Lemmatization is usually preferred as it produces real words.
    
    # Option A: Stemming
    # processed_tokens = [stemmer.stem(word) for word in tokens]
    
    # Option B: Lemmatization (default)
    processed_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Update the example object
    example['tokens'] = processed_tokens
    return example

# Load the dataset
print("Loading dataset...")
# We load 'train' split and take just the first 1000 examples for a quick demo
# Remove `split='train[:1000]'` to run on the full dataset (will take time)
dataset = load_dataset("stanfordnlp/imdb", split='train[:50000]')
    
# 2. Show an example *before* processing
print("\n--- BEFORE PREPROCESSING ---")
print(dataset[0]['text'])
    
# 3. Apply the preprocessing function using .map()
# This is the most efficient way and applies the function to every example
print("\nApplying preprocessing to dataset...")
processed_dataset = dataset.map(preprocess_text)
    
# 4. Show the same example *after* processing
print("\n--- AFTER PREPROCESSING ---")
print(processed_dataset[0]['tokens'])
    
print(f"\nSuccessfully processed {len(processed_dataset)} examples.")

Loading dataset...

--- BEFORE PREPROCESSING ---
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the s

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]


--- AFTER PREPROCESSING ---
['rented', 'curiousyellow', 'video', 'store', 'controversy', 'surrounded', 'first', 'released', 'also', 'heard', 'first', 'seized', 'u', 'custom', 'ever', 'tried', 'enter', 'country', 'therefore', 'fan', 'film', 'considered', 'controversial', 'really', 'see', 'plot', 'centered', 'around', 'young', 'swedish', 'drama', 'student', 'named', 'lena', 'want', 'learn', 'everything', 'life', 'particular', 'want', 'focus', 'attention', 'making', 'sort', 'documentary', 'average', 'swede', 'thought', 'certain', 'political', 'issue', 'vietnam', 'war', 'race', 'issue', 'united', 'state', 'asking', 'politician', 'ordinary', 'denizen', 'stockholm', 'opinion', 'politics', 'sex', 'drama', 'teacher', 'classmate', 'married', 'men', 'kill', 'curiousyellow', 'year', 'ago', 'considered', 'pornographic', 'really', 'sex', 'nudity', 'scene', 'far', 'even', 'shot', 'like', 'cheaply', 'made', 'porno', 'countryman', 'mind', 'find', 'shocking', 'reality', 'sex', 'nudity', 'major', 'stap

#### The above code is to preporcess so as to prepare to apply Word2Vec

In [26]:
import gensim.models
corpus = processed_dataset['tokens']

print("\n--- Training CBOW Model (sg=0) ---")

cbow_model = gensim.models.Word2Vec( # Initialize and train the Word2Vec model
    sentences=corpus,
    vector_size=100, # vector_size=100: Creates 100-dimension word embeddings
    window=5, # window=5: Considers 5 words before and 5 words after the target word
    min_count=2, # min_count=2: Ignores all words with a total frequency lower than 2
    sg=0, # sg=0: This is the flag that specifies CBOW
    workers=4 # workers=4: Use 4 CPU cores to speed up training
)

cbow_model.save("cbow_word2vec.model") # Save the model for later use
print("CBOW Model trained and saved as 'cbow_word2vec.model'.")
    
# Show example results
print("Example: Words most similar to 'movie' (CBOW):")
try:
    # wv.most_similar() finds the k-nearest neighbors
    print(cbow_model.wv.most_similar('movie'))
except KeyError:
    print("'movie' not in vocabulary (or filtered by min_count).")
    
# --- 5. Apply Skip-gram ---
print("\n--- Training Skip-gram Model (sg=1) ---")
    
# The parameters are the same, but we change 'sg=1'
skipgram_model = gensim.models.Word2Vec(
    sentences=corpus,
    vector_size=100,
    window=5,
    min_count=2,
    sg=1, # sg=1: This is the flag that specifies Skip-gram
    workers=4
)

skipgram_model.save("skipgram_word2vec.model") # Save the model
print("Skip-gram Model trained and saved as 'skipgram_word2vec.model'.")

# Show example results
print("Example: Words most similar to 'film' (Skip-gram):")
try:
    print(skipgram_model.wv.most_similar('film'))
except KeyError:
    print("'film' not in vocabulary (or filtered by min_count).")
        
print("\nVectorization complete.")


--- Training CBOW Model (sg=0) ---
CBOW Model trained and saved as 'cbow_word2vec.model'.
Example: Words most similar to 'movie' (CBOW):
[('film', 0.7948123812675476), ('flick', 0.6767253279685974), ('sequel', 0.6639348864555359), ('anyway', 0.5959745645523071), ('opinion', 0.5786111354827881), ('suppose', 0.5699934959411621), ('sucked', 0.5672647953033447), ('stuff', 0.563139796257019), ('honestly', 0.5620121955871582), ('sleepfest', 0.5588194727897644)]

--- Training Skip-gram Model (sg=1) ---
Skip-gram Model trained and saved as 'skipgram_word2vec.model'.
Example: Words most similar to 'film' (Skip-gram):
[('movie', 0.9008914828300476), ('bilal', 0.7973375916481018), ('biopics', 0.7901825308799744), ('horrorthriller', 0.7898569703102112), ('weir', 0.7880809307098389), ('catiii', 0.7864723801612854), ('categorized', 0.7796862125396729), ('filmgoers', 0.7780632376670837), ('malfique', 0.7770559787750244), ('romanticcomedy', 0.7767753601074219)]

Vectorization complete.


#### For 2 words to be considered similar in CBOW, when the window is smaller, they should sort of fit in the same sentence literally and subjectively as the window gets bigger and bigger, we incorporate more semantic meaning into as well and then when the words have to be considered similar then they should be meaningfully same as well