In [4]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('wordnet')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [20]:
! pip install bs4 # in case you don't have it installed
! pip install gensim
! pip install torch
! pip install nltk
! pip install pandas
! pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.4.0-1-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Using cached threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Using cached scikit_learn-1.4.0-1-cp311-cp311-macosx_12_0_arm64.whl (10.6 MB)
Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.4.0 threadpoolctl-3.2.0


## Read Data

In [6]:
url = "https://web.archive.org/web/20201127142707if_/https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz"
data = pd.read_csv(url, sep='\t',on_bad_lines='skip')

  data = pd.read_csv(url, sep='\t',on_bad_lines='skip')


 ## 1. Data Generation
The code performs the following actions on a dataframe that likely contains product review information:

- It selects the columns "star_rating", "review_headline", and "review_body" from the dataframe.

- It combines "review_headline" and "review_body" into a single column named "review", which compiles the full text of each review.

- It transforms the "star_rating" column into a numerical format, converting any non-numeric values to NaN to maintain data integrity.

- It removes any rows that have missing values to ensure the dataset's quality.

The code assigns ratings to one of three classes for sentiment analysis—assigning '1' for positive, '2' for negative, and '3' for neutral sentiments, corresponding to ratings above 3, below 3, and equal to 3, respectively. It then randomly selects 50,000 reviews from each class to balance the dataset, using a fixed seed to ensure reproducibility. Finally, it merges these subsets into a single dataframe, creating a well-distributed dataset crucial for developing a reliable sentiment analysis model.

In [7]:
# Exrtacting the required fields from the dataset
data["review_body"] = data["review_body"] + ' ' + data["review_headline"]
data = data[["review_body", "star_rating"]]

In [8]:
data.loc[:, 'star_rating'] = pd.to_numeric(data.star_rating, errors="coerce")
data = data.dropna()

# Check and sample data
for rating in range(1, 6):
    count = min(50000, data[data["star_rating"] == rating].shape[0])
    sampled_data = data[data["star_rating"] == rating].sample(n=count, random_state=42)
    if rating == 1:
        final_sample = sampled_data
    else:
        final_sample = pd.concat([final_sample, sampled_data])

final_sample.reset_index(drop=True, inplace=True)
final_sample["sentiment"] = final_sample.star_rating.apply(lambda x: 1 if x > 3 else 3 if x == 3 else 2)

# Save dataset for reuse
final_sample.to_csv("final_sample.csv", index=False)

In [9]:
del data, sampled_data, url, count

In [10]:
final_sample.sample(n=5, random_state=100000)

Unnamed: 0,review_body,star_rating,sentiment
62616,"Shipping was great, packaging was solid, envel...",2.0,2
128906,Would have liked it better if it was cheaper f...,3.0,3
154991,"cute larger notepad, gave as gift to bus drive...",4.0,1
90705,Item that I purchased is not as good as it was...,2.0,2
200634,It looks so nice in my grey beetle. LOVE IT!!!...,5.0,1


# Data Cleaning and Pre-processing

The data_cleaning and data_preprocessing functions together undertake a series of steps to prepare the text for analysis:

- They start by transforming all text to lowercase and stripping away URLs and HTML tags through regular expressions.
- The regular expression library is utilized to remove HTML tags to extract plain text.
- Any special characters, including punctuation and symbols, are eliminated via regular expressions to simplify the text.
- To standardize the text, common contractions are expanded to their full forms using a predefined dictionary, contraction_dict.
- A tailored list of stopwords is generated by omitting negation words like 'no', 'nor', and 'not' from the standard list of English stopwords, maintaining the negation context crucial for sentiment analysis. This list is then used to remove stopwords from the text, reducing noise and emphasizing key phrases relevant to sentiment analysis.
- Finally, the function outputs the refined text, which is now ready for further analytical or modeling tasks.

In [11]:
# convert reviews to lowercase strings
final_sample["review_body"] = final_sample["review_body"].str.lower()

# to remove HTML tags
final_sample["review_body"] = final_sample["review_body"].apply(lambda text: re.sub(r'<.*?>', '', text)  if type(text) == str else '')

# to remove URLs
final_sample["review_body"] = final_sample["review_body"].apply(lambda text: re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE))

# to remove non-alphabetic characters except for aphostophes as they will be removed by expanding the contractions
final_sample["review_body"] = final_sample["review_body"].apply(lambda text: re.sub(r'[^a-z\s\']', '', text))

# to remove extra spaces
final_sample["review_body"] = final_sample["review_body"].apply(lambda text: re.sub(r'\s+', ' ', text).strip())

# contractions dictionary for expanding the same. For contractions with multiple expansion I have taken the ones that fit most closely according to me. 
contractions_dict = {
        "i'm": "i am",
        "you're": "you are",
        "he's": "he is",
        "she's": "she is",
        "they're": "they are",
        "we're": "we are",
        "it's": "it is",
        "that's": "that is",
        "here's": "here is",
        "there's": "there is",
        "who's": "who is",
        "where's": "where is",
        "when's": "when is",
        "why's": "why is",
        "what's": "what is",
        "how's": "how is",
        "everybody's": "everybody is",
        "nobody's": "nobody is",
        "something's": "something is",
        "so's": "so is",
        "i'll": "i will",
        "you'll": "you will",
        "he'll": "he will",
        "she'll": "she will",
        "they'll": "they will",
        "it'll": "it will",
        "we'll": "we will",
        "that'll": "that will",
        "this'll": "this will",
        "these'll": "these will",
        "there'll": "there will",
        "where'll": "where will",
        "who'll": "who will",
        "what'll": "what will",
        "how'll": "how will",
        "i've": "i have",
        "you've": "you have",
        "he's": "he has",
        "she's": "she has",
        "we've": "we have",
        "they've": "they have",
        "should've": "should have",
        "could've": "could have",
        "would've": "would have",
        "might've": "might have",
        "must've": "must have",
        "what've": "what have",
        "what's": "what has",
        "where've": "where have",
        "where's": "where has",
        "there've": "there have",
        "there's": "there has",
        "these've": "these have",
        "who's": "who has",
        "don't": "do not",
        "can't": "cannot",
        "mustn't": "must not",
        "aren't": "are not",
        "couldn't": "could not",
        "wouldn't": "would not",
        "shouldn't": "should not",
        "isn't": "is not",
        "doesn't": "does not",
        "didn't": "did not",
        "hasn't": "has not",
        "hadn't": "had not",
        "haven't": "have not",
        "wasn't": "was not",
        "won't": "will not",
        "weren't": "were not",
        "ain't": "am not",
        "let's": "let us",
        "y'all": "you all",
        "where'd": "where did",
        "how'd": "how did",
        "why'd": "why did",
        "who'd": "who did",
        "when'd": "when did",
        "what'd": "what did",
        "g'day": "good day",
        "ma'am": "madam",
        "o'clock": "of the clock"
    }
# expanding contractions
final_sample["review_body"] = final_sample["review_body"].apply(lambda text: ' '.join([contractions_dict[word] if word in contractions_dict else word for word in text.split()]))

In [12]:
del contractions_dict

## remove the stop words 

In [13]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stopwords_en = stopwords.words('english')
stopwords_en = list(set(stopwords_en) - set(['no', 'nor','not', 'only', 'very', "don't", "ain't", "aren't", "couldn't", "didn't", "doesn't", "hadn't", "hasn't", "might't","musn't", "isn't", "needn't", "shan't", "shouldn't", "wasn't", "weren't", "wont't", "wouldn't"]))

# removing stop words
# final_sample["review_body"] = final_sample["review_body"].apply(lambda text: ' '.join([word if word not in stopwords_en else '' for word in text.split()]))
final_sample["review_body"] = final_sample["review_body"].apply( lambda x : ' '.join([i for i in x.split() if i not in (stopwords_en)]))

print(final_sample.head())

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                         review_body star_rating  sentiment
0  keyboard not sensitive enough takes ages retyp...         1.0          2
1                         come buy sony's site price         1.0          2
2  well happy save lot money first despite full s...         1.0          2
3  not please order replaced lamp tv worked mins ...         1.0          2
4  bought new not refurbished one place ipod g do...         1.0          2


In [14]:
del stopwords, stopwords_en

## Perform lemmatization  

The lemmatization process reduces words to their base or root form, aiding in the consolidation of vocabulary variations and enhancing the efficiency of NLP tasks. For instance, "running" transforms into "run", and "better" becomes "good". A crucial step in accurate lemmatization is the use of correct part-of-speech (POS) tagging, as the lemma of a word can vary based on its role in a sentence, such as whether it's used as a verb, noun, adjective, or adverb. The code employs NLTK's pos_tag function for POS tagging, which categorizes each word by its grammatical function. To align NLTK's POS tags with those recognized by WordNet, a mapping dictionary (tag_map) is used, ensuring the lemmatizer correctly interprets the grammatical context and applies the suitable lemma for each word. Consequently, the lemmatization step outputs a text string where words are converted to their lemmatized forms, effectively streamlining the input text for further processing.

In [15]:
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from collections import defaultdict

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# creating a pos tag map for proper lemmatization
pos_tag_map = defaultdict(lambda : wn.NOUN)
pos_tag_map['J'] = wn.ADJ
pos_tag_map['V'] = wn.VERB
pos_tag_map['R'] = wn.ADV

lemmatizer = WordNetLemmatizer()

# for each review: we would contextually tokenize the sentence using word_tokenize, this will give us the word token and it's pos_tag. This tag will be mapped with pos_tag_map to get the appropriate wordnet pos_tag that will be feed into the lemmatizer.
final_sample["review_lemmatize"] = final_sample["review_body"].apply(lambda text: ' '.join([lemmatizer.lemmatize(word_token, pos_tag_map[word_pos_tag[0]])
 for word_token, word_pos_tag in pos_tag(word_tokenize(text))]))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
final_sample = final_sample.drop(["review_body", "star_rating"], axis=1)

In [17]:
del lemmatizer, pos_tag_map, wn
final_sample.to_csv("final_sample_lemmatized.csv", index=False)

## Splitting the dataset

A duplicate of the dataset, which has undergone lemmatization, is generated to distinctively segregate the dataset for ternary classification from that of binary classification. This process involves splitting the data, where 80% is allocated for training and the remaining 20% for testing purposes. The separation facilitates the differentiation between the two types of classification tasks, ensuring that the nuanced distinctions in sentiment analysis—capturing positive, negative, and neutral sentiments—are adequately addressed for ternary classification, while also maintaining a dataset specifically tailored for binary classification scenarios.

In [18]:
import copy

final_copy = copy.deepcopy(final_sample)
final_copy.drop(final_copy[final_copy.sentiment == 3].index, inplace=True)

In [21]:
from sklearn.model_selection import train_test_split

train_data_binary, test_data_binary = train_test_split(final_copy, test_size=0.2, random_state=42)
train_data_ternary, test_data_ternary = train_test_split(final_sample, test_size=0.2, random_state=42)

train_labels_binary = train_data_binary["sentiment"]
test_labels_binary = test_data_binary["sentiment"]

train_labels_ternary = train_data_ternary["sentiment"]
test_labels_ternary = test_data_ternary["sentiment"]

# 2. Word Embedding

## a. Load the pretrained “word2vec-google-news-300” Word2Vec model 

In [22]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

### (summer - hot) + cold = winter

In [23]:
try:
    semantic_sim = wv.most_similar(positive=['summer', 'cold'], negative=['hot'],topn=1)
    print(semantic_sim)
except KeyError:
    print("Keys \'summer\', \'cold\' or \'hot\' not present in word2vec-google-news-300 model")

[('winter', 0.6970129609107971)]


### (father - man) + woman = mother

In [24]:
try:
    semantic_sim = wv.most_similar(positive=['father', 'woman'], negative=['man'], topn=1)
    print(semantic_sim)
except KeyError:
    print("Keys \'father\', \'woman\' or \'man\' not present in word2vec-google-news-300 model")

[('mother', 0.8462507128715515)]


## b. Train a Word2Vec model using my dataset.

Training a Word2Vec model with specific parameters optimizes its performance for particular natural language processing tasks. By setting the vector size to 300, you're determining the dimensionality of the word vectors, capturing a wide array of linguistic information. A window size of 11 allows the model to consider an extended context around each word, enhancing its ability to understand word meanings based on surrounding words. The minimum count of 10 filters out infrequent words, focusing the model's learning on more relevant vocabulary. Additionally, specifying the number of workers increases computational efficiency by paralleling the training process across multiple cores. This approach not only accelerates the model's training time but also leverages computational resources more effectively, leading to quicker iterations and refinements of the Word2Vec model.

In [25]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess 

my_model = Word2Vec(sentences=train_data_binary.review_lemmatize.apply(lambda x: simple_preprocess(x)), vector_size=300, window=11, min_count=10, workers=4)

In [26]:
try:
    semantic_sim = my_model.wv.most_similar(positive=['summer', 'cold'], negative=['hot'],topn=1)
    print(semantic_sim)
except KeyError:
    print("Keys \'summer\', \'cold\' or \'hot\' not present in my model")

[('winter', 0.7141301035881042)]


In [27]:
try:
    semantic_sim = my_model.wv.most_similar(positive=['father', 'woman'], negative=['man'], topn=1)
    print(semantic_sim)
except KeyError:
    print("Keys \'father\', \'woman\' or \'man\' not present in my model")

[('sister', 0.682122528553009)]


### What do you conclude from comparing vectors generated by yourself and the pretrained model? Which of the Word2Vec models seems to encode semantic similarities between words better?

The superior performance of the pretrained model in identifying semantic similarities, as demonstrated by the similarity probabilities of the provided examples, can largely be attributed to its extensive vocabulary. This comprehensive vocabulary equips the pretrained model with a broader linguistic understanding, enabling it to discern semantic nuances more effectively. The depth and breadth of the training data behind the pretrained model allow it to capture a wide range of linguistic relationships and subtleties, which might not be as pronounced in a model trained on a more limited dataset. Consequently, the pretrained model's ability to accurately predict semantic similarities surpasses that of the custom-trained model, reflecting the advantages of leveraging large-scale, diverse datasets for model training.

# 3. Simple models 

The function `averaged_word_vectorizer` computes the average Word2Vec embedding for each sentence in a given corpus. It initializes an array `all_embeddings` to store the averaged embeddings for the entire corpus, with the shape determined by the number of sentences in the corpus and the specified number of features (`num_features`). For each sentence, it tokenizes the sentence into words, retrieves the Word2Vec vector for each word present in the model's vocabulary, and calculates the mean of these vectors to obtain a single averaged embedding. If a sentence does not contain any words present in the model, a zero vector is used as its embedding. This function returns a 2D numpy array where each row corresponds to the averaged embedding of a sentence from the corpus, effectively condensing the semantic information of each sentence into a fixed-size vector.

In [28]:
def averaged_word_vectorizer(corpus, model, num_features):
    all_embeddings = np.zeros((corpus.shape[0], num_features))
    i = 0
    for sentence in corpus:
        tokens = nltk.word_tokenize(sentence)
        vectors = [model[word] for word in tokens if word in model]
        
        if vectors:
            embedding = np.mean(vectors, axis=0, dtype=np.float32)
        else:
            embedding = np.zeros(model.vector_size, dtype=np.float32)
        all_embeddings[i] = embedding
        i += 1
        
    return all_embeddings

In [29]:
# Create averaged word vector features
pretrained_word_embedding = averaged_word_vectorizer(train_data_binary['review_lemmatize'], wv, 300)
my_word_embedding= averaged_word_vectorizer(train_data_binary['review_lemmatize'], my_model.wv, 300)

pretrained_test_embedding = averaged_word_vectorizer(test_data_binary['review_lemmatize'], wv, 300)
my_test_embedding = averaged_word_vectorizer(test_data_binary['review_lemmatize'], my_model.wv, 300)

The `train_test_model` function encapsulates the process of training a given machine learning model on a dataset and evaluating its performance on a test set. The function begins by fitting the model using the provided `train_data` and `train_labels`. After training, it uses the model to predict the outcomes on the `test_data`. It then calculates four key metrics to assess the model's performance: accuracy, precision, recall, and F1 score, using the true labels from `test_labels` and the predicted labels. Finally, it prints these metrics to provide insights into how well the model performs, specifically in terms of its overall correctness (accuracy), its ability to identify positive instances (precision), its effectiveness in identifying actual positives (recall), and a combined measure of precision and recall (F1 score). This function offers a comprehensive view of a model's predictive quality on unseen data.

In [30]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def train_test_model(model, train_data, test_data, train_labels, test_labels):
    model.fit(train_data, train_labels)
    predictions = model.predict(test_data)

    accuracy_test = accuracy_score(test_labels, predictions)
    precision_test = precision_score(test_labels, predictions)
    recall_test = recall_score(test_labels, predictions)
    f1_test = f1_score(test_labels, predictions)

    print("Testing metics:")
    print("Accuracy:", accuracy_test)
    print("Precision:", precision_test)
    print("Recall:", recall_test)
    print("F1 Score:", f1_test)

## Perceptron with Word2Vec

In [31]:
from sklearn.linear_model import Perceptron

# google model
perceptron = Perceptron(max_iter=1000, alpha=1e-4, penalty='elasticnet', tol=1e-5, random_state=42)
print("Pretrained model")
train_test_model(perceptron, pretrained_word_embedding, pretrained_test_embedding, train_labels_binary, test_labels_binary)

# my model
print("\nModel trained by me")
train_test_model(perceptron, my_word_embedding, my_test_embedding, train_labels_binary, test_labels_binary)

Pretrained model
Testing metics:
Accuracy: 0.8139
Precision: 0.766131423971529
Recall: 0.9038336582196231
F1 Score: 0.8293052052281586

Model trained by me
Testing metics:
Accuracy: 0.8218
Precision: 0.7574873045703546
Recall: 0.946868595991403
F1 Score: 0.8416562999822286


## Perceptron with TF-IDF
The code initializes a `TfidfVectorizer` to extract up to 5000 unigram and bigram features from the 'review_lemmatize' column of `final_copy`, transforming the text into a matrix of TF-IDF values. It then constructs a DataFrame, `feature_df`, with these features as columns, facilitating the analysis of the most relevant words and word pairs across the corpus. This approach converts textual data into a numerical format that's ready for machine learning applications, emphasizing words unique to specific documents.

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
tfidf_features = tfidf_vectorizer.fit_transform(final_copy['review_lemmatize'])

# to get the features that the vectorizer selected 
feature_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

Spliting data for TF-IDF 

In [33]:
from sklearn.model_selection import train_test_split

X = feature_df
y = final_copy["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [34]:
del X, y

In [35]:
from sklearn.linear_model import Perceptron

# Initialize the Perceptron model
perceptron = Perceptron()
print("TF-IDF")
train_test_model(perceptron, X_train, X_test, y_train, y_test)

TF-IDF
Testing metics:
Accuracy: 0.912625
Precision: 0.9057800058979653
Recall: 0.9211276053381316
F1 Score: 0.9133893390825961


## SVM with Word2Vec

In [36]:
from sklearn.svm import LinearSVC

# google model
svm = LinearSVC(dual='auto')
print("Pretrained model")
train_test_model(svm, pretrained_word_embedding, pretrained_test_embedding, train_labels_binary, test_labels_binary)

# my model
print("\nModel trained by me")
train_test_model(svm, my_word_embedding, my_test_embedding, train_labels_binary, test_labels_binary)

Pretrained model
Testing metics:
Accuracy: 0.862375
Precision: 0.8768319301527908
Recall: 0.8433048433048433
F1 Score: 0.8597416494687763

Model trained by me
Testing metics:
Accuracy: 0.898825
Precision: 0.9085184805979318
Recall: 0.8870395361623432
F1 Score: 0.897650539945879


## SVM with TF-IDF

In [37]:
print("TF-IDF")
train_test_model(svm, X_train, X_test, y_train, y_test)

TF-IDF
Testing metics:
Accuracy: 0.9295
Precision: 0.9308166641600241
Recall: 0.9280251911830859
F1 Score: 0.9294188316564048


In [38]:
del feature_df, perceptron, svm, X_train, X_test, y_train, y_test, tfidf_features, tfidf_vectorizer

### What do you conclude from comparing performances for the models trained using the three different feature types (TF-IDF, pretrained Word2Vec, your trained Word2Vec)?

The comparison of model performances across three different feature types—TF-IDF, pretrained Word2Vec, and a custom-trained Word2Vec—reveals a nuanced understanding of feature effectiveness in sentiment analysis tasks. The highest accuracy is achieved by the SVM model utilizing TF-IDF features, followed closely by the perceptron with TF-IDF, indicating that TF-IDF, with its ability to capture the importance of words within documents, is particularly effective for these classification tasks. The SVM model with embeddings from the custom-trained Word2Vec model outperforms the SVM with embeddings from the pretrained Word2Vec model, suggesting that domain-specific training of word embeddings can lead to better performance than using a generic, pretrained model. However, both fall short of the TF-IDF results, highlighting the challenge of capturing semantic nuances solely through embeddings. Lastly, the perceptron model with embeddings from the pretrained Word2Vec model shows the least accuracy, underscoring the potential limitations of simpler linear models in leveraging deep semantic features extracted from word embeddings. This comparison underscores the importance of feature selection and model complexity in the effectiveness of sentiment analysis tasks.

# 4. Feedforward Neural Networks

In [39]:
accuracies = {}

### Preparing data for FFNN

The code snippet remaps the sentiment labels in two datasets, `final_sample` and `final_copy`, from their original values of 1, 2, or 3 to 0, 1, or 2, respectively. This is achieved by applying a mapping function that substitutes each sentiment value according to the defined dictionary `map`. This adjustment ensures the sentiment labels align with a zero-based indexing system, which is a common requirement for machine learning models, particularly those developed in PyTorch.

In [40]:
# the model will need labels/classes to be 0, 1, or 2 instead of 1, 2, or 3
map = {1: 0, 2: 1, 3: 2}
final_sample.sentiment = final_sample.sentiment.apply(lambda x: map[x])
final_copy.sentiment = final_copy.sentiment.apply(lambda x: map[x])

In [41]:
train_data_binary, test_data_binary = train_test_split(final_copy, test_size=0.2, random_state=42)
train_data_ternary, test_data_ternary = train_test_split(final_sample, test_size=0.2, random_state=42)

train_labels_binary = train_data_binary["sentiment"]
test_labels_binary = test_data_binary["sentiment"]

train_labels_ternary = train_data_ternary["sentiment"]
test_labels_ternary = test_data_ternary["sentiment"]

In [42]:
# Create averaged word vector features
pretrained_word_embedding = averaged_word_vectorizer(train_data_binary['review_lemmatize'], wv, 300)
my_word_embedding= averaged_word_vectorizer(train_data_binary['review_lemmatize'], my_model.wv, 300)

pretrained_test_embedding = averaged_word_vectorizer(test_data_binary['review_lemmatize'], wv, 300)
my_test_embedding = averaged_word_vectorizer(test_data_binary['review_lemmatize'], my_model.wv, 300)

In [43]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

The provided code outlines the definition of a `SentimentClassifier` class, a simple feedforward neural network designed for sentiment analysis tasks, using PyTorch's `nn.Module`. This classifier includes the following components:

- **Loss Function (`criterion`)**: It employs `nn.CrossEntropyLoss`, a common choice for classification tasks, which combines softmax activation and cross-entropy loss in a single class. This loss function is well-suited for multi-class classification problems.

- **Network Architecture**:
  - The constructor (`__init__`) receives `input_dim` and `output_dim` as parameters to set up the network dimensions. `input_dim` refers to the size of the input features, and `output_dim` corresponds to the number of classes for the classification task.
  - The network comprises three fully connected (`nn.Linear`) layers. The first two linear layers map the input dimensions to 50 and then from 50 to 10. Each of these layers is followed by a ReLU activation function (`nn.ReLU`), introducing non-linearity to the model and enabling it to learn complex patterns.
  - The final linear layer (`fc3`) transforms the intermediate representation to the output size specified by `output_dim`, which should match the number of target classes in the sentiment analysis task.

- **Forward Pass (`forward` method)**: Defines how the input `x` flows through the network:
  - The input is first passed through the `fc1` layer, then activated by ReLU.
  - The activated output is fed into the `fc2` layer, followed by another ReLU activation.
  - Finally, the output of `fc2` is passed through `fc3` to produce the final output of the model. Note that there is no activation function after `fc3` since `nn.CrossEntropyLoss` expects raw scores (logits) to compute the loss; it applies the softmax function internally.

This architecture allows the `SentimentClassifier` to learn from textual feature inputs (like TF-IDF or word embeddings) and predict sentiment labels. The design is straightforward, making it a good starting point for sentiment analysis tasks while remaining flexible enough for further enhancements or adjustments based on specific requirements.

In [44]:
criterion = nn.CrossEntropyLoss()

# Define the neural network architecture
class SentimentClassifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(SentimentClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 50)
        self.fc2 = nn.Linear(50, 10)
        self.fc3 = nn.Linear(10, output_dim)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

The `train_model` and `evaluate_model` functions are designed for training and evaluating a multi-layer perceptron (MLP) classifier using PyTorch. It accepts training and testing datasets with their respective labels and parameters defining the model's architecture and training configuration.

In [45]:
# Function to train the model
def train_model(model=None, optimizer=None, train_data=None, train_labels=None, epochs=10, batch_size=64):
    # Convert numpy arrays to PyTorch tensors
    train_features_tensor = torch.tensor(train_data).float()
    train_labels_tensor = torch.tensor(train_labels.to_numpy()).long()

    # Create TensorDatasets and DataLoaders
    train_dataset = TensorDataset(train_features_tensor, train_labels_tensor)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    
    model.train()
    for epoch in range(epochs):
        losses = []
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            losses.append(loss)
            loss.backward()
            optimizer.step()
        print(f'Loss in epoch {epoch}: {sum(losses)/len(losses)}')

Following the training phase, the model's accuracy is assessed on the test dataset. The function outputs the loss per epoch during training and the final accuracy on test data, recording these metrics in a `accuracies` dict for further analysis.

In [46]:
# Function to evaluate the model
def evaluate_model(model, test_data=None, test_labels=None, batch_size=64):
    # Convert numpy arrays to PyTorch tensors
    test_features_tensor = torch.tensor(test_data).float()
    test_labels_tensor = torch.tensor(test_labels.to_numpy()).long()
    
    # Create TensorDatasets and DataLoaders
    test_dataset = TensorDataset(test_features_tensor, test_labels_tensor)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            # print(torch.max(outputs.data, 1))
            _, predicted = torch.max(outputs.data, 1)
            # print(_, predicted.size())
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    testing_accuracy = correct / total
    
    print(f'Binary classification accuracy: {testing_accuracy}')
    return testing_accuracy

In [47]:
train_labels_binary = train_labels_binary.reset_index(drop=True)
train_labels_ternary = train_labels_ternary.reset_index(drop=True)

## FFNN on binary data with pretained model's features


In [48]:
# Binary classification model
binary_model = SentimentClassifier(input_dim=300, output_dim=2) # for binary classification
binary_optimizer = optim.Adam(binary_model.parameters())

train_model(binary_model, optimizer=binary_optimizer, train_data=pretrained_word_embedding, train_labels=train_labels_binary, epochs=50)

# Evaluate the binary model
accuracies["bin-pre-avg"] = evaluate_model(binary_model, test_data=pretrained_test_embedding, test_labels=test_labels_binary)

Loss in epoch 0: 0.3489203453063965
Loss in epoch 1: 0.2997763454914093
Loss in epoch 2: 0.2846796214580536
Loss in epoch 3: 0.27580681443214417
Loss in epoch 4: 0.26867684721946716
Loss in epoch 5: 0.2633201777935028
Loss in epoch 6: 0.2589226961135864
Loss in epoch 7: 0.25452709197998047
Loss in epoch 8: 0.2508869767189026
Loss in epoch 9: 0.2482881397008896
Loss in epoch 10: 0.24546802043914795
Loss in epoch 11: 0.24271850287914276
Loss in epoch 12: 0.24043893814086914
Loss in epoch 13: 0.23691962659358978
Loss in epoch 14: 0.23528271913528442
Loss in epoch 15: 0.2326139211654663
Loss in epoch 16: 0.23021991550922394
Loss in epoch 17: 0.22890351712703705
Loss in epoch 18: 0.22681310772895813
Loss in epoch 19: 0.2252882868051529
Loss in epoch 20: 0.22352442145347595
Loss in epoch 21: 0.2220490425825119
Loss in epoch 22: 0.22070713341236115
Loss in epoch 23: 0.219153493642807
Loss in epoch 24: 0.21718749403953552
Loss in epoch 25: 0.21598368883132935
Loss in epoch 26: 0.21447589993476

## FFNN on binary data with my model's features

In [49]:
# Binary classification model
binary_model = SentimentClassifier(input_dim=300, output_dim=2) # for binary classification
binary_optimizer = optim.Adam(binary_model.parameters())

train_model(binary_model, optimizer=binary_optimizer, train_data=my_word_embedding, train_labels=train_labels_binary, epochs=50)

# Evaluate the binary model
accuracies["bin-my-avg"] = evaluate_model(binary_model, test_data=my_test_embedding, test_labels=test_labels_binary)

Loss in epoch 0: 0.2577497065067291
Loss in epoch 1: 0.23220810294151306
Loss in epoch 2: 0.22442063689231873
Loss in epoch 3: 0.21897192299365997
Loss in epoch 4: 0.21462886035442352
Loss in epoch 5: 0.21122637391090393
Loss in epoch 6: 0.2083522528409958
Loss in epoch 7: 0.20586560666561127
Loss in epoch 8: 0.2031925469636917
Loss in epoch 9: 0.20129887759685516
Loss in epoch 10: 0.19922475516796112
Loss in epoch 11: 0.19748328626155853
Loss in epoch 12: 0.19526530802249908
Loss in epoch 13: 0.1941741406917572
Loss in epoch 14: 0.1925799697637558
Loss in epoch 15: 0.19123531877994537
Loss in epoch 16: 0.1895778775215149
Loss in epoch 17: 0.18833012878894806
Loss in epoch 18: 0.186847984790802
Loss in epoch 19: 0.18609893321990967
Loss in epoch 20: 0.1846332997083664
Loss in epoch 21: 0.18396411836147308
Loss in epoch 22: 0.18260790407657623
Loss in epoch 23: 0.18186454474925995
Loss in epoch 24: 0.18091736733913422
Loss in epoch 25: 0.18012218177318573
Loss in epoch 26: 0.17903977632

## FFNN on ternary data with pretained model's features

In [50]:
# Create averaged word vector features
pretrained_word_embedding = averaged_word_vectorizer(train_data_ternary['review_lemmatize'], wv, 300)
my_word_embedding= averaged_word_vectorizer(train_data_ternary['review_lemmatize'], my_model.wv, 300)

pretrained_test_embedding = averaged_word_vectorizer(test_data_ternary['review_lemmatize'], wv, 300)
my_test_embedding = averaged_word_vectorizer(test_data_ternary['review_lemmatize'], my_model.wv, 300)

In [51]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Re-trianing word2vec model on ternary data
my_model = Word2Vec(sentences=train_data_ternary.review_lemmatize.apply(lambda x: simple_preprocess(x)), vector_size=300, window=11, min_count=10, workers=4, sg=1)

In [52]:
# Ternary classification model
ternary_model = SentimentClassifier(input_dim=300, output_dim=3) # for ternary classification
ternary_optimizer = optim.Adam(ternary_model.parameters())

train_model(ternary_model, optimizer=ternary_optimizer, train_data=pretrained_word_embedding, train_labels=train_labels_ternary, epochs=50)

# Evaluate the ternary model
accuracies["ter-pre-avg"] = evaluate_model(ternary_model, test_data=pretrained_test_embedding, test_labels=test_labels_ternary)

Loss in epoch 0: 0.7192530632019043
Loss in epoch 1: 0.6450977921485901
Loss in epoch 2: 0.6249060034751892
Loss in epoch 3: 0.6133942604064941
Loss in epoch 4: 0.6053060293197632
Loss in epoch 5: 0.599433422088623
Loss in epoch 6: 0.5937463045120239
Loss in epoch 7: 0.5896198153495789
Loss in epoch 8: 0.585729718208313
Loss in epoch 9: 0.581692099571228
Loss in epoch 10: 0.5789421796798706
Loss in epoch 11: 0.5760374069213867
Loss in epoch 12: 0.5732020139694214
Loss in epoch 13: 0.5714486837387085
Loss in epoch 14: 0.5687288045883179
Loss in epoch 15: 0.5671951770782471
Loss in epoch 16: 0.5648314356803894
Loss in epoch 17: 0.5630868673324585
Loss in epoch 18: 0.5616310834884644
Loss in epoch 19: 0.5600045919418335
Loss in epoch 20: 0.5585857033729553
Loss in epoch 21: 0.556941032409668
Loss in epoch 22: 0.5562700629234314
Loss in epoch 23: 0.5546447038650513
Loss in epoch 24: 0.5535259246826172
Loss in epoch 25: 0.551810085773468
Loss in epoch 26: 0.5510674715042114
Loss in epoch 27

## FFNN on ternary data with my model's features

In [53]:
# Ternary classification model
ternary_model = SentimentClassifier(input_dim=300, output_dim=3) # for ternary classification
ternary_optimizer = optim.Adam(ternary_model.parameters())

train_model(ternary_model, optimizer=ternary_optimizer, train_data=my_word_embedding, train_labels=train_labels_ternary, epochs=50)

# Evaluate the ternary model
accuracies["ter-my-avg"] = evaluate_model(ternary_model, test_data=my_test_embedding, test_labels=test_labels_ternary)

Loss in epoch 0: 0.5888786911964417
Loss in epoch 1: 0.5536072254180908
Loss in epoch 2: 0.5439688563346863
Loss in epoch 3: 0.5374993085861206
Loss in epoch 4: 0.5318558812141418
Loss in epoch 5: 0.5282536149024963
Loss in epoch 6: 0.5247336030006409
Loss in epoch 7: 0.5219700336456299
Loss in epoch 8: 0.5197785496711731
Loss in epoch 9: 0.5176863670349121
Loss in epoch 10: 0.5158699154853821
Loss in epoch 11: 0.5139257907867432
Loss in epoch 12: 0.5123432874679565
Loss in epoch 13: 0.5108899474143982
Loss in epoch 14: 0.509605884552002
Loss in epoch 15: 0.5083060264587402
Loss in epoch 16: 0.5071828365325928
Loss in epoch 17: 0.5058382749557495
Loss in epoch 18: 0.5047503709793091
Loss in epoch 19: 0.503792405128479
Loss in epoch 20: 0.5029356479644775
Loss in epoch 21: 0.5023841857910156
Loss in epoch 22: 0.5011320114135742
Loss in epoch 23: 0.5007970333099365
Loss in epoch 24: 0.4996083676815033
Loss in epoch 25: 0.49881136417388916
Loss in epoch 26: 0.49829280376434326
Loss in epo

## 4 part b

This code snippet defines a class `MLP` for a multi-layer perceptron using PyTorch's `nn.Module`. The MLP architecture is specified during initialization with an input dimension `input_dim` and an output dimension `output_dim`, along with two hidden layers of sizes 50 and 10. The model dynamically constructs these layers, starting with a flattening layer to ensure input tensors are correctly shaped for linear layers. Each hidden layer is followed by a ReLU activation function for non-linearity. The final layer is a linear layer that maps to the output dimension. The `forward` method defines the data flow through the model, making it ready for training and inference with specified input and output dimensions. Here's how it's structured:

1. **Initialization (`__init__` method)**: The constructor takes two arguments: `input_dim` for the size of the input layer and `output_dim` for the size of the output layer. It defines a network architecture with two hidden layers, specified by the `hidden_sizes` list containing 50 and 10 neurons, respectively.

2. **Layer Construction**: 
   - Starts with a `nn.Flatten()` layer to ensure input tensors are flattened (useful if the input comes from previous layers that do not output flat vectors).
   - Iteratively adds pairs of `nn.Linear` and `nn.ReLU` layers to the network based on `hidden_sizes`. The first `nn.Linear` layer's input size is `input_dim`, and subsequent layers use the previous layer's size. Each `nn.Linear` layer is followed by a `nn.ReLU` activation function for non-linearity.
   - Concludes with an `nn.Linear` layer that maps from the last hidden layer to the output layer, sized according to `output_dim`.

3. **Sequential Container**: The layers are wrapped in an `nn.Sequential` container, which automates the forward pass in the order the layers were added.

4. **Forward Pass (`forward` method)**: Defines the forward propagation through the network. Given an input `x`, it passes through the sequential container (`self.model`) and returns the output. This method is automatically called by PyTorch during training and prediction.

This class provides a flexible architecture for MLPs by allowing customization of input and output dimensions and the size of hidden layers, making it adaptable for various classification tasks.

In [55]:
class MLP(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(MLP, self).__init__()
        hidden_sizes=[50,10]
        layers = []
        layers.append(nn.Flatten())
        for i in range(len(hidden_sizes)):
            layers.append(nn.Linear(input_dim if i == 0 else hidden_sizes[i-1], hidden_sizes[i]))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(hidden_sizes[-1], output_dim))
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

The `concatenated_vectors` function processes a given review text to create a fixed-length vector representation using word vectors from a pretrained model. Here's the breakdown:

- **Word Vector Retrieval**: It splits the review into words and fetches their corresponding vectors from the `model`, provided the words exist in the model's vocabulary.

- **Zero Vector for Missing Words**: If the review doesn't contain any words found in the model, a zero vector of length `300 * max_words` is returned, assuming each word vector is 300-dimensional.

- **Limiting Word Vectors**: The function limits the number of word vectors to `max_words`, truncating longer sequences.

- **Padding Short Sequences**: If the number of word vectors is less than `max_words`, the function pads the sequence with zero vectors to reach the `max_words` length, ensuring uniform vector size.

- **Flattening**: The sequence of vectors (either truncated or padded) is flattened into a single vector and returned.

This approach allows for consistency in input vector size across varying lengths of review texts, making it suitable for machine learning models that require fixed-size inputs.

In [56]:
def concatenated_vectors(review, model, max_words=10):
    words_vector = [model[word] for word in review.split() if word in model]
    if len(words_vector) == 0:
        return np.zeros(300 * max_words)
    
    words_vector = np.array(words_vector[:max_words])
    padding_size = max_words - words_vector.shape[0]

    if padding_size > 0:
        padding = np.zeros((padding_size, words_vector.shape[1]))
        words_vector = np.concatenate([words_vector, padding])

    return words_vector.flatten()

In [57]:
del my_word_embedding, my_test_embedding

In [58]:
del binary_model, ternary_model

In [59]:
pretrained_word_embedding = np.asarray([concatenated_vectors(review=_review, model=wv, max_words=10) for _review in train_data_binary.review_lemmatize])
pretrained_test_embedding = np.asarray([concatenated_vectors(review=review, model=wv) for review in test_data_binary.review_lemmatize])

## Pre-trained model with binary classes and concatenated features

In [60]:
# Binary classification model
binary_model = MLP(input_dim=3000, output_dim=2) # for binary classification
binary_optimizer = optim.Adam(binary_model.parameters())

train_model(binary_model, optimizer=binary_optimizer, train_data=pretrained_word_embedding, train_labels=train_labels_binary, epochs=25)

# Evaluate the binary model
accuracies["bin-pre-con"] = evaluate_model(binary_model, test_data=pretrained_test_embedding, test_labels=test_labels_binary)

Loss in epoch 0: 0.4530833065509796
Loss in epoch 1: 0.39451682567596436
Loss in epoch 2: 0.35656094551086426
Loss in epoch 3: 0.3198108673095703
Loss in epoch 4: 0.28353092074394226
Loss in epoch 5: 0.24896378815174103
Loss in epoch 6: 0.21619237959384918
Loss in epoch 7: 0.18852007389068604
Loss in epoch 8: 0.16450753808021545
Loss in epoch 9: 0.14435173571109772
Loss in epoch 10: 0.12878350913524628
Loss in epoch 11: 0.11453308165073395
Loss in epoch 12: 0.10356079041957855
Loss in epoch 13: 0.0942152813076973
Loss in epoch 14: 0.08782602101564407
Loss in epoch 15: 0.08074415475130081
Loss in epoch 16: 0.0753210186958313
Loss in epoch 17: 0.07034900784492493
Loss in epoch 18: 0.0673942118883133
Loss in epoch 19: 0.06359957158565521
Loss in epoch 20: 0.05768275260925293
Loss in epoch 21: 0.05741740018129349
Loss in epoch 22: 0.055370036512613297
Loss in epoch 23: 0.05085262656211853
Loss in epoch 24: 0.04838799685239792
Binary classification accuracy: 0.776125


## My model with binary classes and concatenated features

In [61]:
my_word_embedding = np.asarray([concatenated_vectors(review=_review, model=my_model.wv, max_words=10) for _review in train_data_binary.review_lemmatize])
my_test_embedding = np.asarray([concatenated_vectors(review=review, model=my_model.wv) for review in test_data_binary.review_lemmatize])

In [62]:
my_model = Word2Vec(sentences=train_data_binary.review_lemmatize.apply(lambda x: simple_preprocess(x)), vector_size=300, window=11, min_count=10, workers=4, sg=1)

In [63]:
# Binary classification model
binary_model = MLP(input_dim=3000, output_dim=2) # for binary classification
binary_optimizer = optim.Adam(binary_model.parameters())

train_model(binary_model, optimizer=binary_optimizer, train_data=my_word_embedding, train_labels=train_labels_binary, epochs=25)

# Evaluate the binary model
accuracies["bin-my-con"] = evaluate_model(binary_model, test_data=my_test_embedding, test_labels=test_labels_binary)

Loss in epoch 0: 0.3980901837348938
Loss in epoch 1: 0.35928189754486084
Loss in epoch 2: 0.33583956956863403
Loss in epoch 3: 0.30978691577911377
Loss in epoch 4: 0.28075456619262695
Loss in epoch 5: 0.24907425045967102
Loss in epoch 6: 0.21675018966197968
Loss in epoch 7: 0.18540360033512115
Loss in epoch 8: 0.15746156871318817
Loss in epoch 9: 0.1304992139339447
Loss in epoch 10: 0.11095687001943588
Loss in epoch 11: 0.0957595556974411
Loss in epoch 12: 0.08113052695989609
Loss in epoch 13: 0.07158970087766647
Loss in epoch 14: 0.0649472251534462
Loss in epoch 15: 0.056296128779649734
Loss in epoch 16: 0.05383606627583504
Loss in epoch 17: 0.0497620552778244
Loss in epoch 18: 0.045983415096998215
Loss in epoch 19: 0.04342653602361679
Loss in epoch 20: 0.03997180983424187
Loss in epoch 21: 0.03845454379916191
Loss in epoch 22: 0.03768574073910713
Loss in epoch 23: 0.03504708781838417
Loss in epoch 24: 0.0341111496090889
Binary classification accuracy: 0.79065


## Pre-trained model with ternary classes and concatenated features

In [64]:
pretrained_word_embedding = np.asarray([concatenated_vectors(review=_review, model=wv, max_words=10) for _review in train_data_ternary.review_lemmatize])
pretrained_test_embedding = np.asarray([concatenated_vectors(review=review, model=wv) for review in test_data_ternary.review_lemmatize])

In [65]:
# Ternary classification model
ternary_model = MLP(input_dim=3000, output_dim=3) # for ternary classification
ternary_optimizer = optim.Adam(ternary_model.parameters())

train_model(ternary_model, optimizer=ternary_optimizer, train_data=pretrained_word_embedding, train_labels=train_labels_ternary, epochs=50)

# Evaluate the ternary model
accuracies["ter-pre-con"] = evaluate_model(ternary_model, test_data=pretrained_test_embedding, test_labels=test_labels_ternary)

Loss in epoch 0: 0.8186842203140259
Loss in epoch 1: 0.7316397428512573
Loss in epoch 2: 0.6879225969314575
Loss in epoch 3: 0.6512019038200378
Loss in epoch 4: 0.6178866624832153
Loss in epoch 5: 0.5874897837638855
Loss in epoch 6: 0.5589696764945984
Loss in epoch 7: 0.5338996648788452
Loss in epoch 8: 0.5116013288497925
Loss in epoch 9: 0.49115726351737976
Loss in epoch 10: 0.4721020758152008
Loss in epoch 11: 0.45491644740104675
Loss in epoch 12: 0.4395006597042084
Loss in epoch 13: 0.42477601766586304
Loss in epoch 14: 0.41179630160331726
Loss in epoch 15: 0.3995373547077179
Loss in epoch 16: 0.38791102170944214
Loss in epoch 17: 0.3765275478363037
Loss in epoch 18: 0.3660672605037689
Loss in epoch 19: 0.35694098472595215
Loss in epoch 20: 0.34789901971817017
Loss in epoch 21: 0.3394908308982849
Loss in epoch 22: 0.33109062910079956
Loss in epoch 23: 0.3247874677181244
Loss in epoch 24: 0.316448450088501
Loss in epoch 25: 0.3099673092365265
Loss in epoch 26: 0.30317214131355286
Los

## My model with ternary classes and concatenated features

In [66]:
my_word_embedding = np.asarray([concatenated_vectors(review=_review, model=my_model.wv, max_words=10) for _review in train_data_ternary.review_lemmatize])
my_test_embedding = np.asarray([concatenated_vectors(review=review, model=my_model.wv) for review in test_data_ternary.review_lemmatize])

In [67]:
# Ternary classification model
ternary_model = MLP(input_dim=3000, output_dim=3) # for ternary classification
ternary_optimizer = optim.Adam(ternary_model.parameters())

train_model(ternary_model, optimizer=ternary_optimizer, train_data=my_word_embedding, train_labels=train_labels_ternary, epochs=50)

# Evaluate the ternary model
accuracies["ter-my-con"] = evaluate_model(ternary_model, test_data=my_test_embedding, test_labels=test_labels_ternary)

Loss in epoch 0: 0.7359956502914429
Loss in epoch 1: 0.6848767995834351
Loss in epoch 2: 0.6637907028198242
Loss in epoch 3: 0.6429949998855591
Loss in epoch 4: 0.6211372017860413
Loss in epoch 5: 0.5977684855461121
Loss in epoch 6: 0.5749894380569458
Loss in epoch 7: 0.5524516105651855
Loss in epoch 8: 0.5303213000297546
Loss in epoch 9: 0.5105023384094238
Loss in epoch 10: 0.4917812943458557
Loss in epoch 11: 0.4754505455493927
Loss in epoch 12: 0.4601830840110779
Loss in epoch 13: 0.44584763050079346
Loss in epoch 14: 0.4333261251449585
Loss in epoch 15: 0.42097917199134827
Loss in epoch 16: 0.4098316431045532
Loss in epoch 17: 0.39957061409950256
Loss in epoch 18: 0.3896203935146332
Loss in epoch 19: 0.3804463744163513
Loss in epoch 20: 0.37094733119010925
Loss in epoch 21: 0.36268481612205505
Loss in epoch 22: 0.3556191921234131
Loss in epoch 23: 0.3474865257740021
Loss in epoch 24: 0.3405013978481293
Loss in epoch 25: 0.335632860660553
Loss in epoch 26: 0.32744455337524414
Loss i

In [68]:
del pretrained_word_embedding, pretrained_test_embedding

## What do you conclude by comparing accuracy values you obtain with those obtained in the "Simple Models" section?

# 5. CNN

The `create_word_vector` function processes a single review text to generate a fixed-size word vector representation using a pretrained Word2Vec model. It performs the following steps:

- It tokenizes the review into individual words and retrieves their corresponding vectors from the Word2Vec model, ensuring only words present in the model are considered.
- If no words from the review are found in the model, a tensor of zeros with dimensions `[max_length, vector_size]` is returned to represent an empty word vector.
- If there are word vectors, they are truncated or padded with zeros to ensure the output tensor has a uniform shape of `[max_length, vector_size]`, where `max_length` is the specified maximum number of words and `vector_size` is the dimensionality of each word vector.
- The function returns this tensor, which can be directly used for model input, ensuring consistency in input size regardless of the original review length.-

In [70]:
def create_word_vector(review, model, max_length, vector_size):
    # Load and preprocess data for a single sample
    words_vector = [model[word] for word in review.split() if word in model]
    if len(words_vector) == 0:
        words_vector = torch.zeros(max_length, vector_size)
    else:
        words_vector = torch.tensor(words_vector[:max_length])

    # Pad or truncate the sequence
    if words_vector.shape[0] < max_length:
        padding_size = max_length - words_vector.shape[0]
        padding = torch.zeros(padding_size, vector_size)
        words_vector = torch.cat([words_vector, padding])

    return words_vector

In [78]:
del final_copy, final_sample

The `run_cnn` function orchestrates the training and evaluation of a Convolutional Neural Network (CNN) for text classification tasks using PyTorch. It accepts both training and testing datasets along with their labels, parameters defining the CNN architecture, a Word2Vec model for vectorizing text, and other training parameters. Here's a detailed overview:

### Parameters:
- `train_data`, `test_data`: Arrays or lists containing the raw text reviews for training and testing, respectively.
- `train_labels`, `test_labels`: The corresponding labels for the training and testing datasets.
- `vector_size`: The dimensionality of the Word2Vec vectors.
- `op_channel_s1`, `op_channel_s2`: The number of output channels for the first and second convolutional layers, respectively.
- `num_classes`: The number of classes for classification.
- `word2vec`: A pretrained Word2Vec model used to vectorize the text data.
- `max_length`: The maximum number of words considered from each review for creating fixed-length input vectors (default is 50).
- `batch_size`: The size of each batch used during training (default is 32).
- `epochs`: The number of training iterations over the entire dataset (default is 10).
- `learning_rate`: The learning rate for the optimizer (default is 0.001).

### Process:
1. **Data Preparation**: It vectorizes the text data from `train_data` and `test_data` using the `create_word_vector` function, considering only the first `max_length` words from each review and setting the vector size as per the `word2vec` model. The vectorized data is then converted to PyTorch tensors.

2. **DataLoader Creation**: The tensors are wrapped in a `TensorDataset` and loaded using a `DataLoader`, facilitating efficient batch processing during training and testing.

3. **CNN Model Definition**: Defines a sequential CNN model with two convolutional layers followed by ReLU activations, a flattening layer, and a final linear layer for classification. The CNN expects input with dimensions `[batch_size, vector_size, max_length]`.

4. **Training**: In the training loop, the model is trained on the `train_loader` dataset using the Adam optimizer and CrossEntropyLoss. The inputs are permuted to match the CNN's expected input dimensions, and the average loss per epoch is printed.

5. **Evaluation**: The function evaluates the model's accuracy on the `test_loader` dataset, calculating the proportion of correctly predicted labels.

### Returns:
- `testing_accuracy`: The accuracy of the model on the test dataset, calculated as the number of correct predictions divided by the total number of predictions.

This function provides a comprehensive workflow for training and testing a CNN model for text classification, encapsulating data preprocessing, model training, and evaluation within a single, convenient function.

In [80]:
def run_cnn(train_data, train_labels, test_data, test_labels, vector_size, op_channel_s1, op_channel_s2, num_classes, word2vec, max_length=50, batch_size=32, epochs=10, learning_rate=0.001):
    train_data_limit = np.asarray([create_word_vector(review=str(review), model=word2vec, max_length=50, vector_size=300 ) for review in train_data])
    test_data_limit = np.asarray([create_word_vector(review=str(review), model=word2vec, max_length=50, vector_size=300 ) for review in test_data])
    
    train_data_tensor = torch.from_numpy(train_data_limit).to(dtype=torch.float32)
    train_labels_tensor = torch.tensor(train_labels.values, dtype=torch.long)
    train_dataset = TensorDataset(train_data_tensor, train_labels_tensor)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    
    test_data_tensor = torch.from_numpy(test_data_limit).to(dtype=torch.float32)
    test_labels_tensor = torch.tensor(test_labels.values, dtype=torch.long)
    test_dataset = TensorDataset(test_data_tensor, test_labels_tensor)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    # Define the CNN model using Sequential
    model = nn.Sequential(
        nn.Conv1d(in_channels=vector_size, out_channels=op_channel_s1, kernel_size=3),
        nn.ReLU(),
        nn.Conv1d(in_channels=op_channel_s1, out_channels=op_channel_s2, kernel_size=3),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(op_channel_s2 * (max_length - 4), num_classes)
    )

    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    print(f'\nTraining Loss for each epoch :')
    # Training loop
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0.0
        losses = []
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            # No need to check for sparse input, as Conv1d layer expects dense input
            outputs = model(inputs.permute(0, 2, 1))
            loss = criterion(outputs, labels)
            losses.append(loss)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        # Average loss for the epoch
        print(f'Loss in epoch {epoch}: {sum(losses)/len(losses)}')
        
    # Testing loop
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs.permute(0, 2, 1))
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    testing_accuracy = correct / total
    return testing_accuracy


In [81]:
del my_word_embedding, my_test_embedding

### binary data on my model

In [82]:
my_word_embedding= averaged_word_vectorizer(train_data_binary['review_lemmatize'], my_model.wv, 300)
my_test_embedding = averaged_word_vectorizer(test_data_binary['review_lemmatize'], my_model.wv, 300)

In [83]:
vector_size = 300
op_channel_s1 = 50
op_channel_s2 = 10
output_size = 2

accuracy_binary = run_cnn(
    train_data_binary.review_lemmatize, train_labels_binary, test_data_binary.review_lemmatize, test_labels_binary, vector_size, op_channel_s1, op_channel_s2, output_size, my_model.wv,
)

print(f'Binary Test Accuracy: {accuracy_binary}')
accuracies['bin-my-avg'] = accuracy_binary

  words_vector = torch.tensor(words_vector[:max_length])



Training Loss for each epoch :
Loss in epoch 0: 0.2515266239643097
Loss in epoch 1: 0.20382119715213776
Loss in epoch 2: 0.1852751523256302
Loss in epoch 3: 0.16978009045124054
Loss in epoch 4: 0.15757496654987335
Loss in epoch 5: 0.14520657062530518
Loss in epoch 6: 0.13318829238414764
Loss in epoch 7: 0.1235203742980957
Loss in epoch 8: 0.11423000693321228
Loss in epoch 9: 0.10546647757291794
Binary Test Accuracy: 0.9125


### ternary data on my model

In [84]:
my_word_embedding= averaged_word_vectorizer(train_data_ternary['review_lemmatize'], my_model.wv, 300)
my_test_embedding = averaged_word_vectorizer(test_data_ternary['review_lemmatize'], my_model.wv, 300)

In [86]:
vector_size = 300
op_channel_s1 = 50
op_channel_s2 = 10
output_size = 3

accuracy_ternary = run_cnn(
    train_data_ternary.review_lemmatize, train_labels_ternary, test_data_ternary.review_lemmatize, test_labels_ternary, vector_size, op_channel_s1, op_channel_s2, output_size, my_model.wv,
)

print(f'Ternary Test Accuracy: {accuracy_ternary}')
accuracies['ter-my-avg'] = accuracy_ternary


Training Loss for each epoch :
Loss in epoch 0: 0.5690693259239197
Loss in epoch 1: 0.5156851410865784
Loss in epoch 2: 0.4962995648384094
Loss in epoch 3: 0.4820621907711029
Loss in epoch 4: 0.4688960611820221
Loss in epoch 5: 0.4572489559650421
Loss in epoch 6: 0.4471859633922577
Loss in epoch 7: 0.43758049607276917
Loss in epoch 8: 0.4283675253391266
Loss in epoch 9: 0.4206487834453583
Ternary Test Accuracy: 0.7782


### binary data on pretrained model

In [87]:
my_word_embedding = np.asarray([concatenated_vectors(review=_review, model=my_model.wv, max_words=10) for _review in train_data_binary.review_lemmatize])
my_test_embedding = np.asarray([concatenated_vectors(review=review, model=my_model.wv) for review in test_data_binary.review_lemmatize])

In [91]:
vector_size = 300
op_channel_s1 = 50
op_channel_s2 = 10
output_size = 2

accuracy_binary = run_cnn(
    train_data_binary.review_lemmatize, train_labels_binary, test_data_binary.review_lemmatize, test_labels_binary, vector_size, op_channel_s1, op_channel_s2, output_size, wv,
)

print(f'Binary Test Accuracy: {accuracy_binary}')
accuracies['bin-my-con'] = accuracy_binary


Training Loss for each epoch :
Loss in epoch 0: 0.2650148868560791
Loss in epoch 1: 0.20370304584503174
Loss in epoch 2: 0.18364916741847992
Loss in epoch 3: 0.1674783080816269
Loss in epoch 4: 0.15297238528728485
Loss in epoch 5: 0.14112749695777893
Loss in epoch 6: 0.1287354677915573
Loss in epoch 7: 0.11844947189092636
Loss in epoch 8: 0.1087835431098938
Loss in epoch 9: 0.10171432793140411
Binary Test Accuracy: 0.904925


### ternary data on pretrained model

In [92]:
my_word_embedding = np.asarray([concatenated_vectors(review=_review, model=my_model.wv, max_words=10) for _review in train_data_ternary.review_lemmatize])
my_test_embedding = np.asarray([concatenated_vectors(review=review, model=my_model.wv) for review in test_data_ternary.review_lemmatize])

In [93]:
vector_size = 300
op_channel_s1 = 50
op_channel_s2 = 10
output_size = 3

accuracy_ternary = run_cnn(
    train_data_ternary.review_lemmatize, train_labels_ternary, test_data_ternary.review_lemmatize, test_labels_ternary, vector_size, op_channel_s1, op_channel_s2, output_size, wv,
)

print(f'Ternary Test Accuracy: {accuracy_ternary}')
accuracies['ter-my-con'] = accuracy_ternary


Training Loss for each epoch :
Loss in epoch 0: 0.6057630181312561
Loss in epoch 1: 0.5228792428970337
Loss in epoch 2: 0.5015048384666443
Loss in epoch 3: 0.48573336005210876
Loss in epoch 4: 0.473037451505661
Loss in epoch 5: 0.4609887897968292
Loss in epoch 6: 0.45106977224349976
Loss in epoch 7: 0.4418560266494751
Loss in epoch 8: 0.433174192905426
Loss in epoch 9: 0.4255262017250061
Ternary Test Accuracy: 0.76922


| Classes | Model | Vector Type | Accuracy|
|----------------|-------|-------------|---------|
|Binary|Pereptron|Word2Vec - Pretrained model|81.39%|
|Binary|Pereptron|Word2Vec - My model|82.18%|
|Binary|Pereptron|TF-IDF|91.33%|
|Binary|Pereptron|Word2Vec - My model|82.18%|
|Binary|SVM|Word2Vec - Pretained model|86.23%|
|Binary|SVM|Word2Vec - My model|89.88%|
|Binary|SVM|Word2Vec - My model|82.18%|
|Binary|SVM|TF-IDF|92.95%|
|Binary|SVM|Word2Vec - My model|82.18%|
|Binary|FFNN|Word2Vec - Pretrained model (average vector function)|88.64%|
|Binary|FFNN|Word2Vec - My model (average vector function)|90.72%|
|Ternary|FFNN|Word2Vec - Pretrained model (average vector function)|74.02%|
|Ternary|FFNN|Word2Vec - My model (average vector function)|77.18%|
|Binary|FFNN|Word2Vec - Pretrained model (concate vector function)|77.61%|
|Binary|FFNN|Word2Vec - My model (concate vector function)|79.06%|
|Ternary|FFNN|Word2Vec - Pretrained model (concate vector function)|61.21%|
|Ternary|FFNN|Word2Vec - My model (concate vector function)|62.81%|
|Binary|CNN|Word2Vec - My model (custom vector function)|91.25%|
|Ternary|CNN|Word2Vec - My model (custom vector function)|77.82%|
|Binary|CNN|Word2Vec - Pretrained model (custom vector function)|90.49%|
|Ternary|CNN|Word2Vec - Pretrained model (custom vector function)|76.92%|
