## Implement word2vec and fasttext Model. RNN
### Bhuvana Kanakam, SE21UCSE035

#### Problem Statement :
Training a RNN model on the given dataset.

#### Dataset:
https://ai.stanford.edu/~amaas/data/sentiment/

#### Implementation: [40 marks]
1. Implement the Word2vec model and train the word vectors using
skip-gram model with negative sampling.
2. Implement the FastText model and train the word vectors [https://github.com/facebookresearch/fastText].
Hint: Make use of only “train” folder for training your word vectors.
3. You can use “test” folder and sentiment labels, i.e., pos and neg for
your sentiment classification task using RNN.
4. After creating word vectors using the methods provided above,
train your RNN model on the sentiment classification task by making
using of these word vectors.

#### Results and analysis [25 marks]
Present the results of your experiments including performance metrics for
each word vector technique used for sentiment classification task.
5. Make use of tables, graphs to compare results visually.
6. Discuss any findings and report all the hyperparameters for each
technique used during experimentation.

In [None]:
import torch
torch.cuda.is_available()

True

#### Import Libraries

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import os
from gensim.models import Word2Vec, FastText
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
import os
import zipfile

## Import Data Sets

#### Mount Google Drive
My Drive -> data.zip -> data -> train, test -> pos,neg -> files

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Extract The Data
-  unzip the data.zip file

In [None]:
zip_file_path = '/content/drive/My Drive/data.zip'
extracted_folder_path = '/content/data'

os.makedirs(extracted_folder_path, exist_ok=True)

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_folder_path)


- the contents of the extracted folder

In [None]:
print("Contents of Extracted Folder:")
print(os.listdir(extracted_folder_path))


Contents of Extracted Folder:
['data', 'data.zip.download', '__MACOSX']


#### Define Paths

In [None]:
train_dir = os.path.join(extracted_folder_path, 'data', 'train')
test_dir = os.path.join(extracted_folder_path, 'data', 'test')

In [None]:
train_pos = os.path.join(train_dir, 'pos')
train_neg = os.path.join(train_dir, 'neg')
test_pos = os.path.join(test_dir, 'pos')
test_neg = os.path.join(test_dir, 'neg')

-  the first 2 files in the train/pos directory, just as an example.

In [None]:
print("Extracted Training Data - Positive:")
print(os.listdir(train_pos)[:2])

print("\nExtracted Training Data - Negative:")
print(os.listdir(train_neg)[:2])

print("\nExtracted Testing Data - Positive:")
print(os.listdir(test_pos)[:2])

print("\nExtracted Testing Data - Negative:")
print(os.listdir(test_neg)[:2])

Extracted Training Data - Positive:
['25_7.txt', '5363_9.txt']

Extracted Training Data - Negative:
['5701_1.txt', '3462_1.txt']

Extracted Testing Data - Positive:
['12442_10.txt', '5375_7.txt']

Extracted Testing Data - Negative:
['8160_3.txt', '6931_1.txt']


## Load and preprocess data

#### Read Reviews
*Reads the text files in a given directory ('pos' or 'neg') and appends them to a list.*

In [None]:
def read_reviews(path, sentiment):
    reviews = []
    directory = os.path.join(path, sentiment)
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
            reviews.append(file.read())
    return reviews

- Read reviews from positive and negative directories in training and testing data

In [None]:
train_pos_reviews = read_reviews(train_dir, 'pos')
train_neg_reviews = read_reviews(train_dir, 'neg')
test_pos_reviews = read_reviews(test_dir, 'pos')
test_neg_reviews = read_reviews(test_dir, 'neg')

- Combine positive and negative reviews

In [None]:
train_texts = train_pos_reviews + train_neg_reviews
train_labels = [1] * len(train_pos_reviews) + [0] * len(train_neg_reviews)
test_texts = test_pos_reviews + test_neg_reviews
test_labels = [1] * len(test_pos_reviews) + [0] * len(test_neg_reviews)

- Total Number of Samples

In [None]:
print("Number of training samples:", len(train_texts))
print("Number of testing samples:", len(test_texts))

Number of training samples: 25000
Number of testing samples: 25000


- Print Samples

In [None]:
print("Sample Positive Reviews:")
for i in range(3):
    print(train_pos_reviews[i])

print("\nSample Negative Reviews:")
for i in range(3):
    print(train_neg_reviews[i])


Sample Positive Reviews:
I never saw this when I was a kid, so this was seen with fresh eyes. I had never heard of it and rented it for my 5 year old daughter. Plus, the idea of Christopher Walken singing and dancing made me curious. The special fx are cheesy and the singing and dancing is mediocre. But the story is great. My daughter was entranced. I loved watching Walken in this role thinking about what the future held for him. Very amusing to see him dance! And if the songs weren't great, at least they weren't Disney over-produced saccharine sweetness. The ogre scene in the beginning was a little scary for her, and she was a little nervous when we saw him again at the end, but it was mostly benign. Interestingly, we had recently read "Puss in Boots", and I had wondered about the implausibility of the story. But while staying true to almost every aspect, Walken's acting made it believable. Great fun. I'd watch it again with my daughter.
This is a classic action flick from the '80s fe

## Train Word Embedding Models

In [None]:
word2vec_model = Word2Vec(sentences=[text.split() for text in train_texts], vector_size=100, window=5, min_count=1, sg=1, negative=5)
fasttext_model = FastText(sentences=[text.split() for text in train_texts], vector_size=100, window=5, min_count=1, sg=1, negative=5)

word2vec_model.save("word2vec.model")
fasttext_model.save("fasttext.model")

- *Here, I used the Word2Vec and FastText classes from gensim.models to train my models. `sentences=[text.split() for text in train_texts]:` Tokenizes each review into a list of words for training. The parameters are `vector_size` which tells the dimensionality of the word vectors. `window`, tels the maximum distance between the current and predicted word within a sentence. `min_count`, which ignores all words with a total frequency lower than this `sg`, training algorithm (skip-gram if 1, otherwise CBOW). And `negative`, tells the number of negative samples. I then save the trained models as "word2vec.model" and "fasttext.model".*
- Here, I also observed that the model is taking a lot of time to train. And it is because of the larger dataset from our previous output of 25000 each. Training Word2Vec and FastText models involves iterating over the text data multiple times to learn the embeddings, and this process can be time-consuming.

## Data Set Class and Data Loader

### Data Set Class

In [None]:
import numpy as np

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, model):
        self.labels = labels
        self.texts = [self.text_to_tensor(text, model) for text in texts]

    def text_to_tensor(self, text, model):
        embeddings = [model.wv[word] for word in text.split() if word in model.wv]
        embeddings = np.array(embeddings)  # Convert list of numpy arrays to a single numpy array
        return torch.tensor(embeddings, dtype=torch.float)

    @staticmethod
    def pad_sequence(sequences):
        max_len = max([s.size(0) for s in sequences])
        padded_sequences = torch.zeros(len(sequences), max_len, sequences[0].size(1))
        for i, sequence in enumerate(sequences):
            end = sequence.size(0)
            padded_sequences[i, :end, :] = sequence[:]
        return padded_sequences

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

    @staticmethod
    def collate_fn(batch):
        texts, labels = zip(*batch)
        texts = SentimentDataset.pad_sequence(texts)
        labels = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)  # Add unsqueeze(1) to make it [batch_size, 1]
        return texts, labels

#### Dataset and Data Loaders

In [None]:
train_dataset = SentimentDataset(train_texts, train_labels, word2vec_model)
test_dataset = SentimentDataset(test_texts, test_labels, word2vec_model)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=SentimentDataset.collate_fn)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, collate_fn=SentimentDataset.collate_fn)

## Define RNN Model

In [None]:
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(RNNModel, self).__init__()
        self.rnn = nn.RNN(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x, _ = self.rnn(x)
        x = x[:, -1, :]
        x = self.fc(x)
        return torch.sigmoid(x)

In [None]:
model = RNNModel(100, 256, 1)
loss_function = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [None]:
def log_performance(model_name, accuracy, loss):
    return {
        "Model": model_name,
        "Accuracy": accuracy,
        "Loss": loss
    }

## Function For Model Comparision

In [None]:
def plot_comparison(results):
    df = pd.DataFrame(results)
    fig, ax1 = plt.subplots()

    color = 'tab:red'
    ax1.set_xlabel('Model')
    ax1.set_ylabel('Accuracy', color=color)
    ax1.bar(df['Model'], df['Accuracy'], color=color)
    ax1.tick_params(axis='y', labelcolor=color)

    ax2 = ax1.twinx()
    color = 'tab:blue'
    ax2.set_ylabel('Loss', color=color)
    ax2.plot(df['Model'], df['Loss'], color=color)
    ax2.tick_params(axis='y', labelcolor=color)

    plt.title('Comparison of Word2Vec and FastText Models')
    plt.show()
    fig.savefig('model_comparison.png')

In [None]:
results = []

### Train The Model

In [None]:
def train(model, train_loader, optimizer, loss_function, model_name):
    model.train()
    for epoch in range(5):
        total_loss = 0
        total_correct = 0
        total_samples = 0
        for i, (texts, labels) in enumerate(train_loader):
            optimizer.zero_grad()
            outputs = model(texts)
            loss = loss_function(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

            predicted = (outputs > 0.5).float()
            correct = (predicted[:, 0] == labels.squeeze()).sum().item()
            total_correct += correct
            total_samples += labels.size(0)

        avg_loss = total_loss / len(train_loader)
        accuracy = total_correct / total_samples
        print(f'Epoch [{epoch+1}/5], Loss: {avg_loss:.4f}, Accuracy: {accuracy*100:.2f}%')

        test_accuracy = test(model, test_loader)
        print(f'Epoch [{epoch+1}/5], Test Accuracy: {test_accuracy*100:.2f}%')

        results.append(log_performance(model_name, test_accuracy * 100, avg_loss))

    return test_accuracy * 100, avg_loss



In [None]:
def test(model, test_loader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for texts, labels in test_loader:
            outputs = model(texts)
            predicted = (outputs > 0.5).float()
            total += labels.size(0)
            correct += (predicted[:, 0] == labels.squeeze()).sum().item()
    return correct / total

In [None]:
word2vec_model = RNNModel(100, 128, 1)
optimizer = torch.optim.Adam(word2vec_model.parameters(), lr=0.001)
loss_function = nn.BCELoss()

results = []
train_accuracy, train_loss = train(word2vec_model, train_loader, optimizer, loss_function, "Word2Vec")
print(f'Word2Vec - Test Accuracy: {train_accuracy:.2f}% Loss: {train_loss:.5f}')

Epoch [1/5], Loss: 0.6938, Accuracy: 49.70%
Epoch [1/5], Test Accuracy: 49.64%
Epoch [2/5], Loss: 0.6966, Accuracy: 50.08%
Epoch [2/5], Test Accuracy: 49.21%
Epoch [3/5], Loss: 0.6953, Accuracy: 50.38%
Epoch [3/5], Test Accuracy: 49.27%
Epoch [4/5], Loss: 0.6947, Accuracy: 49.96%
Epoch [4/5], Test Accuracy: 49.25%
Epoch [5/5], Loss: 0.6953, Accuracy: 49.78%
Epoch [5/5], Test Accuracy: 50.11%
Word2Vec - Test Accuracy: 50.11% Loss: 0.69535


In [None]:
fasttext_train_loader = DataLoader(fasttext_train_dataset, batch_size=64, shuffle=True, collate_fn=SentimentDataset.collate_fn)
fasttext_test_loader = DataLoader(fasttext_test_dataset, batch_size=64, shuffle=False, collate_fn=SentimentDataset.collate_fn)

In [None]:
fasttext_model = RNNModel(100, 128, 1)
optimizer = torch.optim.Adam(fasttext_model.parameters(), lr=0.001)

In [None]:
train_accuracy, train_loss = train(fasttext_model, fasttext_train_loader, "FastText")
print(f'FastText - Test Accuracy: {train_accuracy:.2f}% Loss: {train_loss:.5f}')

In [None]:
df = pd.DataFrame(results)

In [None]:
plot_comparison(results)