Process:
1. read the vocab file and build the vacab
2. split the data set
3. create the dataset in required format
4. tokenize the and pad sequence
5. crate model and train and save model
6. use for prediction

Batch Size: Choose based on memory capacity and training stability. Common values are 32 or 64.
Sequence Length: Reflects the typical length of the text input, typically 100-300 words for reviews.
Embedding Dimension: Affects the richness of word representations, with common values ranging from 50 to 300.

hidden_dim in LSTM: Controls the size of the hidden states and the cell states. It determines the model's capacity to learn and retain information from sequences.
Choosing hidden_dim: Balance between too small (which may underfit) and too large (which may overfit and require more computation).
Impact: Larger hidden_dim increases the model's ability to capture complex patterns but also increases training time and memory usage

## Parameters to Consider:
1. Batch Size (batch_size):
    Represents the number of samples processed before the model updates its parameters.
    Affects the training time and model performance.
    Larger batch sizes lead to faster training but require more memory.
    Smaller batch sizes can provide more stable gradient updates but may lead to noisier gradient estimates.

2. Sequence Length (seq_length):
    The number of words in each input sequence.
    Affects how much context the model captures from the input text.
    Should be chosen based on the typical length of input texts and memory constraints.

Embedding Dimension (embedding_dim):

Size of the word vectors representing the input words.
Affects how much information is captured about each word.
Higher dimensions can capture more information but may lead to overfitting and increased computation.

The hidden_dim (hidden dimension) in an LSTM (Long Short-Term Memory) model refers to the number of features in the hidden state of the LSTM. It is a key parameter that determines the capacity and complexity of the model to capture patterns and dependencies in sequential data. Let's delve into what hidden_dim represents and how it affects the model's performance.

Understanding hidden_dim in LSTM
1. Hidden State Representation
Hidden State (h_t): In an LSTM, the hidden state at each time step is a vector of length hidden_dim. It captures information about the sequence up to that point.
Cell State (c_t): Along with the hidden state, the LSTM maintains a cell state, which is also a vector of length hidden_dim. The cell state helps in capturing long-term dependencies.
2. Role in LSTM Architecture
Output Size: The hidden state size determines the output size of each LSTM cell. If hidden_dim is 256, each LSTM cell will output a vector of 256 features for each time step.
Complexity: Larger hidden_dim means the model has more parameters and can capture more complex patterns, but it also increases the risk of overfitting and requires more computational resources.
Performance: A well-chosen hidden_dim can significantly improve the model's ability to learn and generalize from the data.
Impact of hidden_dim on Model
Memory Usage: Larger hidden_dim values lead to increased memory consumption.
Training Time: Higher values can slow down training due to the increased number of computations per time step.
Generalization: Overly large hidden_dim can lead to overfitting, while too small values may underfit the data.

In [None]:
#pip install pandas torch

#Step 1: Prepare the Data
# Here, text contains the review and label contains the sentiment (1 for positive, 0 for negative).
text,label
"I love this movie!",1
"This is terrible.",0
...


Load and Preprocess Data
Read the CSV File: Use pandas to read the file.
Tokenize: Convert text into numerical tokens.
Pad Sequences: Make all sequences the same length.
Convert to Tensors: Prepare the data for PyTorch.

In [None]:
#pip install nltk

In [1]:

import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from collections import Counter
import numpy as np

In [2]:
# Read CSV file
df = pd.read_csv('sentiment_data.csv')

In [None]:
# Train-test split
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)


#print(word_tokenize("HI am a sentence")) #['HI', 'am', 'a', 'sentence']
words = Counter()
print(words.update(word_tokenize("HI am a sentence")))
print(words.items())  ## dict_items([('HI', 1), ('am', 1), ('a', 1), ('sentence', 1)])
print(words.values())  ## dict_values([1, 1, 1, 1])
print(words.elements()) 
for element in words.elements():
    print(element)
print(words.most_common()) ## [('HI', 1), ('am', 1), ('a', 1), ('sentence', 1)]

In [None]:
# Build vocabulary
def build_vocab(sentences, max_vocab_size=25000):
    words = Counter()
    for sentence in sentences:
        words.update(word_tokenize(sentence))
    common_words = words.most_common(max_vocab_size)
    vocab = {word: idx+2 for idx, (word, _) in enumerate(common_words)}
    vocab['<PAD>'] = 0
    vocab['<UNK>'] = 1
    return vocab

print(build_vocab(["My sentence", "another sentence"]))
{'sentence': 2, 'My': 3, 'another': 4, '<PAD>': 0, '<UNK>': 1}

In [49]:
# Tokenize and pad sequences
def tokenize_and_pad(sentence, vocab, max_length=100):
    tokens = [vocab.get(word, vocab['<UNK>']) for word in word_tokenize(sentence)]
    if len(tokens) < max_length:
        tokens.extend([vocab['<PAD>']] * (max_length - len(tokens)))
    else:
        tokens = tokens[:max_length]
    return tokens

In [23]:
df['text']
train_data['text']

5          No
2    Liked it
4       Sorry
3    Pathetic
Name: text, dtype: object

In [38]:
# Convert dataset to PyTorch Dataset
class SentimentDataset(Dataset):
    def __init__(self, data, vocab, max_length=100):
        self.data = data
        self.vocab = vocab
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]['text']
        label = self.data.iloc[idx]['label']
        tokens = tokenize_and_pad(text, self.vocab, self.max_length)
        return torch.tensor(tokens, dtype=torch.long), torch.tensor(label, dtype=torch.float)

print(train_data.iloc[:1,:])
print(type(train_data))
train_data.head()
vacab = build_vocab(["My sentence", "another sentence"])
sample_data = SentimentDataset(train_data.iloc[:1,:],vacab)

In [46]:
#import nltk
#nltk.download('punkt')
# Build the vocabulary from the training data
vocab = build_vocab(train_data['text'].tolist())

# Create PyTorch datasets
train_dataset = SentimentDataset(train_data, vocab)
test_dataset = SentimentDataset(test_data, vocab)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1)

['No']
['Liked', 'it']
['Sorry']
['Pathetic']


In [47]:
from torch import nn, optim
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        embedded = self.dropout(self.embedding(x))
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths=[len(x)], batch_first=True, enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        return self.fc(hidden)


In [50]:
vocab_size = len(vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = 1
n_layers = 2
bidirectional = True
dropout = 0.5
lstm_model = LSTMModel(vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout)

optimizer = optim.Adam(lstm_model.parameters(), lr=0.001)
criterion = nn.BCEWithLogitsLoss()

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    lstm_model.train()
    epoch_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        text, label = batch
        predictions = lstm_model(text).squeeze(1)
        loss = criterion(predictions, label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {epoch_loss/len(train_loader)}')


Epoch 1, Loss: 0.695932611823082
Epoch 2, Loss: 0.6502176076173782
Epoch 3, Loss: 0.6094500869512558


In [None]:
data = {
    'text': ["I love this movie", "I hate this movie", "This movie is great", "This movie is terrible"],
    'label': [1, 0, 1, 0]
}
df = pd.DataFrame(data)

In [76]:
# Make predictions
# text1 = ["Not like"]
new_data = {
    'text': ["Not like"],
    'label': [0]  # Dummy label
}
predict_data = pd.DataFrame(new_data)
#text1_df = pd.DataFrame([{"text":text1,"label": None}])
predict_dataset = SentimentDataset(predict_data,vocab, max_length=10)


# Create data loaders
chk_loader = DataLoader(predict_dataset, batch_size=1, shuffle = False)
for batch in chk_loader:
    print(type(batch))
          
lstm_model.eval()  # Set the model to evaluation mode
for batch in chk_loader:
    with torch.no_grad():  # Disable gradient calculation for inference
        output = lstm_model(batch[0])  # Forward pass through the model
        print("raw model output : {}",output)
        output = output.squeeze(1)  # Remove any extra dimensions if necessary
        print("raw model output after scraping dimension : {}",output)
        prediction = torch.sigmoid(output)  # Apply sigmoid to get probabilities
        print("output after simoid : {}",prediction)
        predicted_label = (prediction > 0.5).float()  # Convert probabilities to binary labels

print("Prediction:", predicted_label.item())

<class 'list'>
raw model output : {} tensor([[0.0459]])
raw model output after scraping dimension : {} tensor([0.0459])
output after simoid : {} tensor([0.5115])
Prediction: 1.0
