<a href="https://colab.research.google.com/github/aaks30/Git_Practice_Aakriti/blob/main/Task_3_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Task 3: Text Prediction**


In this project, I built and trained a **Word-Level LSTM** model to generate text based on a dataset from Project Gutenberg. Below is a summary of the steps involved in processing the data, training the model, and generating text:

### **Achievements:**
- **Data Preprocessing**:
  - Loaded the Project Gutenberg text data.
  - Removed unnecessary metadata and irrelevant sections like the table of contents.
  - Cleaned the text by removing special characters, numbers, and excessive spaces.
- **Text Tokenization**:
  - Tokenized the cleaned text into individual words for further processing.
- **Vocabulary Building**:
  - Built a vocabulary mapping (stoi: string-to-index and itos: index-to-string) based on word frequencies.
- **Model Development**:
  - Implemented a Word-Level LSTM (Long Short-Term Memory) model for text generation.
  - Designed the model with embedding layers, LSTM layers, and fully connected layers for predicting the next word.
- **Training**:
  - Trained the model using the cleaned and tokenized dataset.
  - Used Adam optimizer and CrossEntropyLoss for training.
- **Model Saving**:
  - Saved the trained model weights to ensure reproducibility and for future use.
  - Saved the vocabulary mappings (stoi, itos) for the generation of text from new seed inputs.
  
### **Types of Approaches:**
- **LSTM (Long Short-Term Memory)**:
  - Effective for sequential data, ideal for text generation where the model learns dependencies between words over long sequences.
- **GRU (Gated Recurrent Unit)**:
  - A simpler alternative to LSTM, often faster with fewer parameters while still capturing temporal dependencies in sequences.
- **Bidirectional LSTM**:
  - Captures context from both past and future words in the text by processing the sequence in both directions.
- **Transformer-based Models**:
  - Modern architectures like GPT (Generative Pretrained Transformer) can capture longer-range dependencies and perform better in generating more coherent text.
- **Character-Level Text Generation**:
  - Instead of generating text word by word, this approach generates text one character at a time, capturing finer details of word formation.
- **RNN (Recurrent Neural Network)**:
  - An earlier approach for sequential data tasks. Though less effective than LSTMs, it can still be used for basic text generation tasks.


In [None]:
import re
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from collections import Counter

In [None]:
# Load text file
# Function to load text file contents into memory
def load_text(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        text = file.read()
    return text


In [None]:
# Remove metadata from Project Gutenberg books
# This function removes unnecessary metadata (headers/footers) from the text
def remove_metadata(text):
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

    start_idx = text.find(start_marker)
    end_idx = text.find(end_marker)

    if start_idx != -1 and end_idx != -1:
        text = text[start_idx + len(start_marker): end_idx].strip()
    return text

In [None]:
# Clean text by removing numbers and special characters
# This function standardizes text by making it lowercase and removing unwanted characters
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9,.'!?;:()\-\s]", "", text)  # Keep important punctuation
    text = re.sub(r"\d+", "", text)  # Remove numbers
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

In [None]:
# Extract only the main story portion from the text (if applicable)
# This function ensures the text starts from the main story by searching for specific keywords
def extract_main_story(text):
    story_start_keywords = ["the adventure of the western star", "i was standing at the window"]

    for keyword in story_start_keywords:
        story_start_idx = text.find(keyword)
        if story_start_idx != -1:
            return text[story_start_idx:]
    return text

In [None]:
# Tokenize the text into individual words
# This function splits the text into words based on whitespace
def tokenize_text(text):
    return text.split()

In [None]:
# Build vocabulary from the tokenized words
# The function builds two mappings: one from word to index (stoi) and one from index to word (itos)
def build_vocab(tokenized_words):
    word_counts = Counter(tokenized_words)  # Count the frequency of each word
    sorted_vocab = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)  # Sort by frequency
    stoi = {word: i+1 for i, (word, _) in enumerate(sorted_vocab)}  # Index starts from 1 (not 0)
    itos = {i: word for word, i in stoi.items()}
    return stoi, itos

In [None]:
# Convert words into indices using the built vocabulary
# This function takes a list of tokenized words and converts each word into its corresponding index
def words_to_indices(tokenized_words, stoi):
    return [stoi[word] for word in tokenized_words if word in stoi]  # Only include words in the vocabulary

In [None]:
# Generate sequences for training based on a fixed sequence length
# The function creates input-target pairs for model training
def generate_sequences(indexed_text, seq_length=30):
    input_sequences = []
    target_words = []

    for i in range(len(indexed_text) - seq_length):
        input_sequences.append(indexed_text[i:i + seq_length])
        target_words.append(indexed_text[i + seq_length])

    # Convert sequences to PyTorch tensors
    X = torch.tensor(input_sequences, dtype=torch.long)
    Y = torch.tensor(target_words, dtype=torch.long)
    return X, Y

In [None]:
# Define the Word-Level LSTM model for text generation
# The model consists of an embedding layer, an LSTM layer, and a fully connected output layer
class WordLevelLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256, num_layers=2):
        super(WordLevelLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)  # Embed input words into a dense vector space
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)  # LSTM for sequence learning
        self.fc = nn.Linear(hidden_dim, vocab_size)  # Output layer to predict next word in sequence

    def forward(self, x):
        x = self.embedding(x)
        lstm_out, _ = self.lstm(x)
        out = self.fc(lstm_out[:, -1, :])  # Use only the last LSTM output for prediction
        return out

In [None]:
# Training function to optimize model's weights
# The function trains the model using cross-entropy loss and backpropagation
def train_model(model, dataloader, optimizer, criterion, epochs=10, device='cpu'):
    model.to(device)
    model.train()

    for epoch in range(epochs):
        total_loss = 0
        for batch_X, batch_Y in dataloader:
            batch_X, batch_Y = batch_X.to(device), batch_Y.to(device)
            optimizer.zero_grad()
            output = model(batch_X)  # Forward pass
            loss = criterion(output, batch_Y)  # Calculate loss
            loss.backward()  # Backpropagate the gradients
            optimizer.step()  # Update the model parameters
            total_loss += loss.item()  # Track total loss for monitoring

        print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss / len(dataloader):.4f}")

In [None]:
# Function to preprocess text and generate data loader for training
# The function reads the text file, cleans and tokenizes the text, and generates training sequences
def preprocess_text(file_path, seq_length=30, batch_size=64):
    raw_text = load_text(file_path)
    cleaned_text = remove_metadata(raw_text)
    cleaned_text = clean_text(cleaned_text)
    cleaned_text = extract_main_story(cleaned_text)
    tokenized_words = tokenize_text(cleaned_text)
    stoi, itos = build_vocab(tokenized_words)
    indexed_text = words_to_indices(tokenized_words, stoi)
    X, Y = generate_sequences(indexed_text, seq_length)

    dataset = TensorDataset(X, Y)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    return dataloader, stoi, itos

In [None]:
# File path for the text data
file_path = "/content/61262-0.txt"
dataloader, stoi, itos = preprocess_text(file_path, seq_length=30, batch_size=64)
vocab_size = len(stoi) + 1  # Adding 1 for padding index

In [None]:
# Initialize the model
model = WordLevelLSTM(vocab_size)
criterion = nn.CrossEntropyLoss()  # Cross-entropy loss for classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer for training

In [None]:
# Train the model
train_model(model, dataloader, optimizer, criterion, epochs=10)

# Print the model architecture to inspect the layers
print(model)

In [None]:
# Save the trained model weights for future use
torch.save(model.state_dict(), "word_level_lstm.pth")
print("Model saved!")

Model saved!
Model saved!


In [None]:
# Save the vocabulary (stoi and itos) for text generation during inference
import pickle
with open("stoi.pkl", "wb") as f:
    pickle.dump(stoi, f)

with open("itos.pkl", "wb") as f:
    pickle.dump(itos, f)

In [None]:
import torch
import torch.nn as nn
import pickle
import random

# Define the LSTM model class (same as during training)
class WordLevelLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256, num_layers=2):
        super(WordLevelLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        lstm_out, _ = self.lstm(x)
        out = self.fc(lstm_out[:, -1, :])  # Take the last output for prediction
        return out

# Load the vocabulary (stoi and itos)
with open("stoi.pkl", "rb") as f:
    stoi = pickle.load(f)

with open("itos.pkl", "rb") as f:
    itos = pickle.load(f)

# Define vocab_size for the model
vocab_size = len(stoi) + 1  # Add 1 for padding

# Load the trained model
model = WordLevelLSTM(vocab_size)
model.load_state_dict(torch.load("word_level_lstm.pth"))
model.eval()  # Set the model to evaluation mode

# Generate new text based on a seed input
def generate_text(model, stoi, itos, seed_text, max_words=50, temperature=1.0, device='cpu'):
    model.to(device)
    model.eval()

    words = seed_text.split()
    for _ in range(max_words):
        input_sequence = [stoi[word] for word in words[-30:] if word in stoi]  # Use the last 30 words as input
        input_tensor = torch.tensor(input_sequence, dtype=torch.long).unsqueeze(0).to(device)

        with torch.no_grad():
            logits = model(input_tensor)  # Get predictions from the model
            logits = logits / temperature  # Adjust temperature for randomness in generation
            probabilities = torch.nn.functional.softmax(logits, dim=-1)  # Convert logits to probabilities
            predicted_index = torch.multinomial(probabilities, 1).item()  # Sample next word from the probabilities

        next_word = itos.get(predicted_index, "<UNK>")  # Get the predicted word
        words.append(next_word)

    return ' '.join(words)

  model.load_state_dict(torch.load("word_level_lstm.pth"))


In [None]:
# Generate a sample text starting from a seed phrase
generated_text = generate_text(model, stoi, itos, seed_text="wonderful person", max_words=50)
print("Generated Text:")
print(generated_text)

Generated Text:
wonderful person two pretty willard isnt us, in this pandemonium they reappear the door. mrs. opalsen died into the gun-room face, by the pace. it must have been no place, labelso then, red trying for the same old glory that poirot menacing. where she was stolen. it is being proved from these


### **Performance Improvement Techniques:**
1. **Using Pretrained Models (Transfer Learning)**:
   - Leverage pretrained models like GPT or BERT for better context understanding and more accurate text generation.

2. **Hyperparameter Tuning**:
   - Experiment with hyperparameters like learning rate, batch size, and model dimensions to optimize performance and achieve the best results.

3. **Bidirectional LSTM**:
   - Implement a bidirectional LSTM to capture both past and future context, enhancing the model’s ability to generate more coherent and accurate predictions.


### **Comparison with a Large Language Model (LLM):**
- **Prompting LLMs for Human-Level Sentences**: Large Language Models like GPT-3 or GPT-4 generate human-level sentences by understanding vast amounts of context and patterns from a wide range of text data. They can generate highly coherent and semantically rich text based on the input prompt, without the need for explicit sequence-by-sequence training.

- **Difference in Scale**: Unlike LSTMs, which are trained on specific datasets and may struggle with long-term dependencies, LLMs are pretrained on massive datasets with billions of parameters, enabling them to generate more diverse and contextually accurate text. My approach, based on an LSTM model, is limited by the size of the training data and model architecture.

- **Improving My Model to Reach LLM-Level Performance**:
   - **Pretraining on a Large Corpus**: Like LLMs, pretraining on a larger corpus of text data (such as from books, articles, etc.) and fine-tuning for specific tasks would allow my LSTM model to achieve more human-like text generation.
   - **Larger Datasets**: To approach LLM performance, training on much larger, diverse datasets would improve the model’s ability to generalize and generate better text.
   - **Advanced Architectures (Transformer)**: Switching from LSTM to Transformer-based architectures (like GPT or BERT) could also lead to much better results, as Transformers are known to handle long-range dependencies better than traditional RNNs or LSTMs.

### **Comparision Table:**

| **Aspect**                          | **LSTM Model (This Project)**                             | **ChatGPT**                                          |
|-------------------------------------|----------------------------------------------------------|------------------------------------------------------|
| **Text Coherence and Fluency**      | Generates simple, repetitive text.                       | Generates fluent, natural, and diverse text.         |
| **Context Understanding**           | Struggles with long context and details.                  | Keeps track of long context and complex ideas.       |
| **Creativity**                      | Limited creativity, repeats patterns.                     | Highly creative, generates varied responses.         |
| **Grammar**                         | Mostly correct, some awkward phrasing.                    | Almost always perfect grammar and smooth structure.   |
| **Text Diversity**                  | Limited, tends to repeat.                                 | Highly diverse, with different structures and words. |
| **Training Data**                   | Trained on a small, specific dataset (Project Gutenberg). | Trained on a large, diverse dataset.                 |
| **Model Complexity**                | Simple LSTM with two layers.                             | Complex model with billions of parameters.           |
| **Real-Time Interaction**           | Generates text from a fixed input.                        | Interactive, adapts to ongoing conversation.         |
| **Scope**                           | Best for specific tasks, like literature.                | General-purpose, handles many tasks (Q&A, writing).  |



## **Conclusion:**
- The LSTM-based model has successfully demonstrated text generation capabilities, providing a foundational approach for predicting the next word in a sequence. While this model performs reasonably well on smaller datasets, it faces limitations in generating highly coherent and contextually rich text compared to state-of-the-art Large Language Models like GPT-3 or GPT-4.
- Key improvements, such as hyperparameter tuning, pretraining on larger datasets, and the potential shift to Transformer architectures, can significantly enhance the performance and bring the model closer to achieving human-level text generation.
- Overall, this project highlights the importance of model design, data quality, and training strategies in developing effective text prediction systems. Further exploration into advanced architectures and pretraining techniques can elevate the text generation capabilities of future models.
