<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/nlp/transformer/notebooks/NLP_with_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Workflow

**Define Project Goals and Contraints:**

This will be instrumental in selecting specific architecture and data processing strategies.


**Data Cleaning and Preprocessing:**

Load your data into a Pandas DataFrame.
Perform basic cleaning: remove duplicates, handle missing values.
Normalize text: convert to lowercase, remove punctuation, and special characters.

**Text Preprocessing for BPE:**

Apply BPE tokenization to your corpus. This involves learning the BPE vocab from your dataset and then applying it to both questions and answers to tokenize them.

**Splitting the Dataset:**

Split your data into training, validation, and test sets. A common split ratio is 80% training, 10% validation, and 10% test.

**Converting Text to Sequences:**

Convert your tokenized text into sequences of integers using the BPE vocabulary. This step transforms the textual data into a format that can be fed into the LSTM model.

**Padding Sequences:**

Since LSTM models require inputs of the same length, use padding to ensure all sequences in a batch have the same length.

**Designing the LSTM Model:**

Build your LSTM model architecture using TensorFlow/Keras. The model should include an Embedding layer, one or more LSTM layers, and a Dense output layer.

**Compiling the Model:**

Compile the model with an appropriate optimizer (e.g., Adam), loss function (e.g., sparse_categorical_crossentropy for classification tasks), and metrics (e.g., accuracy).

**Training the Model:**

Train the model on your training set while also validating its performance on the validation set. Use model checkpoints and early stopping to prevent overfitting.

**Evaluating the Model:**

After training, evaluate the model's performance on the test set to get a sense of its generalization ability.

**Model Deployment:**

Deploy the model into a production environment. This could be a simple web application or a REST API that takes in a question and returns the predicted answer.

## Project Goals and Constraints

**Goal**

The goal is to create a "virtual mechanic" to help owners maintain older cars that have a dwindling set of experts available to turn to.


**Task Type:**

The project aims to build a generative language model that will accept written unstructured questions in English from users and provide the user with targeted written answers in English. The model will use sequence prediction and text generation. The model will not use classification, image recognition, or sentiment analysis.

**Data Characteristics:**

The training corpus for the data will be User Generated Content scraped from a domain-specific online forum. The corpus will generally be unstructured with a very limited set of metadata.

**Performance Metrics:**

Performance of the project will be scored on accuracy and speed of responses.

**Resource Constraints:**

The project will be built in Python utilizing limited CPU compute resources from Google Colab.

**Existing Tools or Frameworks:**

The corpus will be stored in Snowflake database.

**Scalability and Adaptability:**

There is no need to support additional user languages. However, when available, the corpus will be supplemented with additional written unstructured text.


## Corpus Creation

The corpus used was assembled using Beautiful Soup to scrape a pubic forum specific to the BMW E9 (www.e9coupe.com). This active forum has been exsitence since 2003. The data was compiled and stored in a Snowflake database for multiple NLP projects, including LDA, GRU and LSTM. Furture ideas include supplementing the forum text with an existing users guide specific to this model.

##Language Model Architectures

### Recurrent Neural Networks (RNNs):

**Pros:**
1. *Sequential Processing:* RNNs process sequential data efficiently, making them suitable for tasks like text generation where the order of input elements matters.
2. *Memory:* RNNs have a form of memory that allows them to remember past information while processing current inputs.
3. *Interpretability:* Due to their sequential nature, RNNs are often more interpretable compared to more complex architectures like Transformers.
4. *Ease of Development:* RNNs have been around for longer and have a simpler architecture compared to Transformers, making them easier to develop and understand for beginners.
5. *CPU Needs:* RNNs can be trained and run on CPU instances, although training large models or processing large datasets may benefit from GPU acceleration.

**Cons:**
1. *Vanishing/Exploding Gradient:* RNNs can suffer from vanishing or exploding gradient problems, especially when dealing with long sequences, which can lead to difficulties in learning long-term dependencies.
2. *Limited Context:* Traditional RNNs have a limited memory span, making them less effective at capturing long-range dependencies in data.
3. *Computationally Inefficient:* Training RNNs can be computationally expensive, especially when dealing with large datasets and long sequences.

### Transformer Architectures:

**Pros:**
1. *Parallelization:* Transformers allow for highly parallelized computation, leading to faster training and inference compared to sequential models like RNNs.
2. *Long-Range Dependencies:* Transformers can capture long-range dependencies in data more effectively than traditional RNNs, making them well-suited for tasks requiring global context, such as machine translation and text generation.
3. *Attention Mechanism:* Transformers use attention mechanisms to weigh the importance of different input elements, allowing them to focus on relevant information and ignore irrelevant parts of the input sequence.
4. *Ease of Development:* While more complex than RNNs, Transformers have a modular architecture that can be easier to develop and experiment with compared to traditional recurrent architectures.

**Cons:**
1. *Complexity:* Transformers have a more complex architecture compared to RNNs, which can make them harder to understand, implement, and interpret.
2. *Data Requirements:* Transformers require large amounts of data to train effectively, especially for tasks with complex patterns and dependencies.
3. *Resource Intensive:* Training large transformer models requires significant computational resources, including powerful GPUs or TPUs, making them less accessible for smaller-scale projects or individuals with limited resources.

### Hybrid Model (Combining RNNs and Transformers):

**Pros:**
1. *Combine Strengths:* A hybrid model can potentially combine the strengths of both RNNs and Transformers, leveraging the sequential processing capabilities of RNNs with the long-range dependency handling of Transformers.
2. *Flexibility:* A hybrid approach offers flexibility in model design, allowing researchers and practitioners to tailor the architecture to specific task requirements and data characteristics.

**Cons:**
1. *Complexity:* Developing and training a hybrid model can be more complex compared to using either RNNs or Transformers alone, as it requires integration of different architectural components and potentially more sophisticated training procedures.
2. *Resource Intensive:* Depending on the specific architecture and scale, training a hybrid model may require significant computational resources, similar to Transformers.


##Tokenization Strategies

#### Word-Level Tokenization:

**Description:**
Word-level tokenization splits the text into individual words, treating each word as a token.

**Libraries:**
1. NLTK (Natural Language Toolkit): Provides tokenization tools for various NLP tasks, including word-level tokenization.
2. spaCy: Another popular NLP library that offers word-level tokenization along with other NLP functionalities.

**Pros:**
1. Preserves semantic meaning of individual words.
2. Intuitive representation of text for language modeling tasks.

**Cons:**
1. May struggle with out-of-vocabulary words, especially in domain-specific or informal language.
2. Increases vocabulary size, potentially leading to higher memory usage.

**Suitability:**
Word-level tokenization may be suitable for this project as it preserves the semantic meaning of individual words, which can be important for generating coherent responses to user questions.

#### Character-Level Tokenization:

**Description:**
Character-level tokenization treats each character in the text as a separate token.

**Libraries:**
1. TensorFlow Text: Part of the TensorFlow ecosystem, TensorFlow Text provides utilities for various text processing tasks, including character-level tokenization.
2. Keras: With its text preprocessing module, Keras offers character-level tokenization capabilities.

**Pros:**
1. Captures fine-grained details in the text, useful for handling misspellings or morphologically complex words.
2. Helps in handling out-of-vocabulary terms effectively.

**Cons:**
1. Can be computationally expensive due to larger token vocabulary.
2. May not capture higher-level semantic meaning as effectively as word-level tokenization.

**Suitability:**
Character-level tokenization might not be the best choice for this project, as it may not capture the semantic meaning of words effectively. However, it could be useful for capturing fine-grained details in the text if necessary.

#### Byte Pair Encoding (BPE):

**Description:**
Byte Pair Encoding (BPE) tokenization iteratively merges the most frequent pairs of tokens to build a vocabulary of subword units.

**Libraries:**
1. Hugging Face Transformers: Provides tokenization functionalities, including BPE, along with pre-trained language models for various NLP tasks.
2. Tokenizers: A Python library specifically designed for fast and customizable tokenization, including BPE tokenization.

**Pros:**
1. Handles rare or out-of-vocabulary terms effectively.
2. Offers a good balance between accuracy and efficiency.

**Cons:**
1. Requires additional pre-processing steps compared to traditional tokenization methods.
2. Increases complexity of tokenization process, potentially impacting speed.

**Suitability:**
BPE tokenization could be a good choice for this project as it effectively handles rare or out-of-vocabulary terms, which may be present in the user-generated content scraped from online forums. It also offers good balance between accuracy and speed, which aligns with the project's performance metrics and resource constraints.


## Summarization Strategies
Very difficult to find a winning strategy here that can accomidate both long and short length text blocks.

**Extractive Summarization**
<br>
Pros:
<br>
Good with Raw Text: Extractive methods can work directly with raw, unstructured text, as they mainly focus on selecting key sentences or phrases without needing deep linguistic processing.
Straightforward Implementation: These methods do not require complex preprocessing like tokenization or lemmatization, simplifying their implementation.
<br>
Cons:
<br>
Limited Depth in Understanding: While they can handle raw text, they may not fully capture the nuanced meaning, especially when the text contains complex structures or unorthodox language use.
Less Effective with Poorly Structured Text: In cases where the text is poorly structured or highly informal, extractive summarization might struggle to identify the main points effectively.
<br>
<br>


**Abstractive Summarization** (like sshleifer/distilbart-cnn-12-6)
<br>
Pros:
<br>
Advanced Processing Capabilities: Abstractive models, especially those based on transformer architectures, are designed to handle and interpret raw text, capturing deeper linguistic and contextual nuances.
Higher Tolerance for Unstructured Text: These models can manage unstructured or informal text by understanding and then rephrasing it in a more coherent and structured summary.
<br>
Cons:
<br>
Dependence on Preprocessing for Optimal Performance: While they can process raw text, the quality of the output can be significantly improved with proper tokenization and lemmatization, especially for complex texts.
Potential Overhead: Requires more computational resources to process and understand raw text, which might be more efficiently handled with some level of preprocessing.
<br>
<br>
**Hybrid Summarization**
<br>
Pros:
<br>
Flexibility in Text Processing: Combining extractive and abstractive methods allows for handling both raw and preprocessed text, adapting to the text's structure and complexity.
Balanced Approach: Can leverage the strengths of extractive methods in handling raw text for identifying key points, while using abstractive techniques for generating a coherent summary.
<br>
Cons:
<br>
Complex Preprocessing Requirements: The need to integrate both extractive and abstractive approaches may necessitate more sophisticated preprocessing strategies to optimize performance.
Potential for Processing Inefficiencies: The combined approach might lead to redundancies or inefficiencies in processing, especially if the text is either too raw or overly preprocessed.
<br>
<br>
After attempting sshleifer/distilbart-cnn-12-6 I found it had a character limit (1024) that is too restrictive for my needs. However, T5 has no limits and is what Ill be trying.

## Designing the Model

This is a project to learn how to build a model using transformers.In the context of artificial intelligence, a transformer is a type of deep learning model that has revolutionized the field of natural language processing (NLP) and beyond. Introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017, transformers are designed to handle sequential data, like text, in a more efficient and effective way compared to previous models like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs).

Transformers are characterized by their use of self-attention mechanisms, which allow them to weigh the importance of different parts of the input data differently. This is particularly useful in understanding the context and relationships between words in a sentence or elements in a sequence, without being constrained by the sequence's order or proximity.

Key features and advantages of transformers include:

**Parallelization:**

Unlike RNNs and LSTMs, which process data sequentially, transformers can process entire sequences of data in parallel. This significantly reduces training times and allows for more efficient computation.

**Scalability:**

Transformers can be scaled up with more layers and parameters to handle larger datasets and more complex tasks, making them highly effective for a wide range of applications.

**Flexibility:**

They can be adapted for a variety of tasks beyond text processing, including image recognition, audio processing, and even tasks in other domains like genomics.

**Self-Attention:**

The self-attention mechanism allows transformers to consider the entire context of a sequence when processing each element, leading to a better understanding of the data.

**Transfer Learning:**

Models based on transformers, such as BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pretrained Transformer), and others, have shown remarkable performance in transfer learning, where a model trained on one task can be adapted for another related task with minimal additional training. Transformers are now a foundational element in the development of state-of-the-art AI systems for natural language understanding, text generation, translation, and more, pushing the boundaries of what's possible in AI research and applications.

## Training the Model

#### Pre-training Checklist:

**1. Review Model Architecture**

- Confirm Layer Configurations: Make sure each layer is configured as intended for your task. For a sequence generation model like yours, using Bidirectional(LSTM()) with return_sequences=True and a TimeDistributed(Dense()) layer is appropriate.
- Output Layer Compatibility: The final TimeDistributed(Dense(vocab_size, activation='softmax')) layer should match your vocabulary size, ensuring the model can predict each token in the sequence.

**2. Verify Data Preprocessing**

- Tokenization and Encoding: Ensure your questions and answers have been correctly tokenized and encoded to integer sequences. This usually involves using a tokenizer that fits your dataset.
- Padding: Verify that both input (questions) and output (answers) sequences are padded to the correct max_length. All sequences should have the same length to ensure consistent model input and output shapes.

**3. Ensure Correct Data Split**

- Training, Validation, and Test Sets: Confirm you have split your data into appropriate sets. Typically, you'd want a training set for model training, a validation set for tuning, and a test set for final evaluation.
- Balance and Representativeness: Check that each data split is representative of the overall dataset to avoid bias.

**4. Check Compilation Settings**

- Loss Function: For a sequence generation task, sparse_categorical_crossentropy is suitable when your labels are integer-encoded (not one-hot encoded). Ensure this aligns with how your target data is prepared.
- Optimizer and Metrics: Validate that you've chosen an optimizer and metrics that align with your model's goals. adam and accuracy are common choices, but ensure they fit your specific task.

**5. Model Summary Review**

- Use model.summary() to review your model's architecture. Confirm the number of parameters and the output shape at each layer align with your expectations.

**6. Small Scale Test Run**

- Consider doing a small-scale test run of your model training with a subset of your data. This can help identify potential issues early without the need for a full training cycle.

**7. Hardware and Runtime Environment**

- GPU Availability: Ensure you have access to a suitable GPU for training if your dataset and model are large. Training on a CPU can be significantly slower.
- Memory Constraints: Monitor memory usage during the test run to ensure your environment has sufficient resources to handle the full training process.


