<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/Language_Models/LLM_with_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Objective and Limitations

## i. Project Overview
The advent of modern automobile manufacturing has led to increased technical complexity, often resulting in mechanics opting to replace parts rather than diagnose and fix issues. This approach, while convenient for contemporary vehicles, poses a significant challenge for classic cars built 30 to 40 years ago, where replacement parts are scarce or non-existent.

To address this problem, this project aims to leverage Generative AI to create a "virtual mechanic." By building a corpus gathered from a classic car forum, this tool will be capable of understanding unstructured questions and providing relevant answers. This solution aims to assist classic car enthusiasts and mechanics by offering expert guidance, thereby preserving the heritage and functionality of vintage automobiles.

## ii. Objectives
The primary objective of this project is the development of a model as part of a portfolio of AI projects that can be showcased to potential employers. This will include an outline of the necessary workflow with a comparison and selection of architectures, libraries, and methods.

## iii. Use Case
With this code, a user will be able to ask questions in plain, unstructured English and receive answers that are driven from previous similar questions. Users will see these answers in plain English. I will have control over the extent to which the answers are sourced from the supplemental corpus versus the pre-trained model.

## iv. Limitations and Challenges
To address budget constraints, a combination of open source and free resources will be used. Python will be the primary programming language. Google Colab will be used for the notebook with compute resources limited to CPUs.


In [None]:
# Access to Google Drive
# This seems to propagate credentials better from its own cell

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Install libraries

import warnings

# Suppress DeprecationWarnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

# Install or upgrade TensorFlow and TensorFlow Text
!pip install --upgrade tensorflow tensorflow-text

# Import TensorFlow and TensorFlow Text directly
import tensorflow as tf
import tensorflow_text as tf_text

# transformers
try:
    from transformers import (
        BertTokenizer, BertModel, pipeline, AutoTokenizer, DistilBertModel,
        T5Tokenizer, T5ForConditionalGeneration, T5EncoderModel,
        GPT2Tokenizer, GPT2LMHeadModel, AutoModelForSeq2SeqLM
    )
except ImportError:
    !pip install transformers
    from transformers import (
        BertTokenizer, BertModel, pipeline, AutoTokenizer, DistilBertModel,
        T5Tokenizer, T5ForConditionalGeneration, T5EncoderModel,
        GPT2Tokenizer, GPT2LMHeadModel, AutoModelForSeq2SeqLM
    )

# gensim
try:
    from gensim.parsing.preprocessing import STOPWORDS
except ImportError:
    !pip install gensim
    from gensim.parsing.preprocessing import STOPWORDS

# sumy
try:
    from sumy.parsers.plaintext import PlaintextParser
    from sumy.summarizers.lex_rank import LexRankSummarizer
    from sumy.nlp.tokenizers import Tokenizer
except ImportError:
    !pip install sumy
    from sumy.parsers.plaintext import PlaintextParser
    from sumy.summarizers.lex_rank import LexRankSummarizer
    from sumy.nlp.tokenizers import Tokenizer

# pyspellchecker
try:
    from spellchecker import SpellChecker
except ImportError:
    import sys
    !{sys.executable} -m pip install pyspellchecker
    from spellchecker import SpellChecker

# faiss
try:
    import faiss
except ImportError:
    !pip install faiss-cpu
    import faiss

# snowflake.connector
try:
    import snowflake.connector
except ImportError:
    !pip install snowflake-connector-python
    import snowflake.connector

# pandas
try:
    import pandas as pd
except ImportError:
    !pip install pandas
    import pandas as pd

# requests
try:
    import requests
except ImportError:
    !pip install requests
    import requests

# BeautifulSoup
try:
    from bs4 import BeautifulSoup
except ImportError:
    !pip install beautifulsoup4
    from bs4 import BeautifulSoup

# nltk
try:
    import nltk
    nltk.download('punkt')
    nltk.download('wordnet')
    nltk.download('stopwords')

    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
except ImportError:
    !pip install nltk
    import nltk
    nltk.download('punkt')
    nltk.download('wordnet')
    nltk.download('stopwords')

    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer

# re
import re

# langdetect
try:
    from langdetect import detect
except ImportError:
    !pip install langdetect
    from langdetect import detect

# torch
try:
    import torch
except ImportError:
    !pip install torch
    import torch

# numpy
try:
    import numpy as np
except ImportError:
    !pip install numpy
    import numpy as np

# pyLDAvis
try:
    import pyLDAvis
except ImportError:
    !pip install pyLDAvis
    import pyLDAvis

# pickle
import pickle

# sklearn
try:
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
except ImportError:
    !pip install scikit-learn
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.metrics.pairwise import cosine_similarity

# Other necessary imports
import time
import itertools
from concurrent.futures import ThreadPoolExecutor
import os



Collecting tensorflow
  Downloading tensorflow-2.16.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (590.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-text
  Downloading tensorflow_text-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m84.9 MB/s[0m eta [36m0:00:00[0m
Collecting h5py>=3.10.0 (from tensorflow)
  Downloading h5py-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m88.4 MB/s[0m eta [36m0:00:00[0m
Collecting ml-dtypes~=0.3.1 (from tensorflow)
  Downloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m50.3 MB/s[0m 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=347ddb1aeeb3be81b20b63bb1a84dc9b9b2daea29f438d87bbc64e1afcbb3eba
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py

In [None]:
# Setup Logging

import logging

# Configure logging
log_path = '/content/drive/MyDrive/Colab Notebooks/model_logs.txt'
logging.basicConfig(
    filename=log_path,
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# Initial Setup
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
!pip install --upgrade tensorflow tensorflow-text

import tensorflow as tf
import tensorflow_text as tf_text

logging.info("Initial setup completed.")


  and should_run_async(code)




# 1 Architectures and Frameworks

This document provides an overview of various architectures, models, and tools used in natural language processing tasks. Understanding the strengths and weaknesses of different approaches is crucial for designing effective NLP systems tailored to my specific use case and requirements.




## 1.1 NLP Architectures

### 1.1.1 Traditional Models

**Solution:** Bag-of-Words (BoW)  
**Description:** Represents text data as a collection of unique words and their frequencies.  
**Example:** TfidfVectorizer  
**Pros:**  
- Simple and efficient representation.
- Works well for tasks like sentiment analysis and document classification.

**Cons:**  
- Ignores word order and context.
- Doesn't capture semantic meanings well.  

**Solution:** N-gram Model  
**Description:** Represents text data as a sequence of N consecutive words (N-grams).  
**Examples:** Bigram, Trigram  
**Pros:**  
- Captures some local word order and context.
- Simple and easy to implement.

**Cons:**  
- Limited in capturing long-range dependencies.
- Can become computationally expensive with larger N values.

**Solution:** Rule-Based Models  
**Description:** Uses a set of manually crafted linguistic rules to process text data.  
**Examples:** Regular Expressions, SpaCy Rule-Based Matching  
**Pros:**  
- High precision for well-defined tasks.
- Transparent and interpretable.  

**Cons:**  
- Requires extensive domain knowledge and manual effort.
- Not scalable for large or diverse datasets.

### 1.1.2 Statistical NLP Models

**Solution:** Hidden Markov Models (HMM)  
**Description:** Sequential text models based on hidden state transitions.  
**Example:** hmmlearn  
**Pros:**  
- Captures sequential dependencies effectively.
- Suitable for tasks like part-of-speech tagging and named entity recognition.

**Cons:**  
- Requires labeled sequential data for training.
- May struggle with capturing complex semantic relationships.

**Solution:** Conditional Random Fields (CRF)  
**Description:** Sequence labeling models.  
**Example:** sklearn-crfsuite  
**Pros:**  
- Effective for sequential labeling tasks.
- Incorporates feature dependencies between adjacent labels.

**Cons:**  
- Requires labeled sequential data for training.
- Less effective for capturing long-range dependencies.

**Solution:** Support Vector Machines (SVM)  
**Description:** A supervised learning model used for classification and regression analysis.  
**Example:** scikit-learn  
**Pros:**  
- Effective in high-dimensional spaces.
- Versatile with different kernel functions for flexibility in decision boundaries.  

**Cons:**  
- Memory-intensive for large datasets.
- May require careful selection of kernel functions and tuning parameters.

### 1.1.3 Deep Learning Models

**Solution:** Word Embeddings  
**Description:** Represent words as dense vectors in a continuous vector space.  
**Examples:** Word2Vec, GloVe  
**Pros:**  
- Captures semantic meanings and relationships between words.
- Provides dense vector representations suitable for downstream tasks.

**Cons:**  
- Requires large amounts of data for training.
- Struggles with out-of-vocabulary words.

**Solution:** Recurrent Neural Networks (RNN)  
**Description:** Neural networks that process sequences by iterating through elements.  
**Examples:** Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU)  
**Pros:**  
- Effective for capturing sequential dependencies in data.
- Suitable for tasks like language modeling and machine translation.

**Cons:**  
- Vulnerable to vanishing and exploding gradient problems.
- Computationally expensive to train.

**Solution:** Convolutional Neural Networks (CNN) for Text  
**Description:** Application of convolution operations to capture local dependencies in text.  
**Examples:** TextCNN, KimCNN  
**Pros:**  
- Effective for tasks like sentence classification and text categorization.
- Captures local patterns and relationships in text.  

**Cons:**  
- May not capture long-range dependencies as effectively as other solutions.
- Requires careful tuning of convolutional filters and pooling strategies.

### 1.1.4 Transformers

**Solution:** Transformer Models  
**Description:** Neural network architecture based entirely on self-attention mechanisms.  
**Examples:** BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), T5 (Text-To-Text Transfer Transformer)  
**Pros:**  
- Captures long-range dependencies effectively.
- Parallelizable training process.  

**Cons:**  
- Requires large amounts of computational resources.
- Limited interpretability compared to traditional models.

**Solution:** Pre-trained Models  
**Description:** Models pre-trained on large corpora and fine-tuned for specific tasks.  
**Examples:** BERT, GPT, T5  
**Pros:**  
- Leverage large amounts of unlabeled data for pre-training.
- Achieve state-of-the-art performance on various NLP tasks.  

**Cons:**  
- Resource-intensive pre-training process.
- May require substantial computational resources for fine-tuning.

**Solution:** Attention Mechanisms  
**Description:** Mechanisms that enable models to focus on specific parts of the input.  
**Examples:** Self-Attention, Multi-Head Attention  
**Pros:**  
- Improves the ability to capture dependencies and relationships within the data.
- Enhances performance in various machine translation and text summarization.

**Cons:**  
- Can be computationally intensive.
- Complexity increases with the number of attention heads and layers.

### 1.1.5 Additional Models and Techniques

**Solution:** Retriever-Generator Models  
**Description:** Models combine retrieval and generation components for text generation tasks.  
**Examples:** RAG  
**Pros:**  
- Incorporates both structured and unstructured information for generation.
- Produces more diverse and contextually relevant responses.  

**Cons:**  
- Requires efficient retrieval mechanisms.
- Increased complexity in model architecture.

**Solution:** Knowledge-Enhanced Retrieval-Augmented Generation (KERAG)  
**Description:** A variant of RAG that incorporates knowledge graphs.  
**Examples:** Graph-BERT  
**Pros:**  
- Integrates structured knowledge for improved understanding and generation.
- Enables more coherent and contextually relevant responses.  

**Cons:**  
- Requires high-quality and curated knowledge graphs.
- Increased computational complexity compared to standard RAG.

**Solution:** Elastic Search  
**Description:** Distributed search and analytics engine for indexing and searching big data.  
**Examples:** Elasticsearch, Apache Solr  
**Pros:**  
- Scalable and distributed architecture.
- Supports full-text search and complex query structures.  

**Cons:**  
- Requires infrastructure for deployment and maintenance.
- Indexing and search performance may degrade with large datasets.

## Architecture Options Score Card

| Model/Architecture  | Key Strength                   | CPU Compatibility | Ease of Use | Performance & Accuracy | Scalability | Integration | Total |
|---------------------|-------------------------------|-------------------|-------------|------------------------|-------------|-------------|-------|
| RAG                 | Context Understanding          | 1                 | 1           | 2                      | 2           | 2           | 8     |
| BoW                 | Simplicity                     | 2                 | 2           | 0                      | 2           | 2           | 8     |
| N-gram Model        | Local Context                  | 2                 | 2           | 1                      | 1           | 2           | 8     |
| Pre-trained Models  | Accuracy                       | 1                 | 1           | 1                      | 2           | 2           | 7     |
| Word Embeddings     | Semantic Understanding         | 1                 | 1           | 2                      | 1           | 2           | 7     |
| Elastic Search      | Scalability                    | 2                 | 1           | 1                      | 2           | 1           | 7     |
| CNN                 | Local Pattern Recognition      | 1                 | 1           | 2                      | 1           | 1           | 6     |
| Rule Based          | High Precision                 | 2                 | 2           | 0                      | 1           | 1           | 6     |
| Transformer Models  | State-of-the-Art               | 0                 | 0           | 2                      | 1           | 2           | 5     |
| KERAG               | Knowledge Integration          | 0                 | 0           | 2                      | 1           | 2           | 5     |
| HMM                 | Sequence Modeling              | 1                 | 1           | 1                      | 1           | 1           | 5     |
| CRF                 | Sequential Labeling            | 1                 | 1           | 1                      | 1           | 1           | 5     |
| SVM                 | Versatile                      | 1                 | 1           | 1                      | 1           | 1           | 5     |
| Sequence Models     | Order Preservation             | 1                 | 1           | 1                      | 1           | 1           | 5     |
| Attention Mechanisms| Focus on Specific Parts        | 0                 | 0           | 2                      | 1           | 2           | 5     |
| RNN                 | Sequential Dependencies        | 0                 | 1           | 2                      | 0           | 1           | 4     |

**0: Does not meet 1: Partially meets 2: Fully meets**




The score card was a valuable tool to reduce options down to those most appropriate for this project. A final audit is as follows:

### RAG (Retriever-Augmented Generation):
- **Strengths:** Excellent at context understanding and contextually relevant responses.
- **Weaknesses:** Computationally intensive and may require more resources.
- **Suitability:** High, if you need detailed, context-aware answers and have the necessary computational resources.

### BoW (Bag-of-Words):
- **Strengths:** Simple and efficient, easy to implement, and works well for basic tasks.
- **Weaknesses:** Ignores word order and context, may not capture semantic meaning well.
- **Suitability:** Moderate, for straightforward tasks where simplicity and efficiency are prioritized over contextual understanding.

### N-gram Model:
- **Strengths:** Captures some local word order and context, relatively simple to implement.
- **Weaknesses:** Limited in capturing long-range dependencies, can become computationally expensive with larger N values.
- **Suitability:** Moderate, for tasks where local context is important, but computational efficiency is still needed.

## Conclusion
The RAG model is most appropriate for this effort given its heavy use of domain specific information (that may be missing from a stand-alone pre-trained model). Its contextual accuracy has a higher weighted value for this use case than Ease of Use. Having said that, it is a computationally heavy architecture, and it is unclear if free cloud CPU resources will be sufficient.

## Example of RAG Model Implementation
1. **Query:** "What are the benefits of using a RAG model?"
2. **Retriever:**
   - Searches a corpus for relevant documents or passages related to "benefits of using a RAG model".
   - Retrieves top-k documents or passages that discuss the advantages of RAG models.
3. **Generator:**
   - Takes the retrieved documents and generates a response: "A RAG model combines the strengths of information retrieval and generative modeling. It retrieves relevant documents to provide context and generates accurate and contextually appropriate responses. This makes it highly effective for tasks requiring detailed and specific information."

For this project, the Retriever will be the corpus scraped from the online forum processed with Word Embeddings, and the Generator will be from a pretrained model.

### RAG Options

Example 1: LLM as the Primary Responder
User asks a question.
System retrieves relevant documents from a supplemental corpus using a retrieval system (e.g., Elasticsearch, FAISS).
Retrieved documents are used to create a prompt that is fed into the LLM.
LLM generates a response based on the retrieved documents.
In this approach, the LLM takes the retrieved documents and synthesizes a response. The LLM is responsible for understanding the context, extracting relevant information from the documents, and generating a coherent answer.  
<br>  

Example 2: LLM as the Editor
User asks a question.
System retrieves relevant documents from a supplemental corpus using a retrieval system.
System extracts the answer from the retrieved documents.
LLM ensures the answer is grammatically correct and potentially enhances the response for fluency.

## 1.2 Generator Options

**Vendor:** OpenAI  
**Package:** GPT (GPT-2)  
**Description:** Generative Pre-trained Transformer for generating text.  
**Pros:**  
- Highly capable of generating coherent and contextually relevant text.
- Free to access and use.  

**Cons:**  
- Requires significant computational resources for fine-tuning.
- GPT-2 is less powerful than newer models.

**Vendor:** Anthropic  
**Package:** Claude (Claude 3)  
**Description:** AI assistant designed for safety and ethical considerations.  
**Pros:**  
- Enhanced safety features and focus on ethical AI use.
- Designed for robust handling of varying text lengths.

**Cons:**  
- Primarily available for research access, which may limit commercial use.
- Conditional access, potentially limiting deployment flexibility.

**Vendor:** Meta  
**Package:** DistilBART (sshleifer/distilbart-cnn-12-6)  
**Description:** A distilled version of BART optimized for efficiency.  
**Pros:**  
- Optimized for CPU usage, making it efficient for resource usage.
- Open-source and free, allowing for flexible use and customization.  

**Cons:**  
- Less powerful than the full BART model due to distillation.
- May require additional integration efforts compared to more commercial models.

**Vendor:** Google  
**Package:** T5 (t5-small)  
**Description:** Text-to-Text Transfer Transformer for various NLP tasks.  
**Pros:**  
- Highly flexible and powerful for a wide range of text-to-text tasks.
- Open-source and free to use and fine-tune.

**Cons:**  
- May require additional preprocessing steps for certain tasks.
- The small version is less powerful compared to larger T5 models.

**Vendor:** Amazon  
**Package:** AWS Comprehend  
**Description:** Managed NLP service for text analysis and insights.  
**Pros:**  
- Fully managed service, reducing the burden of infrastructure management.
- Tight integration with AWS ecosystem, offering scalability and ease of use.
  
**Cons:**  
- API-dependent, limiting control over the underlying models.
- Cloud performance may not be as optimized as specific CPU performance.

## Generator Option Score Card

| Vendor    | Package       | Key Strength               | CPU Compatibility | Ease of Use | Performance & Accuracy | Integration & Flexibility | Scalability | Total |
|-----------|---------------|----------------------------|-------------------|-------------|------------------------|---------------------------|-------------|-------|
| Meta      | DistilBART    | CPU Efficiency             | 2                 | 2           | 1                      | 2                         | 2           | 9     |
| Google    | T5-small      | Flexibility                | 2                 | 1           | 2                      | 2                         | 1           | 8     |
| Amazon    | Comprehend    | Managed Service            | 1                 | 2           | 1                      | 2                         | 1           | 7     |
| OpenAI    | GPT-2         | Coherent Text Generation   | 0                 | 2           | 2                      | 2                         | 0           | 6     |
| Anthropic | Claude 3      | Ethical AI                 | 1                 | 1           | 2                      | 1                         | 1           | 6     |

**0: Does not meet 1: Partially meets 2: Fully meets**





## 1.3 Frameworks and Tools

**Vendor:** Google  
**Package:** TensorFlow  
**Description:** Open-source ML framework for building and deploying models.  
**Pros:**  
- Comprehensive ecosystem with deep learning support.
- Scalable on both CPUs and GPUs.  

**Cons:**  
- Steeper learning curve than some other frameworks.
- Limited support for dynamic computation graphs.

**Vendor:** Meta  
**Package:** PyTorch  
**Description:** Open-source deep learning framework by Meta AI Research.  
**Pros:**  
- Pythonic and intuitive interface for model development.
- Dynamic computation graph for easier debugging and experimentation.  

**Cons:**  
- Less optimized for production deployment than TensorFlow.
- Limited built-in support for distributed training.

**Vendor:** AWS  
**Package:** Amazon Bedrock  
**Description:** Fully managed service for building, deploying, and scaling ML models.  
**Pros:**  
- Integrated support for various ML frameworks.
- Scalable infrastructure with extensive AWS services integration.

**Cons:**  
- Requires AWS-specific knowledge for optimal use.
- Potentially high costs for extensive usage.

**Vendor:** OpenAI  
**Package:** Hugging Face Transformers  
**Description:** Open-source library providing pre-trained models and tools for NLP tasks.  
**Pros:**  
- Easy access to a wide range of pre-trained models.
- Supports integration with both TensorFlow and PyTorch.  

**Cons:**  
- Requires knowledge of underlying frameworks for customization.
- Performance dependent on the selected model and hardware.

**Vendor:** Anthropic  
**Package:** Hugging Face Transformers  
**Description:** Open-source library providing pre-trained models and tools for NLP tasks.  
**Pros:**  
- Easy access to a wide range of pre-trained models.
- Supports integration with both TensorFlow and PyTorch.

**Cons:**  
- Requires knowledge of underlying frameworks for customization.
- Performance dependent on the selected model and hardware.

## Framework Score Card

| Vendor    | Package             | Key Strength                  | CPU Compatibility | Ease of Use | Performance & Accuracy | Integration & Flexibility | Scalability | Total |
|-----------|---------------------|-------------------------------|-------------------|-------------|------------------------|---------------------------|-------------|-------|
| Meta      | PyTorch             | Pythonic Interface            | 2                 | 2           | 2                      | 2                         | 1           | 9     |
| Google    | TensorFlow          | Comprehensive Ecosystem       | 2                 | 1           | 2                      | 2                         | 2           | 9     |
| AWS       | Bedrock             | Fully Managed Service         | 2                 | 1           | 2                      | 2                         | 2           | 9     |
| OpenAI    | Hugging Face        | Wide Range of Pre-trained Models | 2               | 2           | 2                      | 2                         | 1           | 9     |
| Anthropic | Hugging Face        | Wide Range of Pre-trained Models | 2               | 2           | 2                      | 2                         | 1           | 9     |

**0: Does not meet 1: Partially meets 2: Fully meets**




## 1.4 Embedding

**Solution:** Universal Sentence Encoder  
**Provider:** Google  
**Libraries:** TensorFlow Hub  
**Pros:**  
- Captures sentence-level embeddings, enhancing text understanding.
- Efficient and easy to integrate with TensorFlow models.  
**Cons:**  
- May not capture fine-grained word-level nuances.
- Performance can vary depending on the complexity of the sentences.

**Solution:** FastText  
**Provider:** Meta  
**Libraries:** Gensim, TensorFlow, PyTorch  
**Pros:**  
- Handles out-of-vocabulary words as bags of character n-grams.
- Captures subword information, enhancing the representation of rare words.  
**Cons:**  
- Increases computational complexity due to subword representations.
- Larger model size compared to Word2Vec and GloVe.

**Solution:** Amazon SageMaker Embeddings  
**Provider:** AWS  
**Libraries:** Amazon SageMaker  
**Pros:**  
- Provides pre-built models for embeddings, simplifying deployment.
- Integrates seamlessly with other AWS services for scalability.  

**Cons:**  
- Requires familiarity with the AWS ecosystem.
- Costs can increase with extensive usage.

**Solution:** GPT-3 Embeddings  
**Provider:** OpenAI  
**Libraries:** OpenAI API  
**Pros:**  
- Generates high-quality, contextually relevant text embeddings.
- Handles long-range dependencies and contextual information.  

**Cons:**  
- Requires significant computational resources.
- Access may require API usage and associated costs.

**Solution:** Claude Embeddings  
**Provider:** Anthropic  
**Libraries:** Anthropic API  
**Pros:**  
- Offers state-of-the-art embeddings with a focus on safety and ethics.
- Handles context and nuances effectively for complex tasks.  

**Cons:**  
- Primarily available for research access, limiting commercial use.
- Access may require API usage and associated costs.

## 1.5 Tokenization

Tokenization is a crucial preprocessing step in NLP, segmenting text into manageable units for further analysis or model training. The choice of tokenization strategy affects both the complexity of the model and its ability to understand the text.

**Solution:** Word-level Tokenization  
**Libraries:** NLTK, spaCy, TensorFlow/Keras Tokenizers, BPE, Hugging Face Tokenizers  
**Pros:**  
- Preserves word integrity and meaning, crucial for comprehension tasks.
- Subword tokenization methods like BPE can efficiently handle unknown words.

**Cons:**  
- Can result in a large vocabulary, increasing memory and processing needs.
- May overlook nuances in character-level variations.

**Solution:** Character-level Tokenization  
**Libraries:** Supported by deep learning frameworks like TensorFlow and Keras  
**Pros:**  
- Captures morphological nuances at the character level, aiding rich languages.
- Simplifies vocabulary to unique characters, reducing model complexity.  

**Cons:**  
- Leads to longer input sequences, increasing computational costs.
- Loses direct access to semantic information in words or phrases.

**Solution:** Subword Tokenization  
**Libraries:** A blend of word-level and character-level tokenization methods  
**Pros:**  
- Balances vocabulary size and semantic information preservation.
- Handles rare or unknown words by breaking them into recognizable subwords.  

**Cons:**  
- Requires preprocessing to establish a subword vocabulary, adding complexity.
- Generated subwords may lack standalone meaning, complicating interpretation.

**Solution:** Model-Specific Tokenization  
**Libraries:** Hugging Face's transformers library provides access to pre-built tokenizers  
**Pros:**  
- Ensures tokenization consistency with the model's original training data.
- Reduces the need for extra preprocessing steps and custom tokenization.  

**Cons:**  
- Limited flexibility to change tokenization beyond the model's method.
- May not be efficient for tasks outside the model's specific design.

Tokenization and embedding must be considered together because tokenization directly impacts the quality of embedding. The choice of tokenization method determines how text is segmented, which in turn affects how embeddings capture context and meaning. Inconsistent tokenization can lead to poor embeddings and reduced model performance. Properly aligned tokenization and embedding processes ensure that the text's structure and semantics are preserved, enhancing overall model effectiveness.

## Tokenization and Embedding Score Card

| Vendor    | Embedder                    | Tokenizer                      | CPU Compatibility | Ease of Use | Performance & Accuracy | Integration & Flexibility | Scalability | Total |
|-----------|-----------------------------|--------------------------------|-------------------|-------------|------------------------|---------------------------|-------------|-------|
| Google    | Universal Sentence Encoder  | TensorFlow Text                | 2                 | 2           | 2                      | 2                         | 2           | 10    |
| Meta      | FastText                    | Gensim Tokenizer or NLTK       | 2                 | 2           | 2                      | 2                         | 1           | 9     |
| OpenAI    | GPT-2 Embeddings            | GPT-2 Tokenizer                | 2                 | 2           | 2                      | 2                         | 1           | 9     |
| Amazon    | SageMaker Embeddings        | SageMaker’s preprocessing tools| 1                 | 2           | 2                      | 2                         | 1           | 8     |
| Anthropic | Claude Embeddings           | Built-in tokenization          | 1                 | 2           | 2                      | 2                         | 0           | 7     |

**0: Does not meet 1: Partially meets 2: Fully meets**



## 1.6 Solution Leader Board

| Vendor    | CPU Compatibility | Ease of Use | Performance & Accuracy | Integration & Flexibility | Scalability | Total |
|-----------|-------------------|-------------|------------------------|---------------------------|-------------|-------|
| Google    | 6                 | 4           | 6                      | 6                         | 5           | 27    |
| Meta      | 6                 | 6           | 5                      | 6                         | 4           | 27    |
| Amazon    | 4                 | 5           | 5                      | 6                         | 4           | 24    |
| OpenAI    | 4                 | 6           | 6                      | 6                         | 2           | 24    |
| Anthropic | 4                 | 5           | 6                      | 5                         | 2           | 22    |

## Conclusion

The results are interesting and paint a clearer picture of how the strengths of each option compare. While close, Google seems to have an edge on Performance and Accuracy but suffers a bit on Ease of Use and Scalability. Having said that, the use of Google would support my efforts to gain Google Cloud Certification—a highly desirable skill in the job market. With that in mind, I’ll be moving forward with a Google dominant stack.

# 2 Develop Corpus


## 2.1 Data Ethics
The data collected here is a collection of posts from widely available public forum. However, should this project move into public distribution, additional steps will be necessary to ensure PII is obfuscated or removed. In addition, this document shall serve as full disclosure of the project's goals and data gathering process.

### Data Collection
The project leverages user-generated content from a domain-specific online forum as the training corpus. This data is largely unstructured, with minimal metadata available. The following tools were considered to gather the source text for the corpus:

#### Web Scraping
**Tools:** Beautiful Soup, online SaaS products  
**Pros:**  
- Direct Access to Targeted Data: Enables precise extraction of user-generated content from specific sections or threads within the forum.
- Efficiency in Data Collection: Automated scripts can gather large volumes of data in a short amount of time, making it suitable for assembling significant datasets for NLP.  

**Cons:**  
- Potential for Incomplete Data: May miss embedded content or dynamically loaded data, depending on the website’s structure.
- Ethical and Legal Considerations: Scraping data from forums may raise concerns about user privacy and must adhere to the terms of service of the website.
- Very Platform Dependent: Forum-specific solutions result in forum-specific data schemas that must be reverse engineered for successful text extraction.

#### Forum-specific APIs
**Tools:** Python (`requests` library for API calls and `json` library for handling responses)  
**Pros:**  
- Structured and Reliable Data Retrieval: APIs provide structured data, making it easier to process and integrate into your project.
- Efficient and Direct Access: Directly accessing the forum's data through its API is efficient, bypassing the need for HTML parsing.
- Compliance and Ethical Data Use: Utilizing APIs respects the forum's data policies and ensures access is in line with user agreements.  

**Cons:**  
- Rate Limiting: APIs often have limitations on the number of requests that can be made in a certain timeframe, which could slow down data collection.
- API Changes: Dependence on the forum's API structure means that changes or deprecation could disrupt your data collection pipeline.
- Access Restrictions: Some data or functionalities might be restricted or require authentication, posing additional challenges for comprehensive data collection.


## 2.2 Ingest Corpus from scratch

In [None]:
# 2.2 Ingest Corpus from scratch

raise RuntimeError("Remove this line if you want to create a new corpus")

# Remove this line if you want to create a new corpus



# Step 1 Create Corpus
# Fetch and process forum threads
# Corpus created in LDA notebook can be used.

BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'

def forum_thread_ids():
    threads = 1  # Set the number of incremental threads to process here

    file_path = os.path.join(BASE_PATH, 'e9_forum_thread_ids.csv')

    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        e9_forum_thread_ids = pd.read_csv(file_path)
        last_thread_id = int(e9_forum_thread_ids['thread_id'].iloc[-1])
    else:
        e9_forum_thread_ids = pd.DataFrame(columns=['thread_id'])
        last_thread_id = 0

    next_thread_id = last_thread_id + 1
    new_urls = [{'thread_id': thread_id} for thread_id in range(next_thread_id, next_thread_id + threads)]

    new_df = pd.DataFrame(new_urls)
    e9_forum_thread_ids = pd.concat([e9_forum_thread_ids, new_df], ignore_index=True)
    e9_forum_thread_ids.to_csv(file_path, index=False)

    print(f"Starting with thread_id {last_thread_id}")
    print(f"Processing additional {threads} thread(s)")
    print(f"Ending with thread_id {next_thread_id + threads - 1}")

    return new_df

def forum_thread_url(df):
    if df.empty:
        print("No new threads to process.")
        return pd.DataFrame()

    pages = 1

    for index, row in df.iterrows():
        thread_id = row['thread_id']
        thread_url = f"https://e9coupe.com/forum/threads/{thread_id}"
        for i in range(1, pages + 1):
            page_url = f"{thread_url}/?page={i}"
            response = requests.get(page_url)
            soup = BeautifulSoup(response.text, 'html.parser')
            title = soup.find('title').get_text()
            thread_title = title.split('|')[0].strip()
            df.at[index, 'thread_url'] = page_url
            df.at[index, 'thread_title'] = thread_title

    df.to_csv(os.path.join(BASE_PATH, 'e9_forum_thread_url.csv'), index=False)
    return df

def forum_thread_first_post(df):
    data = []

    for thread_id, thread_url, thread_title in zip(df['thread_id'], df['thread_url'], df['thread_title']):
        response = requests.get(thread_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        first_post = soup.find('article', class_='message-body')
        post_content = first_post.get_text(strip=True) if first_post else "No content found"
        data.append({'thread_id': thread_id, 'thread_first_post': post_content})

    forum_first_post = pd.DataFrame(data)
    forum_first_post.to_csv(os.path.join(BASE_PATH, 'e9_forum_first_post.csv'), index=False)
    return forum_first_post

def forum_thread_all_post(df):
    post_data = []
    for index, row in df.iterrows():
        response = requests.get(row['thread_url'])
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article', class_='message--post')
        for article in articles:
            post_timestamp = article.find('time')['datetime'] if article.find('time') else 'N/A'
            content = article.find('div', class_='bbWrapper').get_text(strip=True)
            post_data.append({'thread_id': row['thread_id'], 'post_raw': content})

    e9_forum_posts = pd.DataFrame(post_data)
    e9_forum_posts['thread_all_posts'] = e9_forum_posts['post_raw'].astype(str)
    e9_forum_thread_all_post = e9_forum_posts.groupby('thread_id')['thread_all_posts'].agg(lambda x: ' '.join(x)).reset_index()
    e9_forum_thread_all_post.to_csv(os.path.join(BASE_PATH, 'e9_forum_thread_all_post.csv'), index=False)
    return e9_forum_thread_all_post

def forum_corpus(e9_forum_thread_url, e9_forum_thread_first_post, e9_forum_thread_all_post):
    agg_df_1 = pd.merge(e9_forum_thread_url, e9_forum_thread_first_post, on='thread_id', how='left')
    agg_df_2 = pd.merge(agg_df_1, e9_forum_thread_all_post, on='thread_id', how='left')

    e9_forum_corpus = agg_df_2.dropna()
    corpus_path = os.path.join(BASE_PATH, 'e9_forum_corpus.csv')
    if os.path.exists(corpus_path) and os.path.getsize(corpus_path) > 0:
        existing_corpus = pd.read_csv(corpus_path)
        e9_forum_corpus = pd.concat([existing_corpus, e9_forum_corpus]).drop_duplicates().reset_index(drop=True)

    e9_forum_corpus.columns = e9_forum_corpus.columns.str.upper()
    e9_forum_corpus.to_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_dirty.csv'), index=False)
    return e9_forum_corpus

def main():
    e9_forum_thread_ids = forum_thread_ids()
    e9_forum_thread_url_df = forum_thread_url(e9_forum_thread_ids)
    e9_forum_thread_first_post_df = forum_thread_first_post(e9_forum_thread_url_df)
    e9_forum_thread_all_post_df = forum_thread_all_post(e9_forum_thread_url_df)
    e9_forum_corpus_df = forum_corpus(e9_forum_thread_url_df, e9_forum_thread_first_post_df, e9_forum_thread_all_post_df)
    print(f"Output saved to {os.path.join(BASE_PATH, 'e9_forum_corpus_dirty.csv')}")

if __name__ == "__main__":
    main()


RuntimeError: Remove this line if you want to create a new corpus

## 2.3 Ingest previously compiled corpus

In [None]:
# 2.3 Ingest previously compiled corpus

# Data here is from corpus workbook stored in Snowflake

def load_credentials(path_to_credentials):
    try:
        with open(path_to_credentials, 'r') as file:
            for line_num, line in enumerate(file, start=1):
                line = line.strip()
                if line and '=' in line:
                    key, value = line.split('=')
                    os.environ[key] = value
                else:
                    logging.warning(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")
        logging.info("Credentials loaded successfully.")
    except Exception as e:
        logging.error(f"Error loading credentials: {str(e)}")

def fetch_data_from_snowflake():
    try:
        conn = snowflake.connector.connect(
            user=os.environ.get('USER'),
            password=os.environ.get('PASSWORD'),
            account=os.environ.get('ACCOUNT'),
        )

        cur = conn.cursor()

        query = """
        SELECT * FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS"
        order by 1 asc;
        """
        cur.execute(query)
        e9_forum_corpus = cur.fetch_pandas_all()

        cur.close()
        conn.close()

        # Log the count of records retrieved
        logging.info(f"Number of records retrieved: {len(e9_forum_corpus)}")
        return e9_forum_corpus
    except Exception as e:
        logging.error(f"Error fetching data from Snowflake: {str(e)}")
        return pd.DataFrame()  # Return an empty DataFrame in case of error

# Main sequence
path_to_credentials = '/content/drive/MyDrive/Colab Notebooks/credentials/snowflake_credentials'

# Load credentials
load_credentials(path_to_credentials)

# Fetch data from Snowflake and print the count of records retrieved
e9_forum_corpus = fetch_data_from_snowflake()

if not e9_forum_corpus.empty:
    try:
        # Save the data to a CSV file
        output_path = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/e9_forum_corpus_dirty.csv'
        e9_forum_corpus.to_csv(output_path, index=False)
        logging.info(f"Data saved to {output_path}")
    except Exception as e:
        logging.error(f"Error saving data to CSV: {str(e)}")
else:
    logging.warning("No data retrieved to save.")



# 3 Preprocessing Text

The collected text is very unstructured and needs a reasonable amount of pre-processing to make it usable for NLP. This will address values that are either not localized, use slang, or do not have value from an NLP perspective.

- **Clean the Text:**
  - Remove HTML tags, extra whitespace, non-printable characters, and other irrelevant elements.

- **Standardize the Text:**
  - Convert all characters to lowercase to ensure uniformity.

- **Filter Out Common Stop Words:**
  - Remove stop words to focus on more meaningful content.

- **Remove Duplicate Entries:**
  - Ensure the uniqueness of the data by eliminating duplicates.

- **Lemmatization or Stemming:**
  - Convert words to their base or dictionary form to consolidate similar forms of a word.

- **Anonymize Personal Information:**
  - Identify and anonymize personal information or specific entity names to maintain privacy.

- **Remove Irrelevant Sections:**
  - Remove sections of the text that do not contribute to the knowledge base or are off-topic.

- **Tokenization:**
  - Break down the text into smaller units called tokens. Use a tokenizer compatible with your chosen model, such as the BERT tokenizer.


In [None]:
# 3 Preprocessing Text
# Clean and preprocess forum data

# Define the path to the local CSV file
csv_file_path = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/e9_forum_corpus_dirty.csv'

# Read the CSV file into a DataFrame
e9_forum_corpus_dirty = pd.read_csv(csv_file_path)


def remove_urls(df):
    """Removes URLs from the text."""
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    for column in ['THREAD_TITLE', 'THREAD_FIRST_POST', 'THREAD_ALL_POSTS']:
        df[column] = df[column].apply(lambda text: url_pattern.sub(r'', str(text)))
    return df

def alpha_numeric(df):
    """Removes non-alphanumeric characters and unwanted patterns from text."""
    pattern_email = re.compile(r'\S*@\S*\s?')
    pattern_non_alpha = re.compile(r'[^a-zA-Z0-9\s]')
    for column in ['THREAD_TITLE', 'THREAD_FIRST_POST', 'THREAD_ALL_POSTS']:
        df[column] = df[column].apply(lambda text: pattern_non_alpha.sub('', pattern_email.sub('', str(text))))
        df[column] = df[column].apply(lambda text: re.sub(r'\s+', ' ', text).strip())  # Remove extra spaces
    return df

def spell_check(df):
    """Corrects spelling errors in the text with caching."""
    spell = SpellChecker()
    cache = {}

    def correct_word(word):
        if word in cache:
            return cache[word]
        else:
            correction = spell.correction(word) or word
            cache[word] = correction
            return correction

    for column in ['THREAD_TITLE', 'THREAD_FIRST_POST', 'THREAD_ALL_POSTS']:
        df[column] = df[column].apply(lambda text: ' '.join([correct_word(word) for word in text.split()]))
    return df

def remove_stop_words(df):
    """Removes stop words from the text."""
    stop_words_set = set(stopwords.words('english')).union({'car', 'csi', 'cs', 'csl', 'e9', 'coupe', 'http', 'https', 'www', 'ebay', 'bmw', 'html'})
    for column in ['THREAD_TITLE', 'THREAD_FIRST_POST', 'THREAD_ALL_POSTS']:
        df[column] = df[column].apply(lambda text: ' '.join([word for word in text.split() if word.lower() not in stop_words_set and len(word) > 2]))
    return df

def tokenize_and_lemmatize(df):
    """Tokenizes and lemmatizes the text in specified columns using TensorFlow Text."""
    tokenizer = tf_text.WhitespaceTokenizer()
    lemmatizer = WordNetLemmatizer()

    def tokenize_text(text):
        tokens = tokenizer.tokenize(text).numpy().astype(str).tolist()
        lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
        return ' '.join(lemmatized_tokens)

    for column in ['THREAD_TITLE', 'THREAD_FIRST_POST', 'THREAD_ALL_POSTS']:
        df[column] = df[column].apply(tokenize_text)
    return df

def clean_nan_values(df):
    """Removes or replaces NaN values in the dataset and converts all entries to strings."""
    df.fillna('', inplace=True)
    for column in ['THREAD_TITLE', 'THREAD_FIRST_POST', 'THREAD_ALL_POSTS']:
        df[column] = df[column].astype(str)
    return df

def main():
    """Main function to run the data processing pipeline."""
    # Load credentials
    # path_to_credentials = '/content/drive/MyDrive/Colab Notebooks/credentials/snowflake_credentials'
    # load_credentials(path_to_credentials)

    # Fetch data from Snowflake
    # e9_forum_corpus = fetch_data_from_snowflake()

    # Process the data
    df = remove_urls(e9_forum_corpus_dirty)
    df = alpha_numeric(df)
    # df = spell_check(df)  # Need to find a faster spell checking process
    #df = remove_stop_words(df)
    df = tokenize_and_lemmatize(df)
    #df = clean_nan_values(df)  # Final NaN cleaning and type conversion step
    df.columns = df.columns.str.upper()  # Convert column names to uppercase

    # Save the cleaned data
    output_path = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/e9_forum_corpus_clean.csv'
    ''
    df.to_csv(output_path, index=False)
    print(f"Cleaned data saved to {output_path}")

if __name__ == "__main__":
    main()


Cleaned data saved to /content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/e9_forum_corpus_clean.csv


# 4 Clustering and Summarization

Summarization in NLP involves condensing large texts into shorter versions, capturing the most critical information. This can be approached through multiple options. For this effort, the following solutions were scored to reduce the potential solution set:

| Provider  | Specific Package | Key Strength              | CPU Compatibility | Ease of Use | Performance & Accuracy | Integration & Flexibility | Scalability | Total |
|-----------|------------------|---------------------------|-------------------|-------------|------------------------|---------------------------|-------------|-------|
| Meta      | DistilBART       | Optimized for CPU usage   | 2                 | 2           | 2                      | 2                         | 2           | 10    |
| Google    | T5 (t5-small)    | Efficient CPU usage       | 2                 | 2           | 2                      | 2                         | 2           | 10    |
| Amazon    | AWS Comprehend   | Integrated AWS service    | 2                 | 2           | 2                      | 2                         | 2           | 10    |
| OpenAI    | GPT (GPT-2)      | Optimal for CPUs          | 2                 | 2           | 2                      | 1                         | 1           | 8     |
| Anthropic | Claude (Claude 3)| Enhanced safety features  | 2                 | 2           | 2                      | 1                         | 1           | 8     |



A test of solutions can be found in the Appenix. T5 was chosen based on bettter ROUGE scores than BART. This also allows me to stick with Google centric stack.

In [None]:
# 4 Clustering and Summarization

BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'

# Load the model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Load e9_forum_corpus_clean DataFrame from the CSV
e9_forum_corpus_clean = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_clean.csv'))

# Load the existing summarized corpus if it exists, otherwise create it
summarized_corpus_path = os.path.join(BASE_PATH, 'e9_forum_corpus_summarized.csv')
e9_forum_corpus_summarized = pd.read_csv(summarized_corpus_path) if os.path.exists(summarized_corpus_path) and os.path.getsize(summarized_corpus_path) > 0 else pd.DataFrame(columns=e9_forum_corpus_clean.columns)

# Calculate the starting THREAD_ID of the summarized corpus
starting_thread_id = e9_forum_corpus_summarized['THREAD_ID'].max() if not e9_forum_corpus_summarized.empty else 0

# Identify new entries to be processed
new_entries = e9_forum_corpus_clean[~e9_forum_corpus_clean['THREAD_ID'].isin(e9_forum_corpus_summarized['THREAD_ID'])].copy()

# Calculate ending_thread_id and threads_processed
ending_thread_id = new_entries['THREAD_ID'].max() if not new_entries.empty else starting_thread_id
threads_processed = len(new_entries) if not new_entries.empty else 0

print(f"Starting with thread_id {starting_thread_id}")
print(f"Processing additional {threads_processed} thread(s)")
print(f"Ending with thread_id {ending_thread_id}")

def T5_summarize(text):
    """Summarization using T5."""
    try:
        if text.strip() == "":
            return text

        unformatted_text = text.replace("\n", " ")

        inputs = tokenizer(
            unformatted_text,
            max_length=900,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
            add_special_tokens=True,
            return_attention_mask=True
        )

        summary_ids = model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=50,
            min_length=10,
            length_penalty=2.0,
            num_beams=2,
            early_stopping=True
        )

        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        return summary if summary else text
    except Exception as e:
        return str(e)

def main():
    # Check if 'THREAD_ALL_POSTS' column exists in new entries
    if 'THREAD_ALL_POSTS' in new_entries.columns:
        unique_texts = new_entries['THREAD_ALL_POSTS'].drop_duplicates()
        summaries = unique_texts.apply(T5_summarize)
        summary_map = dict(zip(unique_texts, summaries))
        new_entries.loc[:, 'SUMMARIZED_THREAD'] = new_entries['THREAD_ALL_POSTS'].map(summary_map)

        # Append the new summarized data to the existing summarized corpus
        updated_summarized_corpus = pd.concat([e9_forum_corpus_summarized, new_entries], ignore_index=True)

        # Save the results with the new summarized column
        updated_summarized_corpus.to_csv(summarized_corpus_path, index=False)

        print(f"Summarization completed and saved to {summarized_corpus_path}")
    else:
        print("Error: Column 'THREAD_ALL_POSTS' does not exist in the dataset.")

if __name__ == "__main__":
    main()





tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Starting with thread_id 2001
Processing additional 0 thread(s)
Ending with thread_id 2001
Summarization completed and saved to /content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/e9_forum_corpus_summarized.csv


# 5 Format Text for Training

- Structure text into a question-answer format suitable for training a RAG model.
- Ensure the question string ends with a question mark for clarity.


In [None]:
# 5 Format Text for Training

BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'

# Load the dataset from Step 2
df_summarized = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_summarized.csv'))

# Load the existing QA corpus if it exists
qa_corpus_path = os.path.join(BASE_PATH, 'e9_forum_corpus_qa.csv')
df_qa = pd.read_csv(qa_corpus_path) if os.path.exists(qa_corpus_path) and os.path.getsize(qa_corpus_path) > 0 else pd.DataFrame(columns=['THREAD_ID', 'QUESTION', 'ANSWER'])

# Calculate the starting THREAD_ID of the QA corpus
starting_thread_id = df_qa['THREAD_ID'].max() if not df_qa.empty else 0

# Identify new entries to be processed
new_entries = df_summarized[~df_summarized['THREAD_ID'].isin(df_qa['THREAD_ID'])]

# Calculate ending_thread_id and threads_processed
ending_thread_id = new_entries['THREAD_ID'].max() if not new_entries.empty else starting_thread_id
threads_processed = len(new_entries) if not new_entries.empty else 0

print(f"Starting with thread_id {starting_thread_id}")
print(f"Processing additional {threads_processed} thread(s)")
print(f"Ending with thread_id {ending_thread_id}")

def create_qa_schema(df):
    """Creates a QA schema by renaming and dropping specific columns."""
    df.rename(columns={'SUMMARIZED_THREAD': 'ANSWER', 'THREAD_FIRST_POST': 'QUESTION'}, inplace=True)
    #df.rename(columns={'THREAD_ALL_POSTS': 'ANSWER', 'THREAD_FIRST_POST': 'QUESTION'}, inplace=True)
    #df.drop(['THREAD_TITLE', 'THREAD_ALL_POSTS'], axis=1, inplace=True)
    return df

def main():
    if not new_entries.empty:
        # Process the new entries to create QA schema
        df_qa_new = create_qa_schema(new_entries.dropna())

        # Append the new QA data to the existing QA corpus
        updated_qa_corpus = pd.concat([df_qa, df_qa_new], ignore_index=True)

        # Save the updated QA corpus
        updated_qa_corpus.to_csv(qa_corpus_path, index=False)

        print(f"Output saved to {qa_corpus_path}")
    else:
        print("No new entries to process.")

if __name__ == "__main__":
    main()

Starting with thread_id 2001
Processing additional 43 thread(s)
Ending with thread_id 1972
Output saved to /content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/e9_forum_corpus_qa.csv


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'SUMMARIZED_THREAD': 'ANSWER', 'THREAD_FIRST_POST': 'QUESTION'}, inplace=True)


# 6 Embedding and Indexing

In [None]:
# 6 Embedding and Indexing

BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'

# Load the DataFrame from your CSV file
df_tok = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_qa.csv'))

# Load the existing FAISS corpus if it exists, otherwise create an empty DataFrame
faiss_corpus_path = os.path.join(BASE_PATH, 'e9_forum_corpus_faiss.csv')
df_faiss = pd.read_csv(faiss_corpus_path) if os.path.exists(faiss_corpus_path) and os.path.getsize(faiss_corpus_path) > 0 else pd.DataFrame(columns=df_tok.columns)

df_faiss.info()

# Identify new entries to be processed
new_entries = df_tok[~df_tok['THREAD_ID'].isin(df_faiss['THREAD_ID'])].copy()

# Calculate the starting THREAD_ID of the FAISS corpus
starting_thread_id = df_faiss['THREAD_ID'].max() if not df_faiss.empty else 0

# Calculate ending_thread_id and threads_processed
ending_thread_id = new_entries['THREAD_ID'].max() if not new_entries.empty else starting_thread_id
threads_processed = len(new_entries) if not new_entries.empty else 0

print(f"Starting with thread_id {starting_thread_id}")
print(f"Processing additional {threads_processed} thread(s)")
print(f"Ending with thread_id {ending_thread_id}")

# Initialize the T5 tokenizer and encoder model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")

# Function to tokenize text using T5 tokenizer
def tokenize_text(text):
    return tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)["input_ids"]

# Function to embed text using T5 encoder model
def embed_text(tokens):
    inputs = torch.tensor(tokens).unsqueeze(0)  # Add batch dimension
    with torch.no_grad():
        outputs = model(inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()  # Average pooling
    return embeddings

def process_new_entries(entries):
    entries["Question_Tokens"] = entries["QUESTION"].apply(lambda x: tokenize_text(x).squeeze().tolist())
    entries["Answer_Tokens"] = entries["ANSWER"].apply(lambda x: tokenize_text(x).squeeze().tolist())
    entries["Question_Embeddings"] = entries["Question_Tokens"].apply(embed_text)
    entries["Answer_Embeddings"] = entries["Answer_Tokens"].apply(embed_text)
    return entries

def filter_embeddings(embeddings_list, expected_shape):
    """Filter out embeddings that do not match the expected shape."""
    return [embedding for embedding in embeddings_list if len(embedding) == expected_shape]

def build_faiss_index(embeddings, index_path):
    embeddings_np = np.array(embeddings).astype('float32')  # Convert to NumPy array of type float32
    d = embeddings_np.shape[1]  # Dimension of embeddings
    index = faiss.IndexFlatL2(d)  # Build the index
    index.add(embeddings_np)  # Add vectors to the index

    # Save the index
    faiss.write_index(index, index_path)
    return index

def main():
    if not new_entries.empty:
        processed_entries = process_new_entries(new_entries)

        # Append the new processed data to the existing FAISS corpus
        updated_faiss_corpus = pd.concat([df_faiss, processed_entries], ignore_index=True)

        # Save the updated FAISS corpus
        updated_faiss_corpus.to_csv(faiss_corpus_path, index=False)

        # Ensure all embeddings are of the same shape before saving
        question_embeddings_list = updated_faiss_corpus["Question_Embeddings"].to_list()
        answer_embeddings_list = updated_faiss_corpus["Answer_Embeddings"].to_list()

        expected_shape = 512  # Expected embedding size (512 for T5 model)

        question_embeddings_filtered = filter_embeddings(question_embeddings_list, expected_shape)
        answer_embeddings_filtered = filter_embeddings(answer_embeddings_list, expected_shape)

        question_embeddings = np.array(question_embeddings_filtered)
        answer_embeddings = np.array(answer_embeddings_filtered)

        np.save(os.path.join(BASE_PATH, 'question_embeddings_t5.npy'), question_embeddings)
        np.save(os.path.join(BASE_PATH, 'answer_embeddings_t5.npy'), answer_embeddings)

        # Build and save the FAISS index using the new answer embeddings
        faiss_index_path = os.path.join(BASE_PATH, 'faiss_index_t5.index')
        index = build_faiss_index(answer_embeddings, faiss_index_path)

        print(f"FAISS index has been rebuilt and saved to {faiss_index_path}")
        print(f"Embeddings have been generated and saved to {BASE_PATH}")
    else:
        print("No new entries to process.")

if __name__ == "__main__":
    main()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   THREAD_ID            1958 non-null   int64 
 1   QUESTION             1958 non-null   object
 2   ANSWER               1958 non-null   object
 3   Question_Tokens      1958 non-null   object
 4   Answer_Tokens        1958 non-null   object
 5   Question_Embeddings  1958 non-null   object
 6   Answer_Embeddings    1958 non-null   object
dtypes: int64(1), object(6)
memory usage: 107.2+ KB
Starting with thread_id 2001
Processing additional 0 thread(s)
Ending with thread_id 2001


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


No new entries to process.


# 7 Query Processing and Search

In [None]:
# 7 Query Processing and Search


BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'

# Load the FAISS index
faiss_index_path = os.path.join(BASE_PATH, 'faiss_index_t5.index')
index = faiss.read_index(faiss_index_path)

# Load the pre-trained embeddings
question_embeddings = np.load(os.path.join(BASE_PATH, 'question_embeddings_t5.npy'))
answer_embeddings = np.load(os.path.join(BASE_PATH, 'answer_embeddings_t5.npy'))

# Load the corpus DataFrame
e9_forum_corpus = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_faiss.csv'))

# Initialize the T5 tokenizer and encoder model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")

def generate_query_embeddings(query, tokenizer, model):
    tokens = tokenizer(query, return_tensors="pt", truncation=True, padding="max_length", max_length=512)["input_ids"]
    with torch.no_grad():
        outputs = model(tokens)
        embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()  # Average pooling
    return np.array(embeddings).astype('float32')  # Convert to float32 for FAISS

def search_similar_questions(query_embeddings, index, top_k=5):
    D, I = index.search(query_embeddings, top_k)
    return I, D

def retrieve_and_rank(df, I, D):
    results = []
    for i, distances in zip(I, D):
        for idx, distance in zip(i, distances):
            result = {
                'Thread ID': df.iloc[idx]['THREAD_ID'],
                'Question': df.iloc[idx]['QUESTION'],  # Raw question text
                'Answer': df.iloc[idx]['ANSWER'],  # Raw answer text
                'Distance': distance
            }
            results.append(result)
    return results

# Example usage for step 8:
query = "I want a tool box?"
query_embeddings = generate_query_embeddings(query, tokenizer, model).reshape(1, -1)

top_k = 5
I, D = search_similar_questions(query_embeddings, index, top_k)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.




# 8 Retrieve and Rank


In [None]:
# 8 Retrieve and Rank

def retrieve_and_rank(df, I, D):
    results = []
    for i, distances in zip(I, D):
        for idx, distance in zip(i, distances):
            result = {
                'Thread ID': df.iloc[idx]['THREAD_ID'],
                'Question': df.iloc[idx]['QUESTION'],  # Raw question text
                'Answer': df.iloc[idx]['ANSWER'],  # Raw answer text
                'Distance': distance
            }
            results.append(result)

    # Log the retrieved context
    contexts = [result['Answer'] for result in results]
    retrieved_context = " ".join(contexts)
    logging.info(f"Query: {query}\nRetrieved Context: {retrieved_context}")

    return results

# Retrieve and rank results
ranked_results = retrieve_and_rank(e9_forum_corpus, I, D)
for result in ranked_results:
    print(result)


{'Thread ID': 1, 'Question': 'New owner Goin drive home cant wait post experience read laugh cry critiqueI look forward coming lurker statusRegards', 'Answer': 'new owner Goin drive home cant wait post experience read laugh cry critiqueI look forward coming lurker statusRegards Congrats New OwnerCongrats pecsokfrom one new owner another another javascriptemoticon', 'Distance': 0.8356856}
{'Thread ID': 2001, 'Question': 'Looking at the Photo Gallery for all of the modelsand it seems that very few have tinted windowsIs there a reason that more window are not tintedNot a total black out but something noticeableWhats your opion on tinting your e9 windowsThanks', 'Answer': 'tinted window the tint the tint the glass the tint is 20 30 degree cooler and 10 12 dbquieterAnyone used this product the tint the tint of the models of the', 'Distance': 3.4028235e+38}
{'Thread ID': 2001, 'Question': 'Looking at the Photo Gallery for all of the modelsand it seems that very few have tinted windowsIs ther


# 9 Answer Generation



In [None]:
# Required imports
# Required imports
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import numpy as np
import pandas as pd
import faiss
import os

# Define functions
def generate_query_embeddings(query, tokenizer, model):
    tokens = tokenizer(query, return_tensors="pt", truncation=True, padding="max_length", max_length=512)["input_ids"]
    with torch.no_grad():
        outputs = model.encoder(tokens)
        embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()  # Average pooling
    return np.array(embeddings).astype('float32')  # Convert to float32 for FAISS

def search_similar_questions(query_embeddings, index, top_k=5):
    query_embeddings = query_embeddings.reshape(1, -1)  # Ensure correct shape
    D, I = index.search(query_embeddings, top_k)
    return I, D

def retrieve_and_rank(df, I, D):
    results = []
    for i, distances in zip(I, D):
        for idx, distance in zip(i, distances):
            result = {
                'Thread ID': df.iloc[idx]['THREAD_ID'],
                'Question': df.iloc[idx]['QUESTION'],  # Raw question text
                'Answer': df.iloc[idx]['ANSWER'],  # Raw answer text
                'Distance': distance
            }
            results.append(result)
    return results

def generate_answer(query, ranked_results, tokenizer, model):
    # Print the original user question
    print(f"Original Question: {query}")

    # Concatenate the retrieved contexts
    concatenated_context = " ".join([result['Answer'] for result in ranked_results])

    # Construct the prompt with the query and context
    input_text = f"answer: {query} context: {concatenated_context}"

    # Print the constructed prompt to see what is being sent to the LLM
    print(f"Engineered Prompt: {input_text}")

    # Generate the answer using the T5 model
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    outputs = model.generate(input_ids)
    generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return generated_answer

# Load the FAISS index and necessary data
BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'
faiss_index_path = os.path.join(BASE_PATH, 'faiss_index_t5.index')
index = faiss.read_index(faiss_index_path)
e9_forum_corpus = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_faiss.csv'))

# Initialize the T5 tokenizer and encoder model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Function to process a query and retrieve answers
def process_query_and_generate_answer(query, index, df, tokenizer, model):
    query_embeddings = generate_query_embeddings(query, tokenizer, model)
    I, D = search_similar_questions(query_embeddings, index, top_k=5)
    ranked_results = retrieve_and_rank(df, I, D)
    generated_answer = generate_answer(query, ranked_results, tokenizer, model)
    return generated_answer

# Ensure this is part of your main execution code
query = "How do I fix the transmission issue in my car?"  # Replace with the actual question input mechanism
generated_answer = process_query_and_generate_answer(query, index, e9_forum_corpus, tokenizer, model)

# Print the generated answer
print(f"Generated Answer: {generated_answer}")


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Original Question: How do I fix the transmission issue in my car?
Engineered Prompt: answer: How do I fix the transmission issue in my car? context: new owner Goin drive home cant wait post experience read laugh cry critiqueI look forward coming lurker statusRegards Congrats New OwnerCongrats pecsokfrom one new owner another another javascriptemoticon tinted window the tint the tint the glass the tint is 20 30 degree cooler and 10 12 dbquieterAnyone used this product the tint the tint of the models of the tinted window the tint the tint the glass the tint is 20 30 degree cooler and 10 12 dbquieterAnyone used this product the tint the tint of the models of the tinted window the tint the tint the glass the tint is 20 30 degree cooler and 10 12 dbquieterAnyone used this product the tint the tint of the models of the tinted window the tint the tint the glass the tint is 20 30 degree cooler and 10 12 dbquieterAnyone used this product the tint the tint of the models of the




Generated Answer: the tint the tint of the models of the tinted window the tint the glass the tint


In [None]:
# Code to compare the origional text and how it changes throught the workstream

# Define the base path
BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'

# Load the DataFrames from your CSV files
e9_forum_corpus_dirty = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_dirty.csv'))
e9_forum_corpus_clean = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_clean.csv'))
e9_forum_corpus_summarized = pd.read_csv(os.path.join(BASE_PATH, 'e9_forum_corpus_summarized.csv'))

def compare_thread_id(thread_id):
    # Check if the thread_id exists in all DataFrames
    if thread_id in e9_forum_corpus_dirty['THREAD_ID'].values and \
       thread_id in e9_forum_corpus_clean['THREAD_ID'].values and \
       thread_id in e9_forum_corpus_summarized['THREAD_ID'].values:

        # Get the rows corresponding to the thread_id from each DataFrame
        dirty_row = e9_forum_corpus_dirty[e9_forum_corpus_dirty['THREAD_ID'] == thread_id]
        clean_row = e9_forum_corpus_clean[e9_forum_corpus_clean['THREAD_ID'] == thread_id]
        summarized_row = e9_forum_corpus_summarized[e9_forum_corpus_summarized['THREAD_ID'] == thread_id]

        # Concatenate the rows into a single DataFrame for comparison
        comparison_df = pd.concat([dirty_row, clean_row, summarized_row], keys=['Dirty', 'Clean', 'Summarized'])

        return comparison_df
    else:
        print(f"THREAD_ID {thread_id} not found in all DataFrames.")
        return None

# Example usage
thread_id_to_compare = 4  # Replace with the actual THREAD_ID you want to compare
comparison_result = compare_thread_id(thread_id_to_compare)

if comparison_result is not None:
    # Set display options to show full content of each cell
    pd.set_option('display.max_colwidth', None)
    print(comparison_result)

              THREAD_ID    THREAD_TITLE  \
Dirty      3          4  5 Speed Tranny   
Clean      3          4  5 Speed Tranny   
Summarized 3          4    Speed Tranny   

                                                                                                                                                                                                                                                                                                                                    THREAD_FIRST_POST  \
Dirty      3  I have just purchased a CS and discovered that the tranny was ran dry before the purchase.  What 5-speed tranny would work, the current is a Getrag 225 and the speedo cable is not attached, in fact there is a soft plug in place of the gear and that is the source of the fluid leak.  What would the Getrag numbers be?Thanks,Doug   
Clean      3             I have just purchased a CS and discovered that the tranny wa ran dry before the purchase What 5speed tranny woul

# 10 Evaluation and Tuning

In [None]:
# 10 Evaluation and Tuning

BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/LLM_RAG/'

CREDENTIALS_PATH = '/content/drive/MyDrive/Colab Notebooks/credentials/snowflake_credentials'

faiss_index_path = os.path.join(BASE_PATH, 'faiss_index_t5.index')
representative_sentences_path = os.path.join(BASE_PATH, 'representative_sentences.csv')
similarity_scores_output_path = os.path.join(BASE_PATH, 'similarity_scores_with_answers.csv')
similarity_threshold = 0.01  # Set your threshold value here

def load_credentials(path_to_credentials):
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=')
                os.environ[key] = value
            else:
                print(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")

def connect_to_snowflake():
    """Establish a connection to the Snowflake database."""
    return snowflake.connector.connect(
        user=os.environ.get('USER'),
        password=os.environ.get('PASSWORD'),
        account=os.environ.get('ACCOUNT')
    )

# Load the rebuilt FAISS index
index = faiss.read_index(faiss_index_path)

# Initialize the T5 tokenizer and encoder model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")

# Function to tokenize text using T5 tokenizer
def tokenize_text(text):
    return tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)["input_ids"]

# Function to generate embeddings for a new query using the T5 model
def generate_query_embedding(query):
    query_tokens = tokenize_text(query)
    with torch.no_grad():
        outputs = model(query_tokens)
        query_embedding = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()  # Average pooling
    return torch.tensor(query_embedding).unsqueeze(0)  # Add batch dimension

# Function to search FAISS index for the most similar question
def search_faiss_index(query_embedding, index, k=1):
    query_embedding_np = query_embedding.numpy().astype('float32')  # Convert to NumPy array of type float32
    D, I = index.search(query_embedding_np, k)  # Search
    valid_indices = [idx for idx in I[0] if idx >= 0]
    similarity_scores = D[0][:len(valid_indices)]  # Get similarity scores for valid indices
    return valid_indices, similarity_scores  # Return only valid indices and their scores

def fetch_answers_from_snowflake(indices):
    """Fetch answers and embeddings from Snowflake for given indices."""
    if not indices:
        return pd.DataFrame()  # Return an empty DataFrame if no valid indices

    load_credentials(CREDENTIALS_PATH)
    conn = connect_to_snowflake()
    cur = conn.cursor()
    query = f"SELECT THREAD_ID, ANSWER FROM e9_corpus.e9_corpus_schema.e9_forum_corpus_faiss WHERE THREAD_ID IN ({','.join(map(str, indices))})"
    if indices:
        cur.execute(query)
        answers = cur.fetch_pandas_all()
    else:
        answers = pd.DataFrame()
    cur.close()
    conn.close()
    return answers

def process_representative_sentences():
    # Load representative sentences
    representative_sentences_df = pd.read_csv(representative_sentences_path)

    # Generate embeddings and calculate similarity scores for each representative sentence
    results = []

    for idx, row in representative_sentences_df.iterrows():
        topic = row['Topic']
        sentence = row['Representative Sentence']

        # Generate query embedding
        query_embedding = generate_query_embedding(sentence)

        # Ensure the dimension matches
        if query_embedding.shape[1] != index.d:
            raise ValueError(f"Embedding dimension mismatch: {query_embedding.shape[1]} vs {index.d}")

        # Search FAISS index
        similar_indices, similarity_scores = search_faiss_index(query_embedding, index, k=1)

        # Fetch answers from Snowflake
        answer = None
        score = None
        if similar_indices:
            answers = fetch_answers_from_snowflake(similar_indices)

            if not answers.empty:
                for idx, score in zip(similar_indices, similarity_scores):
                    matching_answers = answers.loc[answers['THREAD_ID'] == idx, 'ANSWER'].values
                    if len(matching_answers) > 0:
                        answer = matching_answers[0]
                        break
        results.append({
            'Representative Sentence': sentence,
            'Answer': answer,
            'Similarity Score': score
        })

    # Save results to a CSV file
    results_df = pd.DataFrame(results)
    results_df.to_csv(similarity_scores_output_path, index=False)
    print("Results saved.")

    # Output results
    for result in results:
        print(f"Representative Sentence: {result['Representative Sentence']}")
        print(f"Answer: {result['Answer']}")
        print(f"Similarity Score: {result['Similarity Score']}\n")

def main():
    process_representative_sentences()

if __name__ == "__main__":
    main()


# 11 Deployment




1. **Hugging Face Spaces**
   - **Pros:** Provides a simple and direct way to deploy and share machine learning models, including RAG models. It supports interactive web-based applications and API endpoints, making it ideal for showcasing projects.

   - **Cons:** While convenient for prototypes and demonstrations, it might not offer the scalability and control needed for high-demand production environments.

2. **AWS SageMaker**
   - **Pros:** Offers a fully managed service that enables data scientists and developers to build, train, and deploy machine learning models at scale. SageMaker supports direct deployment of PyTorch models, including those built with the Hugging Face Transformers library, with robust monitoring and security features.  

   - **Cons:** Can be more expensive and requires familiarity with AWS services. The setup and management might be complex for smaller projects or those new to cloud services.

3. **Docker + Kubernetes**
   - **Pros:** This combination offers flexibility and scalability for deploying machine learning models. Docker containers make it easy to package your RAG model with all its dependencies, while Kubernetes provides orchestration to manage and scale your deployment across multiple instances or cloud providers.  
   
   - **Cons:** Requires significant DevOps knowledge to setup, manage, and scale. It might be overkill for simple or one-off deployments.




# 12 Appendix

## Summarization Comparison

In [None]:
# Summarization Comparison: T5

# Implement an objective score: ROUGE

tokenizer_t5 = T5Tokenizer.from_pretrained("t5-small")
model_t5 = T5ForConditionalGeneration.from_pretrained("t5-small")

# Load the ROUGE metric
rouge = load_metric("rouge")

def t5_summarize(text, max_length, min_length, num_beams):
    # Prepend the text with the task-specific prefix for summarization
    input_text = "summarize: " + text

    # Tokenize the input text
    inputs = tokenizer_t5(input_text, return_tensors="pt", max_length=512, truncation=True)

    # Generate a summary
    summary_ids = model_t5.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        num_beams=num_beams,
        length_penalty=2.0,
        early_stopping=True
    )

    # Decode the summary
    summary = tokenizer_t5.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Define the parameter grid
param_grid = {
    "max_length": [50, 100, 150],
    "min_length": [10, 30, 50],
    "num_beams": [2, 4, 6]
}

# Example input text
text = """
And on the 8th day, God looked down on his planned paradise and said,
"I need a caretaker". God said, "I need somebody willing to get up before
dawn, milk cows, work all day in the fields, milk cows again, eat supper, then
go to town and stay past midnight at a meeting of the school board." I need
somebody with arms strong enough to rustle a calf and yet gentle enough to
deliver his own grandchild; somebody to call hogs, tame cantankerous machinery,
come home hungry, have to wait lunch until his wife’s done feeding visiting
ladies, then tell the ladies to be sure and come back real soon - and mean it.
God said, "I need somebody willing to sit up all night with a newborn colt,
and watch it die, then dry his eyes and say, 'Maybe next year.' I need somebody
who can shape an ax handle from a persimmon sprout, shoe a horse with a hunk
of car tire, who can make harness out of haywire, feed sacks and shoe scraps;
who, planting time and harvest season, will finish his forty-hour week by
Tuesday noon, and then pain’n from tractor back, put in another seventy-two
hours." God had to have somebody willing to ride the ruts at double speed to get
the hay in ahead of the rain clouds, and yet stop in mid-field and race to help
when he sees the first smoke from a neighbor’s place. God said, "I need somebody
strong enough to clear trees and heave bails, yet gentle enough to tame lambs
and wean pigs and tend the pink-combed pullets, who will stop his mower for an
hour to splint the broken leg of a meadow lark." It had to be somebody who’d
plow deep and straight and not cut corners; somebody to seed, weed, feed, breed
and rake and disc and plow and plant and tie the fleece and strain the milk and
replenish the self-feeder and finish a hard week’s work with a five-mile drive
to church; somebody who would bale a family together with the soft strong bonds
of sharing, who would laugh, and then sigh, and then reply, with smiling eyes,
when his son says that he wants to spend his life "doing what dad does."
-- so God made a Farmer.
"""

# Define the reference summary (ground truth)
reference_summary = """
God needed someone to take care of the planet, so God made a Farmer.
"""

# Initialize best scores and best params
best_score = None
best_params = None
best_recall = None
best_precision = None

# Perform grid search
for max_length, min_length, num_beams in itertools.product(param_grid["max_length"], param_grid["min_length"], param_grid["num_beams"]):
    generated_summary = t5_summarize(text, max_length, min_length, num_beams)
    results = rouge.compute(predictions=[generated_summary], references=[reference_summary])

    # Extract ROUGE-LSum scores
    rougeLsum_precision = results['rougeLsum'].mid.precision
    rougeLsum_recall = results['rougeLsum'].mid.recall
    rougeLsum_fmeasure = results['rougeLsum'].mid.fmeasure

    if best_score is None or rougeLsum_fmeasure > best_score:
        best_score = rougeLsum_fmeasure
        best_params = (max_length, min_length, num_beams)
        best_recall = rougeLsum_recall
        best_precision = rougeLsum_precision

print(f"Best ROUGE-Lsum F-measure: {best_score:.4f}")
print(f"Best ROUGE-Lsum Recall: {best_recall:.4f}")
print(f"Best ROUGE-Lsum Precision: {best_precision:.4f}")
print(f"Best parameters: max_length={best_params[0]}, min_length={best_params[1]}, num_beams={best_params[2]}")

In [None]:
# Summarization Comparison: DistilBART

# Implement an objective score: ROUGE

tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")

# Load the ROUGE metric
rouge = load_metric("rouge")

def distilbart_summarize(text, max_length, min_length, num_beams):
    # Tokenize the input text
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=512,  # Adjusted to fit within the model's constraints
        truncation=True
    )

    # Generate a summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        num_beams=num_beams,
        length_penalty=2.0,
        early_stopping=True
    )

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Define the parameter grid
param_grid = {
    "max_length": [50, 100, 150],
    "min_length": [10, 30, 50],
    "num_beams": [2, 4, 6]
}

# Example input text and reference summary
text = """
And on the 8th day, God looked down on his planned paradise and said,
"I need a caretaker". God said, "I need somebody willing to get up before
dawn, milk cows, work all day in the fields, milk cows again, eat supper, then
go to town and stay past midnight at a meeting of the school board." I need
somebody with arms strong enough to rustle a calf and yet gentle enough to
deliver his own grandchild; somebody to call hogs, tame cantankerous machinery,
come home hungry, have to wait lunch until his wife’s done feeding visiting
ladies, then tell the ladies to be sure and come back real soon - and mean it.
God said, "I need somebody willing to sit up all night with a newborn colt,
and watch it die, then dry his eyes and say, 'Maybe next year.' I need somebody
who can shape an ax handle from a persimmon sprout, shoe a horse with a hunk
of car tire, who can make harness out of haywire, feed sacks and shoe scraps;
who, planting time and harvest season, will finish his forty-hour week by
Tuesday noon, and then pain’n from tractor back, put in another seventy-two
hours." God had to have somebody willing to ride the ruts at double speed to get
the hay in ahead of the rain clouds, and yet stop in mid-field and race to help
when he sees the first smoke from a neighbor’s place. God said, "I need somebody
strong enough to clear trees and heave bails, yet gentle enough to tame lambs
and wean pigs and tend the pink-combed pullets, who will stop his mower for an
hour to splint the broken leg of a meadow lark." It had to be somebody who’d
plow deep and straight and not cut corners; somebody to seed, weed, feed, breed
and rake and disc and plow and plant and tie the fleece and strain the milk and
replenish the self-feeder and finish a hard week’s work with a five-mile drive
to church; somebody who would bale a family together with the soft strong bonds
of sharing, who would laugh, and then sigh, and then reply, with smiling eyes,
when his son says that he wants to spend his life "doing what dad does."
-- so God made a Farmer.
"""

reference_summary = """
God needed someone to take care of the planet, so God made a Farmer.
"""

# Initialize best scores and best params
best_score = None
best_params = None
best_recall = None
best_precision = None

# Perform grid search
for max_length, min_length, num_beams in itertools.product(param_grid["max_length"], param_grid["min_length"], param_grid["num_beams"]):
    generated_summary = distilbart_summarize(text, max_length, min_length, num_beams)
    results = rouge.compute(predictions=[generated_summary], references=[reference_summary])

    # Extract ROUGE-LSum scores
    rougeLsum_precision = results['rougeLsum'].mid.precision
    rougeLsum_recall = results['rougeLsum'].mid.recall
    rougeLsum_fmeasure = results['rougeLsum'].mid.fmeasure

    if best_score is None or rougeLsum_fmeasure > best_score:
        best_score = rougeLsum_fmeasure
        best_params = (max_length, min_length, num_beams)
        best_recall = rougeLsum_recall
        best_precision = rougeLsum_precision

print(f"Best ROUGE-Lsum F-measure: {best_score:.4f}")
print(f"Best ROUGE-Lsum Recall: {best_recall:.4f}")
print(f"Best ROUGE-Lsum Precision: {best_precision:.4f}")
print(f"Best parameters: max_length={best_params[0]}, min_length={best_params[1]}, num_beams={best_params[2]}")


Compare ROGE Summarization Scores:

T5



*   Best ROUGE-Lsum F-measure: 0.1778
*   Best ROUGE-Lsum Recall: 0.2857
*   Best ROUGE-Lsum Precision: 0.1290
*   Best parameters: max_length=50, min_length=10, num_beams=2



BART


*   Best ROUGE-Lsum F-measure: 0.1270
*   Best ROUGE-Lsum Recall: 0.2857
*   Best ROUGE-Lsum Precision: 0.0816
*   Best parameters: max_length=100, min_length=10, num_beams=4



Based on these scores, Ill be using T5 for training due to its higher Precision and no discernable differnce in Recall.

# Parking lot

# Score query result quality

In [None]:
# Parking Lot
# Query Processing and Search of LDA derived topics
# This step reuires LDA to have run first to generate topic sentences


BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/Data_sets/e9/'
CREDENTIALS_PATH = '/content/drive/MyDrive/Colab Notebooks/credentials/snowflake_credentials'

faiss_index_path = os.path.join(BASE_PATH, 'faiss_index_t5.index')
representative_sentences_path = os.path.join(BASE_PATH, 'representative_sentences.csv')
similarity_scores_output_path = os.path.join(BASE_PATH, 'similarity_scores_with_answers.csv')
similarity_threshold = 0.01  # Set your threshold value here

def load_credentials(path_to_credentials):
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=')
                os.environ[key] = value
            else:
                print(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")
                # Optionally raise an error or handle the issue as needed
def connect_to_snowflake():
    """Establish a connection to the Snowflake database."""
    return snowflake.connector.connect(
        user=os.environ.get('USER'),
        password=os.environ.get('PASSWORD'),
        account=os.environ.get('ACCOUNT')
    )

# Load the rebuilt FAISS index
index = faiss.read_index(faiss_index_path)

# Initialize the T5 tokenizer and encoder model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")

# Function to tokenize text using T5 tokenizer
def tokenize_text(text):
    return tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)["input_ids"]

# Function to generate embeddings for a new query using the T5 model
def generate_query_embedding(query):
    query_tokens = tokenize_text(query)
    with torch.no_grad():
        outputs = model(query_tokens)
        query_embedding = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()  # Average pooling
    return torch.tensor(query_embedding).unsqueeze(0)  # Add batch dimension

# Function to search FAISS index for the most similar question
def search_faiss_index(query_embedding, index, k=1):
    query_embedding_np = query_embedding.numpy().astype('float32')  # Convert to NumPy array of type float32
    D, I = index.search(query_embedding_np, k)  # Search
    valid_indices = [idx for idx in I[0] if idx >= 0]
    similarity_scores = D[0][:len(valid_indices)]  # Get similarity scores for valid indices
    return valid_indices, similarity_scores  # Return only valid indices and their scores

def fetch_answers_from_snowflake(indices):
    """Fetch answers and embeddings from Snowflake for given indices."""
    if not indices:
        return pd.DataFrame()  # Return an empty DataFrame if no valid indices

    load_credentials(CREDENTIALS_PATH)
    conn = connect_to_snowflake()
    cur = conn.cursor()
    query = f"SELECT THREAD_ID, ANSWER FROM e9_corpus.e9_corpus_schema.e9_forum_corpus_faiss WHERE THREAD_ID IN ({','.join(map(str, indices))})"
    if indices:
        cur.execute(query)
        answers = cur.fetch_pandas_all()
    else:
        answers = pd.DataFrame()
    cur.close()
    conn.close()
    return answers

def process_representative_sentences():
    # Load representative sentences
    representative_sentences_df = pd.read_csv(representative_sentences_path)

    # Generate embeddings and calculate similarity scores for each representative sentence
    results = []

    for idx, row in representative_sentences_df.iterrows():
        topic = row['Topic']
        sentence = row['Representative Sentence']

        # Generate query embedding
        query_embedding = generate_query_embedding(sentence)

        # Ensure the dimension matches
        if query_embedding.shape[1] != index.d:
            raise ValueError(f"Embedding dimension mismatch: {query_embedding.shape[1]} vs {index.d}")

        # Search FAISS index
        similar_indices, similarity_scores = search_faiss_index(query_embedding, index, k=1)

        # Fetch answers from Snowflake
        answer = None
        score = None
        if similar_indices:
            answers = fetch_answers_from_snowflake(similar_indices)

            if not answers.empty:
                for idx, score in zip(similar_indices, similarity_scores):
                    matching_answers = answers.loc[answers['THREAD_ID'] == idx, 'ANSWER'].values
                    if len(matching_answers) > 0:
                        answer = matching_answers[0]
                        break
        results.append({
            'Representative Sentence': sentence,
            'Answer': answer,
            'Similarity Score': score
        })

    # Save results to a CSV file
    results_df = pd.DataFrame(results)
    results_df.to_csv(similarity_scores_output_path, index=False)
    print("Results saved.")

    # Output results
    for result in results:
        print(f"Representative Sentence: {result['Representative Sentence']}")
        print(f"Answer: {result['Answer']}")
        print(f"Similarity Score: {result['Similarity Score']}\n")

def main():
    process_representative_sentences()

if __name__ == "__main__":
    main()


## 3.3 Data Storage and Database


Efficient data storage and management are pivotal for the project, focusing on accommodating extensive unstructured data from various sources. The project explores two main classes of storage solutions: Cloud Storage and Local Storage, each offering unique benefits and challenges.

### 3.3.1 Cloud Storage
Cloud storage solutions offer scalability, reliability, and remote access, making them suitable for projects with dynamic data needs and global access requirements.

- **Tools:** Snowflake (for relational data), MongoDB Atlas (for NoSQL data)
    - **Pros:**
        - **Scalability:** Easily scales to meet growing data demands without the need for physical infrastructure management.
        - **Accessibility:** Provides global access to the data, facilitating collaboration and remote work.
        - **Maintenance and Security:** Cloud providers manage the security, backups, and maintenance, reducing the administrative burden.
    - **Cons:**
        - **Cost:** While scalable, costs can increase significantly with data volume and throughput.
        - **Internet Dependence:** Requires consistent internet access, which might be a limitation in some scenarios.
        - **Data Sovereignty:** Data stored in the cloud may be subject to the laws and regulations of the host country, raising concerns about compliance and privacy.


### 3.3.2 Local Storage
Local storage solutions rely on on-premises or personal hardware, providing full control over the data and its management but requiring more direct oversight.

- **Tools:** MySQL (for relational data), MongoDB (Local installation for NoSQL data)
    - **Pros:**
        - **Control:** Complete control over the data storage environment and configurations.
        - **Cost:** No ongoing costs related to data storage size or access rates, aside from initial hardware and setup.
        - **Connectivity:** No reliance on internet connectivity for access, ensuring data availability even in offline scenarios.
    - **Cons:**
        - **Scalability:** Physical limits to scalability; expanding storage capacity requires additional hardware.
        - **Maintenance:** Requires dedicated resources for maintenance, backups, and security, increasing the administrative burden.
        - **Accessibility:** Data is not as easily accessible from remote locations, potentially hindering collaboration and remote access needs.


**Conclusion: I will be using Snowflake to store my corpus.**