<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/nlp/LSTM/notebooks/NLP_with_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Workflow

**Define Project Goals and Contraints:**

This will be instrumental in selecting specific architecture and data processing strategies.


**Data Cleaning and Preprocessing:**

Load your data into a Pandas DataFrame.
Perform basic cleaning: remove duplicates, handle missing values.
Normalize text: convert to lowercase, remove punctuation, and special characters.

**Text Preprocessing for BPE:**

Apply BPE tokenization to your corpus. This involves learning the BPE vocab from your dataset and then applying it to both questions and answers to tokenize them.

**Splitting the Dataset:**

Split your data into training, validation, and test sets. A common split ratio is 80% training, 10% validation, and 10% test.

**Converting Text to Sequences:**

Convert your tokenized text into sequences of integers using the BPE vocabulary. This step transforms the textual data into a format that can be fed into the LSTM model.

**Padding Sequences:**

Since LSTM models require inputs of the same length, use padding to ensure all sequences in a batch have the same length.

**Designing the LSTM Model:**

Build your LSTM model architecture using TensorFlow/Keras. The model should include an Embedding layer, one or more LSTM layers, and a Dense output layer.

**Compiling the Model:**

Compile the model with an appropriate optimizer (e.g., Adam), loss function (e.g., sparse_categorical_crossentropy for classification tasks), and metrics (e.g., accuracy).

**Training the Model:**

Train the model on your training set while also validating its performance on the validation set. Use model checkpoints and early stopping to prevent overfitting.

**Evaluating the Model:**

After training, evaluate the model's performance on the test set to get a sense of its generalization ability.

**Model Deployment:**

Deploy the model into a production environment. This could be a simple web application or a REST API that takes in a question and returns the predicted answer.

## Project Goals and Constraints

**Goal**

The goal is to create a "virtual mechanic" to help owners maintain older cars that have a dwindling set of experts available to turn to.


**Task Type:**

The project aims to build a generative language model that will accept written unstructured questions in English from users and provide the user with targeted written answers in English. The model will use sequence prediction and text generation. The model will not use classification, image recognition, or sentiment analysis.

**Data Characteristics:**

The training corpus for the data will be User Generated Content scraped from a domain-specific online forum. The corpus will generally be unstructured with a very limited set of metadata.

**Performance Metrics:**

Performance of the project will be scored on accuracy and speed of responses.

**Resource Constraints:**

The project will be built in Python utilizing limited CPU compute resources from Google Colab.

**Existing Tools or Frameworks:**

The corpus will be stored in Snowflake database.

**Scalability and Adaptability:**

There is no need to support additional user languages. However, when available, the corpus will be supplemented with additional written unstructured text.



## Corpus Creation

The corpus used was assembled using Beautiful Soup to scrape a pubic forum specific to the BMW E9 (www.e9coupe.com). This active forum has been exsitence since 2003. The data was compiled and stored in a Snowflake database for multiple NLP projects, including LDA, GRU and LSTM. Furture ideas include supplementing the forum text with an existing users guide specific to this model.

##Language Model Architectures

### Recurrent Neural Networks (RNNs):

**Pros:**
1. *Sequential Processing:* RNNs process sequential data efficiently, making them suitable for tasks like text generation where the order of input elements matters.
2. *Memory:* RNNs have a form of memory that allows them to remember past information while processing current inputs.
3. *Interpretability:* Due to their sequential nature, RNNs are often more interpretable compared to more complex architectures like Transformers.
4. *Ease of Development:* RNNs have been around for longer and have a simpler architecture compared to Transformers, making them easier to develop and understand for beginners.
5. *CPU Needs:* RNNs can be trained and run on CPU instances, although training large models or processing large datasets may benefit from GPU acceleration.

**Cons:**
1. *Vanishing/Exploding Gradient:* RNNs can suffer from vanishing or exploding gradient problems, especially when dealing with long sequences, which can lead to difficulties in learning long-term dependencies.
2. *Limited Context:* Traditional RNNs have a limited memory span, making them less effective at capturing long-range dependencies in data.
3. *Computationally Inefficient:* Training RNNs can be computationally expensive, especially when dealing with large datasets and long sequences.

### Transformer Architectures:

**Pros:**
1. *Parallelization:* Transformers allow for highly parallelized computation, leading to faster training and inference compared to sequential models like RNNs.
2. *Long-Range Dependencies:* Transformers can capture long-range dependencies in data more effectively than traditional RNNs, making them well-suited for tasks requiring global context, such as machine translation and text generation.
3. *Attention Mechanism:* Transformers use attention mechanisms to weigh the importance of different input elements, allowing them to focus on relevant information and ignore irrelevant parts of the input sequence.
4. *Ease of Development:* While more complex than RNNs, Transformers have a modular architecture that can be easier to develop and experiment with compared to traditional recurrent architectures.

**Cons:**
1. *Complexity:* Transformers have a more complex architecture compared to RNNs, which can make them harder to understand, implement, and interpret.
2. *Data Requirements:* Transformers require large amounts of data to train effectively, especially for tasks with complex patterns and dependencies.
3. *Resource Intensive:* Training large transformer models requires significant computational resources, including powerful GPUs or TPUs, making them less accessible for smaller-scale projects or individuals with limited resources.

### Hybrid Model (Combining RNNs and Transformers):

**Pros:**
1. *Combine Strengths:* A hybrid model can potentially combine the strengths of both RNNs and Transformers, leveraging the sequential processing capabilities of RNNs with the long-range dependency handling of Transformers.
2. *Flexibility:* A hybrid approach offers flexibility in model design, allowing researchers and practitioners to tailor the architecture to specific task requirements and data characteristics.

**Cons:**
1. *Complexity:* Developing and training a hybrid model can be more complex compared to using either RNNs or Transformers alone, as it requires integration of different architectural components and potentially more sophisticated training procedures.
2. *Resource Intensive:* Depending on the specific architecture and scale, training a hybrid model may require significant computational resources, similar to Transformers.



##Tokenization Strategies

#### Word-Level Tokenization:

**Description:**
Word-level tokenization splits the text into individual words, treating each word as a token.

**Libraries:**
1. NLTK (Natural Language Toolkit): Provides tokenization tools for various NLP tasks, including word-level tokenization.
2. spaCy: Another popular NLP library that offers word-level tokenization along with other NLP functionalities.

**Pros:**
1. Preserves semantic meaning of individual words.
2. Intuitive representation of text for language modeling tasks.

**Cons:**
1. May struggle with out-of-vocabulary words, especially in domain-specific or informal language.
2. Increases vocabulary size, potentially leading to higher memory usage.

**Suitability:**
Word-level tokenization may be suitable for this project as it preserves the semantic meaning of individual words, which can be important for generating coherent responses to user questions.

#### Character-Level Tokenization:

**Description:**
Character-level tokenization treats each character in the text as a separate token.

**Libraries:**
1. TensorFlow Text: Part of the TensorFlow ecosystem, TensorFlow Text provides utilities for various text processing tasks, including character-level tokenization.
2. Keras: With its text preprocessing module, Keras offers character-level tokenization capabilities.

**Pros:**
1. Captures fine-grained details in the text, useful for handling misspellings or morphologically complex words.
2. Helps in handling out-of-vocabulary terms effectively.

**Cons:**
1. Can be computationally expensive due to larger token vocabulary.
2. May not capture higher-level semantic meaning as effectively as word-level tokenization.

**Suitability:**
Character-level tokenization might not be the best choice for this project, as it may not capture the semantic meaning of words effectively. However, it could be useful for capturing fine-grained details in the text if necessary.

#### Byte Pair Encoding (BPE):

**Description:**
Byte Pair Encoding (BPE) tokenization iteratively merges the most frequent pairs of tokens to build a vocabulary of subword units.

**Libraries:**
1. Hugging Face Transformers: Provides tokenization functionalities, including BPE, along with pre-trained language models for various NLP tasks.
2. Tokenizers: A Python library specifically designed for fast and customizable tokenization, including BPE tokenization.

**Pros:**
1. Handles rare or out-of-vocabulary terms effectively.
2. Offers a good balance between accuracy and efficiency.

**Cons:**
1. Requires additional pre-processing steps compared to traditional tokenization methods.
2. Increases complexity of tokenization process, potentially impacting speed.

**Suitability:**
BPE tokenization could be a good choice for this project as it effectively handles rare or out-of-vocabulary terms, which may be present in the user-generated content scraped from online forums. It also offers good balance between accuracy and speed, which aligns with the project's performance metrics and resource constraints.


## Summarization Strategies
Very difficult to find a winning strategy here that can accomidate both long and short length text blocks.

**Extractive Summarization**
<br>
Pros:
<br>
Good with Raw Text: Extractive methods can work directly with raw, unstructured text, as they mainly focus on selecting key sentences or phrases without needing deep linguistic processing.
Straightforward Implementation: These methods do not require complex preprocessing like tokenization or lemmatization, simplifying their implementation.
<br>
Cons:
<br>
Limited Depth in Understanding: While they can handle raw text, they may not fully capture the nuanced meaning, especially when the text contains complex structures or unorthodox language use.
Less Effective with Poorly Structured Text: In cases where the text is poorly structured or highly informal, extractive summarization might struggle to identify the main points effectively.
<br>
<br>


**Abstractive Summarization** (like sshleifer/distilbart-cnn-12-6)
<br>
Pros:
<br>
Advanced Processing Capabilities: Abstractive models, especially those based on transformer architectures, are designed to handle and interpret raw text, capturing deeper linguistic and contextual nuances.
Higher Tolerance for Unstructured Text: These models can manage unstructured or informal text by understanding and then rephrasing it in a more coherent and structured summary.
<br>
Cons:
<br>
Dependence on Preprocessing for Optimal Performance: While they can process raw text, the quality of the output can be significantly improved with proper tokenization and lemmatization, especially for complex texts.
Potential Overhead: Requires more computational resources to process and understand raw text, which might be more efficiently handled with some level of preprocessing.
<br>
<br>
**Hybrid Summarization**
<br>
Pros:
<br>
Flexibility in Text Processing: Combining extractive and abstractive methods allows for handling both raw and preprocessed text, adapting to the text's structure and complexity.
Balanced Approach: Can leverage the strengths of extractive methods in handling raw text for identifying key points, while using abstractive techniques for generating a coherent summary.
<br>
Cons:
<br>
Complex Preprocessing Requirements: The need to integrate both extractive and abstractive approaches may necessitate more sophisticated preprocessing strategies to optimize performance.
Potential for Processing Inefficiencies: The combined approach might lead to redundancies or inefficiencies in processing, especially if the text is either too raw or overly preprocessed.
<br>
<br>
After attempting sshleifer/distilbart-cnn-12-6 I found it had a character limit (1024) that is too restrictive for my needs. However, T5 has no limits and is what Ill be trying.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Collect source data

!pip install snowflake-connector-python
import snowflake.connector
import pandas as pd
import os


# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value


# And use them in your Snowflake connection (adjust as necessary for your specific case):
conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Select source data
query = """
SELECT * FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
"""
cur.execute(query)

# Load your data into a Pandas DataFrame.
e9_forum_corpus = cur.fetch_pandas_all()

# Close the cursor and the connection
cur.close()
conn.close()

Collecting snowflake-connector-python
  Downloading snowflake_connector_python-3.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting asn1crypto<2.0.0,>0.24.0 (from snowflake-connector-python)
  Downloading asn1crypto-1.5.1-py2.py3-none-any.whl (105 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting platformdirs<4.0.0,>=2.6.0 (from snowflake-connector-python)
  Downloading platformdirs-3.11.0-py3-none-any.whl (17 kB)
Collecting tomlkit (from snowflake-connector-python)
  Downloading tomlkit-0.12.4-py3-none-any.whl (37 kB)
Installing collected packages: asn1crypto, tomlkit, platformdirs, snowflake-connector-python
  Attempting uninstall: platformdirs
    Found existing installation: platformdirs 4.2.0
    Uninstalling platformdirs-4.2.0:
      Succe

## Text Preprocessing


In [None]:
# This revised code removes tokenization from NLTK and focuses on cleaning.
# BPM tokenizatin is handled later on.

import pandas as pd
import re

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

def clean_corpus(df):
    cleaned_titles = []
    cleaned_posts = []

    for title in df['THREAD_TITLE']:
        if isinstance(title, str):
            cleaned_titles.append(clean_text(title))
        else:
            cleaned_titles.append('')  # If the title is not a string, append an empty string

    df['THREAD_TITLE_CLEAN'] = cleaned_titles

    for post in df['THREAD_ALL_POSTS']:
        if isinstance(post, str):
            cleaned_posts.append(clean_text(post))
        else:
            cleaned_posts.append('')  # If the post is not a string, append an empty string

    df['THREAD_POSTS_CLEAN'] = cleaned_posts

    return df

# Example usage:
lstm = clean_corpus(e9_forum_corpus.copy())


# Drop unecessary columns
lstm.drop(columns=['THREAD_TITLE'], inplace=True)
lstm.drop(columns=['THREAD_ALL_POSTS'], inplace=True)
lstm.drop(columns=['THREAD_FIRST_POST'], inplace=True)


# Rename columns to describe their roles
lstm.rename(columns={"THREAD_TITLE_CLEAN": "question", "THREAD_POSTS_CLEAN": "answer"}, inplace=True)

## Split the Dataset

In [None]:
# Splitting the data, validation, and test sets.Common split ratios are 80% training, 10% validation, and 10% test.
from sklearn.model_selection import train_test_split

# Splitting the data
# train_df, temp_df = train_test_split(lstm, test_size=0.2, random_state=42)


# Decreasing the training size to speed up
train_df, temp_df = train_test_split(lstm, test_size=0.9, random_state=42)


# Further splitting the temporary dataset into validation and test datasets (50% validation, 50% test from the temp dataset)
validation_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)


In [None]:
print(f"Training Set Size: {len(train_df)}")
print(f"Validation Set Size: {len(validation_df)}")
print(f"Test Set Size: {len(test_df)}")

Training Set Size: 993
Validation Set Size: 4472
Test Set Size: 4472


## Tokenization

In [None]:
# Text Preprocessing for BPE:
!pip install tokenizers

from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Concatenate questions and answers into a list of texts
texts = lstm['question'].tolist() + lstm['answer'].tolist()

# Save the texts to a file
with open("text_data.txt", "w", encoding="utf-8") as f:
    for item in texts:
        f.write("%s\n" % item)

# Train the tokenizer
tokenizer.train(files="text_data.txt", vocab_size=30_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save the trained tokenizer
tokenizer.save_model(".", "bpe_tokenizer")



['./bpe_tokenizer-vocab.json', './bpe_tokenizer-merges.txt']

## Sequencing

In [None]:
from tokenizers import ByteLevelBPETokenizer

# Assuming your tokenizer files are saved in the current directory
tokenizer = ByteLevelBPETokenizer(
    "./bpe_tokenizer-vocab.json",
    "./bpe_tokenizer-merges.txt",
)

def tokenize_and_encode(df, tokenizer):
    questions = []
    answers = []

    for _, row in df.iterrows():
        question = tokenizer.encode(row['question']).ids
        answer = tokenizer.encode(row['answer']).ids

        questions.append(question)
        answers.append(answer)

    return questions, answers

# Apply the function to each of your DataFrames
train_questions, train_answers = tokenize_and_encode(train_df, tokenizer)
validation_questions, validation_answers = tokenize_and_encode(validation_df, tokenizer)
test_questions, test_answers = tokenize_and_encode(test_df, tokenizer)


## Padding

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Set maximum sequence length
sequence_length = 100

# Calculate maximum sequence length
#sequence_length = max(max(len(seq) for seq in train_questions + train_answers),
#                 max(len(seq) for seq in validation_questions + validation_answers),
#                 max(len(seq) for seq in test_questions + test_answers))

# Use 'max_length' for padding/truncating
train_answers_padded = pad_sequences(train_answers, maxlen=sequence_length, padding='post', truncating='post')
train_questions_padded = pad_sequences(train_questions, maxlen=sequence_length, padding='post', truncating='post')

validation_questions_padded = pad_sequences(validation_questions, maxlen=sequence_length, padding='post', truncating='post')
validation_answers_padded = pad_sequences(validation_answers, maxlen=sequence_length, padding='post', truncating='post')
test_questions_padded = pad_sequences(test_questions, maxlen=sequence_length, padding='post', truncating='post')
test_answers_padded = pad_sequences(test_answers, maxlen=sequence_length, padding='post', truncating='post')

from tensorflow.keras.preprocessing.sequence import pad_sequences


## Designing the Model

Your LSTM model architecture looks well-constructed for a sequence processing task, with a bidirectional LSTM layer to capture patterns from both directions of your input sequences, enhancing the model's understanding of the context. The embedding layer is essential for representing your tokenized words as dense vectors of fixed size, and the dense layers, including a dropout layer for regularization, form the decision-making part of the network. However, there are a few considerations and potential adjustments depending on the specifics of your task:

**Model Architecture Considerations**

- **Output Layer**: The current model ends with a single dense layer with one unit (`Dense(1)`) and no activation function specified. This setup is typical for binary classification tasks. If your task is to generate text (like in a virtual mechanic answering questions), you might need a different setup. For text generation, the output layer often has as many units as the size of the vocabulary and uses a softmax activation function to produce a probability distribution over the vocabulary for each output token.

- **Embedding Layer Input Dimension**: Ensure the `input_dim` parameter of the Embedding layer matches the size of your vocabulary. This parameter seems to be set implicitly to a large value (7680000), likely intended as the `output_dim` parameter for the embedding vectors. The `input_dim` should be the size of your BPE vocabulary plus one (for padding).

- **Sequence Length**: The `input_length` parameter in the Embedding layer is set to `None`, allowing for variable-length sequences. This is fine as long as all sequences are padded to the same length before training. Ensure that the sequence length matches the `maxlen` used during padding.

**Potential Adjustments for a Text Generation Task**

If your goal is to generate text (answers) given questions, consider the following adjustments:

dense_1 = Dense(vocab_size, activation='softmax')
Loss Function: For a model generating text, use sparse_categorical_crossentropy as your loss function, which is suitable for classification problems with multiple classes, where the targets are integers.

Sequence-to-Sequence Model: Depending on your exact requirements, a sequence-to-sequence model architecture might be more appropriate. This involves using one LSTM for encoding the input sequence and another for generating the output sequence, potentially with attention mechanisms for better context capture.

Example Correction for Text Generation
If you're aiming for a model that generates text, here's a slight adjustment to your architecture for clarity:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

vocab_size = 30000  # Example vocabulary size, adjust based on your actual vocabulary
max_length = 100  # Adjust based on your padded sequence length

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=256, input_length=max_length),
    Bidirectional(LSTM(128)),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(vocab_size, activation='softmax')  # Adjusted for text generation
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()


**Output Layer Adjustment**

Change the last Dense layer to have a size matching your vocabulary and add a softmax activation function. For example, if your vocabulary size is vocab_size, the adjustment would be:


Ensure you adjust vocab_size and max_length to match your dataset's specifics. This setup is more aligned with a model that generates text, predicting the probability distribution of the next word in a sequence given the context.

If your project has different objectives or if there are any other aspects you'd like to discuss or clarify, feel free to share more details!




## Compiling the Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, TimeDistributed

vocab_size = 30000  # Example vocabulary size, adjust based on your actual vocabulary
max_length = 100  # Adjust based on your padded sequence length

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=256, input_length=max_length),
    Bidirectional(LSTM(128, return_sequences=True)),  # Ensure LSTM returns sequences
    Dense(128, activation='relu'),
    Dropout(0.5),
    TimeDistributed(Dense(vocab_size, activation='softmax'))  # Use TimeDistributed here
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 256)          7680000   
                                                                 
 bidirectional (Bidirection  (None, 100, 256)          394240    
 al)                                                             
                                                                 
 dense (Dense)               (None, 100, 128)          32896     
                                                                 
 dropout (Dropout)           (None, 100, 128)          0         
                                                                 
 time_distributed (TimeDist  (None, 100, 30000)        3870000   
 ributed)                                                        
                                                                 
Total params: 11977136 (45.69 MB)
Trainable params: 1197

## Training the Model

#### Pre-training Checklist:

**1. Review Model Architecture**

- Confirm Layer Configurations: Make sure each layer is configured as intended for your task. For a sequence generation model like yours, using Bidirectional(LSTM()) with return_sequences=True and a TimeDistributed(Dense()) layer is appropriate.
- Output Layer Compatibility: The final TimeDistributed(Dense(vocab_size, activation='softmax')) layer should match your vocabulary size, ensuring the model can predict each token in the sequence.

**2. Verify Data Preprocessing**

- Tokenization and Encoding: Ensure your questions and answers have been correctly tokenized and encoded to integer sequences. This usually involves using a tokenizer that fits your dataset.
- Padding: Verify that both input (questions) and output (answers) sequences are padded to the correct max_length. All sequences should have the same length to ensure consistent model input and output shapes.

**3. Ensure Correct Data Split**

- Training, Validation, and Test Sets: Confirm you have split your data into appropriate sets. Typically, you'd want a training set for model training, a validation set for tuning, and a test set for final evaluation.
- Balance and Representativeness: Check that each data split is representative of the overall dataset to avoid bias.

**4. Check Compilation Settings**

- Loss Function: For a sequence generation task, sparse_categorical_crossentropy is suitable when your labels are integer-encoded (not one-hot encoded). Ensure this aligns with how your target data is prepared.
- Optimizer and Metrics: Validate that you've chosen an optimizer and metrics that align with your model's goals. adam and accuracy are common choices, but ensure they fit your specific task.

**5. Model Summary Review**

- Use model.summary() to review your model's architecture. Confirm the number of parameters and the output shape at each layer align with your expectations.

**6. Small Scale Test Run**

- Consider doing a small-scale test run of your model training with a subset of your data. This can help identify potential issues early without the need for a full training cycle.

**7. Hardware and Runtime Environment**

- GPU Availability: Ensure you have access to a suitable GPU for training if your dataset and model are large. Training on a CPU can be significantly slower.
- Memory Constraints: Monitor memory usage during the test run to ensure your environment has sufficient resources to handle the full training process.




##  Step 1: Check Input Data Shape
First, confirm the shape of your padded questions (train_questions_padded) to ensure they match the expected input shape for the model. Given your model architecture, the input shape should be (None, 100) for max_length of 100.

In [None]:
print("Shape of train_questions_padded:", train_questions_padded.shape)


Shape of train_questions_padded: (993, 100)


##Step 2: Verify Target Data Shape
If your model is designed for sequence generation with answers as targets, check the shape of your target data (train_answers_padded). For sequence-to-sequence models, the target data typically should have the same sequence length as the input data.

In [None]:
print("Shape of train_answers_padded:", train_answers_padded.shape)


Shape of train_answers_padded: (993, 100)


## Step 3: Confirm Matching Dimensions for Input and Target
The sequence length of train_questions_padded and train_answers_padded should match the max_length specified in your model (100 in this case). Ensure both have the shape (num_samples, 100).

In [None]:
if train_questions_padded.shape[1] == 100 and train_answers_padded.shape[1] == 100:
    print("Input and target data are correctly shaped.")
else:
    print("Mismatch in input or target data shapes detected.")

print("Shape of train_questions_padded:", train_questions_padded.shape)
print("Shape of train_answers_padded:", train_answers_padded.shape)


Input and target data are correctly shaped.
Shape of train_questions_padded: (993, 100)
Shape of train_answers_padded: (993, 100)


## Additional Check: Model's Expected Input Shape
To ensure your model's first layer is configured to accept the shape of your input data, you can also verify the model's expected input shape:

In [None]:
model_input_shape = model.layers[0].input_shape
print("Model's expected input shape:", model_input_shape)


Model's expected input shape: (None, 100)


In [None]:
# Limited data, batches and epochs to expedite learnings.

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, TimeDistributed
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Model definition (repeating for clarity)
vocab_size = 30000  # Adjust based on your actual vocabulary size
max_length = 100    # Sequence length that you have chosen for padding

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=256, input_length=max_length),
    Bidirectional(LSTM(128, return_sequences=True)),
    Dense(128, activation='relu'),
    Dropout(0.5),
    TimeDistributed(Dense(vocab_size, activation='softmax'))
])

# Compile the model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'],
    run_eagerly=False  # Set to False for more efficient training, now that we're done debugging
)

# Actual training with corrected padded data
history = model.fit(
    train_questions_padded,
    train_answers_padded,
    batch_size=1,                     # Adjust as per your computational resource
    epochs=3,                         # Set a suitable number of epochs
    validation_split=0.1               # Use a portion of the data for validation
)

# Optionally, save your trained model
model.save('path_to_save_your_model/my_model.h5')

Epoch 1/3
Epoch 2/3
Epoch 3/3


  saving_api.save_model(


In [None]:
# This code cell will stop execution of subsequent cells
class StopExecution(Exception):
    def _render_traceback_(self):
        pass  # This will prevent the traceback from being shown

raise StopExecution("Execution stopped by user")

StopExecution: Execution stopped by user

## Preliminary Results

In [None]:
# Initial results: Batch: 64, Epochs: 10
# This is taking ~10 to 15 minutes per Epoch


Epoch 1/10
84/84 [==============================] - 1036s 12s/step - loss: 7.0193 - accuracy: 0.1689 - val_loss: 6.2323 - val_accuracy: 0.1764
Epoch 2/10
84/84 [==============================] - 999s 12s/step - loss: 6.2303 - accuracy: 0.1787 - val_loss: 6.1813 - val_accuracy: 0.1783
Epoch 3/10
84/84 [==============================] - 1005s 12s/step - loss: 6.1779 - accuracy: 0.1794 - val_loss: 6.1575 - val_accuracy: 0.1787
Epoch 4/10
84/84 [==============================] - 1002s 12s/step - loss: 6.1540 - accuracy: 0.1797 - val_loss: 6.1482 - val_accuracy: 0.1789
Epoch 5/10
84/84 [==============================] - 998s 12s/step - loss: 6.1337 - accuracy: 0.1801 - val_loss: 6.1177 - val_accuracy: 0.1793
Epoch 6/10
84/84 [==============================] - 958s 11s/step - loss: 6.0721 - accuracy: 0.1835 - val_loss: 6.1016 - val_accuracy: 0.1829
Epoch 7/10
84/84 [==============================] - 994s 12s/step - loss: 5.9832 - accuracy: 0.1921 - val_loss: 6.1179 - val_accuracy: 0.1815
Epoch 8/10
84/84 [==============================] - 993s 12s/step - loss: 5.9335 - accuracy: 0.1984 - val_loss: 6.1405 - val_accuracy: 0.1766
Epoch 9/10
84/84 [==============================] - 993s 12s/step - loss: 5.8925 - accuracy: 0.2027 - val_loss: 6.2001 - val_accuracy: 0.1674
Epoch 10/10
84/84 [==============================] - ETA: 0s - loss: 5.8618 - accuracy: 0.2062

## Parking Lot

## Troubleshooting

In [None]:
# Troubleshooting

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, TimeDistributed

tf.keras.backend.clear_session()  # Clearing the session to reset any leftover state

# Re-define your model here
vocab_size = 30000  # Vocabulary size
max_length = 100  # Adjust based on padded sequence length

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=256, input_length=max_length),
    Bidirectional(LSTM(128, return_sequences=True)),
    TimeDistributed(Dense(vocab_size, activation='softmax'))
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Attempt to train the model again with the dummy data
history = model.fit(train_questions_padded, dummy_targets,
                    batch_size=64,
                    epochs=1,
                    validation_split=0.1)

In [None]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(test_questions_padded, test_answers_padded, batch_size=64)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")


In [None]:
# Select a small subset for a quick test
subset_train_questions_padded = train_questions_padded[:10]
subset_train_answers_padded = train_answers_padded[:10]

# Perform a quick training cycle
history = model.fit(subset_train_questions_padded, subset_train_answers_padded,
                    batch_size=2,  # Small batch size for quick feedback
                    epochs=1,  # Single epoch to minimize wait time
                    verbose=2)  # Less verbose output


In [None]:
# Check the shape of your target data
print("Shape of train_answers_padded:", train_answers_padded.shape)

# If train_answers_padded is not already 1D, you might need to adjust how you're preparing this data.
# For a simple check and reshape (this is hypothetical and depends on your specific data preparation):
if len(train_answers_padded.shape) > 1:
    # Hypothetically flattening the target data if it's not in the expected 1D shape
    # Note: This is just illustrative; you'll likely need a different approach based on your data
    train_answers_padded = train_answers_padded.reshape(-1)
    print("New shape of train_answers_padded:", train_answers_padded.shape)


In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Assuming your model is compiled and ready for training
history = model.fit(train_questions_padded, train_answers_padded,
          batch_size=64, # Adjust based on your dataset size and memory constraints
          epochs=100, # Set a large number and rely on EarlyStopping to halt training
          validation_data=(validation_questions_padded, validation_answers_padded),
          callbacks=[EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
                     ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)])


In [None]:
# Ensure y_train and y_test are of shape (batch_size, sequence_length)
# where each element is an integer class label

# Example: If y_train is currently one-hot encoded or improperly shaped, adjust it
# This step assumes y_train needs to be converted or reshaped

# If y_train is already in the correct shape, adjust the following line accordingly or skip

# Check the current shape of y_train and y_test
print("Before reshape, y_train shape:", y_train.shape)
print("Before reshape, y_test shape:", y_test.shape)

# Reshape or adjust y_train and y_test here based on the specific issue identified
# This is a placeholder step - replace it with the actual adjustment needed for your data

# Example: If y_train needs to be reshaped or converted, add that code here
# Since I don't have the exact format of your y_train, I can't provide specific code without more details

# After adjustment, if needed:
print("After reshape, y_train shape:", y_train.shape)
print("After reshape, y_test shape:", y_test.shape)

# Continue with model training as before


In [None]:
print("Input data shape:", X.shape)
print("Number of batches:", len(X) // 32)

In [None]:
print("Dataframe shape:", df_q_a.shape)
print(df_q_a.head())

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize the tokenizer
tokenizer = Tokenizer(num_words=10000)  # Only the most common 10,000 words
tokenizer.fit_on_texts(df_q_a['question'])

# Convert text to sequences of integers
questions_seq = tokenizer.texts_to_sequences(df_q_a['question'])
answers_seq = tokenizer.texts_to_sequences(df_q_a['answer'])

# Pad the sequences so they are all the same length
max_length = max(max(len(seq) for seq in questions_seq), max(len(seq) for seq in answers_seq))

questions_padded = pad_sequences(questions_seq, maxlen=max_length)
answers_padded = pad_sequences(answers_seq, maxlen=max_length)

print("Padded questions shape:", questions_padded.shape)
print("Padded answers shape:", answers_padded.shape)


In [None]:
# Replace None with empty strings
df_q_a['question'].fillna('', inplace=True)
df_q_a['answer'].fillna('', inplace=True)

# Assuming you have already instantiated and fit a tokenizer
# Convert text to sequences of integers
questions_seq = tokenizer.texts_to_sequences(df_q_a['question'])
answers_seq = tokenizer.texts_to_sequences(df_q_a['answer'])

# Pad the sequences so they are all the same length
max_length = max(max(len(seq) for seq in questions_seq), max(len(seq) for seq in answers_seq))

questions_padded = pad_sequences(questions_seq, maxlen=max_length)
answers_padded = pad_sequences(answers_seq, maxlen=max_length)

print("Padded questions shape:", questions_padded.shape)
print("Padded answers shape:", answers_padded.shape)


In [None]:
print("Input X shape:", X.shape)
print("Output y shape:", y.shape)

num_samples = X.shape[0]
batch_size = 32
expected_num_batches = np.ceil(num_samples / batch_size)

print("Expected number of batches:", expected_num_batches)


In [None]:
# Convert text to sequences of integers using the correct method for ByteLevelBPETokenizer
questions_seq = [tokenizer.encode(question).ids for question in df_q_a['question']]
answers_seq = [tokenizer.encode(answer).ids for answer in df_q_a['answer']]

# For demonstration, let's just print the first few sequences to confirm they're correctly encoded
print("First question sequence:", questions_seq[0])
print("First answer sequence:", answers_seq[0])

In [None]:
from tokenizers import ByteLevelBPETokenizer
from keras.preprocessing.sequence import pad_sequences
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding

# Load the tokenized DataFrame
df_q_a = pd.read_csv('/content/drive/MyDrive/lstm/df_q_a_tokenized.csv')

# Drop any rows with missing values
df_q_a.dropna(inplace=True)

# Convert tokenized sequences to numpy arrays
X = np.array(eval(df_q_a['question_tokens'].values[0]))
y = np.array(eval(df_q_a['answer_tokens'].values[0]))

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ensure there are no None values in the arrays
if X_train is not None and X_test is not None and y_train is not None and y_test is not None:
    # Define LSTM model
    max_sequence_length = 100  # Update with your max sequence length
    vocab_size = np.max(X_train) + 1  # Get the maximum token index
    embedding_dim = 128  # Example value, adjust as needed

    lstm_model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length),
        LSTM(units=64),
        Dense(max_sequence_length, activation='softmax')  # Adjust output shape and activation function as needed
    ])

    # Compile the model
    lstm_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    # Reshape input data for LSTM
    X_train = np.expand_dims(X_train, axis=-1)
    X_test = np.expand_dims(X_test, axis=-1)

    # Train the model
    history = lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

    # Optionally, you can plot training history to visualize model performance over epochs
    import matplotlib.pyplot as plt

    # Plot training loss
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

    # Plot training accuracy
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()
else:
    print("One or more numpy arrays contain None values. Please check your data.")


In [None]:
# This code cell will raise an exception to stop execution of subsequent cells
class StopExecution(Exception):
    def _render_traceback_(self):
        pass  # This will prevent the traceback from being shown

raise StopExecution("Execution stopped by user")

In [None]:
# Train the model
history = lstm_model.fit(X_train_reshaped, y_train, epochs=10, batch_size=32, validation_data=(X_val_reshaped, y_val))  # Adjust epochs, batch_size, validation_data as needed

# Optionally, you can plot training history to visualize model performance over epochs
import matplotlib.pyplot as plt

# Plot training loss
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Plot training accuracy
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()


In [None]:
# Hybrid Model leveraging GPT

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from torch.optim import AdamW

# Initialize tokenizer and model
model_name = "distilgpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained(model_name)

lines = E9_FORUM_CORPUS['THREAD_ALL_POST'].tolist()  # This is only 100 rows

# Batch tokenization
#tokens = tokenizer(lines, max_length=1024, truncation=True, padding="max_length", return_tensors="pt")
tokens = tokenizer(lines, max_length=512, truncation=True, padding="max_length", return_tensors="pt") #reducing the compute
input_ids = tokens['input_ids']
attention_masks = tokens['attention_mask']

# Check that I have enough samples to split
if input_ids.size(0) > 1:
    # Use a smaller portion of the data for quicker experiments (10% of the original data)
    _, small_train_inputs, _, small_train_masks = train_test_split(input_ids, attention_masks, test_size=0.9, random_state=42) #reducing the compute
    _, small_val_inputs, _, small_val_masks = train_test_split(input_ids, attention_masks, test_size=0.9, random_state=42) #reducing the compute
else:
    # Not enough data to split, so we use what we have for both training and validation
    small_train_inputs = input_ids
    small_train_masks = attention_masks
    small_val_inputs = input_ids
    small_val_masks = attention_masks

# Define the TextDataset class
class TextDataset(Dataset):
    def __init__(self, input_ids, masks):
        self.input_ids = input_ids
        self.masks = masks

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids': self.input_ids[idx], 'attention_mask': self.masks[idx]}

# Create datasets with the smaller subset
small_train_dataset = TextDataset(small_train_inputs, small_train_masks)
small_val_dataset = TextDataset(small_val_inputs, small_val_masks)

# Update DataLoaders with the smaller dataset
train_loader = DataLoader(small_train_dataset, batch_size=1, shuffle=True)
val_loader = DataLoader(small_val_dataset, batch_size=1)

# Set up the device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Experimenting with values to find steady state
epochs = 5
training_losses = []
validation_losses = []

# Training loop
for epoch in range(epochs):
    model.train()
    total_train_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = input_ids.clone().detach()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_train_loss += loss.item()
        loss.backward()
        optimizer.step()
    avg_train_loss = total_train_loss / len(train_loader)
    training_losses.append(avg_train_loss)

    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = input_ids.clone().detach()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            total_val_loss += outputs.loss.item()
    avg_val_loss = total_val_loss / len(val_loader)
    validation_losses.append(avg_val_loss)
    print(f'Epoch {epoch}, Training Loss: {avg_train_loss}, Validation Loss: {avg_val_loss}')

# Plotting
import matplotlib.pyplot as plt

plt.plot(range(1, epochs+1), training_losses, 'bo-', label='Training loss')
plt.plot(range(1, epochs+1), validation_losses, 'ro-', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt


In [None]:
# This is a summarization of the initial thread post

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Initialize the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

def summarize_text(df):

    thread_first_post_summary = []  # Initialize the list to hold summaries
    for thread_id, text in zip(df['thread_id'], df['thread_first_post']): # ensures pairing

        # Prefixing the input text with "summarize: " as T5 expects
        inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", truncation=True, max_length=512)
        summary_ids = model.generate(inputs, max_length=150, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode the generated ids to get the summary text
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        # Append the tuple containing thread_id and summary to the list
        thread_first_post_summary.append((thread_id, summary))

    return thread_first_post_summary

# Fetch first post content and convert to DataFrame
data = summarize_text(df_threads)
df_short_desc = pd.DataFrame(data, columns=['thread_id', 'thread_first_post_summary'])

# Merge the summarized_df with df_threads on the 'thread_id' column
df_threads = pd.merge(df_threads, df_short_desc, on='thread_id', how='left')

# Display the resulting DataFrame
df_threads.head()

In [None]:
# Gather keywords from the initial thread post
# Keyword extraction with BERT
from transformers import BertTokenizer, BertModel

import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to extract keywords using BERT
def bert_extract_keywords(text, tokenizer, model, top_n=5):
    # Tokenize and encode the text
    inputs = tokenizer.encode_plus(text, add_special_tokens=True, return_tensors="pt", truncation=True, max_length=512)
    input_ids = inputs['input_ids'][0]

    # Get the embeddings from the last hidden layer
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.squeeze(0)

    # Compute word importance by summing up the embeddings
    word_importance = torch.sum(embeddings, dim=1)

    # Get the indices of the top n important words
    top_n_indices = word_importance.argsort(descending=True)[:top_n]

    # Filter out indices that are out of range of input_ids
    top_n_indices = [idx for idx in top_n_indices if idx < len(input_ids)]

    # Decode the top n words
    keywords = [tokenizer.decode([input_ids[idx]]) for idx in top_n_indices]

    return keywords

# df_threads['keywords'] = bert_extract_keywords(df_threads['summary'],tokenizer, model)
df_threads['thread_first_post_keywords'] = df_threads['thread_first_post_summary'].apply(lambda x: bert_extract_keywords(x, tokenizer, model))

# Display the resulting DataFrame
df_threads.head()


In [None]:
# Process threads and fetch all post data'

# As written this will fetch all the posts on the first page, which is 20
# This might need to be updated to iterate through all page values (1 through n)


def fetch_and_parse_thread(df):
    post_data = []
    processed_posts = set()
    for index, row in df.iterrows():
        response = requests.get(row['thread_url'])
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article', class_='message--post')  # Correct class name as example
        for article in articles:
            post_id = article.get('id', 'N/A')
            numeric_post_id = re.findall(r'\d+', post_id)[0] if re.findall(r'\d+', post_id) else 'N/A'

            if numeric_post_id not in processed_posts:
                processed_posts.add(numeric_post_id)
                content = article.find('div', class_='bbWrapper').get_text(strip=True)
                #timestamp = article.find('time', class_='u-dt').get_text(strip=True) if article.find('time', class_='u-dt') else 'N/A'
                #post_number_element = article.find('ul', class_='message-attribution-opposite').find('li').find_next_sibling('li')
                #post_number = post_number_element.get_text(strip=True) if post_number_element else 'N/A'
                #post_number = post_number.lstrip('#') if post_number != 'N/A' else post_number

                post_data.append({
                    'thread_id': row['thread_id'],  # Corrected to use row's data
                    'post_id': numeric_post_id,
                    #'post_number': post_number,
                    'post_raw': content
                })

    return pd.DataFrame(post_data, columns=['thread_id', 'post_id','post_raw'])

# Fetch thread URLs and titles, and store in a DataFrame
df_posts = fetch_and_parse_thread(df_threads)

# Display the resulting DataFrame
df_posts.head()

In [None]:
# This is a summarization of all posts for a given thread

# Count the number of unique threads to ensure I dont drop any
print(df_posts['thread_id'].nunique())

aggregated_posts = df_posts.groupby(['thread_id'])['post_raw'].apply(lambda x: ' '.join(x)).reset_index(name='post_concat')

df_threads = df_threads.merge(aggregated_posts, on=['thread_id'], how='left')

# Display the resulting DataFrame
df_threads.head()

In [None]:
# This is a summarization of all posts for a given thread

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Initialize the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

def summarize_text(df):
    sum_text = []  # Initialize the list to hold summaries
    for text in df['post_concat']:
        # Ensure the text is a string and not empty
        #if not isinstance(text, str) or not text.strip():
        #    sum_text.append("")  # Append an empty string for non-valid entries
        #   continue

        # Prefixing the input text with "summarize: " as T5 expects
        inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", truncation=True, max_length=512)
        summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode the generated ids to get the summary text
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        sum_text.append(summary)

    return sum_text

df_threads['post_summary'] = summarize_text(df_threads)

# Display the resulting DataFrame
df_threads.head()

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Initialize the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

def summarize_text(texts, batch_size=4):
    sum_texts = []  # Initialize the list to hold summaries
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        input_encodings = tokenizer.batch_encode_plus(batch_texts, padding=True, truncation=True, max_length=512, return_tensors="pt")

        # Generating summaries in batches
        summary_ids = model.generate(input_encodings['input_ids'], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

        for summary_id in summary_ids:
            summary = tokenizer.decode(summary_id, skip_special_tokens=True)
            sum_texts.append(summary)

    return sum_texts

# Example usage
df_threads['post_summary'] = summarize_text(df_threads['post_concat'].tolist())

In [None]:
# Update to the above

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Initialize the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

def summarize_text(df):
    sum_text = []  # Initialize the list to hold summaries
    for text in df['post_concat']:
        # Ensure the text is a string and not empty
        if not isinstance(text, str) or not text.strip():
            sum_text.append("")  # Append an empty string for non-valid entries
            continue

        # Prefixing the input text with "summarize: " as T5 expects
        inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", truncation=True, max_length=512)
        summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode the generated ids to get the summary text
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        sum_text.append(summary)

    return sum_text

# Assuming df_threads is your DataFrame and it has a column named 'post_concat'
df_threads['post_summary'] = summarize_text(df_threads)

# Display the resulting DataFrame
print(df_threads.head())


In [None]:
from transformers import BertTokenizer, BertModel

import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to extract keywords using BERT
def bert_extract_keywords(text, tokenizer, model, top_n=5):
    # Tokenize and encode the text
    inputs = tokenizer.encode_plus(text, add_special_tokens=True, return_tensors="pt", truncation=True, max_length=512)
    input_ids = inputs['input_ids'][0]

    # Get the embeddings from the last hidden layer
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.squeeze(0)

    # Compute word importance by summing up the embeddings
    word_importance = torch.sum(embeddings, dim=1)

    # Get the indices of the top n important words
    top_n_indices = word_importance.argsort(descending=True)[:top_n]

    # Filter out indices that are out of range of input_ids
    top_n_indices = [idx for idx in top_n_indices if idx < len(input_ids)]

    # Decode the top n words
    keywords = [tokenizer.decode([input_ids[idx]]) for idx in top_n_indices]

    return keywords

df_threads['post_keywords'] = df_threads['post_summary'].apply(lambda x: bert_extract_keywords(x, tokenizer, model))

# Display the resulting DataFrame
df_threads.head()

In [None]:
# Export and save result
df_threads.to_csv('/content/drive/MyDrive/e9/nlp/df_threads.csv', index=False)