# **Pretrained Model**

This projects demonstrates a process using a pretrained machine learning model.
 It includes data preprocessing, model loading, prediction, and evaluation steps. Our  purpose is to illustrate how pretrained models can be integrated into workflows for practical applications.

A **pretrained model** is a machine learning (ML) model that has been trained on a large dataset and can be fine-tuned for a specific task. They are often used as a starting point for developing ML models because they provide:
* A set of initial weights and biases
* The ability to achieve better results with less data and computational resources

## **Install (Environment Setup)**



This section ensures all necessary libraries are installed and imported.
**Pretrained models** often require specific dependencies, which are handled here. This step is really important for ensuring compatibility and avoiding runtime errors:
The following commands install these libraries:  
- **`transformers`**: Provides pretrained models for NLP tasks.  
- **`datasets`**: Simplifies access to datasets for machine learning.  
- **`torch`**: The deep learning framework used to run models.  
- **`evaluate`**: Helps compute performance metrics for trained models.

In [None]:
!pip install transformers
!pip install Datasets
!pip install torch
!pip install datasets evaluate



## **Imported Libraries**


We load specific tools and functionalities required for the notebook. Here's what each library and module does in the code:

- **`pandas`**: Used for data manipulation and analysis, such as cleaning and organizing datasets.
- **`torch`**: Provides deep learning functionality and GPU acceleration, essential for running transformer models.
- **`transformers`**:
  - **`T5Tokenizer`**: Converts text into numerical tokens for the model.
  - **`T5ForConditionalGeneration`**: Loads the pretrained T5 model for text-to-text tasks like summarization.
  - **`Trainer` and `TrainingArguments`**: Simplify the training and fine-tuning process for transformer models.
- **`datasets`**:
  - **`Dataset`**: Used to load and preprocess structured datasets compatible with the model.


In [None]:
import pandas as pd
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import Dataset

## **Load Data**

In [None]:
import pandas as pd

# Load the new provided dataset
file_path = "amazon_reviews.csv"
df = pd.read_csv(file_path)

# Show first rows to understand the structure of the new dataset
df.head(), df.columns

(                                              review  label  \
 0  Really disappointed with this Samsung Galaxy S...      0   
 1  After a week of using this Sony WH-1000XM4, I ...      1   
 2  Had this Dell XPS 13 for a month and already h...      0   
 3  Do not waste your money on this AirPods Pro. C...      0   
 4  Just received my Levi's 501 and I'm absolutely...      1   
 
                                             reversed  new_label  
 0  After 2 weeks of using this Dyson V11, I can c...          1  
 1  Really disappointed with this Nike Air Max. Po...          0  
 2  Finally found the perfect Levi's 501! Amazing ...          1  
 3  Just received my Ray-Ban Wayfarer and I'm abso...          1  
 4  Had this Levi's 501 for a month and already ha...          0  ,
 Index(['review', 'label', 'reversed', 'new_label'], dtype='object'))

**Column Descriptions:**

- **`review`**: Likely contains the text of product reviews.  
- **`label`**: Represents the sentiment or classification of the original review, possibly `0` for negative and `1` for positive.  
- **`reversed`**: This column contains modified or rewritten versions of the original reviews.  
- **`new_label`**: This ndicates the sentiment or classification of the `reversed` reviews, potentially altered from the `label`.


## **Data Preparation**


### Objective:
To prepare the dataset for fine-tuning the T5 model. The T5 model requires input-output pairs, and here the input is the original review (`review`), and the output is the rewritten version (`reversed`).

### Key Steps:
1. **Selecting Relevant Columns**:
   - Only the `review` and `reversed` columns are used since these are the input and output pairs required for the model.

2. **Removing Duplicates**:
   - Ensures that only unique input-output pairs are included in the dataset.

3. **Splitting the Dataset**:
   - The dataset is divided into training (80%) and testing (20%) sets to train and evaluate the model.

4. **Conversion to Dataset Format**:
   - The training and testing data are converted to a Hugging Face `Dataset` format for compatibility with the T5 model.

In [None]:
from sklearn.model_selection import train_test_split

# Prepare the data for fine-tuning T5
# Only keep the relevant columns: 'review' (input) and 'reversed' (output)
data = df[['review', 'reversed']].drop_duplicates()

# Split the dataset into training and testing sets (80% train, 20% test)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)



train_data, test_data



(                                                 review  \
 1730  After 2 weeks of using this Project Hail Mary,...   
 3277  After 2 weeks of using this Timberland boots, ...   
 4912  Really disappointed with this Timberland boots...   
 2419  Really disappointed with this Ninja Air Fryer....   
 1173  Had this Dyson V11 for a month and already hav...   
 ...                                                 ...   
 3782  Had this Timberland boots for a month and alre...   
 5211  Really disappointed with this Ninja Air Fryer....   
 5246  Finally found the perfect Nike Air Max! Amazin...   
 5410  After a week of using this Ray-Ban Wayfarer, I...   
 860   Really disappointed with this Nike Air Max. No...   
 
                                                reversed  
 1730  Really disappointed with this Sony WH-1000XM4....  
 3277  Do not waste your money on this Ray-Ban Wayfar...  
 4912  After 2 weeks of using this AirPods Pro, I can...  
 2419  After a month of using this Levi's 

In [None]:
# Convert the training and testing sets to the Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_data)

test_dataset = Dataset.from_pandas(test_data)


## **Modeling**

This section includes the steps required to set up the T5 model for conditional text generation tasks, specifically sentiment inversion, and trains it using the Hugging Face Trainer class.

To load a pre-trained T5 model and tokenizer, preprocess the dataset for model training, and fine-tune the T5 model to invert sentiment in text.

 **Key Steps:**

1. **Load Pre-trained Tokenizer and Model**:
   - The T5 model and tokenizer (`t5-base`) are loaded. These components are pre-trained and require fine-tuning for the specific task.

2. **Data Preprocessing**:
   - A custom function wers defined to tokenize the data, convert input-output pairs into tensors, and format them for training. Padding token IDs in the labels are replaced with `-100` to exclude them from loss calculations.

3. **Dataset Preparation**:
   - The processed datasets are wrapped in a custom PyTorch `Dataset` class to facilitate batch processing during training.

4. **Define Training Arguments**:
   - Hyperparameters for training, such as learning rate, batch size, and number of epochs, are specified using Hugging Face's `TrainingArguments`.

5. **Training the Model**:
   - The Hugging Face `Trainer` is used to manage the training process, including logging and saving model checkpoints.

6. **Model Fine-tuning**:
   - The T5 model is fine-tuned on the training dataset, and its performance is evaluated on the test dataset after each epoch.


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
import torch

# Load the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Tokenize the data for training
def preprocess_data(data, tokenizer, max_length=512):
    inputs = ["invert sentiment: " + text for text in data["review"]]
    targets = data["reversed"].tolist()

    model_inputs = tokenizer(inputs, max_length=max_length, truncation=True, padding="max_length", return_tensors="pt")
    labels = tokenizer(targets, max_length=max_length, truncation=True, padding="max_length", return_tensors="pt").input_ids

    # Replace padding token id's of the labels by -100 to ignore them during loss calculation
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

# Preprocess the train and test data
train_dataset = preprocess_data(train_data, tokenizer)
test_dataset = preprocess_data(test_data, tokenizer)

# Prepare the PyTorch Dataset class
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}

train_dataset = SentimentDataset(train_dataset)
test_dataset = SentimentDataset(test_dataset)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",           # output directory
    evaluation_strategy="epoch",     # evaluate each epoch
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,
    logging_dir="./logs",            # directory for storing logs
    logging_steps=50,
    save_steps=500,
    save_strategy="epoch",
    fp16=torch.cuda.is_available(),  # Enable mixed precision if using GPU
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
)

# Start training the model
trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.4591,0.337642
2,0.4408,0.336891
3,0.4408,0.336874


TrainOutput(global_step=1794, training_loss=0.5567882510199063, metrics={'train_runtime': 409.6742, 'train_samples_per_second': 35.011, 'train_steps_per_second': 4.379, 'total_flos': 8734283024302080.0, 'train_loss': 0.5567882510199063, 'epoch': 3.0})

### Observations:
- **Evaluation Metrics**:
  Training and validation loss are reported after each epoch, helping monitor overfitting or underfitting.
- **Performance Optimization**:
  Mixed precision (`fp16`) is enabled if a GPU is available to speed up training.

## **Perplexity**

This section focuses on evaluating the performance of the trained model by calculating average loss and perplexity over a test dataset. Additionally, we included functions to generate text, invert sentiment, and test the model for specific use cases.


## **Average Loss Calculation and Perplexity|**

The function `calculate_avg_loss` evaluates the model on a dataset and computes the average loss.
The `calculate_perplexity` function computes the perplexity, a metric used to measure the quality of a language model's predictions.


**Process**
- **Model Evaluation Mode:** The model is set to evaluation mode using `model.eval()` to ensure no gradients are calculated.
- **Batch Processing:** The dataset is divided into batches using a `DataLoader`.
- **Loss Calculation:** For each batch, the loss is computed, and the total loss is accumulated.
- **Average Loss:** The average loss is obtained by dividing the total loss by the number of batches.

**Input Parameters:**
- `model`: The trained model to evaluate.
- `dataset`: The test dataset for evaluation.
- `batch_size`: The size of each batch for the `DataLoader`.
- `max_length`: The maximum sequence length for the dataset.



In [None]:
import math
from torch.utils.data import DataLoader

# Function to calculate the average loss on a dataset
def calculate_avg_loss(model, dataset, batch_size=8, max_length=512):
    model.eval()  # Set the model to evaluation mode
    dataloader = DataLoader(dataset, batch_size=batch_size)
    total_loss = 0
    total_batches = 0

    with torch.no_grad():
        for batch in dataloader:
            # Send the data to the appropriate device (GPU if available)
            batch = {key: val.to(model.device) for key, val in batch.items()}

            # Calculate the loss
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.item()
            total_batches += 1

    # Calculate the average loss
    avg_loss = total_loss / total_batches
    return avg_loss

# Function to calculate perplexity
def calculate_perplexity(avg_loss):
    return math.exp(avg_loss)

# Calculate the average loss on the test dataset
avg_loss = calculate_avg_loss(model, test_dataset)

# Calculate the perplexity
perplexity = calculate_perplexity(avg_loss)

print(f"Pérdida promedio en el conjunto de prueba: {avg_loss}")
print(f"Perplejidad: {perplexity}")


Pérdida promedio en el conjunto de prueba: 0.3370023409525553
Perplejidad: 1.400742342806064


## **Text Generation with Sentiment Inversion**

The `generate_text` function generates text by inverting sentiment using a trained T5 model.

**Process:**
- **Input Preprocessing:** Adds a prefix (`invert sentiment:`) to the input text.
- **Tokenization:** Converts the input text to token IDs suitable for the model.
- **Model Inference:** Generates text based on the input.
- **Decoding:** Converts generated token IDs back to text.

**Input Parameters:**
- `model`: The trained T5 model.
- `tokenizer`: The tokenizer compatible with the T5 model.
- `text`: Input text to invert sentiment.
- `max_length`: Maximum length of generated text.

In [None]:
def generate_text(model, tokenizer, text, max_length=512):
    model.eval()
    input_text = "invert sentiment: " + text
    input_ids = tokenizer(input_text, return_tensors="pt", max_length=max_length, truncation=True).input_ids

    # Send to GPU if available
    if torch.cuda.is_available():
        input_ids = input_ids.to(model.device)

    # Generate text
    outputs = model.generate(input_ids, max_length=max_length, num_beams=5, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
example_text = "This phone is amazing! The battery lasts all day and the camera is outstanding."
generated_text = generate_text(model, tokenizer, example_text)
print("Original:", example_text)
print("Generated:", generated_text)


Original: This phone is amazing! The battery lasts all day and the camera is outstanding.
Generated: Really disappointed with this Sony WH-1000XM4. Poor quality. Customer service was unhelpful.


## **Example Generation from Test Data**

The `generate_examples_from_test_data` function generates sentiment-inverted text for a random sample of reviews from the test dataset.

**Key Steps:**
- **Random Sampling:** Randomly selects reviews from the test dataset.
- **Sentiment Inversion:** Generates inverted reviews using the model.
- **Comparison:** Stores original, original inverted, and generated inverted reviews for analysis.

**Input Parameters:**
- `model`: The trained T5 model.
- `tokenizer`: Tokenizer for input and output processing.
- `dataset`: Test dataset containing reviews.
- `num_examples`: Number of examples to generate.
- `max_length`: Maximum length for the generated text.

**Output:**
- Returns a list of dictionaries containing:
  - Original review
  - Original inverted sentiment
  - Model-generated inverted sentiment


In [None]:
from random import sample

def generate_examples_from_test_data(model, tokenizer, dataset, num_examples=5, max_length=512):
    model.eval()
    examples = []

    # Randomly select dataframe indexes
    random_indices = sample(range(len(dataset)), num_examples)

    for idx in random_indices:
        # Obtain the original review and its inverted equivalent
        original_review = dataset.iloc[idx]["review"]
        original_inverted = dataset.iloc[idx]["reversed"]

        # Generate the inverted review using the model
        input_text = "invert sentiment: " + original_review
        input_ids = tokenizer(input_text, return_tensors="pt", max_length=max_length, truncation=True).input_ids

        # Send to GPU if available
        if torch.cuda.is_available():
            input_ids = input_ids.to(model.device)

        # Generate text
        outputs = model.generate(input_ids, max_length=max_length, num_beams=5, early_stopping=True)
        generated_inverted = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Storing the results
        examples.append({
            "Original Review": original_review,
            "Original Inverted": original_inverted,
            "Generated Inverted": generated_inverted
        })

    return examples

num_examples = 5
generated_examples = generate_examples_from_test_data(model, tokenizer, test_data, num_examples=num_examples)


for i, example in enumerate(generated_examples):
    print(f"Example {i+1}:")
    print(f"Original Review: {example['Original Review']}")
    print(f"Original Inverted: {example['Original Inverted']}")
    print(f"Generated Inverted: {example['Generated Inverted']}")
    print("-" * 50)


Example 1:
Original Review: Just received my iRobot Roomba and I'm absolutely loving it! Everything works perfectly. Shipping was fast too!
Original Inverted: Do not waste your money on this Dell XPS 13. Cheaply made. Definitely not as described.
Generated Inverted: Do not waste your money on this Nike Air Max. Overpriced for what you get. Return process is a nightmare.
--------------------------------------------------
Example 2:
Original Review: After 2 weeks of using this Atomic Habits, I can confidently say it's worth every penny. Performance is stellar. Exactly what I needed.
Original Inverted: Do not waste your money on this iPad Pro. Overpriced for what you get. Return process is a nightmare.
Generated Inverted: Do not waste your money on this Atomic Habits. Overpriced for what you get. Return process is a nightmare.
--------------------------------------------------
Example 3:
Original Review: After a week of using this Ray-Ban Wayfarer, I can confidently say it's worth every p

## **Model Saving and Reloading**

The trained model and tokenizer are saved using `model.save_pretrained` and `tokenizer.save_pretrained`, allowing for reuse without retraining.

**Files:**
- Model configuration and weights.
- Tokenizer configuration and special tokens.

**Reloading:**
- Models and tokenizers can be reloaded for inference or further training.


In [None]:

model.save_pretrained("./api_model")
tokenizer.save_pretrained("./api_model")



('./api_model/tokenizer_config.json',
 './api_model/special_tokens_map.json',
 './api_model/spiece.model',
 './api_model/added_tokens.json')

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration


model = T5ForConditionalGeneration.from_pretrained("./api_model")
tokenizer = T5Tokenizer.from_pretrained("./api_model")


def test_model(review):
    input_text = f"invert sentiment: {review}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=256)
    outputs = model.generate(inputs.input_ids)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text


original_review = "This product is amazing! The battery lasts all day."
generated_review = test_model(original_review)

print("Original Review:", original_review)
print("Generated Review:", generated_review)




Original Review: This product is amazing! The battery lasts all day.
Generated Review: Do not waste your money on this Atomic Habits. Overpriced for what you


## **Polarity Classification and Inversion**

The `determine_polarity` function classifies text as positive, negative, or neutral based on keyword analysis. The `invert_polarity` function switches the polarity.

**Process:**
1. Count positive and negative keywords in the text.
2. Classify polarity based on counts.
3. Invert polarity for sentiment switching.

**Output:**
- Dictionary containing:
  - Original text
  - Original polarity
  - Generated text
  - Generated polarity

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load model and tokenizer from api_model folder
model = T5ForConditionalGeneration.from_pretrained("./api_model")
tokenizer = T5Tokenizer.from_pretrained("./api_model")

# Function to determine polarity using keywords
def determine_polarity(text):
    positive_keywords = ["amazing", "outstanding", "great", "excellent", "love", "fantastic", "good", "positive", "happy", "enjoy"]
    negative_keywords = ["disappointed", "poor", "bad", "terrible", "hate", "awful", "negative", "unhappy", "sad", "angry"]

    # Counting positive and negative keywords in the text
    positive_count = sum(word in text.lower() for word in positive_keywords)
    negative_count = sum(word in text.lower() for word in negative_keywords)

    # Sort by number of keywords found
    if positive_count > negative_count:
        return "positive"
    elif negative_count > positive_count:
        return "negative"
    else:
        return "neutral"

# Polarity reversal function
def invert_polarity(original_polarity):
    if original_polarity == "positive":
        return "negative"
    elif original_polarity == "negative":
        return "positive"
    else:
        return "neutral"

# Function for testing the model and classifying polarities
def generate_text_with_polarity(review):
    input_text = f"invert sentiment: {review}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=256)
    outputs = model.generate(inputs.input_ids)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Determine polarities
    original_polarity = determine_polarity(review)
    generated_polarity = invert_polarity(original_polarity)

    return {
        "original_text": review,
        "original_polarity": original_polarity,
        "generated_text": generated_text,
        "generated_polarity": generated_polarity
    }

# Test with examples
review = "This product is terrible! The battery lasts all day."
result = generate_text_with_polarity(review)

# Show results
print("Resultados de la prueba:")
print(result)


Resultados de la prueba:
{'original_text': 'This product is terrible! The battery lasts all day.', 'original_polarity': 'negative', 'generated_text': "After a month of using this iPad Pro, I can confidently say it's worth", 'generated_polarity': 'positive'}


## **API**

This section outlines the setup and deployment of an API using **FastAPI**, a modern web framework for building APIs with Python 3.7+ based on standard Python type hints. The API will serve as the backbone for generating responses and making them accessible via a public URL using **ngrok**.



Ensure the following packages are installed:
- **fastapi**: A web framework for building APIs.
- **uvicorn**: An ASGI server implementation for FastAPI applications.
- **transformers**: A library by Hugging Face for natural language processing tasks.


**Process:**
  1. Install Required Libraries
Run the following command to ensure all necessary dependencies are installed:
```bash
! pip install fastapi uvicorn transfo



In [None]:
! pip install fastapi uvicorn transformers





  2. Import Libraries
Import the necessary modules:

 * pyngrok: To expose the local server to the internet.
* nest_asyncio: To handle asynchronous tasks within Jupyter notebooks.
* uvicorn: To serve the FastAPI application.

3. Initialize Asynchronous Event Loop

We used `nest_asyncio` to allow the server to run seamlessly in a notebook environment.

```python
nest_asyncio.apply()
```
4. Exposing the API Using ngrok
Establish a public URL for the local server using ngrok

5. Start the API Server
Run the server with Uvicorn, specifying the module and application instance (main:app), and define the host and port

In [None]:
from pyngrok import ngrok
import nest_asyncio
import uvicorn


nest_asyncio.apply()


public_url = ngrok.connect(8000)
print(f"Servidor público disponible en: {public_url}")


uvicorn.run("main:app", host="0.0.0.0", port=8000)



Servidor público disponible en: NgrokTunnel: "https://38ce-35-240-141-232.ngrok-free.app" -> "http://localhost:8000"


INFO:     Started server process [1142]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO:     2806:268:4403:1eb:cb7d:8453:4fe9:a51c:0 - "GET / HTTP/1.1" 404 Not Found
INFO:     2806:268:4403:1eb:cb7d:8453:4fe9:a51c:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     2806:268:4403:1eb:cb7d:8453:4fe9:a51c:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     2806:268:4403:1eb:c070:1ac5:ee0a:29e6:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     2806:268:4403:1eb:c070:1ac5:ee0a:29e6:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     2806:268:4403:1eb:c070:1ac5:ee0a:29e6:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     2806:2f0:8040:fd32:154e:6d11:d6d4:541b:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     2806:2f0:8040:fd32:154e:6d11:d6d4:541b:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     2806:268:4403:1eb:c070:1ac5:ee0a:29e6:0 - "POST /generate HTTP/1.1" 200 OK
INFO:     2806:2f0:8040:fd32:154e:6d11:d6d4:541b:0 - "POST /generate HTTP/1.1" 422 Unprocessable Entity
INFO:     2806:2f0:8040:fd32:154e:6d11:d6d4:541b:0 - "POST /generate HTTP/1.1" 422 Unprocessable Entity
INFO:     2806:2f0:8040:fd32:154e:6d11

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1142]
