<a href="https://colab.research.google.com/github/andreevromano/HSE_LSML2/blob/main/FP_LSML2_AndreevRA_MDS23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kaggle



In [2]:
import os
import pandas as pd
import zipfile

# Download the dataset using Kaggle CLI
# Make sure you have placed your kaggle.json in ~/.kaggle directory
# and have run: chmod 600 ~/.kaggle/kaggle.json

# Below is an example dataset: "IMDB Dataset of 50K Movie Reviews"
# Link: https://www.kaggle.com/dataset/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
# If running in a notebook, uncomment and run the following line:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews -p ./data


Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to ./data
 66% 17.0M/25.7M [00:00<00:00, 86.6MB/s]
100% 25.7M/25.7M [00:00<00:00, 101MB/s] 


In [3]:
# Check if dataset is already downloaded, if not, download it
dataset_path = "./data/imdb-dataset-of-50k-movie-reviews.zip"
if not os.path.exists(dataset_path):
    # Using kaggle CLI to download
    os.system("kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews -p ./data")

# Unzip the dataset
with zipfile.ZipFile(dataset_path, 'r') as zip_ref:
    zip_ref.extractall("./data")



In [4]:
# The dataset file is often named "IMDB Dataset.csv"
data_file = "./data/IMDB Dataset.csv"

# Load the dataset into a pandas DataFrame
df = pd.read_csv(data_file)

# Print the first few rows to verify that the data has been loaded correctly
df.head(10)

# The dataset typically has two columns: "review" and "sentiment"
# The "sentiment" column can be "positive" or "negative".
# Next steps: preprocessing text, splitting data, etc.

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [5]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [7]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [8]:
!pip install textblob



In [9]:
import os
import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
from tqdm.auto import tqdm
import pandas as pd

# Download stopwords if not already downloaded
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()

    # Lowercase
    text = text.lower()

    # Keep only letters, digits, and basic punctuation
    text = re.sub(r"[^a-z0-9\s,.!?]", "", text)

    # Remove stopwords
    words = text.split()
    words = [w for w in words if w not in stop_words]
    text = " ".join(words)

    return text


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
!pip install tqdm



In [11]:
# Load the dataframe (assuming df is already loaded as in previous steps)
# For progress bar integration with pandas
tqdm.pandas()

# Apply preprocessing with progress visualization
df['cleaned_review'] = df['review'].progress_apply(preprocess_text)

  0%|          | 0/50000 [00:00<?, ?it/s]

  text = BeautifulSoup(text, "html.parser").get_text()


In [12]:
# Check a few samples
df[['review', 'cleaned_review', 'sentiment']].head(10)


Unnamed: 0,review,cleaned_review,sentiment
0,One of the other reviewers has mentioned that ...,one reviewers mentioned watching 1 oz episode ...,positive
1,A wonderful little production. <br /><br />The...,wonderful little production. filming technique...,positive
2,I thought this was a wonderful way to spend ti...,thought wonderful way spend time hot summer we...,positive
3,Basically there's a family where a little boy ...,basically theres family little boy jake thinks...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love time money visually stunni...,positive
5,"Probably my all-time favorite movie, a story o...","probably alltime favorite movie, story selfles...",positive
6,I sure would like to see a resurrection of a u...,sure would like see resurrection dated seahunt...,positive
7,"This show was an amazing, fresh & innovative i...","show amazing, fresh innovative idea 70s first ...",negative
8,Encouraged by the positive comments about this...,encouraged positive comments film looking forw...,negative
9,If you like original gut wrenching laughter yo...,like original gut wrenching laughter like movi...,positive


In [13]:
from sklearn.model_selection import train_test_split

# Split into train/test first
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['sentiment'])

# Split the train set into train/validation
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42, stratify=train_df['sentiment'])

print("Train size:", len(train_df))
print("Validation size:", len(val_df))
print("Test size:", len(test_df))


Train size: 36000
Validation size: 4000
Test size: 10000


In this code, we load the BERT tokenizer from HuggingFace and define a helper function encode_data to tokenize the cleaned text and map labels (positive/negative) to numeric values (1/0). We then apply this to our train, validation, and test sets.

In [14]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def encode_data(texts, labels, max_length=128):
    # Tokenize the texts
    encoding = tokenizer(
        texts.tolist(),
        truncation=True,
        padding=True,
        max_length=max_length
    )
    # Map labels to numeric format
    label_map = {'positive': 1, 'negative': 0}
    encoded_labels = [label_map[l] for l in labels]
    return encoding, encoded_labels

train_encoding, train_labels = encode_data(train_df['cleaned_review'], train_df['sentiment'])
val_encoding, val_labels = encode_data(val_df['cleaned_review'], val_df['sentiment'])
test_encoding, test_labels = encode_data(test_df['cleaned_review'], test_df['sentiment'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

**Description:**

Now that the tokenizer and encodings are ready, we will create a PyTorch `Dataset` class to hold our tokenized data and corresponding labels. This will allow us to easily iterate over batches during training and evaluation.


In [15]:
import torch
from torch.utils.data import Dataset, DataLoader

class ReviewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # For BERT, inputs are typically input_ids, attention_mask, (optionally token_type_ids)
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create datasets
train_dataset = ReviewsDataset(train_encoding, train_labels)
val_dataset = ReviewsDataset(val_encoding, val_labels)
test_dataset = ReviewsDataset(test_encoding, test_labels)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)
test_loader = DataLoader(test_dataset, batch_size=16)


**Description:**

In this cell, we integrate LoRA into our BERT model using the PEFT library. We will:

- Install `peft` if it's not already installed.
- Import `LoraConfig` and `get_peft_model`.
- Wrap our `BertForSequenceClassification` model with LoRA layers.
- Fine-tune this LoRA-enabled model using the Trainer as before.


In [19]:
!pip install peft



In [20]:
from peft import LoraConfig, get_peft_model
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load the base model
base_model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Configure LoRA
lora_config = LoraConfig(
    r=8,                         # Rank of the LoRA update matrices
    lora_alpha=16,               # Scaling factor
    lora_dropout=0.1,            # LoRA dropout
    bias="none",                 # How biases are handled: "none", "lora_only", or "all"
    task_type="SEQ_CLS"          # Task type: Sequence classification
)

# Wrap the base model with LoRA
lora_model = get_peft_model(base_model, lora_config)

# Verify the number of trainable parameters is significantly reduced
lora_model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir='./logs',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the LoRA-augmented model
trainer.train()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 296,450 || all params: 109,780,228 || trainable%: 0.2700


Epoch,Training Loss,Validation Loss
1,0.3512,0.329693
2,0.3283,0.322278


TrainOutput(global_step=4500, training_loss=0.3869121331108941, metrics={'train_runtime': 1214.6405, 'train_samples_per_second': 59.277, 'train_steps_per_second': 3.705, 'total_flos': 4752391495680000.0, 'train_loss': 0.3869121331108941, 'epoch': 2.0})

**Description:**

Now that the model is trained, it's time to evaluate it on the test dataset. This will give us a better understanding of how the model performs on unseen data. We will use the `trainer.evaluate()` method and print out the resulting metrics. To get more insights, we can also define a `compute_metrics` function that returns common metrics like accuracy, precision, recall, and F1-score.


In [21]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Define a function to compute metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Predictions are in logit form, we need to take argmax
    preds = predictions.argmax(axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Re-initialize the Trainer with the compute_metrics function
trainer = Trainer(
    model=trainer.model,  # reuse the trained model
    args=trainer.args,     # reuse the same training arguments
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Evaluate on test dataset
test_results = trainer.evaluate(test_dataset)
print("Test Results:", test_results)


Test Results: {'eval_loss': 0.3119533956050873, 'eval_model_preparation_time': 0.0138, 'eval_accuracy': 0.8668, 'eval_f1': 0.8665063138905592, 'eval_precision': 0.868421052631579, 'eval_recall': 0.8646, 'eval_runtime': 78.2705, 'eval_samples_per_second': 127.762, 'eval_steps_per_second': 7.985}


**Description:**

We will try improving the model by increasing the number of training epochs from 2 to 4 and reducing the learning rate to make training more stable. We'll use W&B to track these experiments and compare the results.


In [22]:
from transformers import Trainer, TrainingArguments

# Increase epochs, reduce learning rate
training_args_improved = TrainingArguments(
    output_dir='./results_improved',
    num_train_epochs=4,               # Increased from 2 to 4
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir='./logs_improved',
    load_best_model_at_end=True,
    report_to="wandb",
    run_name="imdb_sentiment_lora_improved_lr",
    learning_rate=2e-5               # Reduced learning rate (default often 5e-5)
)

improved_trainer = Trainer(
    model=lora_model,
    args=training_args_improved,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

improved_trainer.train()




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3074,0.320202,0.863,0.860133,0.878519,0.8425
2,0.3119,0.309084,0.8665,0.865015,0.874744,0.8555
3,0.3067,0.309813,0.86925,0.870768,0.860772,0.881
4,0.2971,0.309624,0.86725,0.866817,0.869653,0.864


TrainOutput(global_step=9000, training_loss=0.3078212636311849, metrics={'train_runtime': 2419.0057, 'train_samples_per_second': 59.529, 'train_steps_per_second': 3.721, 'total_flos': 9504782991360000.0, 'train_loss': 0.3078212636311849, 'epoch': 4.0})

We will now evaluate the final (best) model on the test dataset using the `trainer.evaluate()` method. This will help confirm if the improvements seen on the validation set also reflect on the unseen test data.


In [23]:
# Evaluate on the test dataset with the current trainer (which should have the best model loaded)
test_results = trainer.evaluate(test_dataset)
print("Test Results:", test_results)


Test Results: {'eval_loss': 0.29718202352523804, 'eval_model_preparation_time': 0.0138, 'eval_accuracy': 0.8739, 'eval_f1': 0.8736852649504157, 'eval_precision': 0.8751755970299017, 'eval_recall': 0.8722, 'eval_runtime': 78.1076, 'eval_samples_per_second': 128.028, 'eval_steps_per_second': 8.002}


## Summary of Model Improvement and Evaluation

We successfully improved our sentiment classification model by:

- Adjusting the training parameters:
  - Increasing the number of epochs from 2 to 4.
  - Reducing the learning rate for more stable fine-tuning.
- Utilizing LoRA parameter-efficient fine-tuning, which allowed us to only train a small fraction of the parameters while maintaining strong performance.
- Tracking experiments with W&B and comparing different runs, which helped us identify the best configuration.

As a result, we achieved:
- **Accuracy:** ~87.4%  
- **F1-score:** ~0.874  
- **Precision:** ~0.875  
- **Recall:** ~0.872

These metrics are improvements over our initial model, confirming that the adjustments to training parameters positively impacted the model’s ability to generalize to unseen data.

---

## Next Steps

With a well-performing model now in hand, we will proceed to the next stage:

1. **Model Serving and Deployment:**
   - Set up a Python API (FastAPI) to serve the model’s predictions.
   - Containerize the service using Docker.

2. **Frontend Integration:**
   - Create a simple web interface (e.g., using HTML/JavaScript or a lightweight frontend framework) for users to input text and receive sentiment predictions.
   - Use `docker-compose` to run both the backend API and the frontend interface together.

Let’s move on to the serving and deployment steps.


**Description:**

In this next step, we will:

1. Save the trained (best) model locally.
2. Set up a simple FastAPI service that loads the model and provides a prediction endpoint.
3. Later, we will package this into a Docker image and create a `docker-compose` setup with a frontend.

Let's start by saving our model weights and tokenizer.


In [24]:
# Save the trained model and tokenizer
trainer.save_model("./saved_model")
tokenizer.save_pretrained("./saved_model")


('./saved_model/tokenizer_config.json',
 './saved_model/special_tokens_map.json',
 './saved_model/vocab.txt',
 './saved_model/added_tokens.json',
 './saved_model/tokenizer.json')

**Description:**
Next, let's create a simple FastAPI service that:
1. Loads the model and tokenizer from the `./saved_model` directory.
2. Defines an endpoint (`/predict`) that takes a text review as input and returns the predicted sentiment (positive or negative).


In [25]:
!pip install fastapi uvicorn[standard]

Collecting fastapi
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard]
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting starlette<0.42.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.41.3-py3-none-any.whl.metadata (6.0 kB)
Collecting httptools>=0.6.3 (from uvicorn[standard])
  Downloading httptools-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting python-dotenv>=0.13 (from uvicorn[standard])
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting uvloop!=0.15.0,!=0.15.1,>=0.14.0 (from uvicorn[standard])
  Downloading uvloop-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting watchfiles>=0.13 (from uvicorn[standard])
  Downloading watchfiles-1.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading fastapi-0.115.6-py3-none-any.whl (94

In [26]:
# In a production scenario, this code might go into a separate file, e.g. `app.py`.
# For demonstration, we'll write it here.

from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import BertForSequenceClassification, BertTokenizerFast

# Define a Pydantic model for the input data
class ReviewText(BaseModel):
    text: str

app = FastAPI()

# Load the model and tokenizer
model_path = "./saved_model"
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()  # Set model to evaluation mode

@app.post("/predict")
def predict_sentiment(data: ReviewText):
    # Tokenize and encode the input text
    inputs = tokenizer(
        data.text,
        truncation=True,
        padding=True,
        max_length=128,
        return_tensors="pt"
    )

    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    preds = torch.argmax(logits, dim=-1).item()

    # Map prediction to label
    # Assuming label_map from before: 0 - negative, 1 - positive
    sentiment = "positive" if preds == 1 else "negative"

    return {"text": data.text, "sentiment": sentiment}

# To run the service locally (in a notebook, for demonstration):
# !uvicorn app:app --host 0.0.0.0 --port 8000 --reload


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
# Example: creating app.py in Colab
app_code = r"""
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import BertForSequenceClassification, BertTokenizerFast

class ReviewText(BaseModel):
    text: str

app = FastAPI()

model_path = "./saved_model"
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()

@app.post("/predict")
def predict_sentiment(data: ReviewText):
    inputs = tokenizer(data.text, truncation=True, padding=True, max_length=128, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    preds = torch.argmax(logits, dim=-1).item()
    sentiment = "positive" if preds == 1 else "negative"
    return {"text": data.text, "sentiment": sentiment}
"""

# Write the code above to app.py
with open("app.py", "w") as f:
    f.write(app_code)


In [29]:
!pip install fastapi uvicorn[standard] pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.2-py3-none-any.whl.metadata (8.4 kB)
Downloading pyngrok-7.2.2-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.2


In [31]:
!ngrok config add-authtoken 2qIcfGaQ74tV2eHqYqlFPQBv5tl_7PRfNRN1RHsMCRouzKqB

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [32]:
# Run uvicorn in the background
import subprocess
import time
from pyngrok import ngrok

PORT = 8000
public_url = ngrok.connect(PORT)
print("Public URL:", public_url)

# Start uvicorn server
command = f"uvicorn app:app --host 0.0.0.0 --port {PORT}"
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Wait a few seconds for the server to start
time.sleep(5)

Public URL: NgrokTunnel: "https://b053-34-125-11-60.ngrok-free.app" -> "http://localhost:8000"


In [33]:
import requests

endpoint = "https://b053-34-125-11-60.ngrok-free.app/predict"
data = {"text": "This movie was absolutely wonderful!"}
response = requests.post(endpoint, json=data)
print(response.json())


{'text': 'This movie was absolutely wonderful!', 'sentiment': 'positive'}


In [36]:
endpoint = "https://b053-34-125-11-60.ngrok-free.app/predict"
data = {"text": "Disgusting movie!!! Actors was awful"}
response = requests.post(endpoint, json=data)
print(response.json())

{'text': 'Disgusting movie!!! Actors was awful', 'sentiment': 'negative'}


**Description:**
Next steps:

1. Create a `Dockerfile` that:
   - Uses a suitable base image (e.g., `python:3.10-slim`).
   - Installs the necessary Python dependencies.
   - Copies `app.py`, the `saved_model` directory, and all dependencies into the container.
   - Runs the `uvicorn` server when the container starts.

2. Build the Docker image locally or in a CI environment.

3. Create a `docker-compose.yml` that:
   - Defines a service for the API.
   - (Optionally) defines a service for the frontend.

This will allow you to run everything with a simple `docker-compose up` command.
