# Sentiment Analysis with BERT on Yelp Reviews

This Jupyter Notebook provides a detailed example of how to train a BERT model for sentiment analysis using Yelp review data. We will go through all steps from loading and preprocessing the data to training the model and making predictions.

## What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning technique for natural language processing pre-training. Developed by Google, BERT's key innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling.

This is particularly effective in understanding the context of a word based on all of its surroundings (left and right of the word).


### Setup and Installation

In [51]:
! pip install -U "ray" "transformers" "datasets" "numpy" "evaluate" "boto3"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [52]:
# Check if CUDA and GPU is available:

def checking_cuda():
    import torch
    
    # Checks if a GPU is available and identifies it
    if torch.cuda.is_available():
        print(f"GPU is available: {torch.cuda.get_device_name(0)}")
    else:
        print("No GPU available.")

checking_cuda()


GPU is available: NVIDIA A10G


# Training Script

This Training script is built here in Jupyter Notebook but we submit it to Ray using Python, there are ways of doing that with the Ray Operator itself using RayJob CRD in Kubernetes

In [53]:
%%writefile train_script.py
import os

import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    AutoModelForSequenceClassification,
)

import ray.train.huggingface.transformers
from ray.train import ScalingConfig, RunConfig
from ray.train.torch import TorchTrainer

# Variables
s3_name_checkpoints = "<REPLACE_WITH_YOUR_BUCKET_CREATED_BY_TERRAFORM>"
storage_path=f"s3://{s3_name_checkpoints}/checkpoints/"

def train_func():
    # Datasets
    dataset = load_dataset("yelp_review_full") # This is the dataset that we are using for train
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    small_train_dataset = (
        dataset["train"].select(range(1000)).map(tokenize_function, batched=True)
    )
    small_eval_dataset = (
        dataset["test"].select(range(1000)).map(tokenize_function, batched=True)
    )

    # Model
    model = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-cased", num_labels=5
    )

    # Evaluation Metrics
    metric = evaluate.load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    # Hugging Face Trainer
    training_args = TrainingArguments(
        output_dir="test_trainer",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        report_to="none",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics,
    )

    # [2] Report Metrics and Checkpoints to Ray Train
    # ===============================================
    callback = ray.train.huggingface.transformers.RayTrainReportCallback()
    trainer.add_callback(callback)

    # [3] Prepare Transformers Trainer
    # ================================
    trainer = ray.train.huggingface.transformers.prepare_trainer(trainer)

    # Start Training
    trainer.train()


# [4] Define a Ray TorchTrainer to launch `train_func` on all workers
# ===================================================================
ray_trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=True),
    run_config=RunConfig(
        storage_path=storage_path,
        name="bert_experiment",
    )
    # [4a] If running in a multi-node cluster, this is where you
    # should configure the run's persistent storage that is accessible
    # across all worker nodes.
    # run_config=ray.train.RunConfig(storage_path="s3://..."),
)

ray_trainer.fit()

Overwriting train_script.py


# Submitting Training to Ray Cluster

We have a Ray cluster already deployed that we are using for training, so we need to submit the training to Ray

In [54]:
import boto3

# S3 bucket definition and upload of the training script
s3_name_checkpoints = "<REPLACE_WITH_YOUR_BUCKET_CREATED_BY_TERRAFORM>"
s3_client = boto3.client("s3")
s3_client.upload_file("./train_script.py", s3_name_checkpoints, "scripts/train_script.py")

In [55]:
import ray
from ray.job_submission import JobSubmissionClient

# Submitting Training script to Ray
ray_train_address = "ray-cluster-train-kuberay-head-svc.ray-cluster-train-dev.svc.cluster.local"
ray_client = JobSubmissionClient(f"http://{ray_train_address}:8265")
train_dependencies = [
    "ray",
    "transformers",
    "datasets",
    "numpy",
    "evaluate",
    "boto3"
]

submission_id = ray_client.submit_job(
    # Entrypoint shell command to execute
    entrypoint=(
        f"rm -rf train_script.py && aws s3 cp s3://{s3_name_checkpoints}/scripts/train_script.py train_script.py || true;"
        "chmod +x train_script.py && python train_script.py"
    ),
    runtime_env={
        "pip": train_dependencies
    }
)


## Check Ray Dashboard

Before loading and testing your model, load the Ray Dashboard to see your job running

```bash
kubectl port-forward svc/ray-cluster-train-kuberay-head-svc 8265:8265 -nray-cluster-train
```

# Loading the model and doing local inference

In [56]:
import boto3
import os

def download_latest_checkpoint(bucket_name, base_folder, local_directory):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    checkpoints = []

    # Listing all objects within the base folder
    for page in paginator.paginate(Bucket=bucket_name, Prefix=base_folder):
        for obj in page.get('Contents', []):
            key = obj['Key']
            if key.endswith('/') and 'checkpoint' in key:
                checkpoints.append(key)

    if not checkpoints:
        print("No checkpoints found.")
        return

    # Sorting to find the latest
    latest_checkpoint = sorted(checkpoints)[-1]
    print("Latest checkpoint:", latest_checkpoint)

    # Download files from the latest checkpoint
    for page in paginator.paginate(Bucket=bucket_name, Prefix=latest_checkpoint):
        for obj in page.get('Contents', []):
            key = obj['Key']
            local_file_path = os.path.join(local_directory, key[len(latest_checkpoint):])
            if not key.endswith('/'):  # Skip directories
                os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
                s3.download_file(bucket_name, key, local_file_path)
                print(f'Downloaded: {key} to {local_file_path}')
    print("All files from the latest checkpoint are downloaded.")

bucket_name = s3_name_checkpoints
base_folder = "checkpoints/bert_experiment/"
local_directory = "./latest_model_checkpoint"

download_latest_checkpoint(bucket_name, base_folder, local_directory)


Latest checkpoint: checkpoints/bert_experiment/TorchTrainer_60ca2_00000_0_2024-04-24_11-13-40/checkpoint_000002/checkpoint/
Downloaded: checkpoints/bert_experiment/TorchTrainer_60ca2_00000_0_2024-04-24_11-13-40/checkpoint_000002/checkpoint/config.json to ./latest_model_checkpoint/config.json
Downloaded: checkpoints/bert_experiment/TorchTrainer_60ca2_00000_0_2024-04-24_11-13-40/checkpoint_000002/checkpoint/optimizer.pt to ./latest_model_checkpoint/optimizer.pt
Downloaded: checkpoints/bert_experiment/TorchTrainer_60ca2_00000_0_2024-04-24_11-13-40/checkpoint_000002/checkpoint/pytorch_model.bin to ./latest_model_checkpoint/pytorch_model.bin
Downloaded: checkpoints/bert_experiment/TorchTrainer_60ca2_00000_0_2024-04-24_11-13-40/checkpoint_000002/checkpoint/rng_state_0.pth to ./latest_model_checkpoint/rng_state_0.pth
Downloaded: checkpoints/bert_experiment/TorchTrainer_60ca2_00000_0_2024-04-24_11-13-40/checkpoint_000002/checkpoint/scheduler.pt to ./latest_model_checkpoint/scheduler.pt
Downloa

# Testing the model locally

In [59]:
# Load the model from the latest checkpoint
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch.nn.functional as F
import torch

def predict(text, tokenizer, model):
    # Encode the input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)

    # Get predictions from the model
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Convert logits to probabilities
    probabilities = F.softmax(logits, dim=-1)
    return probabilities

local_directory = "./latest_model_checkpoint"
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained(local_directory)

model.eval()
print("Model loaded successfully!")

# Sample text for prediction
sample_text = "The food at this restaurant was absolutely wonderful, from preparation to presentation, very pleasing."

# Get prediction
probabilities = predict(sample_text, tokenizer, model)
predicted_class = torch.argmax(probabilities, dim=-1)

# Map class indices to labels
class_labels = {
    0: "1 Star",
    1: "2 Stars",
    2: "3 Stars",
    3: "4 Stars",
    4: "5 Stars"
}
predicted_label = class_labels[predicted_class.item()]
probabilities_percent = [f"{prob * 100:.2f}%" for prob in probabilities[0]]

print(f"Predicted class: {predicted_label}")
print(f"Class probabilities: {probabilities_percent}")

Model loaded successfully!
Predicted class: 5 Stars
Class probabilities: ['0.73%', '0.61%', '4.72%', '24.02%', '69.92%']


# Serving Script

Serving script will be created and uploaded to S3 in ZIP format, then we will generate a PreSigned URL to use in the RayService CRD

In [30]:
%%writefile serve_script.py
import os
import boto3
import ray
from ray import serve
from starlette.requests import Request
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def download_latest_checkpoint(bucket_name, base_folder, local_directory):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    checkpoints = []

    for page in paginator.paginate(Bucket=bucket_name, Prefix=base_folder):
        for obj in page.get('Contents', []):
            key = obj['Key']
            if key.endswith('/') and 'checkpoint' in key:
                checkpoints.append(key)

    if not checkpoints:
        print("No checkpoints found.")
        return

    latest_checkpoint = sorted(checkpoints)[-1]
    print("Latest checkpoint:", latest_checkpoint)

    for page in paginator.paginate(Bucket=bucket_name, Prefix=latest_checkpoint):
        for obj in page.get('Contents', []):
            key = obj['Key']
            local_file_path = os.path.join(local_directory, key[len(latest_checkpoint):])
            if not key.endswith('/'):
                os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
                s3.download_file(bucket_name, key, local_file_path)
                print(f'Downloaded: {key} to {local_file_path}')
    print("All files from the latest checkpoint are downloaded.")


@serve.deployment(ray_actor_options={"num_gpus": 1})
class SentimentAnalysisDeployment:
    def __init__(self, bucket_name, base_folder):
        local_directory = "./latest_model_checkpoint"
        download_latest_checkpoint(bucket_name, base_folder, local_directory)

        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
        self.model = AutoModelForSequenceClassification.from_pretrained(local_directory)
        self.model.eval()

    def predict(self, text):
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
            probabilities = F.softmax(outputs.logits, dim=-1)
            return probabilities

    async def __call__(self, request: Request):
        body = await request.json()
        text = body['text']
        probabilities = self.predict(text)
        predicted_class = torch.argmax(probabilities, dim=-1)
        class_labels = ["1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars"]
        predicted_label = class_labels[predicted_class.item()]
        probabilities_percent = [f"{prob * 100:.2f}%" for prob in probabilities[0]]
        response = {
            "predicted_class": predicted_label,
            "class_probabilities": probabilities_percent
        }
        return response
        
bucket_name = "<REPLACE_WITH_YOUR_BUCKET_CREATED_BY_TERRAFORM>"
base_folder = "checkpoints/bert_experiment/"
sentiment_deployment = SentimentAnalysisDeployment.bind(bucket_name=bucket_name, base_folder=base_folder)

Overwriting serve_script.py


In [31]:
import boto3

s3_client = boto3.client("s3")
s3_client.upload_file("./serve_script.py", bucket_name, "scripts/serve_script.py")

In [None]:
import boto3
from zipfile import ZipFile

s3_client = boto3.client("s3")

with ZipFile('./bert_finetuned.zip', 'w') as zip_object:
    zip_object.write('./serve_script.py')

s3_client.upload_file("./bert_finetuned.zip", bucket_name, "bert_finetuned.zip")
presigned_url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': bucket_name, 'Key': "bert_finetuned.zip"},
    ExpiresIn=3600
)

print("Pre-signed URL:", presigned_url)


Replace the `01_ray_serve_bert_finetuned.yaml` file with the Presigned URL above and apply the manifest

```bash
kubectl apply -f ray_serve_manifests/01_ray_serve_bert_finetuned.yaml
```

# Sending Request to Finetuned BERT, YELP Review

In [33]:
import requests
import json

In [37]:
url = "http://ray-svc-bert-head-svc.ray-svc-bert.svc.cluster.local:8000/bert_predict"

In [47]:
def get_sentiment(text):
    headers = {'Content-type': 'application/json'}
    data = json.dumps({"text": text})
    response = requests.post(url, data=data, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        return "Error:", response.status_code, response.text


In [50]:
reviews = [
    "The food at this restaurant was absolutely wonderful, from preparation to presentation, very pleasing.",
    "I had a terrible experience. The food was bland and the service was too slow.",
    "Quite a pleasant surprise! The dishes were creative and flavorful, and the staff was attentive.",
    "Not worth the money. The food was mediocre at best and the ambiance wasn't as advertised.",
    "An excellent choice for our anniversary dinner. We had a delightful evening with great food and service.",
    "The service was below average and the waiter seemed uninterested.",
    "Loved the vibrant atmosphere and the innovative menu choices. Will definitely return!",
    "It was just okay, not what I expected from the rave reviews.",
    "The worst restaurant experience ever. Would not recommend to anyone.",
    "Absolutely fantastic! Can't wait to go back and try more dishes."
]

results = []

for review in reviews:
    result = get_sentiment(review)
    results.append(result)
    print("Review:", review)
    print("Sentiment Analysis Result:", result)
    print()  # Adding a newline for better readability between results



Review: The food at this restaurant was absolutely wonderful, from preparation to presentation, very pleasing.
Sentiment Analysis Result: {'predicted_class': '5 Stars', 'class_probabilities': ['0.73%', '0.61%', '4.72%', '24.02%', '69.92%']}

Review: I had a terrible experience. The food was bland and the service was too slow.
Sentiment Analysis Result: {'predicted_class': '1 Star', 'class_probabilities': ['79.51%', '16.45%', '2.21%', '1.11%', '0.72%']}

Review: Quite a pleasant surprise! The dishes were creative and flavorful, and the staff was attentive.
Sentiment Analysis Result: {'predicted_class': '5 Stars', 'class_probabilities': ['0.82%', '0.59%', '4.94%', '25.15%', '68.51%']}

Review: Not worth the money. The food was mediocre at best and the ambiance wasn't as advertised.
Sentiment Analysis Result: {'predicted_class': '2 Stars', 'class_probabilities': ['15.27%', '75.96%', '6.72%', '1.42%', '0.64%']}

Review: An excellent choice for our anniversary dinner. We had a delightful ev