# Sentiment Analysis with BERT on Yelp Reviews

This Jupyter Notebook provides a detailed example of how to create all the scripts and artifacts needed for train a BERT model for sentiment analysis using Yelp review data. We will go through all steps need to create a pipeline for production purposes

## What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning technique for natural language processing pre-training. Developed by Google, BERT's key innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling.

This is particularly effective in understanding the context of a word based on all of its surroundings (left and right of the word).


### Setup and Installation

In [51]:
! pip install -U "ray" "transformers" "datasets" "numpy" "evaluate" "boto3"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [52]:
# Check if CUDA and GPU is available:

def checking_cuda():
    import torch
    
    # Checks if a GPU is available and identifies it
    if torch.cuda.is_available():
        print(f"GPU is available: {torch.cuda.get_device_name(0)}")
    else:
        print("No GPU available.")

checking_cuda()


GPU is available: NVIDIA A10G


# Training Script

This Training script will be created and uploaded to S3 in ZIP format, then we will generate a PreSigned URL to use in the RayJob CRD

In [53]:
%%writefile train_script.py
import os

import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    AutoModelForSequenceClassification,
)

import ray.train.huggingface.transformers
from ray.train import ScalingConfig, RunConfig
from ray.train.torch import TorchTrainer

# Variables
s3_name_checkpoints = "<REPLACE_WITH_YOUR_BUCKET_CREATED_BY_TERRAFORM>"
storage_path=f"s3://{s3_name_checkpoints}/checkpoints/"

def train_func():
    # Datasets
    dataset = load_dataset("yelp_review_full") # This is the dataset that we are using for train
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    small_train_dataset = (
        dataset["train"].select(range(1000)).map(tokenize_function, batched=True)
    )
    small_eval_dataset = (
        dataset["test"].select(range(1000)).map(tokenize_function, batched=True)
    )

    # Model
    model = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-cased", num_labels=5
    )

    # Evaluation Metrics
    metric = evaluate.load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    # Hugging Face Trainer
    training_args = TrainingArguments(
        output_dir="test_trainer",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        report_to="none",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics,
    )

    # [2] Report Metrics and Checkpoints to Ray Train
    # ===============================================
    callback = ray.train.huggingface.transformers.RayTrainReportCallback()
    trainer.add_callback(callback)

    # [3] Prepare Transformers Trainer
    # ================================
    trainer = ray.train.huggingface.transformers.prepare_trainer(trainer)

    # Start Training
    trainer.train()


# [4] Define a Ray TorchTrainer to launch `train_func` on all workers
# ===================================================================
ray_trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=True),
    run_config=RunConfig(
        storage_path=storage_path,
        name="bert_experiment",
    )
    # [4a] If running in a multi-node cluster, this is where you
    # should configure the run's persistent storage that is accessible
    # across all worker nodes.
    # run_config=ray.train.RunConfig(storage_path="s3://..."),
)

ray_trainer.fit()

Overwriting train_script.py


## Creating pre-signed URL for train_script.py

We will need this pre-signed URL to run the training job into the ephimeral Ray Cluster 

In [50]:
import boto3

# S3 bucket definition and upload of the training script
s3_name_checkpoints = "<REPLACE_WITH_YOUR_BUCKET_CREATED_BY_TERRAFORM>"
bucket_name = s3_name_checkpoints

from zipfile import ZipFile
s3_client = boto3.client("s3")

with ZipFile('./bert_train_script.zip', 'w') as zip_object:
    zip_object.write('./train_script.py')

s3_client.upload_file("./bert_train_script.zip", bucket_name, "/scripts/bert_train_script.zip")
presigned_url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': bucket_name, 'Key': "/scripts/bert_train_script.zip"},
    ExpiresIn=3600
)

print("Pre-signed URL:", presigned_url)


Pre-signed URL: https://data-on-eks-genai.s3.amazonaws.com/bert_finetuned.zip?AWSAccessKeyId=ASIAZXNCLXBU4ZBY5KUR&Signature=Rt%2F4QinDhVWksB4kHtZ4m6%2FfGU0%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEG8aCXVzLXdlc3QtMiJHMEUCIGZA4NZ%2BC3%2BKN7IZTsqiiKNT8m8n16I3ojOn9NKC7wybAiEAjBvc%2FDB%2B%2F8uBuCzfVIHKLwGHgSQHJqx7JQ0lLwxpxVUqlAUI%2BP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw2Njg3NDQ3OTIxNjkiDDNF1YGERJ00XawNkSroBDoDq4Q%2BkzzsSw6JS30YmA1BaheirYnK3isVCkG3uUS5SeLrD0yKMOllzDZyD6gVlaYXybsdTI5zapcUP9fARuANKuMeKf7osq3tT5gCVX%2Fu0z2mLvHQuVE4PW%2FOGcjuHTxkKMCh3dWL4NBhQLZ6eS91UTYA5N4FjpQ678B2SZ%2Ft86y5%2FTmyK%2B4TI%2FxWsbX1ZnRDUgS0get4nBce%2FOJ2v29S90LsLrR7hLa673W3%2Bd5gnEFRqS%2B6TxGh6E0anuc0Do85GtotEkbevxB%2F3TPfKFFJhGT2uk5Pzllj3uD%2FYUqcvHKuuu%2B4CTwWxiNBthQ675ZkU0sx4vmwc1NqBnjiA2V6NSyXiKJan6NJC8XfZ8CEE5SxmZWVnihhdjWWsZV%2BBOfS%2BZ%2ByZVmEK7fG8xQRjEno5xItCAuf9NtzcuvDOQ2fe2OrebJfwam2nNvyL7YIK0yciDVJe9tf6zTCUyc26bMbPjmec%2BY2bb3uRsNhVl0UGCoLDLGA0Oo3hkWx6casprjZRv%2FAEtuVdSTzm0PUXJkMZrnwCsOjdKt4%2FU3j%2F7Jw

# Running the training job

The pre-signed URL must be replaced into the file 03_ray_job_training_standalone_s3.yaml and then must be deployed with the following command:

```bash
kubectl apply -f 03_ray_job_training_standalone_s3.yaml
```

# Creating the serving Script

Serving script will be created and uploaded to S3 in ZIP format, then we will generate a PreSigned URL to use in the RayService CRD

In [30]:
%%writefile serve_script.py
import os
import boto3
import ray
from ray import serve
from starlette.requests import Request
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def download_latest_checkpoint(bucket_name, base_folder, local_directory):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    checkpoints = []

    for page in paginator.paginate(Bucket=bucket_name, Prefix=base_folder):
        for obj in page.get('Contents', []):
            key = obj['Key']
            if key.endswith('/') and 'checkpoint' in key:
                checkpoints.append(key)

    if not checkpoints:
        print("No checkpoints found.")
        return

    latest_checkpoint = sorted(checkpoints)[-1]
    print("Latest checkpoint:", latest_checkpoint)

    for page in paginator.paginate(Bucket=bucket_name, Prefix=latest_checkpoint):
        for obj in page.get('Contents', []):
            key = obj['Key']
            local_file_path = os.path.join(local_directory, key[len(latest_checkpoint):])
            if not key.endswith('/'):
                os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
                s3.download_file(bucket_name, key, local_file_path)
                print(f'Downloaded: {key} to {local_file_path}')
    print("All files from the latest checkpoint are downloaded.")


@serve.deployment(ray_actor_options={"num_gpus": 1})
class SentimentAnalysisDeployment:
    def __init__(self, bucket_name, base_folder):
        local_directory = "./latest_model_checkpoint"
        download_latest_checkpoint(bucket_name, base_folder, local_directory)

        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
        self.model = AutoModelForSequenceClassification.from_pretrained(local_directory)
        self.model.eval()

    def predict(self, text):
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
            probabilities = F.softmax(outputs.logits, dim=-1)
            return probabilities

    async def __call__(self, request: Request):
        body = await request.json()
        text = body['text']
        probabilities = self.predict(text)
        predicted_class = torch.argmax(probabilities, dim=-1)
        class_labels = ["1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars"]
        predicted_label = class_labels[predicted_class.item()]
        probabilities_percent = [f"{prob * 100:.2f}%" for prob in probabilities[0]]
        response = {
            "predicted_class": predicted_label,
            "class_probabilities": probabilities_percent
        }
        return response
        
bucket_name = "<REPLACE_WITH_YOUR_BUCKET_CREATED_BY_TERRAFORM>"
base_folder = "checkpoints/bert_experiment/"
sentiment_deployment = SentimentAnalysisDeployment.bind(bucket_name=bucket_name, base_folder=base_folder)

Overwriting serve_script.py


In [31]:
import boto3

s3_client = boto3.client("s3")
s3_client.upload_file("./serve_script.py", bucket_name, "scripts/serve_script.py")

In [None]:
import boto3
from zipfile import ZipFile

s3_client = boto3.client("s3")

with ZipFile('./bert_finetuned.zip', 'w') as zip_object:
    zip_object.write('./serve_script.py')

s3_client.upload_file("./bert_finetuned.zip", bucket_name, "/scripts/bert_finetuned.zip")
presigned_url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': bucket_name, 'Key': "/scripts/bert_finetuned.zip"},
    ExpiresIn=3600
)

print("Pre-signed URL:", presigned_url)


Replace the `01_ray_serve_bert_finetuned.yaml` file with the Presigned URL above and apply the manifest

```bash
kubectl apply -f ray_serve_manifests/01_ray_serve_bert_finetuned.yaml
```

# Sending Request to Finetuned BERT, YELP Review (Optional step)

In [1]:
import requests
import json

In [2]:
url = "http://ray-svc-bert-head-svc.ray-svc-bert.svc.cluster.local:8000/bert_predict"

In [47]:
def get_sentiment(text):
    headers = {'Content-type': 'application/json'}
    data = json.dumps({"text": text})
    response = requests.post(url, data=data, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        return "Error:", response.status_code, response.text


In [50]:
reviews = [
    "The food at this restaurant was absolutely wonderful, from preparation to presentation, very pleasing.",
    "I had a terrible experience. The food was bland and the service was too slow.",
    "Quite a pleasant surprise! The dishes were creative and flavorful, and the staff was attentive.",
    "Not worth the money. The food was mediocre at best and the ambiance wasn't as advertised.",
    "An excellent choice for our anniversary dinner. We had a delightful evening with great food and service.",
    "The service was below average and the waiter seemed uninterested.",
    "Loved the vibrant atmosphere and the innovative menu choices. Will definitely return!",
    "It was just okay, not what I expected from the rave reviews.",
    "The worst restaurant experience ever. Would not recommend to anyone.",
    "Absolutely fantastic! Can't wait to go back and try more dishes."
]

results = []

for review in reviews:
    result = get_sentiment(review)
    results.append(result)
    print("Review:", review)
    print("Sentiment Analysis Result:", result)
    print()  # Adding a newline for better readability between results



Review: The food at this restaurant was absolutely wonderful, from preparation to presentation, very pleasing.
Sentiment Analysis Result: {'predicted_class': '5 Stars', 'class_probabilities': ['0.73%', '0.61%', '4.72%', '24.02%', '69.92%']}

Review: I had a terrible experience. The food was bland and the service was too slow.
Sentiment Analysis Result: {'predicted_class': '1 Star', 'class_probabilities': ['79.51%', '16.45%', '2.21%', '1.11%', '0.72%']}

Review: Quite a pleasant surprise! The dishes were creative and flavorful, and the staff was attentive.
Sentiment Analysis Result: {'predicted_class': '5 Stars', 'class_probabilities': ['0.82%', '0.59%', '4.94%', '25.15%', '68.51%']}

Review: Not worth the money. The food was mediocre at best and the ambiance wasn't as advertised.
Sentiment Analysis Result: {'predicted_class': '2 Stars', 'class_probabilities': ['15.27%', '75.96%', '6.72%', '1.42%', '0.64%']}

Review: An excellent choice for our anniversary dinner. We had a delightful ev