# Real-time Text Classification with BERT + Self-Attention using TorchServe

## Overview

This notebook deploys a fine-tuned BERT-based multi-class text classification model for real-time inference using **TorchServe**. The model predicts topics from the 20 Newsgroups dataset across 20 categories, supporting sentence-level classification in production environments via HTTP REST API.

## Model Architecture

* **Base**: [`bert-base-uncased`](https://huggingface.co/bert-base-uncased)
* **Custom Head**:

  * Self-Attention Pooling Layer
  * Dropout: 0.5
  * Fully Connected Layer
  * Label Smoothing CrossEntropy Loss (`smoothing=0.1`)
* **Output**: 20-class logits over 20 Newsgroups labels

## Deployment Setup

### Frameworks & Tools

* PyTorch
* TorchServe
* HuggingFace Transformers
* Ngrok (for public URL access)
* Python `requests` (for client-side demo)

### TorchServe Artifacts

* `bert_sa_model.mar`: Model archive including:

  * Serialized model weights: `bert_final_sa_model_state_dict.pt`
  * Custom handler: `handler.py`
  * Index to label mapping: `id2label`
* `config.properties`: Port setup and model load settings

## Serving Steps：

## 1. Environment Setup
Mount Google Drive and install required packages.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install torch-model-archiver

Collecting torch-model-archiver
  Downloading torch_model_archiver-0.12.0-py3-none-any.whl.metadata (1.4 kB)
Collecting enum-compat (from torch-model-archiver)
  Downloading enum_compat-0.0.3-py3-none-any.whl.metadata (954 bytes)
Downloading torch_model_archiver-0.12.0-py3-none-any.whl (16 kB)
Downloading enum_compat-0.0.3-py3-none-any.whl (1.3 kB)
Installing collected packages: enum-compat, torch-model-archiver
Successfully installed enum-compat-0.0.3 torch-model-archiver-0.12.0


In [3]:
!pip install torchserve torch-model-archiver transformers nltk evaluate scikit-learn

Collecting torchserve
  Downloading torchserve-0.12.0-py3-none-any.whl.metadata (1.4 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.18-py311-none-any.whl.metadata (7.5 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading torchserve-0.12.0-py3-none-any.w

In [4]:
! pip install -r /content/drive/MyDrive/bert_final_sa_model/requirements.txt

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0->-r /content/drive/MyDrive/bert_final_sa_model/requirements.txt (line 2))
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0->-r /content/drive/MyDrive/bert_final_sa_model/requirements.txt (line 2))
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0->-r /content/drive/MyDrive/bert_final_sa_model/requirements.txt (line 2))
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0->-r /content/drive/MyDrive/bert_final_sa_model/requirements.txt (line 2))
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0->-r /content/driv

## 2. Model Archiving
Use torch-model-archiver to package the trained model.

In [5]:
# === Set up path ===
MODEL_DIR = "/content/drive/MyDrive/bert_final_sa_model"
TMP_DIR = "/tmp/bert_mar_dir"
EXPORT_PATH = f"{MODEL_DIR}/model_store"
MODEL_NAME = "bert_sa_model"

#!cd {TMP_DIR} && zip -r nltk_data.zip nltk_data

#### Package model

In [59]:
!torch-model-archiver \
  --model-name {MODEL_NAME} \
  --version 1.0 \
  --serialized-file {MODEL_DIR}/bert_final_sa_model_state_dict.pt \
  --handler {MODEL_DIR}/handler.py \
  --extra-files "{MODEL_DIR}/model.py,{MODEL_DIR}/requirements.txt" \
  --requirements-file {MODEL_DIR}/requirements.txt \
  --export-path {EXPORT_PATH} \
  --force



In [60]:
!unzip -l /content/drive/MyDrive/bert_final_sa_model/model_store/bert_sa_model.mar

Archive:  /content/drive/MyDrive/bert_final_sa_model/model_store/bert_sa_model.mar
  Length      Date    Time    Name
---------  ---------- -----   ----
      444  2025-05-04 06:47   requirements.txt
     1983  2025-05-03 19:53   model.py
     4844  2025-05-04 06:47   handler.py
438476617  2025-05-04 06:47   bert_final_sa_model_state_dict.pt
      304  2025-05-04 06:47   MAR-INF/MANIFEST.json
---------                     -------
438484192                     5 files


In [10]:
!unzip -l /content/drive/MyDrive/bert_final_sa_model/model_store/bert_sa_model.mar

Archive:  /content/drive/MyDrive/bert_final_sa_model/model_store/bert_sa_model.mar
  Length      Date    Time    Name
---------  ---------- -----   ----
      444  2025-05-04 03:46   requirements.txt
     1983  2025-05-04 03:42   model.py
     4050  2025-05-04 03:46   handler.py
 73941280  2025-05-04 03:45   nltk_data.zip
438476617  2025-05-04 03:46   bert_final_sa_model_state_dict.pt
      304  2025-05-04 03:46   MAR-INF/MANIFEST.json
---------                     -------
512424678                     6 files


In [40]:
!cat /content/drive/MyDrive/bert_final_sa_model/requirements.txt

# === Core ML dependencies ===
torch>=2.0
transformers>=4.39
scikit-learn
pandas

# === NLP utilities ===
nltk>=3.5

# === Optional: tokenizer dependencies
# sentencepiece     # If you use T5, ALBERT, etc.
# protobuf          # For exporting/loading models
# regex             # Often required for advanced tokenization

# === NLTK data files (handled separately)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

## 3. TorchServe Deployment
Start the TorchServe model server and register the model.

In [46]:
import os

os.listdir("/content/drive/MyDrive/bert_final_sa_model/model_store")

['bert_sa_model.mar']

In [61]:
!torchserve --stop

TorchServe has stopped.


In [62]:
!echo "" > config.properties

#### Torch Server Start

In [63]:
config_path = "/content/config.properties"

with open(config_path, "w") as f:
    f.write("""\
inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8086
metrics_address=http://0.0.0.0:8087
model_store=/content/drive/MyDrive/bert_final_sa_model/model_store
""")

In [64]:
!rm /usr/local/lib/python3.11/dist-packages/pathlib.py

rm: cannot remove '/usr/local/lib/python3.11/dist-packages/pathlib.py': No such file or directory


## 4. Inference and Testing
Use pyngrok and requests to send HTTP requests to the deployed model.

#### Launch TorchServe

In [65]:
!torchserve --stop
!rm -rf /tmp/.ts.sock* logs/*

!nohup torchserve \
  --start \
  --ts-config /content/config.properties \
  --models bert_sa_model=bert_sa_model.mar \
  --ncs > server.log 2>&1 &

TorchServe is not currently running.


In [66]:
# Load logs
!tail -n 50 server.log

In [68]:
import json

with open("/content/key_file.json", "r") as f:
    key_data = json.load(f)
print(key_data)

# Create token
access_token = key_data["inference"]["key"]

{'management': {'key': '4WzFHRnp', 'expiration time': '2025-05-04T07:48:01.769608Z'}, 'inference': {'key': '0RDAAoNF', 'expiration time': '2025-05-04T07:48:01.769565Z'}, 'API': {'key': 'phb0QRbc'}}


In [31]:
!pip install -q pyngrok

In [81]:
# Forcefully close all tunnels in the current session
from pyngrok import ngrok
ngrok.kill()

In [82]:
!ngrok config add-authtoken "2wbaoW7f35b7ooh3gGpHkLAX141_3m8iKrJFZ23fvDX5vVjfB"

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


#### Tunnel with ngrok

In [71]:
import json
import requests
from pyngrok import ngrok

# === Open public endpoint ===
public_url = ngrok.connect(8085).public_url
print("Your inference URL is:", public_url)

# === Read token ===
with open("/content/key_file.json") as f:
    key_data = json.load(f)

access_token = key_data["inference"]["key"]

# === Set request headers and payload ===
headers = {"Authorization": f"Bearer {access_token}"}
payload = {"text": "this is a sample sentence"}

# === Send inference request ===
response = requests.post(
    f"{public_url}/predictions/bert_sa_model",
    headers=headers,
    json=[payload]  # ← make sure to wrap payload in a list!
)

# === Print response ===
print("Response:", response.json())

Your inference URL is: https://e6f7-35-197-26-122.ngrok-free.app
Response: {'predicted_label_id': 18, 'predicted_label': 'talk.politics.misc'}


In [72]:
# Load logs
!tail -n 50 server.log

2025-05-04T06:48:25,539 [INFO ] W-9001-bert_sa_model_1.0-stdout MODEL_LOG - Initializing model and tokenizer...
2025-05-04T06:48:25,549 [INFO ] W-9000-bert_sa_model_1.0-stdout MODEL_LOG - OpenVINO is not enabled
2025-05-04T06:48:25,550 [INFO ] W-9000-bert_sa_model_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2025-05-04T06:48:25,550 [INFO ] W-9000-bert_sa_model_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2025-05-04T06:48:25,551 [INFO ] W-9000-bert_sa_model_1.0-stdout MODEL_LOG - Handler script loaded
2025-05-04T06:48:25,552 [INFO ] W-9000-bert_sa_model_1.0-stdout MODEL_LOG - Initializing model and tokenizer...
2025-05-04T06:48:28,377 [INFO ] W-9000-bert_sa_model_1.0-stdout MODEL_LOG - NumExpr defaulting to 2 threads.
2025-05-04T06:48:28,378 [INFO ] W-9001-bert_sa_model_1.0-stdout MODEL_LOG - NumExpr defaulting to 2 threads.
2025-05-04T06:48:32,238 [WARN ] W-9001-bert_sa_model_1.0-stderr MODEL_LOG - 2025-05-04 06:48:32.237763: E external/local_xla/xla/stream_executor/cuda/

#### More examples

In [86]:
import json
import requests
from pyngrok import ngrok

# === Start public endpoint ===
public_url = ngrok.connect(8085).public_url
print("Your inference URL is:", public_url)

# === Read access token ===
with open("/content/key_file.json") as f:
    key_data = json.load(f)
access_token = key_data["inference"]["key"]

# === Set request headers ===
headers = {"Authorization": f"Bearer {access_token}"}

# === Example input texts ===
examples = [
    "The government passed a new bill regulating gun ownership across several states.",
    "I just installed Ubuntu on my machine and it's much faster than Windows!",
    "NASA just released new photos from the Hubble telescope showing distant galaxies.",
    "My Honda Civic has been making a strange noise when I turn left. Any advice?",
    "Christianity teaches love and forgiveness. What do you think about modern interpretations?",
    "The Boston Red Sox had an incredible comeback game last night!",
    "Selling used MacBook Pro 2020 in good condition. DM for details.",
    "Quantum cryptography could change the way we secure our data.",
    "I'm looking for help with programming OpenGL in C++. Any good resources?",
    "Can anyone explain the difference between Sunni and Shia Islam?"
]

# === Send requests for each input text ===
for text in examples:
    payload = [{"text": text}]
    response = requests.post(
        f"{public_url}/predictions/bert_sa_model",
        headers=headers,
        json=payload
    )
    print(f"Input: {text}")
    print("Response:", response.json())
    print("=" * 60)


Your inference URL is: https://65d0-35-197-26-122.ngrok-free.app
Input: The government passed a new bill regulating gun ownership across several states.
Response: {'predicted_label_id': 16, 'predicted_label': 'talk.politics.guns'}
Input: I just installed Ubuntu on my machine and it's much faster than Windows!
Response: {'predicted_label_id': 2, 'predicted_label': 'comp.os.ms-windows.misc'}
Input: NASA just released new photos from the Hubble telescope showing distant galaxies.
Response: {'predicted_label_id': 14, 'predicted_label': 'sci.space'}
Input: My Honda Civic has been making a strange noise when I turn left. Any advice?
Response: {'predicted_label_id': 7, 'predicted_label': 'rec.autos'}
Input: Christianity teaches love and forgiveness. What do you think about modern interpretations?
Response: {'predicted_label_id': 15, 'predicted_label': 'soc.religion.christian'}
Input: The Boston Red Sox had an incredible comeback game last night!
Response: {'predicted_label_id': 9, 'predicted_

## Inference Example and Prediction Analysis

### Strengths

* **Semantic Understanding**: Recognizes scientific, religious, technical, and political contexts effectively.
* **Label Alignment**: Most predictions match expected 20 Newsgroups categories well.
* **Deployment Reliability**: Consistent and correctly structured API responses show successful integration with TorchServe.

### Observation

* Some ambiguous inputs (e.g., religious comparisons) may be assigned to controversial categories like `alt.atheism`. This reflects the label distribution in the original dataset rather than a model error.

## Conclusion

This project demonstrates a full **MLOps-compatible real-time text classification pipeline** using:

* A fine-tuned BERT model with a custom attention-based classification head,
* TorchServe-based scalable deployment,
* Token-based authenticated HTTP requests,
* Structured JSON predictions with high semantic alignment.
