# ISY503 Intelligent Systems — Assessment 3 Project

**Project:** Sentiment Classification with DistilBERT  
**Group Members:**  
- Ahmet  
- Ijod
- Munaf
- Yasin

This notebook contains the implementation and training of our DistilBERT-based sentiment classifier.  


# GPU Check

In [None]:
import tensorflow as tf

print("TensorFlow version:", tf.__version__)
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

# Detailed device info
!nvidia-smi || echo "⚠️ No GPU detected. In Colab, go to Runtime → Change runtime type → set Hardware accelerator to GPU."


TensorFlow version: 2.19.0
Num GPUs Available: 1
Tue Aug 19 15:05:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8             11W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

# Install required libraries

In [None]:
# Step 3 — Clean install (TensorFlow-only)

# 0) Kaldır: PyTorch ve gcsfs (fsspec kilidi koyuyor)
!pip uninstall -y -q torch torchvision torchaudio gcsfs dask-cudf-cu12 cudf-cu12 || true

# 1) Uyumlu kütüphaneleri kur
#    - pandas 2.2.2 (google-colab ile uyumlu)
#    - scikit-learn 1.6.1 (geniş uyumluluk)
#    - transformers/datasets (TF ile kullanılacak)
!pip install -q --upgrade --no-cache-dir \
  pandas==2.2.2 \
  scikit-learn==1.6.1 \
  transformers==4.44.2 \
  datasets==2.20.0 \
  tqdm==4.66.4

# 2) Hızlı kontrol
import pandas as pd, sklearn, datasets, transformers
print("pandas:", pd.__version__)
print("scikit-learn:", sklearn.__version__)
print("datasets:", datasets.__version__)
print("transformers:", transformers.__version__)
print("✅ Libraries ready (no gcsfs, no torch).")


[0mpandas: 2.2.2
scikit-learn: 1.6.1
datasets: 2.20.0
transformers: 4.44.2
✅ Libraries ready (no gcsfs, no torch).


# Bring the project into Colab

In [None]:
# Step 4 — Clone the project repo into Colab

REPO_URL = "https://github.com/ahmetcihan/isy_nlp_project.git"
PROJECT_ROOT = "/content/isy_nlp_project"

# Remove any existing folder and clone fresh
!rm -rf $PROJECT_ROOT
!git clone --depth 1 $REPO_URL $PROJECT_ROOT

print("PROJECT_ROOT:", PROJECT_ROOT)
!ls -la $PROJECT_ROOT


Cloning into '/content/isy_nlp_project'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 36 (delta 3), reused 36 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (36/36), 4.00 MiB | 6.70 MiB/s, done.
Resolving deltas: 100% (3/3), done.
PROJECT_ROOT: /content/isy_nlp_project
total 56
drwxr-xr-x 10 root root 4096 Aug 19 15:05 .
drwxr-xr-x  1 root root 4096 Aug 19 15:05 ..
drwxr-xr-x  2 root root 4096 Aug 19 15:05 app
drwxr-xr-x  3 root root 4096 Aug 19 15:05 data
drwxr-xr-x  2 root root 4096 Aug 19 15:05 docs
drwxr-xr-x  8 root root 4096 Aug 19 15:05 .git
-rw-r--r--  1 root root  233 Aug 19 15:05 .gitignore
-rwxr-xr-x  1 root root 2533 Aug 19 15:05 merge_with_all_branches.sh
drwxr-xr-x  2 root root 4096 Aug 19 15:05 models
drwxr-xr-x  2 root root 4096 Aug 19 15:05 notebooks
-rw-r--r--  1 root root   33 Aug 19 15:05 README.md
drwxr-xr-x  2 root root 4096 Aug 19 15:05 

# Verify dataset splits

In [None]:
# Step 5 — Count rows and preview a few examples from the global split

from pathlib import Path
import json

GLOBAL_DIR = Path("/content/isy_nlp_project/data/processed/global")
print("GLOBAL_DIR:", GLOBAL_DIR)

def count_lines(p: Path) -> int:
    try:
        with p.open("r", encoding="utf-8") as f:
            return sum(1 for _ in f)
    except FileNotFoundError:
        return 0

for split in ["train", "val", "test"]:
    p = GLOBAL_DIR / f"{split}.jsonl"
    print(f"{split:5s} -> {count_lines(p):6d} lines | {p}")

# Show first 3 lines from train.jsonl to confirm field names (e.g., "text" and "label")
train_path = GLOBAL_DIR / "train.jsonl"
print("\n--- First 3 lines of train.jsonl ---")
try:
    with train_path.open("r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i >= 3: break
            print(line.rstrip())
except FileNotFoundError:
    print("train.jsonl not found.")


GLOBAL_DIR: /content/isy_nlp_project/data/processed/global
train ->   6400 lines | /content/isy_nlp_project/data/processed/global/train.jsonl
val   ->    800 lines | /content/isy_nlp_project/data/processed/global/val.jsonl
test  ->    800 lines | /content/isy_nlp_project/data/processed/global/test.jsonl

--- First 3 lines of train.jsonl ---
{"id": "books-positive-200", "domain": "books", "label": 1, "text": "This is one of the best -- and scariest -- business books I've ever read. Christensen clearly illustrates why many of the 'tried and true' formulas really don't work. His research is compelling and is presented clearly enough for non-technical readers. I can't recommend this book highly enough"}
{"id": "books-positive-362", "domain": "books", "label": 1, "text": "I thought that the two books previous to this in the Duncan Kincaid/Gemma James series were slight disappointments. Kincaid seemed relegated to a side character with Gemma taking the lead. IN A DARK HOUSE is an excellent m

# Load tokenizer and tokenize the dataset

In [None]:
# Step 6 — Load tokenizer and tokenize (no training yet)

from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer
import pandas as pd
import json
from pathlib import Path

MODEL_NAME = "distilbert-base-uncased"
MAX_LEN = 256

# 1) Load JSONL into Hugging Face Datasets
GLOBAL_DIR = Path("/content/isy_nlp_project/data/processed/global")

def read_jsonl(path: Path):
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows

def normalize_fields(row: dict):
    """
    Map various possible field names into a consistent schema:
    - text: review/content sentence as string
    - label: integer 0/1
    """
    # Text field
    text = None
    for k in ["text", "review", "sentence", "content", "reviewText"]:
        if k in row and isinstance(row[k], str):
            text = row[k]
            break
    if text is None:
        raise KeyError(f"No text field found in: {list(row.keys())}")

    # Label field
    label = None
    for k in ["label", "sentiment", "target", "y"]:
        if k in row:
            label = row[k]
            break
    if isinstance(label, str):
        v = label.lower()
        if v in {"pos", "positive", "+", "1"}:
            label = 1
        elif v in {"neg", "negative", "-", "0"}:
            label = 0
        else:
            raise ValueError(f"Unrecognized label value: {label}")
    if label is None:
        raise KeyError("No label field found")

    return {"text": text, "label": int(label)}

def load_split(base: Path, split: str) -> Dataset:
    rows = [normalize_fields(r) for r in read_jsonl(base / f"{split}.jsonl")]
    return Dataset.from_pandas(pd.DataFrame(rows))

raw_ds = DatasetDict({
    "train": load_split(GLOBAL_DIR, "train"),
    "validation": load_split(GLOBAL_DIR, "val"),
    "test": load_split(GLOBAL_DIR, "test"),
})

print(raw_ds)

# 2) Tokenize with DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tok(batch):
    # Use fixed padding to MAX_LEN for TF batches; truncation prevents overflow
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=MAX_LEN)

tokenized = raw_ds.map(tok, batched=True, remove_columns=[c for c in raw_ds["train"].column_names if c != "label"])
tokenized = tokenized.rename_column("label", "labels")  # transformers' TF models expect 'labels'

print(tokenized)


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 6400
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
})


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/6400 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 6400
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 800
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 800
    })
})


# Build TensorFlow datasets

In [None]:
# Step 7 — Build tf.data.Dataset without to_tf_dataset (NumPy 2.x friendly)

import numpy as np
import tensorflow as tf

BATCH_SIZE = 16  # You can tune this (8–32 is usually fine on T4)

def ds_to_tfdataset(hf_ds):
    """
    Convert a tokenized Hugging Face split into a tf.data.Dataset
    WITHOUT using `to_tf_dataset` (to avoid NumPy 2.x issues).
    Expects columns: input_ids, attention_mask, labels
    """
    # Extract arrays
    input_ids = np.array(hf_ds["input_ids"])
    attention_mask = np.array(hf_ds["attention_mask"])
    labels = np.array(hf_ds["labels"])

    # Build a tf.data.Dataset from tensors
    ds = tf.data.Dataset.from_tensor_slices((
        {"input_ids": input_ids, "attention_mask": attention_mask},
        labels
    ))
    return ds

tf_train = ds_to_tfdataset(tokenized["train"]).shuffle(10_000).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
tf_val   = ds_to_tfdataset(tokenized["validation"]).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
tf_test  = ds_to_tfdataset(tokenized["test"]).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

tf_train, tf_val, tf_test


(<_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>,
 <_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>,
 <_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>)

# Load TF DistilBERT, compile and train

In [None]:
# Step 8 — Load TF DistilBERT, compile, and train

import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification, create_optimizer
import math

MODEL_NAME = "distilbert-base-uncased"
EPOCHS = 3
LR = 5e-5

# (Optional) Cast to int32 for TF ops (safer on some runtimes)
def cast_batch(features, labels):
    return (
        {
            "input_ids": tf.cast(features["input_ids"], tf.int32),
            "attention_mask": tf.cast(features["attention_mask"], tf.int32),
        },
        tf.cast(labels, tf.int32),
    )

tf_train_cast = tf_train.map(cast_batch)
tf_val_cast   = tf_val.map(cast_batch)
tf_test_cast  = tf_test.map(cast_batch)

# Load model (2-class classification)
num_labels = 2
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

# Optimizer with linear schedule
train_size = len(tokenized["train"])   # <-- simple and robust
steps_per_epoch = math.ceil(train_size / BATCH_SIZE)
num_train_steps = steps_per_epoch * EPOCHS

optimizer, schedule = create_optimizer(
    init_lr=LR,
    num_train_steps=num_train_steps,
    num_warmup_steps=0,
)

model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

# Callbacks (early stopping on val_accuracy)
callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor="val_accuracy",
        mode="max",
        patience=2,
        restore_best_weights=True,
    )
]

history = model.fit(
    tf_train_cast,
    validation_data=tf_val_cast,
    epochs=EPOCHS,
    callbacks=callbacks,
)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/3
Epoch 2/3
Epoch 3/3


# Evaluate on test set & save the model

In [None]:
# Step 9 — Evaluate on test set (accuracy, precision, recall, F1) and save artifacts

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report
from transformers import AutoTokenizer
import tensorflow as tf
from pathlib import Path

# 1) Run predictions on the test set
y_true = []
for _, labels in tf_test_cast:
    y_true.append(labels.numpy())
y_true = np.concatenate(y_true)

logits = model.predict(tf_test_cast).logits
y_pred = logits.argmax(axis=-1)

# 2) Compute metrics
acc = accuracy_score(y_true, y_pred)
prec, rec, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary", zero_division=0)
print({
    "test_accuracy": round(acc, 4),
    "test_precision": round(prec, 4),
    "test_recall": round(rec, 4),
    "test_f1": round(f1, 4),
})

# Optional: show confusion matrix and a short report
print("\nConfusion matrix:\n", confusion_matrix(y_true, y_pred))
print("\nClassification report:\n", classification_report(y_true, y_pred, digits=4))

# 3) Save model + tokenizer (choose one or both destinations)

# (A) Save to Google Drive (persistent across sessions)
SAVE_DIR = "/content/drive/MyDrive/isy_nlp_models_tf/distilbert/global"
Path(SAVE_DIR).mkdir(parents=True, exist_ok=True)
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
print("Saved to Drive:", SAVE_DIR)

# (B) Save inside the project structure (ephemeral in Colab unless you zip/copy out)
LOCAL_SAVE = "/content/isy_nlp_project/models/distilbert/global_tf"
Path(LOCAL_SAVE).mkdir(parents=True, exist_ok=True)
model.save_pretrained(LOCAL_SAVE)
tokenizer.save_pretrained(LOCAL_SAVE)
print("Saved to project:", LOCAL_SAVE)


{'test_accuracy': 0.8975, 'test_precision': 0.9162, 'test_recall': 0.875, 'test_f1': 0.8951}

Confusion matrix:
 [[368  32]
 [ 50 350]]

Classification report:
               precision    recall  f1-score   support

           0     0.8804    0.9200    0.8998       400
           1     0.9162    0.8750    0.8951       400

    accuracy                         0.8975       800
   macro avg     0.8983    0.8975    0.8974       800
weighted avg     0.8983    0.8975    0.8974       800

Saved to Drive: /content/drive/MyDrive/isy_nlp_models_tf/distilbert/global
Saved to project: /content/isy_nlp_project/models/distilbert/global_tf


# save to colab Notebooks

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
