## phase 2
### Task 1 – Text Classification using AraBert

**Team Members:**  
- Muath AlShehri – 443016733  
- Abdullah Almohammed – 443016380  
- Motaz Al-Ghamdi – 444012369  

**Supervisor:** Dr. Fahman Saeed  
**Course:** CS365 – Natural Language Processing  
**Date:** June 2025


###  Install Dependencies

This cell installs essential libraries for transformer-based text classification:
- `transformers`: HuggingFace library for pre-trained models
- `datasets`: For handling and processing datasets
- `arabert`: Preprocessing library optimized for Arabic
- `nltk`: Natural Language Toolkit used for tokenization and text processing


In [None]:

!pip install -q transformers datasets arabert nltk



[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.0/185.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone


### Ensure Latest Version

Upgrades the `transformers` library to the latest version to ensure compatibility with the latest models and features.


In [None]:

!pip install -q --upgrade transformers

### Disable Unused Tracking

Weights & Biases (W&B) is an experiment tracking tool. Since we are not using it in this project, we disable it to avoid unwanted logs.


In [None]:


import os
os.environ["WANDB_DISABLED"] = "true"


### Import Libraries

In [None]:

import pandas as pd
import numpy as np
import torch
from sklearn.metrics import classification_report
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)
from datasets import Dataset
from arabert.preprocess import ArabertPreprocessor
from sklearn.model_selection import train_test_split

### Upload and Load Dataset

This block:
- Uploads and extracts a zipped dataset file
- Loads the extracted CSV file into a DataFrame
- Ensures the dataset columns are labeled `text` and `label`
- Displays the first few rows of the dataset


In [None]:

from google.colab import files
import zipfile
import pandas as pd

# Upload the ZIP file
uploaded = files.upload()  # e.g., non_stemmed_dataset.zip

# Extract ZIP
zip_filename = list(uploaded.keys())[0]  # Get uploaded filename dynamically
with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall()

# Load extracted CSV (assuming only one CSV is in the zip)
csv_filename = zip_ref.namelist()[0]  # First file inside ZIP
df = pd.read_csv(csv_filename, encoding="utf-8-sig")

# Ensure columns are correctly named
df = df.rename(columns={"text": "text", "label": "label"})

print("✅ Dataset loaded successfully.")
print(df.head())


Saving arabert_preprocessed_dataset.zip to arabert_preprocessed_dataset.zip
✅ Dataset loaded successfully.
                                                text    label
0  كتب سالم ال+ رحبي : تنطلق ال+ يوم ال+ دور +ة ا...  culture
1  كتب - فيصل ال+ علوي : شارك +ت ال+ سلطن +ة صباح...  culture
2  أربع +ة عروض على مسرح ال+ شباب و+ عرض في ال+ ر...  culture
3  حاور +ه خالد عبداللطيف : حين يناقش ال+ موضوع ا...  culture
4  افتتح صباح أمس ب+ قاع +ة ال+ موسيقي في جامع +ة...  culture


###  Encode Labels and Prepare Dataset

This step:
- Maps each unique text label to an integer using `label2id`
- Stores a reverse mapping in `id2label` (used for predictions later)
- Updates the `label` column in the DataFrame to be numeric
- Splits the dataset (80% train, 20% test) using stratification to preserve label distribution
- Converts the result into HuggingFace `Dataset` objects


In [None]:

labels = sorted(df["label"].unique())
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for label, i in label2id.items()}
df["label"] = df["label"].map(label2id)

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["label"])
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

### Tokenize the Dataset

This function tokenizes each example using the AraBERT tokenizer with:
- `truncation=True` and `max_length=128` to clip long texts
- `padding="max_length"` to ensure consistent input size

Then we:
- Remove unused columns like `"text"` and internal index
- Format the datasets as PyTorch tensors for model training


In [None]:

tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2")

def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

train_dataset = train_dataset.remove_columns(["text", "__index_level_0__"])
test_dataset = test_dataset.remove_columns(["text", "__index_level_0__"])

train_dataset.set_format("torch")
test_dataset.set_format("torch")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/720k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/14604 [00:00<?, ? examples/s]

Map:   0%|          | 0/3652 [00:00<?, ? examples/s]

### NumPy 2.0 Compatibility Fix

This function detaches datasets from HuggingFace’s internal Arrow/Numpy formatting, which helps prevent compatibility issues with NumPy 2.0 and other external libraries.


In [None]:


from datasets import Dataset
import numpy as np

def fully_detach_from_arrow(dataset):
    # Convert dataset to list of dicts to bypass Arrow formatting
    raw_data = dataset.to_list()
    # Reconstruct dataset and disable special formatters (like numpy or torch)
    new_dataset = Dataset.from_list(raw_data)
    new_dataset.set_format(None)  # No numpy/torch formatting, returns dicts
    return new_dataset

# Apply this to both datasets before training
train_dataset = fully_detach_from_arrow(train_dataset)
test_dataset = fully_detach_from_arrow(test_dataset)

print("✅ Dataset formatting fully detached; compatible with NumPy 2.0")


✅ Dataset formatting fully detached; compatible with NumPy 2.0


### Load AraBERT Classification Model

Loads the AraBERT transformer model for sequence classification.
- `num_labels` tells the model how many output classes to predict
- `id2label` and `label2id` help the model interpret predicted labels and map them back to original class names


In [None]:

model = AutoModelForSequenceClassification.from_pretrained(
    "aubmindlab/bert-base-arabertv2",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Configure Training Arguments

This block defines the training configuration:
- Batch size: 8 per device
- Number of epochs: 1 (can be increased)
- Saves and evaluates the model at the end of each epoch
- Disables W&B reporting
- Ensures `Accelerate` library is reset to avoid runtime issues


In [None]:

from transformers import TrainingArguments

# Defensive reset for Accelerate state (in case of reruns)
from accelerate.state import AcceleratorState
if hasattr(AcceleratorState, "_reset_state"):
    AcceleratorState._reset_state()

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",   # updated working version
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    logging_dir="./logs",
    report_to="none"  # disables wandb
)

### Train the AraBERT Classifier

Initializes the HuggingFace `Trainer` with the model and training settings, then begins training.
- After training, the model and tokenizer are saved for future inference or deployment.


In [None]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer.train()
model.save_pretrained("./my_trained_model")
tokenizer.save_pretrained("./my_trained_model")

Epoch,Training Loss,Validation Loss
1,0.2184,0.223556


('./my_trained_model/tokenizer_config.json',
 './my_trained_model/special_tokens_map.json',
 './my_trained_model/vocab.txt',
 './my_trained_model/added_tokens.json',
 './my_trained_model/tokenizer.json')

### Evaluate the Model

Uses the trained model to predict test set labels and compares them to true labels.
- Outputs a classification report including: precision, recall, F1-score, and support per class.


In [None]:

preds = trainer.predict(test_dataset)
y_true = test_df["label"]
y_pred = np.argmax(preds.predictions, axis=1)

print("\U0001F4CA Classification Report:")
print(classification_report(y_true, y_pred, target_names=labels))


📊 Classification Report:
               precision    recall  f1-score   support

      culture       0.95      0.93      0.94       499
      economy       0.91      0.88      0.89       653
international       0.96      0.93      0.95       338
        local       0.86      0.90      0.88       648
     religion       0.98      1.00      0.99       695
       sports       0.99      0.99      0.99       819

     accuracy                           0.94      3652
    macro avg       0.94      0.94      0.94      3652
 weighted avg       0.94      0.94      0.94      3652



### Setup for Model Deployment 

Installs and imports `Gradio` to build a simple web-based interface for the model.
- Also re-imports essential components to load the saved model and tokenizer.


In [None]:
!pip install -q transformers arabert gradio

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from arabert.preprocess import ArabertPreprocessor
import gradio as gr


### Redefine Labels for UI Consistency

This step ensures we have access to consistent `label2id` and `id2label` mappings when building the interactive interface or post-processing predictions.


In [None]:
labels = ['culture', 'economy', 'international', 'local', 'religion', 'sports']
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for label, i in label2id.items()}


### Load Trained Model and Tokenizer

This code loads the previously fine-tuned AraBERT model and tokenizer from the local directory `./my_trained_model`, which was saved after training.


In [None]:
model_name = "aubmindlab/bert-base-arabertv2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained("./my_trained_model", id2label=id2label, label2id=label2id)
tokenizer = AutoTokenizer.from_pretrained("./my_trained_model")


### Define Prediction Function

This function:
- Preprocesses Arabic text using `ArabertPreprocessor`
- Tokenizes and feeds the text into the trained model
- Returns the predicted class label using the highest logit score


In [None]:
arabert_prep = ArabertPreprocessor(model_name)

def classify_text(text):
    cleaned = arabert_prep.preprocess(text)
    inputs = tokenizer(cleaned, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    return id2label[predicted_class]




100%|██████████| 241M/241M [08:58<00:00, 448kiB/s]




### Launch Gradio Interface

Creates and launches a simple web interface using `Gradio` where:
- Users input Arabic text
- The classifier returns the predicted category


In [None]:
gr.Interface(fn=classify_text, inputs="text", outputs="text", title="Arabic News Classifier").launch()

## Conclusion

In this task, we implemented a modern transformer-based approach for Arabic text classification using AraBERT.  
Key takeaways include:

- **AraBERT** significantly simplifies Arabic NLP by providing robust preprocessing and pretrained models.
- The model achieved strong performance on our classification task, as shown by precision, recall, and F1-scores.
- The use of HuggingFace's `Trainer` API made training and evaluation efficient.
- A simple **Gradio** interface was created to make the model accessible and user-friendly.


