# NLP Project : A comparative analysis of Petition Classification using ML and DL Models

In [1]:
!pip install -q --no-deps datasets transformers scikit-learn multiprocess joblib pyarrow xxhash dill

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.3/144.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.7/119.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h


## Step 1: Install & Import Dependencies

So, we start by installing and importing all the necessary libraries.

In [2]:
import numpy as np
import pandas as pd

In [3]:
# All your safe imports after setup
import re
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, precision_score, recall_score
import joblib
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizer, BertModel



## Step 2: Load the EURLEX Dataset

We use the `eurlex` dataset from HuggingFace. Each document (petition) is associated with multiple EUROVOC concepts, making this a **multi-label classification problem**.


In [4]:

dataset = load_dataset("eurlex",trust_remote_code=True)

df = pd.DataFrame({
    'text': dataset['train']['text'],
    'labels': dataset['train']['eurovoc_concepts']
})
df = df.dropna()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

eurlex.py:   0%|          | 0.00/5.11k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/50.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6000 [00:00<?, ? examples/s]


## Step 3: Preprocess the Text

We lowercase the text, remove special characters, and normalize spaces. This helps models focus on the meaning rather than formatting inconsistencies.


In [5]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()

    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return " ".join(lemmatized_tokens)

df['text'] = df['text'].apply(preprocess_text)
texts = df['text'].tolist()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



## Step 4: MultiLabel Binarization

Since each document can belong to **multiple categories**, we convert the label list into a binary matrix using `MultiLabelBinarizer`.

Each column in the resulting matrix represents a EUROVOC label. A value of 1 means the document is assigned to that category.


In [6]:
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['labels'].tolist())
joblib.dump(mlb, "multilabel_binarizer.pkl")

['multilabel_binarizer.pkl']

In [7]:
np.unique(y[0])

array([0, 1])


## Step 5: Split Data

We split our data into train and test sets to evaluate how well our models perform on unseen data.


In [8]:

X_train, X_test, y_train, y_test = train_test_split(texts, y, test_size=0.2, random_state=42)



## Step 6: Create Text Embeddings

We convert the raw text into numerical form using:
- **TF-IDF**: Captures word importance.
- **FastText**: Embeds words based on context and subword information.


In [9]:

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
joblib.dump(tfidf, "tfidf_vectorizer.pkl")

['tfidf_vectorizer.pkl']

In [10]:
import warnings
warnings.filterwarnings('ignore')

## Model 1: TF-IDF + Naive Bayes

In this model, we used TF-IDF to convert the petition text into numerical features that reflect the importance of words in the dataset. Then, we apply the Multinomial Naive Bayes classifier wrapped in a One-vs-Rest strategy to handle the multi-label nature of the task.

#### How It Works:

Naive Bayes is based on Bayes' Theorem, which calculates the probability of a label given a set of features. It assumes that all features (words) are independent of each other — an assumption often violated in language but still surprisingly effective.

**Example**:  
Given a petition like `"Ban plastic usage in supermarkets"`, TF-IDF might assign high weights to `"plastic"` and `"ban"`. The Naive Bayes model will then estimate:

- `P(Environment | features)`  
- `P(Policy | features)`

and choose the label(s) with the highest probability.


In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier

# Naive Bayes wrapped for multi-label
model_nb = OneVsRestClassifier(MultinomialNB())
model_nb.fit(X_train_tfidf, y_train)

# Predict & evaluate
preds_nb = model_nb.predict(X_test_tfidf)

joblib.dump(model_nb, "naive_bayes_model.pkl")

f1_nb = f1_score(y_test, preds_nb, average='micro')
precision_nb = precision_score(y_test, preds_nb, average='micro')
recall_nb = recall_score(y_test, preds_nb, average='micro')

print("Metrics :")
print("F1 Score:", f1_nb)
print("Precision:", precision_nb)
print("Recall:", recall_nb)


Metrics :
F1 Score: 0.26172502261023645
Precision: 0.8599151000606429
Recall: 0.15435189619889406



----

## Model 2: TF-IDF + Passive Aggressive Classifier

In this model, we use TF-IDF to convert petition texts into numerical feature vectors that reflect the relative importance of words across the dataset. These vectors are then passed into a Passive Aggressive Classifier, which is wrapped in a One-vs-Rest strategy to support multi-label classification.

#### How It Works:

So, the Passive Aggressive algorithm is an online learning model, meaning it updates its parameters incrementally as it sees new data. The core idea is:

- **Passive**: If the prediction is correct and falls within an acceptable margin, the model remains unchanged.
- **Aggressive**: If the prediction is incorrect or violates the margin, the model aggressively updates its weights to correct the error.

**For Example**:  
Consider the following two petition texts:

1. `"Ban plastic usage in supermarkets"`  
2. `"Increase green energy subsidies"`

After applying TF-IDF, each document is transformed into a vector representing word importance. For example:

- `"plastic"` and `"green energy"` may receive high weights due to their contextual uniqueness.
- Common words like `"in"` and `"the"` are down-weighted.

Suppose the true label for the first petition is `["Environment", "Policy"]`. If the model only predicts `["Environment"]`, it recognizes the mistake and updates its internal weights to better capture the context associated with the missed label `"Policy"` in future predictions.


In [12]:
from sklearn.linear_model import PassiveAggressiveClassifier

model_pa = OneVsRestClassifier(PassiveAggressiveClassifier(max_iter=50))
model_pa.fit(X_train_tfidf, y_train)

preds_pa = model_pa.predict(X_test_tfidf)

joblib.dump(model_pa, "passive_aggressive_model.pkl")

f1_pa = f1_score(y_test, preds_pa, average='micro')
precision_pa = precision_score(y_test, preds_pa, average='micro')
recall_pa = recall_score(y_test, preds_pa, average='micro')

print("Metrics :")
print("F1 Score:", f1_pa)
print("Precision:", precision_pa)
print("Recall:", recall_pa)


Metrics :
F1 Score: 0.6887245796733256
Precision: 0.7660277770372937
Recall: 0.6255932424783385


---

## Model 3: BERT + GRU

This model combines the power of contextual word embeddings from BERT with the sequential modeling capabilities of a Gated Recurrent Unit (GRU). It is designed to capture the semantic meaning of the text as well as its order and structure.

#### How It Works:

- First, BERT generates contextual embeddings for each token in the petition text. These embeddings capture the meaning of words in context.
- These token embeddings are then passed through a **GRU layer**, which models sequential dependencies in the data.
- Finally, the output from the GRU is used for multi-label classification.

**Example**:  
For a petition like `"Increase green energy subsidies"`:

- BERT captures that `"green"` and `"energy"` are semantically connected.
- The GRU processes these embeddings in order, helping the model understand temporal and contextual dependencies like “increase subsidies” relates to policy or environment.

#### Key Characteristics:

- **BERT** provides rich contextual understanding of language.
- **GRU** is efficient for sequence modeling and less complex than LSTMs.
- The combination balances **context awareness and computational efficiency**.


In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertTokenizer, BertModel
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import f1_score, precision_score, recall_score
import numpy as np

# Device config
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Tokenize all texts using BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize train and test separately
train_encodings = tokenizer(X_train, padding=True, truncation=True, max_length=256, return_tensors="pt")
test_encodings = tokenizer(X_test, padding=True, truncation=True, max_length=256, return_tensors="pt")

# Custom dataset
class PetitionDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels, dtype=torch.float32)
    def __len__(self): return len(self.labels)
    def __getitem__(self, idx):
        return {
            'input_ids': self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'labels': self.labels[idx]
        }

train_dataset = PetitionDataset(train_encodings, y_train)
test_dataset = PetitionDataset(test_encodings, y_test)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# BERT + GRU model
class BERTGRUClassifier(nn.Module):
    def __init__(self, output_dim, hidden_dim=128, num_layers=1, dropout=0.3):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        for param in self.bert.parameters():
            param.requires_grad = False

        self.gru = nn.GRU(input_size=768, hidden_size=hidden_dim, num_layers=num_layers,
                          batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, input_ids, attention_mask):
        with torch.no_grad():
            output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        x = output.last_hidden_state
        x, _ = self.gru(x)
        pooled = torch.mean(x, dim=1)
        x = self.dropout(pooled)
        return torch.sigmoid(self.fc(x))

# Initialize model
model_bert_gru = BERTGRUClassifier(output_dim=y.shape[1]).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model_bert_gru.parameters(), lr=1e-3)

# Train model
for epoch in range(5):  # increase epochs here
    model_bert_gru.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model_bert_gru(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {total_loss / len(train_loader):.4f}")

# Evaluation on full test set
model_bert_gru.eval()
all_preds = []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = model_bert_gru(input_ids, attention_mask)
        preds = (outputs > 0.5).int().cpu().numpy()
        all_preds.append(preds)

bert_gru_preds = np.vstack(all_preds)

# Convert y_test if needed
true_labels = y_test
if hasattr(true_labels, "toarray"):
    true_labels = true_labels.toarray()
elif hasattr(true_labels, "values"):
    true_labels = true_labels.values


torch.save(model_bert_gru, "bert_gru_model.pt")

# Evaluation scores
f1_bert_gru = f1_score(true_labels, bert_gru_preds, average='micro')
precision_bert_gru = precision_score(true_labels, bert_gru_preds, average='micro')
recall_bert_gru = recall_score(true_labels, bert_gru_preds, average='micro')

print("Metrics :")
print("F1 Score:", f1_bert_gru)
print("Precision:", precision_bert_gru)
print("Recall:", recall_bert_gru)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Epoch 1 Loss: 0.0100
Epoch 2 Loss: 0.0061
Epoch 3 Loss: 0.0046
Epoch 4 Loss: 0.0039
Epoch 5 Loss: 0.0034
Metrics :
F1 Score: 0.534484567762438
Precision: 0.8580642056119748
Recall: 0.38812208821352373


---

## Model 4: BERT + BiLSTM

This model enhances BERT embeddings with a Bi-directional Long Short-Term Memory (BiLSTM) network, enabling the model to capture dependencies from both past and future contexts in the sequence.

#### How It Works:

- So,BERT produces contextual embeddings for each token in the text.
- So,These embeddings are passed through a **BiLSTM layer**, which processes the sequence in both forward and backward directions.
- So,The combined outputs from both directions are used to make multi-label predictions.

**For Example**:  

For a petition such as `"Implement stricter air quality laws"`:

- BERT understands `"air quality"` as a meaningful phrase.
- BiLSTM captures dependencies like “stricter laws” even if they are far apart in the sentence.
- This allows the model to associate the petition with both `["Health", "Environment", "Policy"]`.

#### Key Characteristics:

- Combines **deep contextual embeddings** with **sequential reasoning**.
- **BiLSTM** captures long-range dependencies better than simple RNNs or GRUs.
- More expressive but computationally heavier than GRU-based models.


In [14]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import f1_score, precision_score, recall_score
import numpy as np

# Device config
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize train and test text
train_encodings = [tokenizer(text, padding='max_length', truncation=True, max_length=256) for text in X_train]
test_encodings = [tokenizer(text, padding='max_length', truncation=True, max_length=256) for text in X_test]

# Custom dataset
class PetitionDataset(Dataset):
    def __init__(self, encodings, labels):
        self.input_ids = [e['input_ids'] for e in encodings]
        self.attn_mask = [e['attention_mask'] for e in encodings]
        self.labels = labels

    def __len__(self): return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
            'attention_mask': torch.tensor(self.attn_mask[idx], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.float32)
        }

# Dataloaders
train_dataset = PetitionDataset(train_encodings, y_train)
test_dataset = PetitionDataset(test_encodings, y_test)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# BERT + BiLSTM model
class BERTBiLSTMClassifier(nn.Module):
    def __init__(self, output_dim, hidden_dim=128, num_layers=1, dropout=0.3):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")

        # Freeze BERT weights
        for param in self.bert.parameters():
            param.requires_grad = False

        self.bilstm = nn.LSTM(input_size=768,
                              hidden_size=hidden_dim,
                              num_layers=num_layers,
                              batch_first=True,
                              bidirectional=True)

        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, input_ids, attention_mask):
        with torch.no_grad():
            output = self.bert(input_ids=input_ids, attention_mask=attention_mask)

        x = output.last_hidden_state  # [batch, seq_len, 768]
        x, _ = self.bilstm(x)         # [batch, seq_len, hidden*2]
        x = torch.mean(x, dim=1)      # [batch, hidden*2]
        x = self.dropout(x)
        return torch.sigmoid(self.fc(x))  # Multi-label output

# Initialize model
model_bilstm = BERTBiLSTMClassifier(output_dim=y.shape[1]).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model_bilstm.parameters(), lr=1e-3)

# Training loop
for epoch in range(5):
    model_bilstm.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model_bilstm(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {total_loss / len(train_loader):.4f}")

# Inference on test set
model_bilstm.eval()
all_preds = []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = model_bilstm(input_ids, attention_mask)
        preds = (outputs > 0.5).int().cpu().numpy()
        all_preds.append(preds)

bilstm_preds = np.vstack(all_preds)

# Format y_test
true_labels = y_test
if hasattr(true_labels, "toarray"):
    true_labels = true_labels.toarray()
elif hasattr(true_labels, "values"):
    true_labels = true_labels.values


torch.save(model_bilstm, "bert_bilstm_model.pt")

# Metrics
f1_bilstm = f1_score(true_labels, bilstm_preds, average='micro')
precision_bilstm = precision_score(true_labels, bilstm_preds, average='micro')
recall_bilstm = recall_score(true_labels, bilstm_preds, average='micro')

print("Metrics :")
print("F1 Score:", f1_bilstm)
print("Precision:", precision_bilstm)
print("Recall:", recall_bilstm)


Epoch 1 Loss: 0.0101
Epoch 2 Loss: 0.0066
Epoch 3 Loss: 0.0056
Epoch 4 Loss: 0.0049
Epoch 5 Loss: 0.0044
Metrics :
F1 Score: 0.3380996787897627
Precision: 0.8176578683595055
Recall: 0.21311011451212608



## Final Comparison of All Models


In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


pd.DataFrame({
    "Model": [
        "TF-IDF + Naive Bayes",
        "TF-IDF + Passive Aggressive Classifier",
        "BERT + GRU",
        "BiLSTM"
    ],
    "F1 Score": [
        f1_nb, f1_pa, f1_bert_gru, f1_bilstm
    ],
    "Precision": [
        precision_nb, precision_pa, precision_bert_gru, precision_bilstm  # Replace with actual variables
    ],
    "Recall": [
        recall_nb, recall_pa, recall_bert_gru, recall_bilstm  # Replace with actual variables
    ]
})


Unnamed: 0,Model,F1 Score,Precision,Recall
0,TF-IDF + Naive Bayes,0.261725,0.859915,0.154352
1,TF-IDF + Passive Aggressive Classifier,0.688725,0.766028,0.625593
2,BERT + GRU,0.534485,0.858064,0.388122
3,BiLSTM,0.3381,0.817658,0.21311


### Observations

#### 1. TF-IDF + Passive Aggressive Classifier  
This model stands out with the **highest F1 score (0.6887)** and **highest recall (0.6256)** among all. It balances precision and recall effectively, meaning it not only predicts confidently but also captures a wide range of relevant labels.  
This makes it a strong choice for multi-label text classification where missing labels can be costly.

#### 2. TF-IDF + Naive Bayes  
This model achieves **very high precision (0.8599)**, indicating that when it makes a prediction, it's usually correct. However, the **recall is quite low (0.1544)**, meaning it misses many relevant labels.  
It's useful in scenarios where it's more important to avoid false positives than to find all possible labels — but not ideal if full label coverage is the goal.

#### 3. BERT + GRU  
This hybrid deep learning model performs well, with **strong precision (0.8581)** and a **moderate recall (0.3881)**.  
Its F1 score (0.5345) suggests a decent balance, making it more versatile than Naive Bayes or BiLSTM. The model captures contextual relationships through BERT and sequence flow via GRU, though it may require further tuning or more training data to match the performance of simpler, well-tuned classical models.

#### 4. BiLSTM  
Despite being a deep learning model capable of learning long-term dependencies in both directions, **BiLSTM underperforms** here.  
Its F1 score (0.3381) and low recall (0.2131) suggest it is either not learning patterns effectively or is overfitting. This could be due to insufficient data, suboptimal hyperparameters, or lack of contextual embeddings like BERT in the pipeline.

---

### Conclusion

- **TF-IDF + Passive Aggressive** provides the best balance between capturing relevant labels and making accurate predictions.
- **Naive Bayes** is highly precise but very conservative.
- **BERT + GRU** is promising, especially for context-heavy text, but could benefit from optimization.
- **BiLSTM**, although powerful in theory, needs better tuning or richer input representations to perform well.