# ACTIVITY: A CLASSIFIER

The goal of this activity is to practice building and discussing a classifier. By the end of the activity, you should be able to justify your design decisions according to 

TASK:

In [32]:
import kagglehub
import os
import pandas as pd
from pathlib import Path

path = kagglehub.dataset_download("hgultekin/bbcnewsarchive")
print(os.listdir(path))
df = pd.read_csv(Path(path) / "bbc-news-data.csv", sep='\t').sample(5)

['bbc-news-data.csv']


In [33]:
df.head()

Unnamed: 0,category,filename,title,content
2223,tech,400.txt,US cyber security chief resigns,The man making sure US computer networks are ...
1958,tech,135.txt,GTA sequel is criminally good,The Grand Theft Auto series of games have set...
247,business,248.txt,Survey confirms property slowdown,Government figures have confirmed a widely re...
530,entertainment,021.txt,Obituary: Dame Alicia Markova,"Dame Alicia Markova, who has died in Bath age..."
598,entertainment,089.txt,Oscar nominee Dan O'Herlihy dies,"Irish actor Dan O'Herlihy, who was nominated ..."


In [34]:
df['category'].value_counts()

category
tech             2
entertainment    2
business         1
Name: count, dtype: int64

## Baseline classifier

We will start with a baseline classifier. It is a simple Bag-of-words classifier.

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(df['content'], df['category'], test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

               precision    recall  f1-score   support

entertainment       0.00      0.00      0.00       0.0
         tech       0.00      0.00      0.00       1.0

     accuracy                           0.00       1.0
    macro avg       0.00      0.00      0.00       1.0
 weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## PART 1: preparation

Answer the questions below as comments in the following cell. In your answers, avoid using common sense. Use adequate jargons.

### Question 1: What is the underlying premise of the Bag-of-Words classifier, that is, why does BoW allow to classify these texts?

The underying premise of BoW is to create a document x words matrix, which each cell represents the frequency of a word in a text. Thus, is it possible get to the importance of each word in a document, so the weight of a classification. 
Therefore, the classify model uses 

frequencia das palavras sem contexto/ordem

importancia das palavras dado topico (numericamente: pesos)

usar modelo de classificacao com base nos pesos gerados pelo bow

legibilidade fácil  

### Question 2: What is the underlying premise of a BERT-based classifier, that is, why should BERT embeddings be interesting to classify these texts?

BERT pretreinado para 

BERT embeddings are representation of word 

## PART 2: action

(a) Make a classifier that uses BERT embeddings to categorize the texts in the dataset we have discussed.

(b) Make a bar plot comparing the accuracy of the BERT-based classifier to that of the Bag-of-Words classifier

(c) Use a PCA or a T-SNE plot to visualize the documents in the newsgroups dataset in the embedding space provided by BERT. Analyze the plot taking into account the confusion matrix or the classification report of your BERT-based classifier.




In [37]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from transformers import BertTokenizer, BertModel
from tqdm import tqdm   


class BertTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, max_length=512):
        self.max_length = max_length
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.model = BertModel.from_pretrained('bert-base-uncased') 
        self.embeddings = []

    def fit(self, X, y=None):
        return self
  
    def transform(self, texts):
        for text in tqdm(texts, desc="Generating embeddings"):
              inputs = self.tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=self.max_length)
              outputs = self.model(**inputs)
              cls_embedding = outputs.last_hidden_state[0, 0, :] # embedding que corresponde ao texto
              self.embeddings.append(cls_embedding.detach().numpy())
        return self.embeddings

pipeline = Pipeline([
    ('bert', BertTransformer()),
    ('logreg', LogisticRegression(max_iter=1000))
])


pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Generating embeddings: 100%|██████████| 4/4 [00:01<00:00,  3.13it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  2.73it/s]


ValueError: Found input variables with inconsistent numbers of samples: [1, 5]

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(2)
embeddings = pipeline.named_steps["bert"].embeddings
e_pca = pca.fit_transform

AttributeError: 'BertTransformer' object has no attribute 'embeddings'