## Preprocessing Exercise

As an exercise we want to create a model for the classification of documents in a collection of BBC news articles. The dataset is available at <https://www.kaggle.com/c/learn-ai-bbc>

The data consists of a training an test sets. The training set contains 1490 documents and the test set contains 735 documents. Each document is a news article and is labeled with one of 5 categories: business, entertainment, politics, sport, technology.

- The first goal is to create a model that can classify the documents into two categories: politics and not politics and evaluate its predictive performance.
- Next, remove the stopwords and evaluate the model again.
- Next, use stemming and evaluate the model again.
- Next, use lemmatization and evaluate the model again.


In [1]:
import pandas as pd

df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/BBC%20News%20Train.csv.zip").sample(frac=0.8, random_state=1)

df["IsPolitics"] = df["Category"] == "politics"
df.head()

Unnamed: 0,ArticleId,Text,Category,IsPolitics
91,1756,2d metal slug offers retro fun like some drill...,tech,False
1103,1108,blair stresses prosperity goals tony blair say...,politics,True
909,1955,weak dollar trims cadbury profits the world s ...,business,False
683,63,court rejects $280bn tobacco case a us governm...,business,False
561,293,christmas song formula unveiled a formula for...,entertainment,False


In [2]:
df["IsPolitics"].value_counts(normalize=True)

IsPolitics
False    0.810403
True     0.189597
Name: proportion, dtype: float64

In [3]:
# Import the libraries

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package wordnet to /home/amarov/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/amarov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/amarov/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [4]:
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def get_pos_tag(self, tag):
        if tag.startswith('J'):
            return wordnet.ADJ
        elif tag.startswith('V'):
            return wordnet.VERB
        elif tag.startswith('N'):
            return wordnet.NOUN
        elif tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN

    def __call__(self, doc):
        tokens_with_pos = nltk.pos_tag(word_tokenize(doc))
        return [self.wnl.lemmatize(w, self.get_pos_tag(tag)) for w, tag in tokens_with_pos]

# Create a CountVectorizer object

vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(), stop_words="english")

# Split the data into training and test sets


X_train_txt, X_test_txt, y_train, y_test = train_test_split(df['Text'], df['IsPolitics'], test_size=0.2, random_state=1)

X_train = vectorizer.fit_transform(X_train_txt)
X_test = vectorizer.transform(X_test_txt)

# Fit a logistic regression model

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Training accuracy:", model.score(X_train, y_train))

# Evaluate the model
print("Test accuracy:", model.score(X_test, y_test))





Training accuracy: 1.0
Test accuracy: 0.9790794979079498


In [5]:
# Print the most important words for each class

feature_names = np.array(vectorizer.get_feature_names_out())
sorted_coef_index = model.coef_[0].argsort()

print("Non-Politics words:", feature_names[sorted_coef_index[:10]])
print("Politics words:", feature_names[sorted_coef_index[-10:]])

Non-Politics words: ['$' 'u' 'company' '%' 'firm' 'online' 's' 'win' 'film' 'music']
Politics words: ['brown' 'mr' 'britain' 'labour' 'minister' 'secretary' 'mp' 'election'
 'blair' 'party']


## Exercise

Create a class `StemTokenizer` that uses the `PorterStemmer` from the `nltk` library to tokenize a text. The class should have a method `__call__` that receives a text and returns a list of tokens. Take the `LemmaTokenizer` class as a reference.

- Fit a logistic regression model to the training data using the `StemTokenizer` class to tokenize
- Compare the test performance of the model with the previous models.
