### Installing Necessary Packages

This cell installs the required libraries for Natural Language Processing (NLP), machine learning, and data visualization. If you already have the packages installed, you can skip this step.


In [None]:
!pip install transformers nltk datasets numpy seaborn pandas scikit-learn matplotlib

### Importing Dependencies

We import all the necessary libraries and modules. These include:

- `pandas`: For loading and manipulating datasets.
- `seaborn` and `matplotlib`: For visualizations.
- `transformers`: For pre-trained NLP models like BERT.
- `nltk`: For text preprocessing tasks such as removing stopwords.
- `datasets`: To handle datasets efficiently in a format compatible with the Hugging Face models.


In [None]:
import pandas as pd
import os
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

### Loading the Training and Test Datasets

We load the training and test datasets directly from CSV files (`train.csv` and `test.csv`). The `Class Index` column containing class labels is renamed to `label` for consistency. The goal is to ensure that the datasets are structured correctly for further preprocessing.


In [None]:
# Load the train dataset
train_df = pd.read_csv("data/train.csv", on_bad_lines='skip', engine='python')

# Load the test dataset
test_df = pd.read_csv("data/test.csv", on_bad_lines='skip', engine='python')

# Rename the class label column for consistency
train_df = train_df.rename(columns={'Class Index':'label'})
test_df = test_df.rename(columns={'Class Index':'label'})

# Check the shapes of the dataframes
print(train_df.shape, test_df.shape)

In [None]:
# Adjust labels in train and test datasets to be zero-indexed
train_df['label'] = train_df['label'] - 1
test_df['label'] = test_df['label'] - 1

### Data Statistics Visualization

To ensure that the dataset is balanced across class labels, we generate a bar plot showing the frequency of each class in the training data. This helps identify any class imbalance issues.


In [None]:
# Visualize the class distribution in the training dataset
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,4))
sns.countplot(x=train_df['label'])
plt.show()

### Checking for Null Values

We check for any missing or null values in the datasets. This ensures data quality and helps us avoid potential errors in the following steps.


In [None]:
# Check for missing or null values in the training dataset
train_df.info()

# Check for missing or null values in the test dataset
test_df.info()

#Data Preprocessing:
### Concatenating Title and Description

In this step, we concatenate the `Title` and `Description` columns into a single `text` column for both training and test datasets. This combined column is used as input for the model since it contains all relevant textual information.


In [None]:
# Concatenate Title and Description columns for the training dataset
train_df['text'] = train_df['Title'] + train_df['Description']
train_df.drop(columns=['Title', 'Description'], axis=1, inplace=True)

# Concatenate Title and Description columns for the test dataset
test_df['text'] = test_df['Title'] + test_df['Description']
test_df.drop(columns=['Title', 'Description'], axis=1, inplace=True)

### Removing Punctuation

We define a function to remove punctuation and special characters from the text. This cleaning step improves model performance by eliminating noise in the input data.


In [None]:
# Function to remove punctuation and unwanted characters from the text
def remove_punctuations(text):
    if isinstance(text, (str, bytes)):
        text = re.sub(r'[\\-]', ' ', text)
        text = re.sub(r'[,.?;:\'(){}!|0-9]', '', text)
        return text
    else:
        return ""

# Apply punctuation removal for both train and test datasets
train_df['text'] = train_df['text'].apply(remove_punctuations)
test_df['text'] = test_df['text'].apply(remove_punctuations)

### Removing Stopwords

Stopwords like "the", "is", and "and" are common but carry little information in text classification tasks. We remove stopwords to help the model focus on more meaningful words.


In [None]:
# Download stopwords if not already available
stopw = stopwords.words('english')

# Function to remove stopwords
def remove_stopwords(text):
    clean_text = []
    for word in text.split(' '):
        if word not in stopw:
            clean_text.append(word)
    return ' '.join(clean_text)

# Apply stopword removal to both train and test datasets
train_df['text'] = train_df['text'].apply(remove_stopwords)
test_df['text'] = test_df['text'].apply(remove_stopwords)

### Tokenization

We use a pre-trained BERT tokenizer to convert the text into token representations that can be processed by the BERT model. The `pipeline` function tokenizes both the training and test datasets.


In [None]:
# Define the model name and load the tokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Function to tokenize text
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

# Convert Pandas dataframe to Hugging Face dataset and tokenize it
def pipeline(dataframe):
    dataset = Dataset.from_pandas(dataframe, preserve_index=False)
    tokenized_ds = dataset.map(preprocess_function, batched=True)
    tokenized_ds = tokenized_ds.remove_columns('text')
    return tokenized_ds

# Tokenize the train and test datasets
tokenized_train = pipeline(train_df)
tokenized_test = pipeline(test_df)

#Tokenization using Pre-built Tokenizer

The pre-trained bert-base-uncased tokenizer is used to convert the text into token representations suitable for BERT models.

The text is tokenized into subword units, which the model can process, and stored in a dataset format that the model can handle.

In [None]:
model_name='bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

def pipeline(dataframe):
    dataset = Dataset.from_pandas(dataframe, preserve_index=False)
    tokenized_ds = dataset.map(preprocess_function, batched=True)
    tokenized_ds = tokenized_ds.remove_columns('text')
    return tokenized_ds

tokenized_train = pipeline(train_df)
tokenized_test = pipeline(test_df)

### Load Pre-trained BERT Model and Set Training Arguments

We load a pre-trained BERT model for sequence classification and set the training arguments. These arguments include hyperparameters such as batch size, learning rate, number of epochs, and gradient accumulation steps.


In [None]:
# Load the pre-trained BERT model for sequence classification
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

# Set training arguments for the Trainer
training_args = TrainingArguments(
    output_dir="./results",          # Where to save the model
    save_strategy='epoch',           # Save model after each epoch
    evaluation_strategy='no',        # No evaluation during training
    logging_strategy='no',           # Disable logging of loss/metrics
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=16,  # Batch size
    num_train_epochs=3,              # Number of epochs
    weight_decay=0.01,               # Weight decay for regularization
    report_to="none",                # Disable wandb or any external logging
    #gradient_accumulation_steps=2,   # Accumulate gradients over 2 steps
    log_level="error"                # Suppress most logs
)


### Training the Model

We initialize the `Trainer` and start training the model on the tokenized training dataset. The model's parameters are fine-tuned using the specified training arguments.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

In [None]:
print(train_df['label'].unique())  # Should return something like [0, 1, 2, 3]
print(test_df['label'].unique())   # Should return the same range


### Evaluate the Model

After training, we use the trained model to make predictions on the test dataset. We then calculate and print classification metrics such as precision, recall, and F1-score to evaluate the model's performance.


In [None]:
# Import necessary evaluation tools
import numpy as np
from sklearn.metrics import classification_report

# Make predictions on the test dataset
preds = trainer.predict(tokenized_test)
preds_flat = [np.argmax(x) for x in preds[0]]

# Generate a classification report
print(classification_report(test_df['label'], preds_flat))

#Comparing Predictions on Sample Test Data
A manual comparison between the model’s predictions and actual class labels is made for a few random samples from the test dataset.

This allows for quick visual inspection of the model's performance on individual cases.

In [None]:
import random
class_labels=['World', 'Sports', 'Business', 'Sci/Tech']

num=random.randint(0,len(test_df)-1)
tokenized_test = pipeline(test_df[num:num+10]).remove_columns('label')
preds=trainer.predict(tokenized_test)
preds_flat = [np.argmax(x) for x in preds[0]]

print('Prediction\tActual\n----------------------')
for i in range(len(preds_flat)):
    print(class_labels[preds_flat[i]], ' ---> ', class_labels[test_df['label'].values[num+i]])

### Save the Model

We save the trained model to disk. This allows us to reuse the model later without retraining it.


In [None]:
trainer.save_model('models')

#Loading the Saved Model
Once saved, the model can be reloaded at any time for further use. The loaded model can be used to make new predictions or for additional fine-tuning if needed.

In [None]:
# Reload the saved model from disk
model = AutoModelForSequenceClassification.from_pretrained('models')

# Re-initialize the Trainer with the loaded model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
)