# BBC News Text Classification (ML Prototype)

## Goal of this Notebook
In this notebook, we will build a **text classification machine learning model**
using the BBC News dataset.

We are using this dataset to:
- Understand how text classification works
- Learn the end-to-end ML pipeline
- Later integrate this model into a backend API


### What are We Building?

In the BBC News Classification Project, we are building a predictive model to evaluate the various news records and classify them accordingly with the help of some parameters.The parameters into consideration are the various headlines with their respective categories. After cleaning and preprocessing the dataset using NLP techniques, we will use Machine Learning algorithms like Random Forest and SVM to classify each headline to its respective category.

### Pre-requisites

To churn the best out of this article, the following prerequisites would be a plus:

- Basic Knowledge of Python would be beneficial.
- Implementation of libraries like Pandas, Numpy, Seaborn, Matplotlib, and SciKit Learn.
- Understanding of Machine Learning algorithms like Random Forest, Linear, and Logistic Regression.
- Intermediate understanding of various text cleaning and preprocessing techniques like stemming and lemmatization.

### How Are We Going to Build This?

Here's how we are going to work on this project:

- Libraries - Importing the necessary NLP and ML libraries.
- Data Analysis - This step will enable us to figure out the various values and features of the dataset.
- Data Visualization - With basic Data Visualization, we will be able to figure out the various underlying patterns of our dataset.
- Preprocessing - Since we are working with textual data in this project, preprocessing is very important for us to create a classifier. Using techniques like Tokenization and Lemmatization, we'll make our data model-ready.
- Model Training - In this step, we will use various Machine Learning algorithms like Random Forest, Logistic Regression, SVC, etc., to try and create a classifier that successfully classifies news to their respective categories.
- Model Evaluation - To make sure our classifier is working the way we want it to, we'll perform various evaluation techniques like accuracy and ROC score.
- Model Testing - Finally, we will test our classifier with real-life data to see if it can actually predict the category of the headline.##

### Final Output

Our final output would be to create a classifier that can predict the headline to its respective category.

```
Headline: ['Tim Scott optimistic about Congress progress on police reform']
POLITICS
```

#### Requirements

Environment - GitHub Codespace
Libraries - Pandas, Numpy, Seaborn, Matplotlib, SciKit Learn, PorterStemmer, Lemmatizer, stopwords, etc.

## Step 1: Downloading the Dataset

We are using the **BBC News Dataset** from an official academic source:
http://mlg.ucd.ie/datasets/bbc.html

The dataset contains:
- 2225 news articles
- 5 categories: business, entertainment, politics, sport, tech

Each article is a plain text file.
Each folder name represents the label.

In [None]:
!wget http://mlg.ucd.ie/files/datasets/bbc.zip

In [None]:
!unzip bbc.zip

In [None]:
!ls bbc

## Step 2: Dataset Structure

The dataset is organized like this:

```shell
bbc/
 ├── business/
 ├── entertainment/
 ├── politics/
 ├── sport/
 └── tech/
```

Each `.txt` file inside a folder is one document.
The folder name is the category (label).

In [None]:
# Read Files into Python

import os

data = []
labels = []

base_path = "bbc"

for category in os.listdir(base_path):
    category_path = os.path.join(base_path, category)
    for filename in os.listdir(category_path):
        file_path = os.path.join(category_path, filename)
        with open(file_path, "r", encoding="latin-1") as file:
            data.append(file.read())
            labels.append(category)

## Step 3: Creating a Structured Dataset

Although ML does not require CSVs, structured data
makes experimentation easier.

In [None]:
import pandas as pd

df = pd.DataFrame({
    "text": data,
    "category": labels
})

df.head()

## Step 4: Train-Test Split

We split the dataset into:
- Training data → used to learn
- Test data → used to evaluate

In [None]:
from sklearn.model_selection import train_test_split

X = df["text"]
y = df["category"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Step 5: Text Vectorization (TF-IDF)

Computers cannot understand text directly.
We convert text into numbers using **TF-IDF**.

TF-IDF measures:
- How important a word is in a document
- Relative to all documents

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english")
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## Step 6: Training the Model

We use **Multinomial Naive Bayes**:
- Simple
- Fast
- Works well for text classification

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

## Step 7: Model Evaluation

We now test the model on unseen data.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Step 8: Saving the Model

We save:
- The trained model
- The vectorizer

These will be used later in the backend API.

In [None]:
import joblib

joblib.dump((vectorizer, model), "model.joblib")

## What Comes Next

Next steps:
1. Load this model in the backend
2. Create a `/predict` API
3. Connect frontend input to this ML model
4. Reframe this as a legal case classifier prototype