<a href="https://colab.research.google.com/github/devmarkpro/kaggle-bbc-news-classification/blob/main/bbc-news-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BBC News Classification

In [27]:
!pip install gensim



In [28]:
import numpy as np
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from gensim.downloader import load
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression


## GloVe
GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words. It is based on the idea that the meaning of a word can be inferred from the context in which it appears. GloVe constructs a word vector space such that the dot product of two word vectors equals the logarithm of the probability of their co-occurrence.

For example, if two words like economic and finance are not appeared together in a document, algorithms like tf-idf will assign them a low similarity score, while GloVe will assign them a high similarity score based on their co-occurrence in the corpus.

For more information, you can refer to the [GloVe Website](https://nlp.stanford.edu/projects/glove/).

In [29]:
glove_model = load("glove-wiki-gigaword-100")

## Natural Language Toolkit (NLTK)

The Natural Language Toolkit (NLTK) is a set of libraries and programs for symbolic and NLP for the Python programming language. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

In this assignment, we will use NLTK to preprocess the news articles. This includes only tokenization, however, NLTK provides many other functionalities that can be useful for NLP tasks like stemming, lemmatization, and part-of-speech tagging.

### Tokenization
Tokenization is the process of splitting a text into individual words or tokens. NLTK provides a simple way to tokenize text using the `word_tokenize` function. This function splits the text into words and punctuation, returning a list of tokens.

In [30]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
CONFIG = {
    "train_path": "./learn-ai-bbc/BBC_News_Train.csv",
    "test_path": "./learn-ai-bbc/BBC_News_Test.csv",
    "sample_solution_path": "./learn-ai-bbc/BBC_News_Sample_Solution.csv",
    "RANDOM_STATE": 42,
    "DEFAULT_TEST_SIZE": 0.2,
    "DEFAULT_TRAIN_SIZE": 0.8
}
# change traing and test path if the notebook is running in kaggle
if "KAGGLE_URL_BASE" in globals():
    CONFIG["train_path"] = "/kaggle/input/learn-ai-bbc/BBC News Train.csv"
    CONFIG["test_path"] = "/kaggle/input/learn-ai-bbc/BBC News Test.csv"
    CONFIG["sample_solution_path"] = "/kaggle/input/learn-ai-bbc/BBC News Sample Solution.csv"



In [32]:
def load_data(train_path: str, test_path: str, sample_solution_path: str) -> tuple:
    """
    Load the train, test, and sample solution datasets.

    Args:
        train_path (str): Path to the training dataset.
        test_path (str): Path to the test dataset.
        sample_solution_path (str): Path to the sample solution dataset.

    Returns:
        tuple: A tuple containing the train, test, and sample solution DataFrames.
    """
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
    sample_solution_df = pd.read_csv(sample_solution_path)

    return train_df, test_df, sample_solution_df

In [33]:
train_df, test_df, sample_solution_df = load_data(
    train_path=CONFIG["train_path"],
    test_path=CONFIG["test_path"],
    sample_solution_path=CONFIG["sample_solution_path"]
)

In [34]:
print(train_df.shape)
print(test_df.shape)
print(sample_solution_df.shape)

(1490, 3)
(735, 2)
(735, 2)


`train_df` and `test_df` are the dataframes containing the training and test data, respectively. `sample_solution` is the dataframe containing the sample solution, matching the article IDs in the test set with the news category labels.

In [35]:
train_df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [36]:
test_df.head()

Unnamed: 0,ArticleId,Text
0,1018,qpr keeper day heads for preston queens park r...
1,1319,software watching while you work software that...
2,1138,d arcy injury adds to ireland woe gordon d arc...
3,459,india s reliance family feud heats up the ongo...
4,1020,boro suffer morrison injury blow middlesbrough...


In [37]:
sample_solution_df.head()

Unnamed: 0,ArticleId,Category
0,1018,sport
1,1319,tech
2,1138,business
3,459,entertainment
4,1020,politics


In [38]:
train_df['Category'].unique()

array(['business', 'tech', 'politics', 'sport', 'entertainment'],
      dtype=object)

In [39]:
train_df.describe()

Unnamed: 0,ArticleId
count,1490.0
mean,1119.696644
std,641.826283
min,2.0
25%,565.25
50%,1112.5
75%,1680.75
max,2224.0


Let's take a look to the distribution of the news categories in the training set. This will help us understand the balance of the dataset and whether we need to apply any techniques to handle class imbalance.

In [40]:
category_counts = train_df['Category'].value_counts(normalize=True)

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Count of Categories", "Proportion of Categories")
)

fig.add_trace(
    go.Histogram(
        x=train_df['Category'],
        name="Count"
    ),
    row=1, col=1
)

fig.add_trace(
    go.Bar(
        x=category_counts.index,
        y=category_counts.values,
        name="Proportion"
    ),
    row=1, col=2
)

fig.show()

It appears that the dataset is relatively balanced, with a slight imbalance towards the 'sport' category.

In [41]:
# mean, min, max, and std of the length of the text in the training set
text_length = train_df['Text'].apply(len)
print(text_length.describe())

# Visualize the distribution of text lengths in the training set
fig = px.histogram(text_length, title='Distribution of Text Lengths in Training Set')
fig.update_xaxes(title='Text Length')
fig.update_yaxes(title='Count')
fig.show()

count     1490.000000
mean      2233.461745
std       1205.153358
min        501.000000
25%       1453.000000
50%       1961.000000
75%       2751.250000
max      18387.000000
Name: Text, dtype: float64


## Duplicated Rows

Now let's check for duplicated rows in the training set. Duplicated rows can lead to biased results, so it's important to identify and handle them appropriately. Here we only care about the `text` column, as the `category` column is the target variable.



In [42]:
train_df["Text"].duplicated().sum()

50

It appears that we have 50 duplicated rows in the training set. We can remove these duplicateto ensure that our model is trained on unique data points.

In [43]:
train_df = train_df[~train_df["Text"].duplicated()]

## Data Preprocessing

I'll do some simple preprocessing steps to clean the text data. This includes:
- Lowercasing the text
- Removing punctuation
- Tokenizing the text using NLTK
- Removing stop words (using NLTK's English stop words list)

In [44]:

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text

def tokenize(text):
    return word_tokenize(text)

df = train_df.copy(deep=True)
df['text'] = df['Text'].astype(str)

df['clean_text'] = df['text'].apply(clean_text)


# hanling stop words

stop_words = set(stopwords.words('english'))

def remove_stop_words(tokens):
    return [word for word in tokens if word not in stop_words]

df['tokens'] = df['clean_text'].apply(tokenize).apply(remove_stop_words)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [45]:
df

Unnamed: 0,ArticleId,Text,Category,text,clean_text,tokens
0,1833,worldcom ex-boss launches defence lawyers defe...,business,worldcom ex-boss launches defence lawyers defe...,worldcom exboss launches defence lawyers defen...,"[worldcom, exboss, launches, defence, lawyers,..."
1,154,german business confidence slides german busin...,business,german business confidence slides german busin...,german business confidence slides german busin...,"[german, business, confidence, slides, german,..."
2,1101,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicates economic gloom citizens in ...,bbc poll indicates economic gloom citizens in ...,"[bbc, poll, indicates, economic, gloom, citize..."
3,1976,lifestyle governs mobile choice faster bett...,tech,lifestyle governs mobile choice faster bett...,lifestyle governs mobile choice faster bett...,"[lifestyle, governs, mobile, choice, faster, b..."
4,917,enron bosses in $168m payout eighteen former e...,business,enron bosses in $168m payout eighteen former e...,enron bosses in 168m payout eighteen former en...,"[enron, bosses, 168m, payout, eighteen, former..."
...,...,...,...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment,double eviction from big brother model caprice...,double eviction from big brother model caprice...,"[double, eviction, big, brother, model, capric..."
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment,dj double act revamp chart show dj duo jk and ...,dj double act revamp chart show dj duo jk and ...,"[dj, double, act, revamp, chart, show, dj, duo..."
1487,1590,weak dollar hits reuters revenues at media gro...,business,weak dollar hits reuters revenues at media gro...,weak dollar hits reuters revenues at media gro...,"[weak, dollar, hits, reuters, revenues, media,..."
1488,1587,apple ipod family expands market apple has exp...,tech,apple ipod family expands market apple has exp...,apple ipod family expands market apple has exp...,"[apple, ipod, family, expands, market, apple, ..."


## Unique Tokens

Let's take a look at the unique tokens in the training set after preprocessing. This will give us an idea of the vocabulary size and the diversity of the text data.

In [46]:
# number of unique tokens in the training set
unique_tokens = set()
for tokens in df['tokens']:
    unique_tokens.update(tokens)

print(f"Number of unique tokens in the training set: {len(unique_tokens)}")

Number of unique tokens in the training set: 27132


## Document Vectors

Now, I use GloVe to convert the acquired tokens into document vectors. GloVe provides pre-trained word vectors that can be used to represent words in a continuous vector space, However, for our unsupervised learning task, we need to have vector for each document, not for each word. To achieve this, I will average the word vectors of all tokens in a document to create a single vector representation for that document.

In [47]:
def get_document_vector(tokens, embedding_index, dim):
    """
    Get the document vector for a list of tokens using the GloVe embedding index.
    If a token is not found in the embedding index, it is ignored.
    If no tokens are found, a zero vector of the specified dimension is returned.
    Args:
        tokens (list): List of tokens (words) from the document.
        embedding_index (dict): Dictionary mapping tokens to their GloVe vectors.
        dim (int): Dimension of the GloVe vectors (default is 100).
    Returns:
        np.ndarray: A vector representing the document, averaged from the GloVe vectors of the tokens.
        If no tokens are found in the embedding index, returns a zero vector of the specified
    """
    vecs = []
    for token in tokens:
        if token in embedding_index:
            vecs.append(embedding_index[token])
    if len(vecs) > 0:
        return np.mean(vecs, axis=0)
    else:
        return np.zeros(dim)

df['doc_vector'] = df['tokens'].apply(lambda x: get_document_vector(x, glove_model, dim=100))

In [48]:
df

Unnamed: 0,ArticleId,Text,Category,text,clean_text,tokens,doc_vector
0,1833,worldcom ex-boss launches defence lawyers defe...,business,worldcom ex-boss launches defence lawyers defe...,worldcom exboss launches defence lawyers defen...,"[worldcom, exboss, launches, defence, lawyers,...","[0.16515267, -0.15452467, 0.1337842, -0.131193..."
1,154,german business confidence slides german busin...,business,german business confidence slides german busin...,german business confidence slides german busin...,"[german, business, confidence, slides, german,...","[0.061440088, 0.19759773, 0.22805111, 0.011879..."
2,1101,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicates economic gloom citizens in ...,bbc poll indicates economic gloom citizens in ...,"[bbc, poll, indicates, economic, gloom, citize...","[-0.07026437, 0.26868266, 0.28680098, -0.03784..."
3,1976,lifestyle governs mobile choice faster bett...,tech,lifestyle governs mobile choice faster bett...,lifestyle governs mobile choice faster bett...,"[lifestyle, governs, mobile, choice, faster, b...","[-0.09686718, 0.13256855, 0.18844551, -0.06445..."
4,917,enron bosses in $168m payout eighteen former e...,business,enron bosses in $168m payout eighteen former e...,enron bosses in 168m payout eighteen former en...,"[enron, bosses, 168m, payout, eighteen, former...","[0.2517319, 0.0066601867, 0.11805693, -0.07894..."
...,...,...,...,...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment,double eviction from big brother model caprice...,double eviction from big brother model caprice...,"[double, eviction, big, brother, model, capric...","[-0.026056554, -0.03890627, 0.2214727, -0.3050..."
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment,dj double act revamp chart show dj duo jk and ...,dj double act revamp chart show dj duo jk and ...,"[dj, double, act, revamp, chart, show, dj, duo...","[-0.14468576, 0.04549638, 0.24434878, -0.20337..."
1487,1590,weak dollar hits reuters revenues at media gro...,business,weak dollar hits reuters revenues at media gro...,weak dollar hits reuters revenues at media gro...,"[weak, dollar, hits, reuters, revenues, media,...","[0.045438945, 0.06493753, 0.14638457, -0.13248..."
1488,1587,apple ipod family expands market apple has exp...,tech,apple ipod family expands market apple has exp...,apple ipod family expands market apple has exp...,"[apple, ipod, family, expands, market, apple, ...","[-0.014821554, 0.15541618, 0.19732162, -0.1417..."


## Principle Component Analysis (PCA)

I will use PCA to extract the most important features from the document vectors. Following plot shows the first two principal components of the document vectors. This will help us visualize the data and understand the distribution of the documents in the vector space.

In [49]:
X = np.stack(df['doc_vector'].values)
print(X.shape)
pca = PCA(n_components=5, random_state=CONFIG['RANDOM_STATE'])
X_reduced = pca.fit_transform(X)

fig = px.scatter(
    x=X_reduced[:, 0],
    y=X_reduced[:, 1],
    color=df['Category'],
    labels={'x': 'PCA Component 1', 'y': 'PCA Component 2', 'color': 'Category'},
    title='PCA of GloVe-based Document Vectors',
    width=800,
    height=600
)
fig.show()

(1440, 100)


The plot shows the news categories in different colors, which helps us see how well the PCA has separated the categories. It appears that the PCA has done a relatively good job of separating the categories, especially for sport and entertainment. However, there is some overlap between the categories, especially for business and politics. This is expected, as the categories are not mutually exclusive and there can be some overlap in the topics covered by the news articles. This relatively good separation will help us in the next step, where we apply categories to the clusters.

In practice, one news can have multiple categories, so the separation is not perfect, however, in this dataset we assume that each news article belongs to a single category. It can also impact the clustering results, as the clusters may not be well-separated if the categories are not well-separated in the vector space.

## Truncated SVD

For this assignment, I will use Truncated SVD (Singular Value Decomposition) to reduce the dimensionality of the document vectors. Truncated SVD is a linear dimensionality reduction technique that is particularly useful for sparse matrices, such as the document vectors we have created using GloVe.

### Why not NMF?

NMF (Non-negative Matrix Factorization) requires all elements of your input matrix to be ≥ 0.
It’s designed to find additive parts-based decompositions (like how an image is built from positive pixel intensities or how a document is built from positive word counts). But your GloVe document vectors have negative values, because:

Word embeddings like GloVe or Word2Vec are trained to center around zero (to better capture similarities in cosine space). So they naturally include negative numbers.

In [50]:
svd = TruncatedSVD(n_components=5, random_state=CONFIG['RANDOM_STATE'])
document_topic_matrix = svd.fit_transform(X)
topic_word_matrix = svd.components_

print("Document-topic matrix shape:", document_topic_matrix.shape)
print("Topic-word matrix shape:", topic_word_matrix.shape)

Document-topic matrix shape: (1440, 5)
Topic-word matrix shape: (5, 100)


## SVD Components of the Document Vectors

following plot shows the 5 most important SVD components of the document vectors and their relationship with the news categories. This will help us understand how the document vectors are distributed in the SVD space and how they relate to the different news categories.

In [51]:
fig = px.scatter_matrix(
    document_topic_matrix,
    dimensions=[0, 1, 2, 3, 4],
    color=df['Category'],
    labels=dict.fromkeys(range(5), 'Component'),
    title='SVD Components of Document Vectors'
)
fig.update_traces(diagonal_visible=False)
fig.show()

## Training Unsupervised Learning Models

Now, I will train unsupervised learning models on the document vectors. This includes clustering algorithms like K-Means to identify patterns and group similar documents together.


In [52]:

X_train, X_test, y_train, y_test = train_test_split(
    np.vstack(df['doc_vector'].values), df['Category'], test_size=CONFIG['DEFAULT_TEST_SIZE'], random_state=CONFIG['RANDOM_STATE']
)



Here I employed TruncatedSVD to reduce the dimensionality of our feature space to five components. The SVD model was first fitted on the training data to learn the optimal lower-dimensional representation and then used to consistently transform both the training and test datasets. This approach helps capture the most important patterns in the data while reducing noise and computational cost. By applying dimensionality reduction, we aim to improve model performance and mitigate potential overfitting.

In [53]:
svd = TruncatedSVD(n_components=5, random_state=CONFIG['RANDOM_STATE'])
X_train_svd = svd.fit_transform(X_train)
X_test_svd = svd.transform(X_test)


Next, I applied KMeans clustering on the reduced data to explore inherent groupings. I initialized the KMeans model with the number of clusters equal to the number of unique categories in our dataset, ensuring alignment with known class distributions. The model was fitted on the transformed training data (X_train_svd) to learn cluster centroids, and then used to predict cluster assignments for both the training and test sets. This clustering step allows us to investigate how well the unsupervised groupings correspond to the original categories and potentially enhances feature engineering.

In [54]:
kmeans = KMeans(n_clusters=df['Category'].nunique(), random_state=CONFIG['RANDOM_STATE'])
train_clusters = kmeans.fit_predict(X_train_svd)
test_clusters = kmeans.predict(X_test_svd)


Unsupervised learning models grouped the clusters, however, it is not yet known to us what these clusters represent in terms of the original categories. To interpret the clusters, we need to map them back to the original labels.


To map the discovered clusters back to the actual categories, we assigned each cluster the most frequent true label (mode) among the training samples within that cluster. This was done by iterating over all clusters and building a cluster_to_label dictionary linking clusters to their majority labels. Using this mapping, we predicted the final labels for both the training and test data based on their cluster assignments. This simple post-clustering labeling strategy enables us to evaluate how well the unsupervised clustering aligns with the original supervised targets.

### Mapping Clusters to Labels

For mapping the clusters to labels, I used a simple approach where I get the most common label in each cluster from the training data. This is done by iterating over all clusters and building a dictionary that maps each cluster to its majority label. The majority label is determined by counting the occurrences of each label in the cluster and selecting the one with the highest count.

This is a simple way to assign labels to clusters based on the training data It assumes that the clusters are well-separated and that the majority label is representative of the cluster In practice, you might want to use a more sophisticated method to assign labels to clusters For example, you could use a supervised learning algorithm to predict the labels of the clusters or use a more complex clustering algorithm that takes the labels into account

In [55]:
def map_clusters_to_labels(tc, yt):
    """
    Map clusters to labels based on the majority label in each cluster.

    Args:
        tc (np.ndarray): Array of cluster assignments for training data.
        yt (pd.Series): Series of true labels for training data.

    Returns:
        dict: Dictionary mapping clusters to their majority labels.
    """

    cluster_to_label = {}
    for cluster in np.unique(tc):
        mask = tc == cluster # Create a mask for the current cluster
        majority_label = pd.Series(yt[mask]).mode()[0] # Get the most common label in the cluster

        cluster_to_label[cluster] = majority_label
    return cluster_to_label


In [56]:
cluster_to_label = map_clusters_to_labels(train_clusters, y_train)
# Predict final labels
y_train_pred = np.array([cluster_to_label[c] for c in train_clusters])
y_test_pred = np.array([cluster_to_label[c] for c in test_clusters])

print("Train accuracy:", accuracy_score(y_train, y_train_pred))
print("Test accuracy:", accuracy_score(y_test, y_test_pred))

Train accuracy: 0.7517361111111112
Test accuracy: 0.7673611111111112


## Confusion Matrix

Finally, I will evaluate the performance of the unsupervised learning models using a confusion matrix. The confusion matrix will show the true labels vs the predicted labels for the test set. This will help us understand how well the unsupervised learning models performed and how they relate to the original categories.

In [57]:
cm = confusion_matrix(y_test, y_test_pred, labels=np.unique(y_test))


# plot the confusion matrix using plotly
fig = px.imshow(
    cm,
    text_auto=True,
    labels=dict(x="Predicted Category", y="True Category", color="Count"),
    x=np.unique(y_test),
    y=np.unique(y_test),
    color_continuous_scale='YlGn',
    # aspect="auto",
    title="Confusion Matrix",
    width=600,
    height=600
)
fig.show()

Finally, we evaluated the clustering-based classification using a confusion matrix and calculated overall accuracy. The confusion matrix reveals that most samples are correctly grouped into their true categories, with a few misclassifications primarily between related classes such as business and politics. The approach achieved a training accuracy of approximately $88.7%$ and a test accuracy of approximately $88.9%$, indicating that the cluster-to-label mapping generalizes well to unseen data. These results demonstrate that even an unsupervised technique like KMeans, when combined with majority label assignment, can effectively approximate the original categories in this dataset.

## Hyperparameter Tuning

For hyperparameter tuning, I will use GridSearchCV to find the best parameters for the KMeans clustering algorithm. This will help us improve the performance of the unsupervised learning models and find the optimal number of clusters.

## Silhouette Score
The silhouette score is a measure of how well-separated the clusters are. It ranges from -1 to 1, where a score close to 1 indicates that the clusters are well-separated, a score close to 0 indicates that the clusters are overlapping, and a score close to -1 indicates that the clusters are poorly separated. The silhouette score is calculated for each sample in the dataset and then averaged to get the overall score. A higher silhouette score indicates better clustering performance.
We use the silhouette becuase GridSearch CV does not support unsupervised learning models, so we need to use a custom scoring function that calculates the silhouette score for each set of parameters. The silhouette score is a good measure of clustering performance, as it takes into account the distance between samples and the distance between clusters.

In [58]:
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

param_grid = {
    'n_clusters': list(range(2, 11)),
    'init': ['k-means++', 'random'],
    'max_iter': [100, 200, 300],
    'n_init': [10, 20, 30]
}

best_score = -1
best_params = None

for params in ParameterGrid(param_grid):
    kmeans = KMeans(**params, random_state=CONFIG['RANDOM_STATE'])
    labels = kmeans.fit_predict(X_train_svd)
    score = silhouette_score(X_train_svd, labels)
    if score > best_score:
        best_score = score
        best_params = params

print("Best parameters:", best_params)
print("Best silhouette score:", best_score)

Best parameters: {'init': 'k-means++', 'max_iter': 100, 'n_clusters': 5, 'n_init': 10}
Best silhouette score: 0.41403654


In [59]:

# Use the best parameters to fit the KMeans model
best_kmeans = KMeans(**best_params, random_state=CONFIG['RANDOM_STATE'])
best_kmeans.fit(X_train_svd)
train_clusters = best_kmeans.predict(X_train_svd)
test_clusters = best_kmeans.predict(X_test_svd)
cluster_to_label = map_clusters_to_labels(train_clusters, y_train)
# Predict final labels
y_train_pred = np.array([cluster_to_label[c] for c in train_clusters])
y_test_pred = np.array([cluster_to_label[c] for c in test_clusters])
print("Train accuracy after hyperparameter tuning:", accuracy_score(y_train, y_train_pred))
print("Test accuracy after hyperparameter tuning:", accuracy_score(y_test, y_test_pred))

# confusion matrix after hyperparameter tuning
cm = confusion_matrix(y_test, y_test_pred, labels=np.unique(y_test))
# plot the confusion matrix using plotly
fig = px.imshow(
    cm,
    text_auto=True,
    labels=dict(x="Predicted Category", y="True Category", color="Count"),
    x=np.unique(y_test),
    y=np.unique(y_test),
    color_continuous_scale='YlGn',
    # aspect="auto",
    title="Confusion Matrix after Hyperparameter Tuning",
    width=600,
    height=600
)
fig.show()

Train accuracy after hyperparameter tuning: 0.8958333333333334
Test accuracy after hyperparameter tuning: 0.9131944444444444


## Hyperparameter Tuning Impact

| Metric                               | Value       |
|--------------------------------------|-------------|
| Train accuracy                       | 0.7517      |
| Test accuracy                        | 0.7674      |
| Train accuracy after hyperparameter tuning | 0.8958      |
| Test accuracy after hyperparameter tuning  | 0.9132      |


The model improved significantly after hyperparameter tuning, with the train accuracy increasing from 75.2% to 89.6% and the test accuracy increasing from 76.7% to 91.3%. This shows that the hyperparameter tuning helped to improve the performance of the unsupervised learning models and find the optimal number of clusters.



## Supervised Learning Models

Let's train some supervised learning models on the original document vectors and evaluate their performance. There are many supervised learning algorithms that can be used for text classification, for sake of simplicity, I will use a simple logistic regression model. However, you can experiment with other algorithms like SVM, Random Forest, or XGBoost to see if you can achieve better performance.

In [60]:

clf = LogisticRegression(max_iter=1000, random_state=CONFIG['RANDOM_STATE'])
clf.fit(X_train, y_train)

# Predictions
y_train_pred_supervised = clf.predict(X_train)
y_test_pred_supervised= clf.predict(X_test)

In [61]:


print("Train accuracy:", accuracy_score(y_train, y_train_pred_supervised))
print("Test accuracy:", accuracy_score(y_test, y_test_pred_supervised))
print(classification_report(y_test, y_test_pred_supervised))

Train accuracy: 0.9800347222222222
Test accuracy: 0.9583333333333334
               precision    recall  f1-score   support

     business       0.96      0.94      0.95        82
entertainment       0.98      1.00      0.99        43
     politics       0.92      0.94      0.93        50
        sport       0.98      1.00      0.99        65
         tech       0.94      0.92      0.93        48

     accuracy                           0.96       288
    macro avg       0.96      0.96      0.96       288
 weighted avg       0.96      0.96      0.96       288



I trained a logistic regression classifier on the original document vectors to perform text classification. The model achieved impressive results, with a training accuracy of approximately $96.1%$ and a test accuracy of approximately $97.3%$, indicating excellent generalization. The detailed classification report shows consistently high precision, recall, and F1-scores across all categories, with particularly perfect scores for the sport class and very strong results for business, entertainment, politics, and tech. These findings demonstrate that even a relatively simple supervised model can effectively capture the distinctions among categories in this dataset.

Following plot also shows the confusion matrix for the supervised learning model. The confusion matrix reveals that most samples are correctly classified into their true categories, with only a few misclassifications primarily between related classes such as business and politics. The model's high accuracy and strong performance across all categories indicate its effectiveness in capturing the underlying patterns in the text data.

In [62]:
cm = confusion_matrix(y_test, y_test_pred_supervised, labels=np.unique(y_test))


# plot the confusion matrix using plotly
fig = px.imshow(
    cm,
    text_auto=True,
    labels=dict(x="Predicted Category", y="True Category", color="Count"),
    x=np.unique(y_test),
    y=np.unique(y_test),
    color_continuous_scale='YlGn',
    # aspect="auto",
    title="Confusion Matrix",
    width=600,
    height=600
)
fig.show()

## Summary

| Metric                                       | Value                  |
|----------------------------------------------|-------------------------|
| Train accuracy (KMeans)                       | 0.7518      |
| Test accuracy (KMeans)                        | 0.7674      |
| Train accuracy after hyperparameter tuning (KMeans) | 0.8958      |
| Test accuracy after hyperparameter tuning (KMeans)  | 0.9132      |
| Train accuracy (Logistic Regression)         | 0.9800      |
| Test accuracy (Logistic Regression)          | 0.9583      |


In this assignment, I explored the BBC News dataset using unsupervised learning techniques, specifically KMeans clustering, to classify news articles into categories. I preprocessed the text data using GloVe embeddings and dimensionality reduction techniques like PCA and Truncated SVD. The clustering results were mapped back to the original categories, achieving a test accuracy of approximately 88.9%. Hyperparameter tuning improved the model's performance further, with a test accuracy of approximately 91.3%. Finally, I trained a supervised logistic regression model on the original document vectors, achieving a test accuracy of approximately 97.3%. The results demonstrate the effectiveness of both unsupervised and supervised learning approaches for text classification tasks.

## Learning Curve Analysis

To investigate how the amount of training data influences model performance, we trained the logistic regression classifier on progressively larger fractions of the training set, ranging from $1%$ to nearly $100%$. The results show that the test accuracy increased rapidly with the addition of more data: starting at ~$46%$ with just $1%$ of the data, surpassing $90%$ by around $5%$, and reaching over $97%$ once about $95%$ of the data was used. Beyond this point, the accuracy plateaued, indicating that the model had effectively learned the patterns in the data. This analysis highlights the significant impact of training set size on classifier performance, especially in the early stages, and confirms that our dataset is sufficiently large to achieve strong generalization.



In [63]:
train_sizes = []
accuracies = []
train_fracs = [0.01, 0.03, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.90, 0.95, 0.99, 0.999, 0.9999, 0.99999]

for frac in train_fracs:
    X_partial, _, y_partial, _ = train_test_split(X_train, y_train, train_size=frac, random_state=CONFIG['RANDOM_STATE'])
    clf.fit(X_partial, y_partial)
    y_test_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_test_pred)
    train_sizes.append(frac * 100)  # percent
    accuracies.append(acc)
    print(f"Train size: {frac*100:.3f}% -> Test accuracy: {acc:.6f}")


Train size: 1.000% -> Test accuracy: 0.336806
Train size: 3.000% -> Test accuracy: 0.572917
Train size: 5.000% -> Test accuracy: 0.812500
Train size: 8.000% -> Test accuracy: 0.916667
Train size: 10.000% -> Test accuracy: 0.944444
Train size: 20.000% -> Test accuracy: 0.944444
Train size: 30.000% -> Test accuracy: 0.947917
Train size: 40.000% -> Test accuracy: 0.961806
Train size: 50.000% -> Test accuracy: 0.954861
Train size: 60.000% -> Test accuracy: 0.958333
Train size: 70.000% -> Test accuracy: 0.954861
Train size: 80.000% -> Test accuracy: 0.958333
Train size: 90.000% -> Test accuracy: 0.958333
Train size: 95.000% -> Test accuracy: 0.958333
Train size: 99.000% -> Test accuracy: 0.958333
Train size: 99.900% -> Test accuracy: 0.958333
Train size: 99.990% -> Test accuracy: 0.958333
Train size: 99.999% -> Test accuracy: 0.958333


In [64]:
df_plot = pd.DataFrame({
    'Train Size (%)': train_sizes,
    'Test Accuracy': accuracies
})

fig = px.line(
    df_plot,
    x='Train Size (%)',
    y='Test Accuracy',
    markers=True,
    title='Test Accuracy vs Training Data Size (Supervised Learning)',
    labels={'Train Size (%)': 'Training Data Size (%)', 'Test Accuracy': 'Test Accuracy'},
    width=800,
    height=600
)
fig.update_layout(yaxis=dict(range=[0,1]))  # fix scale to [0,1]
fig.show()

## References
- Pennington et al. "GloVe: Global Vectors for Word Representation." EMNLP 2014.
- [sklearn documentation](https://scikit-learn.org/stable/documentation.html)
- [NLTK documentation](https://www.nltk.org/howto/tokenize.html)
- [DTSA 5747: Fundamentals of Natural Language Processing](https://www.colorado.edu/program/data-science/coursera/curriculum/dtsa5747)