
<center><font size=10>Introduction to LLMs and GenAI</center></font>
<center><font size=6>Mini Project 3 - AI-Based News Classification Using NLP and Unsupervised Learning</center></font>

###Business Context




In the dynamic landscape of the media and news industry, the ability to swiftly categorize and curate content has become a strategic imperative. The vast volume of information demands efficient systems to organize and present content to the audience.

The media industry, being the pulse of information dissemination, grapples with the continuous influx of news articles spanning diverse topics. Ensuring that the right articles reach the right audience promptly is not just a logistical necessity but a critical component in retaining and engaging audiences in an age of information overload.

Common Industry Challenges: Amidst the ceaseless flow of news, organizations encounter challenges such as:

Information Overload: The sheer volume of news articles makes manual categorization impractical.
Timeliness: Delays in categorizing news articles can result in outdated or misplaced content.

###Problem Definition

E-news Express, a news aggregation startup, faces the challenge of categorizing the news articles collected. With news articles covering sports, entertainment, politics, and more, the need for an advanced and automated system to categorize them has become increasingly evident. The manual efforts required for categorizing such a diverse range of news articles are substantial, and human errors in the categorization of news articles can lead to reputational damage for the startup. There is also the factor of delays and potential inaccuracies. To streamline and optimize this process, the organization recognizes the imperative of adopting cutting-edge technologies, particularly machine learning, to automate and enhance the categorization of content.

As a data scientist on the E-news Express data team, the task is to analyze the text in news articles and build an unsupervised learning model for categorizing them. The categorization done by the model can then be validated against human-defined labels to check the overall accuracy of the AI system. The goal is to optimize the categorization process, ensuring timely and personalized delivery.*

In [None]:
# installing the sentence-transformers library
!pip install -U sentence-transformers -q


In [None]:
# to read and manipulate the data
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', None)    # setting column to the maximum column width as per the data

# to visualise data
import matplotlib.pyplot as plt
import seaborn as sns

# to compute distances
from scipy.spatial.distance import cdist, pdist
from sklearn.metrics import silhouette_score

# importing the PyTorch Deep Learning library
import torch

# to import the model
from sentence_transformers import SentenceTransformer

# to cluster the data
from sklearn.cluster import KMeans

# to compute metrics
from sklearn.metrics import classification_report

# to avoid displaying unnecessary warnings
import warnings
warnings.filterwarnings("ignore")


In [None]:
df=pd.read_csv('/content/drive/MyDrive/Intro to LLM and Gen AI/Mini Project 3/news_articles.csv')

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

In [None]:
df = df.drop_duplicates()

In [None]:
# resetting the dataframe index
df.reset_index(drop=True, inplace=True)

df.duplicated().sum()


In [None]:
df.loc[1, 'Text']


In [None]:
df.shape

In [None]:
!pip install hf_xet

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
# setting the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# encoding the dataset
embedding_matrix = model.encode(df['Text'], show_progress_bar=True, device=device)

embedding_matrix.shape


In [None]:
# defining a function to compute the cosine similarity between two embedding vectors
def cosine_score(text1,text2):
    # encoding the text
    embeddings1 = model.encode(text1)
    embeddings2 = model.encode(text2)

    # calculating the L2 norm of the embedding vector
    norm1 = np.linalg.norm(embeddings1)
    norm2 = np.linalg.norm(embeddings2)

    # computing the cosine similarity
    cosine_similarity_score = ((np.dot(embeddings1,embeddings2))/(norm1*norm2))

    return cosine_similarity_score


In [None]:
a= "i love apple"
b= "apple is a fruit"
c= "i like this table"
print(cosine_score(a,b))
print(cosine_score(b,c))
print(cosine_score(a,c))


In [None]:
# We can also use prebuilt method to calculate similarity score

a= "i love apple"
b= "apple is a fruit"
c= "i like this table"

from sentence_transformers import util

embeddings1 = model.encode(a)
embeddings2 = model.encode(b)
embeddings3 = model.encode(c)

print(util.cos_sim(embeddings1, embeddings2))
print(util.cos_sim(embeddings2, embeddings3))
print(util.cos_sim(embeddings1, embeddings3))


In [None]:
# defining a function to find the top k similar sentences for a given query
def top_k_similar_sentences(embedding_matrix, query_text, k):
    # encoding the query text
    query_embedding = model.encode(query_text)

    # calculating the cosine similarity between the query vector and all other encoded vectors of our dataset
    score_vector = np.dot(embedding_matrix,query_embedding)

    # sorting the scores in descending order and choosing the first k
    top_k_indices = np.argsort(score_vector)[::-1][:k]

    # returning the corresponding reviews
    return df.loc[list(top_k_indices), 'Text']


In [None]:
# defining the query text
query_text = "Budget for elections"

# displaying the top 5 similar sentences
top_k_reviews = top_k_similar_sentences(embedding_matrix, query_text, 5)

for i in top_k_reviews:
    print(i, end="\n")
    print("*******************************************************************")
    print("\n")


In [None]:
# defining the query text
query_text = "High imports and exports"

# displaying the top 5 similar sentences
top_k_reviews = top_k_similar_sentences(embedding_matrix, query_text, 5)

for i in top_k_reviews:
    print(i, end="\n")
    print("*******************************************************************")
    print("\n")


In [None]:
meanDistortions = []
clusters = range(2, 11)

for k in clusters:
    clusterer = KMeans(n_clusters=k, random_state=1)
    clusterer.fit(embedding_matrix)

    prediction = clusterer.predict(embedding_matrix)

    distortion = sum(
        np.min(cdist(embedding_matrix, clusterer.cluster_centers_, "euclidean"), axis=1) ** 2
    )
    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)


In [None]:
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()


In [None]:
sil_score = []
cluster_list = range(2, 11)

for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)

    preds = clusterer.fit_predict((embedding_matrix))

    score = silhouette_score(embedding_matrix, preds)
    sil_score.append(score)

    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))


In [None]:
plt.plot(cluster_list, sil_score, "bx-")
plt.show()


In [None]:
# defining the number of clusters/categories
n_categories = 5

# fitting the model
kmeans=KMeans(n_clusters=n_categories,random_state=1).fit(embedding_matrix)

In [None]:
# creating a copy of the data
clustered_data = df.copy()

# assigning the cluster/category labels
clustered_data['Category'] = kmeans.labels_

clustered_data.head()

In [None]:
# for each cluster, printing the 5 random news articles
for i in range(5):
    print("CLUSTER",i)
    print(clustered_data.loc[clustered_data.Category == i, 'Text'].sample(5, random_state=1).values)
    print("*****************************************************************")
    print("\n")


In [None]:
# dictionary of cluster label to category
category_dict = {
    0: 'Sports',
    1: 'Politics',
    2: 'Entertainment',
    3: 'Business',
    4: 'Technology'
}
# mapping cluster labels to categories
clustered_data['Category'] = clustered_data['Category'].map(category_dict)

clustered_data.head()


In [None]:
# loading the actual labels
labels = pd.read_csv("/content/drive/MyDrive/Intro to LLM and Gen AI/Mini Project 3/news_article_labels.csv")
# checking the unique labels
labels['Label'].unique()


In [None]:
labels.head()

In [None]:
labels.shape

In [None]:
labels.value_counts('Label')

In [None]:
# adding the actual categories to our dataframe
clustered_data['Actual Category'] = labels['Label'].values


In [None]:
print(classification_report(clustered_data['Actual Category'], clustered_data['Category']))

In [None]:
# creating a dataframe of incorrect categorizations
incorrect_category_data = clustered_data[clustered_data['Actual Category'] != clustered_data['Category']].copy()
incorrect_category_data.shape



In [None]:
incorrect_category_data.head()

In [None]:
idx = 24

print('Distance from Actual Category')
print(cdist(embedding_matrix[idx].reshape(1,-1), kmeans.cluster_centers_[[2]], "euclidean")[0,0])

print('Distance from Predicted Category')
print(cdist(embedding_matrix[idx].reshape(1,-1), kmeans.cluster_centers_[[3]], "euclidean")[0,0])


In [None]:
idx = 45

print('Distance from Actual Category')
print(cdist(embedding_matrix[idx].reshape(1,-1), kmeans.cluster_centers_[[2]], "euclidean")[0,0])

print('Distance from Predicted Category')
print(cdist(embedding_matrix[idx].reshape(1,-1), kmeans.cluster_centers_[[4]], "euclidean")[0,0])


#Conclusion

##Project Goal

1. Built an unsupervised machine learning system to automatically categorize news articles for E-News Express.

2. The system helps solve problems like information overload, manual classification errors, and delays in organizing articles.

##Technologies & Libraries Used

1. Programming was done in Python.

2. Used essential data science libraries:

- Pandas and NumPy for data loading, cleaning, and preprocessing

- Matplotlib and Seaborn for visualization

3. Used Sentence Transformers (all-MiniLM-L6-v2) to convert text into dense numerical embeddings.

##Data Preprocessing

1. Loaded the dataset and removed duplicates.

2. Cleaned the data and prepared the text for embedding.

3. Embedding & Similarity

4. Generated dense sentence embeddings for each article using Sentence Transformers.

5. Performed cosine similarity searches to find the most relevant articles for any query.

##Unsupervised Clustering

1. Applied K-Means Clustering to group articles based on semantic similarity.

2. Determined the optimal number of clusters using:

3. Elbow Method (distortion scores)

4. Silhouette Score

6. Selected 5 clusters as the most appropriate.

7. Mapping Clusters to Real Categories

8. Analyzed each cluster and labeled them according to common themes like:

- Sports

- Politics

- Entertainment

- Business

- Technology

##Model Evaluation

1. Compared the cluster labels with human-labeled categories from an external file.

2. Generated a classification report to measure:

- Accuracy

- Precision

- Recall

- F1-score

##Error & Distance Analysis

1. Identified articles that were categorized incorrectly.

2. Analyzed the distance between article embeddings and cluster centers to understand misclassifications.

3. Found cases where articles were semantically close to multiple clusters.

##Final Outcome

- The project demonstrates how modern NLP + unsupervised learning can automate news categorization.

- It reduces manual work, improves personalization, and supports large-scale news management for media organizations.