<a href="https://colab.research.google.com/github/assermahmoud99/internship-tasks/blob/main/News_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━

0. Downloading the Dataset from Kaggle


*   Here I used KaggleHub to automatically download the BBC news dataset.
*   The dataset has one single csv file which contains the entire data>
*   I stored it in path loading it later with pandas.




In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("gpreda/bbc-news",path='bbc_news.csv')

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'bbc-news' dataset.
Path to dataset files: /kaggle/input/bbc-news/bbc_news.csv


# 1. Importing Libraries & Loading the Dataset

Here, I imported the required libraries that we are going to use and assigned the number of rows to 15k.







In [16]:
import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel


df = pd.read_csv(path,nrows=15000) #extracting the data

# 2. Preprocessing the News Articles

Here we prepare the dataset for topic modeling. We combine each article’s title and description into a single field called content. Then, using spaCy, we clean the text by:


*   Tokenizing the sentences into words.

*   Lowercasing everything.
*   Removing stopwords and non-alphabetic tokens.


*   Lemmatizing words (reducing them to their base form).


The result is stored in a new column token, which contains a clean list of tokens for each article.




In [17]:

df["content"] = df["title"].fillna("") + " " + df["description"].fillna("")
nlp = spacy.load('en_core_web_sm')
tokens = []
def clean_text(news):
  doc = nlp(news)
  tokens = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
  return tokens
df['token'] = df['content'].astype(str).apply(clean_text)
df['token'].head(10)

Unnamed: 0,token
0,"[ukraine, angry, zelensky, vow, punish, russia..."
1,"[war, ukraine, take, cover, town, attack, jere..."
2,"[ukraine, war, catastrophic, global, food, wor..."
3,"[manchester, arena, bombing, saffie, roussos, ..."
4,"[ukraine, conflict, oil, price, soar, high, le..."
5,"[ukraine, war, pm, hold, talk, world, leader, ..."
6,"[ukraine, war, uk, grant, ukrainian, refugee, ..."
7,"[tiktok, limit, service, netflix, pull, russia..."
8,"[covid, fourth, jab, scotland, vulnerable, tes..."
9,"[protest, russia, thousand, detain, people, ho..."


# 3. Building the Dictionary, Corpus, and Training LDA

Next, we convert the cleaned tokens into a format that the LDA model can understand.


*   We build a Dictionary (word → unique ID mapping).

*   We filter out very rare words (appear in <5 documents) and very common words (appear in >50% of documents).

*   We create a Bag-of-Words corpus where each document is represented as a list of word IDs and counts.
*   Finally, we train a Latent Dirichlet Allocation (LDA) model with 10 topics. The model tries to uncover hidden themes across the articles.











In [18]:
dictionary = Dictionary(df['token'])
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(tokens) for tokens in df['token']]
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    random_state=42,
    passes=10,
    per_word_topics=True
)

# 4. Displaying the Most Significant Words per Topic

After training, we extract the top 5 words per topic from the LDA model. These words are the most representative terms for each latent topic discovered. We store them in a pandas DataFrame for better readability. This output lets us interpret the meaning of each topic (e.g., politics, sports, health, economy) based on its most frequent and significant words.

In [21]:
topics_data = []
for idx, topic in lda_model.show_topics(formatted=False, num_words=5):
    words = [word for word, prob in topic]
    topics_data.append({"Topic": idx, "Top Words": ", ".join(words)})

topics_df = pd.DataFrame(topics_data)
print(topics_df)

   Topic                                    Top Words
0      0  league, champions, arrest, liverpool, score
1      1              say, king, ireland, truss, open
2      2           ukraine, war, russia, russian, say
3      3        queen, covid, earthquake, people, day
4      4    manchester, city, united, league, premier
5      5              uk, sunak, government, new, say
6      6              strike, cost, pay, rise, living
7      7               uk, say, papers, lead, johnson
8      8                year, police, die, bbc, woman
9      9              world, cup, england, win, final
