In [38]:
import pandas as pd
import numpy as np
from gensim import corpora, models
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
import nltk 
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/darth/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/darth/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/darth/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/darth/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

# Objective of the notebook

- In this project, I want to get top 10 topics discussed in news articles in each category from ('Politics', 'Business', 'Arts', 'Sports') (dw_articles.csv).
- The data used here is scraped from DW News portal (code in task2_data_acquistion.ipynb).
- The data contains actual news article and its category as specified by DW News.
- We will group the data into 4 categories ('Politics', 'Business', 'Arts', 'Sports') and explore top 10 topics on each category of news using one of the most popular algorithms, called Latent Dirichlet Allocation (LDA).

## Load and preprocess the dataset

In [2]:
# load the scraped dw articles file
df = pd.read_csv('dw_articles.csv')
df.head()

Unnamed: 0,url,category,title,text,target_category
0,https://www.dw.com/en/germany-undeterred-by-gl...,POLITICS,Germany undeterred by global turmoil — Scholz ...,"In his New Year's message, German Chancellor O...",Politics
1,https://www.dw.com/en/taiwan-presidential-cand...,POLITICS,Taiwan: Presidential candidates debate in shad...,Taiwan's presidential candidates argued over w...,Politics
2,https://www.dw.com/en/emboldened-iran-silences...,HUMAN RIGHTS,Emboldened Iran silences critics as world look...,As least 690 prisoners were executed in Iran i...,Politics
3,https://www.dw.com/en/berlin-prepares-for-anot...,SOCIETY,Berlin prepares for another rowdy New Year's E...,"In Germany, Christmas is all about contemplati...",Politics
4,https://www.dw.com/en/albania-former-pm-put-un...,CRIME,Albania: Former PM put under house arrest in c...,Albania's right-wing opposition leader Sali Be...,Politics


In [3]:
df['category'].value_counts()

category
POLITICS                   202
SPORTS                      95
CULTURE                     67
BUSINESS                    59
SOCCER                      58
SOCIETY                     44
CONFLICTS                   36
ARTS                        27
FILM                        24
CRIME                       19
NATURE AND ENVIRONMENT      18
MUSIC                       17
HUMAN RIGHTS                16
LAW AND JUSTICE             15
SCIENCE                     10
TECHNOLOGY                   9
MIGRATION                    7
CLIMATE                      6
RELIGION                     6
CATASTROPHE                  5
HISTORY                      5
EDUCATION                    5
CARS AND TRANSPORTATION      5
TRAVEL                       5
LIFESTYLE                    5
LITERATURE                   4
MEDIA                        3
TRADE                        3
TERRORISM                    3
HEALTH                       3
RULE OF LAW                  3
OFFBEAT                      2

### Grouping similar categories and dropping categories other than 'Business', 'Politics', 'Sports', 'Arts'

In [4]:
df = df[df['category'].isin(['ARTS', 'CULTURE', 'FILM', 'MUSIC', 'POLITICS', 'SOCCER', 'SPORTS', 'BUSINESS'])]
df['category'] = df['category'].replace({'SOCCER':'SPORTS', 'FILM':'ARTS', 'MUSIC':'ARTS', 'CULTURE':'ARTS'})
df.category.value_counts()

category
POLITICS    202
SPORTS      153
ARTS        135
BUSINESS     59
Name: count, dtype: int64

### Remove non-alphanumeric characters and extra whitespaces

In [6]:
def clean_text(text):
    # Remove non-alphanumeric characters and extra whitespaces
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

In [7]:
df['text'] = df['text'].apply(clean_text)

### Tokenizing and lemmatization english words on news articles

- Also, by extracting pos tag, we are only keeping Nouns because Nouns are the main words for delivering a topic

In [34]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
df['processed_text'] = df['text'].apply(lambda x: [
    lemmatizer.lemmatize(word) for word, pos in pos_tag(word_tokenize(x.lower())) if word not in stop_words and pos.startswith('N')
])

## Latent Dirichlet Allocation (LDA)

- LDA is a generative probabilistic model that assumes each document in a corpus is a mix of topics and that each word in the document is attributable to one of the document's topics.
- It provides interpretable topics and widely used and well established.

### Create a dictionary, corpus and lda_model on each news category

In [48]:
# Get top topics for each category
top_topics_per_category = {}

for category in df['category'].unique():
    # Filter DataFrame by category
    category_df = df[df['category'] == category]

    # documents = [d.split() for d in category_df['processed_text'].values]
    # Create a dictionary and a corpus
    dictionary = corpora.Dictionary(category_df['processed_text'].values)
    corpus = [dictionary.doc2bow(text) for text in category_df['processed_text'].values]

    # Build the LDA model
    num_topics = 10
    lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

    # Extract and store top topics for the category
    topics = lda_model.print_topics()
    top_topics_per_category[category] = ((lda_model, corpus, dictionary), [[word for word, _ in lda_model.show_topic(topic_id)] for topic_id, _ in topics])


### Printing top 10 topics, each topic has 10 words 

In [49]:
# Display top topics for each category
for category, topics in top_topics_per_category.items():
    print(f"Top Topics for Category {category}:")
    for idx, topic in enumerate(topics[1]):
        print(f"  Topic {idx + 1}: {topic}")
    print("\n")

Top Topics for Category POLITICS:
  Topic 1: ['election', 'government', 'party', 'president', 'people', 'year', 'country', 'dw', 'minister', 'vote']
  Topic 2: ['election', 'country', 'government', 'trump', 'year', 'president', 'state', 'party', 'congo', 'eu']
  Topic 3: ['country', 'election', 'year', 'rwanda', 'people', 'president', 'vote', 'africa', 'poland', 'voter']
  Topic 4: ['germany', 'state', 'group', 'berlin', 'country', 'attack', 'israel', 'government', 'war', 'member']
  Topic 5: ['football', 'government', 'sport', 'kong', 'world', 'navalny', 'hong', 'security', 'saudi', 'law']
  Topic 6: ['china', 'state', 'president', 'brics', 'beijing', 'nation', 'country', 'world', 'taiwan', 'policy']
  Topic 7: ['community', 'year', 'business', 'germany', 'government', 'people', 'thailand', 'country', 'kashmir', 'lgbtq']
  Topic 8: ['medium', 'party', 'country', 'minister', 'company', 'bosnia', 'election', 'government', 'herzegovina', 'year']
  Topic 9: ['government', 'germany', 'stat

### Topics visualization for Politics category

In [53]:
topic_values = top_topics_per_category['POLITICS'][0]
vis_data = gensimvis.prepare(topic_values[0], topic_values[1], topic_values[2])

# Display the interactive visualization directly in the Jupyter Notebook
pyLDAvis.display(vis_data)
    

How to read the visualization?

- Each bubble represents a topic. Larger the bubble, the higher percentage of new articles in the corupus is about the topic.
- Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, the blue bars of the most frequently used words will be displayed.
- Red bars give the estimated number of times a given term was generated by a given topic.
- The further the bubbles are away from each other, the more different they are. So a good topic will have big and non-overlapping bubbles scattered throughout the chart.

Note: Explaining all 10 topics in every category might be too exhaustive so, we will explore unique and big topics in each category

Inferences:
- From the above plot, it can be observed that state, germany, minister, group, china, israel are the top keywords/topics overall in the news articles of politics category.
- Topic 1 is more about election, country, president, trump, government, year, etc. So, this topic is about articles talking about US election coming up.
- Topic 5 and 4 are very similar with topic 1 and we can observe it by having a look on their topic words such as election, government, president and party. Topic 5 has also rwanda as top word so its likely about election in Rwanda.
- Let's explore topic 8 which is the farthest from topics 1,4 and 5. It has keywords such as football, sports, and government. So this is more about articles where sports and goverment are seen together.

### Topics visualization for Business category

In [50]:
topic_values = top_topics_per_category['BUSINESS'][0]
vis_data = gensimvis.prepare(topic_values[0], topic_values[1], topic_values[2])

# Display the interactive visualization directly in the Jupyter Notebook
pyLDAvis.display(vis_data)
    

Inferences:
- From the above plot, it can be observed that company, germany, government, country, industry, price, ai, eu, amazon are are the top significant keywords/topics overall in the news articles of business category.
- Topic 1 is more about the electric cars market as we can see words such as ev, price, byd, tesla, auto, carmaker, etc.
- Topic 8 is also very similar to topic 1 talking about cars, vehicles, energy and market.
- Topic 7 being the most farthest represents unique keywords and articles. It is more about train, rail, network, bahn, infrastructure. So, this is about rail and train network and infrastucture in Deustchland.

### Topics visualization for Sports category

In [51]:
topic_values = top_topics_per_category['SPORTS'][0]
vis_data = gensimvis.prepare(topic_values[0], topic_values[1], topic_values[2])

# Display the interactive visualization directly in the Jupyter Notebook
pyLDAvis.display(vis_data)
    

Inferences:
- From the above plot, it can be observed that football, game, player, bundesliga, club, woman, coach, fan, rubiales are the most important words across all topics in this category which as we can guess is Sports.
- Topic 1 is the biggest topic and is talking about football, teams, rubiales (Spanish professional footballer's union), spain and cup. So this is likely about football in Spain.
- Topic 7, the most unique topic here talks about olympic games, sports events and winter sports.
- Topic 10 is more about sports in Saudi Arabia.

### Topics visualization for Arts category

In [52]:
topic_values = top_topics_per_category['ARTS'][0]
vis_data = gensimvis.prepare(topic_values[0], topic_values[1], topic_values[2])

# Display the interactive visualization directly in the Jupyter Notebook
pyLDAvis.display(vis_data)
    

Inferences:
- From the above plot, it can be observed that art, museum, artists, berlin, exhibition, festival, film, disney, painting are the most frequent words in this category.
- Topic 1 is the biggest topic and is talking about museum, art and films and exhibition.
- Topic 5 is about films, art and painting and topic 7 is talking about films and arts possibly in berlin.
- Topic 6 is the most unique topic across all since it talks about yoga, person, japan like eastern philosphy along with art and painting.

### Conclusion 

- From above experiement, it is observed that LDA is quite an effective topic modeling algorithm which can quickly provide the overview of hot topics in different domains.
- A person can quickly understand what;s going on in the world/Germany right now.
- The interactive visualization also is a great tool which is also quite handy and easy to use and understand.