# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [1]:
# TODO: import needed libraries
import nltk
import numpy as np
import pandas as pd

Load the data in the file `random_headlines.csv`

In [3]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [4]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [8]:
# TODO: Preprocess the input data

#tokenize
df['tokens'] = df['headline_text'].apply(lambda row: nltk.word_tokenize(row))

#punctuation removal
df['alphanumeric'] = df['tokens'].apply(lambda row: [word for word in row if word.isalpha()])

#remove stop words
stop = nltk.corpus.stopwords.words('english')
df['stop'] = df['alphanumeric'].apply(lambda row: [word for word in row if word not in stop])

#stemming

stemmer = nltk.PorterStemmer()
df['stemmed'] = df['stop'].apply(lambda row: [stemmer.stem(word) for word in row])

df['stemmed'].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

Now use Gensim to compute a BOW

In [9]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary

dictionary = Dictionary(df['stemmed'])
corpus = [dictionary.doc2bow(line) for line in df['stemmed']]
print(np.shape(corpus))
corpus[0:2]

(20000,)


  result = asarray(a).shape


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

Compute the TF-IDF using Gensim

In [10]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel

tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(np.shape(tf_idf))

(20000,)


  result = asarray(a).shape


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [11]:
# TODO: Compute LSA
from gensim.models import LsiModel

lsi = LsiModel(corpus=corpus, num_topics=4, id2word=dictionary)

For each of the topic, show the most significant words.

In [12]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi.print_topics(num_words=3)

[(0, '-0.752*"polic" + -0.404*"man" + -0.209*"charg"'),
 (1, '-0.669*"man" + 0.575*"polic" + -0.330*"charg"'),
 (2, '-0.654*"new" + -0.298*"plan" + 0.241*"man"'),
 (3, '0.703*"new" + -0.341*"say" + -0.331*"plan"')]

What do you think about those results?

The 2 first rows seems to be talking about the same topic, and it is the same with the last two. Seeing the most significant words are such as police, man, plan, new, we can assume that the topics are about either politics or policework, which are very common topics in news articles.

Now let's try to use LDA instead of LSA using Gensim

In [15]:
# TODO: Compute LDA
from gensim.models import LdaModel

lda = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, random_state=0, chunksize=512, passes=5)

In [16]:
# TODO: print the most frequent words of each topic
lda.print_topics(num_words=3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Now, how does it work with LDA?

LDA assigns documents to different topics, hence resulting in usually a more precise depiction of what are the common topics

Let's make some visualization of the LDA results using pyLDAvis.

In [20]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
vis

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.