# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [1]:
# TODO: import needed libraries
import nltk
import numpy as np
import pandas as pd
import gensim

Load the data in the file `random_headlines.csv`

In [2]:
# TODO: load the dataset
df = pd.read_csv('random_headlines.csv')
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [3]:
# TODO: Perform a short EDA
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [4]:
df['tokens'] = df['headline_text'].apply(lambda row: nltk.word_tokenize(row))
df['alphanumeric']=df['tokens'].apply(lambda row: [
    word for word in row if word.isalpha()
])
stop = nltk.corpus.stopwords.words('english')
df['stop'] = df['alphanumeric'].apply(lambda row: [
    word for word in row if word not in stop
])
stemmer = nltk.PorterStemmer()
df['stemmed'] = df['stop'].apply(lambda row: [
    stemmer.stem(word) for word in row
])
df['stemmed'].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

Now use Gensim to compute a BOW

In [10]:
from gensim.corpora import Dictionary
dictionary = Dictionary(df['stemmed'])
corpus = [dictionary.doc2bow(line) for line in df['stemmed']]
print(np.shape(corpus))
corpus[0:2]

(20000,)


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

Compute the TF-IDF using Gensim

In [14]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel
tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(np.shape(tf_idf))

(20000,)


  return array(a, dtype, copy=False, order=order)


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [18]:
# TODO: Compute LSA
from gensim.models import LsiModel
 
lsi = LsiModel(corpus = corpus, num_topics = 4, id2word=dictionary)


For each of the topic, show the most significant words.

In [20]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi.print_topics(num_words=3)

[(0, '-0.751*"polic" + -0.404*"man" + -0.208*"charg"'),
 (1, '0.670*"man" + -0.574*"polic" + 0.326*"charg"'),
 (2, '0.655*"new" + 0.296*"plan" + 0.241*"say"'),
 (3, '-0.702*"new" + 0.345*"say" + 0.335*"plan"')]

What do you think about those results?

the rows seem to be ordered in pairs, with different orders.



Now let's try to use LDA instead of LSA using Gensim

In [25]:
# TODO: Compute LDA
from gensim.models import LdaModel

lda = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, random_state=0, chunksize=512, passes=5)

In [26]:
# TODO: print the most frequent words of each topic
lda.print_topics(num_words=3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [27]:
!pip install pyldavis --user

Collecting pyldavis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: pyldavis, sklearn
  Building wheel for pyldavis (PEP 517): started
  Building wheel for pyldavis (PEP 517): finished with status 'done'
  Created wheel for pyldavis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136882 sha256=78dc66b53a34631dc5b642adde105808aa694149d59a23b0236e8c4fe6bec787
  Stored in directory: c:\users\darks\appdata\loca

In [34]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda, corpus, dictionary)
vis

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.