The news articles dataset has been obtained from Kaggle. 
This is the url: https://www.kaggle.com/pariza/bbc-news-summary

In [37]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import spacy
from spacy import displacy
from spacy.lang.en.stop_words import STOP_WORDS
import os
import gensim

from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.corpora import Dictionary
import pyLDAvis.gensim

In [3]:
nlp = spacy.load('en_core_web_lg')

At the time of doing this project (25/12/2018), spacy's english language model (en_core_web_lg) has problem in recognizing stop words. This can be seen in the following example.

In [4]:
doc = nlp('This is a sentence. And the cat jumped over the dog. The cat returned as the prisoner of Azkaban.')
for token in doc:
    print(token.text, token.is_stop)

This False
is False
a False
sentence False
. False
And False
the False
cat False
jumped False
over False
the False
dog False
. False
The False
cat False
returned False
as False
the False
prisoner False
of False
Azkaban False
. False


As you can see above, it doesn't recognize This, is, a, the, etc. as stop words, we are going to address this issue, by the following steps.

In [7]:
for stop_word in STOP_WORDS:
    for word in (stop_word, stop_word.capitalize(), stop_word.upper()):
        lex = nlp.vocab[word]
        lex.is_stop = True

Let us recheck the above scenario again and see whether the model is recognizing stop words or not.

In [8]:
doc = nlp('This is a sentence. And the cat jumped over the dog. The cat returned as the prisoner of Azkaban.')
for token in doc:
    print(token.text, token.is_stop)

This True
is True
a True
sentence False
. False
And True
the True
cat False
jumped False
over True
the True
dog False
. False
The True
cat False
returned False
as True
the True
prisoner False
of True
Azkaban False
. False


We are going to analyze the news articles. Let us look at a sample news article.

In [9]:
with open('BBC News Summary/News Articles/politics/001.txt', 'r') as news_file:
    newsArticle = news_file.read()
    print(newsArticle)

Labour plans maternity pay rise

Maternity pay for new mothers is to rise by Â£1,400 as part of new proposals announced by the Trade and Industry Secretary Patricia Hewitt.

It would mean paid leave would be increased to nine months by 2007, Ms Hewitt told GMTV's Sunday programme. Other plans include letting maternity pay be given to fathers and extending rights to parents of older children. The Tories dismissed the maternity pay plan as "desperate", while the Liberal Democrats said it was misdirected.

Ms Hewitt said: "We have already doubled the length of maternity pay, it was 13 weeks when we were elected, we have already taken it up to 26 weeks. "We are going to extend the pay to nine months by 2007 and the aim is to get it right up to the full 12 months by the end of the next Parliament." She said new mothers were already entitled to 12 months leave, but that many women could not take it as only six of those months were paid. "We have made a firm commitment. We will definitely ext

We are going to start with the named entity recognition (NER) of the text document. NER is helpful in identifying geographical location, name of persons, organization, etc. in a textual document. Spacy's list of named entities can be foun here: https://spacy.io/api/annotation#named-entities

In [11]:
newsDoc = nlp(newsArticle)

In [13]:
for ent in newsDoc.ents:
    print(ent.text, ent.label_)



Maternity pay PERCENT
Â£1,400 ORG
the Trade and Industry ORG
Patricia Hewitt PERSON


 PERSON
nine months by 2007 DATE
Ms Hewitt PERSON
GMTV ORG
Sunday DATE
Tories NORP
Liberal NORP
Democrats NORP


 CARDINAL
Ms Hewitt PERSON
13 weeks DATE
up to 26 weeks DATE
nine months by 2007 DATE
the full 12 months DATE
the end of the next Parliament DATE
12 months DATE
months DATE
the six months DATE
is to nine months DATE
Â£1,400 PERSON
State ORG
Family PRODUCT
Theresa ORG
May DATE
Gordon Brown PERSON
December DATE
Tony Blair PERSON


 PERSON
Conservatives ORG
Democrat NORP
Sandra Gidley PERSON
the Liberal Democrats ORG
the first six months DATE


 CARDINAL
Ms Hewitt PERSON
David Frost PERSON
the British Chambers of Commerce ORG
Monday DATE
90% PERCENT
the first six weeks DATE
Â£102.80 a week DATE
six months old DATE


we can visualize the entities in the document using displacy

In [14]:
displacy.render(newsDoc, style='ent', jupyter=True)

Spacy is able to recognize most of the entities correctly, but as we can see above there a few places it is incorrect

The news articles are grouped into 5 categories: sport, tech, politics, entertainment and business. Let's print a sample article from each category.

In [16]:
news_categories = ['sport', 'tech', 'politics', 'entertainment', 'business']
base_directory = 'BBC News Summary/News Articles/'

In [21]:
for category in news_categories:
    with open(base_directory + category + '/001.txt', 'r') as news_file:
        newsArticle = news_file.read()
        #we are going to print the first 400 characters of the article
        print('Category', category.capitalize())
        print('\n')
        if len(newsArticle) > 400:
            print(newsArticle[0:400] + '...')
        else:
            print(newsArticle)
        print('\n')

Category Sport


Claxton hunting first major medal

British hurdler Sarah Claxton is confident she can win her first major medal at next month's European Indoor Championships in Madrid.

The 25-year-old has already smashed the British record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. "I am quite confident," said Claxton. "But I take each race as it comes. "As long...


Category Tech


Ink helps drive democracy in Asia

The Kyrgyz Republic, a small, mountainous state of the former Soviet republic, is using invisible ink and ultraviolet readers in the country's elections as part of a drive to prevent multiple voting.

This new technology is causing both worries and guarded optimism among different sectors of the population. In an effort to live up to its reputation in the 1990s a...


Category Politics


Labour plans maternity pay rise

Maternity pay for new mothers is to rise by Â£1,400 as part of new proposals announced by the Trade an

Let us count the number of news articles in each category.

In [24]:
for category in news_categories:
    num_articles = len([file for file in os.listdir(base_directory + category) if file.endswith(".txt")])
    print('Category:', category.capitalize(), ',' ,'Number of Articles:', num_articles)

Category: Sport , Number of Articles: 511
Category: Tech , Number of Articles: 401
Category: Politics , Number of Articles: 417
Category: Entertainment , Number of Articles: 386
Category: Business , Number of Articles: 510


We will be doing the topic modelling of the new articles. We will be using the gensim package for topic modelling. The different algorithms that we will be looking at:
1. Hierarchical Dirichlet Process (HDP)
2. Latent Dirichlet Allocation (LDA)
3. Latent Semantic Indexing (LSI)

In the case of news articles that are quite frequent, and can be put in the stopwords of the sapcy package

In [25]:
custom_stop_words = ['say', 'says', 'said', 'saying', '\'s', 'mr', 'ms', 'people']
for stop_word in custom_stop_words:
    for word in (stop_word, stop_word.capitalize()):
        lex = nlp.vocab[word]
        lex.is_stop = True

We will be using 250 articles from each news category for topic modelling

In [27]:
news_corpus = []

for category in news_categories:
    
    en_directory = os.fsencode(base_directory + category)
    count_articles = 0

    for file in os.listdir(en_directory):

        fileName = os.fsdecode(file)

        if fileName.endswith(".txt"):
            
            count_articles += 1
            
            if count_articles > 250:
                break
                
            cur_article = []

            with open(base_directory + category + '/' + fileName, 'r') as news_file:

                newsArticle = news_file.read()
                newsArticle = newsArticle.replace('\n', ' ')
                doc = nlp(newsArticle)

                for token in doc:

                    if token.lemma_ != '-PRON-' and not token.is_space and not token.is_stop and not token.is_punct and not token.like_num and not token.like_email and not token.like_url:

                        # For topic modelling lemmatized version of the word is preferable for better results
                        cur_article.append(token.lemma_)

            news_corpus.append(cur_article)
       

In [28]:
print(len(news_corpus))

1250


In [30]:
for i in (2, 252, 502, 752, 1002):
    print(news_corpus[i])
    print('\n')

['greene', 'set', 'sight', 'world', 'title', 'maurice', 'greene', 'aim', 'wipe', 'pain', 'lose', 'olympic', 'm', 'title', 'athens', 'win', 'fourth', 'world', 'championship', 'crown', 'summer', 'settle', 'bronze', 'greece', 'fellow', 'american', 'justin', 'gatlin', 'francis', 'obikwelu', 'portugal', 'hurt', 'look', 'medal', 'mistake', 'lose', 'thing', 'greene', 'race', 'birmingham', 'friday', 'go', 'happen', 'goal', 'be', 'go', 'win', 'world', 'greene', 'cross', 'line', 'second', 'gatlin', 'win', 'second', 'close', 'fast', 'sprint', 'time', 'greene', 'believe', 'lose', 'race', 'title', 'semi', 'final', 'semi', 'final', 'race', 'win', 'race', 'conserve', 'energy', 'francis', 'obikwelu', 'come', 'take', 'not', 'know', 'believe', 'lane', 'final', 'lane', 'not', 'feel', 'race', 'feel', 'like', 'run', 'believe', 'middle', 'race', 'able', 'react', 'come', 'ahead', 'greene', 'deny', 'olympic', 'gold', '4x100', 'm', 'man', 'relay', 'catch', 'britain', 'mark', 'lewis', 'francis', 'final', 'leg',

In [31]:
bigram = gensim.models.Phrases(news_corpus)

In [32]:
news_corpus = [bigram[cur_article] for cur_article in news_corpus]

In [33]:
print(news_corpus[252])



In [34]:
dictionary = Dictionary(news_corpus)
print(dictionary)

Dictionary(19248 unique tokens: ['25-year_old', 'aaas_title', 'athlete', 'attention', 'bear']...)


In [35]:
corpus_for_modelling = [dictionary.doc2bow(cur_article) for cur_article in news_corpus]

In [36]:
print(corpus_for_modelling[2])

[(5, 1), (12, 5), (14, 1), (22, 1), (27, 1), (42, 1), (53, 6), (59, 1), (60, 3), (61, 2), (69, 1), (70, 2), (71, 4), (73, 1), (77, 4), (78, 5), (90, 1), (101, 1), (124, 1), (135, 1), (136, 1), (137, 1), (138, 1), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 4), (147, 1), (148, 1), (149, 1), (150, 1), (151, 1), (152, 1), (153, 1), (154, 1), (155, 1), (156, 2), (157, 1), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 2), (166, 1), (167, 1), (168, 3), (169, 2), (170, 3), (171, 1), (172, 1), (173, 1), (174, 8), (175, 1), (176, 1), (177, 1), (178, 1), (179, 1), (180, 1), (181, 1), (182, 1), (183, 1), (184, 1), (185, 2), (186, 1), (187, 1), (188, 1), (189, 3), (190, 1), (191, 1), (192, 1), (193, 1), (194, 1), (195, 1), (196, 1), (197, 1), (198, 1), (199, 1), (200, 2), (201, 1), (202, 1), (203, 1), (204, 1), (205, 1), (206, 1), (207, 1), (208, 1), (209, 1), (210, 1), (211, 1), (212, 1), (213, 2), (214, 1), (215, 1), (216, 1), (217, 1

We are going to start with the Hierarchical Dirichlet Process (HDP) Topic Modelling Algorithm. In HDP, we don't need to specify the number of topics. The algorithm is able o deduce the best "number of topics" for a given text corpus.

In [38]:
hdpModel = HdpModel(corpus=corpus_for_modelling, id2word=dictionary)

In [39]:
hdpModel.show_topics()

[(0,
  '0.006*year + 0.003*good + 0.003*new + 0.003*government + 0.002*time + 0.002*company + 0.002*plan + 0.002*work + 0.002*include + 0.002*firm + 0.002*sale + 0.002*uk + 0.002*film + 0.002*country + 0.002*rise + 0.002*report + 0.002*go + 0.002*m + 0.002*market + 0.002*need'),
 (1,
  '0.007*year + 0.005*new + 0.003*rise + 0.003*sale + 0.003*company + 0.003*film + 0.003*time + 0.002*uk + 0.002*good + 0.002*government + 0.002*number + 0.002*$ + 0.002*add + 0.002*m + 0.002*work + 0.002*include + 0.002*price + 0.002*game + 0.002*firm + 0.002*go'),
 (2,
  '0.004*year + 0.003*new + 0.002*film + 0.002*government + 0.002*m + 0.002*time + 0.002*report + 0.002*good + 0.002*come + 0.002*go + 0.002*deal + 0.002*company + 0.002*uk + 0.002*number + 0.002*world + 0.002*not + 0.002*need + 0.002*country + 0.001*think + 0.001*set'),
 (3,
  '0.002*year + 0.002*government + 0.002*not + 0.002*good + 0.002*new + 0.002*uk + 0.002*rise + 0.002*film + 0.001*m + 0.001*see + 0.001*site + 0.001*company + 0.001*

In [40]:
hdp_coherence_model = CoherenceModel(model=hdpModel, texts=news_corpus, dictionary=dictionary, coherence='c_v')
hdp_coherence = hdp_coherence_model.get_coherence()
print(hdp_coherence)

0.5550850312761848


Next, we are going to proceed with the Latent Dirichlet Allocation model, here we need to pass the number of topics. Based on the coherence score, we can deduce the best "number of topics".

In [41]:
list_num_of_topics = [10, 15, 20, 25, 30]
for num_topics in list_num_of_topics:
    ldaModel = LdaModel(corpus=corpus_for_modelling, num_topics=num_topics, id2word=dictionary)
    coherenceModel = CoherenceModel(model=ldaModel, texts=news_corpus, dictionary=dictionary, coherence='c_v')
    print('Number of Topics', num_topics)
    print('Coherence Value', coherenceModel.get_coherence())

Number of Topics 10
Coherence Value 0.23320329177130109
Number of Topics 15
Coherence Value 0.23726905296224687
Number of Topics 20
Coherence Value 0.24440007734201422
Number of Topics 25
Coherence Value 0.24959979968409282
Number of Topics 30
Coherence Value 0.25236095902868494


In [42]:
ldaModel_best = LdaModel(corpus=corpus_for_modelling, num_topics=30, id2word=dictionary)
ldaModel_best.show_topics(30)

[(0,
  '0.006*"year" + 0.005*"game" + 0.005*"new" + 0.003*"play" + 0.003*"time" + 0.003*"come" + 0.003*"go" + 0.003*"uk" + 0.003*"not" + 0.003*"add"'),
 (1,
  '0.004*"time" + 0.004*"not" + 0.004*"come" + 0.003*"year" + 0.003*"add" + 0.003*"work" + 0.003*"like" + 0.003*"company" + 0.003*"new" + 0.003*"think"'),
 (2,
  '0.006*"new" + 0.005*"year" + 0.004*"game" + 0.003*"good" + 0.003*"add" + 0.003*"star" + 0.003*"time" + 0.003*"film" + 0.003*"work" + 0.003*"court"'),
 (3,
  '0.006*"film" + 0.006*"sale" + 0.004*"new" + 0.004*"m" + 0.004*"number" + 0.004*"year" + 0.004*"uk" + 0.003*"come" + 0.003*"sell" + 0.003*"single"'),
 (4,
  '0.005*"year" + 0.005*"plan" + 0.004*"new" + 0.004*"time" + 0.004*"phone" + 0.004*"go" + 0.004*"good" + 0.003*"government" + 0.003*"police" + 0.003*"film"'),
 (5,
  '0.005*"win" + 0.004*"year" + 0.004*"good" + 0.003*"time" + 0.003*"play" + 0.003*"game" + 0.003*"want" + 0.003*"call" + 0.003*"come" + 0.003*"new"'),
 (6,
  '0.007*"year" + 0.006*"new" + 0.004*"governm

In [43]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldaModel_best, corpus_for_modelling, dictionary)

Lastly, we are going to proceed with the Latent Semantic Indexing model, here we need to pass the number of topics. Based on the coherence score, we can deduce the best "number of topics".

In [44]:
for num_topics in list_num_of_topics:
    lsiModel = LsiModel(corpus=corpus_for_modelling, num_topics=num_topics, id2word=dictionary)
    coherenceModel = CoherenceModel(model=lsiModel, texts=news_corpus, dictionary=dictionary, coherence='c_v')
    print('Number of Topics', num_topics)
    print('Coherence Value', coherenceModel.get_coherence())

Number of Topics 10
Coherence Value 0.3682659944470568
Number of Topics 15
Coherence Value 0.39737167113930777
Number of Topics 20
Coherence Value 0.330601486812326
Number of Topics 25
Coherence Value 0.3089096223625313
Number of Topics 30
Coherence Value 0.338123033627


In [45]:
lsiModel = LsiModel(corpus=corpus_for_modelling, num_topics=15, id2word=dictionary)
lsiModel.show_topics(15)

[(0,
  '0.253*"year" + 0.195*"new" + 0.160*"good" + 0.159*"game" + 0.155*"time" + 0.118*"government" + 0.116*"uk" + 0.113*"work" + 0.110*"come" + 0.106*"not"'),
 (1,
  '0.562*"game" + 0.204*"play" + 0.194*"win" + -0.186*"government" + 0.147*"player" + 0.142*"good" + 0.123*"time" + 0.110*"title" + -0.106*"plan" + -0.099*"company"'),
 (2,
  '-0.244*"game" + -0.223*"technology" + 0.202*"m" + 0.162*"government" + -0.162*"user" + -0.150*"use" + 0.149*"win" + -0.148*"service" + -0.147*"mobile" + -0.137*"phone"'),
 (3,
  '0.374*"year" + 0.254*"m" + 0.249*"film" + 0.180*"sale" + -0.169*"government" + 0.136*"rise" + -0.129*"not" + 0.121*"award" + -0.116*"liverpool" + -0.103*"think"'),
 (4,
  '0.477*"game" + -0.262*"film" + -0.147*"m" + 0.145*"sale" + 0.144*"rise" + 0.136*"government" + -0.122*"win" + 0.121*"market" + -0.120*"good" + -0.116*"site"'),
 (5,
  '0.299*"film" + -0.234*"liverpool" + 0.230*"game" + 0.203*"government" + -0.200*"club" + -0.184*"parry" + -0.149*"deal" + -0.148*"gerrard" +