# Topic Modelling for News

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:



In [2]:
import nltk
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('random_headlines.csv')
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


In [5]:
# TODO: Preprocess the input data

#tokenize
df['tokens'] = df['headline_text'].apply(lambda row: nltk.word_tokenize(row))

#punctuation
df['alphanumeric'] = df['tokens'].apply(lambda row: [
    word for word in row if word.isalpha()
])

#remove stopwords
stop = nltk.corpus.stopwords.words('english')
df['stop'] = df['alphanumeric'].apply(lambda row: [
    word for word in row if word not in stop
])

#stemming
stemmer = nltk.PorterStemmer()
df['stemmed'] = df['stop'].apply(lambda row: [
    stemmer.stem(word) for word in row
])
df['stemmed'].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

Now use Gensim to compute a BOW


In [6]:
from gensim.corpora import Dictionary

dictionary = Dictionary(df['stemmed'])
corpus = [dictionary.doc2bow(line) for line in df['stemmed']]
print(np.shape(corpus))
corpus[0:2]

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (20000,) + inhomogeneous part.

Compute the TF-IDF using Gensim

In [None]:
from gensim.models import TfidfModel

tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(np.shape(tf_idf))

Finally compute the LSA (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [7]:
# TODO: Compute LSA
from gensim.models import LsiModel

lsi = LsiModel(corpus=corpus, num_topics=4, id2word=dictionary)

For each of the topic, show the most significant words.

In [8]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi.print_topics(num_words=3)

[(0, '0.752*"polic" + 0.404*"man" + 0.208*"charg"'),
 (1, '-0.668*"man" + 0.575*"polic" + -0.327*"charg"'),
 (2, '0.654*"new" + 0.295*"plan" + -0.242*"man"'),
 (3, '0.703*"new" + -0.343*"say" + -0.339*"plan"')]

What do you think about those results?

the first two rows appear to be discussing the same topic but in different order.

The second two rows are discussing a new plan that they may impliment. possibly politically related.

So it can be understandable why these two are the top most used topics for news articles. 

Now let's try to use LDA instead of LSA using Gensim

In [9]:
# TODO: Compute LDA
from gensim.models import LdaModel

lda = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, random_state=0, chunksize=512, passes=5)

In [10]:
# TODO: print the most frequent words of each topic
lda.print_topics(num_words=3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [11]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

vis= pyLDAvis.gensim.prepare(lda, corpus, dictionary)
vis

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters... And check with others their results.

# 2 - Challenge - GDPR Compliant

In the ada_lovelace.txt file, located in the same folder, contains some informations about Ada Lovelace. This problem is that this file is full of identifying informations about people, and as such, is really not GDPR-compliant 😱 (info : the General Data Protection Regulation is a regulation in EU law on data protection and privacy)

Guidelines

The objective of this exercice is to write a function that will clean up a file, by remplacing all mentions of people's names by "[REDACTED]", in order to comply with European law.

In [12]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_md

Collecting pip
  Downloading pip-23.1.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.1
    Uninstalling pip-23.1.1:
      Successfully uninstalled pip-23.1.1
Successfully installed pip-23.1.2
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [13]:
import spacy 
nlp = spacy.load('en_core_web_md')

In [14]:
# TODO : load file and have a look at it
with open('ada lovelace.txt', "rt") as f:
    text = f.read()
    
print(text)

Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation, and published the first algorithm intended to be carried out by such a machine. As a result, she is sometimes regarded as the first to recognise the full potential of a "computing machine" and one of the first computer programmers. 

Lovelace became close friends with her tutor Mary Somerville, who introduced her to Charles Babbage in 1833. She had a strong respect and affection for Somerville, and they corresponded for many years. Other acquaintances included the scientists Andrew Crosse, Sir David Brewster, Charles Wheatstone, Michael Faraday and the author Charles Dickens.


Q1. Using the SpaCy NER tools, identify the entities in this document, and their relating tags.

In [15]:
# TODO : Named Entities Recognition

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Augusta Ada King PERSON
Lovelace PERSON
née Byron PERSON
10 December 1815 DATE
27 CARDINAL
November 1852 DATE
English NORP
Charles Babbage's PERSON
the Analytical Engine ORG
first ORDINAL
first ORDINAL
first ORDINAL
one CARDINAL
first ORDINAL
Lovelace PERSON
Mary Somerville PERSON
Charles Babbage PERSON
1833 DATE
Somerville GPE
many years DATE
Andrew Crosse PERSON
David Brewster PERSON
Charles Wheatstone PERSON
Michael Faraday PERSON
Charles Dickens PERSON


Q2. Display the identified entities in a more visual manner.

In [16]:
# TODO : NER visualization
from spacy import displacy

displacy.render(doc, style="ent")

Q3. Write a function replace_name_by_redactedthat will modify the document in order to replace all occurences of names by "[REDACTED]", and apply it to the file.



In [17]:
# TODO : `replace_name_by_redacted`
def replace_name_by_redacted(token):
    if token.ent_type_ == "PERSON":
        return "[REDACTED]"
    else:
        return token.text

Q4. Write a function make_doc_GDPR_compliant that will modify the document in order to replace all occurencies of names by "[REDACTED]", and apply it to the file.

In [18]:
def make_doc_GDPR_compliant(doc):
    return " ".join([replace_name_by_redacted(token) for token in doc])

make_doc_GDPR_compliant(doc)

'[REDACTED] [REDACTED] [REDACTED] , Countess of [REDACTED] ( [REDACTED] [REDACTED] ; 10 December 1815 – 27 November 1852 ) was an English mathematician and writer , chiefly known for her work on [REDACTED] [REDACTED] [REDACTED] proposed mechanical general - purpose computer , the Analytical Engine . She was the first to recognise that the machine had applications beyond pure calculation , and published the first algorithm intended to be carried out by such a machine . As a result , she is sometimes regarded as the first to recognise the full potential of a " computing machine " and one of the first computer programmers . \n\n [REDACTED] became close friends with her tutor [REDACTED] [REDACTED] , who introduced her to [REDACTED] [REDACTED] in 1833 . She had a strong respect and affection for Somerville , and they corresponded for many years . Other acquaintances included the scientists [REDACTED] [REDACTED] , Sir [REDACTED] [REDACTED] , [REDACTED] [REDACTED] , [REDACTED] [REDACTED] and 