## Further text preprocessing

### Objectives of this test

- Implement Latent Dirichlet Allocation (**LDA**) from **Gensim** package along with the Mallet’s implementation (via Gensim). 

- Implement **Mallet** that optimizes LDA. Mallet is known to run faster and gives better topic segregation.

- Also extract the volume and percentage contribution of each topic to get **an idea of how important a topic is**.

### Pre-requisites: Downloading NLTK Dutch stopwords, data handling tools, model preprocessing & plotting tools, and SpaCy model

In [2]:
# NLTK
import nltk
nltk.download('stopwords', 'dutch')

[nltk_data] Downloading package stopwords to dutch...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# Python data analysis tools and python module for printing
import re
import numpy as np
import pandas as pd
from pprint import pprint

In [4]:
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

In [5]:
# SpaCy for lemmatization
import spacy

In [8]:
# Plotting tools
!pip install pyLDAvis

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

import matplotlib.pyplot as plt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
# Ignoring warnings
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

### Preparing stopwords

In [10]:
# NLTK Stop words
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = nltk.corpus.stopwords.words('dutch')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Tokenizing words and cleaning-up text

In [11]:
# Importing file handling library
import os

# Opening one sample Dutch legal text document
file = open('drive/MyDrive/numac=2019041722.txt','rt')
dutch_text = file.read()
file.close()

In [12]:
# Replace newline with a single space
# Putting all words in lowercase
text = dutch_text.replace('\n', '')
raw_text = text.lower()
print(raw_text)

verslag aan de koning sire, het ontwerp van koninklijk besluit dat wij de eer hebben aan uwe majesteit voor te leggen, beoogt de uitvoering van de artikelen 93 ter tot 93 quinquies van het wetboek van de belasting over de toegevoegde waarde (hierna "wbtw"), de artikelen 412 bis, 433 tot 435 van het wetboek van de inkomstenbelastingen 1992 (hierna "wib 92"), de artikelen 35 tot 37 en 43 tot 45 en 47 van het wetboek van de minnelijke en gedwongen invordering van fiscale en niet-fiscale schuldvorderingen (hierna "invorderingswetboek") en de artikelen 157 tot 159 en 161 van de programmawet (i) van 29 maart 2012 (hierna "programmawet", zoals gewijzigd door de wet van 11 februari 2019 houdende fiscale, fraude bestrijdende, financiële alsook diverse bepalingen en de wet van 23 april 2020 houdende wijzigingen van het wetboek van de belasting over de toegevoegde waarde, het wetboek van de inkomstenbelastingen 1992, het wetboek van de minnelijke en gedwongen invordering van fiscale en niet-fisca

In [13]:
# Splitting text document on basis of terms or words separated by spaces
# creating separate strings
content = raw_text.split(' ')
print(content)

['verslag', 'aan', 'de', 'koning', 'sire,', 'het', 'ontwerp', 'van', 'koninklijk', 'besluit', 'dat', 'wij', 'de', 'eer', 'hebben', 'aan', 'uwe', 'majesteit', 'voor', 'te', 'leggen,', 'beoogt', 'de', 'uitvoering', 'van', 'de', 'artikelen', '93', 'ter', 'tot', '93', 'quinquies', 'van', 'het', 'wetboek', 'van', 'de', 'belasting', 'over', 'de', 'toegevoegde', 'waarde', '(hierna', '"wbtw"),', 'de', 'artikelen', '412', 'bis,', '433', 'tot', '435', 'van', 'het', 'wetboek', 'van', 'de', 'inkomstenbelastingen', '1992', '(hierna', '"wib', '92"),', 'de', 'artikelen', '35', 'tot', '37', 'en', '43', 'tot', '45', 'en', '47', 'van', 'het', 'wetboek', 'van', 'de', 'minnelijke', 'en', 'gedwongen', 'invordering', 'van', 'fiscale', 'en', 'niet-fiscale', 'schuldvorderingen', '(hierna', '"invorderingswetboek")', 'en', 'de', 'artikelen', '157', 'tot', '159', 'en', '161', 'van', 'de', 'programmawet', '(i)', 'van', '29', 'maart', '2012', '(hierna', '"programmawet",', 'zoals', 'gewijzigd', 'door', 'de', 'wet',

In [14]:
# Using list comprehension + split()
# Tokenizing strings in list of strings
data_words = [sub.split() for sub in content]
print(data_words)

[['verslag'], ['aan'], ['de'], ['koning'], ['sire,'], ['het'], ['ontwerp'], ['van'], ['koninklijk'], ['besluit'], ['dat'], ['wij'], ['de'], ['eer'], ['hebben'], ['aan'], ['uwe'], ['majesteit'], ['voor'], ['te'], ['leggen,'], ['beoogt'], ['de'], ['uitvoering'], ['van'], ['de'], ['artikelen'], ['93'], ['ter'], ['tot'], ['93'], ['quinquies'], ['van'], ['het'], ['wetboek'], ['van'], ['de'], ['belasting'], ['over'], ['de'], ['toegevoegde'], ['waarde'], ['(hierna'], ['"wbtw"),'], ['de'], ['artikelen'], ['412'], ['bis,'], ['433'], ['tot'], ['435'], ['van'], ['het'], ['wetboek'], ['van'], ['de'], ['inkomstenbelastingen'], ['1992'], ['(hierna'], ['"wib'], ['92"),'], ['de'], ['artikelen'], ['35'], ['tot'], ['37'], ['en'], ['43'], ['tot'], ['45'], ['en'], ['47'], ['van'], ['het'], ['wetboek'], ['van'], ['de'], ['minnelijke'], ['en'], ['gedwongen'], ['invordering'], ['van'], ['fiscale'], ['en'], ['niet-fiscale'], ['schuldvorderingen'], ['(hierna'], ['"invorderingswetboek")'], ['en'], ['de'], ['art

### Creating bigrams and trigrams

In [15]:
# Building the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
###print(trigram_mod[bigram_mod[data_words]])



###  Removing stopwords, making bigrams & trigtrams, and lemmatizing

In [16]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

### Initializing SpaCy's Dutch NLP model (large size)

In [1]:
 !pip install -U spacy
 !python -m spacy download nl_core_news_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy
  Downloading spacy-3.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 4.2 MB/s 
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 35.2 MB/s 
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 52.4 MB/s 
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.1-py3-none-any.whl (27 kB)
Collecting thinc<8.1.0,>=8.0.14
  Downloading thinc-8.0.17-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (660 kB)
[K     |████████████████████████████████| 660 kB 49.1 MB/s 
[?25hCollecting spacy-legacy<3.1.0,>=3.0.9
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting srsly<3.

### Calling the functions in order

In [17]:
 # Removing Stop Words
data_words_nostops = remove_stopwords(data_words)

# Forming Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Loading the SpaCy 'nl' model, keeping only tagger component (for efficiency)
nlp = spacy.load('nl_core_news_lg', disable=['parser', 'ner'])

# Doing lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized)

[['verslag'], [], [], ['koning'], ['sire'], [], ['ontwerp'], [], ['koninklijk'], ['besluit'], [], [], [], ['eer'], [], [], ['uwe'], ['majesteit'], [], [], ['leggen'], ['beogen'], [], ['uitvoering'], [], [], ['artikel'], [], [], [], [], ['quinquie'], [], [], ['wetboek'], [], [], ['belasting'], [], [], ['toegevoegde'], ['waarde'], ['hierna'], ['wbtw'], [], ['artikel'], [], ['bis'], [], [], [], [], [], ['wetboek'], [], [], [], [], ['hierna'], [], [], [], ['artikel'], [], [], [], [], [], [], [], [], [], [], [], ['wetboek'], [], [], ['minnelijk'], [], ['dwingen'], ['invordering'], [], ['fiscaal'], [], ['fiscaal'], [], ['hierna'], [], [], [], ['artikel'], [], [], [], [], [], [], [], ['programma_wet'], [], [], [], [], [], ['hierna'], ['programma_wet'], [], ['wijzigen'], [], [], ['wet'], [], [], [], [], ['houdenen'], ['fiscaal'], ['_fraude'], ['bestrijden'], ['financieel'], [], ['divers'], ['bepaling'], [], [], ['wet'], [], [], [], [], ['houdenen'], ['wijziging'], [], [], ['wetboek'], [], [], 

**NOTE**: The empty lists (within this extensive python list of individual Dutch words) are stripped off of words that are negligible (e.g. stopwords) and words that have to be retained (e.g. names of persons; complete article code numbers; etc.). **FINER PREPROCESSING VIA NER can be implemented in this part**. 

### Creating the Dictionary and Corpus needed for Topic Modeling

In [18]:
# Creating Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Creating Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# Viewing corpus
print(corpus)

[[(0, 1)], [], [], [(1, 1)], [(2, 1)], [], [(3, 1)], [], [(4, 1)], [(5, 1)], [], [], [], [(6, 1)], [], [], [(7, 1)], [(8, 1)], [], [], [(9, 1)], [(10, 1)], [], [(11, 1)], [], [], [(12, 1)], [], [], [], [], [(13, 1)], [], [], [(14, 1)], [], [], [(15, 1)], [], [], [(16, 1)], [(17, 1)], [(18, 1)], [(19, 1)], [], [(12, 1)], [], [(20, 1)], [], [], [], [], [], [(14, 1)], [], [], [], [], [(18, 1)], [], [], [], [(12, 1)], [], [], [], [], [], [], [], [], [], [], [], [(14, 1)], [], [], [(21, 1)], [], [(22, 1)], [(23, 1)], [], [(24, 1)], [], [(24, 1)], [], [(18, 1)], [], [], [], [(12, 1)], [], [], [], [], [], [], [], [(25, 1)], [], [], [], [], [], [(18, 1)], [(25, 1)], [], [(26, 1)], [], [], [(27, 1)], [], [], [], [], [(28, 1)], [(24, 1)], [(29, 1)], [(30, 1)], [(31, 1)], [], [(32, 1)], [(33, 1)], [], [], [(27, 1)], [], [], [], [], [(28, 1)], [(34, 1)], [], [], [(14, 1)], [], [], [(15, 1)], [], [], [(16, 1)], [(17, 1)], [], [(14, 1)], [], [], [], [], [], [(14, 1)], [], [], [(21, 1)], [], [(22, 1)

Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).

For example, (0, 1) above implies, **word id 0 occurs once** in the document. Likewise, **word id 1 occurs once too**, and so on.

This is used as the input by the LDA model.

If you want to see what word a given id corresponds to, pass the id as a key to the dictionary. (From https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [19]:
# Passing the id as a key to the dictionary to see what word a given ID corresponds to
id2word[0]

'verslag'

In [20]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus]

[[('verslag', 1)],
 [],
 [],
 [('koning', 1)],
 [('sire', 1)],
 [],
 [('ontwerp', 1)],
 [],
 [('koninklijk', 1)],
 [('besluit', 1)],
 [],
 [],
 [],
 [('eer', 1)],
 [],
 [],
 [('uwe', 1)],
 [('majesteit', 1)],
 [],
 [],
 [('leggen', 1)],
 [('beogen', 1)],
 [],
 [('uitvoering', 1)],
 [],
 [],
 [('artikel', 1)],
 [],
 [],
 [],
 [],
 [('quinquie', 1)],
 [],
 [],
 [('wetboek', 1)],
 [],
 [],
 [('belasting', 1)],
 [],
 [],
 [('toegevoegde', 1)],
 [('waarde', 1)],
 [('hierna', 1)],
 [('wbtw', 1)],
 [],
 [('artikel', 1)],
 [],
 [('bis', 1)],
 [],
 [],
 [],
 [],
 [],
 [('wetboek', 1)],
 [],
 [],
 [],
 [],
 [('hierna', 1)],
 [],
 [],
 [],
 [('artikel', 1)],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [('wetboek', 1)],
 [],
 [],
 [('minnelijk', 1)],
 [],
 [('dwingen', 1)],
 [('invordering', 1)],
 [],
 [('fiscaal', 1)],
 [],
 [('fiscaal', 1)],
 [],
 [('hierna', 1)],
 [],
 [],
 [],
 [('artikel', 1)],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [('programma_wet', 1)],
 [],
 [],
 [],
 [],
 [],
 [

### Building the topic model

In [21]:
# Building LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

###  Viewing the topics in LDA model

You can see the keywords for each topic and the weightag (importance) of each keyword using lda_model.print_topics() as shown next.

In [22]:
# Printing keywords for index 0 to 10
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.273*"wijzigen" + 0.161*"wet" + 0.112*"overheid_dienst" + 0.112*"federaal" '
  '+ 0.038*"notariaat" + 0.038*"hetzelfde" + 0.020*"streepje" + 0.010*"divers" '
  '+ 0.003*"financieel" + 0.003*"_fraude"'),
 (1,
  '0.292*"zien" + 0.159*"raadpleging" + 0.159*"voegen" + 0.159*"tabel" + '
  '0.051*"wbtw" + 0.027*"Beeldgezien" + 0.025*"kader" + 0.013*"associeren" + '
  '0.012*"vallen" + 0.004*"begunstigen"'),
 (2,
  '0.382*"bedoelen" + 0.081*"erfop_volging" + 0.065*"btw" + 0.065*"eenheid" + '
  '0.064*"verplichting" + 0.064*"bijzonder" + 0.002*"sire" + 0.002*"volledig" '
  '+ 0.002*"opnieuw" + 0.002*"lang"'),
 (3,
  '0.165*"quinquie" + 0.125*"belasten" + 0.090*"eigenaar" + 0.079*"houden" + '
  '0.069*"Hypotheek" + 0.041*"volgen" + 0.041*"opmaak" + 0.031*"machtigden" + '
  '0.031*"beheers_ysteem" + 0.012*"vermellen"'),
 (4,
  '0.729*"artikel" + 0.058*"zending" + 0.046*"bepalen" + 0.001*"maken" + '
  '0.001*"Beeldgezien" + 0.001*"hetzelfde" + 0.001*"woord" + '
  '0.001*"associeren" + 0.

**NOTE**: For each index (0 to 10), there's a top 10 keywords that contribute to this topic.

The weights reflect how important a keyword is to that topic.

Looking at these keywords, you can guess what this topic could be.

### Compute Model Perplexity and Coherence Score

In [23]:
# Computing Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Computing Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.148686727401539

Coherence Score:  0.8031396675007777


### Visualizing topic-keywords distribution

In [None]:
# Downgrading to pyLDavis 3.2.1 to circumvent pyLDAvis - Gensim conflicts in Colab
### !pip install pyLDAvis==3.2.1

In [25]:
# Feeding the model into the pyLDAvis instance
vis = gensimvis.prepare(lda_model, corpus, id2word)
vis

  by='saliency', ascending=False).head(R).drop('saliency', 1)


**How to make inferences from pyLDAvis' output:**

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

## Mallet's version of LDA

Upnext, we will improve upon this model by using Mallet’s version of LDA algorithm and then we will focus on how to arrive at the optimal number of topics given any large corpus of text.

Gensim provides a wrapper to implement Mallet’s LDA from within Gensim itself. 

In [None]:
# Upgrading Gensim
#!pip install --upgrade gensim==3.8

In [None]:
# Installing Mallet
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip mallet-2.0.8.zip

In [None]:
mallet_path = 'path/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)