## Topic Modelling overview

In this notebook:

- Description of the data
- Looking at the data
- Text pre-processing
- Topic Modelling with Gensim
- Visualisation of Topic Models with pyLDAvis
- Topic coherence

### Some configuration first

The following cell will download some components of the NLTK library

In [None]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

Also, watch out for deprecation warnings (show them once)

In [None]:
import warnings
warnings.filterwarnings(action='once')

### Description of the data

- We're going to use a sub-set of the popular `20newsgroups` dataset
- Each document is a newsgroup message
- Each document is labelled with the related newsgroup (one newsgroup per document)
- There are (surprise!) 20 newsgroups
- The newsgroup name tells us about the overall topic

### Looking at the data

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['comp.sys.mac.hardware',
              'rec.autos',
              'sci.space',
              'misc.forsale',
              'talk.politics.guns',
              'talk.religion.misc']

newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),
                                categories=categories)

In [None]:
len(newsgroups.data)

In [None]:
newsgroups.target_names

Let's look into the content of a document

In [None]:
doc = newsgroups.data[2]
doc

In [None]:
doc_class = newsgroups.target[2]
doc_class

In [None]:
newsgroups.target_names[doc_class]

### Text pre-processing

Gensim expects the input corpus to be a sequence of tokenised documents

e.g. a list of lists (documents) of strings (words/tokens)

In our first iteration, we're simply tokenising the input data (from doc to words)

In [None]:
from nltk.tokenize import word_tokenize

def preprocess(text):
    return word_tokenize(text)

corpus = [preprocess(doc) for doc in newsgroups.data]

#### Building the term-document matrix

In [None]:
from gensim.corpora import Dictionary

id2word = Dictionary(corpus)

# Term Document Frequency
term_document_matrix = [id2word.doc2bow(text) for text in corpus]

# View one document in the term-document matrix
print(term_document_matrix[0])

In [None]:
# Number of documents
len(term_document_matrix)

In [None]:
# Number of unique words (vocabulary size)
len(id2word)

In [None]:
# View one word
id2word[0]

In [None]:
# View word frequency distribution in one document
doc = term_document_matrix[0]
[(id2word[word_id], freq) for word_id, freq in doc]

### Train topic model with LDA

In [None]:
%%time

from gensim.models.ldamodel import LdaModel
model = LdaModel(corpus=term_document_matrix,
                 id2word=id2word,
                 num_topics=10, 
                 passes=10)

In [None]:
model.print_topics()

### Can we do better?

Have a look at the topics extracted in the example above:

- does the output make sense?
- is the output useful at all?

### Better pre-processing

Some options to improve pre-processing:

- normalisation (e.g. lowercasing)
- stop-word removal
- punctuation removal

Data cleaning is not glamorous, but it can have a big impact on our models.

In [None]:
from nltk.corpus import stopwords
from string import punctuation

STOP_LIST = set(stopwords.words('english') + list(punctuation))
STOP_LIST.update(["'m", "n't", '``', "'s", "'ll", "'re", '--', "''", '""', '...'])
STOP_LIST.update(['go', 'get', 'like', 'gon', 'na', 'oh', 'yeah'])

def preprocess(text):
    return [word.lower() for word in word_tokenize(text) if word.lower() not in STOP_LIST]

corpus = [preprocess(doc) for doc in newsgroups.data]
id2word = Dictionary(corpus)
term_document_matrix = [id2word.doc2bow(text) for text in corpus]

In [None]:
len(id2word)

In [None]:
%%time

model = LdaModel(corpus=term_document_matrix,
                id2word=id2word,
                num_topics=10,
                passes=10)

model.print_topics()

### Removing the extremes of the distribution

Zipf's Law - https://en.wikipedia.org/wiki/Zipf%27s_law

In [None]:
id2word = Dictionary(corpus)
id2word.filter_extremes(no_below=10, no_above=0.5)
term_document_matrix = [id2word.doc2bow(text) for text in corpus]

In [None]:
len(id2word)

In [None]:
%%time

model = LdaModel(corpus=term_document_matrix,
                id2word=id2word,
                num_topics=10,
                passes=10)

model.print_topics()

## Visualisation with pyLDAvis

In [None]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [None]:
%%time

pyLDAvis.gensim.prepare(model, term_document_matrix, id2word, mds='mmds')

## Topic coherence

Show the effect of the number of passes:

In [None]:
%%time

from gensim.models.coherencemodel import CoherenceModel

good_model = LdaModel(corpus=term_document_matrix, id2word=id2word, passes=50, num_topics=10)
bad_model = LdaModel(corpus=term_document_matrix, id2word=id2word, passes=1, num_topics=10)

In [None]:
good_score = CoherenceModel(model=good_model, texts=corpus, dictionary=id2word)
bad_score = CoherenceModel(model=bad_model, texts=corpus, dictionary=id2word)

In [None]:
good_score.get_coherence(), bad_score.get_coherence()

#### What's the best number of topics?

In [None]:
models = []
scores = []

for n in range(5, 10):
    print("Training model with n={} topics".format(n))
    model = LdaModel(corpus=term_document_matrix, id2word=id2word, passes=10, num_topics=n)
    score = CoherenceModel(model=model, texts=corpus, dictionary=id2word)
    models.append(model)
    scores.append(score)

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

n_topics = range(5, 10)
coherence_scores = [s.get_coherence() for s in scores]

plt.plot(n_topics, coherence_scores)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_scores"), loc='best')
plt.show()

### Exercise - Train model with only nouns and adjectives

Part-of-speech (PoS) tagging is the process of assigning words to their grammatical categories.

We can achieve this using `nltk.pos_tag()`, for example:

In [None]:
from nltk import pos_tag

sentence = "The quick brown fox jumped over the lazy dog".split()

pos_tag(sentence)

Note: the function `nltk.pos_tag()` uses the set of tags from the [Penn Treebank project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

#### How to change the pre-processing steps to include only nouns and adjectives?

#### Does this produce better topic models?

In [None]:
# Write your solution here
