# Text Classification Using Topic Modelling

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
# from nltk.stem import WordNetLemmatizer, SnowballStemmer
# from nltk.stem.porter import *

import spacy

np.random.seed(42)

## Load the dataset

We'll use a dataset of news articles grouped into 20 news categories - but just use 7 for this example.
I've tried to pick groups that should have a decent seperation.

In [2]:
categories = [
    'comp.windows.x',
    'rec.autos',
    'rec.sport.baseball',
    'rec.sport.hockey',
    'sci.space',
    'soc.religion.christian',
    'talk.politics.guns'
]

In [3]:
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

Lets looks at an example

In [4]:
print(newsgroups_train.data[6])

From: SHOE@PHYSICS.watstar.uwaterloo.ca (Mark Shoesmith)
Subject: Re: Let's talk sticks...
Lines: 35
Organization: University of Waterloo

In article <C50pt4.6CM@odin.corp.sgi.com> dptom@endor.corp.sgi.com (Tom Arnold) writes:

>Okay you hockey playing fans/finatics out there. I'm looking over the wide 
>range of aluminum sticks for the first time. I've been playing with pieces
>of lumbar that seem to weigh alot and break after a few uses, so I'm 
>thinking of changing to an aluminum shaft so when I break the blade all I 
>have to do is change it. The problem is that there is such a wide reange of
>models and selections out there that I'm not certain which to consider. Can
>any of you post some of your suggestions and experiences with the aluminum 
>sticks? What is the difference between models? What do you like/dislike about
>them? And, which brands are best?
>
>

I've had, and still have a few aluminum sticks.  I got my first when I was 15
(a Christian), and broke the shaft halfway t

In [5]:
target_newsgroup = newsgroups_test.target_names[newsgroups_train.target[6]]
print('Group: {}'.format(target_newsgroup))

Group: rec.sport.hockey


In [6]:
print(newsgroups_train.filenames.shape, newsgroups_train.target.shape)

(4122,) (4122,)


This should be enough rows normally. Though it is split over 7 categories which may not be enough.  
Lets see how heavy each category is.

In [7]:
import collections
collections.Counter(newsgroups_train.target)

Counter({4: 593, 2: 597, 1: 594, 0: 593, 3: 600, 6: 546, 5: 599})

I could map the keys to the category names but you can see by eye that it is a really balanced dataset.

## Data Preprocessing

We transform the data to basically optimise it so the ML algorithm recieves the strongest signal. 

* Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* Remove stopwords: such as the, is, at, which, and on.
* Lemmatize: Words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Stemming: Words are reduced to their root form.

Lemmatizing is a mapping of the word to its base form i.e. went -> go.  
Stemming is more of a function on the word such as removing the 'ing' from the end of words.  
We do the lemmatizing and then stemming in the lemma may be a totally different spelt word (going -> go is similar but went -> go has a totally different spelling).  
The stemming can often result in a 'invalid' word such as argue -> argu which the lemmatizing wouldn't accept.

In [8]:
# Don't need the PoS tagger, the NER labeling or the text categorizer 
# https://spacy.io/usage/processing-pipelines
nlp = spacy.load('en', disable=['tagger', 'ner', 'textcat'])

TODO Space doesn't have a stemmer and we may not need it. Can confirm accuracy later.  
https://github.com/explosion/spaCy/issues/327

In [9]:
def preprocess(text):
    doc = nlp(text)
    result = [token.lemma_.lower() for token in doc if not token.is_stop and len(token) > 3 and token.text in nlp.vocab]
    return result

In [10]:
doc_sample = 'This disk has failed many times. I would like to get it replaced.'
proc = preprocess(doc_sample)
print(proc)

['this', 'disk', 'fail', 'time', 'like', 'replace']


In [11]:
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
#DELETEME: NLTK output 
# ['disk', 'fail', 'time', 'like', 'replac']

In [13]:
def preprocess(text):
    result = [token.lemma_.lower() for token in text if not token.is_stop and len(token) > 3 and token.text in nlp.vocab]
    return result

In [14]:
def preprocess_docs(docs):
    processed = list(nlp.pipe(newsgroups_train.data))
    
    processed_docs = []

    for doc in processed:
        processed_docs.append(preprocess(doc))
        
    return processed_docs

In [15]:
# docs = nlp.pipe(newsgroups_train.data)
processed_docs = preprocess_docs(newsgroups_train.data)

In [16]:
# processed = list(docs)

Preprocess all the messages we have (in parallel)

In [17]:
# def preprocess(text):
# #     doc = nlp(text)
#     result = [token.lemma_.lower() for token in text if not token.is_stop and len(token) > 3 and token.text in nlp.vocab]
#     return result

In [18]:
# result = [token.lemma_.lower() for token in processed if not token.is_stop and len(token) > 3]
# processed_docs = []

# for doc in processed:
#     processed_docs.append(preprocess(doc))

In [19]:
# import multiprocessing
# pool = multiprocessing.Pool()
# processed_docs = list(pool.map(preprocess, newsgroups_train.data))
# https://github.com/explosion/spaCy/issues/1839
# processed_docs = []

# for doc in newsgroups_train.data:
#     processed_docs.append(preprocess(doc))

In [20]:
print(processed_docs[:2])

[['from', 'subject', 'version', 'organization', 'university', 'lines', 'please', 'real', 'life'], ['from', 'richard', 'subject', 'cards', 'list', 'distribution', 'organization', 'hewlett', 'packard', 'lines', 'count', 'interest', 'cardinal', 'mail', 'list', 'find', 'start', 'know', 'thanks', 'dick']]


## Create Bag of words

A dictionary is the number of times a word appears in the training set.
A mapping between words and their integer ids.

In [21]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [22]:
for k, v in dictionary.iteritems():
    print(k, v)
    if k > 5:
        break

0 from
1 life
2 lines
3 organization
4 please
5 real
6 subject


Filter out tokens that appear in
* less than 15 documents or
* more than 10% of documents
* after (1) and (2), keep only the first 100k most frequent tokens

In [23]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

Convert document (a list of words) into the bag-of-words format.  
A list of (token_id, token_count) tuples

In [24]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [25]:
bow_doc_x = bow_corpus[10]
bow_word_x = 3

print('{} - {}'.format(
    bow_doc_x[5],
    dictionary[bow_doc_x[bow_word_x][0]]
))

(243, 1) - driver


## Build the LDA Model
(Latent Dirichlet Allocation)  
If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics

* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.


In [29]:
lda_model = gensim.models.LdaMulticore(
    bow_corpus,
    num_topics=7,
    id2word=dictionary,                                    
    passes=10,
    workers=4)

## Evaluate the model

In [30]:
lda_model.show_topics()

[(0,
  '0.009*"weapon" + 0.008*"gun" + 0.007*"firearm" + 0.007*"control" + 0.007*"government" + 0.006*"file" + 0.006*"crime" + 0.005*"kill" + 0.005*"bill" + 0.004*"criminal"'),
 (1,
  '0.014*"christian" + 0.012*"jesus" + 0.009*"church" + 0.007*"christ" + 0.007*"bible" + 0.007*"faith" + 0.006*"truth" + 0.006*"word" + 0.006*"reason" + 0.005*"life"'),
 (2,
  '0.017*"space" + 0.007*"software" + 0.007*"nasa" + 0.005*"station" + 0.005*"system" + 0.005*"black" + 0.005*"shuttle" + 0.005*"jewish" + 0.004*"book" + 0.004*"european"'),
 (3,
  '0.009*"drive" + 0.009*"car" + 0.006*"engine" + 0.005*"speed" + 0.005*"auto" + 0.005*"road" + 0.004*"light" + 0.004*"leave" + 0.004*"fire" + 0.004*"driver"'),
 (4,
  '0.017*"space" + 0.009*"program" + 0.009*"file" + 0.008*"entry" + 0.008*"launch" + 0.006*"satellite" + 0.006*"orbit" + 0.006*"nasa" + 0.006*"moon" + 0.005*"system"'),
 (5,
  '0.012*"player" + 0.009*"season" + 0.009*"hockey" + 0.007*"score" + 0.006*"league" + 0.005*"division" + 0.005*"goal" + 0.00

In [31]:
categories_map = {
    1: 'comp.windows.x',
    3: 'rec.autos',
    6: 'rec.sport.baseball',
    5: 'rec.sport.hockey',
    4: 'sci.space',
    1: 'soc.religion.christian',
    0: 'talk.politics.guns'
}

Testing model on unseen document

In [32]:
num = 2
unseen_document = newsgroups_test.data[num]
print(unseen_document)
print(newsgroups_test.target[num])
print(newsgroups_test.target_names[newsgroups_test.target[num]])

From: eggertj@moses.ll.mit.edu (Jim Eggert x6127 g41)
Subject: Re: Robin Lane Fox's _The Unauthorized Version_?
Reply-To: eggertj@ll.mit.edu
Organization: MIT Lincoln Lab - Group 41
Lines: 19

In article <May.7.01.09.39.1993.14550@athos.rutgers.edu> iscleekk@nuscc.nus.sg (LEE KOK KIONG JAMES) writes:
|   mpaul@unl.edu (marxhausen paul) writes:
|   > My mom passed along a lengthy review she clipped regarding Robin Lane
|   > Fox's book _The Unauthorized Version: Truth and Fiction in the Bible_,
|...
|   I've read the book. Some parts were quite typical regarding its
|   criticism of the bible as an inaccurate historical document,
|   alt.altheism, etc carries typical responses, but not as vociferous as
|   a.a. It does give an insight into how these historian (is he one... I 
|   don't have any biodata on him) work. I've not been able to understand/
|   appreciate some of the arguments, something like, it mentions certain 
|   events, so it has to be after that event, and so on. 

Robin

In [None]:
# Data preprocessing step for the unseen document
# processed_docs = preprocess_docs(newsgroups_train.data)
bow_vector = dictionary.doc2bow(preprocess_docs([unseen_document]))
pred = sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1])
print(pred)

In [53]:
print('predicts {} with a probability of {:.2f}%'.format(categories_map[pred[0][0]], pred[0][1]*100))

predicts soc.religion.christian with a probability of 50.96%


The model correctly classifies the unseen document with 'x'% probability to the X category.

### Check Accuracy

In [None]:
import multiprocessing
pool = multiprocessing.Pool()
test_processed_docs = list(pool.map(preprocess, newsgroups_test.data))

In [24]:
test_bow_corpus = [dictionary.doc2bow(doc) for doc in test_processed_docs]

In [25]:
y_true = newsgroups_test.target

In [26]:
newsgroups_test.target_names

['comp.windows.x',
 'rec.autos',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns']

In [27]:
y_pred = []
for i, doc in enumerate(test_bow_corpus):
    pred_all = sorted(lda_model[doc], key=lambda tup: -1*tup[1])
    pred_cat = categories_map[pred_all[0][0]]
    y_pred.append(newsgroups_test.target_names.index(pred_cat))

Accuracy is the proportion of correct predictions of the model

In [28]:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

0.6985052861830113

In [29]:
# creating a confusion matrix 
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_true, y_pred)

Y_pred ->, y_true \/

In [30]:
cm

array([[362,  11,   0,   8,  12,   2,   0],
       [  5, 317,   0,  33,  11,  27,   3],
       [  6,   8,   0, 356,   5,  18,   4],
       [  0,   7,   0, 382,   2,   6,   2],
       [  5,  13,   0,   3, 325,  25,  23],
       [  3,   3,   0,   3,   5, 381,   3],
       [  2, 174,   0,   5,   2,  32, 149]])

In [31]:
# import pyLDAvis.gensim
# pyLDAvis.enable_notebook()
# prepared = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
# pyLDAvis.show(prepared)