# 20. Extracting Topics
For unsupervised learning it is helpfull to be able to extract the topics from the text. Although we are not going to do unsupervised, we might also want to extract the topics to detect new ones and to convert them to the Interpol list. I found a tutorial online that might help and I'll try it in this notebook.

## Data

In [4]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

print(list(newsgroups_train.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [5]:
# Lets look at some sample news
newsgroups_train.data[:2]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [6]:
print(newsgroups_train.filenames.shape, newsgroups_train.target.shape)

(11314,) (11314,)


## Preprocessing

In [7]:
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [8]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/16090187/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v'))

go


In [10]:
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [11]:
processed_docs = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))
    
print(processed_docs[:2])

[['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['guykuo', 'carson', 'washington', 'subject', 'clock', 'poll', 'final', 'summari', 'final', 'clock', 'report', 'keyword', 'acceler', 'clock', 'upgrad', 'articl', 'shelley', 'qvfo', 'innc', 'organ', 'univers', 'washington', 'line', 'nntp', 'post', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'soul', 'upgrad', 'clock', 'oscil', 'share', 'experi', 'poll', 'send', 'brief', 'messag', 'detail', 'experi', 'procedur', 'speed', 'attain', 'rat', 'speed', 'card', 'adapt', 'heat', 'sink', 'hour', 'usag', 'floppi', 'disk', 'function', 'floppi', 'especi', 'request', 'summar', 'day',

## Bag of Words

In [27]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [13]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 addit
1 bodi
2 bricklin
3 bring
4 bumper
5 call
6 colleg
7 door
8 earli
9 engin
10 enlighten


In [14]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

In [15]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

## LDA

In [16]:
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [17]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.007*"presid" + 0.005*"clinton" + 0.005*"bike" + 0.004*"homosexu" + 0.004*"netcom" + 0.004*"virginia" + 0.004*"run" + 0.003*"pitch" + 0.003*"talk" + 0.003*"consid"


Topic: 1 
Words: 0.009*"govern" + 0.007*"armenian" + 0.006*"israel" + 0.005*"kill" + 0.005*"isra" + 0.004*"american" + 0.004*"turkish" + 0.004*"weapon" + 0.004*"jew" + 0.004*"countri"


Topic: 2 
Words: 0.017*"game" + 0.015*"team" + 0.011*"play" + 0.009*"player" + 0.008*"hockey" + 0.006*"season" + 0.005*"leagu" + 0.005*"canada" + 0.005*"score" + 0.004*"andrew"


Topic: 3 
Words: 0.012*"window" + 0.011*"card" + 0.007*"driver" + 0.007*"drive" + 0.006*"sale" + 0.005*"control" + 0.005*"price" + 0.005*"speed" + 0.005*"disk" + 0.005*"scsi"


Topic: 4 
Words: 0.013*"file" + 0.009*"program" + 0.007*"window" + 0.006*"encrypt" + 0.006*"chip" + 0.006*"imag" + 0.006*"data" + 0.006*"avail" + 0.005*"code" + 0.004*"version"


Topic: 5 
Words: 0.012*"space" + 0.009*"nasa" + 0.006*"scienc" + 0.005*"orbit" + 0.004*"researc

## Testing

In [18]:
num = 100
unseen_document = newsgroups_test.data[num]
print(unseen_document)

Subject: help
From: C..Doelle@p26.f3333.n106.z1.fidonet.org (C. Doelle)
Lines: 13

Hello All!

    It is my understanding that all True-Type fonts in Windows are loaded in
prior to starting Windows - this makes getting into Windows quite slow if you
have hundreds of them as I do.  First off, am I correct in this thinking -
secondly, if that is the case - can you get Windows to ignore them on boot and
maybe make something like a PIF file to load them only when you enter the
applications that need fonts?  Any ideas?


Chris

 * Origin: chris.doelle.@f3333.n106.z1.fidonet.org (1:106/3333.26)



In [19]:
# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.7365520000457764	 Topic: 0.012*"window" + 0.011*"card" + 0.007*"driver" + 0.007*"drive" + 0.006*"sale"
Score: 0.23920853435993195	 Topic: 0.013*"file" + 0.009*"program" + 0.007*"window" + 0.006*"encrypt" + 0.006*"chip"


In [20]:
print(newsgroups_test.target[num])

2


## Agora
After following the tutorial, let's try the topic modelling on our own dataset.

## Preprocessing

In [48]:
import pandas as pd

df = pd.read_csv('Structured_DataFrame_Sample_500.csv', index_col=0)
df.shape

(15000, 3)

In [49]:
processed_descriptions = []

for description in df['Item Description']:
    processed_descriptions.append(preprocess(description))
    
print(processed_descriptions[:2])

[['emporio', 'armani', 'shell', 'case', 'ceram', 'bracelet', 'replica', 'watch', 'inform', 'brand', 'armani', 'dial', 'window', 'materi', 'type', 'miner', 'band', 'materi', 'ceram', 'case', 'materi', 'ceram', 'case', 'diamet', 'millimet', 'case', 'thick', 'millimet', 'item'], ['cartier', 'tank', 'ladi', 'brand', 'cartier', 'seri', 'tank', 'gender', 'ladi', 'diamet', 'thick', 'movement', 'swiss', 'quartz', 'movement', 'function', 'hour', 'minut', 'second', 'case', 'materi', 'stainless', 'steel', 'strap', 'materi', 'real']]


## Bag of Words

In [35]:
dictionary = gensim.corpora.Dictionary(processed_descriptions)

In [36]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 armani
1 band
2 bracelet
3 brand
4 case
5 ceram
6 dial
7 diamet
8 emporio
9 inform
10 item


In [37]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

In [39]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

## LDA

In [44]:
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 30, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 32)

In [45]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.105*"account" + 0.039*"lifetim" + 0.024*"work" + 0.021*"freebi" + 0.021*"card" + 0.020*"premium" + 0.019*"anonym" + 0.015*"porn" + 0.015*"cash" + 0.014*"test"


Topic: 1 
Words: 0.034*"haze" + 0.027*"super" + 0.018*"amnesia" + 0.016*"hydrocodon" + 0.016*"sulfat" + 0.014*"stealth" + 0.014*"escrow" + 0.013*"avail" + 0.013*"offer" + 0.013*"want"


Topic: 2 
Words: 0.029*"ketamin" + 0.018*"time" + 0.014*"stamp" + 0.013*"prioriti" + 0.013*"hash" + 0.013*"bag" + 0.012*"sourc" + 0.012*"come" + 0.011*"crystal" + 0.011*"uncut"


Topic: 3 
Words: 0.093*"cocain" + 0.061*"track" + 0.034*"shipment" + 0.033*"best" + 0.028*"offer" + 0.026*"number" + 0.020*"reship" + 0.018*"receiv" + 0.018*"stick" + 0.017*"sign"


Topic: 4 
Words: 0.083*"heroin" + 0.073*"mdma" + 0.023*"white" + 0.020*"crystal" + 0.019*"uncut" + 0.017*"black" + 0.017*"brown" + 0.014*"tramadol" + 0.014*"strong" + 0.012*"puriti"


Topic: 5 
Words: 0.067*"speed" + 0.064*"blotter" + 0.048*"past" + 0.048*"nbome" + 0.021*"

## Testing

In [47]:
num = 100
unseen_description = "Hello, I sell some weed for you to smoke and get high!"
print(unseen_description)

Hello, I sell some weed for you to smoke and get high!


In [50]:
# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.26623544096946716	 Topic: 0.041*"hash" + 0.026*"open" + 0.021*"templat" + 0.019*"driver" + 0.019*"licenc"
Score: 0.2602015435695648	 Topic: 0.090*"case" + 0.086*"materi" + 0.051*"watch" + 0.041*"steel" + 0.041*"stainless"
Score: 0.14973457157611847	 Topic: 0.142*"ketamin" + 0.019*"isom" + 0.019*"final" + 0.019*"earli" + 0.016*"escrow"
Score: 0.11855105310678482	 Topic: 0.027*"chocol" + 0.023*"clean" + 0.017*"custom" + 0.014*"dutch" + 0.012*"kief"
Score: 0.10444159060716629	 Topic: 0.057*"price" + 0.027*"paypal" + 0.024*"target" + 0.023*"email" + 0.022*"direct"
Score: 0.07416040450334549	 Topic: 0.062*"ciali" + 0.033*"cooki" + 0.032*"send" + 0.025*"address" + 0.022*"scan"
