# Topic Modeling on IoT News

This exercise is inspired from the tutorial by Sibanjan Das at DZone and the brilliant article at Machine Learning Plus. For the reference to the exact articles, please go through the Reference section at the end.


## Using Topic Modeling to classify various IoT news

We are going to demonstrate the use of Topic Modeling to categorize various news in the IoT space. The news pieces would give us an idea about the popularity of specific topics in the IoT world. The news pieces are taken from "iotbusinessnews.com". The presented exercise is only meant for educational purposes. 

Problem Statement: Identify the trending topics from latest news in the IoT space.

Assumptions:
    1. The news pieces in the data folder are considered the entire set for the purpose of this exercise. So, any insights that are generated apply to this set and this needs to be considered when looking at them.
    2. The inherent bias in the publication of specific articles at www.iotbusinessnews.com also needs to be considered.

Approach:
    1. Retrieve text articles from the news website into separate text files with name "text_<category>_<number>_<date of publication>.txt"
    2. Prepare data for the model along with the necessary documentation in Jupyter notebook
    3. Run the LDAmodel on the text files
    4. Generate insights
    5. Suggest further work

### Retrieve text articles from the news website

This step has already been performed. For educational purpose, we only use eight text files and these have been created manually. The news pieces were accessed on 1st May 2019 from the website "iotbusinessnews.com" under the category "Industrial IoT". Techniques such as web scraping and even, neural networks can be used to create these files at scale.

The files are stored in the data folder under the parent folder where this Notebook resides.

![image.png](attachment:image.png)

### Prepare the data for the model

We first import the necessary libraries.

In [2]:
import os

import numpy as np
import pandas as pd

import re
import string

# NLTK Stop words

import nltk
from nltk.corpus import stopwords
from nltk import TweetTokenizer
from nltk.stem import WordNetLemmatizer

# Gensim

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Plots

import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt

%matplotlib inline

#### Side Note on Gensim and pyLDAvis

Gensim is a free python library used for topic modeling. You can explore more at https://radimrehurek.com/gensim/index.html

pyLDAvis is a library for interactive topic model visualization. You can explore more at https://pyldavis.readthedocs.io/en/latest/readme.html#usage



Next step is to read in the files and do some basic data cleansing.

In [71]:
cwd = os.getcwd() # get current directory which will be used to prepend to the data folder

path = cwd + '\\data'  
# print(path) # for debug purpose

filesdata = []

fileList = os.listdir(path)

for i in fileList:
    file = open(os.path.join(path+'/'+ i), 'r', encoding = 'utf8')

    data = file.readlines()
    data = [re.sub(r'\s+',' ', sent) for sent in data]
    data = [re.sub(r'\'', '', sent) for sent in data]
    data = [x for x in data if x != ' ']
    data = gensim.utils.simple_preprocess(str(data), deacc=True)
    filesdata.append(data)
    
#print(filesdata[7]) # check if the last file was read

Next step is to create tokens after getting rid of stopwords.

In [100]:
stopwords_punct = set(stopwords.words('english')).union(string.punctuation).union('-')

data_tokens_no_stopwords = []
for data in filesdata:
    data_stopwords_rm = [word for word in data if word.lower() not in stopwords_punct]
    data_tokens_no_stopwords.append(data_stopwords_rm)
    
#data_tokens_no_stopwords # for debug purpose

The next step is to stem the tokens. There are two popular ways to do this - Stemming and Lemmatized. We use the Lemmatized form.

In [95]:
wordnet_lemmatizer = WordNetLemmatizer()

data_lemmatized = []

for w in data_tokens_no_stopwords:
    data_lemmatized.append([word for word in map(wordnet_lemmatizer.lemmatize, w)]) # did not work

data_lemmatized  # for debug purpose

[['ptc',
  'improves',
  'workforce',
  'efficiency',
  'launch',
  'vuforia',
  'expert',
  'capture',
  'ar',
  'solution',
  'providing',
  'faster',
  'efficient',
  'way',
  'empower',
  'front',
  'line',
  'worker',
  'ptc',
  'today',
  'announced',
  'hannover',
  'messe',
  'release',
  'vuforia',
  'expert',
  'capture',
  'augmented',
  'reality',
  'ar',
  'solution',
  'designed',
  'improve',
  'workforce',
  'productivity',
  'quality',
  'safety',
  'compliance',
  'vuforia',
  'expert',
  'capture',
  'provides',
  'industrial',
  'enterprise',
  'faster',
  'efficient',
  'way',
  'empower',
  'front',
  'line',
  'worker',
  'relevant',
  'information',
  'need',
  'get',
  'job',
  'done',
  'quickly',
  'accurately',
  'first',
  'time',
  'major',
  'skill',
  'gap',
  'threatening',
  'manufacturing',
  'industry',
  'effective',
  'knowledge',
  'transfer',
  'existing',
  'subject',
  'matter',
  'expert',
  'smes',
  'critical',
  'next',
  'decade',
  'milli

In [96]:
# Create Corpus

dictionary = corpora.Dictionary(data_lemmatized) # traverses each document and assigns a unique id to each unique token along with their counts.

corpus = [dictionary.doc2bow(text) for text in data_lemmatized] # convert to bag of words

# print(corpus) # for debug purpose

### Run the LDAmodel

In [108]:
# Build LDA model

ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=20, alpha='auto', per_word_topics=True)

In [109]:
# Calculate accuracy of model by using perplexity metric. 

print('Perplexity: ', ldamodel.log_perplexity(corpus)) # The lower the value, better is the model.

Perplexity:  -6.613550819456577


### Generate insights

Now, we will visualize the model using pyLDAvis. 

It is an interactive plot, where each bubble represents a topic.

A good topic model will have large non-overlapping bubbles in the chart. The bar plot one right-hand side of the screenshot shows the frequency of the terms in the topic, out of the total term frequency in the documents. 

In [110]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
vis

### Further work

We can look at following improvements:
    1. Improve stemming
    2. Separate model into separate python file and call it from the notebook
    3. Identify which topics are prominent in which document
    4. Add visualizations based on Machine Learning Plus

#### References:

1. "Interactive Topic Modeling Using Python" at https://dzone.com/articles/interactive-topic-modeling-using-python
2. "Topic modeling visualization – How to present the results of LDA models?" https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/