# Topic Detection using Machine Learning

Importing Python Libraries

Numpy http://www.numpy.org/

NLTK https://www.nltk.org/

Pandas https://pandas.pydata.org/

TfLearn http://tflearn.org/

RegexpTokenizer from NLTK http://www.nltk.org/api/nltk.tokenize.html

SnowballStemmer from NLTK http://www.nltk.org/howto/stem.html

Stopwords from NLTK https://pythonspot.com/nltk-stop-words/

Training and Testing Data split from Sklearn http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [1]:
import numpy as np
import nltk
import os
import json
import datetime
import time
import random
import tensorflow as tf
import pandas as pd
import tflearn
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stemmer = SnowballStemmer("english")

hdf5 is not supported on this machine (please install/reinstall h5py for optimal experience)
curses is not supported on this machine (please install/reinstall curses for an optimal experience)


In [2]:
random.seed(101)

# Data Pre-Processing

Import the training_data.csv file with pandas.

In [4]:
print(stopwords)

<WordListCorpusReader in 'C:\\Users\\kpajm\\AppData\\Roaming\\nltk_data\\corpora\\stopwords'>


In [3]:
#Reading the training data using Pandas 
dataset = pd.read_csv('training_data.csv')

#Printing out First Few Data 
dataset.head()

Unnamed: 0,Title,Country,Class,Topic
0,"Harper Lee's will, unsealed after a lawsuit, r...",USA,Entertainment,Celebrity
1,'Strikingly Opaque' Harper Lee Will Unsealed,USA,Entertainment,Celebrity
2,"The Life, Death and Career of Harper Lee",USA,Entertainment,Celebrity
3,Harper Lee's will reveals lawyer holds control...,USA,Entertainment,Celebrity
4,"Marvel's Stan Lee, 95, is dealing with 'a litt...",USA,Entertainment,Hollywood


In [4]:
#Extracting Features X (Title) and Y (Topic) 

title = dataset.iloc[:,0:1].values
topic = dataset.iloc[:,3:4].values


# Splitting The Dataset to X_train, X_test, y_train, y_test

X_train contains 80% of the Title from DataSet

y_train contains 80% of the Label

Separate it into a training (80%) and testing set(20%).

In [5]:
#Splitting The Dataset to X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(title,topic,test_size=0.2,random_state=101)

Now The data has been split for traing and testing

X_train and y_train is appened to a Json

In [6]:
training_data = []
topic_count = 0
for split_x in X_train:
    training_data.append({"title":split_x[0], "topic":y_train[topic_count][0]})
    topic_count+=1

# Cleaning our Training Data

In [7]:
words = []
topics = []
documents = []
ignore_words = ['?']
bin_words = [1]
print("{} sentences in training data".format(len(training_data)))

515 sentences in training data


In [8]:
training_data

[{'title': "Jennifer Lawrence: I was treated 'in a way that now we would call abusive'",
  'topic': 'Jennifer Lawrence'},
 {'title': "Review: 'Red Sparrow' is a spy thriller with an identity crisis",
  'topic': 'Jennifer Lawrence'},
 {'title': "Kevin Smith explains events of heart attack: 'I was never really in pain'",
  'topic': 'Celebrity'},
 {'title': 'India Pays Final Respects To Bollywood Superstar Sridevi',
  'topic': 'Sridevi'},
 {'title': 'PlayStation Plus dropping PS3 and Vita games in 2019 (update)',
  'topic': 'Video Game Console'},
 {'title': 'PlayStation Plus is getting rid of free PS3 and Vita games in March 2019',
  'topic': 'Video Game Console'},
 {'title': 'South Africa keeps Australia in check on 1st day of series',
  'topic': 'Cricket'},
 {'title': 'Landmark Cosmic Observation Provides Tantalizing Hints of Dark Matter',
  'topic': 'Astronomy'},
 {'title': 'Watch a Huawei Mate 10 Pro Drive a Porsche Panamera',
  'topic': 'Smart Phone'},
 {'title': "What's going on wit

Using Regex Tokenizor to remove all the unwanted characters from "Title"

aswell as Stopwords are removed

Each sentence is split into words 

In [9]:
tokenizer = RegexpTokenizer('\w{3,20}')
stop_words = set(stopwords.words('english'))
for pattern in training_data:
    w = tokenizer.tokenize(pattern['title'])
    print(w)
    w = [stemmer.stem(wr.lower()) for wr in w if wr not in stop_words]
    words.extend(w)
    documents.append((w, pattern['topic']))
    if pattern['topic'] not in topics:
        topics.append(pattern['topic'])

['Jennifer', 'Lawrence', 'was', 'treated', 'way', 'that', 'now', 'would', 'call', 'abusive']
['Review', 'Red', 'Sparrow', 'spy', 'thriller', 'with', 'identity', 'crisis']
['Kevin', 'Smith', 'explains', 'events', 'heart', 'attack', 'was', 'never', 'really', 'pain']
['India', 'Pays', 'Final', 'Respects', 'Bollywood', 'Superstar', 'Sridevi']
['PlayStation', 'Plus', 'dropping', 'PS3', 'and', 'Vita', 'games', '2019', 'update']
['PlayStation', 'Plus', 'getting', 'rid', 'free', 'PS3', 'and', 'Vita', 'games', 'March', '2019']
['South', 'Africa', 'keeps', 'Australia', 'check', '1st', 'day', 'series']
['Landmark', 'Cosmic', 'Observation', 'Provides', 'Tantalizing', 'Hints', 'Dark', 'Matter']
['Watch', 'Huawei', 'Mate', 'Pro', 'Drive', 'Porsche', 'Panamera']
['What', 'going', 'with', 'Rey', 'Mysterio', 'and', 'WWE']
['Three', 'Space', 'Station', 'Astronauts', 'Return', 'Earth']
['need', 'national', 'conversation', 'MeToo', 'but', 'not', 'directed', 'Hollywood', 'Ted', 'Diadiun']
['Evan', 'Rachel'

In [10]:
documents

[(['jennif', 'lawrenc', 'treat', 'way', 'would', 'call', 'abus'],
  'Jennifer Lawrence'),
 (['review', 'red', 'sparrow', 'spi', 'thriller', 'ident', 'crisi'],
  'Jennifer Lawrence'),
 (['kevin',
   'smith',
   'explain',
   'event',
   'heart',
   'attack',
   'never',
   'realli',
   'pain'],
  'Celebrity'),
 (['india', 'pay', 'final', 'respect', 'bollywood', 'superstar', 'sridevi'],
  'Sridevi'),
 (['playstat', 'plus', 'drop', 'ps3', 'vita', 'game', '2019', 'updat'],
  'Video Game Console'),
 (['playstat',
   'plus',
   'get',
   'rid',
   'free',
   'ps3',
   'vita',
   'game',
   'march',
   '2019'],
  'Video Game Console'),
 (['south', 'africa', 'keep', 'australia', 'check', '1st', 'day', 'seri'],
  'Cricket'),
 (['landmark',
   'cosmic',
   'observ',
   'provid',
   'tantal',
   'hint',
   'dark',
   'matter'],
  'Astronomy'),
 (['watch', 'huawei', 'mate', 'pro', 'drive', 'porsch', 'panamera'],
  'Smart Phone'),
 (['what', 'go', 'rey', 'mysterio', 'wwe'], 'WWE'),
 (['three', 'spa

#### List of all words from the titles "Bag of words"

In [11]:
words = list(set(words))

In [12]:
words

['blast',
 'peta',
 'nomin',
 'roof',
 'board',
 'chanc',
 'took',
 'await',
 'ralph',
 'thriller',
 'giant',
 'against',
 'creas',
 'next',
 'world',
 'griev',
 'prison',
 'weight',
 'reveal',
 'streisand',
 'their',
 'detect',
 'ahead',
 'icc',
 'fantasi',
 'standout',
 'sudden',
 'swarm',
 'hollywood',
 'introduc',
 'how',
 'her',
 'slack',
 'etsi',
 'turan',
 'possibl',
 'odd',
 'datamin',
 'strong',
 'explan',
 'privat',
 'nukem',
 'stori',
 'knock',
 'superstar',
 'switch',
 'rachel',
 'atom',
 'snowboard',
 'rip',
 'pleas',
 'speci',
 'technic',
 'patch',
 'investig',
 'process',
 'everyth',
 'media',
 'time',
 'woman',
 'pokémon',
 'richard',
 'theme',
 'leagu',
 'goal',
 'here',
 '2022',
 'trump',
 'from',
 'musk',
 'luyendyk',
 'now',
 'valu',
 'hack',
 'appeal',
 'skit',
 '500k',
 'shake',
 'hawkin',
 'kushner',
 'veteran',
 'notch',
 'invad',
 'mani',
 'thurman',
 'diminish',
 'panamera',
 'justin',
 'launch',
 'winter',
 'pitch',
 'celebr',
 'wreck',
 'pick',
 'disneyland'

In [13]:
with open("jgi/words.json", 'w') as outfile:
    json.dump(words, outfile)

In [14]:
print("There are {} words in corpus".format(len(words)))
# for wrd in words:
#     print(wrd,end='\t')

There are 1421 words in corpus


### Lists all Labels saved in topics variable

In [15]:
topics = list(set(topics))
for tpc in topics:
    print(tpc,end='\t')

Sridevi	Moon	Mobile World Congress	Premier League	Astronomy	Smart Watch	Google Search	Donald Trump	Evolution	Airplane	Cricket	Pakistan	Milky Way	Politics	Mars	Indian Super League	Will Smith	Jennifer Lawrence	Video Game	Stan Lee	NFL	International Space Station	Google	Black Holes	WWE	SpaceX	Quantum Computing	Mobile Game	Android	Hollywood	Voice Assistant	Space	Video Game Console	Music Streaming	Egypt	Dark Matter	Formula One	Smart Camera	Oscar	Car	Twitch	Cloud Computing	iPhone	Computer Components	Celebrity	Saturn	Elon Musk	Social Media	Electric Car	Smart Phone	Hockey	

In [16]:
topics[33]

'Music Streaming'

In [17]:
with open("jgi/topics.json", 'w') as outfile:
    json.dump(topics, outfile)

# Creating our Training Data

### Initializing variables

In [18]:
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(topics)

#### Each title is tokenized, cleaned and represented as 0 or 1 according to "Bag of Words" created ealier

Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation

Each training set appened to "training" variable

In [19]:
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # stem each word
    
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
    pattern_words = [w for w in pattern_words if not w in stop_words]
    # create our bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
        
    # output is a '0' for each tag and '1' for current tag 
    output_row = list(output_empty)
    output_row[topics.index(doc[1])] = 1
    training.append([bag, output_row])

In [20]:
training[-1]

[[0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


In [21]:
topics[28]

'Android'

Trained data is shuffled to produce better results

In [22]:
# with open("jgi/training.json", 'w') as outfile:
#     json.dump(training, outfile)

In [23]:
# shuffle our features and turn into np.array
# random.shuffle(training)
training = np.array(training)

# trainX contains the Bag of words and train_y contains the label/ category
train_xi = list(training[:,0])
train_yi = list(training[:,1])


In [24]:
train_xi[-1]

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [25]:
train_yi[-1]

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0]

# Training

Trainig using TfLearn , TfLearn is a High level Tensorflow library

### Building The Neural network with SoftMax activation function

In [26]:
# reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_xi[0])],dtype=tf.float32)
net = tflearn.fully_connected(net, 16)
net = tflearn.fully_connected(net, 16)
net = tflearn.fully_connected(net, 16)
# net = tflearn.fully_connected(net, 16)
net = tflearn.fully_connected(net, len(train_yi[0]), activation='softmax')
net = tflearn.regression(net,dtype=tf.float32)

### Define the Densely Connected Network

In [27]:
# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')

session = tf.get_default_session()
tf.train.write_graph(tf.get_default_graph(), 'jgi', 'model_graph.pbtxt')

# Start training (apply gradient descent algorithm)
model.fit(train_xi, train_yi, n_epoch=1000, batch_size=16, show_metric=True)

model.save('jgi/trained_model.ckpt')
# model.load("jgi/trained_model.ckpt")

Training Step: 32999  | total loss: [1m[32m0.03544[0m[0m | time: 0.148s
| Adam | epoch: 1000 | loss: 0.03544 - acc: 0.9680 -- iter: 512/515
Training Step: 33000  | total loss: [1m[32m0.03947[0m[0m | time: 0.152s
| Adam | epoch: 1000 | loss: 0.03947 - acc: 0.9712 -- iter: 515/515
--
INFO:tensorflow:C:\Users\kpajm\tfdeeplearning\jgi\trained_model.ckpt is not in all_model_checkpoint_paths. Manually adding it.


# Testing

In [35]:
# a method that takes in a sentence and list of all words
# and returns the data in a form the can be fed to tensorflow
def get_tf_record(sentence):
    global words
    # tokenize the pattern
    tokenizer = RegexpTokenizer('\w{3,20}')
    sentence_words = tokenizer.tokenize(sentence)
    # stem each word
#     sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    sentence_words = [stemmer.stem(wr.lower()) for wr in sentence_words if wr not in stop_words]
    print(sentence_words)
    # bag of words
    bow = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                bow[i] = 1
    return(np.array(bow))

### Argmax returns the Labels with Higher Probability

Here the Accuracy of the model is pretty low due to the training dataset size

Some of the predictions are wrong

In [36]:
print(topics[np.argmax(model.predict([get_tf_record("Extraterrestrial Life On Saturn's Icy Moon May Flourish")]))])

['extraterrestri', 'life', 'saturn', 'ici', 'moon', 'may', 'flourish']
Saturn


In [37]:
model.predict([get_tf_record("Extraterrestrial Life On Saturn's Icy Moon May Flourish")])[0]

['extraterrestri', 'life', 'saturn', 'ici', 'moon', 'may', 'flourish']


array([  0.00000000e+00,   1.62302634e-08,   0.00000000e+00,
         0.00000000e+00,   4.32176848e-19,   6.45992850e-06,
         0.00000000e+00,   0.00000000e+00,   2.52939391e-16,
         0.00000000e+00,   1.85868856e-07,   9.06043631e-28,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         5.08533674e-27,   9.43381394e-19,   2.87689171e-31,
         0.00000000e+00,   1.48626699e-32,   8.86239667e-16,
         3.42395268e-28,   1.57506747e-07,   3.20079006e-13,
         0.00000000e+00,   2.91819930e-07,   2.17889894e-19,
         4.91007432e-19,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.10975283e-11,   2.75199449e-17,
         0.00000000e+00,   1.58979582e-15,   2.40899786e-27,
         2.36916424e-11,   5.42426279e-14,   5.70676251e-14,
         3.67316135e-17,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   2.70149714e-31,   0.00000000e+00,
         9.99990225e-01,   6.69386153e-20,   0.00000000e+00,
         9.98767174e-22,

In [38]:
model.predict([get_tf_record("Extraterrestrial Life On Saturn's Icy Moon May Flourish")])[0][27]

['extraterrestri', 'life', 'saturn', 'ici', 'moon', 'may', 'flourish']


4.9100743e-19

In [39]:
np.argmax(model.predict([get_tf_record("Extraterrestrial Life On Saturn's Icy Moon May Flourish")]))

['extraterrestri', 'life', 'saturn', 'ici', 'moon', 'may', 'flourish']


45

In [40]:
topics[25]

'SpaceX'

In [41]:
# we can start to predict the results for each of the 4 sentences

for x_split in X_test:
    print(x_split[0])
    print(topics[np.argmax(model.predict([get_tf_record(x_split[0])]))])
#     print(model.predict([get_tf_record(x_split[0])]))
    print("="*85)

4G Coverage Bound for the Moon in 2019
['coverag', 'bound', 'moon', '2019']
Moon
Good Morning, Here's A Duke Nukem 3D-Style Shooter
['good', 'morn', 'here', 'duke', 'nukem', 'style', 'shooter']
Social Media
Colleague of Ryan Seacrest's former stylist backs up claim the star repeatedly sexually harassed her
['colleagu', 'ryan', 'seacrest', 'former', 'stylist', 'back', 'claim', 'star', 'repeat', 'sexual', 'harass']
Oscar
Free Fortnite: Battle Royale Items Available Now From Twitch Prime
['free', 'fortnit', 'battl', 'royal', 'item', 'avail', 'now', 'from', 'twitch', 'prime']
Video Game
Spotify has filed to go public
['spotifi', 'file', 'public']
Music Streaming
Google Clips review: a smart camera that doesn't make the grade
['googl', 'clip', 'review', 'smart', 'camera', 'make', 'grade']
Astronomy
LG G7 Shows Up at MWC With the Most Beautiful iPhone X Notch Ever
['show', 'mwc', 'with', 'most', 'beauti', 'iphon', 'notch', 'ever']
Mobile World Congress
Spotify has spent $10 billion on music 

['mission', 'moon', 'youtub']
SpaceX
3D Realms Returns With New “Old-School” Shooter, Ion Maiden, Built On 90's Tech
['realm', 'return', 'with', 'new', 'old', 'school', 'shooter', 'ion', 'maiden', 'built', 'tech']
Video Game
Disney, built on franchises, says not everything needs to be a franchise
['disney', 'built', 'franchis', 'say', 'everyth', 'need', 'franchis']
Hollywood
Rethinking 10 past Oscar best pictures — and what should have won
['rethink', 'past', 'oscar', 'best', 'pictur']
Oscar
Three Space Station Crew Members Return Home To Earth Closing 168-Day Mission
['three', 'space', 'station', 'crew', 'member', 'return', 'home', 'earth', 'close', '168', 'day', 'mission']
International Space Station
PlayStation Plus won't include free PS3 and Vita games next year
['playstat', 'plus', 'includ', 'free', 'ps3', 'vita', 'game', 'next', 'year']
Video Game Console
West Indies and rivals scramble for 2019 lifeline
['west', 'indi', 'rival', 'scrambl', '2019', 'lifelin']
Space
Even with doub