# Topic Detection using Machine Learning

Importing Python Libraries

Numpy http://www.numpy.org/

NLTK https://www.nltk.org/

Pandas https://pandas.pydata.org/

TfLearn http://tflearn.org/

RegexpTokenizer from NLTK http://www.nltk.org/api/nltk.tokenize.html

SnowballStemmer from NLTK http://www.nltk.org/howto/stem.html

Stopwords from NLTK https://pythonspot.com/nltk-stop-words/

Training and Testing Data split from Sklearn http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
import numpy as np
import nltk
import os
import json
import datetime
import time
import random
import tensorflow as tf
import pandas as pd
import tflearn
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stemmer = SnowballStemmer("english")

# Data Pre-Processing

Import the training_data.csv file with pandas.

In [29]:
#Reading the training data using Pandas 
dataset = pd.read_csv('training_data.csv')

#Printing out First Few Data 
dataset.head()

Unnamed: 0,Title,Country,Class,Topic
0,"Harper Lee's will, unsealed after a lawsuit, r...",USA,Entertainment,Celebrity
1,'Strikingly Opaque' Harper Lee Will Unsealed,USA,Entertainment,Celebrity
2,"The Life, Death and Career of Harper Lee",USA,Entertainment,Celebrity
3,Harper Lee's will reveals lawyer holds control...,USA,Entertainment,Celebrity
4,"Marvel's Stan Lee, 95, is dealing with 'a litt...",USA,Entertainment,Hollywood


In [10]:
#Extracting Features X (Title) and Y (Topic) 

title = dataset.iloc[:,0:1].values
topic = dataset.iloc[:,3:4].values


# Splitting The Dataset to X_train, X_test, y_train, y_test

X_train contains 80% of the Title from DataSet

y_train contains 80% of the Label

Separate it into a training (80%) and testing set(20%).

In [None]:
#Splitting The Dataset to X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(title,topic,test_size=0.2,random_state=101)

Now The data has been split for traing and testing

X_train and y_train is appened to a Json

In [30]:
training_data = []
topic_count = 0
for split_x in X_train:
    training_data.append({"title":split_x[0], "topic":y_train[topic_count][0]})
    topic_count+=1

# Cleaning our Training Data

In [12]:
words = []
topics = []
documents = []
ignore_words = ['?']
bin_words = [1]
print("{} sentences in training data".format(len(training_data)))

515 sentences in training data


Using Regex Tokenizor to remove all the unwanted characters from "Title"

aswell as Stopwords are removed

Each sentence is split into words 

In [13]:
tokenizer = RegexpTokenizer('\w{3,20}')
stop_words = set(stopwords.words('english'))
for pattern in training_data:
    w = tokenizer.tokenize(pattern['title'])
    w = [stemmer.stem(wr.lower()) for wr in w if wr not in stop_words]
    words.extend(w)
    documents.append((w, pattern['topic']))
    if pattern['topic'] not in topics:
        topics.append(pattern['topic'])

#### List of all words from the titles "Bag of words"

In [14]:
words = list(set(words))

In [15]:
print("There are {} words in corpus".format(len(words)))
# for wrd in words:
#     print(wrd,end='\t')

There are 1421 words in corpus


### Lists all Labels saved in topics variable

In [16]:
topics = list(set(topics))
for tpc in topics:
    print(tpc,end='\t')

Electric Car	Twitch	SpaceX	Quantum Computing	Video Game Console	Cricket	Cloud Computing	Oscar	Smart Camera	Egypt	Mars	Donald Trump	Milky Way	Stan Lee	Formula One	Computer Components	Car	Elon Musk	Astronomy	Sridevi	Jennifer Lawrence	Google	Smart Watch	International Space Station	Google Search	Mobile Game	Moon	Evolution	Politics	Celebrity	Voice Assistant	iPhone	Airplane	Space	Smart Phone	Hockey	Video Game	Dark Matter	NFL	Mobile World Congress	Pakistan	Android	Black Holes	Premier League	Saturn	Social Media	Indian Super League	Will Smith	WWE	Hollywood	Music Streaming	

# Creating our Training Data

### Initializing variables

In [17]:
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(topics)

#### Each title is tokenized, cleaned and represented as 0 or 1 according to "Bag of Words" created ealier

Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation

Each training set appened to "training" variable

In [18]:
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # stem each word
    
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
    pattern_words = [w for w in pattern_words if not w in stop_words]
    # create our bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
        
    # output is a '0' for each tag and '1' for current tag 
    output_row = list(output_empty)
    output_row[topics.index(doc[1])] = 1
    training.append([bag, output_row])

Trained data is shuffled to produce better results

In [19]:
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)

# trainX contains the Bag of words and train_y contains the label/ category
train_xi = list(training[:,0])
train_yi = list(training[:,1])

# x_data = tf.Variable(train_xi)
# y_label = tf.Variable(train_yi)
# holder_x = tf.placeholder(tf.int32,(None,len(train_xi)))

# Training

Trainig using TfLearn , TfLearn is a High level Tensorflow library

### Building The Neural network with SoftMax activation function

In [20]:
# reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_xi[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(train_yi[0]), activation='softmax')
net = tflearn.regression(net)

### Define the Densely Connected Network

In [23]:
# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')
# Start training (apply gradient descent algorithm)

# model.fit(train_xi, train_yi, n_epoch=1000, batch_size=8, show_metric=True)
# model.save('model.tflearn')

model.load("model.tflearn")

INFO:tensorflow:Restoring parameters from C:\Users\kpajm\tfdeeplearning\model.tflearn


# Testing

In [24]:
# a method that takes in a sentence and list of all words
# and returns the data in a form the can be fed to tensorflow
def get_tf_record(sentence):
    global words
    # tokenize the pattern
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    # bag of words
    bow = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                bow[i] = 1
    return(np.array(bow))

### Argmax returns the Labels with Higher Probability

Here the Accuracy of the model is pretty low due to the training dataset size

Some of the predictions are wrong

In [25]:
# we can start to predict the results for each of the 4 sentences
for x_split in X_test:
    print(x_split[0])
    print(topic[np.argmax(model.predict([get_tf_record(x_split[0])]))])
    print("="*85)

4G Coverage Bound for the Moon in 2019
['Social Media']
Good Morning, Here's A Duke Nukem 3D-Style Shooter
['Celebrity']
Colleague of Ryan Seacrest's former stylist backs up claim the star repeatedly sexually harassed her
['Social Media']
Free Fortnite: Battle Royale Items Available Now From Twitch Prime
['Celebrity']
Spotify has filed to go public
['Celebrity']
Google Clips review: a smart camera that doesn't make the grade
['Celebrity']
LG G7 Shows Up at MWC With the Most Beautiful iPhone X Notch Ever
['Celebrity']
Spotify has spent $10 billion on music royalties since its creation and it's a big part of why its bleeding money
['Celebrity']
Pikachu Talk App Lets You Chat With The Iconic Pokémon Through Amazon Alexa, Google Assistant
['WWE']
Rey Mysterio Vs John Cena? Fastlane Rumors on Road to Wrestlemania
['Smart Watch']
Vero: What it is, why it's suddenly big, and how to delete it
['Donald Trump']
Delete Vero? A movement has already started to ax the new social app
['Social Media']