# Topic Detection using Machine Learning

Importing Python Libraries

Numpy http://www.numpy.org/

NLTK https://www.nltk.org/

Pandas https://pandas.pydata.org/

TfLearn http://tflearn.org/

RegexpTokenizer from NLTK http://www.nltk.org/api/nltk.tokenize.html

SnowballStemmer from NLTK http://www.nltk.org/howto/stem.html

Stopwords from NLTK https://pythonspot.com/nltk-stop-words/

Training and Testing Data split from Sklearn http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [1]:
import numpy as np
import nltk
import os
import json
import datetime
import time
import random
import tensorflow as tf
import pandas as pd
import tflearn
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stemmer = SnowballStemmer("english")

hdf5 is not supported on this machine (please install/reinstall h5py for optimal experience)
curses is not supported on this machine (please install/reinstall curses for an optimal experience)


In [2]:
random.seed(101)

# Data Pre-Processing

Import the training_data.csv file with pandas.

In [3]:
#Reading the training data using Pandas 
dataset = pd.read_csv('train.csv')

#Printing out First Few Data 
dataset.head()

Unnamed: 0,Title,Category,Topic
0,"Ireland Votes To Overturn Abortion Ban, 'Culmi...",World,Ireland
1,Ireland overturns abortion ban in landslide vote,World,Ireland
2,Ireland votes to repeal abortion ban in landsl...,World,Ireland
3,Can Ireland Be Catholic Without the Church?,World,Ireland
4,The Latest: Irish PM Plans to Move Quickly on ...,World,Ireland


In [4]:
#Extracting Features X (Title) and Y (Topic) 

# title = dataset.iloc[:,0:1].values
# topic = dataset.iloc[:,3:4].values

title = dataset.iloc[:,0:1].values
topic = dataset.iloc[:,2:3].values


# Splitting The Dataset to X_train, X_test, y_train, y_test

X_train contains 80% of the Title from DataSet

y_train contains 80% of the Label

Separate it into a training (80%) and testing set(20%).

In [5]:
#Splitting The Dataset to X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(title,topic,test_size=0.25,random_state=101)

Now The data has been split for traing and testing

X_train and y_train is appened to a Json

In [6]:
training_data = []
topic_count = 0
for split_x in X_train:
    training_data.append({"title":split_x[0], "topic":y_train[topic_count][0]})
    topic_count+=1

# Cleaning our Training Data

In [7]:
words = []
topics = []
documents = []
ignore_words = ['?']
bin_words = [1]
print("{} sentences in training data".format(len(training_data)))

1615 sentences in training data


In [8]:
# training_data

Using Regex Tokenizor to remove all the unwanted characters from "Title"

aswell as Stopwords are removed

Each sentence is split into words 

In [9]:
tokenizer = RegexpTokenizer('\w{3,20}')
stop_words = set(stopwords.words('english'))
for pattern in training_data:
    w = tokenizer.tokenize(pattern['title'])
    w = [stemmer.stem(wr.lower()) for wr in w if wr not in stop_words]
    words.extend(w)
    documents.append((w, pattern['topic']))
    if pattern['topic'] not in topics:
        topics.append(pattern['topic'])

#### List of all words from the titles "Bag of words"

In [10]:
words = list(set(words))

In [11]:
# with open("jgi/words.json", 'w') as outfile:
#     json.dump(words, outfile)

In [12]:
print("There are {} words in corpus".format(len(words)))
# for wrd in words:
#     print(wrd,end='\t')

There are 2374 words in corpus


### Lists all Labels saved in topics variable

In [13]:
topics = list(set(topics))
for tpc in topics:
    print(tpc,end='\t')

Nuclear Tests	Train	Cyclone	Spanish	Priyanka Chopra	Isreal	Spain	Xi Jinping	Sushma Swaraj	Celebrity	Education	Australia	Greece	Terrorism	Oman	Donald Trump	Monsoon Rain	Malaysia	Russia	Hillary Clinton	Taiwan	Colombia	Theresa May	Kim Jong Un	Pakistan	Turtle Day	FIFA World Cup	Israel	North Korea	Vladmir Putin	Vladimir Putin	Bangladesh	China	War	Britain	Ireland	Mark Zuckerberg	LGBT	World Turtle Day	Refugees	Egypt	Volcano	Venezuela	NASA	Eid Al Fitr	Rape	Nepal	Air Force	Italy	India	Myanmar	Crime	Michelle Obama	Brexit	Iran	Yemen	Germany	Harvey Weinstein	Politics	Religion	UAE	US Visa	Saudi Arabia	Nigeria	Iraq	Artificial Intelligence	Qatar	Palestine	Sri Lanka	Cuba	Ivanka Trump	Narendra Modi	Monsoon	USA	

In [14]:
len(topics)

74

In [15]:
# with open("jgi/topics.json", 'w') as outfile:
#     json.dump(topics, outfile)

# Creating our Training Data

### Initializing variables

In [16]:
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(topics)

#### Each title is tokenized, cleaned and represented as 0 or 1 according to "Bag of Words" created ealier

Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation

Each training set appened to "training" variable

In [17]:
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # stem each word
    
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
    pattern_words = [w for w in pattern_words if not w in stop_words]
    # create our bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
        
    # output is a '0' for each tag and '1' for current tag 
    output_row = list(output_empty)
    output_row[topics.index(doc[1])] = 1
    training.append([bag, output_row])

Trained data is shuffled to produce better results

In [18]:
# shuffle our features and turn into np.array
# random.shuffle(training)
training = np.array(training)

# trainX contains the Bag of words and train_y contains the label/ category
train_xi = list(training[:,0])
train_yi = list(training[:,1])


# Training

Trainig using TfLearn , TfLearn is a High level Tensorflow library

### Building The Neural network with SoftMax activation function

In [19]:
# reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_xi[0])],dtype=tf.float32)
net = tflearn.fully_connected(net, 8,activation='relu')
net = tflearn.fully_connected(net, 8,activation='relu')
net = tflearn.fully_connected(net, len(train_yi[0]), activation='softmax')
net = tflearn.regression(net,dtype=tf.float32)

### Define the Densely Connected Network

In [20]:
# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')

session = tf.get_default_session()
# tf.train.write_graph(tf.get_default_graph(), 'jgi', 'model.pbtxt')

# Start training (apply gradient descent algorithm)
model.fit(train_xi, train_yi, n_epoch=50, batch_size=8, show_metric=True)

# model.save('jgi/model.ckpt')
# model.load("jgi/demo_model.ckpt")

Training Step: 10099  | total loss: [1m[32m0.61092[0m[0m | time: 0.753s
| Adam | epoch: 050 | loss: 0.61092 - acc: 0.8142 -- iter: 1608/1615
Training Step: 10100  | total loss: [1m[32m0.56896[0m[0m | time: 0.756s
| Adam | epoch: 050 | loss: 0.56896 - acc: 0.8202 -- iter: 1615/1615
--


# Testing

In [21]:
# a method that takes in a sentence and list of all words
# and returns the data in a form the can be fed to tensorflow
def get_tf_record(sentence):
    global words
    # tokenize the pattern
    tokenizer = RegexpTokenizer('\w{3,20}')
    sentence_words = tokenizer.tokenize(sentence)
    # stem each word
#     sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    sentence_words = [stemmer.stem(wr.lower()) for wr in sentence_words if wr not in stop_words]
    print(sentence_words)
    # bag of words
    bow = [0]*len(words)
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s:
                bow[i] = 1
    return(np.array(bow))

### Argmax returns the Labels with Higher Probability

Here the Accuracy of the model is pretty low due to the training dataset size

Some of the predictions are wrong

In [22]:
print(topics[np.argmax(model.predict([get_tf_record("US warships sail near disputed islands in South China Sea in a move likely to anger Beijing")]))])

['warship', 'sail', 'near', 'disput', 'island', 'south', 'china', 'sea', 'move', 'like', 'anger', 'beij']
USA


In [23]:
model.predict([get_tf_record("Extraterrestrial Life On Saturn's Icy Moon May Flourish")])[0]

['extraterrestri', 'life', 'saturn', 'ici', 'moon', 'may', 'flourish']


array([  5.94636193e-03,   8.56299046e-03,   2.66240204e-06,
         5.15639503e-03,   6.38007239e-14,   4.16797957e-05,
         8.74681922e-04,   4.32086475e-02,   2.44178344e-03,
         4.79879227e-11,   4.14645541e-07,   1.13350356e-12,
         2.94380989e-05,   7.97950197e-04,   4.29709324e-10,
         6.29946817e-10,   1.81450574e-07,   4.20901272e-03,
         1.05753288e-01,   5.36084099e-06,   4.22972931e-07,
         3.33605943e-09,   6.08945149e-04,   6.68953266e-19,
         1.45975710e-03,   2.23739653e-06,   9.04916879e-03,
         2.18329351e-05,   4.42393566e-10,   1.04957621e-03,
         6.19847402e-02,   2.34375917e-03,   1.17670781e-04,
         8.59066546e-02,   6.32161973e-04,   7.65076262e-13,
         1.38316967e-08,   3.46679641e-09,   1.11032227e-06,
         2.31049256e-07,   8.24319502e-10,   7.98883702e-05,
         1.58021506e-02,   2.02302326e-04,   6.50374773e-07,
         7.36335992e-11,   2.04810585e-05,   2.75488049e-10,
         7.87378196e-03,

In [24]:
model.predict([get_tf_record("US warships sail near disputed islands in South China Sea in a move likely to anger Beijing")])[0][27]

['warship', 'sail', 'near', 'disput', 'island', 'south', 'china', 'sea', 'move', 'like', 'anger', 'beij']


0.0

In [25]:
np.argmax(model.predict([get_tf_record("Extraterrestrial Life On Saturn's Icy Moon May Flourish")]))

['extraterrestri', 'life', 'saturn', 'ici', 'moon', 'may', 'flourish']


58

In [26]:
# we can start to predict the results for each of the 4 sentences

for x_split in X_test:
    print(x_split[0])
    print(topics[np.argmax(model.predict([get_tf_record(x_split[0])]))])
#     print(model.predict([get_tf_record(x_split[0])]))
    print("="*85)

Meghan Markle gets Duchess lessons from Queen Elizabeth's aide
['meghan', 'markl', 'get', 'duchess', 'lesson', 'queen', 'elizabeth', 'aid']
Celebrity
Pakistan supports SCO's anti-terrorism efforts: Tehmina Janjua
['pakistan', 'support', 'sco', 'anti', 'terror', 'effort', 'tehmina', 'janjua']
Pakistan
Russia forces among dozens dead in IS group east Syria attacks
['russia', 'forc', 'among', 'dozen', 'dead', 'group', 'east', 'syria', 'attack']
Russia
Iran lists demands for staying in nuclear deal
['iran', 'list', 'demand', 'stay', 'nuclear', 'deal']
Iran
At Harvard, Clinton warns of threats to American democracy
['harvard', 'clinton', 'warn', 'threat', 'american', 'democraci']
India
Is Italy's government on a collision course with the EU?
['itali', 'govern', 'collis', 'cours']
Italy
Spanish opposition demands Rajoy confidence vote
['spanish', 'opposit', 'demand', 'rajoy', 'confid', 'vote']
Politics
Savita Halappanavar's father hails 'justice for daughter', thanks Irish voters for histori

Kim Jong Un
NRA Met with Close Putin Allies During 2016 Election
['nra', 'met', 'close', 'putin', 'alli', 'dure', '2016', 'elect']
Xi Jinping
UK judge orders Operation Blue Star related files to be made public
['judg', 'order', 'oper', 'blue', 'star', 'relat', 'file', 'made', 'public']
Britain
Back-Slapping Donald Trump Summit Legitimises Kim Jong Un, Say Critics
['back', 'slap', 'donald', 'trump', 'summit', 'legitimis', 'kim', 'jong', 'say', 'critic']
Donald Trump
Israel to build 2 500 more settler homes in occupied West Bank
['israel', 'build', '500', 'settler', 'home', 'occupi', 'west', 'bank']
Israel
United Kingdom destroyed files on Tamil Tigers, India-Sri Lanka relations: The Guardian
['unit', 'kingdom', 'destroy', 'file', 'tamil', 'tiger', 'india', 'sri', 'lanka', 'relat', 'the', 'guardian']
Politics
Priyanka Chopra urges world to step up support for Rohingya women, children
['priyanka', 'chopra', 'urg', 'world', 'step', 'support', 'rohingya', 'women', 'children']
Bangladesh
She

['donald', 'trump', 'can', 'block', 'critic', 'twitter', 'judg', 'rule']
Donald Trump
Europe looks for deeds, not words, from Italy's populists
['europ', 'look', 'deed', 'word', 'itali', 'populist']
Politics
PM Abbasi bats for CPEC & Nawaz Sharif has an unexpected guest
['abbasi', 'bat', 'cpec', 'nawaz', 'sharif', 'unexpect', 'guest']
Pakistan
Last few years a golden chapter in Indo-Bangla ties: Modi
['last', 'year', 'golden', 'chapter', 'indo', 'bangla', 'tie', 'modi']
Narendra Modi
The Guardian view on the Brexit bill debates: crash bang wallop
['the', 'guardian', 'view', 'brexit', 'bill', 'debat', 'crash', 'bang', 'wallop']
Britain
Sheikh Hasina expresses gratitude to India for 'standing beside Bangladesh in times of crisis'
['sheikh', 'hasina', 'express', 'gratitud', 'india', 'stand', 'besid', 'bangladesh', 'time', 'crisi']
Narendra Modi
With no progressive force to give it shape, Italians' anger has hit a wall
['with', 'progress', 'forc', 'give', 'shape', 'italian', 'anger', 'hit'