# Text Classification

The data I wrote this script for is sensitive, so instead I am using a dataset from kaggle. 
https://www.kaggle.com/guiyihan/text-classification-20


## Import Packages and Set Up Data Prep

In [1]:
import sys
#sys.stdout = open('output.txt', 'w')
import nltk
nltk.download('punkt')
from nltk.stem.lancaster import LancasterStemmer
import numpy as np
import pandas as pd
import tflearn
import tensorflow as tf
from tensorflow.python.framework import ops
import random
import json
import unicodedata

# a table structure to hold the different punctuation used
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                    if unicodedata.category(chr(i)).startswith('P'))

# method to remove punctuations from sentences.
def remove_punctuation(text):
    return text.translate(tbl)

# initialize the stemmer
stemmer = LancasterStemmer()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/chasebrown/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Instructions for updating:
non-resource variables are not supported in the long term


## Import Dataset

In [2]:
#Read File
df = pd.read_csv("data/textClassification.csv")
print(df.head(5))

   label                                               text
0      0  Archive name atheism resources Alt atheism arc...
1      0  Archive name atheism introduction Alt atheism ...
2      0  In article 65974 mimsy.umd.edu mangoe cs.umd.e...
3      0  dmn kepler.unh.edu ...until kings become philo...
4      0  In article N4HY.93Apr5120934 harder.ccr p.ida....


In [3]:
#Extract Categories
categories = sorted(df['label'].unique())
categories = categories[:3]
print(categories)

[0, 1, 2]


## Preprocessing

### Clean Data

In [4]:
words = []
# a list of tuples with words in the sentence and category name
docs = []

for category in categories:
    sentences = df[df['label'] == category]['text'].unique()
    for sentence in sentences:
        # remove any punctuation from the sentence
        sentence = remove_punctuation(sentence)
        sentence = sentence[:int(len(sentence)/5)]
        # extract words from each sentence and append to the word list
        w = nltk.word_tokenize(sentence)
        words.extend(w)
        docs.append((w, category))
print(len(words)," ",len(docs))
print(words[:10])

196626   2983
['Archive', 'name', 'atheism', 'resources', 'Alt', 'atheism', 'archive', 'name', 'resources', 'Last']


In [5]:
# stem and lower each word and remove duplicates
words = [stemmer.stem(w.lower()) for w in words]
words = sorted(list(set(words)))
print(words[:10])

['\x19', '0', '00', '0000', '000005102000', '0010580b0b6r49', '0010580bvma7o9', '0010580bvmcbrt', '0028', '00451']


### Create Training and Test Data

In [6]:
data = []
output = []

# create an empty array for our output
output_empty = [0] * len(categories)
for doc in docs:
    # initialize our bag of words(bow) for each document in the list
    bow = []
    # list of tokenized words for the pattern
    token_words = doc[0]
    # stem each word
    token_words = [stemmer.stem(word.lower()) for word in token_words]
    # create our bag of words array
    for w in words:
        bow.append(1) if w in token_words else bow.append(0)

    output_row = list(output_empty)
    output_row[categories.index(doc[1])] = 1

    # our training set will contain a the bag of words model and the output row that tells
    # which catefory that bow belongs to.
    data.append([bow, output_row])

# shuffle our features and turn into np.array as tensorflow  takes in numpy array
random.shuffle(data)
training = data[int(len(data)/5):]
testing = data[:int(len(data)/5)]
training = np.array(training)
testing = np.array(testing)

print(len(training), ' ', len(testing))

2387   596




In [7]:
# train_x contains the Bag of words and train_y contains the label/ category
train_x = list(training[:, 0])
train_y = list(training[:, 1])
print(len(train_x))
print(len(train_y))

2387
2387


## Set Up Neural Network

In [8]:
ops.reset_default_graph()
net = tflearn.input_data(shape=[None, len(train_x[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax')
net = tflearn.regression(net)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


## Define Model

In [9]:
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs', tensorboard_verbose=3)

## Train Model

In [10]:
model.fit(train_x, train_y, n_epoch=20, batch_size=8, show_metric=True)
#model.save('model.tflearn')

Training Step: 5979  | total loss: [1m[32m0.27093[0m[0m | time: 2.849s
| Adam | epoch: 020 | loss: 0.27093 - acc: 0.9754 -- iter: 2384/2387
Training Step: 5980  | total loss: [1m[32m0.24396[0m[0m | time: 2.858s
| Adam | epoch: 020 | loss: 0.24396 - acc: 0.9778 -- iter: 2387/2387
--


## Test Model

In [11]:
# test_x contains the Bag of words and test_y contains the label/ category
test_x = list(testing[:, 0])
test_y = list(testing[:, 1])
print(len(test_x))
print(len(test_y))

596
596


In [12]:
results = model.predict(test_x)

In [13]:
confusion = np.array([[0,0,0], [0,0,0], [0,0,0]])
for index in range(0,len(test_y)):
    answer = 0
    if test_y[index][1] == 1:
        answer = 1
    elif test_y[index][2] == 1:
        answer = 2
    confusion[np.argmax(results[index]), answer] += 1
    
print(confusion)
numCorrect = confusion[0,0] + confusion[1,1] + confusion[2,2]
print(numCorrect, " out of ", len(test_y), "correct.")
print(int(numCorrect/len(test_y)*100), "% accurate.")

[[164   2   1]
 [ 29 168  41]
 [  0  28 163]]
495  out of  596 correct.
83 % accurate.


## Comments

When I wrote this code, the project it was for had very obvious categories, so this had a higher accuracy than it does with this more complex dataset. It is also important to note that I didn't use the whole dataset.

If I was going to go back and improve this, I would include a validation dataset and checkpoints. I just watched it until it seemed to be high accuracy and stopped there for this notebook.