<h1>Text Classification with Keras Deep Learning</h1>

<h3>Loading data from files to Python variables</h3>

In [1]:
from sklearn.datasets import load_files

In [2]:
twenty_train = load_files('/home/hadoop/scikit_learn_data/20news_home/20news-bydate-train',encoding='latin1')

In [3]:
twenty_train.data[1]

'From: gnelson@pion.rutgers.edu (Gregory Nelson)\nSubject: Thanks Apple: Free Ethernet on my C610!\nArticle-I.D.: pion.Apr.6.12.05.34.1993.11732\nOrganization: Rutgers Univ., New Brunswick, N.J.\nLines: 26\n\n\n\tWell, I just got my Centris 610 yesterday.  It took just over two \nweeks from placing the order.  The dealer (Rutgers computer store) \nappologized because Apple made a substitution on my order.  I ordered\nthe one without ethernet, but they substituted one _with_ ethernet.\nHe wanted to know if that would be "alright with me"!!!  They must\nbe backlogged on Centri w/out ethernet so they\'re just shipping them\nwith!  \n\n\tAnyway, I\'m very happy with the 610 with a few exceptions.  \nBeing nosy, I decided to open it up _before_ powering it on for the first\ntime.  The SCSI cable to the hard drive was only partially connected\n(must have come loose in shipping).  No big deal, but I would have been\npissed if I tried to boot it and it wouldn\'t come up!\n\tThe hard drive also

In [4]:
twenty_train.target[1]

4

In [5]:
twenty_train.target_names[twenty_train.target[1]]

'comp.sys.mac.hardware'

<h3>Data Preprocessing</h3>

In [6]:
texts = twenty_train.data # Extract text

In [7]:
target = twenty_train.target # Extract target

In [8]:
# Load tools we need for preprocessing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [9]:
vocab_size = 20000 #define vocabulary size is 20.000

In [10]:
tokenizer = Tokenizer(num_words=vocab_size) # Setup tokenizer
tokenizer.fit_on_texts(texts)

In [11]:
x_trains = tokenizer.texts_to_matrix(texts, mode='tfidf')

In [12]:
x_trains[0]

array([0.        , 1.90025907, 1.26800369, ..., 0.        , 0.        ,
       0.        ])

In [13]:
tokenizer.index_word #list words

{1: 'the',
 2: 'to',
 3: 'of',
 4: 'a',
 5: "'ax",
 6: 'and',
 7: 'in',
 8: 'i',
 9: 'is',
 10: 'that',
 11: 'it',
 12: 'for',
 13: 'you',
 14: 'from',
 15: 'edu',
 16: 'on',
 17: 'this',
 18: 'be',
 19: 'are',
 20: 'not',
 21: 'have',
 22: 'with',
 23: 'as',
 24: '1',
 25: 'or',
 26: 'was',
 27: 'if',
 28: 'but',
 29: 'subject',
 30: 'they',
 31: 'com',
 32: 'lines',
 33: 'at',
 34: 'organization',
 35: 'by',
 36: '2',
 37: 'an',
 38: 'my',
 39: 'can',
 40: 'x',
 41: '3',
 42: 'what',
 43: '0',
 44: 'all',
 45: 'will',
 46: 'm',
 47: 'there',
 48: 'would',
 49: 'one',
 50: 'do',
 51: "'",
 52: 'about',
 53: 're',
 54: 'we',
 55: 'writes',
 56: 'so',
 57: 'he',
 58: 'your',
 59: 'no',
 60: 'has',
 61: 'article',
 62: 'any',
 63: 'me',
 64: 'some',
 65: 'who',
 66: 'out',
 67: 'which',
 68: '4',
 69: 'q',
 70: 'more',
 71: 'like',
 72: 'people',
 73: "don't",
 74: 'when',
 75: '5',
 76: 'just',
 77: 'university',
 78: 'posting',
 79: 'their',
 80: 'were',
 81: 'up',
 82: 'r',
 83: 'p',


In [15]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 134142 unique tokens.


In [16]:
#categorical data -> target column
t = target
print(t)

[ 9  4 11 ... 16 18  4]


In [17]:
from sklearn.preprocessing import LabelBinarizer

In [18]:
encoder = LabelBinarizer()
encoder.fit(t)
y_train = encoder.transform(t)

In [19]:
print(target[1])
print(y_train[1])

4
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


<h3>Keras Model</h3>

In [20]:
#import Library
from keras.models import Sequential
from keras.layers import Dense, Dropout


In [23]:
#First Trial

model1 = Sequential()
model1.add(Dense(512, activation='relu', input_shape=(vocab_size,)))
model1.add(Dense(512, activation='relu'))
model1.add(Dense(20, activation='softmax'))
model1.summary()

model1.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#fitting model
history1 = model1.fit(x_trains, y_train,
                    batch_size=128,
                    epochs=10,
                    verbose=1,
                    validation_split=0.1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 512)               10240512  
_________________________________________________________________
dense_5 (Dense)              (None, 512)               262656    
_________________________________________________________________
dense_6 (Dense)              (None, 20)                10260     
Total params: 10,513,428
Trainable params: 10,513,428
Non-trainable params: 0
_________________________________________________________________
Train on 10182 samples, validate on 1132 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
#Second Trial

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(vocab_size,)))
model.add(Dropout(0.3))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(20, activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               10240512  
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 20)                10260     
Total params: 10,513,428
Trainable params: 10,513,428
Non-trainable params: 0
_________________________________________________________________


In [22]:
#fitting model
history = model.fit(x_trains, y_train,
                    batch_size=128,
                    epochs=10,
                    verbose=1,
                    validation_split=0.1)

Train on 10182 samples, validate on 1132 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<h3>Evaluation</h3>

In [35]:
#test data preprocessing
twenty_test = load_files('/home/hadoop/scikit_learn_data/20news_home/20news-bydate-test',encoding='latin1')

In [37]:
texts_test = twenty_test.data # Extract text and target
target_test = twenty_test.target

In [45]:
tokenizer_test = Tokenizer(num_words=vocab_size) # Setup tokenizer
tokenizer_test.fit_on_texts(texts_test)

x_test = tokenizer.texts_to_matrix(texts_test, mode='tfidf')
y_test = encoder.transform(target_test)

In [47]:
#Do testing
score = model.evaluate(x_test, y_test, batch_size=128, verbose=1)
 
print('Test accuracy:', score[1])

Test accuracy: 0.8325809879753733


In [58]:
#Overfitting is happening = model acc: 0.9967, test acc: 0.832

<h3>Predict the news class</h3>

In [90]:
labels = np.array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x',
 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball',
 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space',
 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast',
 'talk.politics.misc', 'talk.religion.misc'])

In [91]:
prediction = model.predict(x_test)

In [111]:
#check result
import numpy as np
predicted_label = np.argmax(prediction[0])
actual_label = target_test[0]
predicted_label_name = labels[predicted_label]

In [112]:
print("text content:", texts_test[0])
print("predicted_label:", predicted_label)
print("actual_label:", actual_label)
print("label_name:", predicted_label_name)

text content: From: stimpy@dev-null.phys.psu.edu (Gregory Nagy)
Subject: Re: ESPN UP YOURS .........
Organization: Penn State Laboratory for Elementary Steam Physics
Lines: 52
NNTP-Posting-Host: dev-null.phys.psu.edu

In article <C5u542.3CD@news.udel.edu> tmavor@earthview.cms.udel.edu writes:
>>
>>[Various justifiable rantings on ESPN coverage by several deleted]
>>
>
>The only way to change ESPN's thinking, if it is even possible, is to complain
>to them directly.  Anyone know there telephone # in Bristol, Ct?  

Heh... Try the rec.autos.sport FAQ. They are always calling ESPN to complain.
I'm sure you could find the number for ABC there too, as many west-coast 
viewers were compaining about how something as boring as hockey cut into
the Long Beach GP. =)

>
>I do find it hard to believe that ESPN doesn't think viewers will simply
>change the channel from a boring game....I know I did.  And then, when
>they didn't show the NYI-Wash overtime(s), I was livid!  If I wanted
>to watch base

<h3>Prediction Metric</h3>

In [133]:
from sklearn.metrics import confusion_matrix


predict_test = []
for i in range(len(prediction)):
    predict_test.append(np.argmax(prediction[i]))

In [137]:
print(confusion_matrix(target_test, predict_test))

[[257   1   0   1   0   1   0   0   1   0   0   1   0   4   4  19   0   0
    2  28]
 [  0 302   9  16   7  19   5   1   0   2   0   4  10   0   4   4   1   1
    0   4]
 [  0  24 276  39  11  13   6   0   0   1   1   2   1   1   5   2   0   2
    7   3]
 [  0  12  19 301  22   2  12   2   0   0   0   1  15   0   2   1   1   1
    1   0]
 [  1   4   5  20 316   3  16   1   0   0   0   0   7   4   2   1   3   0
    1   1]
 [  1  34  17   2   2 329   3   0   1   0   0   0   0   1   3   0   0   1
    1   0]
 [  0   1   1  11   8   1 348   5   1   0   1   0   9   2   0   0   1   0
    0   1]
 [  0   2   1   5   3   0   9 348  12   1   2   0   3   3   2   0   0   0
    3   2]
 [  0   1   0   0   1   2   4   4 381   0   0   0   0   2   0   0   0   0
    2   1]
 [  0   0   1   0   0   0   0   1   1 369  18   0   0   0   0   1   0   2
    4   0]
 [  0   1   0   0   1   1   0   0   0   8 384   0   0   0   0   0   1   2
    1   0]
 [  1   3   1   3   2   1   4   1   0   1   0 369   2   1   0   0

In [140]:
#so far so goooood