# [Simple Text Classification using Keras Deep Learning Python Library](https://www.opencodez.com/python/text-classification-using-keras.htm)

### Importing Required Packages

In [1]:
import pandas as pd
import numpy as np
import pickle
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
from sklearn.preprocessing import LabelBinarizer
import sklearn.datasets as skds
from pathlib import Path

Using TensorFlow backend.


### Loading data from files to Python variables

In [2]:
# For reproducibility
np.random.seed(1237)

# Source file directory
path_train = "20news-bydate\\20news-bydate-train"

files_train = skds.load_files(path_train,load_content=False)

label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename","category","news"]
data_list = []

# Read and add data from file to a list
i=0
for f in labelled_files:
    data_list.append((f,label_names[label_index[i]],Path(f).read_text()))
    i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)

In our case data is not available as CSV. We have text data file and the directory in which the file is kept is our label or category. So we will first iterate through the directory structure and create data set that can be further utilized in training our model.

We will use scikit-learn load_files method. This method can give us raw data as well as the labels and label indices. For our example, we will not load data at one go. We will iterate over files and prepare a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

At the end of above code, we will have a data frame that has a filename, category, actual data.

**Note**: The above approach to make data available for training worked, as its volume is not huge. If you need to train on huge dataset then you have to consider BatchGenerator approach. In this approach, the data will be fed to your model in small batches.

### Split Data for Train and Test

In [3]:
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['news'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['news'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]

We will keep 80% of our data for training and remaining 20% for testing and validations.

### Tokenize and Prepare Vocabulary

In [4]:
# 20 news groups
num_labels = 20
vocab_size = 15000
batch_size = 100

# define Tokenizer with Vocab Size
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(train_posts)

x_train = tokenizer.texts_to_matrix(train_posts, mode='tfidf')
x_test = tokenizer.texts_to_matrix(test_posts, mode='tfidf')

encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

When we classify texts we first pre-process the text using [Bag Of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) method. Now the keras comes with inbuilt [Tokenizer](https://keras.io/preprocessing/text/) which can be used to convert your text into a numeric vector. The text_to_matrix method above does exactly same.

### Pre-processing Output Labels / Classes

As we have converted our text to numeric vectors, we also need to make sure our labels are represented in the numeric format accepted by neural network model. The prediction is all about assigning the probability to each label.  We need to convert our labels to [one hot vector](https://en.wikipedia.org/wiki/One-hot)

scikit-learn has a [LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) class which makes it easy to build these one-hot vectors.

In [5]:
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

### Build Keras Model and Fit

In [6]:
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=30,
                    verbose=1,
                    validation_split=0.1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               7680512   
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_2 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 20)                10260     
__________

### Evaluate model

In [7]:
score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)

print('Test accuracy:', score[1])

text_labels = encoder.classes_

for i in range(10):
    prediction = model.predict(np.array([x_test[i]]))
    predicted_label = text_labels[np.argmax(prediction[0])]
    print(test_files_names.iloc[i])
    print('Actual label:' + test_tags.iloc[i])
    print("Predicted label: " + predicted_label)

Test accuracy: 0.8806893534487557
20news-bydate\20news-bydate-train\alt.atheism\53114
Actual label:alt.atheism
Predicted label: alt.atheism
20news-bydate\20news-bydate-train\comp.graphics\38666
Actual label:comp.graphics
Predicted label: comp.graphics
20news-bydate\20news-bydate-train\sci.med\58932
Actual label:sci.med
Predicted label: sci.med
20news-bydate\20news-bydate-train\sci.crypt\15212
Actual label:sci.crypt
Predicted label: sci.crypt
20news-bydate\20news-bydate-train\comp.os.ms-windows.misc\9695
Actual label:comp.os.ms-windows.misc
Predicted label: comp.os.ms-windows.misc
20news-bydate\20news-bydate-train\rec.sport.baseball\104482
Actual label:rec.sport.baseball
Predicted label: rec.sport.baseball
20news-bydate\20news-bydate-train\soc.religion.christian\20731
Actual label:soc.religion.christian
Predicted label: comp.graphics
20news-bydate\20news-bydate-train\comp.graphics\38583
Actual label:comp.graphics
Predicted label: comp.graphics
20news-bydate\20news-bydate-train\rec.sport

### Prediction

In [8]:
# These are the labels we stored from our training
# The order is very important here.

labels = np.array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x',
 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball',
 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space',
 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast',
 'talk.politics.misc', 'talk.religion.misc'])

test_files = ["20news-bydate\\20news-bydate-test\\comp.graphics\\38758",
              "20news-bydate\\20news-bydate-test\\misc.forsale\\76115",
              "20news-bydate\\20news-bydate-test\\soc.religion.christian\\21329"
              ]
x_data = []
for t_f in test_files:
    t_f_data = Path(t_f).read_text()
    x_data.append(t_f_data)

x_data_series = pd.Series(x_data)
x_tokenized = tokenizer.texts_to_matrix(x_data_series, mode='tfidf')

i=0
for x_t in x_tokenized:
    prediction = model.predict(np.array([x_t]))
    predicted_label = labels[np.argmax(prediction[0])]
    print("File ->", test_files[i], "Predicted label: " + predicted_label)
    i += 1

File -> 20news-bydate\20news-bydate-test\comp.graphics\38758 Predicted label: comp.graphics
File -> 20news-bydate\20news-bydate-test\misc.forsale\76115 Predicted label: comp.sys.ibm.pc.hardware
File -> 20news-bydate\20news-bydate-test\soc.religion.christian\21329 Predicted label: soc.religion.christian
