# Document Classification
This notebook presents an end-to-end process for document classification using a deep feedforward neural network. The labels to predict include "agenda", "medical record", "paper" and "resume". 

## Import Libraries
Import all the libraries necessary for this project.

In [None]:
from keras.layers import  Dropout, Dense
from keras.models import Sequential
from keras.utils import np_utils
from keras.models import load_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import re
from sklearn import metrics
import pickle
import tika
import glob
from tika import parser as tika_parser

## Dataset
The raw dataset consists of four types of documents, including agendas, medical records, academic papers and resumes. It is worth noting that the data is imbalanced as the dataset only contains 100 agenda examples where other document types have 500 examples each.

Retrieve all the file paths for each class.

In [56]:
root_path = "C:\\Users\\Alex.Chokwijitkul\\Desktop\\Document Classification\\data\\"

agenda_paths = glob.glob(root_path + "Agendas/*")
medicalrecord_paths = glob.glob(root_path + "MedicalRecords/*")
paper_paths = glob.glob(root_path + "Papers/*")
resume_paths = glob.glob(root_path + "Resumes/*")

## Data Preprocessing
Add functions for reading raw data from files, cleaning and tranforming it into a Pandas dataframe.

In [59]:
def preprocess_text(text):
    processed = re.sub('[^a-zA-Z]', ' ', text)
    processed = re.sub(r"\s+[a-zA-Z]\s+", ' ', processed)
    processed = re.sub(r'\s+', ' ', processed)

    return processed

def process_raw_data(paths, label):
    data = {
        'Content': [],
        'Type': [label] * len(paths)
    }
    
    for path in paths:
        print('Processing {}'.format(path))
        parsed = tika_parser.from_file(path)
        text = preprocess_text(parsed["content"])
        data['Content'].append(text)
    
    return pd.DataFrame(data, columns = ['Content', 'Type'])

Process all the files and concatenate the results into a single dataframe.

In [None]:
agenda_df = process_raw_data(agenda_paths, 'agenda')

In [None]:
medicalrecord_df = process_raw_data(medicalrecord_paths, 'medicalrecord')

In [None]:
paper_df = process_raw_data(paper_paths, 'paper')

In [None]:
resume_df = process_raw_data(resume_paths, 'resume')

In [9]:
df = pd.concat([agenda_df, medicalrecord_df, paper_df, resume_df], axis=0)

## Feature Engineering
Term Frequency – Inverse Document Frequency (tf-idf) is used to assign weightage to the feature vector.

Simply use the `TfidfVectorizer` to fit and tranform the training data. Also, save the vectoriser object as an pickle file as it will be needed later.

In [17]:
def tfidf(X_train, X_test, num_words=5000):

    vectorizer_x = TfidfVectorizer(max_features=num_words)
    X_train = vectorizer_x.fit_transform(X_train).toarray()
    X_test = vectorizer_x.transform(X_test).toarray()

    pickle.dump(vectorizer_x, open('vectoriser.pkl','wb'))
    print("tf-idf with", str(np.array(X_train).shape[1]), "features")

    return (X_train,X_test)

## Model
Add a function for creating and compileing a deep neural network. For this project, we will use a DNN with 4 hidden layers, each layer contains 512 hidden neurals with the rectified linear unit (ReLU) except for the last layer where the softmax activation function will be used. Dropout regularisation is added after each dense layer. The model will use the Adam optimiser and the categorical cross entropy loss function when training.

In [18]:
def build_DNN_model(shape, num_classes, dropout=0.2):

    model = Sequential()
    node = 512 # number of nodes
    num_layers = 4 # number of  hidden layer
    model.add(Dense(node,input_dim=shape,activation='relu'))
    model.add(Dropout(dropout))

    for i in range(0, num_layers):
        model.add(Dense(node,input_dim=node,activation='relu'))
        model.add(Dropout(dropout))

    model.add(Dense(num_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

## Training

Retrieve a series of raw text examples and their labels.

In [19]:
X = df['Content'].values
y = df['Type'].values

Use the `LabelEncoder` to tranform each label text into a one-hot vector. Save the pickle file for later use.

In [20]:
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)
dummy_y = np_utils.to_categorical(encoded_y)

pickle.dump(encoder, open('encoder.pkl','wb'))

Split the data into training and test sets.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, dummy_y, test_size=0.20, random_state=7)

Transform each example to a tf-idf vector. 

In [22]:
X_train_tfidf, X_test_tfidf = tfidf(X_train, X_test)

tf-idf with 5000 features


Create a DNN model that takes 5,000 features as its input and outputs the probability vector of the four classes.

In [23]:
model = build_DNN_model(X_train_tfidf.shape[1], 4)

Print the model summary.

In [24]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               2560512   
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 512)              

Train the model for 5 epochs using 20 percent of the training data as the validation set and a batch size of 128.

In [25]:
history = model.fit(X_train_tfidf, y_train, validation_split=0.2, epochs=5, batch_size=128, verbose=1)

Train on 1024 samples, validate on 256 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Evaluation

Evaluate the trained model against the test dataset. With only 5 epochs, the model seems to be able to generalise reasonably well with the accuracy of 99.69 percent.

In [26]:
score = model.evaluate(X_test_tfidf, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

Test Score: 0.006668820339473314
Test Accuracy: 0.996874988079071


Save the trained model as a HDF5 file. For testing purposes, delete the existing model.

In [27]:
model.save('model.h5')  # creates a HDF5 file 'model.h5'
del model  # deletes the existing model

For a better understanding of model performance over the whole dataset instead of just a single train/test split, cross validation should be performed to train and test the model over multiple folds of the dataset.

In [28]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

def tfidf_all(X, num_words=5000):

    vectorizer_x = TfidfVectorizer(max_features=num_words)
    X = vectorizer_x.fit_transform(X).toarray()

    print("tf-idf with", str(np.array(X).shape[1]), "features")

    return X

def DNN_model():

    model = Sequential()
    node = 512 # number of nodes
    nLayers = 4 # number of  hidden layer
    model.add(Dense(node,input_dim=5000,activation='relu'))
    model.add(Dropout(0.2))

    for i in range(0,nLayers):
        model.add(Dense(node,input_dim=node,activation='relu'))
        model.add(Dropout(0.2))

    model.add(Dense(4, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

In [30]:
estimator = KerasClassifier(build_fn=DNN_model, epochs=5, batch_size=128, verbose=1)
kfold = KFold(n_splits=5, shuffle=True)
results = cross_val_score(estimator, tfidf_all(X), dummy_y, cv=kfold)

print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

tf-idf with 5000 features
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 99.62% (0.23%)


## Load Saved Model

Load the saved trained model along with the input vectoriser and label encoder objects.

In [31]:
model = load_model('model.h5')
vectoriser = pickle.load(open('vectoriser.pkl', 'rb'))
encoder = pickle.load(open('encoder.pkl', 'rb'))

The preprocessing process should remain the same.

In [32]:
def preprocess_text(text):
    processed = re.sub('[^a-zA-Z]', ' ', text)
    processed = re.sub(r"\s+[a-zA-Z]\s+", ' ', processed)
    processed = re.sub(r'\s+', ' ', processed)

    return processed

def process_input_data(paths):
    data = []
    
    for path in paths:
        print('Processing {}'.format(path))
        parsed = tika_parser.from_file(path)
        text = preprocess_text(parsed["content"])
        data.append(text)
    
    return data

Select a document from one of the files and then preprocess the data using the alredy defined function. 

In [45]:
papers = process_input_data([paper_paths[10]])

Processing C:\Users\Alex.Chokwijitkul\Desktop\Document Classification\data\Papers\1812.02993.pdf


Transform the data to a feature vector using the loaded vectoriser.

In [46]:
vector = vectoriser.transform([papers[0]]).toarray()
vector

array([[0., 0., 0., ..., 0., 0., 0.]])

Use the loaded model to predict the input vector.

In [47]:
prediction = model.predict([vector])
prediction

array([[8.7061787e-13, 1.8738636e-22, 1.0000000e+00, 1.0185268e-28]],
      dtype=float32)

Round up the results.

In [48]:
prediction = np.round(prediction[0])
prediction

array([0., 0., 1., 0.], dtype=float32)

Translate the result back to a string label. The output is "paper", which is a correct label.

In [49]:
prediction = encoder.inverse_transform(np.where(prediction == 1))
prediction[0]

'paper'