# One-Dimensional Convolutional Neutral Network

This is to demonstrate a 1D CNN using Keras. The utility class accepts pandas dataframe as input for training and testing. 

## Load Packages

In [1]:
import os
import pandas as pd
from one_dim_cnn import OneDimCnn
import numpy as np

Using TensorFlow backend.


## Load Data
We use the data from BBC. The raw files can be downloaded here: http://mlg.ucd.ie/datasets/bbc.html.

When we unzip the folder, bbc has subdirectories namely; business, entertainment, politics, sport and tech. We can use these folders as our target categories or labels. Each folder contains files belonged to the category.


Note: README text file is removed for convinient parsing and contains.
Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
Natural Classes: 5 (business, entertainment, politics, sport, tech)

If you make use of the dataset, please consider citing the publication: 
- D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.

All rights, including copyright, in the content of the original articles are owned by the BBC.

Contact Derek Greene <derek.greene@ucd.ie> for further information.
http://mlg.ucd.ie/datasets/bbc.html


Note: The datasets must be contained in csv file or dataframe with column names, 'input' and 'label'. 

Let us check first if we have the desired list of folders in our path, 'bbc'

In [2]:
dir_list = os.listdir(os.path.abspath('data'))
dir_list

['business', 'entertainment', 'politics', 'sport', 'tech']

We see that we have the 5 lists of categories that later we will use as labels.
Now, Let us define a function that scans all the contents for each category and list them all in pandas dataframe.

In [3]:
def loader(path):
    """Input contents to dataframe with corresponding label."""
    data = pd.DataFrame([], columns=['input', 'label'])
    dir_list = os.listdir(os.path.abspath(path))
    for folder in dir_list:
        file_list = os.listdir(path + '/' + folder)
        for f in file_list:
            with open(path + '/'+ folder + '/' + f, 'r', newline='') as file:
                data.loc[data['input'].shape[0] + 1,'input'] = file.read().strip()
                data.loc[data['input'].shape[0], 'label'] = str(folder)
    return data

In [4]:
df = loader('data')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2225 entries, 1 to 2225
Data columns (total 2 columns):
input    2225 non-null object
label    2225 non-null object
dtypes: object(2)
memory usage: 132.1+ KB


In [5]:
df.head(5)

Unnamed: 0,input,label
1,Ad sales boost Time Warner profit\n\nQuarterly...,business
2,Dollar gains on Greenspan speech\n\nThe dollar...,business
3,Yukos unit buyer faces loan claim\n\nThe owner...,business
4,High fuel prices hit BA's profits\n\nBritish A...,business
5,Pernod takeover talk lifts Domecq\n\nShares in...,business


We can see from the above lines that there are 2225 entries which correspond to the number of documents. We also show the first 5 entries in the dataframe.

## Balanced Data
Check the shape of data.

In [6]:
for lbl in df['label'].unique():
    print('category: ', lbl, ', shape: ', df[df.loc[:, 'label'] == lbl].shape)

category:  business , shape:  (510, 2)
category:  entertainment , shape:  (386, 2)
category:  politics , shape:  (417, 2)
category:  sport , shape:  (511, 2)
category:  tech , shape:  (401, 2)


To keep the data having equal amounts, we will take the lowest number, 386.

In [7]:
sub_df = pd.DataFrame([])
for lbl in df['label'].unique():
    sub_df = sub_df.append(df[df.loc[:, 'label'] == lbl].head(386))

In [8]:
for lbl in sub_df['label'].unique():
    print('category: ', lbl, ', shape: ', sub_df[sub_df.loc[:, 'label'] == lbl].shape)

category:  business , shape:  (386, 2)
category:  entertainment , shape:  (386, 2)
category:  politics , shape:  (386, 2)
category:  sport , shape:  (386, 2)
category:  tech , shape:  (386, 2)


Now that we have a balanced data set, we can start training.

## One-Dimensional CNN
Initialize Model

In [9]:
cnn = OneDimCnn()

{'epochs': 10, 'hidden_dims': 250, 'kernel_size': 5, 'filters': 128, 'embedding_dims': 100, 'batch_size': 128, 'maxlen': 1000, 'max_features': 100000.0, 'self': <one_dim_cnn.OneDimCnn object at 0x0000026879E88390>}


In [10]:
tokenizer, model = cnn.train(sub_df, save=False) # Set save=True if you want to save the model, Default: False

Loading data...
Train shape:  (1545, 1000)
Test shape:  (386, 1000)
Building model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 100)         10000000  
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000, 100)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 996, 128)          64128     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               32250     
________________________________________________________

You can find from the above example that the validated accuracy in ~92%.

## Testing
A sample content from BBC News is pulled out for testing.

In [13]:
from keras.preprocessing.sequence import pad_sequences
text = 'Elon Musk unveils first tourist for SpaceX Moon loop. Japanese billionaire and online fashion tycoon Yusaku Maezawa, 42, announced: "I choose to go to the Moon".'
vect_text = tokenizer.texts_to_sequences(text)
padded = pad_sequences(vect_text, maxlen=1000, padding='post')
pred = cnn.predict_class(model, padded)
pred

'business'