# N-Gram MultiChannel Convolutional Neutral Network
 A class of N-Gram Multichannel Convolutional Neural Network which takes in Pandas' dataframe as an input data.
 This is an adaptation of Jason Brownlee's model which can be found at,

 https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/

 This approach was first described by Yoon Kim in his 2014 paper titled
 “Convolutional Neural Networks for Sentence Classification.”

 "A multi-channel convolutional neural network for document classification
 involves using multiple versions of the standard model with different sized kernels.
 This allows the document to be processed at different resolutions or different
 n-grams (groups of words) at a time, whilst the model learns how to best integrate
 these interpretations." - Jason Brownlee, Ph.D

## Load Packages

In [1]:
import pandas as pd
from multichan_cnn import MultiChanCnn

Using TensorFlow backend.


## Load Data
Note: The datasets must be contained in csv file or dataframe with column names, 'input' and 'label'. 

In [2]:
path = 'data/training_set.csv'
df = pd.read_csv(path, encoding='latin-1')

Check information of data.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427 entries, 0 to 426
Data columns (total 2 columns):
input    427 non-null object
label    427 non-null object
dtypes: object(2)
memory usage: 6.8+ KB


## Balanced Data
Check the shape of data.

In [4]:
df[df.loc[:, 'label'] == 'celebrity'].shape, df[df.loc[:, 'label'] == 'non-celeb'].shape

((283, 2), (144, 2))

Create sub-dataframe with balanced data sets.

In [5]:
sub_df = df[df.loc[:, 'label'] == 'celebrity'].head(144)
sub_df = sub_df.append(df[df.loc[:, 'label'] == 'non-celeb'])
sub_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 288 entries, 3 to 426
Data columns (total 2 columns):
input    288 non-null object
label    288 non-null object
dtypes: object(2)
memory usage: 6.8+ KB


In [6]:
sub_df[sub_df.loc[:, 'label'] == 'celebrity'].shape, sub_df[sub_df.loc[:, 'label'] == 'non-celeb'].shape

((144, 2), (144, 2))

## Multi-Channel CNN
Initialize Model

In [7]:
cnn = MultiChanCnn()

In [9]:
tokenizer, model = cnn.train(sub_df, save=True) # Set save=True if you want to save the model, Default: False

Max document length: 2384
Vocabulary size: 12479
Train shape:  (217, 2384)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, 2384)         0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 2384)         0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, 2384)         0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 2384, 100)    1247900     input_4[0][0]                    
__________________________________

## Testing
We can check the model prediction with actual article content.

In [10]:
unseen_text = df[df.loc[:, 'label'] == 'celebrity'].loc[146, 'input']

Create tokenizer and encode the text.

In [16]:
from keras.preprocessing.sequence import pad_sequences
encoded = tokenizer.texts_to_sequences(unseen_text)

In [17]:
padded = pad_sequences([unseen_vect], maxlen=2384, padding='post')

In [19]:
pred = model.predict([padded, padded, padded])
pred

array([[0.18592758]], dtype=float32)

In [20]:
pred_class = pred.argmax(axis=-1)
pred_class

array([0], dtype=int64)

Since Keras has no mapping of classes to integer, we defined a mapping function in our Cnn class. 

In [21]:
cnn.classes_

{'celebrity': 0, 'non-celeb': 1}