# 18. Convolutional Neural Network (CNN)
One of the neural network types that can be used for NLP is a Convolutional Neural Network. Such a network is based on animal visual cortex and is traditionally used for computer vision. More recently, application in the field of natural language processing was also looked into and this seemed to work quite well. CNN's look at features of images to see if certain shapes of patterns can be found. When they do, a node in the network is fired. The application for nlp works well, because such a network can detect patterns in text as well. 'I like' or 'very much' (n-grams) for example, show the significance of certain combinations of words. http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/

## Preprocessing

In [1]:
import os
import random
import numpy as np

from collections import namedtuple

from sklearn.preprocessing import LabelEncoder

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

import pandas as pd
from preprocessing import PreProcessor

pp = PreProcessor()

df = pd.read_csv('Structured_DataFrame_Sample_500.csv', index_col=0)
df['Item Description'] = df['Item Description'].apply(lambda d: pp.preprocess(str(d)))
df

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Unnamed: 0,Category,Item Description,category_id
40127,Counterfeits/Watches,emporio armani ar shell case ceram bracelet re...,0
40126,Counterfeits/Watches,cartiertank ladi brand cartier seri tank gende...,0
40125,Counterfeits/Watches,patek philipp watch box patek philipp watch bo...,0
40130,Counterfeits/Watches,breitl navitim cosmonaut replica watch inform ...,0
40129,Counterfeits/Watches,emporio armani men ar dial color gari watch re...,0
...,...,...,...
15401,Services/Money,canada cc get card number cvv expiri date name...,29
15402,Services/Money,uk debit card take chanc buy uk visa debit car...,29
15403,Services/Money,itali card detail high valid fresh itali card ...,29
15404,Services/Money,centurionblack cc get us centurion cc card num...,29


## Vectorizing

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2))
features = tfidf.fit_transform(df['Item Description'])
labels = df.Category

features

## Splitting and encoding

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)

X_train

94722    g peruvian flake best cocain high grade peruvi...
38485    european paypal account x insid account test s...
15551    two co hash oil pen profession manufactur co h...
83348    furanylfentanyl power cousin fentanyl heroin g...
2346     growityourself growschrank homebox q im komple...
                               ...                        
97143    uncut potent pink dutch speed g product nice d...
82703    g premium moroccan hash uk uk one gram love mo...
90008    zstrain magic mushroom cap welcom fellow breth...
14381    g ethylphenid ep pleas read care intern bulk o...
12030    mg cooki add item one mg cooki add item cooki ...
Name: Item Description, Length: 10050, dtype: object

In [4]:
# convert list of tokens/words to indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
sequences_train = tokenizer.texts_to_sequences(X_train)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 7006 unique tokens.


In [5]:
# get the max sentence lenght, needed for padding
max_input_lenght = max([len(x) for x in sequences_train])
print("Max. sequence lenght: ", max_input_lenght)

Max. sequence lenght:  265


In [6]:
# pad all the sequences of indexes to the 'max_input_lenght'
data_train = pad_sequences(sequences_train, maxlen=max_input_lenght, padding='post', truncating='post')

In [7]:
# Encode the labels, each must be a vector with dim = num. of possible labels
le = LabelEncoder()
le.fit(y_train)
labels_encoded_train = le.transform(y_train)
categorical_labels_train = to_categorical(labels_encoded_train, num_classes=None)
print('Shape of train data tensor:', data_train.shape)
print('Shape of train label tensor:', categorical_labels_train.shape)

Shape of train data tensor: (3015, 265)
Shape of train label tensor: (3015, 9)


In [8]:
# pre-process test data
sequences_test = tokenizer.texts_to_sequences(X_test)
x_test = pad_sequences(sequences_test, maxlen=max_input_lenght)

labels_encoded_test = le.transform(y_test)
categorical_labels_test = to_categorical(labels_encoded_test, num_classes=None)
print('Shape of test data tensor:', x_test.shape)
print('Shape of test labels tensor:', categorical_labels_test.shape)

Shape of test data tensor: (1485, 265)
Shape of test labels tensor: (1485, 9)


## CNN with random word embeddings

In [9]:
from convnets_utils import *

model_1 = get_cnn_rand(3000, len(word_index)+1, max_input_lenght, 30)
print(model_1.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
main_input (InputLayer)         (None, 265)          0                                            
__________________________________________________________________________________________________
embedding_layer_dynamic (Embedd (None, 265, 3000)    21021000    main_input[0][0]                 
__________________________________________________________________________________________________
Conv_dynamic_3 (Conv1D)         (None, 263, 100)     900100      embedding_layer_dynamic[0][0]    
__________________________________________________________________________________________________
Conv_dynamic_4 (Conv1D)         (None, 262, 100)     1200100     embedding_layer_dynamic[0][0]    
__________________________________________________________________________________________________
Conv_dynam

In [None]:
history = model_1.fit(x=data_train, y=categorical_labels_train, batch_size=50, epochs=10)

Epoch 1/10


## Conclusion
The kernel keeps dying when I try to train the network. I have tried tweaking the parameters such as the batch size, but nothing seems to work. I don't know how to continue with this.