**Neural Network Classifier with Keras**

Using the multi-label classifier dataset from earlier exercises (categorized-comments.jsonl in the reddit folder), fit a neural network classifier using Keras. Use the code found in chapter 12 of the Applied Text Analysis with Python book as a guideline. Report the accuracy, precision, recall, F1-score, and confusion matrix.

In [26]:
import pandas as pd, numpy as np, json, re, pickle, keras

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, auc, precision_recall_fscore_support
from sklearn.metrics import classification_report
from keras.models import Sequential
from keras.layers import Dense

In [62]:
# set the features and classes

N_FEATURES = 5000
N_CLASSES = 5
N_UNITS = 2500

In [3]:
def read_data(file):
    """
    Take a json file location and
    read the file into a pandas data frame
    Args: full path to file
    Returns: pandas dataframe with data from file
    """
    
    data = []

    with open(file) as f:
        for line in f:
            data.append(json.loads(line))
        
    # convert to data frame
    
    return pd.DataFrame(data)

In [29]:
# read category data

cat_df = read_data('data/reddit/categorized-comments.jsonl')

# check size, structure and categories

print('Size: ', len(cat_df), '\n',
      'Shape: ', cat_df.info(), '\n',
      'Categories: ', cat_df.cat.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2347476 entries, 0 to 2347475
Data columns (total 2 columns):
cat    object
txt    object
dtypes: object(2)
memory usage: 35.8+ MB
Size:  2347476 
 Shape:  None 
 Categories:  ['sports' 'science_and_technology' 'video_games' 'news']


In [30]:
def clean_text(text):
    """
    Remove punctuations and special characters, makes lower case
    Args: text 
    Output: text
    """
    
    text=text.lower()
    text=re.sub('&lt;/?.*?&gt;',' &lt;&gt', text)
    text=re.sub('\\d|\\W+|_',' ',text)
    text=re.sub('[^a-zA-Z]'," ", text)
    
    return text

# Create stop words list

stop_words = stopwords.words('english')

In [31]:
# since the size is humongus, I will take sample of the 2 categories. 
# by trial, sample of 50000 from each category can be easily handled by my machine

size = 50000    # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]

category = cat_df.groupby('cat', as_index=False).apply(fn)

# free up memory

del cat_df

category['txt'] = category['txt'].apply(lambda x:clean_text(x))
category.reset_index(drop=True, inplace=True)

category.head()

Unnamed: 0,cat,txt
0,news,i wouldn t listen to of people on earth the...
1,news,yes pc and pc are both laws in califor...
2,news,ok cool now again what does that have to do wi...
3,news,this is why you lost trump spoke of infrastruc...
4,news,it s amazing how little that point is even bei...


In [46]:
# check the unique categories

category["cat"].unique()

array(['news', 'science_and_technology', 'sports', 'video_games'],
      dtype=object)

In [47]:
# create dictionary to map the categories to int for downstream processing

cat_to_int = {'news' : 1,
              'science_and_technology' : 2,
              'sports' : 3,
              'video_games' : 4}

category['cat'] = category['cat'].map(cat_to_int)
category.head()

Unnamed: 0,cat,txt
0,1,i wouldn t listen to of people on earth the...
1,1,yes pc and pc are both laws in califor...
2,1,ok cool now again what does that have to do wi...
3,1,this is why you lost trump spoke of infrastruc...
4,1,it s amazing how little that point is even bei...


In [74]:
# create the feature matrix

cv = CountVectorizer(stop_words=stop_words, max_features = N_FEATURES)

# create target and sample

X = cv.fit_transform(category['txt'])
y = category['cat']

# create train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [77]:
y_test

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.]], dtype=float32)

A simple ANN can only take a linear array of features as input. Therefore, checking the train and test dataset accordingly.

In [78]:
print(str(X_train.shape))
print(str(X_test.shape))
print(str(y_train.shape))
print(str(y_test.shape))

(150000, 5000)
(50000, 5000)
(150000, 5)
(50000, 5)


In [76]:
# One-hot encode target vector to create a target matrix

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

#### Create classifier for ANN

In [63]:
# initialize

classifier_seq = Sequential()

# add input layer, two hidden layers and output layer

classifier_seq.add(Dense(activation="relu", input_shape=(N_FEATURES,), units=N_UNITS))
classifier_seq.add(Dense(activation="relu", units=N_UNITS))
classifier_seq.add(Dense(activation="softmax", units=N_CLASSES))

# compile the Artificial Neural Network (ANN)

classifier_seq.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

#### Apply model

In [64]:
# fit ANN to the training set

classifier_seq.fit(X_train, y_train, batch_size=200, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1dae80f3cf8>

#### Evaluate

In [69]:
y_pred = classifier_seq.evaluate(X_test, y_test, verbose=0)

In [70]:

print('Confusion Matrix: ',confusion_matrix(y_test,y_pred))
#print('Classification Report:',classification_report(y_test.argmax(axis=1),y_pred.argmax(axis=1))) 
#print('Accuracy: ',accuracy_score(y_test.argmax(axis=1),y_pred.argmax(axis=1)))

ValueError: Found input variables with inconsistent numbers of samples: [50000, 2]