**Neural Network Classifier with Keras**

Using the multi-label classifier dataset from earlier exercises (categorized-comments.jsonl in the reddit folder), fit a neural network classifier using Keras. Use the code found in chapter 12 of the Applied Text Analysis with Python book as a guideline. Report the accuracy, precision, recall, F1-score, and confusion matrix.

In [1]:
import pandas as pd, numpy as np, json, re, pickle, keras

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
from sklearn.metrics import classification_report
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import LabelEncoder

Using TensorFlow backend.


In [2]:
def read_data(file):
    """
    Take a json file location and
    read the file into a pandas data frame
    Args: full path to file
    Returns: pandas dataframe with data from file
    """
    
    data = []

    with open(file) as f:
        for line in f:
            data.append(json.loads(line))
        
    # convert to data frame
    
    return pd.DataFrame(data)

In [3]:
# read category data

cat_df = read_data('data/reddit/categorized-comments.jsonl')

# check size, structure and categories

print('Size: ', len(cat_df), '\n',
      'Shape: ', cat_df.info(), '\n',
      'Categories: ', cat_df.cat.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2347476 entries, 0 to 2347475
Data columns (total 2 columns):
cat    object
txt    object
dtypes: object(2)
memory usage: 35.8+ MB
Size:  2347476 
 Shape:  None 
 Categories:  ['sports' 'science_and_technology' 'video_games' 'news']


In [4]:
def clean_text(text):
    """
    Remove punctuations and special characters, makes lower case
    Args: text 
    Output: text
    """
    
    text=text.lower()
    text=re.sub('&lt;/?.*?&gt;',' &lt;&gt', text)
    text=re.sub('\\d|\\W+|_',' ',text)
    text=re.sub('[^a-zA-Z]'," ", text)
    
    return text

# Create stop words list

stop_words = stopwords.words('english')

In [5]:
# since the size is humongus, I will take sample of the 2 categories. 
# by trial, sample of 50000 from each category can be easily handled by my machine

size = 50000    # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]

category = cat_df.groupby('cat', as_index=False).apply(fn)

# free up memory

del cat_df

category['txt'] = category['txt'].apply(lambda x:clean_text(x))
category.reset_index(drop=True, inplace=True)

category.head()

Unnamed: 0,cat,txt
0,news,about the a the reason behind it was that saud...
1,news,you can register as an independent
2,news,gt would you call someone sexually attracted ...
3,news,it can also be argued that it s not enough foo...
4,news,not an attack like msm wants us to think trump...


In [6]:
# check the unique categories

#category["cat"].unique()
category.groupby(["cat"]).size()

cat
news                      50000
science_and_technology    50000
sports                    50000
video_games               50000
dtype: int64

In [7]:
encoder = LabelEncoder()

cat = category["cat"]
category["cat"]=encoder.fit_transform(cat)
category.groupby("cat").count()

Unnamed: 0_level_0,txt
cat,Unnamed: 1_level_1
0,50000
1,50000
2,50000
3,50000


In [8]:
# set the features and classes

N_FEATURES = 5000
N_CLASSES = 1
N_UNITS = 2500

In [9]:
# create the feature matrix

cv = CountVectorizer(analyzer='word',
                     stop_words=stop_words, 
                     max_features = N_FEATURES,
                     max_df = 0.5,
                     min_df = 3)

# create target and sample

X = cv.fit_transform(category['txt'])
y = category['cat']

# create train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

A simple ANN can only take a linear array of features as input. Therefore, checking the train and test dataset accordingly.

In [10]:
print(str(X_train.shape))
print(str(X_test.shape))
print(str(y_train.shape))
print(str(y_test.shape))

(150000, 5000)
(50000, 5000)
(150000,)
(50000,)


#### Create classifier for ANN

In [12]:
# initialize

classifier_seq = Sequential()

classifier_seq.add(Dense(units=500,activation="relu",input_shape=(N_FEATURES,)))
classifier_seq.add(Dense(units=50, activation="relu"))
classifier_seq.add(Dense(units=4, activation="softmax"))

# compile the Artificial Neural Network (ANN)

classifier_seq.compile(optimizer="rmsprop", 
                       loss="sparse_categorical_crossentropy", 
                       metrics=["accuracy"])

#### Apply model

In [14]:
# fit ANN to the training set

classifier_seq.fit(X_train, y_train, batch_size=200, epochs=5, verbose = 1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x25fe47e6f28>

#### Evaluate model and calculate matrix

In [15]:
classifier_seq.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 500)               2500500   
_________________________________________________________________
dense_5 (Dense)              (None, 50)                25050     
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 204       
Total params: 2,525,754
Trainable params: 2,525,754
Non-trainable params: 0
_________________________________________________________________


In [16]:
# loss and accuracy

loss, accuracy = classifier_seq.evaluate(X_test, y_test, verbose=1)
print("Training Accuracy: {:.4f}".format(accuracy))

Training Accuracy: 0.6570


In [17]:
# create prediction

y_pred = classifier_seq.predict_classes(X_test)

In [18]:
# calculate model matrix

print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print("Classification Report: ", classification_report(y_test,y_pred))
print("Accuracy: ", accuracy_score(y_test,y_pred))

Confusion Matrix:  [[8413 1815 1586  792]
 [2203 8172 1161 1043]
 [1591  501 8887 1393]
 [1298 1017 2751 7377]]
Classification Report:                precision    recall  f1-score   support

           0       0.62      0.67      0.64     12606
           1       0.71      0.65      0.68     12579
           2       0.62      0.72      0.66     12372
           3       0.70      0.59      0.64     12443

   micro avg       0.66      0.66      0.66     50000
   macro avg       0.66      0.66      0.66     50000
weighted avg       0.66      0.66      0.66     50000

Accuracy:  0.65698
