**Neural Network Classifier with Keras**

Using the multi-label classifier dataset from earlier exercises (categorized-comments.jsonl in the reddit folder), fit a neural network classifier using Keras. Use the code found in chapter 12 of the Applied Text Analysis with Python book as a guideline. Report the accuracy, precision, recall, F1-score, and confusion matrix.

In [2]:
import pandas as pd, numpy as np, json, re, pickle, keras

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import LabelEncoder

In [3]:
def read_data(file):
    """
    Take a json file location and
    read the file into a pandas data frame
    Args: full path to file
    Returns: pandas dataframe with data from file
    """
    
    data = []

    with open(file) as f:
        for line in f:
            data.append(json.loads(line))
        
    # convert to data frame
    
    return pd.DataFrame(data)

In [5]:
# read category data
# Had to complete the missing " and } in the file. Was getting error earlier.
cat_df = read_data('categorized-comments-copy.jsonl')

# check size, structure and categories

print('Size: ', len(cat_df), '\n',
      'Shape: ', cat_df.info(), '\n',
      'Categories: ', cat_df.cat.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606476 entries, 0 to 606475
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   cat     606476 non-null  object
 1   txt     606476 non-null  object
dtypes: object(2)
memory usage: 9.3+ MB
Size:  606476 
 Shape:  None 
 Categories:  ['sports' 'science_and_technology' 'video_games']


In [6]:
def clean_text(text):
    """
    Remove punctuations and special characters, makes lower case
    Args: text 
    Output: text
    """
    
    text=text.lower()
    text=re.sub('&lt;/?.*?&gt;',' &lt;&gt', text)
    text=re.sub('\\d|\\W+|_',' ',text)
    text=re.sub('[^a-zA-Z]'," ", text)
    
    return text

# Create stop words list

stop_words = stopwords.words('english')

In [7]:
# since the file size is too big, I will take sample of the 2 categories. 
# trying out 50000 from each category
# not able to run higher numbers in my machine
# had issues even previously with text based exercises

size = 50000    # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]

category = cat_df.groupby('cat', as_index=False).apply(fn)

# free up memory

del cat_df

category['txt'] = category['txt'].apply(lambda x:clean_text(x))
category.reset_index(drop=True, inplace=True)

category.head()

Unnamed: 0,cat,txt
0,science_and_technology,i think the whole point of having a beta is be...
1,science_and_technology,gpmdp is just the first letter of every word i...
2,science_and_technology,you don t have to be on desktop to change flai...
3,science_and_technology,unfortunately it s something i see repeated ag...
4,science_and_technology,inside of a browser that i don t use


In [8]:
# check the unique categories

#category["cat"].unique()
category.groupby(["cat"]).size()

cat
science_and_technology    50000
sports                    50000
video_games               50000
dtype: int64

In [9]:
encoder = LabelEncoder()

cat = category["cat"]
category["cat"]=encoder.fit_transform(cat)
category.groupby("cat").count()

Unnamed: 0_level_0,txt
cat,Unnamed: 1_level_1
0,50000
1,50000
2,50000


In [10]:
# set the features and classes

N_FEATURES = 5000
N_CLASSES = 1
N_UNITS = 2500

In [11]:
# create the feature matrix

cv = CountVectorizer(analyzer='word',
                     stop_words=stop_words, 
                     max_features = N_FEATURES,
                     max_df = 0.5,
                     min_df = 3)

# create target and sample

X = cv.fit_transform(category['txt'])
y = category['cat']

# create train test split
# Splitting 75 25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

A simple ANN can only take a linear array of features as input. Therefore, checking the train and test dataset accordingly.

In [12]:
print(str(X_train.shape))
print(str(X_test.shape))
print(str(y_train.shape))
print(str(y_test.shape))

(112500, 5000)
(37500, 5000)
(112500,)
(37500,)


#### Create classifier for ANN

In [13]:
# initialize

classifier_seq = Sequential()

classifier_seq.add(Dense(units=500,activation="relu",input_shape=(N_FEATURES,)))
classifier_seq.add(Dense(units=50, activation="relu"))
classifier_seq.add(Dense(units=4, activation="softmax"))

# compile the Artificial Neural Network (ANN)

classifier_seq.compile(optimizer="rmsprop", 
                       loss="sparse_categorical_crossentropy", 
                       metrics=["accuracy"])

#### Apply model

In [14]:
# fit ANN to the training set

classifier_seq.fit(X_train, y_train, batch_size=200, epochs=5, verbose = 1)

Train on 112500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1a1eb022438>

#### Evaluate model and calculate matrix

In [15]:
classifier_seq.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 500)               2500500   
_________________________________________________________________
dense_1 (Dense)              (None, 50)                25050     
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 204       
Total params: 2,525,754
Trainable params: 2,525,754
Non-trainable params: 0
_________________________________________________________________


In [16]:
# Checking loss and accuracy
loss, accuracy = classifier_seq.evaluate(X_test, y_test, verbose=1)
print("Training Accuracy: {:.4f}".format(accuracy))

Training Accuracy: 0.7879


In [17]:
# create prediction
y_pred = classifier_seq.predict_classes(X_test)



In [18]:
# calculate model matrix -  accuracy, precision, recall, F1-score, and confusion matrix.
print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print("Classification Report: ", classification_report(y_test,y_pred))
print("Accuracy: ", accuracy_score(y_test,y_pred))

Confusion Matrix:  [[11166   624   745]
 [  995  9513  2070]
 [ 1403  2117  8867]]
Classification Report:                precision    recall  f1-score   support

           0       0.82      0.89      0.86     12535
           1       0.78      0.76      0.77     12578
           2       0.76      0.72      0.74     12387

    accuracy                           0.79     37500
   macro avg       0.79      0.79      0.79     37500
weighted avg       0.79      0.79      0.79     37500

Accuracy:  0.7878933333333333
