**1. Neural Network Classifier with Scikit**

Using the multi-label classifier dataset from earlier exercises (categorized-comments.jsonl in the reddit folder), fit a neural network classifier using scikit-learn. Use the code found in chapter 12 of the Applied Text Analysis with Python book as a guideline. Report the accuracy, precision, recall, F1-score, and confusion matrix.

#### Import libraries

In [1]:
import pandas as pd, numpy as np, json, re, pickle

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, auc, precision_recall_fscore_support
from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier

#### Read data

In [3]:
def read_data(file):
    """
    Take a json file location and
    read the file into a pandas data frame
    Args: full path to file
    Returns: pandas dataframe with data from file
    """
    
    data = []

    with open(file) as f:
        for line in f:
            data.append(json.loads(line))
        
    # convert to data frame
    
    return pd.DataFrame(data)

In [4]:
# read controversy data

con_df = read_data('data/reddit/controversial-comments.jsonl')

# check size, structure and categories

print('Size: ', len(con_df), '\n',
      'Shape: ', con_df.info(), '\n',
      'Categories: ', con_df.con.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950000 entries, 0 to 949999
Data columns (total 2 columns):
con    950000 non-null int64
txt    950000 non-null object
dtypes: int64(1), object(1)
memory usage: 14.5+ MB
Size:  950000 
 Shape:  None 
 Categories:  [0 1]


#### Preprocessing

In [2]:
def clean_text(text):
    """
    Remove punctuations and special characters, makes lower case
    Args: text 
    Output: text
    """
    
    text=text.lower()
    text=re.sub('&lt;/?.*?&gt;',' &lt;&gt', text)
    text=re.sub('\\d|\\W+|_',' ',text)
    text=re.sub('[^a-zA-Z]'," ", text)
    
    return text

# Create stop words list

stop_words = stopwords.words('english')

In [5]:
# since the size is humongus, I will take sample of the 2 categories. 
# by trial, sample of 50000 from each category can be easily handled by my machine

size = 50000    # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]

controversy = con_df.groupby('con', as_index=False).apply(fn)

# free up memory

del con_df

controversy['txt'] = controversy['txt'].apply(lambda x:clean_text(x))
controversy.reset_index(drop=True, inplace=True)

controversy.head()

Unnamed: 0,con,txt
0,0,and then the guinea worm will finally be extinct
1,0,yeah that was before he tried to influence the...
2,0,i ll accept it
3,0,democrats were told a week or so before the pu...
4,0,the doublespeak is strong with this one


#### Create feature matrix

In [6]:
# create the feature matrix

cv = CountVectorizer(stop_words=stop_words)

# create target and sample

X = cv.fit_transform(controversy['txt'])
Y = controversy['con']

# create train test split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=1)

#### Apply MLPClassifier

In [7]:
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30), max_iter=75)
mlp.fit(X_train,y_train)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(30, 30, 30), learning_rate='constant',
       learning_rate_init=0.001, max_iter=75, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

#### Calculate accuracy and other model metrics

In [13]:
predictions = mlp.predict(X_test)
print('Confusion Matrix: ',confusion_matrix(y_test,predictions))  
print('Classification Report:',classification_report(y_test,predictions)) 
print('Accuracy: ',accuracy_score(y_test,predictions))

Confusion Matrix:  [[8743 3835]
 [3407 9015]]
Classification Report:               precision    recall  f1-score   support

           0       0.72      0.70      0.71     12578
           1       0.70      0.73      0.71     12422

   micro avg       0.71      0.71      0.71     25000
   macro avg       0.71      0.71      0.71     25000
weighted avg       0.71      0.71      0.71     25000

Accuracy:  0.71032
