In an earlier lesson, you created text classification models using logistic regression and Naive Bayes. In this exercise, you will expand on those models and learn how to properly evaluate the performance of those models and select the best model.

### 1. Evaluating Binary Classifiers

#### a. Model Metrics

Using the same model parameters you used in the text classification exercise, the model’s accuracy, ROC AUC, precision, recall, and F1-score.

<img src="files/Images/ex9-1.jpg">

#### b. Cross Validation

Using the logistic regression model with the L2 penalty, calculate the cross-validation score of the model’s ROC AUC using three-fold cross validation.

### 2. Evaluating Multi-class Classifiers

Using the same data you used for the multi-class text classifier in the text classification lesson, calculate the following model metrics.

#### a. Model Metrics

<img src="files/Images/ex9-2.jpg">

#### b. Cross Validation

Using the logistic regression model with the L2 penalty, calculate the cross-validation score of the model’s F1 score using three-fold cross validation.

### 3. Hyperparameter Selection

Use the GridSearchCV function to select the best binary classification model. For this problem, only consider the logistic regression model. For the search parameters try both L1 and L2 with C values of np.logspace(0, 4, 10). Report the cross-validation score and parameters of the best model.

<img src="files/Images/ex9-4.jpg">

#### Centralized import of all libraries

In [20]:
import pandas as pd, numpy as np, json, re, pickle

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, precision_recall_fscore_support
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.preprocessing import StandardScaler

#### Text cleaning

In [3]:
def clean_text(text):
    """
    Remove punctuations and special characters, makes lower case
    Args: text 
    Output: text
    """
    
    text=text.lower()
    text=re.sub('&lt;/?.*?&gt;',' &lt;&gt', text)
    text=re.sub('\\d|\\W+|_',' ',text)
    
    return text

# Create stop words list
stop_words = stopwords.words('english')

#### Create Classifiers

In [56]:
# Create Bernoulli NB classifier
bnb = BernoulliNB(class_prior=[0.25, 0.5])

# Create Multinomial NB classifier
mnb = MultinomialNB()

# Create Logistic Regression classifier (for Penalty l1 and l2)
lr1 = LogisticRegression(penalty = 'l1', solver = 'saga', max_iter = 500, multi_class = 'auto')
lr2 = LogisticRegression(penalty = 'l2', solver = 'saga', max_iter = 500, multi_class = 'auto')

#### Read data

In [None]:
def read_data(file):
    """
    Take a json file location and
    read the file into a pandas data frame
    Args: full path to file
    Returns: pandas dataframe with data from file
    """
    
    data = []

    with open(file) as f:
        for line in f:
            data.append(json.loads(line))
        
    # convert to data frame
    
    return pd.DataFrame(data)

## 1. Evaluating Binary Classifiers

In [26]:
# read category data

#cat_df = read_data('data/reddit/categorized-comments.jsonl')
con_df = read_data('data/reddit/controversial-comments.jsonl')

# check size, structure and categories

print('Size: ', len(con_df), '\n',
      'Shape: ', con_df.info(), '\n',
      'Categories: ', con_df.con.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950000 entries, 0 to 949999
Data columns (total 2 columns):
con    950000 non-null int64
txt    950000 non-null object
dtypes: int64(1), object(1)
memory usage: 10.9+ MB
Size:  950000 
 Shape:  None 
 Categories:  [0 1]


In [29]:
# since the size is humongus, I will take sample of the 2 categories. 
# by trial, sample of 50000 from each category can be easily handled by my machine

size = 50000    # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]

controversy = con_df.groupby('con', as_index=False).apply(fn)

# free up memory

del con_df

controversy['txt'] = controversy['txt'].apply(lambda x:clean_text(x))
controversy.reset_index(drop=True, inplace=True)

controversy.head()

Unnamed: 0,con,txt
0,0,statistically no they just use up a ton of wel...
1,0,to be fair corey was also uniquely experienced...
2,0,if trump s comment about pound hackers did...
3,0,gt clinton on the other hand was overprepared...
4,0,those type of people have always been out ther...


### a. Model Metrics

In [57]:
# create the feature matrix

cv = CountVectorizer(stop_words=stop_words)

# create target and sample

X = cv.fit_transform(controversy['txt'])
Y = controversy['con']

def evaluate_model(X, Y, test_size, classifier):
    """
    Evaluate model and return model scores
    Args: feature, target, test set sample size (%), 
          classifier (BNB, LR1, LR2, MNB)
    Output: model scores (accuracy, conf_matrix, auc, 
            precision, recall, fscore, support)
    """
    
    # create train test split

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=1)

    # fit models  
    
    # Bernoulli's Naive Bayes
    
    if classifier == 'BNB':
        BNB = bnb.fit(X_train, Y_train)
        
        Y_pred = BNB.predict(X_test)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        conf_matrix = confusion_matrix(Y_test, Y_pred)
                
        auc = roc_auc_score(Y_test, Y_pred)
        
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
        
        return accuracy, auc, precision, recall, fscore
        
    # Logistic Regression with penalty L1
        
    elif classifier == 'LR1':
        LR1 = lr1.fit(X_train, Y_train)
        
        Y_pred = LR1.predict(X_test)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        conf_matrix = confusion_matrix(Y_test, Y_pred)
                
        auc = roc_auc_score(Y_test, Y_pred)
        
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
        
        return accuracy, auc, precision, recall, fscore
        
    # Logistic Regression with penalty L2
        
    elif classifier == 'LR2':
        LR2 = lr2.fit(X_train, Y_train)
        
        Y_pred = LR2.predict(X_test)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        conf_matrix = confusion_matrix(Y_test, Y_pred)
                
        auc = roc_auc_score(Y_test, Y_pred)
        
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
        
        return accuracy, auc, precision, recall, fscore
        
    # Multinomial Naive Bayes
        
    elif classifier == 'MNB':
        MNB = mnb.fit(X_train, Y_train)
        
        Y_pred = MNB.predict(X_test)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        conf_matrix = confusion_matrix(Y_test, Y_pred)
                
        auc = roc_auc_score(Y_test, Y_pred)
        
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
        
        return accuracy, auc, precision, recall, fscore
    else:
        print('Wrong calssifier')


### Additional explanation
I am getting non-convergence warning in the Logistic Regression model. I tried to increase the max iteration upto 500. Increasing it more might remove the warning but my CPU cannot handle more than 500. Hence, I left the iteration at 500.

In [None]:
# Logistic Regression (L1)

accuracy, auc, precision, recall, fscore = evaluate_model(X, Y, 0.25, 'LR1')
print('Logistic Regression (L1)')
print('Accuracy: ', accuracy)
print('AUC: ', auc)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1-Score: ', fscore)

# Logistic Regression (L2)

accuracy, auc, precision, recall, fscore = evaluate_model(X, Y, 0.25, 'LR2')
print('Logistic Regression (L2)')
print('Accuracy: ', accuracy)
print('AUC: ', auc)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1-Score: ', fscore)

# Naive Bayes

accuracy, auc, precision, recall, fscore = evaluate_model(X, Y, 0.25, 'BNB')
print('Naive Bayes')
print('Accuracy: ', accuracy)
print('AUC: ', auc)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1-Score: ', fscore)

### b. Cross Validation

In [None]:
seed = 7
kfold = model_selection.KFold(n_splits=3, random_state=seed)
#model = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(lr2, X, Y, cv=kfold, scoring=scoring)
print("Accuracy: %.3f (%.3f)") % (results.mean(), results.std())

roc=roc_auc_score(Y_test, model.predict_proba(X_test)[:,1])

confusion_matrix(logit1.predict(inputData),outputData)

# https://www.ritchieng.com/machine-learning-cross-validation/