In an earlier lesson, you created text classification models using logistic regression and Naive Bayes. In this exercise, you will expand on those models and learn how to properly evaluate the performance of those models and select the best model.

### 1. Evaluating Binary Classifiers

#### a. Model Metrics

Using the same model parameters you used in the text classification exercise, the model’s accuracy, ROC AUC, precision, recall, and F1-score.

<img src="files/Images/ex9-1.jpg">

#### b. Cross Validation

Using the logistic regression model with the L2 penalty, calculate the cross-validation score of the model’s ROC AUC using three-fold cross validation.

### 2. Evaluating Multi-class Classifiers

Using the same data you used for the multi-class text classifier in the text classification lesson, calculate the following model metrics.

#### a. Model Metrics

<img src="files/Images/ex9-2.jpg">

#### b. Cross Validation

Using the logistic regression model with the L2 penalty, calculate the cross-validation score of the model’s F1 score using three-fold cross validation.

### 3. Hyperparameter Selection

Use the GridSearchCV function to select the best binary classification model. For this problem, only consider the logistic regression model. For the search parameters try both L1 and L2 with C values of np.logspace(0, 4, 10). Report the cross-validation score and parameters of the best model.

<img src="files/Images/ex9-4.jpg">

#### Centralized import of all libraries

In [1]:
import pandas as pd, numpy as np, json, re, pickle

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, auc, precision_recall_fscore_support
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.preprocessing import StandardScaler

#### Text cleaning

In [2]:
def clean_text(text):
    """
    Remove punctuations and special characters, makes lower case
    Args: text 
    Output: text
    """
    
    text=text.lower()
    text=re.sub('&lt;/?.*?&gt;',' &lt;&gt', text)
    text=re.sub('\\d|\\W+|_',' ',text)
    
    return text

# Create stop words list
stop_words = stopwords.words('english')

#### Create Classifiers

In [5]:
# Create Bernoulli NB classifier
bnb = BernoulliNB(class_prior=[0.25, 0.5])

# Create Multinomial NB classifier
mnb = MultinomialNB()

# Create Logistic Regression classifier (for Penalty l1 and l2)
lr1 = LogisticRegression(penalty = 'l1', solver = 'saga', max_iter = 100, multi_class = 'auto')
lr2 = LogisticRegression(penalty = 'l2', solver = 'saga', max_iter = 100, multi_class = 'auto')

#### Read data

In [6]:
def read_data(file):
    """
    Take a json file location and
    read the file into a pandas data frame
    Args: full path to file
    Returns: pandas dataframe with data from file
    """
    
    data = []

    with open(file) as f:
        for line in f:
            data.append(json.loads(line))
        
    # convert to data frame
    
    return pd.DataFrame(data)

## 1. Evaluating Binary Classifiers

In [7]:
# read controversy data

#cat_df = read_data('data/reddit/categorized-comments.jsonl')
con_df = read_data('data/reddit/controversial-comments.jsonl')

# check size, structure and categories

print('Size: ', len(con_df), '\n',
      'Shape: ', con_df.info(), '\n',
      'Categories: ', con_df.con.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950000 entries, 0 to 949999
Data columns (total 2 columns):
con    950000 non-null int64
txt    950000 non-null object
dtypes: int64(1), object(1)
memory usage: 10.9+ MB
Size:  950000 
 Shape:  None 
 Categories:  [0 1]


In [8]:
# since the size is humongus, I will take sample of the 2 categories. 
# by trial, sample of 50000 from each category can be easily handled by my machine

size = 50000    # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]

controversy = con_df.groupby('con', as_index=False).apply(fn)

# free up memory

del con_df

controversy['txt'] = controversy['txt'].apply(lambda x:clean_text(x))
controversy.reset_index(drop=True, inplace=True)

controversy.head()

Unnamed: 0,con,txt
0,0,deleted
1,0,johnson is on the rise
2,0,well her being a woman might be number but i...
3,0,i don t see how the dems are currently doing i...
4,0,i don t want to say more extreme but when i th...


### a. Model Metrics

In [14]:
def evaluate_model(X_train, X_test, Y_train, Y_test, classifier, binary):
    """
    Evaluate model and return model scores. In the output, 
    AUC is calculated for binary model only
    Args: feature - train, feature - test, target - train, target - test,
          classifier (BNB, LR1, LR2, MNB), binary (Y, N)
    Output: model scores (accuracy, conf_matrix, auc, 
            precision, recall, fscore, support)            
    """  
    
    # Bernoulli's Naive Bayes
    
    if classifier == 'BNB':
        BNB = bnb.fit(X_train, Y_train)
        
        Y_pred = BNB.predict(X_test)
               
        conf_matrix = confusion_matrix(Y_test, Y_pred)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
        
        if binary == 'Y':
            auc = roc_auc_score(Y_test, Y_pred)
            return accuracy, auc, precision, recall, fscore
        else:
            return accuracy, precision, recall, fscore
        
    # Logistic Regression with penalty L1
        
    elif classifier == 'LR1':
        LR1 = lr1.fit(X_train, Y_train)
        
        Y_pred = LR1.predict(X_test)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        conf_matrix = confusion_matrix(Y_test, Y_pred)
                
        accuracy = accuracy_score(Y_test, Y_pred)
        
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
                
        if binary == 'Y':
            auc = roc_auc_score(Y_test, Y_pred)
            return accuracy, auc, precision, recall, fscore
        else:
            return accuracy, precision, recall, fscore
        
    # Logistic Regression with penalty L2
        
    elif classifier == 'LR2':
        LR2 = lr2.fit(X_train, Y_train)
        
        Y_pred = LR2.predict(X_test)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        conf_matrix = confusion_matrix(Y_test, Y_pred)
       
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
        
        if binary == 'Y':
            auc = roc_auc_score(Y_test, Y_pred)
            return accuracy, auc, precision, recall, fscore
        else:
            return accuracy, precision, recall, fscore
        
    # Multinomial Naive Bayes
        
    elif classifier == 'MNB':
        MNB = mnb.fit(X_train, Y_train)
        
        Y_pred = MNB.predict(X_test)
        
        accuracy = accuracy_score(Y_test, Y_pred)        
        
        conf_matrix = confusion_matrix(Y_test, Y_pred)
                
        #auc = roc_auc_score(Y_test, Y_pred)
        
        precision, recall, fscore, support = precision_recall_fscore_support(Y_test, Y_pred, average='weighted')
        
        return accuracy, precision, recall, fscore
    else:
        print('Wrong calssifier')


# function for k-fold cross validation
        
def evaluate_crossval(X_train, X_test, Y_train, Y_test, classifier, fold):
    """
    Perform k-fold cross validation of the LR model
    Args: feature - train, feature - test, target - train, target - test,
          classifier (LR1, LR2), k-fold
    Output: Cross validation score      
    """
    
    if classifier == 'LR1':
        lr = lr1.fit(X_train, Y_train)
    elif classifier == 'LR2':
        lr = lr2.fit(X_train, Y_train)
    else:
        print('Wrong calssifier')
    
    cvs = cross_val_score(estimator = lr, X = X_train, y = Y_train, n_jobs=-1, cv = fold, scoring = 'roc_auc')
    
    return cvs

In [10]:
# create the feature matrix

cv = CountVectorizer(stop_words=stop_words)

# create target and sample

X_bc = cv.fit_transform(controversy['txt'])
Y_bc = controversy['con']

# create train test split

X_train, X_test, Y_train, Y_test = train_test_split(X_bc, Y_bc, test_size=0.25, random_state=1)

### Additional explanation
I am getting non-convergence warning in the Logistic Regression model. I tried to increase the max iteration upto 500. Increasing it more might remove the warning but my CPU freezes during the process. Hence, I left the iteration at default 100.

In [15]:
# Logistic Regression (L1)

accuracy, auc, precision, recall, fscore = evaluate_model(X_train, X_test, Y_train, Y_test, 'LR1', 'Y')
print('Logistic Regression (L1)')
print('Accuracy: ', accuracy)
print('AUC: ', auc)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1-Score: ', fscore)

Logistic Regression (L1)
Accuracy:  0.61724
AUC:  0.6172557976593469
Precision:  0.6172748543614726
Recall:  0.61724
F1-Score:  0.6172424858008753


In [56]:
# Logistic Regression (L2)

accuracy, auc, precision, recall, fscore = evaluate_model(X_train, X_test, Y_train, Y_test, 'LR2', 'Y')
print('Logistic Regression (L2)')
print('Accuracy: ', accuracy)
print('AUC: ', auc)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1-Score: ', fscore)

Logistic Regression (L2)
Accuracy:  0.61776
AUC:  0.617862432480251
Precision:  0.6180164192307499
Recall:  0.61776
F1-Score:  0.6176697281925775




In [57]:
# Naive Bayes

accuracy, auc, precision, recall, fscore = evaluate_model(X_train, X_test, Y_train, Y_test, 'BNB', 'Y')
print('Naive Bayes')
print('Accuracy: ', accuracy)
print('AUC: ', auc)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1-Score: ', fscore)

Naive Bayes
Accuracy:  0.56816
AUC:  0.570401154051976
Precision:  0.6457365735617245
Recall:  0.56816
F1-Score:  0.5045516540705048


### b. Cross Validation

In [15]:
# calculating 3-fold cross validation for logistic regression with L2 penalty

evaluate_crossval(X_train, X_test, Y_train, Y_test, 'LR2', 3)



array([0.65338806, 0.65619237, 0.66228931])

## 1. Evaluating Multi-class Classifiers

In [37]:
# read category data

cat_df = read_data('data/reddit/categorized-comments.jsonl')

# check size, structure and categories

print('Size: ', len(cat_df), '\n',
      'Shape: ', cat_df.info(), '\n',
      'Categories: ', cat_df.cat.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2347476 entries, 0 to 2347475
Data columns (total 2 columns):
cat    object
txt    object
dtypes: object(2)
memory usage: 17.9+ MB
Size:  2347476 
 Shape:  None 
 Categories:  ['sports' 'science_and_technology' 'video_games' 'news']


In [38]:
# since the size is humongus, I will take sample of each categories. 
# by trial, sample of 10000 from each category can be easily handled by my machine

size = 10000    # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]

category = cat_df.groupby('cat', as_index=False).apply(fn)

# free up memory

del cat_df

category['txt'] = category['txt'].apply(lambda x:clean_text(x))
category.reset_index(drop=True, inplace=True)

category.head()

Unnamed: 0,cat,txt
0,news,cocaine abuse in addition to epo abuse
1,news,saying that people can enter from x country ...
2,news,deleted
3,news,yet people like the rothschilds literally have...
4,news,dj khaled sharia raw


### a. Model metrics

In [58]:
# create the feature matrix

cv = CountVectorizer(stop_words=stop_words)

# create target and sample for multi-class model 

X_mc = cv.fit_transform(category['txt'])
Y_mc = category['cat']

# Logistic Regression (L1)

auc, precision, recall, fscore = evaluate_model(X_train, X_test, Y_train, Y_test, 'LR1', 'N')
print('Logistic Regression (L1)')
print('Accuracy: ', accuracy)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1-Score: ', fscore)

Logistic Regression (L1)
Accuracy:  0.56816
Precision:  0.614903745678144
Recall:  0.61472
F1-Score:  0.6146652671139686




### b. Cross Validation

In [46]:
# calculating 3-fold cross validation for logistic regression with L2 penalty

evaluate_crossval(X_train, X_test, Y_train, Y_test, 'LR2', 3)

array([0.64960309, 0.65125267, 0.65508894])

## 3. Hyperparameter selection

In [None]:
# Create logistic regression

lr = LogisticRegression(solver = 'saga', max_iter = 100)

# Create regularization penalty space

penalty = ['l1', 'l2']

# Create regularization hyperparameter space

C = np.logspace(0, 4, 10)

# Create hyperparameter options

hyperparameters = dict(C=C, penalty=penalty)

# Create grid search using 5-fold cross validation

grid = GridSearchCV(lr, hyperparameters, cv=5, verbose=0, n_jobs=3)

# Fit grid search

best_model = grid.fit(X_bc, Y_bc)

# View best hyperparameters

print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

#grid.fit(Xtrain, ytrain, verbose=True)
# make predictions for test data
#ypred = clf.predict(Xtest)
#predictions = [round(value) for value in y_pred]
#evaluate predictions
#accuracy = accuracyscore(ytest, predictions)

#print("Accuracy: %.2f%" % (accuracy * 100.0))
#output: Accuracy: 0.93
#print("Accuracy Score : " + str(grid.score(Xtest,ytest)))