# Predicting the selection of comments on NYT articles as editor's picks

For most articles on New York Times articles that are open to comments, there is a selection of comments' called NYT's pick. The [dataset here](https://www.kaggle.com/aashita/nyt-comments) contains the comments' text along with many features including the feature `editorsSelection` that indicates whether a comment was picked by NYT as editor's selection. Two classifiers are trained to predict the probablities for the comments to be selected as NYT's picks.

The first classifier uses Logistic Regression coupled with Latent Semantic Analysis (LSA) and the second classifier uses NB-Logistic Regression model inspired from the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) by Sida Wang and Chris Manning and previously used in [Toxic Comments Classification kernel by Jeremy Howard](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline).

First we import the relevant python modules and get the data:

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GroupKFold, GridSearchCV
from sklearn.metrics import (roc_auc_score, classification_report, log_loss, make_scorer, 
                             recall_score, precision_recall_curve, roc_curve)

import gc
from time import time

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline


In [None]:
c1 = pd.read_csv('../input/CommentsJan2017.csv')
c2 = pd.read_csv('../input/CommentsFeb2017.csv')
c3 = pd.read_csv('../input/CommentsMarch2017.csv')
c4 = pd.read_csv('../input/CommentsApril2017.csv')
c5 = pd.read_csv('../input/CommentsMay2017.csv')
c6 = pd.read_csv('../input/CommentsJan2018.csv')
c7 = pd.read_csv('../input/CommentsFeb2018.csv')
c8 = pd.read_csv('../input/CommentsMarch2018.csv')
c9 = pd.read_csv('../input/CommentsApril2018.csv')
comments = pd.concat([c1, c2, c3, c4, c5, c6, c7, c8, c9])
comments.drop_duplicates(subset='commentID', inplace=True)
comments.reset_index(drop=True, inplace=True)

In [None]:
comments.shape

The comments dataset contains many features, but for the starter model we will use only the text of the comments given by the column `commentBody`.

In [None]:
comments.columns

## Steps:
* Balance the classes to some extent by undersampling the majority class.
* Obtain the tf-idf vectors for words and character n-grams using `TDIDFVectorizer` and `FeatureUnion`.
* Train the first classifier that uses Logistic Regression coupled with Latent Semantic Analysis (LSA).
* Train the second classifier that uses NB-Logistic Regression model inspired from the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) by Sida Wang and Chris Manning and previously used in [Toxic Comments Classification kernel by Jeremy Howard](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline).
* Compare the two classifiers by plotting the ROC and Precision-Recall curve.

In [None]:
plt.axis('equal')
plt.pie(comments.editorsSelection.value_counts(), labels=("", "NYT's pick"));
plt.title("Before balancing the classes");

The two classes are highly imbalanced with an  approxatimate ratio of 20:1. We will bring it down to less than 3:1 by undersampling the majority class. First we discard all the comments from articles that have no comments picked as Editor's selection. From the remaining articles, we randomly pick comments from the majority class so as to have a ratio of 3:1.

In [None]:
ratio = 3
def balance_classes(grp):
    picked = grp.loc[grp.editorsSelection == True]
    n = round(picked.shape[0]*ratio)
    if n:        
        try:
            not_picked = grp.loc[grp.editorsSelection == False].sample(n)
        except: # In case, fewer than n comments with `editorsSelection == False`
            not_picked = grp.loc[grp.editorsSelection == False]
        balanced_grp = pd.concat([picked, not_picked])
        return balanced_grp
    else: # If no editor's pick for an article, dicard all comments from that article
        return None 

comments = comments.groupby('articleID').apply(balance_classes).reset_index(drop=True)

In [None]:
plt.axis('equal')
plt.pie(comments.editorsSelection.value_counts(), labels=("", "NYT's pick"));
plt.title("After balancing the classes");

In [None]:
comments.shape

Our goal is to predict the probabilty that a given comment is picked by NYT as editor's selection. So the target variable is given by the column `editorsSelection`. We are training the classifier on `commentBody` and keeping `articleID` to partition the comments into train, test, validation sets below such that they do not share comments from the same article. Thus, we need only three features `articleID`, `commentBody` and `editorsSelection`.

In [None]:
commentBody = comments.commentBody
nytpicks = comments.editorsSelection
articleID = comments.articleID

Now we delete the comments dataframe to free up space.

In [None]:
# Delete comments dataframe since it is no longer needed
del comments

# Collect residual garbage
gc.collect();

We split the data into train and test sets such that the two sets have comments from disjoint set of articles. This is achieved using `GroupKFold`.

In [None]:
for train_index, test_index in GroupKFold(n_splits=5).split(commentBody, nytpicks, groups=articleID):
    train_text, test_text = commentBody[train_index], commentBody[test_index] 
    train_target, test_target = nytpicks[train_index], nytpicks[test_index]
    train_groups, test_groups = articleID[train_index], articleID[test_index]
    
print("Number of comments for training:", train_text.shape[0])
print("Number of comments for testing:", test_text.shape[0])

Next we get features using TFIDF for words and character n-grams and combine them using `FeatureUnion`:

In [None]:
vectorizer = FeatureUnion([
    ('word_tfidf', TfidfVectorizer(
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 2),
    max_features=600,
    )),
    
    ('char_tfidf', TfidfVectorizer(
    analyzer='char',
    ngram_range=(2, 4),
    max_features=600,
    ))
])
start_vect = time()
vectorizer.fit(commentBody)
train_text = vectorizer.transform(train_text)
test_text = vectorizer.transform(test_text)

print("Vectorization Runtime: %0.2f Minutes"%((time() - start_vect)/60))

Here we use Latent Semantic Analysis (LSA) to perform dimensionality reduction on the tf-idf vectors and then train the Logistic regression to make predictions. LSA is implemented as `TruncatedSVD` in sklearn. 

In [None]:
clf_logistic = Pipeline([
    ('lsa', TruncatedSVD(n_components=1000, random_state=0)), 
    ('logistic', LogisticRegression(C=150))  
])

Using `GridSearchCV`, we find the optimal parameters for the model. Since the classes are imbalanced, we use `recall_score` as the metric. We again use `GroupKFold` for splitting the data into train and validation sets so that the comments from the same article are not mixed up. On account of the time it takes to run the kernel, the tuned parameters obtained are used in the model below without running the `GridSearchCV` here.

In [None]:
# def grid_search_cv(param_grid, clf):
#     gkf = GroupKFold(n_splits=3).split(train_text, train_target, groups=train_groups)
#     scorer = make_scorer(recall_score)
#     
#     grid_search = GridSearchCV(clf, param_grid=param_grid, cv=gkf, scoring=scorer)
#     grid_search.fit(train_text, train_target)
#     `
#     print("Best parameters found:")
#     print(grid_search.best_params_)
#     print()
#     print("Best score:")
#     print(grid_search.best_score_)
#     print()
#     
#     test_prediction = grid_search.predict(test_text)
#     print("Classification report:")
#     print(classification_report(test_target, test_prediction))
#     
#     test_prediction_proba = grid_search.predict_proba(test_text)[:, 1]
#     score = roc_auc_score(test_target, test_prediction_proba)
#     print("ROC AUC Score: ", round(score, 4)) 
#     
#     score = log_loss(test_target, test_prediction_proba)
#     print("logloss: ", round(score, 4))
#     return test_prediction_proba

In [None]:
# param_grid = [
#     {'logistic__C': [150, 200]},
#     {'logistic__class_weight_balanced': [True, False]},
# ]
# 
# start_vect = time()
# 
# test_prediction_proba_logistic = grid_search_cv(param_grid, clf_logistic)
# 
# print()
# print("Runtime for running GridSearchCV on logistic regression model and predicting probabilities for the test set is %0.2f Minutes"%((time() - start_vect)/60))

Now we train the classifier:

In [None]:
def train_model(clf):
    clf.fit(train_text, train_target)
    
    test_prediction = clf.predict(test_text)
    print("Classification report:")
    print(classification_report(test_target, test_prediction))
    
    test_prediction_proba = clf.predict_proba(test_text)[:, 1]
    score = roc_auc_score(test_target, test_prediction_proba)
    print("ROC AUC Score: ", round(score, 4)) 
    
    score = log_loss(test_target, test_prediction_proba)
    print("logloss: ", round(score, 4))
    return test_prediction_proba

In [None]:
start_vect = time()

test_prediction_proba_logistic = train_model(clf_logistic)

print()
print("Runtime for training logistic regression model and predicting probabilities for the test set is %0.2f Minutes"%((time() - start_vect)/60))

Next we train another classifier inspired from the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) by Sida Wang and Chris Manning and previously used in [Toxic Comments Classification kernel by Jeremy Howard](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline) that combines Naive Bayes with Logistic Regression.

We start by defining a class for the NB-logistic model:

In [None]:
class NB_logistic(LogisticRegression, BaseEstimator):
    def __init__(self, r=None, C=1, solver='sag', class_weight_balanced=False):
        self.r = r
        if class_weight_balanced:
            super().__init__(C=C, solver=solver, class_weight='balanced')
        else:
            super().__init__(C=C, solver=solver)
        
    def pr(self, X, y, y_i):
        p = X[np.where(y==y_i)[0]].sum(0)+1
        return (p+1)/((y==y_i).sum()+1)

    def fit(self, X, y):
        self.r = np.log(self.pr(X, y, 1) / self.pr(X, y, 0))
        X_nb = X.multiply(self.r)
        super().fit(X_nb, y)
        return self
    
    def predict(self, X):
        X_nb = X.multiply(self.r)
        return super().predict(X_nb)
    
    def predict_proba(self, X):
        X_nb = X.multiply(self.r)
        return super().predict_proba(X_nb)

In [None]:
clf_nb_logistic = NB_logistic()

start_vect = time()

test_prediction_proba_nb_logistic = train_model(clf_nb_logistic)

print("Runtime for running GridSearchCV on NB-logistic regression model and predicting probabilities for the test set is %0.2f Minutes"%((time() - start_vect)/60))

Lastly, we compare the two classifier by plotting the respective ROC and Precision-Recall curves:

In [None]:
def curves(test_prediction_proba):
    p, r, _ = precision_recall_curve(test_target, test_prediction_proba)
    tpr, fpr, _ = roc_curve(test_target, test_prediction_proba)
    return p, r, tpr, fpr

fig = plt.figure(figsize=(12,24))

ax1 = fig.add_subplot(2,1,1)
ax1.set_xlim([-0.05,1.05])
ax1.set_ylim([-0.05,1.05])
ax1.set_xlabel('Recall')
ax1.set_ylabel('Precision')
ax1.set_title('PR Curve')

ax2 = fig.add_subplot(2,1,2)
ax2.set_xlim([-0.05,1.05])
ax2.set_ylim([-0.05,1.05])
ax2.set_xlabel('False Positive Rate')
ax2.set_ylabel('True Positive Rate')
ax2.set_title('ROC Curve')


p, r, tpr, fpr = curves(test_prediction_proba_logistic) 
ax1.plot(r, p, c='g', label="Logistic with LSA")
ax2.plot(tpr, fpr, c='g', label="Logistic with LSA")

p, r, tpr, fpr = curves(test_prediction_proba_nb_logistic) 
ax1.plot(r, p, c='r', label="NB-Logistic")
ax2.plot(tpr, fpr, c='r', label="NB-Logistic") 

ax1.legend(loc='lower left')    
ax2.legend(loc='lower right')

plt.show()

The two classifiers are very similar in performance and they both form a strong baseline for predicting the selection of comments on NYT articles as editor's picks with reasonably good ROC score of 0.72. In the next kernel, I will use pre-trained word embedding with RNN and try to improve the performance.

#### References:
* https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline
* https://www.kaggle.com/metadist/work-like-a-pro-with-pipelines-and-feature-unions
* https://www.kaggle.com/lct14558/imbalanced-data-why-you-should-not-use-roc-curve/notebook