# Predicting NYT Picks Comments
The New York Times has a feature on its website called "NYT Picks" which is an effort by the 14 moderators on staff to select and highlight the most interesting and insightful comments that are submitted by readers. The selected comments are put in a separate user interface tab so that they're easier to find, and they get a little yellow badge which serves to highlight them further. 

There are various editorial criteria that the Times' staff uses to decide which comments to publish and to select as NYT Picks. Some of the criteria used are described in this [article](https://www.google.com/#safe=off&q=http:%2F%2Fwww.nytimes.com%2Ftimes-insider%2F2014%2F04%2F17%2Fa-comments-path-to-publication%2F) (top link). Criteria used to decide to publish include: incoherency, political insults, profanity / obscenity, and insults or sterotyped condemnation. For NYT Picks they look for: high quality, broad representation, back and forth conversation, the unexpected, personal stories. Other criteria for editorial selection of comments that have been [developed from the research literature](http://www.nickdiakopoulos.com/wp-content/uploads/2011/07/ISOJ_Journal_V5_N1_2015_Spring_Diakopoulos_Picking-NYT-Picks.pdf) include argument quality, criticality, emotionality, entertainment value, readability, personal experience, internal coherence, thoughtfulness, length, relevance, fairness, and novelty. 

In this exercise we're goint to try to predict whether a comment should be a NYT Pick based on other scores that we use to make that prediction. We'll first develop a prediction framework, and then for the majority of class you'll work in pairs to code a new score based on the text of the comment that can be used in that prediction framework to improve the prediction. Toward the end, we'll combine all of our scores together and see if the combination of all our scores leads to an improvement in prediction accuracy. 

**The Data**  
The data includes about 25,084 comments (half of which, 12,542, are NYT Picks) collected in 2014 from the NYT Community API. You can download that [here](https://www.dropbox.com/s/dqkgewvtxtfocy4/comments-sampled.csv?dl=0).

The dataset includes several variables including:
- `commentID`: the unique identifier for the comment
- `commentBody`: test text of the comment
- `approveDate`: the date and time when the comment was approved and published
- `recommendationCount`: the number of times the comment was up-voted (i.e. recommended) by the community
- `display_name`: the screen name of the user who made the comment
- `articleURL`: the link to the article to which this comment was posted
- `NYTPicks`: 0 or 1 to indicate whether the comment was selected as a Times Pick

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Makes it so that you can scroll horizontally to see all columns of an output DataFrame
pd.set_option('display.max_columns', None)
# Make it so urls and tweets won't get truncated when we print them out
pd.set_option('display.max_colwidth', -1)

# This magic function allows you to see the charts directly within the notebook. 
%matplotlib inline

# This command will make the plots more attractive by adopting the commone style of ggplot
matplotlib.style.use("ggplot")

In [None]:
cdf = pd.read_csv("Data/comments-sampled.csv", parse_dates=["approveDate"])
# cdf_nonpicks = cdf[cdf.NYTPicks==0].sample(12542)
# cdf_picks = cdf[cdf.NYTPicks==1]
# cdf_new = cdf_nonpicks.append(cdf_picks, ignore_index=True)
# cdf_new.to_csv("Data/comments-sampled.csv", index=False)

#cdf.columns
#cdf.rename(columns={"editorsSelection": "NYTPicks"}, inplace=True)
#cdf = cdf.iloc[np.random.permutation(len(cdf))]
#cdf.to_csv("Data/comments-1.csv", index=False)

In [None]:
print cdf.shape[0]
print cdf[cdf.NYTPicks==0].shape[0]
print cdf[cdf.NYTPicks==1].shape[0]

In [None]:
cdf

**Supervised Learning**  
The prediction problem in this case is one of classification. We want to predict for any input comment whether it should be classified as an NYT Pick (noted as "1" in the NYTPicks) column. 

For the dataset we have at hand we actually know the answer already since for every comment we can see whether it was actually picked by editors or not. This allows us to develop a predictive model that *learns* the relationship between the input comment and the "target" or "label" which in this case is the NYT Pick status. This is called **supervised learning** -- the learning process is supervised in the sense that we're already given it the answer. The power of this of course is that once the learning process is complete we can use the model that was learned to apply it to *new comments* for which we don't yet know if it should be an NYT Pick or not. 

**Features**  
For classification to work we need to have some attributes, often called **features** which are used as predictors for the classification. Many of the editorial criteria described above may make good candidates, but let's start simple. To find possible features that may have some predictive power (not all features will), we might start by doing a bit of exploratory analysis. Lets see how the mean and median recommendation count varies between comments that were picked and those that weren't.


In [None]:
cdf.groupby("NYTPicks").mean()

In [None]:
cdf.groupby("NYTPicks").median()

This would appear to suggest that recommendationCount might be a good predictor since NYT Picks have much higher means and medians than non picks. 

Ok, so let's set up our classification model to use recommendationCount as a feature used to predict NYTPicks. 

**Logistic Regression**  
There are many different types of machine learning algorithms that can be used to learn from input data in a supervised fashion. You can read up more on [machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html), a popular Python library. We'll just use a very basic algorithm in this case called **logistic regression**. You may have encountered regression before in stats class. Logistic regression is a form of regression that is used for classification problems in which the variable you're predicting isn't continuous but is binary (i.e. NYTPicks is either 0 or 1).


In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# Create an array in the proper format from all of the recommendation Count values
X = cdf.recommendationCount.values.reshape(-1,1)

# Create an array in the proper format for the NYTPicks outcomes that we want to learn
y = cdf.NYTPicks.values.ravel()

# Train the model (or "fit" it to the data)
model = model.fit(X, y)

# Now score the model
model.score(X,y)

73.9% accuracy. That's not too bad. 

But, wait a minute. That's not really fair, since we just tested it on the exact same data that it was trained on ... that's basically cheating. To get an accurate evaluation it's important to have a **training dataset** and a **testing dataset** when doing supervised learning. In order to really know whether your learned model is successful you need to test it on data that it's never seen before. That's because if you just test it on the examples it was trained on you won't know if it generalizes to new examples it's unfamiliar with. Let's create training and testing sets so we can properly evaluate the model. 

In [None]:
from sklearn import metrics

model2 = LogisticRegression()

# Create a train-test split of 50-50
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.5)
# Train the model just on the training data
model2 = model.fit(X_train, y_train)

# And test the model on the test data
#print model2.score(X_train, y_train)
#print model2.score(X_test, y_test)

# Use the model to predict NYTPicks values for the test dataset
predicted = model2.predict(X_test)
print "Accuracy"
print metrics.accuracy_score(y_test, predicted)
print "\nConfusion Matrix"
print metrics.confusion_matrix(y_test, predicted)
#print sklearn.metrics.classification_report(y_test, predicted)

The way to read the confusion matrix is:

True Positives (TP) | False Positive (FP)
----|----
**False Negatives (FN)** | **True Positives (TP)**

The sum of the two True Positive (TP) cells divided by the total number of test cases (i.e. all four cells) yields the accuracy. 

A False Positive (FP) is a comment that is actually not a NYT Pick, but was predicted by the classifier to be a NYT Pick. 

A False Negative (FN) is a comment that is actually a NYT Picks, but was predicted by the classifier to *not* be a NYT Pick.

You can see from the result above that the errors are imbalanced. There are many more FNs (2576) than FPs (645). 

----
**New Features**  
Now let's try to improve our classifier, increasing the accuracy as well as decreasing the FP and FN rates. To do that we need to find more features that are predictive of NYT Picks status. 

Let's break into teams of two. Each team should spend about 40 minutes writing code to compute a feature or score based on comment text analysis. The previous [class tutorial on text analysis](https://github.com/comp-journalism/UMD-J479V-J779V-Spring2016/blob/master/Weekly/Week_3/text-analysis.ipynb) will come in handy. What can you count / measure from the text itself that might be predictive? The score should ideally help predict NYT Picks status. A template is provided below so that you can test your new score's predictive power. 

Then you'll send me your .ipynb files and we'll combine everyone's features to see if we can make the predictive power of the classifier ever greater.  

In [None]:
import nltk
import string
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords

stopword_list = stopwords.words('english')

tokenizer = WhitespaceTokenizer()

def remove_punctuation(text):
    # Grab the list of standard punctuation symbols that are provided in the string library
    punctuations = string.punctuation # includes following characters: !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

    # But don't strip out apostrophes, as we want to preserve possessives and contractions, an alternative would be to expand contractions
    excluded_punctuations = ["'"]
    for p in punctuations:
        if p not in excluded_punctuations:
            # replace each punctuation symbol by a space
            text = text.replace(p, " ") 

    return text

# Takes a tokenized list and returns a list minus any of the words in the stopword list
def remove_stopwords(tokens):
     return [w for w in tokens if w not in stopword_list]

# There are a bunch of commented out pieces that may or may not be useful in the next function. Use as you see fit.
def calculate_score(text):
    #text = text.lower()
    #text = remove_punctuation(text)
    #text = " ".join(text.split())
    text_tokens = tokenizer.tokenize(text)
    #text_tokens = remove_stopwords(text_tokens)
    #text_tokens = [porter.stem(w) for w in text_tokens if w not in stopword_list]
    # Very simple (silly?) new score counts the number of times a period was used in the comment
    period_count = text.count(".")
    # This function should return a numerical score
    return float(period_count) / len(text_tokens)

# This will create a new_score column by applying the above function to the text column
cdf["new_score"] = cdf["commentBody"].apply(calculate_score)
cdf

In [None]:
# After calculating your score you might do a quick eyeball to see if the aggregate score is different between the two classes
cdf.groupby("NYTPicks").mean()

In [None]:
# Repeat predictive train test
X = cdf[["new_score", "recommendationCount"]].values.reshape(-1,2)

# Create an array in the proper format for the NYTPicks outcomes that we want to learn
y = cdf.NYTPicks.values.ravel()

X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.5)

# Train the model just on the training data
model3 = model.fit(X_train, y_train)

# Use the model to predict NYTPicks values for the test dataset
predicted = model2.predict(X_test)
print "Accuracy"
print metrics.accuracy_score(y_test, predicted)
print "\nConfusion Matrix"
print metrics.confusion_matrix(y_test, predicted)
