# Predicting NYT Picks Comments
The New York Times has a feature on its website called "NYT Picks" which is an effort by the 14 moderators on staff to select and highlight the most interesting and insightful comments that are submitted by readers. The selected comments are put in a separate user interface tab so that they're easier to find, and they get a little yellow badge which serves to highlight them further. 

There are various editorial criteria that the Times' staff uses to decide which comments to publish and to select as NYT Picks. Some of the criteria used are described in this [article](https://www.google.com/#safe=off&q=http:%2F%2Fwww.nytimes.com%2Ftimes-insider%2F2014%2F04%2F17%2Fa-comments-path-to-publication%2F) (top link). Criteria used to decide to publish include: incoherency, political insults, profanity / obscenity, and insults or sterotyped condemnation. For NYT Picks they look for: high quality, broad representation, back and forth conversation, the unexpected, personal stories. Other criteria for editorial selection of comments that have been [developed from the research literature](http://www.nickdiakopoulos.com/wp-content/uploads/2011/07/ISOJ_Journal_V5_N1_2015_Spring_Diakopoulos_Picking-NYT-Picks.pdf) include argument quality, criticality, emotionality, entertainment value, readability, personal experience, internal coherence, thoughtfulness, length, relevance, fairness, and novelty. 

In this exercise we're goint to try to predict whether a comment should be a NYT Pick based on other scores that we use to make that prediction. We'll first develop a prediction framework, and then for the majority of class you'll work in pairs to code a new score based on the text of the comment that can be used in that prediction framework to improve the prediction. Toward the end, we'll combine all of our scores together and see if the combination of all our scores leads to an improvement in prediction accuracy.  

**The Data**  
The data includes about 25,084 comments (half of which, 12,542, are NYT Picks) collected in 2014 from the NYT Community API. You can download that [here](https://www.dropbox.com/s/dqkgewvtxtfocy4/comments-sampled.csv?dl=0).

The dataset includes several variables including:
- `commentID`: the unique identifier for the comment
- `commentBody`: the text of the comment
- `approveDate`: the date and time when the comment was approved and published
- `recommendationCount`: the number of times the comment was up-voted (i.e. recommended) by the community
- `display_name`: the screen name of the user who made the comment
- `articleURL`: the link to the article to which this comment was posted
- `NYTPicks`: 0 or 1 to indicate whether the comment was selected as a Times Pick

In [27]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Makes it so that you can scroll horizontally to see all columns of an output DataFrame
pd.set_option('display.max_columns', None)
# Make it so urls and tweets won't get truncated when we print them out
pd.set_option('display.max_colwidth', -1)

# This magic function allows you to see the charts directly within the notebook. 
%matplotlib inline

# This command will make the plots more attractive by adopting the commone style of ggplot
matplotlib.style.use("ggplot")

In [28]:
cdf = pd.read_csv("Data/comments-sampled.csv", parse_dates=["approveDate"])
# cdf_nonpicks = cdf[cdf.NYTPicks==0].sample(12542)
# cdf_picks = cdf[cdf.NYTPicks==1]
# cdf_new = cdf_nonpicks.append(cdf_picks, ignore_index=True)
# cdf_new.to_csv("Data/comments-sampled.csv", index=False)

#cdf.columns
#cdf.rename(columns={"editorsSelection": "NYTPicks"}, inplace=True)
#cdf = cdf.iloc[np.random.permutation(len(cdf))]
#cdf.to_csv("Data/comments-1.csv", index=False)

In [29]:
print cdf.shape[0]
print cdf[cdf.NYTPicks==0].shape[0]
print cdf[cdf.NYTPicks==1].shape[0]

25084
12542
12542


In [30]:
cdf

Unnamed: 0,commentID,commentBody,approveDate,recommendationCount,display_name,articleURL,NYTPicks
0,352886,That 2% uses water to produce power and food for the other 98% We have met the enemy and he is us.,2014-03-17 15:22:36,3,michjas,http://www.nytimes.com/2014/03/17/us/wests-drought-and-growth-intensify-conflict-over-water-rights.html,0
1,21073,"@Cobbler:<br/><br/>Firstly, family coverage doesn't mean that there isn't a high price attached. Secondly you are exaggerating your claim to draw the conclusions you desire.",2014-01-19 08:10:30,0,allan,http://economix.blogs.nytimes.com/2014/01/17/the-real-health-care-war-on-the-young/,0
2,208883,"I don't care what color lipstick is put on this 'pig' of a piece of legislation, ""free to live and work according to their faith"" is simply a fancy way of saying Arizonans should be 'free to discriminate' whomever they want to, whether due to sexual orientation, ethnicity, or religion. Pure and simple. One would think cooler heads in the legislature would prevail and ask the sponsors if they've really, REALLY, considered the implications of such nonsense. It reminds me of the infamous SYG laws that give homeowners the legal cover to shoot first and ask questions later. Just as SYG has proven to be far more problematic in practice for states and municipalities than the measure was sold on, this too, if it were ever passed, would become a nightmare--not just from a PR standpoint, but from an administrative one--that its proponents have probably refused to acknowledge because, lets face it, they might have to admit to its problems in the light of day, so best not to engage in any serious conversation about it and just press ahead with their partisan advantage. <br/><br/>As with all the other lessons of the past few years, let this too be an object lesson on what happens when GOtPers are elected to positions of authority. They don't say 'yes' to anything reasonable, and can't say 'no' to the most preposterous proposals. The strange thing is, they swim against a far greater tide of public and legal opinion on issues of discrimination. As such, it becomes a self-inflicted wound.",2014-02-21 22:57:56,100,Citixen,http://www.nytimes.com/2014/02/22/us/religious-right-in-arizona-cheers-bill-allowing-businesses-to-refuse-to-serve-gays.html,0
3,392025,"""These “e-liquids,” the key ingredients in e-cigarettes, are powerful neurotoxins.""<br/><br/>No doubt. But let's not allow e-cigarettes to co-opt the English language. The word ""e-liquid"" should be reserved for something much more general than liquid nicotine. Call it liquid nicotine or e-nicotine if you like, or maybe liquotine. But please, not e-liquid.",2014-03-24 13:30:15,1,polymath,http://www.nytimes.com/2014/03/24/business/selling-a-poison-by-the-barrel-liquid-nicotine-for-e-cigarettes.html,0
4,227001,"""1 in 5 children, of the USA, is starving to death due to lack of food""<br/><br/>Really? Let's see: Current population of the US is about 317 million.<br/>Percent of population under 15 is about 20%.<br/>So, in your opinion about 63 million people in the US are starving to death.<br/>This kind of makes Stalin look like a benevolent uncle.",2014-02-25 18:00:26,9,Michael Henry,http://www.nytimes.com/2014/02/26/health/obesity-rate-for-young-children-plummets-43-in-a-decade.html,0
5,123542,"I'm not sure where Ross encounters ""elites""; but he needs to branch out more. If you go into a wealthy neighborhood anywhere in the US--either coast, or in the middle--you will find many conservatives living the elite life too.<br/><br/>The big dividing line is education and the work that comes with it, and the ""lifestyle"" and culture that ensue. In affluent neighborhoods of my town, organic produce is purchased by liberals and conservatives; nobody is eating at McDonalds every night. These people (lefties and righties) believe infants should be read to and that academic achievement matters, no matter how much they disparage professors. And they believe this will make kids more employable and resilient, and mostly this has worked for them. Since they're pursuing education, marriage is delayed (yes, even in Bible Belt elite communities). <br/><br/>But go a few counties over, and it's a very different story and a different culture. If your kids are too bookish, socially that's bad. Since the parents don't read so well themselves, nobody pushes that. Since nothing is going to happen after high school, why wait? What is there to wait for? The loss of jobs has devastated the prospects of these young people. Nobody has steady work, and nobody looks like marriage material. <br/><br/>And if you're local ""elite""--small town lawyer or doctor--you try to keep your kids from being infected by the low horizons of their classmates. Given the anti-intellectual climate you live in, you too plan to leave town ASAP.",2014-02-05 16:49:08,0,mc,http://douthat.blogs.nytimes.com/2014/01/29/social-liberalism-as-class-warfare/,0
6,284027,I second what Nina just wrote. The NY exchange plans are lying about which doctors are in their networks. Today I found out that NONE of my doctors are accepting any of the exchange plans - even though they ARE on the plans' official lists of providers. You don't find this out until you go to your appointment and get asked to pay hundreds of dollars up front because they don't take your insurance. The insurance companies are just lying to get people's money. They know we'd never pick their plans if they were honest about how lousy their networks are. They're just adding drs names without their permission so that they can boost enrollment,2014-03-06 13:19:05,1,anae,http://www.nytimes.com/2014/03/06/opinion/in-health-care-choice-is-overrated.html,0
7,137686,"My Mantra: The Republican party has to return the White House to a white man. In order to do that, the Republicans must make Obama so toxic that ""reasonable"" people will vote Republican. (But, the smart Republicans know, there will be a white Democrat on the ballot in Nov 2016, which will negate insipid racism) Nothing worse than partisan politics and racism. All people have gotta do is read about Reconstruction after the Civil War.",2014-02-08 11:39:54,7,First Last,http://www.nytimes.com/2014/02/08/opinion/collins-boehner-on-fantasy-island.html,0
8,113329,"The West Bank is ""quiet?"" So, yes, let's just go ahead and let these people live in subjugation forever. Do you remember the day when Israel told these people that it could not negotiate until things got quiet? If the West Bank were not quiet, I have not doubt that you would have a different reason to argue for a continued occupation.",2014-02-04 12:19:14,15,Al N.,http://www.nytimes.com/2014/02/04/opinion/cohen-the-talks-round-two.html,0
9,407305,"If it had to come down to Mother Nature and global warming, Putin wins! Unless we stop fooling ourselves that we can start or stop changes in our climate. What happened to the "" we'll run out of oil soon argument"" formerly espoused by those screaming "" global warming"". One thing is for sure, you better have plenty of oil, gas and coal if you want to survive in the cold.",2014-03-26 14:03:24,1,Jack,http://www.nytimes.com/2014/03/26/opinion/friedman-putin-and-the-laws-of-gravity.html,0


**Supervised Learning**  
The prediction problem in this case is one of classification. We want to predict for any input comment whether it should be classified as an NYT Pick (noted as "1" in the NYTPicks) column. 

For the dataset we have at hand we actually know the answer already since for every comment we can see whether it was actually picked by editors or not. This allows us to develop a predictive model that *learns* the relationship between the input comment and the "target" or "label" which in this case is the NYT Pick status. This is called **supervised learning** -- the learning process is supervised in the sense that we're already given it the answer. The power of this of course is that once the learning process is complete we can use the model that was learned to apply it to *new comments* for which we don't yet know if it should be an NYT Pick or not. 

**Features**  
For classification to work we need to have some attributes, often called **features** which are used as predictors for the classification. Many of the editorial criteria described above may make good candidates, but let's start simple. To find possible features that may have some predictive power (not all features will), we might start by doing a bit of exploratory analysis. Lets see how the mean and median recommendation count varies between comments that were picked and those that weren't.


In [31]:
cdf.groupby("NYTPicks").mean()

Unnamed: 0_level_0,commentID,recommendationCount
NYTPicks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,265203.963323,11.196221
1,269775.046564,72.177882


In [32]:
cdf.groupby("NYTPicks").median()

Unnamed: 0_level_0,commentID,recommendationCount
NYTPicks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,259597,4
1,269415,30


This would appear to suggest that recommendationCount might be a good predictor since NYT Picks have much higher means and medians than non picks. 

Ok, so let's set up our classification model to use recommendationCount as a feature used to predict NYTPicks. 

**Logistic Regression**  
There are many different types of machine learning algorithms that can be used to learn from input data in a supervised fashion. You can read up more on [machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html), a popular Python library. We'll just use a very basic algorithm in this case called **logistic regression**. You may have encountered regression before in stats class. Logistic regression is a form of regression that is used for classification problems in which the variable you're predicting isn't continuous but is binary (i.e. NYTPicks is either 0 or 1).


In [33]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# Create an array in the proper format from all of the recommendation Count values
X = cdf.recommendationCount.values.reshape(-1,1)

# Create an array in the proper format for the NYTPicks outcomes that we want to learn
y = cdf.NYTPicks.values.ravel()

# Train the model (or "fit" it to the data)
model = model.fit(X, y)

# Now score the model
model.score(X,y)

0.73939563068091219

73.9% accuracy. That's not too bad. 

But, wait a minute. That's not really fair, since we just tested it on the exact same data that it was trained on ... that's basically cheating. To get an accurate evaluation it's important to have a **training dataset** and a **testing dataset** when doing supervised learning. In order to really know whether your learned model is successful you need to test it on data that it's never seen before. That's because if you just test it on the examples it was trained on you won't know if it generalizes to new examples it's unfamiliar with. Let's create training and testing sets so we can properly evaluate the model. 

In [36]:
from sklearn import metrics
from sklearn import cross_validation

model2 = LogisticRegression()

# Create a train-test split of 50-50
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5)
# Train the model just on the training data
model2 = model.fit(X_train, y_train)

# And test the model on the test data
#print model2.score(X_train, y_train)
#print model2.score(X_test, y_test)

# Use the model to predict NYTPicks values for the test dataset
predicted = model2.predict(X_test)
print "Accuracy"
print metrics.accuracy_score(y_test, predicted)
print "\nConfusion Matrix"
print metrics.confusion_matrix(y_test, predicted)
#print sklearn.metrics.classification_report(y_test, predicted)

Accuracy
0.740551746133

Confusion Matrix
[[5673  604]
 [2650 3615]]


The way to read the confusion matrix is:

True Positives (TP) | False Positive (FP)
----|----
**False Negatives (FN)** | **True Positives (TP)**

The sum of the two True Positive (TP) cells divided by the total number of test cases (i.e. all four cells) yields the accuracy. 

A False Positive (FP) is a comment that is actually not a NYT Pick, but was predicted by the classifier to be a NYT Pick. 

A False Negative (FN) is a comment that is actually a NYT Picks, but was predicted by the classifier to *not* be a NYT Pick.

You can see from the result above that the errors are imbalanced. There are many more FNs (2576) than FPs (645). 

----
**New Features**  
Now let's try to improve our classifier, increasing the accuracy as well as decreasing the FP and FN rates. To do that we need to find more features that are predictive of NYT Picks status. 

Let's break into teams of two. Each team should spend about 40 minutes writing code to compute a feature or score based on comment text analysis. The previous [class tutorial on text analysis](https://github.com/comp-journalism/UMD-J479V-J779V-Spring2016/blob/master/Weekly/Week_3/text-analysis.ipynb) will come in handy. What can you count / measure from the text itself that might be predictive? The score should ideally help predict NYT Picks status. A template is provided below so that you can test your new score's predictive power. 

Then you'll send me your .ipynb files and we'll combine everyone's features to see if we can make the predictive power of the classifier ever greater.  

In [37]:
import nltk
import string
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords

stopword_list = stopwords.words('english')

tokenizer = WhitespaceTokenizer()

def remove_punctuation(text):
    # Grab the list of standard punctuation symbols that are provided in the string library
    punctuations = string.punctuation # includes following characters: !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

    # But don't strip out apostrophes, as we want to preserve possessives and contractions, an alternative would be to expand contractions
    excluded_punctuations = ["'"]
    for p in punctuations:
        if p not in excluded_punctuations:
            # replace each punctuation symbol by a space
            text = text.replace(p, " ") 

    return text

# Takes a tokenized list and returns a list minus any of the words in the stopword list
def remove_stopwords(tokens):
     return [w for w in tokens if w not in stopword_list]

# There are a bunch of commented out pieces that may or may not be useful in the next function. Use as you see fit.
def calculate_score(text):
    #text = text.lower()
    #text = remove_punctuation(text)
    #text = " ".join(text.split())
    text_tokens = tokenizer.tokenize(text)
    #text_tokens = remove_stopwords(text_tokens)
    #text_tokens = [porter.stem(w) for w in text_tokens if w not in stopword_list]
    # Very simple (silly?) new score counts the number of times a period was used in the comment
    period_count = text.count(".")
    # This function should return a numerical score
    return float(period_count) / len(text_tokens)

# This will create a new_score column by applying the above function to the text column
cdf["new_score"] = cdf["commentBody"].apply(calculate_score)
cdf

Unnamed: 0,commentID,commentBody,approveDate,recommendationCount,display_name,articleURL,NYTPicks,new_score
0,352886,That 2% uses water to produce power and food for the other 98% We have met the enemy and he is us.,2014-03-17 15:22:36,3,michjas,http://www.nytimes.com/2014/03/17/us/wests-drought-and-growth-intensify-conflict-over-water-rights.html,0,0.045455
1,21073,"@Cobbler:<br/><br/>Firstly, family coverage doesn't mean that there isn't a high price attached. Secondly you are exaggerating your claim to draw the conclusions you desire.",2014-01-19 08:10:30,0,allan,http://economix.blogs.nytimes.com/2014/01/17/the-real-health-care-war-on-the-young/,0,0.083333
2,208883,"I don't care what color lipstick is put on this 'pig' of a piece of legislation, ""free to live and work according to their faith"" is simply a fancy way of saying Arizonans should be 'free to discriminate' whomever they want to, whether due to sexual orientation, ethnicity, or religion. Pure and simple. One would think cooler heads in the legislature would prevail and ask the sponsors if they've really, REALLY, considered the implications of such nonsense. It reminds me of the infamous SYG laws that give homeowners the legal cover to shoot first and ask questions later. Just as SYG has proven to be far more problematic in practice for states and municipalities than the measure was sold on, this too, if it were ever passed, would become a nightmare--not just from a PR standpoint, but from an administrative one--that its proponents have probably refused to acknowledge because, lets face it, they might have to admit to its problems in the light of day, so best not to engage in any serious conversation about it and just press ahead with their partisan advantage. <br/><br/>As with all the other lessons of the past few years, let this too be an object lesson on what happens when GOtPers are elected to positions of authority. They don't say 'yes' to anything reasonable, and can't say 'no' to the most preposterous proposals. The strange thing is, they swim against a far greater tide of public and legal opinion on issues of discrimination. As such, it becomes a self-inflicted wound.",2014-02-21 22:57:56,100,Citixen,http://www.nytimes.com/2014/02/22/us/religious-right-in-arizona-cheers-bill-allowing-businesses-to-refuse-to-serve-gays.html,0,0.035156
3,392025,"""These “e-liquids,” the key ingredients in e-cigarettes, are powerful neurotoxins.""<br/><br/>No doubt. But let's not allow e-cigarettes to co-opt the English language. The word ""e-liquid"" should be reserved for something much more general than liquid nicotine. Call it liquid nicotine or e-nicotine if you like, or maybe liquotine. But please, not e-liquid.",2014-03-24 13:30:15,1,polymath,http://www.nytimes.com/2014/03/24/business/selling-a-poison-by-the-barrel-liquid-nicotine-for-e-cigarettes.html,0,0.117647
4,227001,"""1 in 5 children, of the USA, is starving to death due to lack of food""<br/><br/>Really? Let's see: Current population of the US is about 317 million.<br/>Percent of population under 15 is about 20%.<br/>So, in your opinion about 63 million people in the US are starving to death.<br/>This kind of makes Stalin look like a benevolent uncle.",2014-02-25 18:00:26,9,Michael Henry,http://www.nytimes.com/2014/02/26/health/obesity-rate-for-young-children-plummets-43-in-a-decade.html,0,0.070175
5,123542,"I'm not sure where Ross encounters ""elites""; but he needs to branch out more. If you go into a wealthy neighborhood anywhere in the US--either coast, or in the middle--you will find many conservatives living the elite life too.<br/><br/>The big dividing line is education and the work that comes with it, and the ""lifestyle"" and culture that ensue. In affluent neighborhoods of my town, organic produce is purchased by liberals and conservatives; nobody is eating at McDonalds every night. These people (lefties and righties) believe infants should be read to and that academic achievement matters, no matter how much they disparage professors. And they believe this will make kids more employable and resilient, and mostly this has worked for them. Since they're pursuing education, marriage is delayed (yes, even in Bible Belt elite communities). <br/><br/>But go a few counties over, and it's a very different story and a different culture. If your kids are too bookish, socially that's bad. Since the parents don't read so well themselves, nobody pushes that. Since nothing is going to happen after high school, why wait? What is there to wait for? The loss of jobs has devastated the prospects of these young people. Nobody has steady work, and nobody looks like marriage material. <br/><br/>And if you're local ""elite""--small town lawyer or doctor--you try to keep your kids from being infected by the low horizons of their classmates. Given the anti-intellectual climate you live in, you too plan to leave town ASAP.",2014-02-05 16:49:08,0,mc,http://douthat.blogs.nytimes.com/2014/01/29/social-liberalism-as-class-warfare/,0,0.056680
6,284027,I second what Nina just wrote. The NY exchange plans are lying about which doctors are in their networks. Today I found out that NONE of my doctors are accepting any of the exchange plans - even though they ARE on the plans' official lists of providers. You don't find this out until you go to your appointment and get asked to pay hundreds of dollars up front because they don't take your insurance. The insurance companies are just lying to get people's money. They know we'd never pick their plans if they were honest about how lousy their networks are. They're just adding drs names without their permission so that they can boost enrollment,2014-03-06 13:19:05,1,anae,http://www.nytimes.com/2014/03/06/opinion/in-health-care-choice-is-overrated.html,0,0.052174
7,137686,"My Mantra: The Republican party has to return the White House to a white man. In order to do that, the Republicans must make Obama so toxic that ""reasonable"" people will vote Republican. (But, the smart Republicans know, there will be a white Democrat on the ballot in Nov 2016, which will negate insipid racism) Nothing worse than partisan politics and racism. All people have gotta do is read about Reconstruction after the Civil War.",2014-02-08 11:39:54,7,First Last,http://www.nytimes.com/2014/02/08/opinion/collins-boehner-on-fantasy-island.html,0,0.053333
8,113329,"The West Bank is ""quiet?"" So, yes, let's just go ahead and let these people live in subjugation forever. Do you remember the day when Israel told these people that it could not negotiate until things got quiet? If the West Bank were not quiet, I have not doubt that you would have a different reason to argue for a continued occupation.",2014-02-04 12:19:14,15,Al N.,http://www.nytimes.com/2014/02/04/opinion/cohen-the-talks-round-two.html,0,0.032258
9,407305,"If it had to come down to Mother Nature and global warming, Putin wins! Unless we stop fooling ourselves that we can start or stop changes in our climate. What happened to the "" we'll run out of oil soon argument"" formerly espoused by those screaming "" global warming"". One thing is for sure, you better have plenty of oil, gas and coal if you want to survive in the cold.",2014-03-26 14:03:24,1,Jack,http://www.nytimes.com/2014/03/26/opinion/friedman-putin-and-the-laws-of-gravity.html,0,0.042254


In [38]:
# After calculating your score you might do a quick eyeball to see if the aggregate score is different between the two classes
cdf.groupby("NYTPicks").mean()

Unnamed: 0_level_0,commentID,recommendationCount,new_score
NYTPicks,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,265203.963323,11.196221,0.079307
1,269775.046564,72.177882,0.060711


In [39]:
# Repeat predictive train test
X = cdf[["new_score", "recommendationCount"]].values.reshape(-1,2)

# Create an array in the proper format for the NYTPicks outcomes that we want to learn
y = cdf.NYTPicks.values.ravel()

X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.5)

# Train the model just on the training data
model3 = model.fit(X_train, y_train)

# Use the model to predict NYTPicks values for the test dataset
predicted = model2.predict(X_test)
print "Accuracy"
print metrics.accuracy_score(y_test, predicted)
print "\nConfusion Matrix"
print metrics.confusion_matrix(y_test, predicted)


Accuracy
0.745255940041

Confusion Matrix
[[5676  598]
 [2597 3671]]
