# Business Question

The New York Times is reevaluating the comment moderation on its web content, in hopes of stimulating reader engagement, deepening the quality of user feedback, and identifying potentially problematic comments more quickly. They would like to use the existing recommendation tool to develop a machine learning model that will predict which comments will be most popular and which comments will be most likely to generate further engagement. The end goal will be to sort comments according to this prediction (as a third option available to users in addition to presenting comments ranked in response to recommendations or chronologically). Additionally, they would like to identify comments that are most likely to be flagged as abuse to bring these comments to the attention of moderators more quickly. 

### Data Acquisition
The data used for this project represents articles and comments on The New York Times website in April 2017. I downloaded the dataset from Kaggle [at this address](https://www.kaggle.com/aashita/nyt-comments). The data was originally collected using the New York Times API. The process for this collection is well-documented on the Kaggle page. 

In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#from skmultilearn.problem_transform import ClassifierChain

from sklearn.multioutput import ClassifierChain
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist

#toggle variable to print previews, 'sanity checks'
print_detail = True

## Preprocessing

In [33]:
# DELETE ME if you end up wanting to expand and add more data to the dataset, you can combine the datasets before this cell 
# and name them art/comm and then pick up at this point.

# read in dataset of articles
art = pd.read_csv('data/ArticlesApril2017.csv')
# read in dataset of comments
comm = pd.read_csv('data/CommentsApril2017.csv')

if print_detail:
    # preview
    art.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [34]:
if print_detail:
    #preview
    comm.head()

In [35]:
print('Article columns: ')
display(art.columns)
print('Comments columns: ')
display(comm.columns)

Article columns: 


Index(['abstract', 'articleID', 'articleWordCount', 'byline', 'documentType',
       'headline', 'keywords', 'multimedia', 'newDesk', 'printPage', 'pubDate',
       'sectionName', 'snippet', 'source', 'typeOfMaterial', 'webURL'],
      dtype='object')

Comments columns: 


Index(['approveDate', 'commentBody', 'commentID', 'commentSequence',
       'commentTitle', 'commentType', 'createDate', 'depth',
       'editorsSelection', 'parentID', 'parentUserDisplayName', 'permID',
       'picURL', 'recommendations', 'recommendedFlag', 'replyCount',
       'reportAbuseFlag', 'sharing', 'status', 'timespeople', 'trusted',
       'updateDate', 'userDisplayName', 'userID', 'userLocation', 'userTitle',
       'userURL', 'inReplyTo', 'articleID', 'sectionName', 'newDesk',
       'articleWordCount', 'printPage', 'typeOfMaterial'],
      dtype='object')

In [36]:
#combine the article and comment dataframes using articleID as an index
df = pd.merge(comm, art, on='articleID')

if print_detail:
    df

In [37]:
df.columns

Index(['approveDate', 'commentBody', 'commentID', 'commentSequence',
       'commentTitle', 'commentType', 'createDate', 'depth',
       'editorsSelection', 'parentID', 'parentUserDisplayName', 'permID',
       'picURL', 'recommendations', 'recommendedFlag', 'replyCount',
       'reportAbuseFlag', 'sharing', 'status', 'timespeople', 'trusted',
       'updateDate', 'userDisplayName', 'userID', 'userLocation', 'userTitle',
       'userURL', 'inReplyTo', 'articleID', 'sectionName_x', 'newDesk_x',
       'articleWordCount_x', 'printPage_x', 'typeOfMaterial_x', 'abstract',
       'articleWordCount_y', 'byline', 'documentType', 'headline', 'keywords',
       'multimedia', 'newDesk_y', 'printPage_y', 'pubDate', 'sectionName_y',
       'snippet', 'source', 'typeOfMaterial_y', 'webURL'],
      dtype='object')

Before I start the Exploratory Data Analysis (EDA), I want to eliminate some of the columns that are duplicates, or contain information that I know won't be helpful (such as URLs). Some of these columns I'm still unsure about. 

In [40]:
keep = ['commentBody', 'commentID', 'commentType', 'createDate', 'depth',
        'editorsSelection','recommendations', 'recommendedFlag', 'replyCount',
        'reportAbuseFlag', 'sharing', 'status', 'timespeople', 'trusted','sectionName_x', 
        'newDesk_x', 'articleWordCount_x', 'printPage_x', 'typeOfMaterial_x' , 'pubDate',
        'byline', 'documentType', 'headline', 'keywords']

to_drop = [item for item in list(df.columns) if item not in keep]

if print_detail:
    print(to_drop)

df.drop(to_drop, inplace=True)

KeyError: "['approveDate' 'commentSequence' 'commentTitle' 'parentID'\n 'parentUserDisplayName' 'permID' 'picURL' 'updateDate' 'userDisplayName'\n 'userID' 'userLocation' 'userTitle' 'userURL' 'inReplyTo' 'articleID'\n 'abstract' 'articleWordCount_y' 'multimedia' 'newDesk_y' 'printPage_y'\n 'sectionName_y' 'snippet' 'source' 'typeOfMaterial_y' 'webURL'] not found in axis"

## EDA

EDA Outline
* how many articles
* how many comments/article
* comment length

* # of recommendations
* relationship between recommendations and recommendation flag
    * highly recommended comments?
* relationship between recommendations and editorSelection

* avg number of replies (replyCount)
* relationship between replies and recommendations

* number of abusive comments

* sections
    * how many articles/section
    * how many comments/article/section
    * comment length by section
    * number of recommendations(or highly recommended)/section
    * number of replies/section
    * number of abusive comments/section

One of the more interesting and useful ways that the data segments itself is by section. There are several variables (sectionName, newDesk, typeOfMaterial) that might end up being best for sorting the data, so I want to take a look at each of them to see which one is most helpful.

In [10]:
# the saddest little helper function
def print_categories(column):
    '''the function will print the name of a given column as well as all of the categories in that column'''
    print(column)
    display(df[column].unique())

#create a list of potential section variables
potential_ys = ['sectionName_x', 'newDesk_x', 'typeOfMaterial_x']

# print categories for each of the potential section variables
for each in potential_ys:
    print_categories(each)

sectionName_x


array(['Unknown', 'College Basketball', 'Media', 'Politics', 'Baseball',
       'Sunday Review', 'Pro Basketball', 'Television', 'Asia Pacific',
       'Family', 'Live', 'Education Life', 'Hockey', 'Lesson Plans',
       'Middle East', 'Move', 'Music', 'Mind', 'Soccer', 'DealBook',
       'Golf', 'Eat', 'Student Loans', 'Economy', 'Art & Design',
       'Book Review', 'Europe', 'Tennis', 'Auto Racing', 'Pro Football',
       'Canada'], dtype=object)

newDesk_x


array(['Insider', 'OpEd', 'Editorial', 'Sports', 'Games', 'Culture',
       'Travel', 'Business', 'RealEstate', 'National', 'Metro',
       'Learning', 'Unknown', 'Foreign', 'Well', 'Upshot', 'Science',
       'EdLife', 'Dining', 'Magazine', 'Letters', 'Arts&Leisure',
       'Styles', 'Metropolitan', 'Weekend', 'SundayBusiness',
       'BookReview', 'Summary'], dtype=object)

typeOfMaterial_x


array(['News', 'Op-Ed', 'Editorial', 'Review', 'Blog', 'briefing',
       'Brief', 'Letter', 'News Analysis', 'Obituary (Obit)', 'Question'],
      dtype=object)

Right now, the only columns that I'm keeping are the comment body (x) and the type of material (y)

In [17]:
df.groupby(by='newDesk_x').count()
df.groupby(by='newDesk_y').count().sort_values('commentBody', ascending=False)

Unnamed: 0_level_0,commentBody,newDesk_x,typeOfMaterial_y
newDesk_y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OpEd,82519,82519,82519
National,37406,37406,37406
Foreign,30666,30666,30666
Editorial,24462,24462,24462
Business,20600,20600,20600
Learning,6080,6080,6080
Magazine,6049,6049,6049
Upshot,6025,6025,6025
Culture,4895,4895,4895
Metro,4613,4613,4613


In [15]:
df.groupby(by='newDesk_y').count()

Unnamed: 0_level_0,commentBody,newDesk_x,typeOfMaterial_y
newDesk_y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Arts&Leisure,331,331,331
BookReview,221,221,221
Business,20600,20600,20600
Culture,4895,4895,4895
Dining,2560,2560,2560
EdLife,1185,1185,1185
Editorial,24462,24462,24462
Foreign,30666,30666,30666
Games,2169,2169,2169
Insider,664,664,664


## Text Preprocessing

In [None]:
#test run for processing a single string - want to functionalize
#integers = []

def lower_and_sw_filter(comment_str):
    ''' this function returns a string with all the characters converted to lowercase
    and all stopwords and punctuation removed'''
    
    #strip html tags
    comment_str = strip_tags(comment_str)
    
    #lowercase
    comment_str = comment_str.lower()

    #tokenize
    tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')

    comment_str = tokenizer.tokenize(comment_str)

    #stopwords

    filtered = list(filter(lambda x: x.lower() not in stopwords_set, comment_str))

    #lemmatize

    lemmatizer = WordNetLemmatizer()
    lemma = []
    for word in filtered:
#         #print(word)
#         if word.isdigit():
#             int_list.append(word)
        
        lemmatized_word = lemmatizer.lemmatize(word)
        lemma.append(lemmatized_word)
    
    lemma = ' '.join(lemma)
    
    return lemma

## Modeling 

## Evaluation

## Conclusions