# Pipeline for creating VADER sentiment scores

This notebook accepts a body of user comments and uses the VADER sentiment scoring system to find a distribution of positive-negative intensity scores grouped by original article or post.

A measure of distritution variance is then used as a target for our machine learning feature set.

## About VADER scores
Source VADER: https://github.com/cjhutto/vaderSentiment

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based mechanism trained on social media datasets that provides scores for strings of text ranging from -1 (extremely negative) to +1 (extremely postive).

It computes a total score (compound score) for a sentence by looking up the sentiment score assigned to each individual word and then applying learned rules to modify these scores based on context.

The values returned on a call to polarity_scores(string) are the positive, negative and neutral parts of the string and a post-rule computed compound score.

This is in the form: {'compound': 0.4199, 'neg': 0.0, 'neu': 0.417, 'pos': 0.583}



In [3]:
# Install VADER if needed. Uncomment & run.
#!pip install vaderSentiment

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer as sia

### Assigning Parameters

In [98]:
# Dataset Specific Parameters
FILE_PATH = './'

MIN_COMMENT_COUNT = 5

OUTPUT_COLS = ['id',
               'topic',
               'source',
               'text',
               'replyCount',
               'vaderCatLabel',
               'vaderCat']


## for the NYT news archive from Kaggle uncomment:
#MAIN_ID = 'articleID'
#MAIN_TEXT = 'snippet'
#REPLY_ID = 'commentID'
#REPLY_TEXT = 'commentBody'
#TOPIC = 'newDesk'  # needed
#SOURCE = 'kaggle'
#REPLY_FILE = 'article_comments.csv'
#MAIN_FILE = 'articles.csv'
#EXPORT_NAME = 'articles_w_scores.csv'


## for the tweet dataset uncomment:
MAIN_ID = 'ConversationId'
MAIN_TEXT = 'Text'
REPLY_ID = 'TweetId'
REPLY_TEXT = 'Text'
TOPIC = 'MentionedUsers' #this should change after topics updated
SOURCE = 'Username'
    
REPLY_FILE = 'comments.csv'
MAIN_FILE = 'tweets.csv'
EXPORT_NAME = 'tweets_all.csv'

### Setting up VADER scoring

In [86]:
# vader initialized
vader = sia()

In [87]:
def getScore(string):
  scoreDict = vader.polarity_scores(string)
  return scoreDict["compound"]

### Importing Replies

In [88]:
# reply data import
comments = pd.read_csv(FILE_PATH + "data/" + REPLY_FILE)
comments = comments[comments[REPLY_TEXT].notnull()]
print('total comments: ',comments.shape[0])

total comments:  3012431


In [89]:
# contraining NYT Kaggle comments to the length of a Tweet
if MAIN_ID == 'articleID':
    comments[REPLY_TEXT] = comments[REPLY_TEXT].str[0:250]

### Applying Scores

In [90]:
# applying scores
comments["vaderScore"] = comments[REPLY_TEXT].map(getScore)

## Analyzing for Category Cut-offs

### Expectations for Polarization Distributions

After working though more complex definitions of polarization, we determined that at it's essense, a polarized conversation is one that has more sentiment on the fringes than in the middle.

Therefore we categorize an thread with a larger number of both positive and negative comments than neutral ones as polarized.
If a thread is skewed negative, with very few positive comments, it is not polarized (and visa versa).
We require both the positive and negative counts to outweigh the centralizing tweets in order to call a thread polarized.

VADER uses 0.05 as the cut off for positive sentiment and -0.05 as the cut off for negative, so we do the same.


In [91]:
# using vader classifications to count positive, negative and neutral tweets

comments['neutral'] = 1
comments['pos'] = 0
comments['neg'] = 0

comments.loc[(comments['vaderScore'] <= -0.05) |
         (comments['vaderScore'] >= 0.05), 'neutral'] = 0 
comments.loc[(comments['vaderScore'] < -0.05), 'neg'] = 1
comments.loc[(comments['vaderScore'] > 0.05), 'pos'] = 1


In [92]:
# creating aggregations by original post

articles = comments.groupby(MAIN_ID).agg({'neutral':['count', 'sum'],
                                          'pos': ['sum'],
                                          'neg': ['sum']})
articles = articles.reset_index()

if MAIN_ID == 'ConversationId':
    articles[MAIN_ID] = articles[MAIN_ID].astype('int')

articles.columns = [MAIN_ID, 'commentCount', 
                    'neuCount', 'posCount', 'negCount']

articles['vaderCat'] = 0.0
articles.loc[((articles.neuCount < articles.posCount) &
             (articles.neuCount < articles.negCount)), 'vaderCat'] = 1.0

articles = articles[articles['commentCount'] >= MIN_COMMENT_COUNT]
print(MIN_COMMENT_COUNT)
articles.head()

5


Unnamed: 0,ConversationId,commentCount,neuCount,posCount,negCount,vaderCat
0,1344795510348066817,13,3,3,7,0.0
2,1344798049747542016,5,1,2,2,1.0
3,1344798817523294213,53,15,24,14,0.0
4,1344799259007311872,8,4,2,2,0.0
6,1344800129413488642,49,20,14,15,0.0


In [93]:
# display class counts
articles.groupby('vaderCat')['commentCount'].count()

vaderCat
0.0    22240
1.0    17825
Name: commentCount, dtype: int64

In [94]:
print('Threads: ', articles.shape[0])
print('Replies: ', articles.commentCount.sum())

Threads:  40065
Replies:  2966516


### Adding Category Labels

In [95]:
articles['vaderCatLabel'] = 'low'
articles.loc[articles['vaderCat'] == 1, 'vaderCatLabel'] = 'high'

### Merging with Original Posts and Exporting

In [100]:
# import main thread data 
main_df = pd.read_csv(FILE_PATH + "data/" + MAIN_FILE)

if SOURCE == 'kaggle':
    main_df['kaggle'] = 'nyt_kaggle'
    
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67343 entries, 0 to 67342
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Datetime        67343 non-null  object 
 1   TweetId         67343 non-null  int64  
 2   Text            67311 non-null  object 
 3   Username        67343 non-null  object 
 4   MentionedUsers  0 non-null      float64
 5   ConversationId  67343 non-null  int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 3.1+ MB


In [101]:
# add standard deviation and categories
main_df = main_df.merge(articles, on=MAIN_ID, how='inner')

# filter and rename columns
main_df = main_df[[MAIN_ID,
                   TOPIC,
                   SOURCE,
                   MAIN_TEXT,
                   'commentCount',
                   'vaderCatLabel',
                   'vaderCat']]

main_df.columns = OUTPUT_COLS

# export dataframe
main_df.to_csv(FILE_PATH + "data/" + EXPORT_NAME, index=False)

print("matched threads: ", main_df.shape[0])
print("matched replies: ", main_df.replyCount.sum())

matched threads:  42516
matched replies:  3854911


In [102]:
main_df = pd.read_csv(FILE_PATH + 'data/' + EXPORT_NAME)
main_df.head()

Unnamed: 0,id,topic,source,text,replyCount,vaderCatLabel,vaderCat
0,1377385383168765952,,FoxNews,Activists protest renaming Chicago school afte...,306,high,1.0
1,1377384607969013765,,FoxNews,Border Patrol video shows smugglers abandoning...,108,high,1.0
2,1377384339105669122,,FoxNews,Cause of Tiger Woods car crash determined but ...,169,low,0.0
3,1377367836046192641,,FoxNews,GOP rep urges HHS to halt reported plan to rel...,80,high,1.0
4,1377358399759785987,,FoxNews,Some Democrats trying to stop Iowa New Hampshi...,96,high,1.0
