# OffensiveEval VADER Experiments
So in this notebook i will be trying to conduct some experiments with the VADER sentiment analysis tool [Github Link](https://github.com/cjhutto/vaderSentiment#demo-including-example-of-non-english-text-translations).

What i'm gonna try and do is add a column to the dataframe that has the compound score from VADER. Hopefully this extra bit of sentiment analysis will help the clasifier with things like sarcasm negations and use of degree modifiers such as 'very' or using lots of punctuation e.g. 'Great!!!!'

Uses of VADER taken from github README:


examples of typical use cases for sentiment analysis, including proper handling of sentences with:

>	- typical negations (e.g., "*not* good")
	- use of contractions as negations (e.g., "*wasn't* very good")
	- conventional use of **punctuation** to signal increased sentiment intensity (e.g., "Good!!!")
	- conventional use of **word-shape** to signal emphasis (e.g., using ALL CAPS for words/phrases)
	- using **degree modifiers** to alter sentiment intensity (e.g., intensity *boosters* such as "very" and intensity *dampeners* such as "kind of")
	- understanding many **sentiment-laden slang** words (e.g., 'sux')
	- understanding many sentiment-laden **slang words as modifiers** such as 'uber' or 'friggin' or 'kinda'
	- understanding many sentiment-laden **emoticons** such as :) and :D
	- translating **utf-8 encoded emojis** such as 💘 and 💋 and 😁
	- understanding sentiment-laden **initialisms and acronyms** (for example: 'lol')


In [1]:
!pip install vaderSentiment



In [2]:
import pandas as pd
import numpy as np
import logging
from pprint import pprint
from time import time
from sklearn import metrics
from sklearn.metrics import f1_score, precision_score, recall_score

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [3]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [4]:
analyzer = SentimentIntensityAnalyzer()

In [5]:
path = 'data/olid-training-v2.0.tsv'
testset = pd.read_table(path, header=None, names=['id','tweet','sub_a','sub_b','sub_c'])
# Just gonna be using the training data to do all this testing, at least to start with.
# Just so I can try my best to avoid overfitting of any kind
testset['label_a_num'] = testset.sub_a.map({'NOT':0, 'OFF':1})
x = testset.tweet
y = testset.label_a_num

In [6]:
del testset['sub_b']
del testset['sub_c']

In [7]:
vader_testset = testset.head(10)

In [8]:
vader_testset.head(1)

Unnamed: 0,id,tweet,sub_a,label_a_num
0,86426,@USER She should ask a few native Americans wh...,OFF,1


In [9]:
example_tweet = testset.iloc[1,].tweet

In [10]:
analyzer.polarity_scores(example_tweet)

{'neg': 0.201, 'neu': 0.799, 'pos': 0.0, 'compound': -0.5067}

In [11]:
def get_vader_compound_score(tweet):
    scores = analyzer.polarity_scores(tweet)
    compound = scores['compound']
    return compound

In [19]:
testset['vader_compound_score'] = testset['tweet'].map(get_vader_compound_score)

In [20]:
testset.head(-10)

Unnamed: 0,id,tweet,sub_a,label_a_num,vader_compound_score
0,86426,@USER She should ask a few native Americans wh...,OFF,1,0.0000
1,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF,1,-0.5067
2,16820,Amazon is investigating Chinese employees who ...,NOT,0,0.4767
3,62688,"@USER Someone should'veTaken"" this piece of sh...",OFF,1,-0.1779
4,43605,@USER @USER Obama wanted liberals &amp; illega...,NOT,0,0.0000
...,...,...,...,...,...
13225,22965,@USER Can we all agree that Tomlins seat is he...,NOT,0,0.6486
13226,11132,@USER when you coming to ohio?,NOT,0,0.0000
13227,87416,@USER @USER @USER @USER Liars like the Antifa ...,OFF,1,-0.1027
13228,56034,@USER @USER He is involved because he was ther...,NOT,0,-0.4003


In [21]:
vect = CountVectorizer(stop_words='english', min_df=2, max_df=0.5)

In [22]:
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=1)

In [23]:
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)

In [24]:
x_train_dtm

<9930x7115 sparse matrix of type '<class 'numpy.int64'>'
	with 82989 stored elements in Compressed Sparse Row format>

In [18]:
# pd.concat([ df[['text']], df['vader'].apply(pd.Series) ], axis='columns')