# Sentiment Analysis 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Custom Code imports
import sys
sys.path.insert(0, '/Users/briankalinowski/PycharmProjects/FakeNewsChallenge/Code')
import SentimentAnalysisUtils as sentUtils

## Fake & Real News Data 

In [2]:
news_df = pd.read_csv('/Users/briankalinowski/Desktop/Data/news_content_lemma.csv')
news_df.head()

Unnamed: 0,title,text,tokenized_headline,tokenized_content,type,valid_score
0,Muslims BUSTED They Stole Millions In Govt Ben...,Print They should pay all the back all the mon...,muslims bust steal millions in govt benefit,print should pay all the back all the money pl...,bias,0
1,Re Why Did Attorney General Loretta Lynch Plea...,Why Did Attorney General Loretta Lynch Plead T...,re why do attorney general loretta lynch plead...,why do attorney general loretta lynch plead th...,bias,0
2,BREAKING Weiner Cooperating With FBI On Hillar...,Red State Fox News Sunday reported this mornin...,break weiner cooperate with fbi on hillary ema...,red state fox news sunday report this morning ...,bias,0
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,pin drop speech by father of daughter kidnappe...,email kayla mueller be a prisoner and torture ...,bias,0
4,FANTASTIC! TRUMPS 7 POINT PLAN To Reform Healt...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,fantastic trump 7 point plan to reform healthc...,email healthcare reform to make america great ...,bias,0


## Preform Sentiment Scoring

- `SentimentIntensityAnalyzer()` Returns a dict of sentiment percentage scores for each article.


- Sentiment Scoring features are: `neg`, `neu`, `pos`, `compound`


- The `sentiment_score` feature is extracted from the `compound` score which is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1 (most extreme negative) and +1 (most extreme positive).


    - Positive sentiment: (sentiment_score = 1), (compound score >= 0.05)
    - Neutral sentiment: (sentiment_score = 0), (compound score > -0.05) and (compound score < 0.05)
    - Negative sentiment: (sentiment_score = -1), (compound score <= -0.05)

In [3]:
news_sentiment_df = sentUtils.get_sentiment_vader_scores(news_df, 'tokenized_content')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/briankalinowski/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [4]:
# only using the sentiment scoring features 
news_sentiment_df = news_sentiment_df[['sentiment_score', 'neg', 'neu', 'pos', 'compound', 'valid_score']]
news_sentiment_df.head()

Unnamed: 0,sentiment_score,neg,neu,pos,compound,valid_score
0,-1,0.123,0.764,0.113,-0.2263,0
1,-1,0.071,0.874,0.055,-0.7533,0
2,1,0.017,0.9,0.083,0.9041,0
3,1,0.253,0.472,0.275,0.095,0
4,1,0.08,0.765,0.154,0.9799,0


## Random Forest 1: Just Sentiment Scoring Features

In [5]:
# Train/Test Split
x_train, x_test, y_train, y_test = train_test_split(news_sentiment_df.drop(columns=['valid_score']), 
                                                    news_sentiment_df.valid_score, 
                                                    test_size=0.2, 
                                                    random_state=21)

In [6]:
rf_params = {'n_estimators': [100, 200, 300],
             'min_samples_split': [2, 4, 8, 10, 12, 15]
            }

rf_sent_predict = sentUtils.run_random_forest_grid_search(x_train, y_train, rf_params, x_test)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   14.1s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   26.6s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   39.2s
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:   51.2s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  85 out of  90 | elapsed:  1.5min remaining:    5.2s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  1.6min finished


Best Training Parameters: {'min_samples_split': 15, 'n_estimators': 300}
Best Training Score: 0.8019636814260521


In [7]:
sentUtils.format_classification_report(y_test, rf_sent_predict)

Unnamed: 0,precision,recall,f1-score,support
fake,0.844836,0.707894,0.770326,2369.0
real,0.787795,0.892944,0.83708,2877.0
accuracy,0.809379,0.809379,0.809379,0.809379
macro avg,0.816316,0.800419,0.803703,5246.0
weighted avg,0.813554,0.809379,0.806935,5246.0


In [8]:
sentUtils.format_confusion_matrix(y_test, rf_sent_predict)

Unnamed: 0,Predict_Fake,Predict_Real,True_Totals
True_Fake,1677,692,2369
True_Real,308,2569,2877


## Random Forest 2: Sentiment and LDA 

- First we create a word count matrix then apply the LDA transformation to that matrix. This gives us probabilities for each of our documents belonging in each of the respective LDA topics. 



- Next we will combine the LDA topic probabilities with the sentiment scoring data from the previous model. 

In [9]:
# Run CountVectorizer transformation 
vectorized_tokens = sentUtils.get_count_vectorizer_matrix(news_df, 'tokenized_content')

# Get LDA transformed Topics df
news_lda_topics = sentUtils.get_lda_transformed_topics(vectorized_tokens)

# Combine with Senitment scoring df
news_sentiment_lda_df = pd.concat([news_lda_topics, news_sentiment_df], axis=1)
news_sentiment_lda_df.head()

Count Vectorizer Shape: (26227, 27906) 

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,sentiment_score,neg,neu,pos,compound,valid_score
0,0.002704,0.002704,0.002704,0.002703,0.002703,0.002703,0.57111,0.002703,0.002704,0.407262,-1,0.123,0.764,0.113,-0.2263,0
1,0.000633,0.000633,0.000633,0.653693,0.174755,0.000633,0.071526,0.096227,0.000633,0.000633,-1,0.071,0.874,0.055,-0.7533,0
2,0.067823,0.000991,0.00099,0.806241,0.030974,0.00099,0.089019,0.00099,0.00099,0.00099,1,0.017,0.9,0.083,0.9041,0
3,0.662286,0.054681,0.004002,0.065395,0.004,0.004,0.143301,0.054334,0.004,0.004001,1,0.253,0.472,0.275,0.095,0
4,0.000578,0.000578,0.000578,0.000579,0.000578,0.000578,0.177604,0.183452,0.000578,0.634896,1,0.08,0.765,0.154,0.9799,0


In [10]:
# Train/Test Split 
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(news_sentiment_lda_df.drop(columns=['valid_score']), 
                                                            news_sentiment_lda_df.valid_score, 
                                                            test_size=0.2, 
                                                            random_state=21)

In [11]:
rf_lda_predict = sentUtils.run_random_forest_grid_search(x_train_2, y_train_2, rf_params, x_test_2)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   22.2s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   29.7s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   45.3s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  85 out of  90 | elapsed:  2.7min remaining:    9.5s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  2.9min finished


Best Training Parameters: {'min_samples_split': 2, 'n_estimators': 300}
Best Training Score: 0.9029121586197035


In [12]:
sentUtils.format_confusion_matrix(y_test_2, rf_lda_predict)

Unnamed: 0,Predict_Fake,Predict_Real,True_Totals
True_Fake,2012,357,2369
True_Real,123,2754,2877


In [13]:
sentUtils.format_classification_report(y_test_2, rf_lda_predict)

Unnamed: 0,precision,recall,f1-score,support
fake,0.942389,0.849304,0.893428,2369.0
real,0.885246,0.957247,0.91984,2877.0
accuracy,0.908502,0.908502,0.908502,0.908502
macro avg,0.913817,0.903275,0.906634,5246.0
weighted avg,0.911051,0.908502,0.907913,5246.0
