# Sentiment Analysis and Part of Speech Tagging

In this notebook, we will create some features for our classification model by extracting information about the sentiment and syntactical makeup of these titles and texts.

### I. [Preprocessing](#Preprocessing)
### II. [Sentiment analysis](#Sentiment-analysis)
### III. [Part of speech tagging](#Part-of-speech-tagging)

## Reading in libraries and data

In [1]:
import pandas as pd
import numpy as np
import spacy

from nltk.tag import pos_tag
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Git would not allow us to push a CSV containing the entire dataset so we had to split it up by X and y as well as by train and test sets. Below, we read in all of those files and put them back into one dataframe.

In [5]:
X_train = pd.read_csv('../datasets/X_train.csv')
X_test = pd.read_csv('../datasets/X_test.csv')
y_train = pd.read_csv('../datasets/y_train.csv')
y_test = pd.read_csv('../datasets/y_test.csv')

In [6]:
y_train['train_dataset'] = 1
y_test['train_dataset'] = 0

X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])

X.reset_index(drop=True, inplace=True)
y.reset_index(drop=True, inplace=True)

df = pd.concat([X, y], axis = 1)

In [7]:
df.head()

Unnamed: 0,title,text,date,domestic,title_word_count,text_word_count,title_uppercase_count,title_lowercase_count,title_all_letter_count,title_special_count,...,text_all_letter_count,text_special_count,text_!,text_?,text_#,text_%,text_$,text_parentheses,is_true,train_dataset
0,Trump Just Accidentally Gave America The Grea...,Donald Trump wanted to give his supporters a s...,2016-04-24,1,14,485,13,51,64,0,...,2021,5,0,1,0,0,0,4,0,1
1,U.N. torture envoy concerned at water-boarding...,GENEVA (Reuters) - The U.N. torture investigat...,2016-03-09,1,10,377,4,50,54,0,...,1895,2,0,0,0,0,0,2,1,1
2,Robert Kennedy Jr. says tapped by Trump to hea...,CHICAGO (Reuters) - Vaccination skeptic Robert...,2017-01-10,1,12,690,4,53,57,0,...,3501,7,1,0,0,0,0,6,1,1
3,More arrests in apparent Saudi campaign agains...,(Reuters) - Saudi Arabia has detained more cle...,2017-09-12,0,9,539,2,55,57,0,...,2702,4,0,0,0,0,0,4,1,1
4,Illinois ends spring session without a FY 2017...,"SPRINGFIELD, Ill. (Reuters) - The Democrat-con...",2016-05-31,1,9,592,3,38,41,0,...,3020,10,0,0,0,0,6,4,1,1


In [9]:
df.describe()

Unnamed: 0,domestic,title_word_count,text_word_count,title_uppercase_count,title_lowercase_count,title_all_letter_count,title_special_count,title_!,title_?,title_#,...,text_all_letter_count,text_special_count,text_!,text_?,text_#,text_%,text_$,text_parentheses,is_true,train_dataset
count,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,...,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0,39858.0
mean,0.729816,12.164685,416.125847,13.750489,49.146796,62.897285,0.333785,0.059963,0.033318,0.015355,...,1983.638717,5.191204,0.366451,0.622209,0.173265,0.058483,0.428346,3.542451,0.532164,0.749987
std,0.44406,3.76451,358.401114,14.581771,13.173449,17.801389,0.704135,0.262801,0.189532,0.128351,...,1729.653051,12.849245,1.40991,1.753961,1.041448,0.810225,1.769956,10.553744,0.498971,0.433025
min,0.0,3.0,1.0,1.0,0.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,10.0,213.0,3.0,43.0,52.0,0.0,0.0,0.0,0.0,...,1009.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.25
50%,1.0,11.0,374.0,6.0,50.0,59.0,0.0,0.0,0.0,0.0,...,1771.0,3.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0
75%,1.0,14.0,524.0,20.0,56.0,71.0,0.0,0.0,0.0,0.0,...,2482.0,6.0,0.0,1.0,0.0,0.0,0.0,4.0,1.0,1.0
max,1.0,45.0,8436.0,137.0,167.0,233.0,6.0,4.0,3.0,4.0,...,42044.0,1756.0,133.0,94.0,53.0,122.0,129.0,1526.0,1.0,1.0


## Sentiment analysis

Now that we have the dataframe read in, we will create columns with sentiment scores for both the titles and text of the articles.

In [10]:
analyzer = SentimentIntensityAnalyzer()

df['title_sa_neg'] = 0
df['title_sa_pos'] = 0
df['title_sa_neu'] = 0
df['title_sa_compound'] = 0

print('Title Sentiment Analysis')
for i, t in enumerate(df.title.values):
    vs = analyzer.polarity_scores(t)
    df.loc[i, 'title_sa_neg'] = vs['neg']
    df.loc[i, 'title_sa_pos'] = vs['pos']
    df.loc[i, 'title_sa_neu'] = vs['neu']
    df.loc[i, 'title_sa_compound'] = vs['compound']
    if (i % 5000) == 0:
        print(i)

Title Sentiment Analysis
0
3000
6000
9000
12000
15000
18000
21000
24000
27000
30000
33000
36000
39000


In [11]:
df['text_sa_neg'] = 0
df['text_sa_pos'] = 0
df['text_sa_neu'] = 0
df['text_sa_compound'] = 0

print('Text Sentiment Analysis')        
for i, t in enumerate(df.text.values):
    vs = analyzer.polarity_scores(t)
    df.loc[i, 'text_sa_neg'] = vs['neg']
    df.loc[i, 'text_sa_pos'] = vs['pos']
    df.loc[i, 'text_sa_neu'] = vs['neu']
    df.loc[i, 'text_sa_neu'] = vs['compound']
    if (i % 5000) == 0:
        print(i)

Text Sentiment Analysis
0
3000
6000
9000
12000
15000
18000
21000
24000
27000
30000
33000
36000
39000


# Part of speech tagging

In order to have more features and potentially improve the accuracy of our model, we can apply the part of speech tagging extraction feature. It is also a way to analyze the grammatical structure of each sentences and compare the syntax between real news articles and fake news articles. Each articles will be assign a ratio for each tag like verbs, nouns and adverbs: 

In [15]:
# parsing parts of speech with spacy
en_nlp = spacy.load('en')

parsed_quotes = []
for i, parsed in enumerate(en_nlp.pipe(df.text.values, batch_size=50, n_threads=4)):
    assert parsed.is_parsed
    if (i % 5000) == 0:
        print(i)
    parsed_quotes.append(parsed)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000


In [None]:
# creating proportion column for each part of speech

unique_pos = []

for parsed in parsed_quotes:
    unique_pos.extend([t.pos_ for t in parsed])
    
unique_pos = np.unique(unique_pos)

for pos in unique_pos:
    df[pos+'_prop'] = 0.

In [None]:
# looping through rows and adding proportions to each POS columns

for i, parsed in enumerate(parsed_quotes):
    if (i % 5000) == 0:
        print(i)
    parsed_len = len(parsed)
    for pos in unique_pos:
        count = len([x for x in parsed if x.pos_ == pos])
        try:
            df.loc[i, pos+'_prop'] = float(count)/parsed_len    
        except:
            pass

We can now save the results in the form of new columns and export the data frame to a csv file:

In [16]:
X_train = df[df.train_dataset == 1].drop(columns = ['is_true', 'train_dataset'])
X_test = df[df.train_dataset == 0].drop(columns = ['is_true', 'train_dataset'])

X_train.to_csv('X_train_w_SA.csv', index = False)
X_test.to_csv('X_test_w_SA.csv', index = False)

Now we'll move onto vectorizing the text and narrowing those vectorized columns into the most important features to bring into our final classification model.