# Sentiment Analysis

Generate a sentiment analysis for newspaper articles covering events in Syria from the years 2010-2017. 

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
from numpy import nan
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os

sns.set_context('notebook')
sns.set_style('whitegrid')

## Data Loading

In [3]:
df = pd.read_csv('CleanLexisNexis.csv', parse_dates=['date'])

In [4]:
df.dtypes

publication                object
date               datetime64[ns]
title                      object
length                      int64
publicationtype            object
text                       object
year                        int64
month                       int64
day                         int64
dtype: object

In [5]:
df.head(4)

Unnamed: 0,publication,date,title,length,publicationtype,text,year,month,day
0,The Atlanta Journal-Constitution,2010-01-03,Five pressing questions to answer in 2010,747,Newspapers,Will President Barack Obama regain his momentu...,2010,1,3
1,BBC,2010-01-04,"Saudi foreign minister says Israel ""spoiled ch...",2196,Transcript,Text of report by Saudi-owned leading pan-Arab...,2010,1,4
2,BBC,2010-01-08,Highlights of Iran parliamentary session.,1123,Transcript,Excerpt from report on parliamentary proceedin...,2010,1,8
3,Right Vision News,2010-01-09,Jordan:Way out for Obama,852,Newspaper,"Pakistan, Jan. 09 -- These are the worst of ti...",2010,1,9


## 1. Sentiment Analysis

Use NLTK to build sentiment scores. 

Use the positive/negative corpus provided by Andy Kim, author of *Can Big Data Forcast North Korean Military Aggression?* 

#### Append positive and negative list together

In [5]:
os.chdir('/Users/laurieottehenning/Documents/Georgetown Data Science /Capstone/Harvard Pos:Neg')

pos = pd.read_csv('Harvard_Positive.csv', names=['Word', 'Positive'])
neg = pd.read_csv('Harvard_Negative.csv', names=['Word', 'Negative'])

pos['Word'] = pos['Word'].str.lower()
neg['Word'] = neg['Word'].str.lower()

wordlist = pd.concat([pos, neg])


#### Count the number of positive or negative words within a text

In [6]:
pos_list = []
for i in pos['Word']:
    pos_list.append(i)

neg_list = []
for i in neg['Word']:
    neg_list.append(i)

df = df.assign(PositiveCount=df['text'].apply(lambda sentence: 
                                            sum(word.lower() in pos_list 
                                                for word in sentence.split())))
df = df.assign(NegativeCount=df['text'].apply(lambda sentence: 
                                            sum(word.lower() in neg_list 
                                                for word in sentence.split())))



In [7]:
df.head(4)

Unnamed: 0,publication,date,title,length,publicationtype,text,year,month,day,PositiveCount,NegativeCount
0,The Atlanta Journal-Constitution,2010-01-03,Five pressing questions to answer in 2010,747,Newspapers,Will President Barack Obama regain his momentu...,2010,1,3,39,51
1,BBC,2010-01-04,"Saudi foreign minister says Israel ""spoiled ch...",2196,Transcript,Text of report by Saudi-owned leading pan-Arab...,2010,1,4,137,89
2,BBC,2010-01-08,Highlights of Iran parliamentary session.,1123,Transcript,Excerpt from report on parliamentary proceedin...,2010,1,8,50,38
3,Right Vision News,2010-01-09,Jordan:Way out for Obama,852,Newspaper,"Pakistan, Jan. 09 -- These are the worst of ti...",2010,1,9,51,38


#### Create article polarity

Polarity is calculated by taking the (sum of positive words - sum of negative)/sum of all words

In [9]:
df['tone'] = (df['PositiveCount'] - df['NegativeCount'])/df['length']
df.head(4)

Unnamed: 0,publication,date,title,length,publicationtype,text,year,month,day,PositiveCount,NegativeCount,tone
0,The Atlanta Journal-Constitution,2010-01-03,Five pressing questions to answer in 2010,747,Newspapers,Will President Barack Obama regain his momentu...,2010,1,3,39,51,-0.016064
1,BBC,2010-01-04,"Saudi foreign minister says Israel ""spoiled ch...",2196,Transcript,Text of report by Saudi-owned leading pan-Arab...,2010,1,4,137,89,0.021858
2,BBC,2010-01-08,Highlights of Iran parliamentary session.,1123,Transcript,Excerpt from report on parliamentary proceedin...,2010,1,8,50,38,0.010686
3,Right Vision News,2010-01-09,Jordan:Way out for Obama,852,Newspaper,"Pakistan, Jan. 09 -- These are the worst of ti...",2010,1,9,51,38,0.015258


In [10]:
df.to_csv("Sentiment Data.csv")