### Aim:

Perform sentiment analysis on the news headlines.

In [1]:
from IPython import display
import math
from pprint import pprint
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='talk', palette='Dark2')

In [2]:
#read csv file
data = pd.read_csv('data/Eluvio_DS_Challenge.csv')
data[:10]

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews
5,1201287889,2008-01-25,15,0,Hay presto! Farmer unveils the illegal mock-...,False,Armagedonovich,worldnews
6,1201289438,2008-01-25,5,0,"Strikes, Protests and Gridlock at the Poland-U...",False,Clythos,worldnews
7,1201536662,2008-01-28,0,0,The U.N. Mismanagement Program,False,Moldavite,worldnews
8,1201558396,2008-01-28,4,0,Nicolas Sarkozy threatens to sue Ryanair,False,Moldavite,worldnews
9,1201635869,2008-01-29,3,0,US plans for missile shields in Polish town me...,False,JoeyRamone63,worldnews


In [3]:
#create the data frame
dataDF = pd.DataFrame(data)
dataDF

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews
5,1201287889,2008-01-25,15,0,Hay presto! Farmer unveils the illegal mock-...,False,Armagedonovich,worldnews
6,1201289438,2008-01-25,5,0,"Strikes, Protests and Gridlock at the Poland-U...",False,Clythos,worldnews
7,1201536662,2008-01-28,0,0,The U.N. Mismanagement Program,False,Moldavite,worldnews
8,1201558396,2008-01-28,4,0,Nicolas Sarkozy threatens to sue Ryanair,False,Moldavite,worldnews
9,1201635869,2008-01-29,3,0,US plans for missile shields in Polish town me...,False,JoeyRamone63,worldnews


### EDA

Let's check the categories of data provided to us.

In [4]:
categories = dataDF.groupby('category').size()
categories

category
worldnews    509236
dtype: int64

This means that the entire dataset consists of world news.

Before starting with the sentiment analysis, we need to classify the titles as positive, negative and neutral.

In [5]:
#create a new data frame with just headlines

headlinesDF = dataDF['title']
headlinesDF = pd.DataFrame(headlinesDF)
headlinesDF = headlinesDF.rename(columns={'title': 'headlines'})
headlinesDF

Unnamed: 0,headlines
0,Scores killed in Pakistan clashes
1,Japan resumes refuelling mission
2,US presses Egypt on Gaza border
3,Jump-start economy: Give health care to all
4,Council of Europe bashes EU&UN terror blacklist
5,Hay presto! Farmer unveils the illegal mock-...
6,"Strikes, Protests and Gridlock at the Poland-U..."
7,The U.N. Mismanagement Program
8,Nicolas Sarkozy threatens to sue Ryanair
9,US plans for missile shields in Polish town me...


In [6]:
# Importing TextBlob
from textblob import TextBlob

In [7]:
sentence = 'Scores killed in Pakistan clashes'
# Creating a textblob object and assigning the sentiment property
analysis = TextBlob(sentence).sentiment[0] #0 gives polarity, 1 gives subjectivity
print(analysis)

-0.2


In [8]:
# negative values indicate negative sentiments, positive values indicate positive sentiments
def sentimentAnalysis(row):
    analysis = TextBlob(row).sentiment.polarity
    return analysis

In [None]:
headlinesDF['polarity'] = headlinesDF["headlines"].apply(sentimentAnalysis)

In [None]:
headlinesDF

In [None]:
#let's define the values for polarity as follows: 
# 0  : neutral
# -1 : negative
# 1 : positive

def polarChange(row):
    polar = 0
    
    if row < 0:
        polar = -1
    if row > 0:
        polar = 1
    if row == 0:
        polar = 0
    
    return polar


In [None]:
headlinesDF['polarVal'] = headlinesDF["polarity"].apply(polarChange)
headlinesDF

In [None]:
#Now that we have the simplified polairty, let's check the polarity distribution
plot = headlinesDF.groupby('polarVal').count()['polarity'].sort_values().plot(kind= 'bar', title = 'Polarity distribution', figsize =(7,6))
plt.show()

We have more neutral headings in comparison to positive and negative headings in this dataset. This may lead to biased learning. Therefore, let's sample the data to equalize the class size.

In [None]:
import imblearn
from imblearn.over_sampling import SMOTE

In [None]:
#assign polarVal as y

y = headlinesDF['polarVal']
yDF = pd.DataFrame(y)
yDF = yDF.rename(columns={'polarVal': 'y'})
yDF

In [None]:
#assign X as 