# Preprocessing

I am going to clean up the headlines pulled from the API - removing punctuation, non-english words, stop words, and then vectorizing the headlines to be input into classification models.

In [39]:
import pandas as pd
import nltk
import numpy as np
import pickle
import json
import re
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction import stop_words
from nltk.tokenize import RegexpTokenizer

# Load in the data and create dataframe

In [2]:
json_data=open('../API-data/ps_news_posts').read()
news_posts = json.loads(json_data)

In [3]:
news_headlines = [li['title'] for li in news_posts]

df_news = pd.DataFrame(news_headlines, columns=['headlines'])

df_news.drop_duplicates(inplace=True)

In [4]:
df_news['news'] = 0

In [5]:
len(news_headlines)

80099

I'll be using the NLTK for sentiment analysis via the Vader function/library to assign positive, neutral or negative polarity. The function returns probabilistic values to determine how much is falls into the polarities. 

In [6]:
sia = SIA()
results = []

for line in news_headlines:
    pol_score = sia.polarity_scores(line)
    pol_score['headline'] = line
    results.append(pol_score)

In [7]:
results[:5]

[{'neg': 0.0,
  'neu': 0.802,
  'pos': 0.198,
  'compound': 0.4939,
  'headline': 'Steve Bannon disinvited from New Yorker festival after Jimmy Fallon, Jim Carrey pull out'},
 {'neg': 0.231,
  'neu': 0.769,
  'pos': 0.0,
  'compound': -0.34,
  'headline': "Brazil's National Museum Fire: What It Means for Science"},
 {'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'compound': 0.0,
  'headline': 'Democrats, Eyeing a Majority, Prepare an Investigative Onslaught'},
 {'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'compound': 0.0,
  'headline': 'None of them were Redditors.'},
 {'neg': 0.0,
  'neu': 0.816,
  'pos': 0.184,
  'compound': 0.4019,
  'headline': 'Seahawks Owner Gives $100k To Help Republicans Keep Control Of House – Eagle Rising'}]

In [8]:
df = pd.DataFrame.from_records(results)
df.shape

(80099, 5)

In [9]:
df_news.shape

(73806, 2)

I will be assigning the classification labels - r/news will be 0 and r/upliftingnews will be 1.

In [10]:
df['news'] = 0

In [11]:
df.head()

Unnamed: 0,compound,headline,neg,neu,pos,news
0,0.4939,Steve Bannon disinvited from New Yorker festiv...,0.0,0.802,0.198,0
1,-0.34,Brazil's National Museum Fire: What It Means f...,0.231,0.769,0.0,0
2,0.0,"Democrats, Eyeing a Majority, Prepare an Inves...",0.0,1.0,0.0,0
3,0.0,None of them were Redditors.,0.0,1.0,0.0,0
4,0.4019,Seahawks Owner Gives $100k To Help Republicans...,0.0,0.816,0.184,0


Using RegEx to clean up/remove urls, non-alphanumeric characters, and subreddit references (included originally for comments, which ended up not being used).

In [12]:
df.headline = df.headline.map(lambda x: re.sub('[^a-zA-Z0-9\s]','',x))
df.headline = df.headline.map(lambda x: re.sub('/r/News', ' ', x))
df.headline = df.headline.map(lambda x: re.sub('/r/Upliftingnews', ' ', x))
df.headline = df.headline.map(lambda x: re.sub('http[^\s]*', ' ', x))

In [13]:
df['label'] = 0
df.loc[df['compound'] > 0.2, 'label'] = 1
df.loc[df['compound'] < -0.2, 'label'] = -1
df.head(10)

Unnamed: 0,compound,headline,neg,neu,pos,news,label
0,0.4939,Steve Bannon disinvited from New Yorker festiv...,0.0,0.802,0.198,0,1
1,-0.34,Brazils National Museum Fire What It Means for...,0.231,0.769,0.0,0,-1
2,0.0,Democrats Eyeing a Majority Prepare an Investi...,0.0,1.0,0.0,0,0
3,0.0,None of them were Redditors,0.0,1.0,0.0,0,0
4,0.4019,Seahawks Owner Gives 100k To Help Republicans ...,0.0,0.816,0.184,0,1
5,0.0,payment gateway for online gaming,0.0,1.0,0.0,0,0
6,0.0,rokambola Google celebrar el Orgullo Gay con ...,0.0,1.0,0.0,0,0
7,-0.6486,A 20yearold Instagram star is dead after being...,0.264,0.736,0.0,0,-1
8,-0.6124,Dozens Arrested in Marriott Worker Protests in...,0.417,0.583,0.0,0,-1
9,-0.7579,Scallop Wars Brexiteer fishermen attacks Frenc...,0.419,0.581,0.0,0,-1


For the purpose of reducing noise and maximizing the neutralities, I decided to round the sentiments up/down to categorize them. I can now also cast the label column as my target for testing.

Looking at the numbers breakdown for the sentiments show that there are only a few thousand more negative sentiment posts than positive, and most of the posts were considered "neutral". This makes sense on paper, as news posts should typically strive for neutrality. However, this does not support the hypothesis that I am working on, and looking at the example dataframe above, it is evident that the certain posts are not being properly sentimentalized.

In [14]:
counts = df.label.value_counts()
print(counts)

 0    44066
-1    19150
 1    16883
Name: label, dtype: int64


In [23]:
df = df[df.label != 0]
df.shape

(36033, 9)

### Save Dataframe

Save the new dataframe to a csv and move on to do the same with the next subreddit.

In [24]:
df.to_csv('news_posts_SA')