# World News NLP Project: Notebook 1
## Data Imports, Cleaning & EDA
#### Adam Zucker

---

## Problem

Using industry-standard NLP libraries [SpaCy](https://spacy.io/), [NLTK](https://www.nltk.org/), and [scikit-learn](https://scikit-learn.org/stable/), this study will examine the key words in a post title that most positively affect user engagement. The exploratory data analysis and visualizations in the following notebook will also factor in other features of the supplied data, including author, post time, and date. For the purposes of this study, positive user engagement will be measured in upvotes.

---

## Notebook Contents

1. Library and Data Imports
2. Basic Exploratory Data Analysis
3. Data Visualizations
4. NLP

---

## Data

- __*world_news_posts.csv*:__ Supplied dataframe with roughly 500,000 titles of posts on a "world news" message board, including data for the date, time, and author of the post, along with user interaction.

---

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import spacy
from spacy import displacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from geotext import GeoText

import time
from datetime import datetime

In [2]:
# Reading in data
df = pd.read_csv('../data/world_news_posts.csv')

In [3]:
df.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


---

## EDA

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509236 entries, 0 to 509235
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   time_created  509236 non-null  int64 
 1   date_created  509236 non-null  object
 2   up_votes      509236 non-null  int64 
 3   down_votes    509236 non-null  int64 
 4   title         509236 non-null  object
 5   over_18       509236 non-null  bool  
 6   author        509236 non-null  object
 7   category      509236 non-null  object
dtypes: bool(1), int64(3), object(4)
memory usage: 27.7+ MB


In [5]:
# Checking for nulls in the dataframe - none detected
df.isnull().sum()

time_created    0
date_created    0
up_votes        0
down_votes      0
title           0
over_18         0
author          0
category        0
dtype: int64

In [6]:
# The data spans 3223 days, from 1/25/08 to 11/22/16
print(f"Number of days represented in dataframe: {len(df['date_created'].unique())}")
print(f"Data date range is from {min(df['date_created'])} to {max(df['date_created'])}")

Number of days represented in dataframe: 3223
Data date range is from 2008-01-25 to 2016-11-22


In [7]:
# Defining a function to concisely process this dataframe and others in the same format
def process_data(df):
    
    # Redefining the 'time_created' column to hold datetime, converted from unix timestamp format
    df['time_created'] = [datetime.fromtimestamp(ts) for ts in df['time_created']]
    # Dropping 'date_created' because of redundancy
    df.drop(columns='date_created', inplace=True)
    
    # Creating a feature to hold the post length in characters and words
    df['post_length_chars'] = df['title'].apply(len)
    df['post_length_tokens'] = df['title'].str.split().apply(len)
    
#     # Generating features to hold total author posts and total author upvotes alongside each post
#     df['author_posts'] = df['author'].groupby(df['author']).transform('count')
#     df['author_upvotes'] = [df['up_votes'].groupby(df['author']).sum() for a in df['author']]
    
    # Generating a feature to hold day of the week and dummifying
    df['weekday'] = df['time_created'].dt.day_name()
    day_dummies = pd.get_dummies(df['weekday'], drop_first=True)
    df = pd.concat([df, day_dummies], axis=1)
    df.drop(columns='weekday', inplace=True)
    
    # Dropping 'category' feature if only one category is present
    if len(df['category'].unique()) == 1:
        df.drop(columns='category', inplace=True)
    # Similarly dropping down votes if there are none reported
    if sum(df['down_votes']) == 0:
        df.drop(columns='down_votes', inplace=True)
    
    # Binarizing 'over_18' feature
    df['over_18'] = df['over_18'].map({False:0, True:1})
    

    
    return df

In [8]:
# df['up_votes'].groupby(df['author']).sum()

In [9]:
# Reordering columns for clarity
# df = df[['author', 'title', 'up_votes', 'over_18', 'post_length_chars' 'time_created', 
#          'weekday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday']]

In [10]:
df = process_data(df)

In [11]:
df.head(3)

Unnamed: 0,time_created,up_votes,title,over_18,author,post_length_chars,post_length_tokens,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,2008-01-24 22:34:06,3,Scores killed in Pakistan clashes,0,polar,33,5,0,0,0,1,0,0
1,2008-01-24 22:34:35,2,Japan resumes refuelling mission,0,polar,32,4,0,0,0,1,0,0
2,2008-01-24 22:42:03,3,US presses Egypt on Gaza border,0,polar,31,6,0,0,0,1,0,0


In [12]:
# # Converting 'date_created' to datetime
# df['date_created'] = pd.to_datetime(df['date_created'])

In [13]:
df.dtypes

time_created          datetime64[ns]
up_votes                       int64
title                         object
over_18                        int64
author                        object
post_length_chars              int64
post_length_tokens             int64
Monday                         uint8
Saturday                       uint8
Sunday                         uint8
Thursday                       uint8
Tuesday                        uint8
Wednesday                      uint8
dtype: object

In [14]:
# # All posts are classified as 'worldnews' - with just a single class represented, this feature becomes unnecessary
# df['category'].value_counts()

In [15]:
# # Dropping 'category' feature
# df.drop(columns='category', inplace=True)

---

In [16]:
# Summary stats for upvotes
df['up_votes'].describe()

count    509236.000000
mean        112.236283
std         541.694675
min           0.000000
25%           1.000000
50%           5.000000
75%          16.000000
max       21253.000000
Name: up_votes, dtype: float64

In [17]:
# Looking at titles of most upvoted posts
df['up_votes'].groupby(df['title']).sum().sort_values(ascending=False)[0:50].to_frame()

Unnamed: 0_level_0,up_votes
title,Unnamed: 1_level_1
"A biotech startup has managed to 3-D print fake rhino horns that carry the same genetic fingerprint as the actual horn. The company plans to flood Chinese rhino horn market at one-eighth of the price of the original, undercutting the price poachers can get and forcing them out eventually.",21253
"Twitter has forced 30 websites that archive politician s deleted tweets to shut down, removing an effective tool to keep politicians honest",13435
"2.6 terabyte leak of Panamanian shell company data reveals how a global industry led by major banks, legal firms, and asset management companies secretly manages the estates of politicians, Fifa officials, fraudsters and drug smugglers, celebrities and professional athletes.",13244
"The police officer who leaked the footage of the surfers paradise police brutality, where the victims blood was washed away by officers, has been criminally charged for bringing it to the publics view. Officers who did the bashing get nothing.",12333
Paris shooting survivor suing French media for giving away his location while he hid from shooters,11288
Hundreds of thousands of leaked emails reveal massively widespread corruption in global oil industry,11108
Brazil s Supreme Court has banned corporate contributions to political campaigns and parties,10922
"ISIS beheads 81-year-old pioneer archaeologist and foremost scholar on ancient Syria. Held captive for 1 month, he refused to tell ISIS the location of the treasures of Palmyra unto death.",10515
"Feeding cows seaweed could slash global greenhouse gas emissions, researchers say: They discovered adding a small amount of dried seaweed to a cow s diet can reduce the amount of methane a cow produces by up to 99 per cent.",10394
Brazilian radio host famous for exposing corruption in his city murdered while broadcasting live on the air by two gunmen.,10377


In [18]:
df.sort_values('up_votes', ascending=False)[0:50]

Unnamed: 0,time_created,up_votes,title,over_18,author,post_length_chars,post_length_tokens,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
377200,2015-06-20 12:41:11,21253,A biotech startup has managed to 3-D print fak...,0,KRISHNA53,289,49,0,1,0,0,0,0
391415,2015-08-24 08:57:59,13435,Twitter has forced 30 websites that archive po...,0,joeyoungblood,139,22,1,0,0,0,0,0
450818,2016-04-03 14:01:46,13244,2.6 terabyte leak of Panamanian shell company ...,0,mister_geaux,277,39,0,0,1,0,0,0
391318,2015-08-23 18:09:28,12333,The police officer who leaked the footage of t...,0,navysealassulter,243,40,0,0,1,0,0,0
390252,2015-08-18 19:06:08,11288,Paris shooting survivor suing French media for...,0,seapiglet,98,16,0,0,0,0,1,0
449809,2016-03-30 07:19:33,11108,Hundreds of thousands of leaked emails reveal ...,0,Xiroth,100,14,0,0,0,0,0,1
397215,2015-09-17 20:14:48,10922,Brazil s Supreme Court has banned corporate co...,0,DoremusJessup,92,13,0,0,0,1,0,0
390494,2015-08-19 20:30:33,10515,ISIS beheads 81-year-old pioneer archaeologist...,0,DawgsOnTopUGA,188,30,0,0,0,0,0,1
500786,2016-10-19 08:47:15,10394,Feeding cows seaweed could slash global greenh...,0,mvea,225,40,0,0,0,0,0,1
388230,2015-08-07 11:58:55,10377,Brazilian radio host famous for exposing corru...,0,fiffers,122,20,0,0,0,0,0,0


In [19]:
df.head()

Unnamed: 0,time_created,up_votes,title,over_18,author,post_length_chars,post_length_tokens,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,2008-01-24 22:34:06,3,Scores killed in Pakistan clashes,0,polar,33,5,0,0,0,1,0,0
1,2008-01-24 22:34:35,2,Japan resumes refuelling mission,0,polar,32,4,0,0,0,1,0,0
2,2008-01-24 22:42:03,3,US presses Egypt on Gaza border,0,polar,31,6,0,0,0,1,0,0
3,2008-01-24 22:54:50,1,Jump-start economy: Give health care to all,0,fadi420,44,7,0,0,0,1,0,0
4,2008-01-25 10:25:20,4,Council of Europe bashes EU&UN terror blacklist,0,mhermans,47,7,0,0,0,0,0,0


---

In [20]:
print(f"Number of unique authors: {len(df['author'].unique())}")
print('-----')
print(f"Top 20 contributors by post count: \n{df['author'].value_counts()[0:20]}")
print('-----')
print(f"Top 20 contributors by upvotes: \n{df['up_votes'].groupby(df['author']).sum().sort_values(ascending=False)[0:20]}")

Number of unique authors: 85838
-----
Top 20 contributors by post count: 
davidreiss666         8897
anutensil             5730
DoremusJessup         5037
maxwellhill           4023
igeldard              4013
readerseven           3170
twolf1                2923
madam1                2658
nimobo                2564
madazzahatter         2503
ionised               2493
NinjaDiscoJesus       2448
bridgesfreezefirst    2405
SolInvictus           2181
Libertatea            2108
vigorous              2077
galt1776              1897
DougBolivar           1770
bob21doh              1698
trot-trot             1649
Name: author, dtype: int64
-----
Top 20 contributors by upvotes: 
author
maxwellhill         1985416
anutensil           1531544
Libertatea           832102
DoremusJessup        584380
Wagamaga             580121
NinjaDiscoJesus      492582
madazzahatter        428966
madam1               390541
davidreiss666        338306
kulkke               333311
pnewell              297270
nimob

---

In [21]:
# Looking at distribution of 'over_18' posts by number and percentage
print(df['over_18'].value_counts())
print(df['over_18'].value_counts(normalize=True))

0    508916
1       320
Name: over_18, dtype: int64
0    0.999372
1    0.000628
Name: over_18, dtype: float64


In [22]:
# Checking title content of some of the posts classified as "over_18"
df[df['over_18'] == True]

Unnamed: 0,time_created,up_votes,title,over_18,author,post_length_chars,post_length_tokens,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
1885,2008-03-24 13:57:18,189,Pics from the Tibetan protests - more graphic ...,1,pressed,76,12,1,0,0,0,0,0
6721,2008-05-18 15:25:18,5,"MI5 linked to Max Mosley’s Nazi-style, sadomas...",1,alllie,188,31,0,0,1,0,0,0
8414,2008-06-05 15:42:05,0,Tabloid Horrifies Germany: Poland s Yellow Pre...,1,stesch,135,25,0,0,0,1,0,0
12163,2008-07-21 16:26:56,0,Love Parade Dortmund: Techno Festival Breaks R...,1,stesch,101,16,1,0,0,0,0,0
12699,2008-07-29 21:29:40,5,IDF kills young Palestinian boy. Potentially N...,1,cup,50,7,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
503776,2016-10-31 00:59:26,4,Latest Italian Earthquake Devastates Medieval ...,1,pixelinthe,51,6,1,0,0,0,0,0
508067,2016-11-17 11:30:29,12,ISIS Release Video Showing Melbourne As A Poss...,1,halacska,57,9,0,0,0,1,0,0
508176,2016-11-17 21:04:41,0,Animal welfare activists have released footage...,1,NinjaDiscoJesus,137,24,0,0,0,1,0,0
508376,2016-11-18 13:14:35,6,Jungle Justice : Public lynching of a street ...,1,avivi_,60,11,0,0,0,0,0,0


In [23]:
nsfw = df[df['over_18'] == True]
nsfw.sort_values(by='up_votes', ascending=False)[0:25]

Unnamed: 0,time_created,up_votes,title,over_18,author,post_length_chars,post_length_tokens,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
500590,2016-10-18 12:08:56,7941,"Judge presiding over El Chapo s case shot, k...",1,IsleCook,78,13,0,0,0,0,1,0
494536,2016-09-25 08:05:14,6322,[NSFL] Australian child molester Peter Scully ...,1,ExWhySaid,238,39,0,0,1,0,0,0
428689,2016-01-07 06:48:09,5878,Armed suspect shot dead after trying to storm ...,1,rawmas02,66,11,0,0,0,1,0,0
462067,2016-05-17 06:17:06,5617,Syria Army killed over 200 ISIS militants in 3...,1,orangeflower2015,79,14,0,0,0,0,1,0
303900,2014-09-05 14:45:33,5507,Man escapes ISIS execution,1,brothamo,26,4,0,0,0,0,0,0
461255,2016-05-13 10:34:54,4839,ISIS massacre 14 Real Madrid fans at supporter...,1,PeterG92,63,11,0,0,0,0,0,0
376435,2015-06-16 20:31:08,4209,The fight is on to stop an annual Chinese even...,1,ShakoWasAngry,157,30,0,0,0,0,1,0
269963,2014-04-21 18:47:40,3831,China: “Violent Government Thugs” Beaten To De...,1,helpmesleep666,117,18,1,0,0,0,0,0
431221,2016-01-18 00:39:45,3823,ISIS commits largest massacre since Syrian con...,1,AllenDono,108,18,1,0,0,0,0,0
246618,2014-01-23 10:02:27,3738,Video of riot police stripping detained protes...,1,_skylark,172,28,0,0,0,1,0,0


---

## Feature Engineering

In [24]:
# df['post_length'] = ''
# c=0
# for t in df['title']:
#     df['post_length'][c] = len(t)
#     c+=1

---

## Data Visualizations

## NLP

In [25]:
nlp = spacy.load('en_core_web_sm')

In [26]:
test_title = df['title'][0]

In [27]:
df['title'][0]

'Scores killed in Pakistan clashes'

In [28]:
doc = nlp(test_title)

print([noun_phrases.text for noun_phrases in doc.noun_chunks])

['Scores', 'Pakistan clashes']


In [29]:
len(df.index)

509236

In [30]:
# Creating empty columns to hold NLP output
df['noun_chunks'] = ''
df['verbs'] = ''
df['entities'] = ''
df['entity_labels'] = ''

In [31]:
df

Unnamed: 0,time_created,up_votes,title,over_18,author,post_length_chars,post_length_tokens,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday,noun_chunks,verbs,entities,entity_labels
0,2008-01-24 22:34:06,3,Scores killed in Pakistan clashes,0,polar,33,5,0,0,0,1,0,0,,,,
1,2008-01-24 22:34:35,2,Japan resumes refuelling mission,0,polar,32,4,0,0,0,1,0,0,,,,
2,2008-01-24 22:42:03,3,US presses Egypt on Gaza border,0,polar,31,6,0,0,0,1,0,0,,,,
3,2008-01-24 22:54:50,1,Jump-start economy: Give health care to all,0,fadi420,44,7,0,0,0,1,0,0,,,,
4,2008-01-25 10:25:20,4,Council of Europe bashes EU&UN terror blacklist,0,mhermans,47,7,0,0,0,0,0,0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509231,2016-11-22 07:12:44,5,Heil Trump : Donald Trump s alt-right white...,0,nonamenoglory,88,13,0,0,0,0,1,0,,,,
509232,2016-11-22 07:12:52,1,There are people speculating that this could b...,0,SummerRay,67,10,0,0,0,0,1,0,,,,
509233,2016-11-22 07:17:36,1,Professor receives Arab Researchers Award,0,AUSharjah,41,5,0,0,0,0,1,0,,,,
509234,2016-11-22 07:19:17,1,Nigel Farage attacks response to Trump ambassa...,0,smilyflower,55,8,0,0,0,0,1,0,,,,


In [32]:
# Instantiating spacy NLP
nlp = spacy.load('en_core_web_sm')

# Defining a new function to segment post titles into component pieces and insert into original dataframe
def title_deconstruct(df):
    for i in range(len(df.index-1)):
        title = df['title'][i]
        doc = nlp(title)
        df['noun_chunks'][i] = [noun_chunk.text for noun_chunk in doc.noun_chunks]
        df['verbs'][i] = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
        df['entities'][i] = [entity.text for entity in doc.ents]
        df['entity_labels'][i] = [entity.label_ for entity in doc.ents]
    return df

In [33]:
title_deconstruct(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['noun_chunks'][i] = [noun_chunk.text for noun_chunk in doc.noun_chunks]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['verbs'][i] = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['entities'][i] = [entity.text for entity in doc.ents]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

KeyboardInterrupt: 