# World News NLP Project: Notebook 1
## Data Imports, Cleaning & EDA
#### Adam Zucker

---

## Problem

Using industry-standard NLP libraries [SpaCy](https://spacy.io/), [NLTK](https://www.nltk.org/), and [scikit-learn](https://scikit-learn.org/stable/), this study will examine the key words in a post title that most positively affect user engagement. The exploratory data analysis and visualizations in the following notebook will also factor in other features of the supplied data, including author, post time, and date. For the purposes of this study, positive user engagement will be measured in upvotes.

---

## Notebook Contents

1. Library and Data Imports
2. Basic Exploratory Data Analysis
3. Data Visualizations
4. NLP

---

## Data

- __*world_news_posts.csv*:__ Supplied dataframe with roughly 500,000 titles of posts on a "world news" message board, including data for the date, time, and author of the post, along with user interaction.

---

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import spacy
from spacy import displacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import time
from datetime import datetime

In [2]:
# Reading in data
df = pd.read_csv('../data/world_news_posts.csv')

In [3]:
df.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


---

## EDA

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509236 entries, 0 to 509235
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   time_created  509236 non-null  int64 
 1   date_created  509236 non-null  object
 2   up_votes      509236 non-null  int64 
 3   down_votes    509236 non-null  int64 
 4   title         509236 non-null  object
 5   over_18       509236 non-null  bool  
 6   author        509236 non-null  object
 7   category      509236 non-null  object
dtypes: bool(1), int64(3), object(4)
memory usage: 27.7+ MB


In [5]:
# Checking for nulls in the dataframe - none detected
df.isnull().sum()

time_created    0
date_created    0
up_votes        0
down_votes      0
title           0
over_18         0
author          0
category        0
dtype: int64

In [6]:
# The data spans 3223 days, from 1/25/08 to 11/22/16
print(f"Number of days represented in dataframe: {len(df['date_created'].unique())}")
print(f"Data date range is from {min(df['date_created'])} to {max(df['date_created'])}")

Number of days represented in dataframe: 3223
Data date range is from 2008-01-25 to 2016-11-22


In [7]:
# Defining a function to concisely process this dataframe and others in the same format
def process_data(df):
    
    # Redefining the 'time_created' column to hold datetime, converted from unix timestamp format
    df['time_created'] = [datetime.fromtimestamp(ts) for ts in df['time_created']]
    # Dropping 'date_created' because of redundancy
    df.drop(columns='date_created', inplace=True)
    
    # Dropping 'category' feature if only one category is present
    if len(df['category'].unique()) == 1:
        df.drop(columns='category', inplace=True)
    # Similarly dropping down votes if there are none reported
    if sum(df['down_votes']) == 0:
        df.drop(columns='down_votes', inplace=True)
    
    # -----------
    # Creating columns of empty lists to hold NLP output
    df['noun_phrases'] = df.apply(lambda value: [], axis=1)
    df['verbs'] = df.apply(lambda value: [], axis=1)
    df['entities'] = df.apply(lambda value: [], axis=1)
    df['entity_labels'] = df.apply(lambda value: [], axis=1)
    
    # Instantiating spacy NLP
    nlp = spacy.load('en_core_web_sm')

    # Incorporating the loop from 'title_deconstruct' function to segment post titles into component pieces and insert into original dataframe
    for i in range(len(df)):
        title = df['title'][i]
        doc = nlp(title)
        df.at[i, 'noun_phrases'] = [noun_chunk.text for noun_chunk in doc.noun_chunks]
        df.at[i, 'verbs'] = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
        df.at[i, 'entities'] = [entity.text for entity in doc.ents]
        df.at[i, 'entity_labels'] = [entity.label_ for entity in doc.ents]
    # -----------    
    
    # Binarizing 'over_18' feature
    df['over_18'] = df['over_18'].map({False:0, True:1})
    
    # Creating a feature to hold the post length in characters and words
    df['post_length_chars'] = df['title'].apply(len)
    df['post_length_tokens'] = df['title'].str.split().apply(len)
    
    # Generating features to hold total author posts and total author upvotes alongside each post
    df['author_posts'] = df['author'].groupby(df['author']).transform('count')
#     df['author_upvotes'] = [df['up_votes'].groupby(df['author']).sum() for a in df['author']]
    
    # Generating a feature to hold day of the week and dummifying
    df['weekday'] = df['time_created'].dt.day_name()
    day_dummies = pd.get_dummies(df['weekday'], drop_first=True)
    df = pd.concat([df, day_dummies], axis=1)
    df.drop(columns='weekday', inplace=True)
    
    # Ordering columns for clarity  ***NOTE*** move outside function because of column dependencies
    df = df[['time_created', 'author', 'author_posts', 'over_18', 'up_votes', 'title', 
             'noun_phrases', 'verbs', 'entities', 'entity_labels', 'post_length_chars', 
             'post_length_tokens', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday']]
    
    return df

In [8]:
df = process_data(df)

In [9]:
df.head(3)

Unnamed: 0,time_created,author,author_posts,over_18,up_votes,title,noun_phrases,verbs,entities,entity_labels,post_length_chars,post_length_tokens,Saturday,Sunday,Monday,Tuesday,Wednesday,Thursday
0,2008-01-24 22:34:06,polar,50,0,3,Scores killed in Pakistan clashes,"[Scores, Pakistan clashes]",[kill],[Pakistan],[GPE],33,5,0,0,0,0,0,1
1,2008-01-24 22:34:35,polar,50,0,2,Japan resumes refuelling mission,"[Japan, refuelling mission]",[resume],[Japan],[GPE],32,4,0,0,0,0,0,1
2,2008-01-24 22:42:03,polar,50,0,3,US presses Egypt on Gaza border,"[US, Egypt, Gaza border]",[press],"[US, Egypt, Gaza]","[GPE, GPE, GPE]",31,6,0,0,0,0,0,1


---

In [10]:
df.shape

(509236, 18)

In [11]:
# df.to_csv('../data/world_news_processed.csv', index=False)

---

In [12]:
# df = pd.read_csv('../data/world_news_processed.csv')
# df.head(3)

Unnamed: 0,time_created,author,author_posts,over_18,up_votes,title,noun_phrases,verbs,entities,entity_labels,post_length_chars,post_length_tokens,Saturday,Sunday,Monday,Tuesday,Wednesday,Thursday
0,2008-01-24 22:34:06,polar,50,0,3,Scores killed in Pakistan clashes,"['Scores', 'Pakistan clashes']",['kill'],['Pakistan'],['GPE'],33,5,0,0,0,0,0,1
1,2008-01-24 22:34:35,polar,50,0,2,Japan resumes refuelling mission,"['Japan', 'refuelling mission']",['resume'],['Japan'],['GPE'],32,4,0,0,0,0,0,1
2,2008-01-24 22:42:03,polar,50,0,3,US presses Egypt on Gaza border,"['US', 'Egypt', 'Gaza border']",['press'],"['US', 'Egypt', 'Gaza']","['GPE', 'GPE', 'GPE']",31,6,0,0,0,0,0,1


In [13]:
df.shape

(509236, 18)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509236 entries, 0 to 509235
Data columns (total 18 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   time_created        509236 non-null  object
 1   author              509236 non-null  object
 2   author_posts        509236 non-null  int64 
 3   over_18             509236 non-null  int64 
 4   up_votes            509236 non-null  int64 
 5   title               509236 non-null  object
 6   noun_phrases        509236 non-null  object
 7   verbs               509236 non-null  object
 8   entities            509236 non-null  object
 9   entity_labels       509236 non-null  object
 10  post_length_chars   509236 non-null  int64 
 11  post_length_tokens  509236 non-null  int64 
 12  Saturday            509236 non-null  int64 
 13  Sunday              509236 non-null  int64 
 14  Monday              509236 non-null  int64 
 15  Tuesday             509236 non-null  int64 
 16  We

---

In [15]:
# Summary stats for upvotes
df['up_votes'].describe()

count    509236.000000
mean        112.236283
std         541.694675
min           0.000000
25%           1.000000
50%           5.000000
75%          16.000000
max       21253.000000
Name: up_votes, dtype: float64

In [16]:
# # Looking at titles of most upvoted posts
# df['up_votes'].groupby(df['title']).sum().sort_values(ascending=False)[0:10].to_frame()

In [17]:
df.sort_values('up_votes', ascending=False)[0:10]

Unnamed: 0,time_created,author,author_posts,over_18,up_votes,title,noun_phrases,verbs,entities,entity_labels,post_length_chars,post_length_tokens,Saturday,Sunday,Monday,Tuesday,Wednesday,Thursday
377200,2015-06-20 12:41:11,KRISHNA53,109,0,21253,A biotech startup has managed to 3-D print fak...,"['A biotech startup', '3-D print fake rhino ho...","['manage', 'carry', 'plan', 'flood', 'undercut...","['Chinese', 'one-eighth']","['NORP', 'CARDINAL']",289,49,1,0,0,0,0,0
391415,2015-08-24 08:57:59,joeyoungblood,3,0,13435,Twitter has forced 30 websites that archive po...,"['Twitter', '30 websites', 'archive politician...","['force', 'delete', 'shut', 'remove', 'keep']",['30'],['CARDINAL'],139,22,0,0,1,0,0,0
450818,2016-04-03 14:01:46,mister_geaux,12,0,13244,2.6 terabyte leak of Panamanian shell company ...,"['2.6 terabyte leak', 'Panamanian shell compan...","['reveal', 'lead', 'manage']","['2.6', 'Panamanian', 'Fifa']","['CARDINAL', 'NORP', 'ORG']",277,39,0,1,0,0,0,0
391318,2015-08-23 18:09:28,navysealassulter,1,0,12333,The police officer who leaked the footage of t...,"['The police officer', 'who', 'the footage', '...","['leak', 'paradise', 'wash', 'charge', 'bring'...",[],[],243,40,0,1,0,0,0,0
390252,2015-08-18 19:06:08,seapiglet,1,0,11288,Paris shooting survivor suing French media for...,"['Paris', 'survivor', 'French media', 'his loc...","['shoot', 'sue', 'give', 'hide']","['Paris', 'French']","['GPE', 'NORP']",98,16,0,0,0,1,0,0
449809,2016-03-30 07:19:33,Xiroth,2,0,11108,Hundreds of thousands of leaked emails reveal ...,"['Hundreds of thousands', 'leaked emails', 'ma...","['leak', 'reveal']",['Hundreds of thousands'],['CARDINAL'],100,14,0,0,0,0,1,0
397215,2015-09-17 20:14:48,DoremusJessup,5037,0,10922,Brazil s Supreme Court has banned corporate co...,"['Brazil s Supreme Court', 'corporate contribu...",['ban'],"['Brazil', 'Supreme Court']","['GPE', 'ORG']",92,13,0,0,0,0,0,1
390494,2015-08-19 20:30:33,DawgsOnTopUGA,1,0,10515,ISIS beheads 81-year-old pioneer archaeologist...,"['ISIS', '81-year-old pioneer archaeologist', ...","['behead', 'hold', 'refuse', 'tell']","['ISIS', '81-year-old', 'Syria', '1 month', 'I...","['ORG', 'DATE', 'GPE', 'DATE', 'ORG', 'PRODUCT']",188,30,0,0,0,0,1,0
500786,2016-10-19 08:47:15,mvea,18,0,10394,Feeding cows seaweed could slash global greenh...,"['Feeding cows', 'global greenhouse gas emissi...","['slash', 'say', 'discover', 'add', 'dry', 're...",['up to'],['CARDINAL'],225,40,0,0,0,0,1,0
388230,2015-08-07 11:58:55,fiffers,5,0,10377,Brazilian radio host famous for exposing corru...,"['Brazilian radio host', 'corruption', 'his ci...","['expose', 'murder', 'broadcast']","['Brazilian', 'two']","['NORP', 'CARDINAL']",122,20,0,0,0,0,0,0


In [18]:
df.head(1)

Unnamed: 0,time_created,author,author_posts,over_18,up_votes,title,noun_phrases,verbs,entities,entity_labels,post_length_chars,post_length_tokens,Saturday,Sunday,Monday,Tuesday,Wednesday,Thursday
0,2008-01-24 22:34:06,polar,50,0,3,Scores killed in Pakistan clashes,"['Scores', 'Pakistan clashes']",['kill'],['Pakistan'],['GPE'],33,5,0,0,0,0,0,1


---

In [19]:
print(f"Number of unique authors: {len(df['author'].unique())}")
print('-----')
print(f"Top 20 contributors by post count: \n{df['author'].value_counts()[0:20]}")
print('-----')
print(f"Top 20 contributors by upvotes: \n{df['up_votes'].groupby(df['author']).sum().sort_values(ascending=False)[0:20]}")

Number of unique authors: 85838
-----
Top 20 contributors by post count: 
davidreiss666         8897
anutensil             5730
DoremusJessup         5037
maxwellhill           4023
igeldard              4013
readerseven           3170
twolf1                2923
madam1                2658
nimobo                2564
madazzahatter         2503
ionised               2493
NinjaDiscoJesus       2448
bridgesfreezefirst    2405
SolInvictus           2181
Libertatea            2108
vigorous              2077
galt1776              1897
DougBolivar           1770
bob21doh              1698
trot-trot             1649
Name: author, dtype: int64
-----
Top 20 contributors by upvotes: 
author
maxwellhill         1985416
anutensil           1531544
Libertatea           832102
DoremusJessup        584380
Wagamaga             580121
NinjaDiscoJesus      492582
madazzahatter        428966
madam1               390541
davidreiss666        338306
kulkke               333311
pnewell              297270
nimob

---

In [20]:
# Looking at distribution of 'over_18' posts by number and percentage
print(df['over_18'].value_counts())
print(df['over_18'].value_counts(normalize=True))

0    508916
1       320
Name: over_18, dtype: int64
0    0.999372
1    0.000628
Name: over_18, dtype: float64


In [21]:
# Looking at data for a sample of the posts classified as "over_18" - on fist look, lots of violent content
df[df['over_18'] == True].head()

Unnamed: 0,time_created,author,author_posts,over_18,up_votes,title,noun_phrases,verbs,entities,entity_labels,post_length_chars,post_length_tokens,Saturday,Sunday,Monday,Tuesday,Wednesday,Thursday
1885,2008-03-24 13:57:18,pressed,1,1,189,Pics from the Tibetan protests - more graphic ...,"['Pics', 'the Tibetan protests', 'Wikileaks', ...",[],['Tibetan'],['NORP'],76,12,0,0,1,0,0,0
6721,2008-05-18 15:25:18,alllie,1092,1,5,"MI5 linked to Max Mosley’s Nazi-style, sadomas...","['MI5', 'Max Mosley’s Nazi-style, sadomasochis...","['link', 'lead']","['Max Mosley’s', 'Nazi', 'the British Union of...","['PERSON', 'NORP', 'ORG', 'DATE']",188,31,0,1,0,0,0,0
8414,2008-06-05 15:42:05,stesch,211,1,0,Tabloid Horrifies Germany: Poland s Yellow Pre...,"['Poland s Yellow Press', 'Blood Red', 'you', ...","['turn', 'follow']","['Tabloid Horrifies Germany', 'Poland', 'Yello...","['PERSON', 'GPE', 'ORG', 'ORDINAL', 'ORG', 'GPE']",135,25,0,0,0,0,0,1
12163,2008-07-21 16:26:56,stesch,211,1,0,Love Parade Dortmund: Techno Festival Breaks R...,"['Love Parade Dortmund', 'Techno Festival Brea...",[],"['1.6 Million', '90x90', 'NSFW']","['CARDINAL', 'CARDINAL', 'ORG']",101,16,0,0,1,0,0,0
12699,2008-07-29 21:29:40,cup,35,1,5,IDF kills young Palestinian boy. Potentially N...,"['IDF', 'young Palestinian boy', 'Potentially ...",['kill'],"['Palestinian', 'NSFW']","['NORP', 'ORG']",50,7,0,0,0,1,0,0


In [22]:
# Creating a new dataframe of the over_18 posts and examining most upvoted among those
nsfw = df[df['over_18'] == True]
nsfw.sort_values(by='up_votes', ascending=False)[0:10]

Unnamed: 0,time_created,author,author_posts,over_18,up_votes,title,noun_phrases,verbs,entities,entity_labels,post_length_chars,post_length_tokens,Saturday,Sunday,Monday,Tuesday,Wednesday,Thursday
500590,2016-10-18 12:08:56,IsleCook,645,1,7941,"Judge presiding over El Chapo s case shot, k...","['Judge', 'El Chapo s case']","['preside', 'shoot', 'kill', 'jog']",['El Chapo'],['GPE'],78,13,0,0,0,1,0,0
494536,2016-09-25 08:05:14,ExWhySaid,1,1,6322,[NSFL] Australian child molester Peter Scully ...,"['Australian child molester Peter Scully', 'de...","['face', 'film', 'make', 'dig', 'laugh', 'joke...","['Australian', 'Peter Scully', 'Philippines']","['NORP', 'PERSON', 'GPE']",238,39,0,1,0,0,0,0
428689,2016-01-07 06:48:09,rawmas02,27,1,5878,Armed suspect shot dead after trying to storm ...,"['Armed suspect', 'Paris police station']","['shoot', 'try', 'storm']",['Paris'],['GPE'],66,11,0,0,0,0,0,1
462067,2016-05-17 06:17:06,orangeflower2015,51,1,5617,Syria Army killed over 200 ISIS militants in 3...,"['Syria Army', 'over 200 ISIS militants', '3-d...",['kill'],"['Syria Army', 'over 200', 'ISIS', '3-day']","['ORG', 'CARDINAL', 'ORG', 'DATE']",79,14,0,0,0,1,0,0
303900,2014-09-05 14:45:33,brothamo,8,1,5507,Man escapes ISIS execution,"['Man', 'ISIS execution']",['escape'],['ISIS'],['ORG'],26,4,0,0,0,0,0,0
461255,2016-05-13 10:34:54,PeterG92,1,1,4839,ISIS massacre 14 Real Madrid fans at supporter...,"['ISIS', '14 Real Madrid fans', 'supporters cl...",[],"['ISIS', '14', 'Real Madrid', 'Baghdad']","['ORG', 'CARDINAL', 'ORG', 'GPE']",63,11,0,0,0,0,0,0
376435,2015-06-16 20:31:08,ShakoWasAngry,72,1,4209,The fight is on to stop an annual Chinese even...,"['The fight', 'an annual Chinese event', 'the ...","['stop', 'expect', 'involve', 'burn', 'boil']","['annual', 'Chinese', 'more than 10,000']","['DATE', 'NORP', 'CARDINAL']",157,30,0,0,0,1,0,0
269963,2014-04-21 18:47:40,helpmesleep666,1,1,3831,China: “Violent Government Thugs” Beaten To De...,"['China', '“Violent Government Thugs', 'Death'...","['beat', 'kill', 'document']","['China', 'Violent Government Thugs', 'Beaten ...","['GPE', 'WORK_OF_ART', 'WORK_OF_ART']",117,18,0,0,1,0,0,0
431221,2016-01-18 00:39:45,AllenDono,211,1,3823,ISIS commits largest massacre since Syrian con...,"['ISIS', 'largest massacre', 'Syrian conflict'...","['commit', 'dump', 'take']","['ISIS', 'Syrian', '280', '400']","['ORG', 'NORP', 'CARDINAL', 'CARDINAL']",108,18,0,0,1,0,0,0
246618,2014-01-23 10:02:27,_skylark,1,1,3738,Video of riot police stripping detained protes...,"['Video', 'riot police', 'detained protester',...","['strip', 'detain', 'take', 'continue']","['0', 'Kyiv', 'Ukraine']","['CARDINAL', 'GPE', 'GPE']",172,28,0,0,0,0,0,1


---

## NLP & Feature Engineering

---

## Data Visualizations