# World News NLP Project
## Scratchpad
#### Adam Zucker

---

## Data

- __*world_news_posts.csv*:__ Supplied dataframe with roughly 500,000 titles of posts on a "world news" message board, including data for the date, time, and author of the post, along with user interaction.

---

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import spacy
from spacy import displacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import time
from datetime import datetime

In [54]:
# Reading in data
df = pd.read_csv('../data/world_news_posts.csv')

In [55]:
df.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


---

## EDA

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509236 entries, 0 to 509235
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   time_created  509236 non-null  int64 
 1   date_created  509236 non-null  object
 2   up_votes      509236 non-null  int64 
 3   down_votes    509236 non-null  int64 
 4   title         509236 non-null  object
 5   over_18       509236 non-null  bool  
 6   author        509236 non-null  object
 7   category      509236 non-null  object
dtypes: bool(1), int64(3), object(4)
memory usage: 27.7+ MB


In [5]:
# Checking for nulls in the dataframe - none detected
df.isnull().sum()

time_created    0
date_created    0
up_votes        0
down_votes      0
title           0
over_18         0
author          0
category        0
dtype: int64

In [6]:
# The data spans 3223 days, from 1/25/08 to 11/22/16
print(f"Number of days represented in dataframe: {len(df['date_created'].unique())}")
print(f"Data date range is from {min(df['date_created'])} to {max(df['date_created'])}")

Number of days represented in dataframe: 3223
Data date range is from 2008-01-25 to 2016-11-22


In [7]:
# df['up_votes'].groupby(df['author']).sum()

In [8]:
# Converting 'date_created' to datetime
df['date_created'] = pd.to_datetime(df['date_created'])

In [9]:
df.dtypes

time_created             int64
date_created    datetime64[ns]
up_votes                 int64
down_votes               int64
title                   object
over_18                   bool
author                  object
category                object
dtype: object

In [10]:
# All posts are classified as 'worldnews' - with just a single class represented, this feature becomes unnecessary
df['category'].value_counts()

worldnews    509236
Name: category, dtype: int64

In [11]:
# Dropping 'category' feature
df.drop(columns='category', inplace=True)

---

In [12]:
# Summary stats for upvotes
df['up_votes'].describe()

count    509236.000000
mean        112.236283
std         541.694675
min           0.000000
25%           1.000000
50%           5.000000
75%          16.000000
max       21253.000000
Name: up_votes, dtype: float64

In [13]:
# Looking at titles of most upvoted posts
df['up_votes'].groupby(df['title']).sum().sort_values(ascending=False)[0:10].to_frame()

Unnamed: 0_level_0,up_votes
title,Unnamed: 1_level_1
"A biotech startup has managed to 3-D print fake rhino horns that carry the same genetic fingerprint as the actual horn. The company plans to flood Chinese rhino horn market at one-eighth of the price of the original, undercutting the price poachers can get and forcing them out eventually.",21253
"Twitter has forced 30 websites that archive politician s deleted tweets to shut down, removing an effective tool to keep politicians honest",13435
"2.6 terabyte leak of Panamanian shell company data reveals how a global industry led by major banks, legal firms, and asset management companies secretly manages the estates of politicians, Fifa officials, fraudsters and drug smugglers, celebrities and professional athletes.",13244
"The police officer who leaked the footage of the surfers paradise police brutality, where the victims blood was washed away by officers, has been criminally charged for bringing it to the publics view. Officers who did the bashing get nothing.",12333
Paris shooting survivor suing French media for giving away his location while he hid from shooters,11288
Hundreds of thousands of leaked emails reveal massively widespread corruption in global oil industry,11108
Brazil s Supreme Court has banned corporate contributions to political campaigns and parties,10922
"ISIS beheads 81-year-old pioneer archaeologist and foremost scholar on ancient Syria. Held captive for 1 month, he refused to tell ISIS the location of the treasures of Palmyra unto death.",10515
"Feeding cows seaweed could slash global greenhouse gas emissions, researchers say: They discovered adding a small amount of dried seaweed to a cow s diet can reduce the amount of methane a cow produces by up to 99 per cent.",10394
Brazilian radio host famous for exposing corruption in his city murdered while broadcasting live on the air by two gunmen.,10377


In [14]:
df.sort_values('up_votes', ascending=False)[0:10]

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author
377200,1434818471,2015-06-20,21253,0,A biotech startup has managed to 3-D print fak...,False,KRISHNA53
391415,1440421079,2015-08-24,13435,0,Twitter has forced 30 websites that archive po...,False,joeyoungblood
450818,1459706506,2016-04-03,13244,0,2.6 terabyte leak of Panamanian shell company ...,False,mister_geaux
391318,1440367768,2015-08-23,12333,0,The police officer who leaked the footage of t...,False,navysealassulter
390252,1439939168,2015-08-18,11288,0,Paris shooting survivor suing French media for...,False,seapiglet
449809,1459336773,2016-03-30,11108,0,Hundreds of thousands of leaked emails reveal ...,False,Xiroth
397215,1442535288,2015-09-18,10922,0,Brazil s Supreme Court has banned corporate co...,False,DoremusJessup
390494,1440030633,2015-08-20,10515,0,ISIS beheads 81-year-old pioneer archaeologist...,False,DawgsOnTopUGA
500786,1476881235,2016-10-19,10394,0,Feeding cows seaweed could slash global greenh...,False,mvea
388230,1438963135,2015-08-07,10377,0,Brazilian radio host famous for exposing corru...,False,fiffers


In [15]:
df.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans


---

In [16]:
print(f"Number of unique authors: {len(df['author'].unique())}")
print('-----')
print(f"Top 20 contributors by post count: \n{df['author'].value_counts()[0:20]}")
print('-----')
print(f"Top 20 contributors by upvotes: \n{df['up_votes'].groupby(df['author']).sum().sort_values(ascending=False)[0:20]}")

Number of unique authors: 85838
-----
Top 20 contributors by post count: 
davidreiss666         8897
anutensil             5730
DoremusJessup         5037
maxwellhill           4023
igeldard              4013
readerseven           3170
twolf1                2923
madam1                2658
nimobo                2564
madazzahatter         2503
ionised               2493
NinjaDiscoJesus       2448
bridgesfreezefirst    2405
SolInvictus           2181
Libertatea            2108
vigorous              2077
galt1776              1897
DougBolivar           1770
bob21doh              1698
trot-trot             1649
Name: author, dtype: int64
-----
Top 20 contributors by upvotes: 
author
maxwellhill         1985416
anutensil           1531544
Libertatea           832102
DoremusJessup        584380
Wagamaga             580121
NinjaDiscoJesus      492582
madazzahatter        428966
madam1               390541
davidreiss666        338306
kulkke               333311
pnewell              297270
nimob

---

In [17]:
# Looking at distribution of 'over_18' posts by number and percentage
print(df['over_18'].value_counts())
print(df['over_18'].value_counts(normalize=True))

False    508916
True        320
Name: over_18, dtype: int64
False    0.999372
True     0.000628
Name: over_18, dtype: float64


In [18]:
# Checking title content of some of the posts classified as "over_18"
df[df['over_18'] == True]

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author
1885,1206381438,2008-03-24,189,0,Pics from the Tibetan protests - more graphic ...,True,pressed
6721,1211138718,2008-05-18,5,0,"MI5 linked to Max Mosley’s Nazi-style, sadomas...",True,alllie
8414,1212694925,2008-06-05,0,0,Tabloid Horrifies Germany: Poland s Yellow Pre...,True,stesch
12163,1216672016,2008-07-21,0,0,Love Parade Dortmund: Techno Festival Breaks R...,True,stesch
12699,1217381380,2008-07-30,5,0,IDF kills young Palestinian boy. Potentially N...,True,cup
...,...,...,...,...,...,...,...
503776,1477889966,2016-10-31,4,0,Latest Italian Earthquake Devastates Medieval ...,True,pixelinthe
508067,1479400229,2016-11-17,12,0,ISIS Release Video Showing Melbourne As A Poss...,True,halacska
508176,1479434681,2016-11-18,0,0,Animal welfare activists have released footage...,True,NinjaDiscoJesus
508376,1479492875,2016-11-18,6,0,Jungle Justice : Public lynching of a street ...,True,avivi_


In [19]:
nsfw = df[df['over_18'] == True]
nsfw.sort_values(by='up_votes', ascending=False)[0:10]

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author
500590,1476806936,2016-10-18,7941,0,"Judge presiding over El Chapo s case shot, k...",True,IsleCook
494536,1474805114,2016-09-25,6322,0,[NSFL] Australian child molester Peter Scully ...,True,ExWhySaid
428689,1452167289,2016-01-07,5878,0,Armed suspect shot dead after trying to storm ...,True,rawmas02
462067,1463480226,2016-05-17,5617,0,Syria Army killed over 200 ISIS militants in 3...,True,orangeflower2015
303900,1409942733,2014-09-05,5507,0,Man escapes ISIS execution,True,brothamo
461255,1463150094,2016-05-13,4839,0,ISIS massacre 14 Real Madrid fans at supporter...,True,PeterG92
376435,1434501068,2015-06-17,4209,0,The fight is on to stop an annual Chinese even...,True,ShakoWasAngry
269963,1398120460,2014-04-21,3831,0,China: “Violent Government Thugs” Beaten To De...,True,helpmesleep666
431221,1453095585,2016-01-18,3823,0,ISIS commits largest massacre since Syrian con...,True,AllenDono
246618,1390489347,2014-01-23,3738,0,Video of riot police stripping detained protes...,True,_skylark


---

## Feature Engineering

In [51]:
# Generating features to hold total author posts and total author upvotes alongside each post
df['author_posts'] = df['author'].groupby(df['author']).transform('count')

In [58]:
df['author_upvotes'] = df['up_votes'].groupby(df['author']).transform('sum')

0          1151
1          1151
2          1151
3             1
4             4
          ...  
509231      105
509232        1
509233       16
509234      429
509235    30314
Name: up_votes, Length: 509236, dtype: int64

In [57]:
df.head(3)

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category,author_upvotes
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews,1151
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews,1151
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews,1151
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews,1
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews,4
...,...,...,...,...,...,...,...,...,...
509231,1479816764,2016-11-22,5,0,Heil Trump : Donald Trump s alt-right white...,False,nonamenoglory,worldnews,105
509232,1479816772,2016-11-22,1,0,There are people speculating that this could b...,False,SummerRay,worldnews,1
509233,1479817056,2016-11-22,1,0,Professor receives Arab Researchers Award,False,AUSharjah,worldnews,16
509234,1479817157,2016-11-22,1,0,Nigel Farage attacks response to Trump ambassa...,False,smilyflower,worldnews,429


---
---
## NLP

In [20]:
nlp = spacy.load('en_core_web_sm')

In [21]:
test_title = df['title'][111111]

In [22]:
df['title'][111111]

'A man  who put his fiancée in a cardboard computer box and tried to bury her alive because he was bored with her was sentenced today to 20 years in prison.'

In [23]:
# From https://spacy.io/ demo code
doc = nlp(test_title)

print([noun_phrases.text for noun_phrases in doc.noun_chunks])
print('-----')
print([token.lemma_ for token in doc if token.pos_ == "VERB"])
print('-----')
for entity in doc.ents:
    print(entity.text, entity.label_)

['A man', 'who', 'his fiancée', 'a cardboard computer box', 'her', 'he', 'her', '20 years', 'prison']
-----
['put', 'try', 'bury', 'be', 'sentence']
-----
today DATE
20 years DATE


In [24]:
df.head(1)

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar


In [25]:
df.index

RangeIndex(start=0, stop=509236, step=1)

In [26]:
range(len(df.index))

range(0, 509236)

In [27]:
print((range(len(df.index)))[-1])

509235


In [28]:
df.tail(1)

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author
509235,1479817346,2016-11-22,1,0,Palestinian wielding knife shot dead in West B...,False,superislam


---
---

In [29]:
# Creating columns of empty lists to hold NLP output

df['noun_phrases'] = df.apply(lambda value: [], axis=1)
df['verbs'] = df.apply(lambda value: [], axis=1)
df['entities'] = df.apply(lambda value: [], axis=1)
df['entity_labels'] = df.apply(lambda value: [], axis=1)

**NOTE:** The `lambda` function above is necessary since I can't directly assign an empty list as a value to fill the new columns.

---
---

In [30]:
df.dtypes

time_created              int64
date_created     datetime64[ns]
up_votes                  int64
down_votes                int64
title                    object
over_18                    bool
author                   object
noun_phrases             object
verbs                    object
entities                 object
entity_labels            object
dtype: object

In [31]:
# # Instantiating spacy NLP
# nlp = spacy.load('en_core_web_sm')

# # Defining a new function to segment post titles into component pieces and insert into original dataframe
# def title_deconstruct(df):
#     for i in range(len(df.index)):
#         title = df['title'][i]
#         doc = nlp(title)
#         df['noun_chunks'][i] = [noun_chunk.text for noun_chunk in doc.noun_chunks]
#         df['verbs'][i] = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
#         df['entities'][i] = [entity.text for entity in doc.ents]
#         df['entity_labels'][i] = [entity.label_ for entity in doc.ents]
#     return df

In [32]:
# title_deconstruct(df)

---

In [33]:
# # Initializing a new, empty dataframe to hold nlp data
# nlp_df = pd.DataFrame(data=None, index=range(len(df.index)), columns=['noun_chunks', 'verbs', 'entities', 'entity_labels'])

In [34]:
# nlp_df.head(3)

In [35]:
# # Instantiating spacy NLP
# nlp = spacy.load('en_core_web_sm')

# # Defining a new function to segment post titles into component pieces and insert into original dataframe
# def title_deconstruct(df):
#     for i in range(len(df)):
#         title = df['title'][i]
#         doc = nlp(title)
#         nlp_df['noun_chunks'][i] = [noun_chunk.text for noun_chunk in doc.noun_chunks]
#         nlp_df['verbs'][i] = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
#         nlp_df['entities'][i] = [entity.text for entity in doc.ents]
#         nlp_df['entity_labels'][i] = [entity.label_ for entity in doc.ents]
#     return nlp_df

In [36]:
# nlp_df = title_deconstruct(df)

In [37]:
# nlp_df

In [38]:
# pd.concat([df, nlp_df], axis=1)

---

In [39]:
# # Instantiating spacy NLP
# nlp = spacy.load('en_core_web_sm')

# # Defining a new function to segment post titles into component pieces and insert into original dataframe
# def title_deconstruct(df):
#     for i in range(10):
#         title = df['title'][i]
#         doc = nlp(title)
#         nouns = [noun_chunk.text for noun_chunk in doc.noun_chunks]
#         verbs = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
#         entities = [entity.text for entity in doc.ents]
#         ent_labels = [entity.label_ for entity in doc.ents]
#         df['noun_chunks'][i].append(nouns) 
#         df['verbs'][i].append(verbs) 
#         df['entities'][i].append(entities)
#         df['entity_labels'][i].append(ent_labels)
#     return df

In [40]:
# title_deconstruct(df)

In [41]:
# df.isnull().sum()

---
---
### This one works!

**BELOW:** This seems to be the best iteration of the function, but is still computationally inefficient.

In [42]:
# Instantiating spacy NLP
nlp = spacy.load('en_core_web_sm')

# Defining a new function to segment post titles into component pieces and insert into original dataframe
def title_deconstruct(df):
    for i in range(len(df)):
        title = df['title'][i]
        doc = nlp(title)
        df.at[i, 'noun_phrases'] = [noun_chunk.text for noun_chunk in doc.noun_chunks]
        df.at[i, 'verbs'] = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
        df.at[i, 'entities'] = [entity.text for entity in doc.ents]
        df.at[i, 'entity_labels'] = [entity.label_ for entity in doc.ents]
    return df

In [43]:
# df = title_deconstruct(df)

In [44]:
# df.head(3)

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,noun_phrases,verbs,entities,entity_labels
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,"[Scores, Pakistan clashes]",[kill],[Pakistan],[GPE]
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,"[Japan, refuelling mission]",[resume],[Japan],[GPE]
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,"[US, Egypt, Gaza border]",[press],"[US, Egypt, Gaza]","[GPE, GPE, GPE]"


---
#### Sentiment Analysis

**Compound Polarity Score:** The Compound Polarity Score generated by the `SentimentIntensityAnalyzer` is a rough judge of positivity and negativity in a text token, ranging from $-1$ (very negative) to $+1$ (very positive). The mean polarity score I see above of 0.488 implies that the Audio Engineering Subreddit's contributors tend to skew positive with the content they share!

In [45]:
# titles = df['title'].tolist()

In [46]:
# New column to hold the sentiment analysis results - can assign Null for now since data will be floats
df['compound_sentiment'] = np.NaN

# Verb sentiments will be stored a list of floats
df['verb_sentiment'] = df.apply(lambda value: [], axis=1)

In [47]:
# Defining a function to analyze sentiment of each title based on the VADER lexicon
def sent_analysis(df):
    # Instantiating a Sentiment Intensity Analyzer
    sent = SentimentIntensityAnalyzer()
    
    for i in range(len(df)):
        title = df['title'][i]
        title_sentiment = sent.polarity_scores(title)
        df.at[i, 'compound_sentiment'] = round(title_sentiment['compound'], 2)
        
#         verb_sents = []
#         for v in df['verbs'][i]:
#             verb_sentiment = sent.polarity_scores(v)
#             verb_sents.append(round(verb_sentiment['compound'], 2))
            
#         df.at[i, 'verb_sentiment'] = verb_sents
    
    return df


In [48]:
# df = sent_analysis(df)

In [50]:
# df

In [59]:
verb_test = 'make, run, smile, pass'
vt_sent = sent.polarity_scores(verb_test)

In [60]:
vt_sent

{'neg': 0.0, 'neu': 0.545, 'pos': 0.455, 'compound': 0.3612}

In [77]:
verb_test2 = ['make', 'run', 'smile', 'pass', 'happy']
vt_sent2 = sent.polarity_scores(verb_test2[1])

In [78]:
vt_sent2

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [75]:
verb_test2[2]

'smile'

---

### TEST

Combining it all into a single function

In [45]:
# Reading in data
df = pd.read_csv('../data/world_news_posts.csv')

In [46]:
df.head(3)

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews


In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509236 entries, 0 to 509235
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   time_created  509236 non-null  int64 
 1   date_created  509236 non-null  object
 2   up_votes      509236 non-null  int64 
 3   down_votes    509236 non-null  int64 
 4   title         509236 non-null  object
 5   over_18       509236 non-null  bool  
 6   author        509236 non-null  object
 7   category      509236 non-null  object
dtypes: bool(1), int64(3), object(4)
memory usage: 27.7+ MB


**BELOW:** The `process_data` function defined here will generate and populate the existing dataframe with a number of new features, as well as drop unnecessary features.

**TO DO:**
* Fix the upvotes by author feature
* Thoroughly verify data integrity
* Make sure the code is clear and flexible, eg, add loops to account for nulls in source data
* Is there a way to make it more efficient?

In [49]:
# Defining a function to concisely process this dataframe and others in the same format
def process_data(df):
    
    # Redefining the 'time_created' column to hold datetime, converted from unix timestamp format
    df['time_created'] = [datetime.fromtimestamp(ts) for ts in df['time_created']]
    # Dropping 'date_created' because of redundancy
    df.drop(columns='date_created', inplace=True)
    
    # Dropping 'category' feature if only one category is present
    if len(df['category'].unique()) == 1:
        df.drop(columns='category', inplace=True)
    # Similarly dropping down votes if there are none reported
    if sum(df['down_votes']) == 0:
        df.drop(columns='down_votes', inplace=True)
  

    # -----------
    # Creating columns of empty lists to hold NLP output
    df['noun_phrases'] = df.apply(lambda value: [], axis=1)
    df['verbs'] = df.apply(lambda value: [], axis=1)
    df['entities'] = df.apply(lambda value: [], axis=1)
    df['entity_labels'] = df.apply(lambda value: [], axis=1)
    # New column to hold the sentiment analysis results - can assign Null for now since data will be floats
    df['compound_sentiment'] = np.NaN
    
    # Instantiating spacy NLP
    nlp = spacy.load('en_core_web_sm')
    
    # Instantiating Sentiment Intensity Analyzer
    sent = SentimentIntensityAnalyzer()

    # Incorporating the loop from 'title_deconstruct' function to segment post titles into component pieces and insert into original dataframe
    for i in range(len(df)):
        title = df['title'][i]
        doc = nlp(title)
        df.at[i, 'noun_phrases'] = [noun_chunk.text for noun_chunk in doc.noun_chunks]
        df.at[i, 'verbs'] = [verb.lemma_ for verb in doc if verb.pos_ == "VERB"]
        df.at[i, 'entities'] = [entity.text for entity in doc.ents]
        df.at[i, 'entity_labels'] = [entity.label_ for entity in doc.ents]
        
        # Sentiment analysis
        title_sentiment = sent.polarity_scores(title)
        df.at[i, 'compound_sentiment'] = round(title_sentiment['compound'], 2)
    # -----------    
    
    
    # Binarizing 'over_18' feature
    df['over_18'] = df['over_18'].map({False:0, True:1})
    
    # Creating a feature to hold the post length in characters and words
    df['post_length_chars'] = df['title'].apply(len)
    df['post_length_tokens'] = df['title'].str.split().apply(len)
    
    # Generating features to hold total author posts and total author upvotes alongside each post
    df['author_posts'] = df['author'].groupby(df['author']).transform('count')
    df['author_upvotes'] = df['up_votes'].groupby(df['author']).transform('sum')
    
    # Generating a feature to hold day of the week and dummifying
    df['weekday'] = df['time_created'].dt.day_name()
    day_dummies = pd.get_dummies(df['weekday'], drop_first=True)
    df = pd.concat([df, day_dummies], axis=1)
    df.drop(columns='weekday', inplace=True)
    
    return df

In [31]:
df = process_data(df)

In [32]:
df

Unnamed: 0,time_created,up_votes,title,over_18,author,noun_phrases,verbs,entities,entity_labels,post_length_chars,post_length_tokens,author_posts,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,2008-01-24 22:34:06,3,Scores killed in Pakistan clashes,0,polar,"[Scores, Pakistan clashes]",[kill],[Pakistan],[GPE],33,5,50,0,0,0,1,0,0
1,2008-01-24 22:34:35,2,Japan resumes refuelling mission,0,polar,"[Japan, refuelling mission]",[resume],[Japan],[GPE],32,4,50,0,0,0,1,0,0
2,2008-01-24 22:42:03,3,US presses Egypt on Gaza border,0,polar,"[US, Egypt, Gaza border]",[press],"[US, Egypt, Gaza]","[GPE, GPE, GPE]",31,6,50,0,0,0,1,0,0
3,2008-01-24 22:54:50,1,Jump-start economy: Give health care to all,0,fadi420,"[Jump-start economy, health care]",[give],[],[],44,7,2,0,0,0,1,0,0
4,2008-01-25 10:25:20,4,Council of Europe bashes EU&UN terror blacklist,0,mhermans,"[Council, Europe, EU&UN]",[bash],"[Council of Europe, EU&UN]","[ORG, ORG]",47,7,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509231,2016-11-22 07:12:44,5,Heil Trump : Donald Trump s alt-right white...,0,nonamenoglory,"[ Heil Trump, Donald Trump, alt-right white n...","[s, invoke]","[Heil Trump, Donald Trump, Nazi]","[PERSON, PERSON, NORP]",88,13,5,0,0,0,0,1,0
509232,2016-11-22 07:12:52,1,There are people speculating that this could b...,0,SummerRay,"[people, Madeleine McCann]","[speculate, be]",[Madeleine McCann],[PERSON],67,10,1,0,0,0,0,1,0
509233,2016-11-22 07:17:36,1,Professor receives Arab Researchers Award,0,AUSharjah,"[Professor, Arab Researchers Award]",[receive],[Arab],[NORP],41,5,3,0,0,0,0,1,0
509234,2016-11-22 07:19:17,1,Nigel Farage attacks response to Trump ambassa...,0,smilyflower,"[Nigel Farage, response, Trump ambassador tweet]",[attack],"[Nigel Farage, Trump]","[PERSON, ORG]",55,8,52,0,0,0,0,1,0


In [33]:
df.isnull().sum()

time_created          0
up_votes              0
title                 0
over_18               0
author                0
noun_phrases          0
verbs                 0
entities              0
entity_labels         0
post_length_chars     0
post_length_tokens    0
author_posts          0
Monday                0
Saturday              0
Sunday                0
Thursday              0
Tuesday               0
Wednesday             0
dtype: int64

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509236 entries, 0 to 509235
Data columns (total 18 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   time_created        509236 non-null  datetime64[ns]
 1   up_votes            509236 non-null  int64         
 2   title               509236 non-null  object        
 3   over_18             509236 non-null  int64         
 4   author              509236 non-null  object        
 5   noun_phrases        509236 non-null  object        
 6   verbs               509236 non-null  object        
 7   entities            509236 non-null  object        
 8   entity_labels       509236 non-null  object        
 9   post_length_chars   509236 non-null  int64         
 10  post_length_tokens  509236 non-null  int64         
 11  author_posts        509236 non-null  int64         
 12  Monday              509236 non-null  uint8         
 13  Saturday            509236 no

In [35]:
df.columns

Index(['time_created', 'up_votes', 'title', 'over_18', 'author',
       'noun_phrases', 'verbs', 'entities', 'entity_labels',
       'post_length_chars', 'post_length_tokens', 'author_posts', 'Monday',
       'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday'],
      dtype='object')