# BlogMe Sentiment and Keyword Analysis

BlogMe, a famous blogging business has a dataset of news articles that they need further analysis on.
Firstly, they’d like keywords to be extracted from headlines of the article. Secondly, they would need to determine the sentiment of the news articles. The data is in an excel sheet and they would like to see a dashboard outlying sentiment, top articles etc.

Date File: (articles.xlsx) – 4.7 MB file
https://finch-groundhog-9245.squarespace.com/s/articles.xlsx

BlogMe_sources.xlsx
https://finch-groundhog-9245.squarespace.com/s/BlogMe_sources.xlsx

Logo: (BlogMe Logo.png)
https://finch-groundhog-9245.squarespace.com/s/BlogMe-Logo.png

Tableau Joins:
https://finch-groundhog-9245.squarespace.com/s/TableauJoins-xc7w.pdf

In [18]:
import pandas as pd
import numpy as np

# VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and 
# rule-based sentiment analysis tool that is specifically attuned to sentiments 
# expressed in social media, and works well on texts from other domains.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [19]:
data = pd.read_excel("data/articles.xlsx")

In [20]:
data.describe()

Unnamed: 0,article_id,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
count,10437.0,10435.0,10319.0,10319.0,10319.0,10319.0
mean,5218.0,0.122089,381.39529,124.032949,196.236263,0.011629
std,3013.046714,0.327404,4433.344792,965.351188,1020.680229,0.268276
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,2609.0,0.0,0.0,0.0,1.0,0.0
50%,5218.0,0.0,1.0,0.0,8.0,0.0
75%,7827.0,0.0,43.0,12.0,47.5,0.0
max,10436.0,1.0,354132.0,48490.0,39422.0,15.0


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10437 entries, 0 to 10436
Data columns (total 15 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   article_id                       10437 non-null  int64  
 1   source_id                        10437 non-null  object 
 2   source_name                      10437 non-null  object 
 3   author                           9417 non-null   object 
 4   title                            10435 non-null  object 
 5   description                      10413 non-null  object 
 6   url                              10436 non-null  object 
 7   url_to_image                     9781 non-null   object 
 8   published_at                     10436 non-null  object 
 9   content                          9145 non-null   object 
 10  top_article                      10435 non-null  float64
 11  engagement_reaction_count        10319 non-null  float64
 12  engagement_comment

Counting the number of article per source and reaction per publisher

In [22]:
data.groupby('source_id')['article_id'].count()

source_id
1                             1
abc-news                   1139
al-jazeera-english          499
bbc-news                   1242
business-insider           1048
cbs-news                    952
cnn                        1132
espn                         82
newsweek                    539
reuters                    1252
the-irish-times            1232
the-new-york-times          986
the-wall-street-journal     333
Name: article_id, dtype: int64

In [23]:
data.groupby('source_id')['engagement_reaction_count'].sum()

source_id
1                                0.0
abc-news                    343779.0
al-jazeera-english          140410.0
bbc-news                    545396.0
business-insider            216545.0
cbs-news                    459741.0
cnn                        1218206.0
espn                             0.0
newsweek                     93167.0
reuters                      16963.0
the-irish-times              26838.0
the-new-york-times          790449.0
the-wall-street-journal      84124.0
Name: engagement_reaction_count, dtype: float64

Dropping a column

In [24]:
data = data.drop('engagement_comment_plugin_count', axis=1)

Create a keyword flag

In [25]:
data['keyword_flag_murder'] = np.where(data['title'].str.lower().str.contains('murder'), 1, 0)

In [26]:
data['keyword_flag_murder'].unique()

array([0, 1])

## Extract Sentiment from Data Title

In [27]:
data['title'] = data['title'].astype(str)

In [28]:
sent_int = SentimentIntensityAnalyzer()
pos_sents = []
neg_sents = []
neu_sents = []

n = len(data)

for x in range(n):
    try:
        sent = sent_int.polarity_scores(data['title'][x])
        pos = sent['pos']
        neg = sent['neg']
        neu = sent['neu']
    except:
        pos = 0
        neg = 0
        neu = 0
    
    pos_sents.append(pos)
    neg_sents.append(neg)
    neu_sents.append(neu)

data['title_positive_sent_score'] = pos_sents
data['title_negative_sent_score'] = neg_sents
data['title_neurtral_sent_score'] = neu_sents



In [29]:
data[['title','title_positive_sent_score' ,'title_negative_sent_score' ,'title_neurtral_sent_score']].head()

Unnamed: 0,title,title_positive_sent_score,title_negative_sent_score,title_neurtral_sent_score
0,NTSB says Autopilot engaged in 2018 California...,0.218,0.218,0.565
1,Unemployment falls to post-crash low of 5.2%,0.0,0.5,0.5
2,"Louise Kennedy AW2019: Long coats, sparkling t...",0.18,0.0,0.82
3,North Korean footballer Han joins Italian gian...,0.0,0.0,1.0
4,UK government lawyer says proroguing parliamen...,0.0,0.146,0.854


Export cleaned data

In [31]:
data.to_excel("data/blogme_cleaned.xlsx", sheet_name='blogmedata', index=False)