## Upworthy Archive Sentiment Code (1/2)
This notebook covers the code used to process the Upworthy Archive exploratory data and find simple OLS regressions of various classifiers (in Python). Python does not have a sufficient package for analyzing time-invariant non-demeaned data (PanelOLS doesn't work with this), so we turn to R to find Fixed Effects in the other notebook.

In [1]:
#load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
string.punctuation
import re
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
#load upworthy archive and instantiate CTR and and headline variables
upworthy_data_dir = "upworthy-archive-exploratory-packages-03.12.2020.csv"
upworthy = pd.read_csv(upworthy_data_dir, index_col = 1)
CTR = np.array(upworthy['clicks'])/np.array(upworthy['impressions'])
upworthy['CTR'] = CTR
headline = upworthy.headline

In [3]:
#import nltk downloader if you don't have it already (run once only)
#nltk.download()

In [4]:
upworthy

Unnamed: 0_level_0,Unnamed: 0,updated_at,clickability_test_id,excerpt,headline,lede,slug,eyecatcher_id,impressions,clicks,significance,first_place,winner,share_text,square,test_week,CTR
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2014-11-20 06:43:16.005,0,2016-04-02 16:33:38.062,546d88fb84ad38b2ce000024,Things that matter. Pass 'em on.,They're Being Called 'Walmart's Worst Nightmar...,"<p>When I saw *why* people are calling them ""W...",theyre-being-called-walmarts-worst-nightmare-a...,546d6fa19ad54eec8d00002d,3052,150,100.0,True,True,Anyone who's ever felt guilty about shopping a...,,201446,0.049148
2014-11-20 06:43:44.646,1,2016-04-02 16:25:54.021,546d88fb84ad38b2ce000024,Things that matter. Pass 'em on.,They're Being Called 'Walmart's Worst Nightmar...,"<p>When I saw *why* people are calling them ""W...",theyre-being-called-walmarts-worst-nightmare-a...,546d6fa19ad54eec8d00002d,3033,122,14.0,False,False,Walmart is getting schooled by another retaile...,,201446,0.040224
2014-11-20 06:44:59.804,2,2016-04-02 16:25:54.024,546d88fb84ad38b2ce000024,Things that matter. Pass 'em on.,They're Being Called 'Walmart's Worst Nightmar...,"<p>When I saw *why* people are calling them ""W...",theyre-being-called-walmarts-worst-nightmare-a...,546d6fa19ad54eec8d00002d,3092,110,1.8,False,False,Walmart may not be crapping their pants over t...,,201446,0.035576
2014-11-20 06:54:36.335,3,2016-04-02 16:25:54.027,546d902c26714c6c44000039,Things that matter. Pass 'em on.,This Is What Sexism Against Men Sounds Like,<p>DISCLOSURE: I'm a dude. I have cried on mul...,this-is-what-sexism-against-men-sounds-like-am...,546bc55335992b86c8000043,3526,90,4.1,False,False,"If you ever wondered, ""but what about the men?...",,201446,0.025525
2014-11-20 06:54:57.878,4,2016-04-02 16:31:45.671,546d902c26714c6c44000039,Things that matter. Pass 'em on.,This Is What Sexism Against Men Sounds Like,<p>DISCLOSURE: I'm a dude. I have cried on mul...,this-is-what-sexism-against-men-sounds-like-am...,546d900426714cd2dd00002e,3506,120,100.0,True,False,"If you ever wondered, ""but what about the men?...",,201446,0.034227
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-11-20 01:21:49.197,150749,2016-04-02 16:25:53.916,546d373426714cde76000018,Things that matter. Pass 'em on.,5 Reasons You May Need To Plan A Vacation - Ri...,<p>Travel isn't just a luxury or indulgence an...,5-reasons-you-may-need-to-plan-a-vacation-righ...,546d398a9ad54eec8d00000f,3724,18,0.0,True,False,,,201446,0.004834
2014-11-20 01:23:31.203,150755,2016-04-02 16:31:40.651,546d373426714cde76000018,Things that matter. Pass 'em on.,The Next Time You Encounter A Small Minded Big...,<p>Travel isn't just a luxury or indulgence an...,the-next-time-you-encounter-a-small-minded-big...,546d398a9ad54eec8d00000f,3728,23,0.0,False,False,,,201446,0.006170
2014-11-20 01:24:23.415,150756,2016-04-02 16:31:50.079,546d373426714cde76000018,Things that matter. Pass 'em on.,I've Never Wanted To Buy A Plane Ticket More T...,<p>Travel isn't just a luxury or indulgence an...,ive-never-wanted-to-buy-a-plane-ticket-more-th...,546d398a9ad54eec8d00000f,3581,21,0.0,False,False,,,201446,0.005864
2015-01-14 17:11:40.585,150813,2016-04-02 16:24:27.29,54b6a21662646300182c0000,It all makes sense now.,3 Ladies Having Too Much Fun At The Epicenter ...,<p>The Frackettes want to remind you of one im...,3-ladies-having-too-much-fun-at-the-epicenter-...,54b6a2df3931650012620000,3425,37,100.0,True,False,,,201502,0.010803


### Posemo/Negemo Dictionary Model
This was a first "foray," if you will, of trying to tackle sentiment. We will not be doing further analysis of this model at the fixed-effects stage.

In [5]:
#load posemo dictionary

posemo_data_dir = "posemo_dict.csv"
posemo = pd.read_csv(posemo_data_dir, index_col = 0).word
#posemo.head()
posemo_punct = [word.strip(string.punctuation) for word in posemo][2:]
posemo_clean = [word.replace(r")","").replace(r"(","") for word in posemo_punct]

In [6]:
#load negemo dictionary
negemo_data_dir = "negemo_dict.csv"
negemo = pd.read_csv(negemo_data_dir, index_col = 0).word
#negemo.head()
negemo_punct = [word.strip(string.punctuation) for word in negemo][2:]
negemo_clean = [word.replace(r")","").replace(r"(","") for word in negemo_punct]

In [7]:
#load stopwords dictionary
stop = stopwords.words('english')
stop_list = set(stop)
exception_stop = set(['s',"t","no","d","ll","m","o","re","ve","y","won","ma","not"])
add_stop = set(["what's"])
stop = list((stop_list.union(add_stop))-exception_stop)

In [8]:
#split headlines
headline_wordlist = [0]*len(headline)
for i in range(len(headline)):
    headline_wordlist[i] = headline[i].split()
    
#clean headlines of unnecessary punctuation
for headline_num in range(len(headline_wordlist)):
    for word_num in range(len(headline_wordlist[headline_num])):
        headline_wordlist[headline_num][word_num] = headline_wordlist[headline_num][word_num].strip(string.punctuation)

In [9]:
#remove extraneous grammar words
for i in range(len(headline_wordlist)):
    for j in range(len(headline_wordlist[i])):
        for k in range(len(stop)):
            if re.fullmatch(stop[k],headline_wordlist[i][j].lower()) != None:
                headline_wordlist[i][j] = ""
    headline_wordlist[i] = list(filter(None,headline_wordlist[i]))

In [10]:
#testing postive matching algorithm on headlines
posemo_counts = []
for headline_blurb in range(len(headline_wordlist)):
    for headline_word in range(len(headline_wordlist[headline_blurb])):
        for posemo_word in posemo_clean:
            if re.search(posemo_word, headline_wordlist[headline_blurb][headline_word][0:len(posemo_word)]) != None:
#                print(str(headline_blurb) + ", " + str(posemo_word) + ", " + str(headline_wordlist[headline_blurb][headline_word]))
                posemo_counts.append(headline_blurb)

In [11]:
#add posemo counter to dataframe
headline_posemo_counter = [0]*len(headline)

for headline_num in posemo_counts:
    headline_posemo_counter[headline_num] += 1
    
upworthy["posemo_counts"] = headline_posemo_counter

In [12]:
#testing negative matching algorithm on headlines
negemo_counts = []
for headline_blurb in range(len(headline_wordlist)):
    for headline_word in range(len(headline_wordlist[headline_blurb])):
        for negemo_word in negemo_clean:
            if re.search(negemo_word, headline_wordlist[headline_blurb][headline_word][0:len(negemo_word)]) != None:
#                print(str(headline_blurb) + ", " + str(negemo_word) + ", " + str(headline_wordlist[headline_blurb][headline_word]))
                negemo_counts.append(headline_blurb)

In [13]:
#add negemo counter to dataframe
headline_negemo_counter = [0]*len(headline)

for headline_num in negemo_counts:
    headline_negemo_counter[headline_num] += 1
    
upworthy["negemo_counts"] = headline_negemo_counter

In [14]:
#classify headline into positive, negative, or neutral and add to headline
polarity = np.array(headline_posemo_counter) - np.array(headline_negemo_counter)
headline_polarity = [0]*len(headline)

for i in range(len(polarity)):
    if polarity[i] > 0:
        headline_polarity[i] = "Positive"
    elif polarity[i] < 0:
        headline_polarity[i] = "Negative"
    else:
        headline_polarity[i] = "Baseline_Neutral"

upworthy["polarity"] = headline_polarity

In [15]:
# Run OLS regression with polarity as indicator variable. Source: https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html

df_dict = upworthy[['CTR','polarity']]

polarity_dict = smf.ols(formula = 'CTR ~ C(polarity)', data = df_dict).fit()
print(polarity_dict.summary())

                            OLS Regression Results                            
Dep. Variable:                    CTR   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     96.49
Date:                Wed, 20 Jan 2021   Prob (F-statistic):           1.88e-42
Time:                        15:14:54   Log-Likelihood:                 67644.
No. Observations:               22666   AIC:                        -1.353e+05
Df Residuals:                   22663   BIC:                        -1.353e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                 

### VADER Models
This takes a sophisticated sentiment analyzer and finds if there is significance between sentiment and CTR. Two models were made: one of the raw polarity scores, and one with the analyzer's judgement score.

In [16]:
#load fresh DataFrame
upworthy = pd.read_csv(upworthy_data_dir, index_col = 1)
CTR = np.array(upworthy['clicks'])/np.array(upworthy['impressions'])
upworthy['CTR'] = CTR
headline = upworthy.headline

In [17]:
#load vader sentiment analyzer
vader = SentimentIntensityAnalyzer()

In [18]:
#get pos, neg, neu, and compound scores from vader
def vader_neutral(text):
    score = vader.polarity_scores(text)
    return score['neu']

def vader_pos(text):
    score = vader.polarity_scores(text)
    return score['pos']

def vader_neg(text):
    score = vader.polarity_scores(text)
    return score['neg']

def vader_compound(text):
    score = vader.polarity_scores(text)
    return score['compound']

polarity_neu = [0]*len(upworthy)  
polarity_pos = [0]*len(upworthy)  
polarity_neg = [0]*len(upworthy)  
polarity_compound = [0]*len(upworthy)

for i in range(len(headline)):
    polarity_neu[i] = vader_neutral(headline[i])
    polarity_pos[i] = vader_pos(headline[i])
    polarity_neg[i] = vader_neg(headline[i])
    polarity_compound[i] = vader_compound(headline[i])

upworthy['Neutral'] = polarity_neu
upworthy['Positive'] = polarity_pos
upworthy['Negative'] = polarity_neg
upworthy['compound'] = polarity_compound

In [19]:
#classifier with defined thresholds, using pos neg neu scores
vader = SentimentIntensityAnalyzer()
def vader_polarity_pnn(text):
    score = vader.polarity_scores(text)
    if score['neu'] >=0.85:
        return 'Baseline_Neutral'
    elif score['pos'] >= 0.15 and score['pos'] > score['neg']:
        return 'Positive'
    elif score['neg'] > 0.15 and score['neg'] > score['pos']:
        return 'Negative'
    else:
        return 'Baseline_Neutral'

#There are neutral scores = 1.0, so need to keep that in mind.
#Since median 0.791, choose threshold above median. Examples: 0.8, 0.85, 0.9, 0.95, 0.99
polarity = [0]*len(upworthy)
for i in range(len(headline)):
    polarity[i] = vader_polarity_pnn(headline[i])

#add polarity to upworthy dataframe
upworthy['polarity_vader'] = polarity

In [20]:
#OLS regression pos-neg-neu
df_polarity = upworthy[['CTR','polarity_vader']]
polarity_model = smf.ols(formula = 'CTR ~ C(polarity_vader)', data = df_polarity).fit()
print(polarity_model.summary())

                            OLS Regression Results                            
Dep. Variable:                    CTR   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     22.28
Date:                Wed, 20 Jan 2021   Prob (F-statistic):           2.16e-10
Time:                        15:15:19   Log-Likelihood:                 67570.
No. Observations:               22666   AIC:                        -1.351e+05
Df Residuals:                   22663   BIC:                        -1.351e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

In [21]:
upworthy.to_csv("upworthy_vader_pnn_classifier.csv")

In [22]:
#classifier with defined thresholds, using compound
def vader_polarity_compound(text):
    score = vader.polarity_scores(text)
    if score['compound'] <= 0.5 and score['compound'] >= -0.5:
        return 'Baseline_Neutral'
    elif score['compound'] > 0.5:
        return 'Positive'
    elif score['compound'] < -0.5:
        return 'Negative'

#There are neutral scores = 1.0, so need to keep that in mind.
#Since median 0.791, choose threshold above median. Examples: 0.8, 0.85, 0.9, 0.95, 0.99
polarity = [0]*len(upworthy)
for i in range(len(headline.index)):
    polarity[i] = vader_polarity_compound(headline[i])

#add polarity to upworthy dataframe
upworthy['polarity_vader'] = polarity

In [23]:
#OLS regression compound
df_polarity = upworthy[['CTR','polarity_vader']]
polarity_model = smf.ols(formula = 'CTR ~ C(polarity_vader)', data = df_polarity).fit()
print(polarity_model.summary())

                            OLS Regression Results                            
Dep. Variable:                    CTR   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     26.39
Date:                Wed, 20 Jan 2021   Prob (F-statistic):           3.55e-12
Time:                        15:15:27   Log-Likelihood:                 67575.
No. Observations:               22666   AIC:                        -1.351e+05
Df Residuals:                   22663   BIC:                        -1.351e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

In [24]:
upworthy.to_csv("upworthy_vader_compound_classifier.csv")

### TextBlob Model
TextBlob is another sentiment analyzer; however, it also scores subjectivity.

In [25]:
from textblob import TextBlob

#grab a fresh copy of upworthy
upworthy = pd.read_csv(upworthy_data_dir, index_col = 1)
CTR = np.array(upworthy['clicks'])/np.array(upworthy['impressions'])
upworthy['CTR'] = CTR
headline = upworthy.headline

headline_blob = [0]*len(headline)

for i in range(len(headline)):
    headline_blob[i] = TextBlob(headline[i])

In [26]:
blob_polarity = [0]*len(headline_blob)
blob_subjectivity = [0]*len(headline_blob)

for i in range(len(headline_blob)):
    blob_polarity[i] = headline_blob[i].sentiment[0]
    blob_subjectivity[i] = headline_blob[i].sentiment[1]
    
upworthy['polarity'] = blob_polarity
upworthy['subjectivity'] = blob_subjectivity

In [27]:
#OLS regression
df_sentiment = upworthy[['CTR','polarity','subjectivity']]
sentiment_model = smf.ols(formula = 'CTR ~ polarity+ subjectivity', data = df_sentiment).fit()
print(sentiment_model.summary())

                            OLS Regression Results                            
Dep. Variable:                    CTR   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     8.076
Date:                Wed, 20 Jan 2021   Prob (F-statistic):           0.000312
Time:                        15:15:40   Log-Likelihood:                 67556.
No. Observations:               22666   AIC:                        -1.351e+05
Df Residuals:                   22663   BIC:                        -1.351e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.0159      0.000    117.543   

In [28]:
upworthy.to_csv("upworthy_textblob_classifier.csv")