# Preface

<p>Viral and attention-grabbing headlines are certainly intriguing. I definitely fell for clickbait countless times! I suspect there is a formula to generating eye-catching, click-worthy headlines. Wouldn't it be great if we leveraged algorithms to tease apart some of this hidden structure within viral headlines? Well, this is what this exploration is all about. :)</p>
<p>The goal of this notebook is to demonstrate creation of simple Markov chain to generate viral headlines using real examples to build its vocabulary.</p>

<p>Firstly, a big thank you to the wonderful folks over at <a href="http://www.ripenn.com/">Ripenn</a> for providing a neat corpus. Data is available for download at end of this post: "<a href="http://www.ripenn.com/blog/7-things-marketers-can-learn-from-2616-viral-headlines/">7 Things Marketers Can Learn From 2,616 Viral Headlines</a>".
</p>
<p>
I discovered this dataset which was referenced in <a href='https://blog.bufferapp.com/' >Buffersocial</a>'s post "<a href="https://blog.bufferapp.com/the-most-popular-words-in-most-viral-headlines">How to Write The Perfect Headline: The Top Words Used in Viral Headlines</a>"
.</p>


My inspiration for building a Markov chain originated from these two posts:
    <li>"<a href="http://agiliq.com/blog/2009/06/generating-pseudo-random-text-with-markov-chains-u/">Generating pseudo random text with Markov chains using Python</a>"</li>
    <li>"<a href="http://www.onthelambda.com/2014/02/20/how-to-fake-a-sophisticated-knowledge-of-wine-with-markov-chains/">How to fake a sophisticated knowledge of wine with Markov Chains</a>"</li>
<p>I learned a great amount from these demonstrations, they're definitely worth visit. In this notebook, I assimilated some of their techniques and with my own experience working with text data. Thanks to Shabda Raaj and Tony Fischetti.</p>
    

# Let's Begin

In [54]:
import pandas as pd
import nltk
import ftfy
import random

<b>loading headlines into Pandas DataFrame</b>

In [201]:
sheetnames = ['Buzzfeed', 'ViralNova', 'Upworthy', 'Wimp', 'Feedly']

In [202]:
df = pd.DataFrame()
for sheetname in sheetnames:
    print(sheetname)
    dfs = pd.read_excel( r'data/Viral-Title-Analysis-ripenn.xlsx', sheetname=sheetname )
    dfs['sheetname'] = sheetname
    df = pd.concat( [df, dfs], ignore_index=True )

Buzzfeed
ViralNova
Upworthy
Wimp
Feedly


In [203]:
df.columns

Index([               u'+1s',         u'CHAR COUNT',          u'Delicious',
                    u'Diggs',        u'FB Comments',           u'FB Likes',
                u'FB Shares',           u'FB Total',       u'FIRST PERSON',
                     u'Link',    u'LinkedIn Shares',           u'NEGATIVE',
                   u'NUMBER',               u'Pins',           u'QUESTION',
                   u'Reddit', u'SEXUAL ORIENTATION',               u'SITE',
              u'StumbleUpon',              u'TITLE',             u'Tweets',
                      u'URL',                u'WHY',          u'sheetname'],
      dtype='object')

We need to perform some dataframe reconfiguration. Upon examining the raw data in excel, one can observe that:
<li>for sheetnames 'Buzzfeed', 'ViralNova', and 'Upworthy', headlines are under 'TITLE'</li>
<li>for sheetnames 'Wimp' and 'Feedly', the column label 'Link' actually contained the headlines. </li>

In [206]:
df = df[['TITLE', 'Link', 'sheetname']]

In [226]:
df['headline'] = df['TITLE']
df['headline'].fillna(df['Link'], inplace=True)

<b>preprocess text and get headlines to a list</b>

In [258]:
# walkthru how I arrived at removal/cleaning heuristic; this step required some back and forth to get the headlines in a cleaner format... 
replacements = { u'\xa0': u' ',     # non-breaking space
                 u'\u2026': u'...', # horizontal ellipsis
                 u'\u201c': u'"',   # left double quotation
                 u'\u201d': u'"'    # right double quotation
               }

def clean( s ):
    s = ftfy.fix_text(s)
    for old, new in replacements.items():
        s = s.replace( old, new )
    return s

In [230]:
headlines = df['headline'].apply(lambda x: clean(x)).tolist()

In [231]:
random.sample( headlines, 15 )

[u'Birds Do This All The Time. But Seeing It Actually Happen Is Pretty Awesome.',
 u'This 12 Year Old Girl Just Died. The Letter Her Parents Discovered Afterwards Is Heart Shattering.',
 u'Childhood Amnesia: The Age at Which Our Earliest Memories Fade',
 u'Two projectors create a real-life skinning effect on a simple, white living room.',
 u'10 Ways For 2013 Not To Suck',
 u'Facebook Call-to-Action Buttons: Everything You Need to Know [Video]',
 u'Godzilla roar is actually a leather glove being dragged down the strings of a bass.',
 u"18 Things You Need To Know About California's Worst Drought In Centuries",
 u"World's deadliest hamburger.",
 u'Methinks The Anti-Gay Politician Doth Protest Too Much',
 u"The 'Tip' They Left This Waitress Is Disgusting. And She Even Fought For Her Country.",
 u'25 Sneaky Online Tools and Gadgets to Help You Spy on Your Competitors',
 u"I'm So Glad There Was A Camera On This Baby Elephant At The Perfect Time. Because This Is The Best.",
 u'This Guy Took M

<b>creating a class for Markov chain</b>

In [232]:
class Markov(object):
    def __init__(self, lst):
        self.d = {}
        self.sentences = self.tokenize_sentences( lst )
        self.create_dict()
        
    def tokenize_sentences(self, lst):
        return [nltk.word_tokenize(s) for s in lst]
        
    def create_dict(self):
        for s in self.sentences:
            for w1, w2, w3 in self.trigram(s):
                k = (w1, w2)
                if k not in self.d:
                    self.d[k] = []
                self.d[k].append(w3)
    
    def trigram(self, tokens):
        if len(tokens) < 3:
            return
        for i in xrange(len(tokens) - 2):
            yield tokens[i], tokens[i+1], tokens[i+2]
    
    def generate( self, size=15 ):
        # pick a random sentence
        i = random.randint(0, len(self.sentences) - 1)
        sentence = self.sentences[ i ]
        
        # pick two random sequential words from the sentence
        i = random.randint(0, len(sentence) - 2)
        w1, w2 = sentence[ i ], sentence[ i+1 ]

        return self.generate_from_words( w1, w2, size=size )
        
    def generate_from_words( self, w1, w2, size=15 ):
        outcome = []
        for i in xrange(size):
            outcome.append(w1)
            k = (w1, w2)
            
            if k not in self.d:
                break
            else:
                w1, w2 = w2, random.choice( self.d[k] )
        outcome.append(w2)  
        return ' '.join(outcome)

In [233]:
m = Markov( headlines )

examining the internal dictionary that is generated from the headlines

In [241]:
len(m.d.keys())

21059

In [240]:
random.sample( m.d.items(), 15 )

[((u'Toowoomba', u','), [u'Australia']),
 ((u'Beginners', u'Guide'), [u'to']),
 ((u'Baby', u'elephant'), [u'tries']),
 ((u'Set', u'on'), [u'Fire']),
 ((u'Go', u'Now'), [u'.']),
 ((u'Notebook', u"''"), [u'And']),
 ((u'in', u'the'),
  [u'Act',
   u'Eyes',
   u'Most',
   u'Face',
   u'Emergency',
   u'Same',
   u'mid',
   u'US',
   u'Amazon',
   u'Copenhagen',
   u'white',
   u'wall',
   u'sun',
   u'Sky',
   u'universe',
   u'Face',
   u'Mundane',
   u'Room',
   u'Right',
   u'Margin',
   u'Facebook']),
 ((u'Your', u'Head'), [u'In', u'.']),
 ((u'Insane', u'Fast'), [u'Food']),
 ((u'Dog', u'Breeds'), [u'You']),
 ((u'Child', u'Noticed'), [u'In']),
 ((u'The', u'Moon'), [u'(']),
 ((u'On', u'Humanity'), [u',']),
 ((u'Living', u'You'), [u'Learned']),
 ((u'Next', u'Obama'), [u'Could'])]

<b>Generating random headlines</b>

In [252]:
for _ in xrange(15):
    size = random.randint(10, 20)
    print(m.generate(size=size))

Your Bestie You Love Says About You And How It Can Your
To Follow Instructions ... And I Still Ca n't Steal Love . The Folks Dismantling Wear
A Woman Gets Sick Of Divorce And Mortgages . And Most Accurate History Of The 19th Century
Want You To Be Caged For 4 Decades ?
Steffi Graf receives a pleasant surprise during a college football game .
The Office
' Explained In Under 60 Seconds . These 'Before And '
subway project in NYC .
Do n't Allow Animals Inside . It 's A Good Cuddle
The 29 Most New Zealand Moments Ever
Role ? Replacing Andy Samberg In British Sitcom `` Cuckoo ''
Christmas Creation Of All The Places His Little Boy Stared a Terrorist in the mid late
Your Presidential Candidate
Ranking Of Disney Love Songs
Was Arrested 20 Times For This . Wow . Watch These Rhinos Fly ! Much Endangered .


<b>Priming with two starter words of my choice</b>

In [254]:
for _ in xrange(10):
    size = random.randint(10, 20)
    print( m.generate_from_words( 'This', 'Is', size=size ) )

This Is Awesome . I 've Ever Cried This Hard Especially
This Is ANGELIC . And It 's Not The Unbelievable Loophole In U.S. Child Labor Law
This Is Hilarious . LOLOLOL .
This Is Why You 're Damned Right The Government Wants You To See This . Neither Did My Heart This
This Is So Perfect That It Made Me Cry A Bucket Of Tears . All
This Is Okay ? Unbelievable . Seriously , Stop What You Expect . OMG . Google Carlson
This Is Your Ideal Relationship ?
This Is Awesome . I 'm Glad I Saw This . Especially If You Hate It A Dog Still
This Is Crazy . These Are GREAT .
This Is The Funniest Thing You 'll Probably Agree .


In [253]:
for _ in xrange(10):
    size = random.randint(10, 20)
    print( m.generate_from_words( 'Why', 'Is', size=size ) )

Why Is Even More Unbelievable . Seriously .
Why Is Unforgettable . The Letter Her Parents Discovered Afterwards Is Shattering
Why Is Even A Thing . So His Mom . The It
Why Is Even More Absurd .
Why Is Something That Is Barely Being Talked About
Why Is Even More Disturbing Than It Looks Like A Typical With
Why Is Absurd .
Why Is YOUR Member Of Congress — They 're Brilliant . OMG Google
Why Is Google Sleeping With That Jerk ?
Why Is This Arrested Woman So Happy ?


# Results

<b><i>The Good</i></b>

<li>Sharing His Text To Kendrick Lamar</li>
<li>7 Alarming Facts Essentially Say : Women Are Biased Toward Thinking They Are Eerie .</li>
<li>, This Will Make You Cry . But THIS is Insanity .</li>
<li>What Happened Next Will Shock You . That 's Actually A Completely Brilliant Idea !</li>
<li>This May Be Wrong . Reality Check .</li>
<li>For Aliens That Want To Check This Out . Freaky .</li>
<li>Any Of It Is Breathtaking .</li>
<li>Black Girls Code . Simple Name , Revolutionary Premise .</li>
<li>This Is Awesomely Weird . Then I Saw How He Did . And I LOVE .</li>
<li>When You Can Only Buy At A Party ... LOL .</li>
<li>12 of the Strangest Weather-Related Photographs Ever Taken . Wow . Watch These Rhinos Fly !</li>
<li>Be Prepared To Change What You Were Expecting</li>
<li>Romney 's Own Mother Undermines His Entire Campaign</li>
<li>Which Vladimir Putin Tattoos Are Works Of Refrigerator Door Art</li>
<li>This Is Every Bit As Awesome As It Looks Like At 4,000 Frames Per Second</li>

<b><i>The Bad</i></b>

<li>Stuffed Animal . It 's Hard Not To Run For President</li>
<li>Your First Apartment</li>
<li>Well In Museums</li>
<li>Perfect Roommate</li>
<li>Ghosts , You Will Love It . You Probably Forgot About</li>
<li>Wo n't Have Gotten Married At All Times . When He Returned 30 Minutes Later Something</li>
<li> Endangered</li>
<li># ing Message</li>
<li>Closer ... WHAT ? !</li>
<li>Future Is NOT After</li>
<li>Fracking ?</li>

<b><i>The Ugly</i></b> (These made me laugh... some of these where imperfect/incomplete but totally left me hanging!)

<li>'ll Ever See . And The Internet , Democracy Is NOT Doomed After Falling Through ...</li>
<li>Matt Damon 's Incredible Pro-Toilet , Anti-Reporter Press Conference</li>
<li>... RACCOON ! What Happened To This 1.5 Pound Baby Is Beyond Words . This Beautiful</li>
<li>New Video Shows President Obama Meeting His Half-Brother For The Last Words Were Kind Of Religion</li>
<li>Reason Behind Them . Kinda Like This . WHOA !</li>
<li>Nearly As Bad As These People . And He Had Police</li>
<li>Thought This Mom And Dad Did ... Holy Cow .</li>
<li>20 Reasons Why Thor Is The Most Epic Secret Santa Exchange I 've Seen Acts of Kindness ,</li>
<li>Like Your Mother . Here are 16 Epic Creations From Just a Single Piece of Paper . This Why</li>
<li>Be Shocked And Disgusted By Her Ex Led To Something Unexpected ... Yet Beautiful . This Is Cavin I</li>
<li>Why You Should Probably See This . Wow . Watch These Rhinos Fly ! Much Endangered .</li>
<li>More Beautiful Than This Abandoned Opera House You 'll Agree .</li>
<li>Do n't Need A Man Decided To Build Something . Nothing Could Prepare Me The</li>
<li>Little Girl Stopped An Olive Garden Manager In His Neighborhood What</li>
<li>Something We Should All Be Dead . That 's Dying . Yet Somehow The Last One . I Call A</li>
<li>Want To Do It . Evangelicals Do It . Awww x25 .</li>

<b>Here are some fun headlines that were generated with chosen starter words</b>

<b>"Why Is"</b>
<li>Why Is YOUR Member Of Congress Voting To Keep You In The World ?</li>
<li>Why Is Google Sleeping With That Jerk ?</li>
<li>Why Is Unforgettable . The Next Day ... Kinda Like This . It 's an Nightmare</li>
<li>Why Is This Arrested Woman So Happy ?</li>
<li>Why Is This The Most Relatable Macklemore Vine Ever</li>
<li>Why Is Even More Unbelievable . Seriously , OMG . Google 'Tucker Carlson ' And 'Gay Marriage . '</li>

<b>"This Is"</b>
<li>This Is Awesome . ( PS : The Untold Story Behind Every Casualty Of War</li>
<li>This Is One Of Them Is Facing This NIGHTMARE Right Now .</li>
<li>This Is What Creationists Believe About Dinosaurs</li>
<li>This Is So Perfect That It Made Me Sick . But THIS is Insanity .</li>
<li>This Is How They Came Back Together Is Beautiful . This Is The Most Astounding .</li>

# Conclusion

<p>This was a fun little side project to explore the implementation of a simple Markov chain. It is quite entertaining to see the results!</p>
<p>In this notebook, we ingested data from excel sheet, used pandas for data cleaning, and created a Markov chain class that iterated over sentences to build a vocabulary.</p>
<p>Using the vocabulary from prior viral headline examples, we created our very own viral headlines. We leveraged structure in the text, simply using bigrams chained to the ensuing third word. The bigrams were linked to a list of possible words (built from real examples), and we used python's random module to pick a possible candidate.</p>
<p>This was done iteratively to chain together likely words and formulate cool, intriging headlines, some which I probably would have clicked on. </p>
<p>The End... </p>

<li>The End ... Amazing . I NEVER Expected What 's Below It Left Me Speechless With Goosebumps . Is </li>
<li>The End Made It Unforgettable . The Note They Attached To Him Is Heartbreaking Yet Somehow The Last Thing 'll</li>
<li>The End Of The Weirdest 29-Second Traffic Stop Ever</li>
<li>The End ... Amazing . I Still Ca n't Be Silent Any Longer '</li>
<li>The End Of The Sickest Presidential Insults</li>
<li>The End Of The World . And When You Realize What They Seem</li>
<li>The End , I Burst Into Tears . The Letter This Trooper Wrote For His Son</li>
<li>The End ... WOW .</li>
<li>The End Of The World . When You Say 'Happy Holidays </li>
<li>The End ... Amazing . I Laughed So Hard . Especially You</li>
<li>The End Will Make You Cry . Get A Tissue .</li>
<li>The End Made It AWESOME .</li>

In [None]:
# Addendum:
# Possible next steps:
# - collect more viral headlines; a larger corpus will probably yield better results
# - integrate own webscraping mechanism for content acquisition
#     - ?scrape http://www.clickhole.com/
#     - ?scrape http://www.buzzfeed.com/
#             http://www.jeffbullas.com/2015/01/16/22-headlines-that-went-viral-have-these-marketers-cracked-the-code/
#     - ?scrape businessinsider
#     - ?scrape feedly
# - explore text generation with RNNs (Recurrent Neural Networks)