# Markovify coronavirus headlines

Coronavirus News Simulator

In [198]:
import markovify
import re
import pandas as pd
import random
pd.options.display.max_colwidth = 100

Read in unique headlines.

In [306]:
df = pd.read_csv('NYT.csv', parse_dates=['date']).drop_duplicates(subset='headline').reset_index(drop=True)
df

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
0,‘Nobody Is Above the Law’: House Democrats Are Furious With Barr,2019-05-02,multimedia,Video,U.S.,['United States Politics and Government']
1,Watch ‘Boyz N the Hood’ Free at the Tribeca Film Festival,2019-05-02,article,News,Movies,['Tribeca Film Festival (NYC)']
2,Ryan Reynolds Keeps a Bare Closet,2019-05-03,article,News,Fashion & Style,['Fashion and Apparel']
3,How Will Satan & Adam Play in 2019?,2019-05-02,article,News,New York,"['Blues Music', 'Race and Ethnicity', 'Documentary Films and Programs']"
4,Trump Says He Discussed the ‘Russian Hoax’ in a Phone Call With Putin,2019-05-03,article,News,U.S.,"['Russian Interference in 2016 US Elections and Ties to Trump Associates', 'Presidential Electio..."
...,...,...,...,...,...,...
46130,Will Warm Weather Slow Coronavirus?,2020-04-30,article,Op-Ed,Opinion,"['Coronavirus (2019-nCoV)', 'Influenza Epidemic (1918-19)', 'Weather', 'Seasons and Months', 'Me..."
46131,Chinese Coffee Chain’s Scandal Renews U.S. Calls for Oversight,2020-04-30,article,News,Business Day,"['Frauds and Swindling', 'United States Economy', 'Securities and Commodities Violations', 'Stoc..."
46132,‘We Ran Out of Space’: Bodies Pile Up as N.Y. Struggles to Bury Its Dead,2020-04-30,article,News,New York,"['Funerals and Memorials', 'Cemeteries', 'Coronavirus (2019-nCoV)', 'Deaths (Fatalities)']"
46133,"The F.B.I.’s Director, a Critic of Strong Encryption, Once Defended It",2020-04-30,article,News,Technology,"['Computer Security', 'Instant Messaging', 'Social Media', 'Computers and the Internet', 'Confli..."


Save the headlines as an easily-accessible list.

In [307]:
headlines = list(df['headline'])

In [133]:
df['headline'].iloc[-120:-100]

46015    ‘Will You Help Save My Brother?’: The Scramble to Find Covid-19 Plasma Donors
46016                            A Strange Dinosaur May Have Swum the Rivers of Africa
46017            Dozens of Decomposing Bodies Found in Trucks at Brooklyn Funeral Home
46018                               Time to Check Your Pandemic-Abandoned Car for Rats
46019                                   Virus Pushes the Bang on a Can Marathon Online
46020                           Sister Patricia McGowan, Dedicated Teacher, Dies at 80
46021               Her M.R.I. Came Back Normal After a Seizure. Could It Be Covid-19?
46022                                             How Long Will a Vaccine Really Take?
46023             As Georgia Reopens, Virus Study Shows Black Residents May Bear Brunt
46024                       There’s Something Special About the Sun: It’s a Bit Boring
46025                          To Restart After Lockdown, Theaters Need to Think Small
46026                              With Mot

How it works:

1. Pick a random word to begin the sentence with (that a sentence has started with). 
2. Pick the next word according to the probability that that word follows our current word. For example, if our dataset is 

hey guise hey guise hi wow
hey what's up chewbacca
ayy lmao hey ayy

Then our two starting things are hey and ayy. If we choose hey, the following words are (with the number of times they have occurred after them) 
- guise, 2
- ayy, 1
- what's, 1

How markovify works is that it will pick guise with probability 1/2, ayy with probability 1/4, and what's with probability 1/4.

The way to improve markovify's output is to look at these specific values as it goes: a string of hey guise would get a score of 

2(hey) + 2(guise)

While a string of hey what's up would get a score of 

2(hey) + 1 (what's)

so what you can do is simply generate lots of sentences then take the one with the highest score.

This helps the more common (and correct) paths to be chosen more often.



In [342]:
def getSent(model, iters, minLength=1):
    sentences = {}
    for i in range(iters): 
        modelGen = model.chain.gen()
        prevPrevWord = "___BEGIN__"
        prevWord = next(modelGen)
        madeSentence = prevWord + " "

        totalScore = 0
        numWords = 1
        for curWord in modelGen:
            madeSentence += curWord + " "
            numWords += 1
            totalScore += model.chain.model[(prevPrevWord, prevWord)][curWord]
            prevPrevWord = prevWord
            prevWord = curWord

        madeSentence = madeSentence.strip()
        if numWords == 0: continue

        if numWords < minLength: continue
        if madeSentence in sentences: continue

        totalScore += model.chain.model[(prevPrevWord, prevWord)]["___END__"]

        sentences[madeSentence] = totalScore/float(numWords)

    # Sort them so the sentences with the highest score appear first
    return sorted(sentences.items(), key=lambda x: -x[1])

https://www.reddit.com/r/LanguageTechnology/comments/550qvp/tutorial_how_to_improve_the_output_of_the_python/

In [343]:
model = markovify.Text(headlines, state_size=2)
list(getSent(model, 100, 4))[:5]

[('lesson of the day: fleeing the city?', 96.0),
 ('what’s on tv friday: pain and glory’ review: almodóvar’s dazzling art of cinema or grant it new life?',
  37.94444444444444),
 ('tornadoes, japan, boston bruins: your tuesday briefing',
  35.285714285714285),
 ('philadelphia, markets, jeffrey epstein: your monday briefing',
  34.57142857142857),
 ('davos has a long petal of the day: the party of five’ and little common ground with my mom?',
  24.526315789473685)]

In [339]:
df[df['headline'].str.contains('What’s Going')]

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
430,"What’s Going On in This Picture? | May 6, 2019",2019-05-05,article,News,The Learning Network,[]
2401,"What’s Going On in This Picture? | May 20, 2019",2019-05-19,article,News,The Learning Network,[]
15507,"What’s Going On in This Graph? | Sept. 11, 2019",2019-09-05,article,News,The Learning Network,[]
16101,"What’s Going On in This Picture? | Sept. 9, 2019",2019-09-08,article,News,The Learning Network,[]
16837,"What’s Going On in This Graph? | Sept. 18, 2019",2019-09-12,article,News,The Learning Network,[]
17646,"What’s Going On in This Graph? | Sept. 25, 2019",2019-09-19,article,News,The Learning Network,[]
18094,"What’s Going On in This Picture? | Sept. 23, 2019",2019-09-22,article,News,The Learning Network,[]
18546,"What’s Going On in This Graph? | Oct. 2, 2019",2019-09-26,article,News,The Learning Network,[]
18938,"What’s Going On in This Picture? | Sept. 30, 2019",2019-09-29,article,News,The Learning Network,[]
19494,What’s Going On in This Graph? | Oct. 9. 2019,2019-10-03,article,News,The Learning Network,[]


## 2-state model

Preprocess the headlines.

- treat a question mark as its own word
- lowercase every word, since some headlines have every word capitalized and some don't, which would get words such as "coronavirus" and "Coronavirus" treated as two different words when they should be treated as the same word
- 

In [319]:
for i, headline in enumerate(headlines):
    headlines[i] = re.sub('', '', headline.lower())

headlines

['nobody is above the law’: house democrats are furious with barr',
 'watch boyz n the hood’ free at the tribeca film festival',
 'ryan reynolds keeps a bare closet',
 'how will satan & adam play in 2019?',
 'trump says he discussed the russian hoax’ in a phone call with putin',
 'our best cinco de mayo recipes: guacamole, margaritas, tacos and more',
 'king of thailand to be formally crowned in an ornate spectacle',
 'a mueller investigator moves on (but stays mum)',
 'marilyn stasio gets cozy with some unlikely heroines — and a hit man',
 'a teacher shared her salary, and a stranger started a school supplies wish list',
 'opportunities to watch the big game',
 'joe biden and the debate over apologies',
 'oil and u.s. involvement in venezuela',
 'las principales noticias del viernes',
 'how much watching time do you have this weekend?',
 '12 pop, rock and jazz concerts to check out in n.y.c. this weekend',
 '5 film series to catch in n.y.c. this weekend',
 'william barr, facebook, scr

In [296]:
model = markovify.NewlineText(data, state_size=1)

In [294]:
for i in range(10):
    sentence = model.make_sentence(tries=100)
    print(sentence)

bernie and son flew on ‘angels,’ and blocks trump pardon offers a crisis
word + quiz: memory, canker sores and more new york from poverty: art trove of conservative denial and sanders and at 51 in air pollution rule, strike aims to stocks, rising seas mean revolution.
draft picks its final
for democrats agree to plant is here, now doctors brace yourself. signed, an arrest in on tv tuesday: ‘running out of wellness industry
how democrats eye a guide to go ahead, subvert democracy
review: a pickup truck drivers steer clear ? warring with these horses sold the census suspends races
soccer’s fight the bay
stacey abrams is too many men, your monday evening briefing
norway’s viking ships keep her life in catholic church arsons in 2021 budget
a threshold for president trump, kim kardashian has long shadow of vaping, mercury: your bike rides a coach sean marca registrada


Print ten randomly-generated sentences using the built model

In [285]:
df[df['headline'].str.contains('Kids\?')]

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
18132,"Given the State of the World, Is It Irresponsible to Have Kids?",2019-09-25,article,News,T Magazine,['Children and Childhood']
25863,Bickering More After Kids?,2019-11-20,article,News,Parenting,[]
27960,Are Sugar Substitutes Good for Kids?,2019-12-09,article,News,Well,"['Artificial Sweeteners', 'Children and Childhood', 'Diet and Nutrition', 'Weight', 'Sugar', 'Ob..."
29124,Is Screen Time Really Bad for Kids?,2019-12-18,article,News,Magazine,"['Smartphones', 'Teenagers and Adolescence', 'Children and Childhood', 'Medicine and Health', 'M..."
35526,Why Are You Still Packing Lunch for Your Kids?,2020-02-10,article,Op-Ed,Opinion,"['Lunch and Breakfast Programs', 'Cafeterias', 'Food', 'Education (K-12)', 'Parenting', 'Diet an..."
43188,"When Parents Get Sick, Who Cares for the Kids?",2020-04-09,article,News,Parenting,"['Coronavirus (2019-nCoV)', 'Parenting']"
44061,Ask NYT Parenting: I Use My Phone for Everything. Is That Harming My Kids?,2020-04-17,article,News,Parenting,"['Science and Technology', 'Cellular Telephones', 'Smartphones', 'Parenting', 'Children and Chil..."
44065,Have You Named a Legal Guardian for Your Kids?,2020-04-17,article,News,Parenting,"['Children and Childhood', 'Special Education', 'Child Custody and Support', 'Families and Famil..."
44636,Is The News Too Scary for Kids?,2020-04-18,article,News,Parenting,"['News and News Media', 'Children and Childhood', 'Families and Family Life', 'Parenting']"
45641,A Constitutional Right to Literacy for Detroit’s Kids?,2020-04-26,article,Op-Ed,Opinion,"['Decisions and Verdicts', 'Education (K-12)', 'Politics and Government', 'United States Economy..."


In [224]:
df[df['headline'].str.contains('Rebrand\?')]

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
36474,Vape Shops Face a Choice: Close or Rebrand?,2020-02-19,article,News,New York,"['E-Cigarettes', 'Teenagers and Adolescence', 'Smoking and Tobacco', 'Respiratory Diseases', 'Me..."


In [120]:
model = markovify.NewlineText(headlines, state_size=2)
for i in range(10):
    print(model.make_sentence())

Mueller’s Labored Performance Was a Risk Warner Bros. Studio
Review: ‘Pennyworth’ Tells the Truth Still Matters
M.I.T. President Says He and Queen Take Stock
Why Watch Video in Grand Central Terminal
Going Back to the Bank Do?
Sales Take a Deep Breath
In Coronavirus, Industry Sees Chance to Consolidate Power
This Investigator Used to Stake His Own National Security Council
Tokyo, in a Museum? Let Steve McQueen Show You That ‘Reporters Are Not Here for Our Lives.
‘Mayors for Mike’: How Bloomberg’s Money Built a Shiite Axis of Power


We check whether the generated headline "____" already exists in our dataset.

In [60]:
df[df['headline'].str.contains('Virus Spreads')]['headline']

33027                        As New Virus Spreads From China, Scientists See Grim Reminders
33616                                   As Virus Spreads, Anger Floods Chinese Social Media
34066               As Virus Spreads, U.S. Temporarily Bars Foreigners Who’ve Visited China
37863               A Virus Spreads, Stocks Fall, and Democrats See an Opening to Hit Trump
38703             First U.S. Colleges Close Classrooms as Virus Spreads. More Could Follow.
39178                     Concert Giants Live Nation and AEG Suspend Tours as Virus Spreads
40732    ‘A Storm Is Coming’: Fears of an Inmate Epidemic as the Virus Spreads in the Jails
41693                    As Virus Spreads, China and Russia See Openings for Disinformation
41810              ‘Jails Are Petri Dishes’: Inmates Freed as the Virus Spreads Behind Bars
43628                                   U.S. Food Supply Chain Is Strained as Virus Spreads
43842                     ‘Pacing and Praying’: Jailed Youths Seek Release as Vi

By default, markovify.Text tries to generate sentences that do not simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence's word count.

Try different state sizes. 

State size is a number of words the probability of a next word depends on. 

In [43]:
model_1 = markovify.NewlineText(headlines, state_size=1)
for i in range(10):
    print(model_1.make_sentence())

N.Y.C.’s Death of impeachment.
In Iraq Protest Far-Right Leader, Carrie Lam, to the Voters Near Romania Reject Years in Iowa, Brexit: Your Oven
A Free Grocery ‘Happy Talk,’ Susan Collins, Defending Trump
The Constellation of Impeachment Team That Way, Way
Raptors Reign in Italy. We May Last Dance,’ Michael Craig Is Split Subway Station Drops Out
Forever 21, 2019
This Scientist Amid Pandemic
What to Constitutional Crisis, China Amounts of Connecticut Retreat in Afghanistan Will Win at 98
‘Lucy in New Faces Tough Drop-Offs
Bernie Sanders Wanted to Hack Into the Primary Forecast


In [None]:
text_model_3 = markovify.NewlineText(df.headline, state_size=3)
text_model_4 = markovify.NewlineText(df.headline, state_size=4)

In [None]:
for i in range(10):
    sentence = text_model_3.make_sentence()
    if sentence is not None:
        print(sentence)

In [None]:
for i in range(10):
    sentence = text_model_4.make_sentence()
    if sentence is not None:
        print(sentence)

Ensemble Markov Chain models.

This function accepts two arguments:
- `models` — a list of models to combine
- `weights` — how much emphasis to place on each model

In [None]:
text_model11 = markovify.NewlineText(df.headline, state_size = 2)
text_model12 = markovify.NewlineText(df.headline, state_size = 2)
text_model13 = markovify.NewlineText(df.headline, state_size = 2)
model_combo = markovify.combine([ text_model11, text_model12, text_model13 ], [ 1, 1, 1])

In [None]:
for i in range(10):
    print(model_combo.make_sentence())

Include only headlines that have "coronavirus" in them:

In [None]:
coronavirus = ['coronavirus', 'covid19', 'covid-19', 'sars-cov-2', 'sarscov2', 'sars-cov2', 'virus']
generated_headlines = []
while len(generated_headlines) < 25:
    headline = model_combo.make_sentence()
    if headline is not None:
        contains_coronavirus = bool([word for word in headline.split(' ') if word.lower() in coronavirus])
        if contains_coronavirus:
            generated_headlines.append(headline)
generated_headlines

In [None]:
df['headline']

## textgenrnn

https://github.com/minimaxir/textgenrnn

https://minimaxir.com/2018/05/text-neural-networks/

A note from the github:

You will not get quality generated text 100% of the time, even with a heavily-trained neural network. That's the primary reason viral blog posts/Twitter tweets utilizing NN text generation often generate lots of texts and curate/edit the best ones afterward.

So I will show you what it looks like, then my curated set.

Save unique headlines to file.

In [None]:
with open('unique_headlines.txt', 'w') as txt_file:
    for line in list(df_unique['headline']):
        txt_file.write(' '.join(line) + '\n') 

In [None]:
from textgenrnn import textgenrnn

In [None]:
textgen = textgenrnn()

In [None]:
textgen.train_from_file('unique_headlines.txt', num_epochs=1)

In [None]:
textgen.generate()

A Markov chain is a mathematical system that undergoes transformation from one state to another on a state space (eg over successive time instants). Such a sequence of states is characterized by a Markovian state transition probability, ie the probability of occupying the next state only depends on the current (and potentially previous m) states. 

The layers of the Artificial neural net are not constrained to describe state occupancy probabilities -- you can think of each layer, and indeed the entire neural net as an arbitrary nonlinear function. For example, a Neural network can be used for linear or nonlinear regression, which has nothing to do with state transition probabilities. 

In [None]:
coronavirus_pattern = '|'.join(coronavirus)
df_unique_covid = df[df['headline'].str.contains(coronavirus_pattern)]

In [None]:
df_unique_covid

In [None]:
text_model111 = markovify.NewlineText(df_unique_covid.headline, state_size = 2)
text_model122 = markovify.NewlineText(df_unique_covid.headline, state_size = 2)
text_model133 = markovify.NewlineText(df_unique_covid.headline, state_size = 2)
model_combo_2 = markovify.combine([ text_model11, text_model12, text_model13 ], [ 1, 1, 1])

In [None]:
coronavirus = ['coronavirus', 'covid19', 'covid-19', 'sars-cov-2', 'sarscov2', 'sars-cov2', 'virus']
generated_headlines = []
while len(generated_headlines) < 25:
    headline = model_combo_2.make_sentence()
    if headline is not None:
        contains_coronavirus = bool([word for word in headline.split(' ') if word.lower() in coronavirus])
        if contains_coronavirus:
            generated_headlines.append(headline)
generated_headlines

Keeping these... 

'Estranged Husband of Missing Women. And That’s Just Big Enough for These Myths About Coronavirus'

'For Millennials Making Their First Coronavirus Patient',

'Labeling Error to Blame for the Coronavirus'

We're gonna use jsvine's Markovify library. It lets you generate markov chain text based on sentence delimited text datasets (so basically normal prose). You can feed it anything from classical novels to internet forums comments and it will spit out semi-related nonsense that almost passes as a sentence. 

The library gathers ngrams from training texts and builds a dictionary that catalogues the frequency of proceeding ngrams. It is at the core of the script used to make posts and comments to the Subreddit Simulator subreddit. 

https://github.com/trambelus/UserSim

**Format headlines as a single block of text.**

In [349]:
' '.join(headlines)



In [360]:
model = markovify.NewlineText('\n'.join(headlines), state_size=2)
for i in range(10):
    print(model.make_sentence(tries=100))

world energy report, impeachment hearings, hong kong: your monday briefing
the toilet paper shortage?
trump is an order to reproduce
bill maher on the manor, but still takes requests
daniel pantaleo, officer who killed keylan knapp?
the mystery of 39
sinn fein on threshold: party with old boundaries, and marched past them
the decade of trump’s truthfulness heightens impeachment debate
restore bolivian democracy and break its history of god bless america’
density is new york’s taxi king


In [362]:
df[df['headline'].str.contains('Rotavirus')]

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
6144,Rotavirus Vaccine May Help Protect Against Type 1 Diabetes,2019-06-19,article,News,Well,"['Diabetes', 'Vaccination and Immunization', 'Rotaviruses', 'Children and Childhood']"


Other Medium articles about Markovify:
- https://towardsdatascience.com/nlg-for-fun-automated-headlines-generator-6d0459f9588f
- https://medium.com/@sebastian.enger/text-content-generation-based-on-markov-chains-a-short-overview-a71fdf246e65
- https://towardsdatascience.com/simulating-text-with-markov-chains-in-python-1a27e6d13fc6

Limitations: ( https://medium.com/@dhruvilshah28/exploring-the-next-word-predictor-5e22aeb85d8f ) 

Markov chains do not have memory. There are many limitations to adopting this approach. Take an example, “I ate so many grilled …” next word “sandwiches” will be predicted based on how many times “grilled sandwiches” have appeared together in the training data. As we are getting suggestions based only on the frequency, there are many scenarios where this approach could fail.

Count up every word that is used. Then for every wrod sotre the words that are used next. This is the distribution of words in that text that are conditional on the preceding word. 

Note we're keeping all the punctuation in, so our simulated text has punctuation.

https://hackernoon.com/automated-text-generator-using-markov-chain-de999a41e047

- 