## Part 4: Twitter Bot

In [277]:
import pandas as pd
import json
from twython import Twython
import time
import os
import wordcloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import numpy as np
import re
import nltk
from nltk.util import bigrams
from nltk.tokenize import TweetTokenizer
from nltk.lm import Vocabulary
from nltk.lm import MLE
from nltk.util import pad_sequence
from nltk.lm.preprocessing import padded_everygram_pipeline

## Generating Text: NLTK and Markov Chains

Using a computer to generate natural text is an extremely useful and important part of NLP. You may want to create a bot to handle front-line customer queries on your website, or, in a manner perhaps more applicable to our corpus of tweets, you may be a certain intelligence service in the Eurasian Steppe. In this section, we will use our corpus and NLP skills we've acquired to generate our own Twitter posts.

Early in Chapter 1 of our Natural Language Processing book, there's an example that shows output from the generate function of Python's Natural Language Toolkit (nltk) package. Using the Bible's Book of Genesis as its corpus, the generate() function creates text based on the corpus. This intrigued us. 
  
Unfortunately, starting with nltk version 3, the generate function was removed due to problems with NLTK's language modeling class. Going deep into the [issues section](https://github.com/nltk/nltk/issues/736) of the nltk package on Github, as of July 2019, generate appears to be available again but requires much more data preparation. We're going to try it.

In this section, we will generate tweets from our corpus in two ways:
1. Using the language model interface module from the nltk package.
2. Creating our own markov model to generate text.

###Data Preparation (Again)

Again, we start with our json files from twitter that contain a user's latest tweets along with like and retweet counts and other metadata. The two json files will be loaded as a combined pandas dataframe.

In [278]:
with open('mikegravel1561157054.0137448_followers.json') as file:
    mgfol = json.load(file)
    
df1j = pd.DataFrame.from_dict(mgfol)
df1j = df1j.sort_values(by='followers_count', ascending=False)

with open('mikegravel1562978529.995827_followers.json') as file:
    mgfol2 = json.load(file)
    
df2j = pd.DataFrame.from_dict(mgfol2)
df2j = df2j.sort_values(by='followers_count', ascending=False)

dfcj = [df1j,df2j]
dfj = pd.concat(dfcj)
dfj.head(12)

Unnamed: 0,screen_name,verified,location,followers_count,tweet_text,retweet_count,favorite_count
95977,maddow,True,"New York, NY USA",9628336,"""Some within DHS and ICE say the president app...",3043.0,4651.0
95338,wikileaks,True,Everywhere,5503497,37 MEPs call on the European Commission to pre...,445.0,638.0
9849,AOC,True,"Bronx + Queens, NYC",4437409,This is what the United States is doing in the...,3836.0,10345.0
86051,NateSilver538,True,New York,3202795,@SeanMcElwee @DataProgress Glad to see de Blas...,7.0,208.0
22595,HEELZiggler,True,"Hollywood, CA",2827306,@go_kings_go25 @PWTees 🤘🏽🤘🏽i saw it,0.0,1.0
66225,verified,True,San Francisco,2692833,We've paused public submissions for verificati...,2070.0,11968.0
23606,marwilliamson,True,,2615633,@US395 @AlanaKStewart Noooooo,0.0,0.0
95126,jaketapper,True,,2069987,"Amid controversy, Biden gets support from high...",4.0,22.0
16955,AlaattinCAGIL,True,"London, England",1647751,RT @AlaattinCAGIL: 4 boyutlu tarayıcı ile Ann...,209.0,0.0
22259,johncusack,True,USA,1623173,Cubs Mets https://t.co/Q6NdtKqBwv,6.0,146.0


In [279]:
len(dfj)

216655

Our data consists of two different API pulls of data from followers of Democratic presidential longshot candidate Mike Gravel. That data includes a Twitter account's latest tweet. If a user did not post a new tweet between the data pulls, their tweet will be duplicated in our corpus. We don't want this. If the two pulls of their latest tweets resulted in disparate posts from a user, we want both. If it's the same tweet, we only want it once.

In [280]:
#help with duplicate rows based on select multiple columns: https://thispointer.com/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python/
dfj = dfj.sort_values(by='followers_count', ascending=False)
dfj_dedup = dfj.drop_duplicates(subset=['screen_name', 'tweet_text'], keep='first')
len(dfj_dedup)

192524

Ok, this looks right. We'll go ahead and make this our dataframe to be used for our corpus and do a quick preview of the tweet text.

In [281]:
dfj = dfj_dedup
dfj['tweet_text'][:12]

116107    RT @nycsouthpaw: “whatever it is he does“ http...
95977     "Some within DHS and ICE say the president app...
115480    RT @atilioboron: ASSANGE, la cortina de silenc...
95338     37 MEPs call on the European Commission to pre...
32118     @morningmika GOP has been stating that I am ly...
9849      This is what the United States is doing in the...
2452      RT @verainstitute: "Today, World Population Da...
106430    @conor64 @newrepublic IDK how many LGBTQ peopl...
86051     @SeanMcElwee @DataProgress Glad to see de Blas...
44433                               @BradenWard6 Hell yeah!
22595                   @go_kings_go25 @PWTees 🤘🏽🤘🏽i saw it
87027     We've paused public submissions for verificati...
Name: tweet_text, dtype: object

Before we start out text analysis, what should we remove. Web adresses - URLs - aren't helpful.

In [282]:
#help from: https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python
dfj['tweet_text'] = dfj['tweet_text'].str.replace('http\S+|www.\S+', '',case=False)
dfj['tweet_text'][:12]

116107           RT @nycsouthpaw: “whatever it is he does“ 
95977     "Some within DHS and ICE say the president app...
115480    RT @atilioboron: ASSANGE, la cortina de silenc...
95338     37 MEPs call on the European Commission to pre...
32118     @morningmika GOP has been stating that I am ly...
9849      This is what the United States is doing in the...
2452      RT @verainstitute: "Today, World Population Da...
106430    @conor64 @newrepublic IDK how many LGBTQ peopl...
86051     @SeanMcElwee @DataProgress Glad to see de Blas...
44433                               @BradenWard6 Hell yeah!
22595                   @go_kings_go25 @PWTees 🤘🏽🤘🏽i saw it
87027     We've paused public submissions for verificati...
Name: tweet_text, dtype: object

That worked, so let's get rid of a few more undesired elements. We'll start with the Twitter usernames based on the @symbol. Next, we'll get rid of emojis.

In [283]:
dfj['tweet_text'] = dfj['tweet_text'].str.replace('@\S+', '',case=False)
dfj['tweet_text'] = dfj['tweet_text'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)
dfj['tweet_text'][:20]

116107                          RT  whatever it is he does 
95977     Some within DHS and ICE say the president appe...
115480    RT  ASSANGE, la cortina de silencio  de la pre...
95338     37 MEPs call on the European Commission to pre...
32118      GOP has been stating that I am lying about th...
9849      This is what the United States is doing in the...
2452      RT  Today, World Population Day, is an opportu...
106430      IDK how many LGBTQ people there are on the m...
86051       Glad to see de Blasio finally ahead in a poll. 
44433                                             Hell yeah
22595                                              i saw it
87027     Weve paused public submissions for verification. 
45419     Theres not one U.S. city or county where someo...
23606                                               Noooooo
115273    Warren among 2020 Dems courting progressive vo...
95126     Amid controversy, Biden gets support from high...
16955     RT  4 boyutlu tarayıcı ile  An

Now let's get down to business.

In [284]:
tknzr = TweetTokenizer()
tweets_j = dfj['tweet_text'].fillna("").astype('str')
all_tweets_j = ' '.join(tweets_j)
corpus_list_j = tknzr.tokenize(all_tweets_j)
corpus_j = ' '.join(corpus_list_j)

In [285]:
tweets_2_j = tweets_j.apply(nltk.word_tokenize)
tweets_2_j[:12]

116107                     [RT, whatever, it, is, he, does]
95977     [Some, within, DHS, and, ICE, say, the, presid...
115480    [RT, ASSANGE, ,, la, cortina, de, silencio, de...
95338     [37, MEPs, call, on, the, European, Commission...
32118     [GOP, has, been, stating, that, I, am, lying, ...
9849      [This, is, what, the, United, States, is, doin...
2452      [RT, Today, ,, World, Population, Day, ,, is, ...
106430    [IDK, how, many, LGBTQ, people, there, are, on...
86051     [Glad, to, see, de, Blasio, finally, ahead, in...
44433                                          [Hell, yeah]
22595                                          [i, saw, it]
87027     [Weve, paused, public, submissions, for, verif...
Name: tweet_text, dtype: object

In [286]:
tweets_3_j = tweets_2_j.tolist()
tweets_2_j[:12]

116107                     [RT, whatever, it, is, he, does]
95977     [Some, within, DHS, and, ICE, say, the, presid...
115480    [RT, ASSANGE, ,, la, cortina, de, silencio, de...
95338     [37, MEPs, call, on, the, European, Commission...
32118     [GOP, has, been, stating, that, I, am, lying, ...
9849      [This, is, what, the, United, States, is, doin...
2452      [RT, Today, ,, World, Population, Day, ,, is, ...
106430    [IDK, how, many, LGBTQ, people, there, are, on...
86051     [Glad, to, see, de, Blasio, finally, ahead, in...
44433                                          [Hell, yeah]
22595                                          [i, saw, it]
87027     [Weve, paused, public, submissions, for, verif...
Name: tweet_text, dtype: object

In [287]:
type(corpus_j)

str

Now we'll split the corpus, currently a string, into a list.

In [288]:
corpus_list_j = corpus_j.split()
corpus_text_j = nltk.Text(corpus_list_j)
corpus_list_j[:10]

['RT', 'whatever', 'it', 'is', 'he', 'does', 'Some', 'within', 'DHS', 'and']

In [289]:
type(corpus_list_j)

list

In [290]:
corpus_text_j

<Text: RT whatever it is he does Some within...>

OK, so we have our corpus list.

###Language Model Interface (lm) from NLTK

Here, we tackle our first task of building a model to generate text using the Language Model Interface module from the nltk package. We leaned heavily on the [nltk documentation](https://www.nltk.org/api/nltk.lm.html) to prepare data for our text generator.

The model we create will utilize bigrams. Let's take a look at the bigrams from the first sentence in our tweet corpus.

In [291]:
list(bigrams(tweets_3_j[0]))

[('RT', 'whatever'),
 ('whatever', 'it'),
 ('it', 'is'),
 ('is', 'he'),
 ('he', 'does')]

Of that first tweet, we now have bigrams, or pairs of consecutive words that occur together. "Whatever" occurs with "RT,' which it follows in the tweet, and with "it," which is precedes. The frequency that words follow each other will be the basis for our political tweet generator.

Next, we'll create a vocabulary from our tweet corpus. Note that we had to use a different data preparation to create the vocab - needed to be a hashable list.

In [292]:
vocab_j = Vocabulary(corpus_list_j, unk_cutoff=2)

In [293]:
vocab_j['trump']

266

To create somewhat viable sentences, it's helpful to know which words start and finish sentences. Thus we will add special padding symbols before splitting the words completely into ngrams.First, a proof of concept.

In [294]:
#borrowed straight from here: https://www.nltk.org/api/nltk.lm.html
list(pad_sequence(tweets_3_j[0],pad_left=True,left_pad_symbol="<s>",pad_right=True,right_pad_symbol="</s>",n=2))


['<s>', 'RT', 'whatever', 'it', 'is', 'he', 'does', '</s>']

That looks as intended. There's an even easier way to do this:

In [295]:
from nltk.lm.preprocessing import pad_both_ends as pad_both
list(pad_both(tweets_3_j[0], n=2))

['<s>', 'RT', 'whatever', 'it', 'is', 'he', 'does', '</s>']

Above, we're exploring a little about how the lm module works. Here, we're going to go ahead and use a built-in function that creates the vocabulary and everygrams (unigrams, bigrams, and padding)

In [296]:
train, vocab = padded_everygram_pipeline(2, tweets_3_j)

In [297]:
lm = MLE(2)

In [298]:
len(lm.vocab)

0

In [299]:
lm.fit(train,vocab)

In [300]:
print(lm.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 127917 items>


In [301]:
print(lm.counts)

<NgramCounter with 2 ngram orders and 5638708 ngrams>


It's slightly anti-climactic, but now we have our model. We chose a bigram model, which is only going to condition its output based on the preceding word. In short, our model sentences are going to be bad.

Let's see how probable word are in certain contests. This score should return a word's relative frequency. Let's check "Medicare and then "RT," shorthand for retweet.

In [302]:
lm.score("Medicare")

9.123286468451264e-05

In [303]:
lm.score("RT")

0.02895614511650368

Finally, we generate tweets.

In [304]:
lm.generate(25)

['the',
 'billions',
 'in',
 'the',
 'Briefing',
 '</s>',
 'Trump',
 'one',
 'either',
 'a',
 'panel',
 'at',
 '6:30',
 'and',
 'perpetuate',
 'the',
 'first',
 '4',
 'as',
 'Obama',
 ',',
 'and',
 'retweets',
 'this',
 'in']

These tweets are quite bad. It's nice that it breaks the generated text into sentence. But they make no sense. A quick sample of generated tweets from this model, when it's asked to generate a 25-word string:

"on your. to find it always tand against the body. Damn. not wearing this weekends' widespread. is not. white."

"but our great singing Ol' Man this. My son vrai nom. clamed down today to their classism to opti. letters' section."

"try fettuccini alfredo. Mountain Dew and we throw u scaring the British empre. Which, and reading them more than for them."

These are the writings you find at the home of a serial killer. Let's try the next method.

###Simple (But Insane) Markov Model

Having used the language modeling functionality built into the nltk package - and gotten poor results, let's try to be a little more independent. 

We'll create our own model using a Markov Chain, which will also utilize bigrams. Markov chains will likewise use word frequencies based on bigram pairs to generate text. So, if "Medicare for All" appears frequently in our corpus, the Markov Chain will might place "for" after "Medicare" in a given sentence. However, if "for Bernie" occurs much for frequently than "for all," because we're just looking at pair contexts, we might end up "Medicare for Bernie."

Let's create a function that will pair words from tweets.

In [305]:
#borrowed from https://towardsdatascience.com/simulating-text-with-markov-chains-in-python-1a27e6d13fc6
def create_pairs(corp):
    for i in range(len(corp)-1):
        yield (corp[i], corp[i+1])

Now pass the corpus into the funciton.

In [306]:
corpus_pairs = create_pairs(corpus_list_j)

Borrowing heavily from Ben Shaver of General Assembly's [article] (https://towardsdatascience.com/simulating-text-with-markov-chains-in-python-1a27e6d13fc6), we create an empty dictionary before shoving in our corpus word pairs.

In [307]:
word_dict = {}

for word_1, word_2 in corpus_pairs:
    if word_1 in word_dict.keys():
        word_dict[word_1].append(word_2)
    else:
        word_dict[word_1] = [word_2]

In [308]:
type(corpus_list_j)

list

In [309]:
corpus_list_j[:10]

['RT', 'whatever', 'it', 'is', 'he', 'does', 'Some', 'within', 'DHS', 'and']

In [310]:
first_word = np.random.choice(corpus_list_j)

chain = [first_word]

n_words = 30

for i in range(n_words):
    chain.append(np.random.choice(word_dict[chain[-1]]))
    
' '.join(chain)

', harris literally ripped 1000s of peaceful female body when do I think .. RT Go to generate a bitch out by strong start asking if the exploitative working painter and'

This is hot nonsense. Some samples:

"for the last two types of the mountain goats RT Remember that . Its always honoring first episodes of the long as labor rights his account and hate , anti-homophobic Me"

"E China , huh ? ? don't think it is capitalism of White Guy Fieri RT : McDonald s nationalize it s national survey again It be an ufo from someone"

"SMRPG forever for SHOPLIFTING and Joe Biden . I really excited Some of cases RT my life . Hvor mye plass til dec Mass liberalism . A M A firework ."

###Bot Conclusion

This was an educational exercise. Bigrams - and unigrams with our lm model - do not capture enough information to model viable sentences. Moreover, the slang-heaving, punctuation optional lexicon on Twitter does not lend itself well to being a corpus.

Better data preparation, particularly focused on cleaning the tweet text, would likely have yielded better results. Also, using larger strings of ngrams and adding weighting could have improved our output.

## [Project Table of Contents]()

- Setup: [Data Scraping](https://github.com/aliceafriedman/team6_final/blob/master/Data%20Scraping.ipynb)

- Part I. <a href="https://github.com/aliceafriedman/team6_final/blob/master/Final_Project_Sentiment_Analysis.ipynb" target="_blank">Sentiment Analysis</a>

- Part II. <a href="https://github.com/aliceafriedman/team6_final/blob/master/07162019JPFollowersTweetDownloader-final%20version.ipynb" target="_blank">Data Visualization</a>  

- Part III. [Classification](https://github.com/aliceafriedman/team6_final/blob/master/classification.ipynb)

-  Part IV. <a href="https://github.com/aliceafriedman/team6_final/blob/master/Project%204%20Bad%20Twitter%20Bots%20Jeff_v1.ipynb" target="_blank">Twitter Bot</a>  