Skip to content

araval/trump-bot

Repository files navigation

Trump Bot

This is a bot built on Donald Trump’s vocabulary, i.e. text from his tweets, speeches, essays and debates. It utters Trump-like sentences, given a particular word or phrase. It enjoys tweeting. Mention it in your tweet and it will respond. It also has a lonely existence here. It used to have a lonely existence in some corner of the web for about nine months, that looked like this:

Donald plays golf

But it got so lonely that it stopped existing - so only social existence now.

Data Pre-processing

Tweets

Using the Twitter API, one can get at most 3000 most recent tweets by a particular user. greptweet.com has been storing tweets of users for a long time, and so using greptweet, I was able to get Trump's tweets all the way from 2012, which gave me a total of ~18000 tweets.

I collected tweets from realdonaldtrump.txt to tweetsONLY.txt using awk.

    awk -F '|' '{print $3}' < realdonaldtrump.txt > tweetsONLY.txt

Tweet Cleaning:

  1. I removed retweets by discarding ones starting with 'RT'.

  2. Usernames: I collected all the users tagged by Trump - these would be words that begin with one of '@' or '.@' or '"@' or '".@'.

    I scraped Twitter for the users' real names, number of followers, and the number of people they follow. I then made two lists: a "white list" of 'well-known' users, and a 'throw-out-list' of 'not-so-well-known' users. The white list was determined by the following criterion:

    if (number of followers)/(number following) > 10
      add user to white_list 
    else 
      add user to throw_out_list
    

    There were present users whose accounts were either deleted or suspended, so I got either a 404 error or an "IndexError" for those users. I added them to errorList - and manually inspected the contents. I kept the users, whose names I could recognize (such as 'vpbiden' or 'georgeclooney').

    I discarded tweets directed at users not in white list, as they were of a personal nature and uninteresting to anyone else.
    I replaced the white-listed usernames by the real names.

  3. Tweet completion: Twitter has a limit of 140 characters for tweets. Longer tweets end in "(cont) some.link.to.twitlonger". I scraped these tweets from twitlonger.com, and then went through step 2.

  4. From tweets that contained links to websites, I removed the link, and kept the text around it (this is one case where Trump does a good job of describing what's in the link).

  5. Finally, made substitutions for words that contain a period such as acronyms or titles, such as "U.S. or Jr.". Added spaces between words and punctuation. This step was required for making my model, which treats punctuation as another 'word', and also the sentence construction uses the period as a cue to stop moving forward along the Markov Chain.
    All scripts for this step are in the data_processing/Tweets

Debates

I scraped Times.com and nytimes.com for the transcripts of the four republican debates. These were reasonably clean, and required only step 5 listed above. The script for this is in data_processing/

Speeches and essays

I downloaded speeches from several different websites, and essays from Trump's official website. Both of these were processed using step 5 of the tweet-cleaning process.

The Model

Using this Trump's corpus which consists of ~250,000 words (14,000 unique), I built a bi-gram Markov Model, where the probability of a word depends on the bigram preceding it.

Here's an example of a sentence generated by the bot, with the seed "hair": hair response markov chain example

I first look for a tuple which has "hair" at position '0'. Then I randomly select a word from the list in the value. The list contains words as they occur - i.e. "not" may appear in the list multiple times. So if the list contained 5 "not" and 10 "be", then P(not| (hair, may)) = 1/3. Therefore, random sampling from this list is actually sampling according to frequency in the corpus.

For the initial part of the sentence, I constructed another dictionary, with reversed sentences. Traversing this chain in the forward direction gives us the initial sentence in reverse. Reversing this reversed sentence and combining with the forward piece gives the complete sentence.

What about words not in the Vocbulary?

The model requires a seed to generate a sentence. If the input does not contain a word present in the vocabulary, then this model will not be able to generate a sentence. We can try to find words similar to the words in the input and see if they are present in the vocabulary. If they are, then we generate a sentence and we are done. We can get similar words easily with the help of Word Vectors!

I downloaded 400,000 50-dimensional pre-trained word vectors from Stanford's NLP project. I wrote a C++ program to find 10 most similar words for all the 400,000 words. Here's an example of similar words from my program:

similar words

If none of the similar words are in the vocabulary, then model utters a sentence from a set of canned responses, such as "Don't waste my time. You're fired!".

Examples

Donald plays golf Obama Apprentice

Trump's Favorite Words

Excluding stopwords, the figure below shows Trump's most frequently used words in his speeches and essays. I excluded tweets, to be able to compare to US presidents (next figure).

trump favorite words

Below are the most frequent words for an 'average US president'. These are the words in speech transcripts of all US presidents from Washington to Obama (from here).

The two sets of words are definitely very different, but how similar or dissimilar is Trump from other presidents? To answer this question, I created 5000 dimensional tf-idf vectors from the presidents' speech collection. The total number of words in this collection is greater than 800,000. Using these tf-idf vectors, cosine-similariy between any two presidents, as well as a president and trump can be calculated easily. The figure below is a plot of the cosine-similarities between Trump and US presidents arranged by year.

trump vs presidents

Suprisingly, Trump is more similar to Clinton and Obama than to Bush. To compare these similarities with those between US presidents', consider the heatmap below. Darker pixels are more similar than lighter ones. Trump is as similar to Clinton and Obama as Clinton and Obama are to very early presidents.

president similarity

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors