Skip to content

A meaningful scrape of realDonaldTrump tweets as candidate, PEOTUS and POTUS

Notifications You must be signed in to change notification settings

alexandrama/scraping-twitter-for-mentions

Repository files navigation

@realDonaldTrump's top mentions

I wanted to run a meaningful analysis of @realDonaldTrump's frequently used words when he was tweeting as presidential candidate, president-elect and president.

Here's how I did it. (And here's what I found out.)

Scraping and cleaning tweets:

  • Got Twitter oAuth access via apps.twitter.com
  • Ran scraping package in R using stringr, twitteR, etc. for further analysis. Here's the script I used.
  • After running the package with two separate packages and Twitter API keys, I found that the package/Twitter API failed to get the complete collection of tweets I wanted: not only did it not scrape 3,200 tweets, it also left out numerous tweets from certain time periods (for example, tweets between 29 September and 5 October 2016 were completely missing).
  • I then ran a Python script forked from bpb27 with a separate Twitter API key, to see if my scrape would be better. All the scripts I used can be found in my fork of bpb27's repo, and I've pasted the scripts from my scrape in part 1 of this repo.
  • Oddly enough, tweets were also missing in this package (those between 11 March and 15 March 2017, for instance, were completely missing from this scrape). At this point I started to think it was my Twitter API keys, rather than the scripts, acting up. Sad!
  • To resolve this issue, I merged the data from both the R and Python scrapes, and removed the duplicates to get the final collection. Due to many different special characters present, I had to manually sort through every tweet again to find duplicates.
  • I divided the data from the final dataset into three time frames: tweets as candidate, from 6 Oct 2016 to 8 Nov 2016; as president-elect from 9 November 2016 to 20 January 2017; as president from 20 January 2017 to 15 March 2017. This gave me 160 days' worth of tweets to examine. The csv files are in part 2 of this repo, here, here and here.

Examining data

  • Ran text mining and wordcloud programmes on the scraped tweets, now sorted into separate csv files according to the time frame. I found eight2late's tutorial the clearest and the most useful guide.
  • The scripts are in part 3 of this repo, or here, here and here.
  • Downloaded data as csv files for further cleaning and analysis. The cleaned files are in part 4 of this repo, or here: top words as Candidate, as President-elect and as President.
  • Created separate Excel file highlighting mentions of the media (e.g. "news", "fake", "nytimes") during these three time periods. Here it is.

Presenting data

  • Ran ggplot scripts to visualise top mentions per period. The scripts are in part 5 of this repo: here, here and here. The final jpegs of the graphs are here, here and here.
  • Created interactive presentation of the graphs: script here.
  • Created interactive line graph via JavaScript and Highcharts documentation to visualise mentions of the media during each period: script here.

The finishing touches

About

A meaningful scrape of realDonaldTrump tweets as candidate, PEOTUS and POTUS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published