I wanted to run a meaningful analysis of @realDonaldTrump's frequently used words when he was tweeting as presidential candidate, president-elect and president.
Here's how I did it. (And here's what I found out.)
Scraping and cleaning tweets:
- Got Twitter oAuth access via apps.twitter.com
- Ran scraping package in R using stringr, twitteR, etc. for further analysis. Here's the script I used.
- After running the package with two separate packages and Twitter API keys, I found that the package/Twitter API failed to get the complete collection of tweets I wanted: not only did it not scrape 3,200 tweets, it also left out numerous tweets from certain time periods (for example, tweets between 29 September and 5 October 2016 were completely missing).
- I then ran a Python script forked from bpb27 with a separate Twitter API key, to see if my scrape would be better. All the scripts I used can be found in my fork of bpb27's repo, and I've pasted the scripts from my scrape in part 1 of this repo.
- Oddly enough, tweets were also missing in this package (those between 11 March and 15 March 2017, for instance, were completely missing from this scrape). At this point I started to think it was my Twitter API keys, rather than the scripts, acting up. Sad!
- To resolve this issue, I merged the data from both the R and Python scrapes, and removed the duplicates to get the final collection. Due to many different special characters present, I had to manually sort through every tweet again to find duplicates.
- I divided the data from the final dataset into three time frames: tweets as candidate, from 6 Oct 2016 to 8 Nov 2016; as president-elect from 9 November 2016 to 20 January 2017; as president from 20 January 2017 to 15 March 2017. This gave me 160 days' worth of tweets to examine. The csv files are in part 2 of this repo, here, here and here.
Examining data
- Ran text mining and wordcloud programmes on the scraped tweets, now sorted into separate csv files according to the time frame. I found eight2late's tutorial the clearest and the most useful guide.
- The scripts are in part 3 of this repo, or here, here and here.
- Downloaded data as csv files for further cleaning and analysis. The cleaned files are in part 4 of this repo, or here: top words as Candidate, as President-elect and as President.
- Created separate Excel file highlighting mentions of the media (e.g. "news", "fake", "nytimes") during these three time periods. Here it is.
Presenting data
- Ran ggplot scripts to visualise top mentions per period. The scripts are in part 5 of this repo: here, here and here. The final jpegs of the graphs are here, here and here.
- Created interactive presentation of the graphs: script here.
- Created interactive line graph via JavaScript and Highcharts documentation to visualise mentions of the media during each period: script here.
The finishing touches