Skip to content
Connect to Twitter's API, collect data, save, visualize in R
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
twitter_basics_files/figure-gfm
README.md

README.md

Twitter Data Basics

Connect to Twitter’s API using Mike Kearney’s ‘rtweet’ package. Collect tweets via keywords, from a specific geographic location, or from a specific user’s timeline. Also do some basic plotting to explore data like geo-maps and a word cloud. Important: this script assumes you have a Twitter developer account, which you can easily set up following this tutorial.

Setup

# load libraries ----
library(rtweet)
library(tidyverse)
library(maps)
library(tidytext)
library(httpuv)
library(wordcloud)

1 - Authenticate for Twitter API access

If you don’t know how to get these values see this tutorial

# first establish authentication. Replace the dummy values below with your own
# store api keys (these are fake example values; replace with your own keys)
api_key <- "afYS4vbIlPAj096E60c4W1fiK"
api_secret_key <- "bI91kqnqFoNCrZFbsjAWHD4gJ91LQAhdCJXCj3yscfuULtNkuu"

# authenticate via web browser
token <- create_token(
  app = "YourApp",
  consumer_key = api_key,
  consumer_secret = api_secret_key)

2 - Search tweets and save them

(Or jump to step 3 if you’ve downloaded the “trump_tweets.csv” data directly)

# search for 5000 tweets sent from the US mentioning Trump
tweets <- search_tweets("#trump", geocode = lookup_coords("usa"), n = 5000)

# you could also search a user's timeline
trump <- get_timelines("realdonaldtrump", n = 500)

# unlist and save as csv
save_as_csv(tweets, "trump_tweets.csv")

3 - In what states are people tweeting about Trump?

#if you didn't collect your own tweets load in file of Trump tweets
tweets <- read_csv("trump_tweets.csv")

# plot on a map
# create lat/lng variables using all available tweet and profile geo-location data
# note: you may not have much lat_lng data in your available set, depends on which users appear in your set
# geo-tagged data is often sparse
tweets <- lat_lng(tweets)

#plot state boundaries
par(mar = c(0, 0, 0, 0))
maps::map("state", lwd = .25)

#plot lat and lng points onto state map
with(tweets, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))

3 - Explore most used words in dataset - construct wordcloud

# remove URLs
tweets$text <- gsub("https\\S*","", tweets$text)

# remove "@username" tags
tweets$text <- gsub("@\\w+", "", tweets$text) 

# put data into tidy text format - note we use 'token = 'tweets'' for twitter-specific text preprocessing
tweets_tokens <- tweets %>% 
  unnest_tokens(word, text, token = "tweets") %>% 
  # remove numbers
  filter(!str_detect(word, "^[0-9]*$")) %>%
  # remove stop words
  anti_join(stop_words) %>%
  # stem the words
  mutate(word = SnowballC::wordStem(word))

#plot
wordcloud(tweets_tokens$word, min.freq=200)

You can take out commonly occuring expressions that aren’t of interest to make a cleaner word cloud using something like this

tweets_tokens_trim <- tweets_tokens %>% filter(word != 'amp')

#replot
wordcloud(tweets_tokens_trim$word, min.freq=200)

You can’t perform that action at this time.