-
Notifications
You must be signed in to change notification settings - Fork 0
/
tweetscrape_blank.Rmd
134 lines (91 loc) · 6.74 KB
/
tweetscrape_blank.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "Template Code for Scraping Tweets with rtweet"
author: "Emily J. Rollinson"
date: "July 31 2022"
output: html_document
---
# rtweet 0.7.0 method
## Download tweets
```{r, echo=FALSE}
#The comments below show how to use rtweet to download Twitter data. You will need to create an app at https://dev.twitter.com/ to get Twitter API OAuth values that fill into the spaces indicated below. This code needs to run to generate the csv used in creating this document, but should be commented out to knit the final product.
#Please note that this code works with rtweet 0.7.0 as of 7/28/2022, but the same code fails with rtweet > 1.0.0 - the data format produced by search_tweets, has changed dramatically as of 1.0.0 (https://ropensci.org/blog/2022/07/21/rtweet-1-0-0/), so the entirety of the code below needs to be rewritten to accommodate that. Notably, this also includes the loss of the ability to save the downloaded tweets as a csv.
#load libraries
library(rtweet)
library(tidyverse)
library(lubridate)
library(tidytext)
#pass your keys to API
# appname <- "YourAppName"
# key <- "YourKey"
# secret <- "YourSecretKey"
# access_token <- "YourAccessToken"
# access_secret <- "YourSecretAccessToken"
twitter_token <-create_token(
app = appname,
consumer_key = key,
consumer_secret = secret,
access_token = access_token,
access_secret = access_secret)
# hashtag <- "#YourHashtag"
# filename <- "yourfilename.csv"
tweets<-search_tweets(q = hashtag,
n=10000, retryonratelimit = TRUE)
save_as_csv(tweets, filename)
```
## Use rbind() to combine repeated scrapes into one master list
```{r}
#add more as they are saved - uncomment and edit as needed
#check length and date range of files to see if merging files is necessary first - depending on range of dates of interest and number of tweets on the hashtag, a single scrape may contain all the tweets of interest
#it's best to avoid merging multiple scrapes if possible, to avoid the need to ensure that the most recent/up-to-date fave/RT counts out of duplicate records are the ones kept by distinct() below.
#However, always a good idea to scrape and save regularly because of the limits of API access, just in case, and determine later whether a merge is necessary
# l1<-read.csv("file1.csv")
# l2<-read.csv("file2.csv")
# l3<-read.csv("file3.csv")
# l4<-read.csv("file4.csv")
# l5<-read.csv("file5.csv")
# l6<-read.csv("file6.csv")
# etc.
#combine, add more as they are saved
alltweets <- rbind(l1, l2, l3, l4, l5, l6)
cleaned <- alltweets %>%
arrange(status_id, desc(retweet_count)) %>% #this sorts by # retweets so that if the same tweet is included in multiple concatenated files above from repeated scrapes, the one with the most RTs is kept
distinct(status_id, .keep_all=TRUE)
cleancopy <- cleaned
#may be better to revise the above to explicitly label each scraped file with the date of the scrape and then keep the most recent version, but this works well enough
cleancopy$created_at <- as.POSIXct(cleancopy$created_at, format="%Y-%m-%d %H:%M:%S", tz="GMT")
cleancopy$created_at <- with_tz(cleancopy$created_at, "America/New_York")
#to get a specific time span for plotting
# startdate <- "YYYY-MM-DD HH:MM:SS"
# enddate <- "YYYY-MM-DD HH:MM:SS"
confweek <- cleancopy %>%
filter(created_at > as.POSIXct(startdate, tz="America/New_York"), created_at < as.POSIXct(enddate, tz="America/New_York"))
# combinedfile <- "overallfilename.csv"
write.csv(confweek, combinedfile)
```
# rtweet > 1.0.0 method (in progress)
rtweet 1.0.0 and later made [many breaking changes] (https://ropensci.org/blog/2022/07/21/rtweet-1-0-0/) that make the approach above nonfunctional.
This section is a draft approach to getting tweets using this updated library, but is not yet complete. It's also possible all of this will change again with the release of Twitter API v2 - more breaking changes seem to be expected ~November 2022. It also seems that future versions of rtweet are intented to provide some functions to help extract the nested data being returned by this version.
## Setting up authentication
In this version of rtweet, create_token() is deprecated. Instead, authentication should occur via browser - the first time an rtweet function runs, it should open a browser window to allow authentication via your personal Twitter account. However, the browser popup may not happen if you already have an authentication token stored (e.g., from running create_token()).
This is likely stored in "C:\Users\YourUsername\AppData\Roaming\R\config\R\rtweet" if it needs to be located to delete and restart. This can (I think?) be accessed with tools:R_user_dir("rtweet", "config"). For previous token issues use auth_sitrep(), which will find prior token and move them to this location. auth_list() shows stored credentials.
```{r}
#Options for authentication - choose whichever ONE makes sense. More info at https://docs.ropensci.org/rtweet/articles/auth.html
#auth_setup_default() #this uses authentication that comes with rtweet; for one-off uses of the package, this should be fine
#rtweet_user() # will open browser window to authenticate access to twitter API via your personal Twitter account; you will need to be logged into that account in the browser already. This approach avoids limits on number of requests and rate limits.
#rtweet_app() #For very regular/heavy use of the package, this is the best option. This requires registering your account as a developer with Twitter, and then entering your authentication tokens in the window that pops up when you run this line.
```
## Getting data
```{r}
#Replace the string below with your search term of choice (demonstrated here with #rstats) and the number of tweets desired. Remember, this can only access the past ~6-10 days of data, and a limited number of tweets.
tweets <- search_tweets("#rstats", n = 100, retryonratelimit = TRUE)
#This produces an object with 43 columns (instead of the 91 returned by rtweet 0.7.0). Many of these columns contain nested lists (which was also not true for rtweet 0.7.0). This type of object cannot be saved to a csv. The desired columns would need to be extracted to a new data frame before saving as a csv is possible. The downloaded object can instead be saved as an RDS for later processing:
library(readr)
write_rds(tweets, "savedtweets.rds")
imported <- read_rds("savedtweets.rds")
```
## Extracting desired columns from downloaded tweets
```{r}
#necessary information for the current conference tweets summary page: created_at, is_retweet, screen_name, retweet_screen_name, followers_count, retweet_followers_count, mentions_screen_name, favorite_count, retweet_count, status_url
users <- users_data(tweets) #contains screen_name and followers_count
#a lot of necessary information is hidden within entities; unnest_longer() or hoist() may be the way to go here
```