A tool for exporting the Twitter stream into a CSV file.
(sudo) gem install twitter_to_csv
Perspectives on Acne: What Twitter Can Teach Health Care Providers Shive M, Bhatt M, Cantino A, Kvedar J, Jethwani K., JAMA Dermatol. 2013;149(5):621-622. doi:10.1001/jamadermatol.2013.248.
Twitter: an opportunity for public health campaigns Mackenzie R Wehner, Mary-Margaret Chren, Melissa L Shive, Jack S Resneck, Sherry Pagoto, Andrew B Seidenberg, Eleni Linos, The Lancet 12 July 2014 (Volume 384 Issue 9938 Pages 131-132 DOI: 10.1016/S0140-6736(14)61161-2)
You might start by first running the script for a while to dump the Twitter stream into a JSON file:
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \
--access-token <your twitter access token> --access-token-secret <your twitter access token secret> \
--json out.json --filter your,keywords,of,interest
Then, later, you could export to CSV:
twitter_to_csv --replay-from-file out.json --csv out.csv \
--fields text,created_at,user.name,retweeted_status.id,retweeted_status.favorited,...
Alternatively, you can always stream directly to CSV:
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \
--access-token <your twitter access token> --access-token-secret <your twitter access token secret> \
--filter your,keywords,of,interest --csv out.csv \
--fields text,created_at,user.name,retweeted_status.id,retweeted_status.favorited,...
Twitter requires all API access over oAuth. Follow these instructions to get register and authorize a free Twitter API Application:
- Visit https://apps.twitter.com/app/new
- Enter something like "my twitter_to_csv" as the name, "Using twitter_to_csv Ruby gem" as the description, and "https://github.com/cantino/twitter_to_csv" as the website.
- Click "manage API keys"
- Copy down the "API key" and "API secret"
- Click "Create my access token" at the bottom of the page, refreshing until your new access token shows up at the bottom of the page. Copy down your "Access token" and "Access token secret".
- You're all set!
You may want to limit to Tweets that appear to be written in English.
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \
--access-token <your twitter access token> --access-token-secret <your twitter access token secret> \
--require-english --fields ...
This filter isn't perfect and will have both false positives and false negatives, but it works fairly well.
You can extract URLs, Hashtags, and User Mentions from the tweet into their own columns by using --url-columns
, --hashtag-columns
, and --user-mention-columns
. For example, you could use --url-columns 3
to get up to 3 extracted URLs in their own columns.
Twitter To CSV can compute an average sentiment score for each tweet. Provide --compute-sentiment
to use this feature. The AFINN-111 valence database is used to look up the valence of each recognized word, then the average is computed, only considering words that have some known valence associated. That is, "I love cheese" only has one word with valence, "love" with a score of 3, so the average is 3. "I love cheese and like bread", on the other hand, has two words with valence, "love" (3) and "like" (2), and so has an average valence of (3 + 2) / 2, or 2.5. The library will break hyphenated words up and score them as separate words unless the whole thing has a single known valence.
Once you have a recorded Twitter stream, you can rollup retweets in various ways. Here is an example that collapses retweets into the retweet_count
field of the original tweet, only outputs tweets with at least 1 retweet, ignores retweets that happened more than 7 days after the original tweet, and outputs retweet count columns at half an hour, 2 hours, and 2 days after the original tweet:
twitter_to_csv --replay-from-file out.json -c out.csv \
--retweet-mode rollup \
--retweet-threshold 1 \
--retweet-window 7 \
--retweet-counts-at 0.5,2,48 \
--fields retweet_count,text
Note that all of the retweet features require you to --replay-from-file
because they parse the stream backwards. They WILL NOT function correctly when you're reading directly from the stream .
To select a specific window of time in a pre-recorded stream by created_at
, pass in --start
and --end
, for example:
twitter_to_csv --replay-from-file out.json \
--start "Mon Mar 07 07:42:22 +0000 2011" \
--end "Mon Mar 08 07:42:22 +0000 2011" \
...
Sometimes the Twitter API goes down. You can analyze a json output file to see where data gaps (of over 10 minutes, in this case) have occurred:
twitter_to_csv --replay-from-file out.json --analyze-gaps 10
If you're doing research on Twitter, it may make sense to construct composite binary variables predicated on the existence of certain tokens in a tweet. For example, you may want a variable called sf that is true if the tweet contains san francisco OR sf OR bay area (but not area by the bay). You could do that as follows:
twitter_to_csv --replay-from-file out.json \
--csv out.csv \
-w "sf: san francisco OR sf OR bay area" \
If you did want to allow area by the bay, you might need something like:
-w "sf: san francisco OR sf OR (bay AND area)"
Use twitter_to_csv --help
to see all available options:
Usage: twitter_to_csv [options]
These four fields are required. Please see the README to learn how to get them for your Twitter account.
--api-key KEY Twitter API key
--api-secret SECRET Twitter API secret
--access-token TOKEN Twitter access token
--access-token-secret SECRET Twitter access token secret
General settings:
-c, --csv FILE The CSV file to append to, or - for STDOUT
-j, --json FILE The JSON file to append to, or - for STDOUT
-f, --filter KEYWORDS Keywords to ask Twitter to filter on
-x, --fields FIELDS Fields to include in the CSV
--date-fields FIELD_NAMES Break these fields into separate numerical columns for weekday, day, month, your, hour, minute, and second.
-e, --require-english [STRATEGY] Attempt to filter out non-English tweets. This will have both false positives and false negatives.
The strategy can be either 'uld' to use the UnsupervisedLanguageDetection Ruby gem,
'lang' to use Twitter's guessed 'lang' attribute, or 'both' to only remove tweets that
both Twitter and ULD think are non-English. This is most conservative and is the default.
-v, --[no-]verbose Run verbosely
-r, --replay-from-file FILENAME Replay tweets from a JSON dump file
--analyze-gaps MINUTES Look at the stream and display gap information for gaps longer than MINUTES
--sample-fields NUMBER_OF_SAMPLES
Record NUMBER_OF_SAMPLES tweets and then print out all
of the field names seen. Use to find out what can be passed to.
--url-columns NUM_COLUMNS Extract up to NUM_COLUMNS urls from the status and include them in the CSV
--hash-columns NUM_COLUMNS Extract up to NUM_COLUMNS hashtags (#foo) from the status and include them in the CSV
--user-columns NUM_COLUMNS Extract up to NUM_COLUMNS user mentions (@foo) from the status and include them in the CSV
-s, --compute-sentiment Compute an average sentiment score for each status using the AFINN-111 sentiment dictionary
--compute-word-count Include a word count for each status in the output CSV
--normalize-source Return just the domain name from the Tweet source (i.e., tweetdeck, facebook)
--remove-quotes This option strips all double quotes from the output to help some CSV parsers.
--prefix-ids Prefix any field ending in _id or _id_str with 'id' to force parsing as a string in some programs.
-w "NAME:WORD AND WORD AND WORD",
--bool-word-field Create a named CSV column that is true when the word expression matches, false otherwise.
Word expressions are boolean expressions where neighboring words must occur sequentially
and you can use parentheses, AND, and OR to test for occurrence relationships. Examples:
keyword_any:tanning booth OR tanning booths OR tanningbooth
keyword_both:tanning AND booth
keyword_complex:tanning AND (booth OR bed)
This option can be used multiple times.
--start TIME Ignore tweets with a created_at earlier than TIME
--end TIME Ignore tweets with a created_at later than TIME
If you would like to do special retweet handling, use the following options.
For these to function, you must be using --replay-from-file. The replay will be performed in reverse.
--retweet-mode MODE Determine how to handle retweets
Options are just 'ROLLUP'
--retweet-threshold COUNT Only consider statuses with at least COUNT retweets
--retweet-window WINDOW Ignore retweets that occur beyond WINDOW days
Additionally, statuses where WINDOW days have not yet passed will be ignored.
--retweet-counts-at HOURS Output the number of retweets seen at specific times after the original tweet
-h, --help Show this message
--version Show version
Use --sample-fields 1000
to output the occurrence count of different Twitter fields, like so:
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \
--access-token <your twitter access token> --access-token-secret <your twitter access token secret> \
--sample-fields 1000
Here's a list of fields and their occurrences in a 50,000 tweet dataset:
id 50000
id_str 50000
created_at 50000
text 50000
source 50000
truncated 50000
in_reply_to_status_id 50000
in_reply_to_status_id_str 50000
in_reply_to_user_id 50000
in_reply_to_user_id_str 50000
in_reply_to_screen_name 50000
user.id 50000
user.id_str 50000
user.name 50000
user.screen_name 50000
user.location 50000
user.url 50000
user.description 50000
user.protected 50000
user.followers_count 50000
user.friends_count 50000
user.listed_count 50000
user.created_at 50000
user.favourites_count 50000
user.utc_offset 50000
user.time_zone 50000
user.geo_enabled 50000
user.verified 50000
user.statuses_count 50000
user.lang 50000
user.contributors_enabled 50000
user.is_translator 50000
user.profile_background_color 50000
user.profile_background_image_url 50000
user.profile_background_image_url_https 50000
user.profile_background_tile 50000
user.profile_image_url 50000
user.profile_image_url_https 50000
user.profile_banner_url 41614
user.profile_link_color 50000
user.profile_sidebar_border_color 50000
user.profile_sidebar_fill_color 50000
user.profile_text_color 50000
user.profile_use_background_image 50000
user.default_profile 50000
user.default_profile_image 50000
user.following 50000
user.follow_request_sent 50000
user.notifications 50000
geo 48656
coordinates 48656
place 48664
contributors 50000
retweet_count 50000
favorite_count 50000
favorited 50000
retweeted 50000
filter_level 50000
lang 36041
entities.hashtags[].text 11024
entities.hashtags[].indices[] 11024
entities.user_mentions[].screen_name 22368
entities.user_mentions[].name 22368
entities.user_mentions[].id 22368
entities.user_mentions[].id_str 22368
entities.user_mentions[].indices[] 22368
retweeted_status.created_at 13959
retweeted_status.id 13959
retweeted_status.id_str 13959
retweeted_status.text 13959
retweeted_status.source 13959
retweeted_status.truncated 13959
retweeted_status.in_reply_to_status_id 13959
retweeted_status.in_reply_to_status_id_str 13959
retweeted_status.in_reply_to_user_id 13959
retweeted_status.in_reply_to_user_id_str 13959
retweeted_status.in_reply_to_screen_name 13959
retweeted_status.user.id 13959
retweeted_status.user.id_str 13959
retweeted_status.user.name 13959
retweeted_status.user.screen_name 13959
retweeted_status.user.location 13959
retweeted_status.user.url 13959
retweeted_status.user.description 13959
retweeted_status.user.protected 13959
retweeted_status.user.followers_count 13959
retweeted_status.user.friends_count 13959
retweeted_status.user.listed_count 13959
retweeted_status.user.created_at 13959
retweeted_status.user.favourites_count 13959
retweeted_status.user.utc_offset 13959
retweeted_status.user.time_zone 13959
retweeted_status.user.geo_enabled 13959
retweeted_status.user.verified 13959
retweeted_status.user.statuses_count 13959
retweeted_status.user.lang 13959
retweeted_status.user.contributors_enabled 13959
retweeted_status.user.is_translator 13959
retweeted_status.user.profile_background_color 13959
retweeted_status.user.profile_background_image_url 13959
retweeted_status.user.profile_background_image_url_https 13959
retweeted_status.user.profile_background_tile 13959
retweeted_status.user.profile_image_url 13959
retweeted_status.user.profile_image_url_https 13959
retweeted_status.user.profile_banner_url 11028
retweeted_status.user.profile_link_color 13959
retweeted_status.user.profile_sidebar_border_color 13959
retweeted_status.user.profile_sidebar_fill_color 13959
retweeted_status.user.profile_text_color 13959
retweeted_status.user.profile_use_background_image 13959
retweeted_status.user.default_profile 13959
retweeted_status.user.default_profile_image 13959
retweeted_status.user.following 13959
retweeted_status.user.follow_request_sent 13959
retweeted_status.user.notifications 13959
retweeted_status.geo 13728
retweeted_status.coordinates 13728
retweeted_status.place 13724
retweeted_status.contributors 13959
retweeted_status.retweet_count 13959
retweeted_status.favorite_count 13959
retweeted_status.entities.hashtags[].text 2438
retweeted_status.entities.hashtags[].indices[] 2438
retweeted_status.entities.urls[].url 361
retweeted_status.entities.urls[].expanded_url 361
retweeted_status.entities.urls[].display_url 361
retweeted_status.entities.urls[].indices[] 361
retweeted_status.favorited 13959
retweeted_status.retweeted 13959
retweeted_status.possibly_sensitive 916
retweeted_status.lang 13959
entities.urls[].url 3662
entities.urls[].expanded_url 3662
entities.urls[].display_url 3662
entities.urls[].indices[] 3662
possibly_sensitive 5339
entities.media[].id 1736
entities.media[].id_str 1736
entities.media[].indices[] 1736
entities.media[].media_url 1736
entities.media[].media_url_https 1736
entities.media[].url 1736
entities.media[].display_url 1736
entities.media[].expanded_url 1736
entities.media[].type 1736
entities.media[].sizes.thumb.w 1736
entities.media[].sizes.thumb.h 1736
entities.media[].sizes.thumb.resize 1736
entities.media[].sizes.large.w 1736
entities.media[].sizes.large.h 1736
entities.media[].sizes.large.resize 1736
entities.media[].sizes.small.w 1736
entities.media[].sizes.small.h 1736
entities.media[].sizes.small.resize 1736
entities.media[].sizes.medium.w 1736
entities.media[].sizes.medium.h 1736
entities.media[].sizes.medium.resize 1736
geo.type 1344
geo.coordinates[] 1344
coordinates.type 1344
coordinates.coordinates[] 1344
place.id 1336
place.url 1336
place.place_type 1336
place.name 1336
place.full_name 1336
place.country_code 1336
place.country 1336
place.bounding_box.type 1336
place.bounding_box.coordinates[][][] 1336
entities.media[].source_status_id 621
entities.media[].source_status_id_str 621
retweeted_status.entities.user_mentions[].screen_name 1379
retweeted_status.entities.user_mentions[].name 1379
retweeted_status.entities.user_mentions[].id 1379
retweeted_status.entities.user_mentions[].id_str 1379
retweeted_status.entities.user_mentions[].indices[] 1379
retweeted_status.entities.media[].id 609
retweeted_status.entities.media[].id_str 609
retweeted_status.entities.media[].indices[] 609
retweeted_status.entities.media[].media_url 609
retweeted_status.entities.media[].media_url_https 609
retweeted_status.entities.media[].url 609
retweeted_status.entities.media[].display_url 609
retweeted_status.entities.media[].expanded_url 609
retweeted_status.entities.media[].type 609
retweeted_status.entities.media[].sizes.thumb.w 609
retweeted_status.entities.media[].sizes.thumb.h 609
retweeted_status.entities.media[].sizes.thumb.resize 609
retweeted_status.entities.media[].sizes.medium.w 609
retweeted_status.entities.media[].sizes.medium.h 609
retweeted_status.entities.media[].sizes.medium.resize 609
retweeted_status.entities.media[].sizes.large.w 609
retweeted_status.entities.media[].sizes.large.h 609
retweeted_status.entities.media[].sizes.large.resize 609
retweeted_status.entities.media[].sizes.small.w 609
retweeted_status.entities.media[].sizes.small.h 609
retweeted_status.entities.media[].sizes.small.resize 609
retweeted_status.place.id 235
retweeted_status.place.url 235
retweeted_status.place.place_type 235
retweeted_status.place.name 235
retweeted_status.place.full_name 235
retweeted_status.place.country_code 235
retweeted_status.place.country 235
retweeted_status.place.bounding_box.type 235
retweeted_status.place.bounding_box.coordinates[][][] 235
retweeted_status.geo.type 231
retweeted_status.geo.coordinates[] 231
retweeted_status.coordinates.type 231
retweeted_status.coordinates.coordinates[] 231
retweeted_status.entities.media[].source_status_id 42
retweeted_status.entities.media[].source_status_id_str 42
place.attributes.street_address 2