Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This is a fork of twitter-scraping by bpb27,, adding functionality from marsam/tweetshot.js -

Super quick TL;DR for people who know what they're doing

  1. Copy sample_api_keys.json to api_keys.json and enter your API keys. Scripts require selenium, tweepy, pandas, phantomjs (node), and a webdriver in your $PATH.
  2. edit and choose a user to scrape near the top of the file.
  3. Create a time window in which to store data by editing the start and end variables. Shorter windows work best, a year or less generally works well for most active users.
  4. Run python3 when finished, choose another date window if desired and run again. (t: ~2-10 seconds per day of activity)
  5. Edit the user variable in, then run python3 (t: ~6 seconds for every 100 tweets)
  6. Edit the user variable in,then run python3 (t: ~2 seconds per tweet)
  7. CSV data will be in the data/ folder and screenshots will be in img/

Twitter Scraper

Twitter makes it hard to get all of a user's tweets (assuming they have more than 3200). This is a way to get around that using Python, Selenium, and Tweepy.

Essentially, we will use Selenium to open up a browser and automatically visit Twitter's search page, searching for a single user's tweets on a single day. If we want all tweets from 2015, we will check all 365 days / pages. This would be a nightmare to do manually, so the script does it all for you - all you have to do is input a date range and a twitter user handle, and wait for it to finish.

The script collects tweet ids. If you know a tweet's id number, you can get all the information available about that tweet using Tweepy - text, timestamp, number of retweets / replies / favorites, geolocation, etc. Tweepy uses Twitter's API, so you will need to get API keys. Once you have them, you can run the script.


  • basic knowledge on how to use a terminal
  • Safari 10+ with 'Allow Remote Automation' option enabled in Safari's Develop menu to control Safari via WebDriver.
  • python3
    • to check, in your terminal, enter python3
    • if you don't have it, check YouTube for installation instructions
  • pip or pip3
    • to check, in your terminal, enter pip or pip3
    • if you don't have it, again, check YouTube for installation instructions
  • selenium (3.0.1)
    • pip3 install selenium
  • tweepy (3.5.0)
    • pip3 install tweepy
  • phantomjs (for taking screenshots)

Running the scraper

  • open up and edit the user, start, and end variables (and save the file)
  • run python3
  • you'll see a browser pop up and output in the terminal
  • do some fun other task until it finishes
  • once it's done, it outputs all the tweet ids it found into all_ids.json
  • every time you run the scraper with different dates, it will add the new ids to the same file
    • it automatically removes duplicates so don't worry about small date overlaps

Troubleshooting the scraper

  • do you get a no such file error? you need to cd to the directory of
  • do you get a driver error when you try and run the script?
    • open and change the driver to use Safari() or Firefox()
      • if neither work, google the error (you probably need to install a new driver)
  • does it seem like it's not collecting tweets for days that have tweets?
    • open and change the delay variable to 2 or 3

Getting the metadata

  • first you'll need to get twitter API keys
  • put your keys into the sample_api_keys.json file
  • change the name of sample_api_keys.json to api_keys.json
  • open up and edit the user variable (and save the file)
  • run python3
  • this will get metadata for every tweet id in all_ids.json
  • it will create 4 files
    • username.json (master file with all metadata)
    • (a zipped file of the master file with all metadata)
    • username_short.json (smaller master file with relevant metadata fields)
    • username.csv (csv version of the smaller master file)

Getting the screenshots

  • for this part, you will need pandas, node.js and phantom.js:
    • pip3 install pandas
    • [package-manager on your distro] install nodejs
    • npm install -g phantomjs
  • Ensure that phantomjs is executable from your path. On awindows machine you may need to copy the executable to the current working directory or create an environment variable that points to phantomjs. On UNIX-based systems you can create a symbolic link to phantomjs in the current working directory if needed.
  • Edit the user variable in, then run python3
    • the script will collect screenshot of every tweet that has been indexed in the data file, which has path information for each image
    • expect a little weirdness with embedded videos, especially vines; attached images may not be full-sized
    • there will be one image (~60kb) per tweet. Context will be captured for replies. The option to allow for not capturing this context is under development


Script for archiving a user's twitter data and to backup tweet screenshot images in .png format







No releases published


No packages published