Skip to content

dhudsmith/pull_twitter

Repository files navigation

Setup

Clone the repository and navigate into the repository directory.

Create the python virtual environment

# Note: can alternatively use python3.6 or python3.7
virtualenv venv -p python3.8

After this completes, activate the environment

source venv/bin/activate

Your terminal line should now start with (venv). You can deactivate the virtual environment at any time by running deactivate.

Finally, install the necessary dependencies

python -m pip install -r requirements.txt

Usage

Activate the virtual environment if it is not already active source venv/bin/activate

Fill in the contents of configs/template_config.yaml by following the comments. This file contains all global arguments that each command depends on.

From within the repository folder, run the command (substituting the contents within <>)

python pull_twitter.py --config_file <path to config yaml file> <subcommand> <subcommand arguments>

A python interface is also available and detailed below

All data will be saved to the directory indicated by output_dir in the designated config file. Each subcommand is provided an independent subdirectory to save outputs, and all results are stored in timestamped directories within.

If expansions are designated in the config file, additional output csv's are created to hold the additional data:

  • author_id/referenced_tweets.id.author_id/entities.mentions.username
    • for tweet-based outputs, this expansion creates a data file "data_users.csv" holding user data
  • referenced_tweets.id
    • for tweet-based outputs, this expansion creates a data file "data_ref_tweets.csv" holding the tweet data for retweets, replies, and quotes
    • an additional data file "data_ref_links.csv" is also created holding the relationships between tweets and reference tweets to link the two outputs

Available subcommands and their arguments are detailed below

Fetch User Tweets

Using the subcommand timeline will collect the tweets in each non-skipped users' timeline, as indicated by the handles_csv parameter. A separate subdirectory is created for each non-skipped handle. For help information, run the command: python pull_twitter.py timeline --help

Note: including the author_id extension will also pull user metadata simultaneously with tweets

Arguments

Full name Shortened name Description Required? Default
--user-csv -u CSV containing handles of users to pull timelines for (see data/celeb_handle_test.csv for example) Yes N/A
--output-user -ou Indicates whether to include handles in timeline outputs No False
--handle-column -hc Name of handles column in handles-csv. Incompatible with author-id-column. No (exactly one of -hc or -aic must be supplied) "handle"
--author-id-column -aic Name of handles column in handles-csv. Incompatible with handle-column. No (exactly one of -hc or -aic must be supplied) "author_id"
--skip-column -sc Name of column containing skip indicators in handles-csv (skip indicated with a 1) No "skip"
--use-skip -usc Indicates whether to use the skip column to ignore specific handles No True

Example

python pull_twitter.py --config-file ./configs/config.yaml timeline -u "./data/celeb_handle_test.csv" -hc "handle" -ou True

Fetch User Data

Using the subcommand users will collect profile information connected to each non-skipped user as indicated by the handles_csv parameter. For help information, run the command: python pull_twitter.py users --help

Arguments

Full name Shortened name Description Required? Default
--user-csv -u CSV containing handles of users to pull timelines for (see data/celeb_handle_test.csv for example) Yes N/A
--handle-column -hc Name of handles column in handles-csv No (exactly one of -hc or -aic must be supplied) "handle"
--author-id-column -aic Name of handles column in handles-csv. Incompatible with handle-column. No (exactly one of -hc or -aic must be supplied) "author_id"
--skip-column -sc Name of column containing skip indicators in handles-csv (skip indicated with a 1) No "skip"
--use-skip -usc Indicates whether to use the skip column to ignore specific handles No True

Example

python pull_twitter.py --config-file ./configs/config.yaml users -u "./data/celeb_handle_test.csv" -hc handle

Search Tweets

Using the subcommand search will collect tweets that match a provided query string. For help information, run the command: python twitter_pull.py search --help

Note: including the author_id extension will also pull user metadata simultaneously with tweets

Arguments

Full name Shortened name Description Required? Default
--query -q Query term(s) for searching tweets Yes N/A
--max-response -mr Maximum number of tweets to return using query No 100
--start-time -st Starting date to search tweets (in format YYYY-MM-DD or isoformat) No None
--end-time -et Ending date to search tweets(in format YYYY-MM-DD or isoformat) No None (Current time)
--tweets-per-query -tpq Number of tweets present in each response from the Twitter API No 500

Example

python pull_twitter.py --config-file ./configs/config.yaml search -q COVID19 -mr 50 -st 2021-08-19 -et 2021-08-21

Lookup Tweets

Using the subcommand lookup will collect tweets that match a provided query string. For help information, run the command: python twitter_pull.py lookup --help

Note: including the author_id extension will also pull user metadata simultaneously with tweets

Arguments

Full name Shortened name Description Required? Default
--id-csv -i Path to csv with list of Tweet ids Yes N/A
--id-col -ic Name of column containing Tweet ids No "id"
--skip-column -sc Name of column containing skip indicators in handles-csv (skip indicated with a 1) No "skip"
--use-skip -usc Indicates whether to use the skip column to ignore specific handles No True
--tweets-per-query -tpq Number of tweets present in each response from the Twitter API No 500

Example

python pull_twitter.py --config-file ./configs/config.yaml lookup -i data/tweet_ids.csv

Python API

As an alternative to a command line interface, there is also a python script API with the same functionality.

Guide

To use the tool in a python script or notebook, begin with importing the PullTwitterConfig and PullTwitterAPI modules from pull_twitter_api import PullTwitterConfig, PullTwitterAPI

To initialize the API object, you may create a PullTwitterConfig object or pass a filepath to a configuration yaml file: config = PullTwitterConfig.from_file(<config_filepath>) api = PullTwitterAPI(config = config) Or
api = PullTwitterAPI(config_path = <config_filepath>)

Finally, the four subcommands can be called using the api object: api.timelines(...) api.users(...) api.search(...) api.lookup(...)

Depending on your use case, the auto_save parameter in all api commands controls how the response data is saved with two options:

  1. Setting auto_save=True will automatically save responses from the Twitter API to a local file. The file is updated each batch returned, and only the most recent batch is held in memory. This is best for large jobs or jobs run without supervision.
  2. Setting auto_save=False (default for api) will require the program to manually save the response data. This can be done by calling .save() on the PullTwitterResponse Object. All response data will be held in the response object throughout the api call.

An example notebook is included to show basic usage of the tool in python.

API

Arguments through the python API mimic those of the command line interface

PullTwitterAPI.timelines()

Arg name Description Required? Default
user_csv CSV containing handles of users to pull timelines for (see data/celeb_handle_test.csv for example) Yes N/A
save_format Option ('csv' or 'json') to save results as csv file or json No 'csv'
output_user Indicates whether to include handles in timeline outputs No False
handle-column Name of handles column in handles-csv. Incompatible with author-id-column. No (mutually exclusive with above) "handle"
author_id_column Name of handles column in handles-csv. Incompatible with handle-column. No (mutually exclusive with above) "author_id"
skip_column Name of column containing skip indicators in handles-csv (skip indicated with a 1) No "skip"
use_skip Indicates whether to use the skip column to ignore specific handles No True

PullTwitterAPI.users()

Arg name Description Required? Default
user_csv CSV containing handles of users to pull timelines for (see data/celeb_handle_test.csv for example) Yes N/A
save_format Option ('csv' or 'json') to save results as csv file or json No 'csv'
handle_column Name of handles column in handles-csv No (mutually exclusive with above) "handle"
author_id_column Name of handles column in handles-csv. Incompatible with handle-column. No (mutually exclusive with above) "author_id"
skip_column Name of column containing skip indicators in handles-csv (skip indicated with a 1) No "skip"
use_skip Indicates whether to use the skip column to ignore specific handles No True

PullTwitterAPI.search()

Arg name Description Required? Default
query Query term(s) for searching tweets Yes N/A
max_response Maximum number of tweets to return using query No 100
start_time Starting date to search tweets (in format YYYY-MM-DD or isoformat) No None
end_time Ending date to search tweets(in format YYYY-MM-DD or isoformat) No None (Current time)
tweets_per_query Number of tweets present in each response from the Twitter API No 500

PullTwitterAPI.lookup()

Arg name Description Required? Default
id_csv Path to csv file with list of Tweet Ids Yes N/A
id_col Name of column in id_csv containing Tweet ids No "id"
skip_column Name of column containing skip indicators in handles-csv (skip indicated with a 1) No "skip"
use_skip Indicates whether to use the skip column to ignore specific handles No True
tweets_per_query Number of tweets present in each response from the Twitter API No 500

Issues or suggested features

Please post any suggestions as a new issue on github or reach out to me directly.

About

Easily pull twitter timelines and save to file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published