Skip to content

Grabs tweets, adds a fingerprint, and dumps to csv for user accounts experiencing harassment. Encryption to be added in v2.0.

Notifications You must be signed in to change notification settings

WellstoneAction/twitter_hate_speech_collector

Repository files navigation

Twitter hate speech collector

This script suite grabs mentions of a Twitter user after a specific tweet and dumps them to a csv (comma-separated values) file.

It was created in response to an uptick in Twitter-based harassment of organizers by white supremacist and neo-Nazi organizations, and specifically in response to a Nov-Dec 2016 influx of hate tweets against two organizers in Minneapolis (see originating target tweet in the last comment of grab_last_week_of_tweets.py).

The data produced is intended to:

  • Maintain copies of tweets that may be deleted later as Twitter maintains policies that limit their own ability to reproduce deleted tweets
  • Back up or replace screenshots with more forensically sound data (i.e. "real evidence" vs direct evidence"), because IDs and timestamps are taken directly from Twitter's own database. Ideally this helps to protect targeted organizers from accusations of fabricated evidence.
  • Protect targeted organizers from having to repeatedly view triggering photos or disrupt the work day to take screenshots.
  • Produce a spreadsheet that makes it easier to search for text key phrases/words and/or do statistical analysis on it.

Two scripts exist for grabbing tweets and dumping to csv, drawing on the same data set but two difference presentations of it.:

  1. grab_last_week_of_tweets.py Uses the Twitter search API and the tweepy module to get the last week or so of tweets (Twitter API limitation) matching a search query. It prompts users to enter the Twitter handle of the person experiencing harassment and the tweet ID of a tweet occurring on the first day of the harassment (may or may not be a tweet whose contented started a response thread).

  2. parse_advanced_search_results.py Uses lxml to scrape results from the HTML generated by an advanced search on Twitter's site. Needs debugging to capture cleaner text fields, especially for image-based tweets.

Both scripts output:

  • Tweet_id: generated from Twitter's data
  • Direct response to original thread?: Places 'Yes' or blank. Used to help filter out unrelated Tweets. This is not as accurate as it could be and needs debugging.
  • Sent when?: Time stamp generated from Twitter's created_at field
  • Sender: Twitter handle of author
  • Sender's name: Name (as entered by user onto profile) of author
  • Tweet text
  • Data source: Confirms which Twitter data source was used. The Twitter search API is used for the last week of Tweets, and the Advanced Search web server is used for older tweets.

About

Grabs tweets, adds a fingerprint, and dumps to csv for user accounts experiencing harassment. Encryption to be added in v2.0.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages