GabLeaks-Analysis
This codebase relates to the GabLeaks release from Distributed Denial of Secrets. While the Gab data itself is limited distribution, this helper code is public.
The original release format is a Postgresql database dump that includes data in four tables: accounts, statuses, gabgroups, and verifications. However, the 'statuses', 'accounts', and 'gabgroups' tables contain a 'data' column where most of the relevant details are stored in JSON.
The mixed JSON / SQL format is not the most convenient for analysis. This script expands the data, creating new statuses_expanded, accounts_expanded, and gabgroups_expanded tables with individual typed fields instead of one big JSON blob. The schemas are as follows:
statuses_expanded
| Field | Type |
|---|---|
| id | bigint |
| bookmark_collection_id | bigint |
| card | jsonb |
| content | text |
| created_at | timestamp with time zone |
| emojis | jsonb |
| expires_at | timestamp with time zone |
| favourited | boolean |
| favourites_count | bigint |
| group_ | jsonb |
| has_quote | boolean |
| in_reply_to_account_id | bigint |
| in_reply_to_id | bigint |
| language | text |
| media_attachments | jsonb |
| mentions | jsonb |
| pinnable | boolean |
| pinnable_by_group | boolean |
| plain_markdown | text |
| poll | jsonb |
| quote | jsonb |
| quote_of_id | bigint |
| reblog | jsonb |
| reblogged | boolean |
| reblogs_count | bigint |
| replies_count | bigint |
| revised_at | text |
| rich_content | text |
| sensitive | boolean |
| spoiler_text | text |
| tags | jsonb |
| url | text |
| visibility | text |
accounts_expanded
| Field | Type |
|---|---|
| id | bigint |
| text | |
| password | text |
| name | text |
| bot | boolean |
| url | text |
| note | text |
| avatar | text |
| emojis | jsonb |
| fields | jsonb |
| header | text |
| is_pro | boolean |
| locked | boolean |
| is_donor | boolean |
| created | timestamp with time zone |
| is_investor | boolean |
| is_verified | boolean |
| display_name | text |
| avatar_static | text |
| header_static | text |
| statuses_count | bigint |
| followers_count | bigint |
| following_count | bigint |
| is_flagged_as_spam | boolean |
gabgroups_expanded
| Field | Type |
|---|---|
| id | int |
| password | text |
| url | text |
| slug | jsonb |
| tags | jsonb |
| title | text |
| created_at | timestamp with time zone |
| is_private | boolean |
| is_visible | boolean |
| description | text |
| is_archived | boolean |
| has_password | boolean |
| member_count | int |
| group_category | jsonb |
| cover_image_url | text |
| description_html | text |
Usage
First, ensure you already have the dataset loaded into Postgresql. Installing and configuring postgres, creating a database and user, and importing the database contents from a SQL dump, are out of scope for this README.
Second, install dependencies for this project:
pip3 install -r requirements.txt
Or install manually with:
pip3 install psycopg2 tqdm
Finally, launch the script like:
./expand.py <sql_host> <sql_user> <sql_db_name>
For example:
./expand.py localhost postgres gableaks
You'll be prompted for your postgres user's password, and then the script should take care of everything from there. There's a progress bar included - since there are about 39 million statuses, expanding the table took a little over two hours during development tests.