We want to:
- Load the snapshots of X and BlueSky data
- Format them into threads (give replies/quotes their necessary context)
- Filter by the politician-focused keyword lists
- Export for narrative extraction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import io
import json
import os
import pickle
import re
import uuid

from itertools import product 

from dotenv import load_dotenv
from tqdm import tqdm

# Load data
From Google Drive snapshots  
Only unzipped bluesky-2025-02 and -03 (not -01), since we only have the last four days of -02 for x for comparison.

100+ GB of BlueSky data. So just loading 14 days starting from first available X data.

In [2]:
def loads_jsonl(data: str):
    return [json.loads(line) for line in data.split('\n')]

def escape_newlines_in_json(json_str):
    return json_str.replace('\n', '\\n')

def load_jsonl_str(json_str):
    # Split into newlines and load as list of Python dicts
    json_chunks = re.split(r'\n(?=\{)', json_str.strip())

    data_objects = []
    for chunk in json_chunks:
        escaped_chunk = escape_newlines_in_json(chunk)
        try:
            obj = json.loads(escaped_chunk)
            data_objects.append(obj)
        except json.JSONDecodeError as e:
            print("Error decoding a chunk:", e)
    
    return data_objects

In [3]:
# Pull data from specified buckets and dates
data_path = './data/snapshots'
buckets = [
    'bluesky',
    'x',
]
years = ['2025']
months_days = {
    '02': ['26', '27'],
    '03': [],
    # f'{day:02d}' for day in range(1, 2)
}
hours = [-1]

In [4]:
# Load data and directly construct records to save memory
dataframes = {}
for bucket in buckets:
    records = []
    for year, month in product(years, months_days.keys()):
        base_path = os.path.join(data_path, bucket, f"{bucket}-{year}-{month}")
        if not os.path.exists(base_path):
            continue

        for day in months_days[month]:
            day_path = os.path.join(base_path, day)
            if not os.path.isdir(day_path):
                continue

            available_hours = os.listdir(day_path) if -1 in hours else hours
            for hour in available_hours:
                hour_path = os.path.join(day_path, hour)
                if not os.path.isdir(hour_path):
                    continue

                files = [f for f in os.listdir(hour_path)]
                for filename in tqdm(files, desc=f"Loading {bucket}/{year}-{month}-{day}/{hour}"):
                    file_path = os.path.join(hour_path, filename)
                    with open(file_path, 'r', encoding='utf-8') as file:
                        data = file.read()
                    data_list = loads_jsonl(data)
                    # data_list = load_jsonl_str(data)
                    for idx, data in enumerate(data_list):
                        records.append({"bucket": bucket, "file": file_path, "data_idx": idx, **data})
                    del data
                    del data_list
    
    dataframes[bucket] = pd.json_normalize(records)
    print(f'{bucket.capitalize()} Dataframe shape:', dataframes[bucket].shape)
    del records

Loading bluesky/2025-02-26/03: 100%|██████████| 12/12 [00:00<00:00, 48.48it/s]
Loading bluesky/2025-02-26/04: 100%|██████████| 12/12 [00:00<00:00, 69.57it/s]
Loading bluesky/2025-02-26/05: 100%|██████████| 12/12 [00:00<00:00, 102.04it/s]
Loading bluesky/2025-02-26/02: 100%|██████████| 12/12 [00:00<00:00, 51.36it/s]
Loading bluesky/2025-02-26/20: 100%|██████████| 26/26 [00:08<00:00,  3.15it/s]
Loading bluesky/2025-02-26/18: 100%|██████████| 11/11 [00:02<00:00,  5.39it/s]
Loading bluesky/2025-02-26/11: 100%|██████████| 12/12 [00:00<00:00, 121.43it/s]
Loading bluesky/2025-02-26/16: 100%|██████████| 12/12 [00:00<00:00, 107.63it/s]
Loading bluesky/2025-02-26/17: 100%|██████████| 6/6 [00:00<00:00, 46.59it/s]
Loading bluesky/2025-02-26/10: 100%|██████████| 12/12 [00:00<00:00, 69.15it/s]
Loading bluesky/2025-02-26/19: 100%|██████████| 10/10 [00:02<00:00,  3.84it/s]
Loading bluesky/2025-02-26/21: 100%|██████████| 19/19 [00:06<00:00,  3.10it/s]
Loading bluesky/2025-02-26/07: 100%|██████████| 12/

Bluesky Dataframe shape: (8105175, 294)


Loading x/2025-02-26/03: 100%|██████████| 11/11 [00:00<00:00, 17.84it/s]
Loading x/2025-02-26/04: 100%|██████████| 12/12 [00:00<00:00, 28.66it/s]
Loading x/2025-02-26/05: 100%|██████████| 12/12 [00:00<00:00, 34.74it/s]
Loading x/2025-02-26/02: 100%|██████████| 12/12 [00:00<00:00, 15.30it/s]
Loading x/2025-02-26/20: 100%|██████████| 11/11 [00:00<00:00, 17.63it/s]
Loading x/2025-02-26/18: 100%|██████████| 12/12 [00:00<00:00, 18.44it/s]
Loading x/2025-02-26/11: 100%|██████████| 11/11 [00:00<00:00, 28.62it/s]
Loading x/2025-02-26/16: 100%|██████████| 12/12 [00:00<00:00, 20.56it/s]
Loading x/2025-02-26/17: 100%|██████████| 12/12 [00:00<00:00, 19.34it/s]
Loading x/2025-02-26/10: 100%|██████████| 11/11 [00:00<00:00, 48.95it/s]
Loading x/2025-02-26/19: 100%|██████████| 12/12 [00:02<00:00,  5.49it/s]
Loading x/2025-02-26/21: 100%|██████████| 12/12 [00:00<00:00, 17.18it/s]
Loading x/2025-02-26/07: 100%|██████████| 12/12 [00:00<00:00, 43.43it/s]
Loading x/2025-02-26/00: 100%|██████████| 12/12 [00

X Dataframe shape: (29401, 49)


# Filter

In [5]:
pd.set_option('display.max_columns', 300)

In [26]:
# Identify thread roots (original posts)
def identify_thread_roots_bluesky(df):
    return df[df['commit.record.reply.parent.uri'].isna() & df['commit.record.reply.root.uri'].isna()]

def identify_thread_roots_x(df):
    return df[df['data.referenced_tweets'].isna()]

# Filter original posts based on keywords or authors
def filter_original_posts(df, text_col, keywords):
    pattern = '|'.join(keywords)
    return df[
        df[text_col].str.contains(pattern, case=False, na=False, regex=True)
    ]

In [31]:
# Load keywords
k_path = './data/keywords/bluesky_keywords_politicians.txt'
with open(k_path, 'r') as f:
    keywords_bluesky = f.readlines()
keywords_bluesky = [k.strip() for k in keywords_bluesky]

k_path = './data/keywords/x_keywords_politicians.txt'
with open(k_path, 'r') as f:
    keywords_x = f.readlines()
keywords_x = [k.strip() for k in keywords_x]

- Bluesky:
    - Text:
        - commit.record.text
- X
    - Text:
        - data.text

In [8]:
# Get root posts
bluesky_roots = identify_thread_roots_bluesky(dataframes['bluesky'])
x_roots = identify_thread_roots_x(dataframes['x'])

In [32]:
# Filtered roots based on keywords and authors
filtered_bluesky_roots = filter_original_posts(
    bluesky_roots, 'commit.record.text', keywords_bluesky
)
filtered_x_roots = filter_original_posts(
    x_roots, 'data.text', keywords_x
)
print('Filtered BlueSky roots shape:', filtered_bluesky_roots.shape)
print('Filtered X roots shape:', filtered_x_roots.shape)

Filtered BlueSky roots shape: (2391, 294)
Filtered X roots shape: (303, 49)


In [39]:
print(filtered_x_roots['data.text'].sample(1).iloc[0])

And those goals serve the Anointed, not us plebs. Imagine, all the poverty for 1.6% of emissions, even if you believe it

The contrast between us and the US will eventually resemble North vs. South Korea
FUCK YOU @MarkJCarney 
ANOINTED @liberal_party 
@CBC
https://t.co/PkKtPiqImF


In [34]:
filtered_bluesky_roots.shape[0] / bluesky_roots.shape[0]

0.0002958348897349995

In [35]:
filtered_x_roots.shape[0] / x_roots.shape[0]

0.17647058823529413

## Export samples

In [24]:
cols_keep = [
    'bucket', 'file', 'data_idx', 'matching_rules',
    'data.author_id', 'data.conversation_id',
    'data.text', 'data.referenced_tweets', 'includes.media',
]
fname = f'./data/to_annotate/df_x_sample_filtered_20250226_20250227.xlsx'
filtered_x_roots[cols_keep].to_excel(fname, index=False)
print('Wrote to:', fname)

Wrote to: ./data/to_annotate/df_x_sample_filtered_20250226_20250227.xlsx


In [26]:
cols_keep = [
    'bucket', 'file', 'data_idx',
    'commit.record.reply.parent.uri', 'commit.record.reply.root.uri',
    'commit.record.text', 'commit.record.title', 'commit.record.embed.external.uri',
]
fname = f'./data/to_annotate/df_bluesky_sample_filtered_20250226_20250227.xlsx'
filtered_bluesky_roots[cols_keep].to_excel(fname, index=False)
print('Wrote to:', fname)

Wrote to: ./data/to_annotate/df_bluesky_sample_filtered_20250226_20250227.xlsx


- TODO Extract with Open AI (zero-shot and with few-shot prompting)
- Get similarity score performance