<div style="border-left: 6px solid #00356B; padding-left: 15px; margin-bottom: 20px;">
  <h1 style="margin-bottom: 5px; color: #00356B"><strong>Assignment 2:</strong> Part 2 (Reply Homophily Analysis)</h1>
  <span style="font-size: 1.2em; color: #444; font-weight: bold">S&DS 5350 | Social Algorithms</span>
  <br><br>
  <strong>Primary:</strong> Brandon Tran (bat53)
  <br>
  <strong>Partner:</strong> Cailey Bobadilla (Yale NetID)
</div>

---

*Mood for this part:*

<iframe data-testid="embed-iframe" style="border-radius:12px" src="https://open.spotify.com/embed/track/3QaPy1KgI7nu9FJEQUgn6h?utm_source=generator" width="40%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>

#### II.1 | Data Collection
We are asked to collect replies to senator posts: at least 5 female senators and 5 male senators. For each, we:

1. Fetch posts from last 7 days via `getAuthorFeed`
2. For posts with replies, fetch the reply thread via `getPostThread`
3. Extract replier handle, display name, timestap, and post reply count
4. Save the data in JSON format

For completeness, we will collect replies for all senators (not just 5 for each gender). Given the duration our code will run, we provide period print statements to verify successful progression of data collection.

In [27]:
import time
import random
import bluesky_helpers as bsky
from datetime import datetime, timedelta, timezone

In [28]:
def collect_senator_replies():
    """
    Uses helper functions to collect reply thread data for all 42 senators
    (14 women and 28 men).
    """

    # Load senators from provided data file.
    senators = bsky.load_senators('data/senators_bluesky.csv')
    print(f"Identified {len(senators)} senators.")

    # Define time window for step (1).
    cutoff_date = datetime.now(timezone.utc) - timedelta(days=7)
    print(f"Setting post cutotoff window as: {cutoff_date.isoformat()}\n")

    # Collect data; iterate over all senators.
    for i, senator in enumerate(senators, 1):
        
        handle = senator['handle']

        ## Define dictionary structure.
        senator_data = {
            'meta': senator,
            'posts': []
        }

        ## Fetch recent posts with a relatively high limit to cover full week.
        feed_response = bsky.get_author_feed(handle, limit = 100)
        time.sleep(0.2)

        posts_collected = 0

        ### Add robustness for if no feed data is found (e.g., 404 error).
        if not feed_response or 'feed' not in feed_response:
            continue

        ### Add robustness for missing or incomplete posts (e.g., deleted).
        for item in feed_response['feed']:
            post = item.get('post')
            if not post: continue

            record = post.get('record', {})

            #### Checks for failure or missing key, then date and reply count.
            try:
                created_at = record.get('createdAt')
                post_date = bsky.parse_datetime(created_at)
            except (ValueError, TypeError):
                continue
            
            if post_date < cutoff_date:
                continue

            reply_count = post.get('replyCount', 0)
            if reply_count == 0:
                continue

            ### Fetch thread.
            uri = post['uri']
            time.sleep(0.2)   #### Ensures we remain below ~300 requests/min.

            thread_data = bsky.get_post_thread(uri)
            captured_replies = []

            if thread_data and 'thread' in thread_data:
                thread_root = thread_data['thread']

                if 'replies' in thread_root:
                    for reply in thread_root['replies']:
                        if 'post' in reply:
                            reply_post = reply['post']
                            author = reply_post.get('author', {})

                            captured_replies.append({
                                'handle': author.get('handle'),
                                'displayName': author.get('displayName'),
                                'did': author.get('did'),
                                'createdAt': reply_post.get('record', {}).get('createdAt'),
                                'uri': reply_post.get('uri'),
                                'text': reply_post.get('record', {}).get('text', '')
                            })

            senator_data['posts'].append({
                'uri': uri,
                'text': record.get('text'),
                'createdAt': created_at,
                'totalReplyCount': reply_count,
                'capturedReplyCount': len(captured_replies), # We may miss some.
                'replies': captured_replies
            })
            
            posts_collected += 1

            clean_handle = handle.replace('.', '_') # Prevent filename errrors.
            filename = f"part2_data/replies_{clean_handle}.json"
            bsky.save_json(senator_data, filename)
        
        print(f"Saved {filename}.")

In [29]:
# Execute collect_senator_replies().
try:
    collect_senator_replies()
    print("Complete!")

except KeyboardInterrupt:
    print("Aborted. Partial data saved, if applicable.")

Identified 42 senators.
Setting post cutotoff window as: 2026-01-31T20:45:00.001419+00:00

Saved part2_data/replies_baldwin_senate_gov.json.
Saved part2_data/replies_murray_senate_gov.json.
Saved part2_data/replies_cantwell_senate_gov.json.
Saved part2_data/replies_markwarner_bsky_social.json.
Saved part2_data/replies_kaine_senate_gov.json.
Saved part2_data/replies_sanders_senate_gov.json.
Saved part2_data/replies_whitehouse_senate_gov.json.
Saved part2_data/replies_reed_senate_gov.json.
Saved part2_data/replies_reed_senate_gov.json.
Saved part2_data/replies_wyden_senate_gov.json.
Saved part2_data/replies_jeff-merkley_bsky_social.json.
Saved part2_data/replies_schumer_senate_gov.json.
Saved part2_data/replies_kirstengillibrand_bsky_social.json.
Saved part2_data/replies_lujan_senate_gov.json.
Saved part2_data/replies_heinrich_senate_gov.json.
Saved part2_data/replies_kim_senate_gov.json.
Saved part2_data/replies_booker_senate_gov.json.
Saved part2_data/replies_shaheen_senate_gov.json.
S

### II.2 | Gender Influence
With our full data collected in II.2, we will infer the gender of repliers from
their display names using SSA baby name data. Specifically, we will:

1. Extract a likely first name from each display name
2. Look up the name's historical gender distribution
3. Classify as female if >60% of registrations are female and male if >60% are male; otherwise unknown

We will report which fraction of repliers are classifiable