<div style="border-left: 6px solid #00356B; padding-left: 15px; margin-bottom: 20px;">
  <h1 style="margin-bottom: 5px; color: #00356B"><strong>Assignment 2:</strong> Part 2 (Reply Homophily Analysis)</h1>
  <span style="font-size: 1.2em; color: #444; font-weight: bold">S&DS 5350 | Social Algorithms</span>
  <br><br>
  <strong>Primary:</strong> Brandon Tran (bat53)
  <br>
  <strong>Partner:</strong> Cailey Bobadilla (Yale NetID)
</div>

---

*Mood for this part:*

<iframe data-testid="embed-iframe" style="border-radius:12px" src="https://open.spotify.com/embed/track/3QaPy1KgI7nu9FJEQUgn6h?utm_source=generator" width="40%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>

#### II.1 | Data Collection
We are asked to collect replies to senator posts: at least 5 female senators and 5 male senators. For each, we:

1. Fetch posts from last 7 days via `getAuthorFeed`
2. For posts with replies, fetch the reply thread via `getPostThread`
3. Extract replier handle, display name, timestap, and post reply count
4. Save the data in JSON format

For completeness, we will collect replies for all senators (not just 5 for each gender). Given the duration our code will run, we provide period print statements to verify successful progression of data collection.

In [27]:
import time
import random
import bluesky_helpers as bsky
from datetime import datetime, timedelta, timezone

In [None]:
def collect_senator_replies():
    """
    Uses helper functions to collect reply thread data for all 42 senators
    (14 women and 28 men). Expected to take 5 minutes with built-in sleep.
    """

    # Load senators from provided data file.
    senators = bsky.load_senators('data/senators_bluesky.csv')
    print(f"Identified {len(senators)} senators.")

    # Define time window for step (1).
    cutoff_date = datetime.now(timezone.utc) - timedelta(days=7)
    print(f"Setting post cutotoff window as: {cutoff_date.isoformat()}\n")

    # Collect data; iterate over all senators.
    for i, senator in enumerate(senators, 1):
        
        handle = senator['handle']

        ## Define dictionary structure.
        senator_data = {
            'meta': senator,
            'posts': []
        }

        ## Fetch recent posts with a relatively high limit to cover full week.
        feed_response = bsky.get_author_feed(handle, limit = 100)
        time.sleep(0.2)

        posts_collected = 0

        ### Add robustness for if no feed data is found (e.g., 404 error).
        if not feed_response or 'feed' not in feed_response:
            continue

        ### Add robustness for missing or incomplete posts (e.g., deleted).
        for item in feed_response['feed']:
            post = item.get('post')
            if not post: continue

            record = post.get('record', {})

            #### Checks for failure or missing key, then date and reply count.
            try:
                created_at = record.get('createdAt')
                post_date = bsky.parse_datetime(created_at)
            except (ValueError, TypeError):
                continue
            
            if post_date < cutoff_date:
                continue

            reply_count = post.get('replyCount', 0)
            if reply_count == 0:
                continue

            ### Fetch thread.
            uri = post['uri']
            time.sleep(0.2)   #### Ensures we remain below ~300 requests/min.

            thread_data = bsky.get_post_thread(uri)
            captured_replies = []

            if thread_data and 'thread' in thread_data:
                thread_root = thread_data['thread']

                if 'replies' in thread_root:
                    for reply in thread_root['replies']:
                        if 'post' in reply:
                            reply_post = reply['post']
                            author = reply_post.get('author', {})

                            captured_replies.append({
                                'handle': author.get('handle'),
                                'displayName': author.get('displayName'),
                                'did': author.get('did'),
                                'createdAt': reply_post.get('record', {}).get('createdAt'),
                                'uri': reply_post.get('uri'),
                                'text': reply_post.get('record', {}).get('text', '')
                            })

            senator_data['posts'].append({
                'uri': uri,
                'text': record.get('text'),
                'createdAt': created_at,
                'totalReplyCount': reply_count,
                'capturedReplyCount': len(captured_replies), # We may miss some.
                'replies': captured_replies
            })
            
            posts_collected += 1

            clean_handle = handle.replace('.', '_') # Prevent filename errrors.
            filename = f"part2_data/replies_{clean_handle}.json"
            bsky.save_json(senator_data, filename)
        
        print(f"Saved {filename}.")

In [29]:
# Execute collect_senator_replies().
try:
    collect_senator_replies()
    print("Complete!")

except KeyboardInterrupt:
    print("Aborted. Partial data saved, if applicable.")

Identified 42 senators.
Setting post cutotoff window as: 2026-01-31T20:45:00.001419+00:00

Saved part2_data/replies_baldwin_senate_gov.json.
Saved part2_data/replies_murray_senate_gov.json.
Saved part2_data/replies_cantwell_senate_gov.json.
Saved part2_data/replies_markwarner_bsky_social.json.
Saved part2_data/replies_kaine_senate_gov.json.
Saved part2_data/replies_sanders_senate_gov.json.
Saved part2_data/replies_whitehouse_senate_gov.json.
Saved part2_data/replies_reed_senate_gov.json.
Saved part2_data/replies_reed_senate_gov.json.
Saved part2_data/replies_wyden_senate_gov.json.
Saved part2_data/replies_jeff-merkley_bsky_social.json.
Saved part2_data/replies_schumer_senate_gov.json.
Saved part2_data/replies_kirstengillibrand_bsky_social.json.
Saved part2_data/replies_lujan_senate_gov.json.
Saved part2_data/replies_heinrich_senate_gov.json.
Saved part2_data/replies_kim_senate_gov.json.
Saved part2_data/replies_booker_senate_gov.json.
Saved part2_data/replies_shaheen_senate_gov.json.
S

### II.2 | Gender Influence
With our full data collected in II.2, we will infer the gender of repliers from
their display names using SSA baby name data. Specifically, we will:

1. Extract a likely first name from each display name
2. Look up the name's historical gender distribution
3. Classify as female if >60% of registrations are female and male if >60% are male; otherwise unknown

We will report which fraction of repliers are classifiable.

In [None]:
import pandas as pd
import re
import os

In [37]:
def load_ssa_data():
    """
    Takes `female_names.tsv` and `male_names.tsv` and creates a dictionary
    `gender_map` as output.
    """
    
    # Load data.
    f_df = pd.read_csv(
        'data/female_names.tsv',
        sep = '\t',
        header = 0,
        names = ['name', 'count', 'year'])

    m_df = pd.read_csv(
        'data/male_names.tsv',
        sep = '\t',
        header = 0,
        names = ['name', 'count', 'year'])

    print("Files loaded successfully.")

    # Group by name and sum counts (e.g., there may be Mary 1950 and Mary 1951).
    f_sums = f_df.groupby('name')['count'].sum()
    m_sums = m_df.groupby('name')['count'].sum()

    # Create dictionary.
    name_dict = set(f_sums.index).union(set(m_sums.index))
    gender_map = {}

    for name in name_dict:
        f_count = f_sums.get(name, 0)
        m_count = m_sums.get(name, 0)
        total = f_count + m_count

        gender_map[name.lower()] = {
            'pct_female': f_count / total,
            'total_count': total
        }

    return gender_map

In [38]:
gender_db = load_ssa_data()

Files loaded successfully.


In [40]:
def infer_gender(display_name):
    """
    Takes a Bluesky handle and predicts binary gender based on SSA data.
    """
    
    # Clean the handle (e.g., numbers, punctuation, emojis)
    clean = re.sub(r'[^a-zA-Z/s/-]', '', display_name).strip()

    tokens = clean.split()

    if not tokens:
        return "Unknown", None

    first_token = tokens[0]

    # Handle conjoined names (e.g., MarySmith -> Mary).ArithmeticError
    conjoined = re.match(r'([A-Z][a-z]+)(?=[A-Z])', first_token)

    if conjoined:
        likely_name = conjoined.group(1)
    else:
        likely_name = first_token

    # Compare against SSA name database.
    stats = gender_db.get(likely_name.lower())

    if stats:

        if stats['pct_female'] > 0.60:
            return "Female", likely_name

        elif stats['pct_female'] < 0.40:
            return "Male", likely_name

    return "Unknown", likely_name

In [None]:
import json
import glob
import pandas as pd
from datetime import datetime

In [48]:
def process_gender_data():
    """
    Iterates through our collected JSON files and infers gender for repliers.
    Outputs a dataframe containing every single classifiable reply as a row in
    anticipation of II.3 and II.4.
    """

    all_replies = []
    files = glob.glob('part2_data/replies_*.json')

    for filepath in files:
        with open(filepath, 'r') as f:
            data = json.load(f)

        ## Pull senator metadata.
        senator = data['meta']
        s_handle = senator['handle']
        s_gender = senator.get('gender')

        for post in data['posts']:
            post_uri = post['uri']
            total_reply_count = post.get('totalReplyCount', 0)
            
            ## Begin iterating through replies.
            for reply in post['replies']:
                display_name = reply.get('displayName', '') or ""

                ## Infer gender.
                try:
                    inferred_gender, _ = infer_gender(display_name)
                except Exception:
                    inferred_gender = 'Unknown'

                ## Parse timestamp for II.4.
                created_str = reply.get('createdAt')
                ts = datetime.fromisoformat(created_str.replace('Z', '+00:00'))

                all_replies.append({
                    'senator_handle': s_handle,
                    'senator_gender': s_gender,
                    'post_uri': post_uri,
                    'post_total_replies': total_reply_count,
                    'reply_text_len': len(reply.get('text', '')),
                    'reply_timestamp': ts,
                    'replier_gender': inferred_gender
                })
    
    df = pd.DataFrame(all_replies)
    return(df)

In [None]:
# Execution Code
df_replies = process_gender_data()

In [52]:
# Validation Code
print(df_replies.head())

          senator_handle senator_gender  \
0  alsobrooks.senate.gov              F   
1  alsobrooks.senate.gov              F   
2  alsobrooks.senate.gov              F   
3  alsobrooks.senate.gov              F   
4  alsobrooks.senate.gov              F   

                                            post_uri  post_total_replies  \
0  at://did:plc:6srmlf7guiy534fkxhfmlubf/app.bsky...                   5   
1  at://did:plc:6srmlf7guiy534fkxhfmlubf/app.bsky...                   5   
2  at://did:plc:6srmlf7guiy534fkxhfmlubf/app.bsky...                   5   
3  at://did:plc:6srmlf7guiy534fkxhfmlubf/app.bsky...                   5   
4  at://did:plc:6srmlf7guiy534fkxhfmlubf/app.bsky...                   5   

   reply_text_len                  reply_timestamp replier_gender  
0              46 2026-02-06 19:32:02.414000+00:00         Female  
1              55 2026-02-07 00:35:25.192000+00:00        Unknown  
2             155 2026-02-06 19:39:32.201000+00:00        Unknown  
3           

In [54]:
# Report Classifiable Fraction

counts = df_replies['replier_gender'].value_counts()
total = len(df_replies)
classifiable = counts.get('Male', 0) + counts.get('Female', 0)

print(f"Total Replies: {total}")
print(f"Classifiable:  {classifiable}")
print(f"Fraction:      {classifiable/total:.4f}")

Total Replies: 15151
Classifiable:  5209
Fraction:      0.3438
