# Quick Introduction:
Hello, this is a jupyter notebook created by Adel for the medium article {article title}. The goal of the code is to analyze the culture of a goegraphical region according to how people associated with said region tweet online. I recomend reading the article for more information. 

#### **The official summary is at the bottom of this notebook.**

# Step 0: Sanity Checks
Here we make sure everything loads and works as intended

In [85]:
# package load 
import os
from dotenv import load_dotenv
import re
import time
from typing import List, Dict
import random

import json
from pathlib import Path

import requests
import pandas as pd
import numpy as np
import requests
from pathlib import Path
from collections import defaultdict
from tqdm import tqdm
from sklearn.decomposition import PCA
import umap

In [71]:
# This loads variables from .env into environment variables
load_dotenv()

BEARER_TOKEN = os.getenv("X_BEARER_TOKEN")
if BEARER_TOKEN is not None:
    print("Token loaded successfully from .env")
else:
    print("Token not found in .env, refer to .env.example")

HEADERS = {"Authorization": f"Bearer {BEARER_TOKEN}"}
BASE_URL = "https://api.x.com/2"

    
# maybe test token works by seeing if elon's user can be pinged

Token loaded successfully from .env


# Step 1: Finding X users to Scrape and scraping their sweet sweet data

## step 1.1: Load the config for our target area

In [90]:
# select target, default available is "uk", "nyc", "singapore"
target_area = "uk"

# Notice ! If you want to have more seeds then use uk_extended ... the standard uk.json is small and for testing
CONFIG_PATH = Path("..") / "configs" / f"{target_area}.json"

with open (CONFIG_PATH) as f:
    CFG = json.load(f)

print(f"Loaded config for {target_area}:")
cfg_copy = CFG.copy()
if len(CFG['bio_keywords']) > 5:
    cfg_copy['bio_keywords'] = CFG['bio_keywords'][:5] + ['...']
print(json.dumps(cfg_copy, indent=2))
del cfg_copy # only used for example print

Loaded config for uk:
{
  "region_name": "uk",
  "local_seeds": "../data/local_seeds/uk.json",
  "users_output": "../data/users/uk_users.jsonl",
  "users_adjacent_output": "../data/users_adjacent/uk_users.jsonl",
  "tweets_output": "../data/tweets/uk_tweets.jsonl",
  "bio_keywords": [
    "uk",
    "united kingdom",
    "england",
    "scotland",
    "wales",
    "..."
  ],
  "max_mentions_per_seed": 1,
  "max_tweets_per_user": 5
}


In [73]:
with open(CFG["local_seeds"]) as f:
    seeds = json.load(f)

seed_types = list(seeds.keys())
print(f"{target_area} seed types are:", ", ".join(seed_types))

# Peek at the first few in one category, e.g. sports
print("\nFirst few sports seeds:")
seeds["sports"][:5]

uk seed types are: sports, music, tech_lifestyle, comedy

First few sports seeds:


['@premierleague', '@Arsenal']

## Step 1.2: Clean + Validate seed handles

### step 1.2.1 Filter for valid seeds

In [74]:
def is_valid_username(u: str) -> bool:
    return bool(re.fullmatch(r"[A-Za-z0-9_]{1,15}", u))

seed_types = list(seeds.keys())

valid_seeds = []
invalid_seeds = []

for seed_types in seeds:
    for seed in seeds[seed_types]:
        clean = seed.lstrip("@") if seed.startswith("@") else seed
        if is_valid_username(clean):
            valid_seeds.append(clean)
        else:
            invalid_seeds.append(seed)

print(f"Valid {target_area} seeds: {len(valid_seeds)} | Invalid: {len(invalid_seeds)}")

Valid uk seeds: 8 | Invalid: 0


### step 1.2.2: Resolve seeds via `/2/users/by` (X API v2)
API level validation: "do these usernames actually exist, lets get their metadata"

In [75]:
def lookup_usernames(usernames: List[str]) -> List[Dict]:
    """
    X API v2: Lookup users by username.

    HTTP:
        GET /2/users/by

    Docs:
        https://developer.x.com/en/docs/twitter-api/users/lookup/api-reference/get-users-by

    Args:
        usernames:
            List of X usernames (without the leading '@') to look up.

    Returns:
        A list of user objects returned in the 'data' field of the response,
        each including fields like id, username, name, location, description,
        and public_metrics (depending on user.fields requested).
    """
    results = []
    for i in range(0, len(usernames), 100):
        batch = usernames[i:i+100]
        params = {
            "usernames": ",".join(batch),
            "user.fields": "id,username,name,location,description,public_metrics"
        }
        try:
            r = requests.get(f"{BASE_URL}/users/by", headers=HEADERS, params=params)
            r.raise_for_status()
            data = r.json().get("data", [])
            results.extend(data)
            print(f"Batch {i//100}: {len(data)}/{len(batch)} users")
        except requests.HTTPError as e:
            print(f"Batch {i//100} error: {e}")
        time.sleep(0.5)  # to avoid have timeouts & ensure expected behavior
    return results

In [76]:
seed_users = lookup_usernames(valid_seeds)
print(f"Resolved {len(seed_users)} seed user objects out of {len(valid_seeds)} valid usernames")
# TODO: Should we save seeds ... maybe seeds_unfiltered and seeds_filtered directories respectfully

Batch 0: 8/8 users
Resolved 8 seed user objects out of 8 valid usernames


## Step 1.3: Find Adjacent Accounts to Seeds + Filter for Region

### step 1.3.1: Find adjacent Accounts to seeds

In [78]:
# from the PDF – this stays as "find adjacent users by mentions"
def search_region_mentions_batched(
    seeds: List[Dict],
    batch_size: int = 20,
    max_adj: int = 3000,
) -> List[Dict]:
    """
    X API v2: Find 'adjacent' users who mention the seed accounts.

    HTTP:
        GET /2/tweets/search/recent

    Docs:
        https://developer.x.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent

    Query pattern (for UK example):
        "UK (@premierleague OR @Arsenal OR @ChelseaFC ...)"

    Args:
        seeds:
            List of user objects (each with at least a 'username' key) used as local seeds.
        batch_size:
            How many seed usernames to include in each search query batch.
        max_adj:
            Maximum number of unique adjacent users to collect before stopping.

    Returns:
        A list of unique user objects from the 'includes.users' field of search results.
    """
    # usernames of your seed accounts
    usernames = [u["username"] for u in seeds]

    # Ssimple base region term. You could later make this more complex
    region_term = CFG.get("tweet_region_term", CFG["region_name"])  # e.g. "UK"

    all_adjacent: Dict[str, Dict] = {}
    total_fetched = 0

    for i in range(0, len(usernames), batch_size):
        batch = usernames[i:i + batch_size]
        batch_handles = [f"@{u}" for u in batch]
        query = f'{region_term} ({" OR ".join(batch_handles)})'

        print(f"Batch {i//batch_size + 1}: {len(batch)} seeds → {query[:120]}...")

        url = f"{BASE_URL}/tweets/search/recent"
        params = {
            "query": query,
            "max_results": 100, # Note: here we don't do the min of max_adj - total_fetched because not guaranteed unique and so by dong 100 can reduce total # of HTTP calls
            "tweet.fields": "author_id", # --> the account id of who posted the tweet
            "expansions": "author_id",
            "user.fields": "id,username,name,location,description,public_metrics", #location is a parameter inputed by the user, not like 'geo' coordinates like in the old twitter API
        }

        fetched_in_batch = 0

        while fetched_in_batch < 500 and total_fetched < max_adj:
            try:
                r = requests.get(url, headers=HEADERS, params=params)
                r.raise_for_status()
                resp = r.json()

                # users come back in the 'includes.users' block
                includes = resp.get("includes", {}).get("users", [])
                for user in includes:
                    uid = user["id"]
                    if uid not in all_adjacent:
                        all_adjacent[uid] = user
                        total_fetched += 1
                        fetched_in_batch += 1

                token = resp.get("meta", {}).get("next_token")
                if not token:
                    break

                params["next_token"] = token
                time.sleep(0.5)  # avoid rate limits

            except Exception as e:
                print(f"  Error: {e}")
                break

        print(f"  → {fetched_in_batch} new users (total: {total_fetched})")
        time.sleep(2)

        if total_fetched >= max_adj:
            print("Reached max_adj limit, stopping.")
            break

    return list(all_adjacent.values())


In [79]:
adjacent_users = search_region_mentions_batched(
    seed_users,
    batch_size=20,
    max_adj=CFG.get("max_adjacent_users", 200),
)
# An idea to make it more robust could search+filter for accounts between a range of followers, a certain number of tweets account age ect.

adj_dir = Path("../data/users_adjacent")
adj_dir.mkdir(parents=True, exist_ok=True)

adj_path = adj_dir / f"{target_area}_adjacent_users.jsonl"
with adj_path.open("w", encoding="utf-8") as f:
    for u in adjacent_users:
        f.write(json.dumps(u) + "\n")

print(f"Saved {len(adjacent_users)} adjacent users → {adj_path}")

Batch 1: 8 seeds → uk (@premierleague OR @Arsenal OR @TheO2 OR @RoyalAlbertHall OR @techUK OR @LDNTechWeek OR @Tate OR @britishmuseum)...
  → 212 new users (total: 212)
Reached max_adj limit, stopping.
Saved 212 adjacent users → ../data/users_adjacent/uk_adjacent_users.jsonl


### 1.3.2: Filter Adjacent Users by bio (confirmed locals)

In [80]:
"""
# Optional: Load from file
adj_path = Path("../data/users_adjacent") / f"{target_area}_adjacent_users.jsonl"
with adj_path.open() as f:
     adj_rows = [json.loads(line) for line in f]

adj_df = pd.DataFrame(adj_rows)
"""

adj_df = pd.DataFrame(adjacent_users)

# Combine description + location into a single "bio-like" field
adj_df["bio"] = (
    adj_df["description"].fillna("") + " " +
    adj_df["location"].fillna("")
)

adj_df["is_in_region"] = adj_df["bio"].apply(bio_matches_region)

region_users_df = adj_df[adj_df["is_in_region"]].copy()
print("Adjacent users:", len(adj_df), "| confirmed locals:", len(region_users_df))

# Save to the final users_output path from config
final_path = Path(CFG["users_output"])
final_path.parent.mkdir(parents=True, exist_ok=True)
region_users_df.to_json(final_path, orient="records", lines=True)

print(f"Saved {len(region_users_df)} confirmed local users → {final_path}")

Adjacent users: 212 | confirmed locals: 121
Saved 121 confirmed local users → ../data/users/uk_users.jsonl


# 1.4: Get tweets of confirmed local users
### We will pick some of the locals and get their tweets

In [82]:
def get_user_tweets(user_id: str, max_tweets: int = 50) -> List[Dict]:
    """
    X API v2: Get recent Tweets by user ID.

    HTTP:
        GET /2/users/:id/tweets

    Docs:
        https://developer.x.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets

    Description:
        Fetches up to `max_tweets` recent Tweets for a single user.
        Requests are paginated with `max_results=100`
        per API call and use `next_token` from the response `meta` object to
        continue until either `max_tweets` is reached or there are no more
        Tweets available.

        Retweets and replies are excluded so that the resulting dataset reflects
        more of the user's own original content.

    Args:
        user_id:
            The numeric user ID (as a string) of the account whose Tweets
            should be fetched.
        max_tweets:
            Maximum number of Tweets to return for this user. The function may
            return fewer if the user has fewer original Tweets or if the API
            stops returning pages.

    Returns:
        A list of Tweet objects (dicts) as returned in the response `data`
        field. Each Tweet includes at least:
            - id
            - author_id
            - text
            - created_at
            - lang
            - public_metrics
        The list length is at most `max_tweets`.
    """
    url = f"{BASE_URL}/users/{user_id}/tweets"
    params = {
        "max_results": min(max_tweets, 100),  # API page size, we'll stop at max_tweets
        "tweet.fields": "id,author_id,text,created_at,lang,public_metrics",
        "exclude": "retweets,replies",
    }

    tweets: List[Dict] = []

    while len(tweets) < max_tweets:
        try:
            r = requests.get(url, headers=HEADERS, params=params)
            r.raise_for_status()
            resp = r.json()
            data = resp.get("data", [])
            if not data:
                break

            tweets.extend(data)

            token = resp.get("meta", {}).get("next_token")
            if not token:
                break

            params["next_token"] = token
            time.sleep(1)  # be nice to the API
        except Exception as e:
            print(f"Error fetching tweets for user {user_id}: {e}")
            break

    # guarantee we don't exceed max_tweets
    return tweets[:max_tweets]


In [86]:
# Optional: Load from file 
# locals_path = Path(CFG["users_output"])
# with locals_path.open() as f:
#     locals_list = [json.loads(line) for line in f]
# locals_df = pd.DataFrame(locals_list)


# Here we pick 100 locals 
num_random_locals = CFG.get("num_random_locals", 100)
locals_df = region_users_df.copy()

random.seed(42)
n_locals = min(num_random_locals, len(locals_df))
sampled_locals_df = locals_df.sample(n=n_locals, random_state=42)

# user IDs to query
local_ids = sampled_locals_df["id"].astype(str).tolist()
len(local_ids), local_ids[:5]

(100,
 ['1264821227610419201',
  '1062024201660493824',
  '908358114675707906',
  '1040550277131186176',
  '1064229456'])

In [95]:
max_per_user = CFG.get("max_tweets_per_user", 1) # had to set it really low due to running out of credits

all_tweets: List[Dict] = []

for uid in tqdm(local_ids, desc=f"Pulling tweets for {target_area} locals"):
    user_tweets = get_user_tweets(uid, max_tweets=max_per_user)

    # Just in case, enforce author_id here too (should already be set by the API)
    for t in user_tweets:
        t.setdefault("author_id", uid)

    all_tweets.extend(user_tweets)
    time.sleep(0.5)  # avoid hammering the API

Pulling tweets for uk locals:  44%|███████         | 44/100 [01:13<02:00,  2.15s/it]

Error fetching tweets for user 519767002: 402 Client Error: Payment Required for url: https://api.x.com/2/users/519767002/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  45%|███████▏        | 45/100 [01:14<01:40,  1.84s/it]

Error fetching tweets for user 310201041: 402 Client Error: Payment Required for url: https://api.x.com/2/users/310201041/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  46%|███████▎        | 46/100 [01:15<01:27,  1.62s/it]

Error fetching tweets for user 248768462: 402 Client Error: Payment Required for url: https://api.x.com/2/users/248768462/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  47%|███████▌        | 47/100 [01:16<01:17,  1.46s/it]

Error fetching tweets for user 3339230698: 402 Client Error: Payment Required for url: https://api.x.com/2/users/3339230698/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  48%|███████▋        | 48/100 [01:17<01:10,  1.36s/it]

Error fetching tweets for user 2217786008: 402 Client Error: Payment Required for url: https://api.x.com/2/users/2217786008/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  49%|███████▊        | 49/100 [01:18<01:05,  1.28s/it]

Error fetching tweets for user 72523656: 402 Client Error: Payment Required for url: https://api.x.com/2/users/72523656/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  50%|████████        | 50/100 [01:19<01:01,  1.23s/it]

Error fetching tweets for user 261628910: 402 Client Error: Payment Required for url: https://api.x.com/2/users/261628910/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  51%|████████▏       | 51/100 [01:20<00:58,  1.20s/it]

Error fetching tweets for user 3169753600: 402 Client Error: Payment Required for url: https://api.x.com/2/users/3169753600/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  52%|████████▎       | 52/100 [01:21<00:56,  1.17s/it]

Error fetching tweets for user 4526739273: 402 Client Error: Payment Required for url: https://api.x.com/2/users/4526739273/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  53%|████████▍       | 53/100 [01:23<00:54,  1.16s/it]

Error fetching tweets for user 1909905669438005248: 402 Client Error: Payment Required for url: https://api.x.com/2/users/1909905669438005248/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  54%|████████▋       | 54/100 [01:24<00:52,  1.14s/it]

Error fetching tweets for user 1620452334617624578: 402 Client Error: Payment Required for url: https://api.x.com/2/users/1620452334617624578/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  55%|████████▊       | 55/100 [01:25<00:50,  1.13s/it]

Error fetching tweets for user 1571695559328645123: 402 Client Error: Payment Required for url: https://api.x.com/2/users/1571695559328645123/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  56%|████████▉       | 56/100 [01:26<00:49,  1.13s/it]

Error fetching tweets for user 1479155367363944450: 402 Client Error: Payment Required for url: https://api.x.com/2/users/1479155367363944450/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  57%|█████████       | 57/100 [01:27<00:48,  1.12s/it]

Error fetching tweets for user 3662824157: 402 Client Error: Payment Required for url: https://api.x.com/2/users/3662824157/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  58%|█████████▎      | 58/100 [01:28<00:46,  1.11s/it]

Error fetching tweets for user 182785352: 402 Client Error: Payment Required for url: https://api.x.com/2/users/182785352/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  59%|█████████▍      | 59/100 [01:29<00:45,  1.11s/it]

Error fetching tweets for user 2669193350: 402 Client Error: Payment Required for url: https://api.x.com/2/users/2669193350/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  60%|█████████▌      | 60/100 [01:30<00:44,  1.11s/it]

Error fetching tweets for user 18448883: 402 Client Error: Payment Required for url: https://api.x.com/2/users/18448883/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  61%|█████████▊      | 61/100 [01:31<00:43,  1.11s/it]

Error fetching tweets for user 15653762: 402 Client Error: Payment Required for url: https://api.x.com/2/users/15653762/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  62%|█████████▉      | 62/100 [01:33<00:42,  1.11s/it]

Error fetching tweets for user 111570867: 402 Client Error: Payment Required for url: https://api.x.com/2/users/111570867/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  63%|██████████      | 63/100 [01:34<00:40,  1.10s/it]

Error fetching tweets for user 85434818: 402 Client Error: Payment Required for url: https://api.x.com/2/users/85434818/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  64%|██████████▏     | 64/100 [01:35<00:39,  1.10s/it]

Error fetching tweets for user 1399665283: 402 Client Error: Payment Required for url: https://api.x.com/2/users/1399665283/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  84%|█████████████▍  | 84/100 [02:08<00:27,  1.74s/it]

Error fetching tweets for user 2507994280: 402 Client Error: Payment Required for url: https://api.x.com/2/users/2507994280/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  85%|█████████████▌  | 85/100 [02:09<00:23,  1.55s/it]

Error fetching tweets for user 637063232: 402 Client Error: Payment Required for url: https://api.x.com/2/users/637063232/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  86%|█████████████▊  | 86/100 [02:10<00:19,  1.42s/it]

Error fetching tweets for user 568149499: 402 Client Error: Payment Required for url: https://api.x.com/2/users/568149499/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  87%|█████████████▉  | 87/100 [02:11<00:17,  1.32s/it]

Error fetching tweets for user 300909522: 402 Client Error: Payment Required for url: https://api.x.com/2/users/300909522/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  88%|██████████████  | 88/100 [02:12<00:15,  1.25s/it]

Error fetching tweets for user 36489575: 402 Client Error: Payment Required for url: https://api.x.com/2/users/36489575/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  89%|██████████████▏ | 89/100 [02:13<00:13,  1.21s/it]

Error fetching tweets for user 6120962: 402 Client Error: Payment Required for url: https://api.x.com/2/users/6120962/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  90%|██████████████▍ | 90/100 [02:14<00:11,  1.18s/it]

Error fetching tweets for user 727660210244993024: 402 Client Error: Payment Required for url: https://api.x.com/2/users/727660210244993024/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  91%|██████████████▌ | 91/100 [02:16<00:10,  1.16s/it]

Error fetching tweets for user 574059441: 402 Client Error: Payment Required for url: https://api.x.com/2/users/574059441/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals:  92%|██████████████▋ | 92/100 [02:17<00:09,  1.15s/it]

Error fetching tweets for user 299219144: 402 Client Error: Payment Required for url: https://api.x.com/2/users/299219144/tweets?max_results=100&tweet.fields=id%2Cauthor_id%2Ctext%2Ccreated_at%2Clang%2Cpublic_metrics&exclude=retweets%2Creplies


Pulling tweets for uk locals: 100%|███████████████| 100/100 [02:30<00:00,  1.51s/it]


In [96]:
tweets_path = Path(CFG["tweets_output"])
tweets_path.parent.mkdir(parents=True, exist_ok=True)

with tweets_path.open("w", encoding="utf-8") as f:
    for t in all_tweets:
        f.write(json.dumps(t) + "\n")

print(f"Collected {len(all_tweets)} tweets → {tweets_path}")
# ran out of credits during this test so I didn't get all of the tweets :/

Collected 64 tweets → ../data/tweets/uk_tweets.jsonl


In [97]:
print("We finished getting our data, the data analysis section will be in x_api_culture_analysis.ipynb")

We finished getting our data, the data analysis section will be in x_api_culture_analysis.ipynb


## Overall Summary of Step 1

In this notebook, I used the X API v2 to build a small, end-to-end data pipeline for one region (the UK). Step 1 breaks down into four sub-steps:

**Step 1.1 – Load region config**

- Loaded the `uk` config from `../configs/uk.json`.
- This defined the local seeds file, all output paths, the UK bio keywords, and control knobs like `max_mentions_per_seed`, `max_adjacent_users`, `num_random_locals`, and `max_tweets_per_user`.

**Step 1.2 – Validate and resolve seed accounts**

- Loaded the seed handles from `../data/local_seeds/uk.json`.
- Cleaned them (stripped `@`, checked they were valid usernames).
- Resolved them via `GET /2/users/by` to get full user objects for each seed (name, description, location, public metrics).

**Step 1.3 – Find “adjacent” users via mentions**

- Queried `GET /2/tweets/search/recent` for tweets that mention the seed accounts plus a UK term (e.g. `"UK (@premierleague OR @Arsenal ...)"`), in batches.
- Collected the authors returned in `includes.users` as “adjacent users” and wrote them to  
  `../data/users_adjacent/uk_adjacent_users.jsonl`.

**Step 1.4 – Filter confirmed locals and pull their tweets**

- Combined each adjacent user’s `description` and `location` into a simple “bio” string.
- Kept only users whose bios contained at least one UK keyword (e.g. “uk”, “england”, “scotland”, “glasgow”, etc.), and wrote these confirmed locals to  
  `../data/users/uk_users.jsonl`.
- Randomly sampled up to `num_random_locals` confirmed locals.
- For each sampled user, called `GET /2/users/:id/tweets` (excluding retweets and replies) and saved the results to  
  `../data/tweets/uk_tweets.jsonl`.


Because this was a live test run on my own X developer account, I hit the **402 Payment Required** errors part way through fetching tweets and ended up with a **partial UK tweets dataset** (64 tweets total). Across this and a couple of earlier test calls, I spent roughly **\$20** in API usage.

> **Important:**  
> The data from *this specific run* (with partial UK tweets) will **not** be used in the main data-analysis section of the project. This notebook is meant to document **how** I collected the data and all the little gotchas of using X API v2 (rate limits, access tiers, cost, etc.).

For the actual cultural analysis, I’ll be working in a separate notebook, **`x_api_culture_analysis.ipynb`**, where I load pre-collected datasets for multiple regions and:

- compare **UK vs NYC** users on both the **dictionary-based features** and **embedding-based features**,  
- run PCA/UMAP to visualize the “cultural space” of each city, and  
- use those region-level profiles to design different chatbot personas.

So you can think of this notebook as **“how I got the data and what it cost”**, and `x_api_culture_analysis` as **“what I actually did with that data once I had it.”**
