# Section 1 — Report Header & Hypothesis
**Report Title:** API Data Report Pictures vs Text<br>
**Your Name:** Ella Benore<br>
**Date:** 2025-10-07

## Hypothesis
My hypothesis is that posts with hashtags will have signifigantly more likes and user interactions than posts without hashtags.

## Theoretical Rationale
I believe that the posts with hashtags will have more likes and user interactions because hashtags place posts in a catagory with other posts using the same hashtag. Therfore, users who click a specific hashtag will cause more interactions with other posts in that same catagory. 

## Statistical Application
Explain how your hypothesis could be tested statistically (e.g., group comparison, correlation).
What variables (columns) will you be using.

Tip: You do not need to fully execute the analysis now, but you should articulate how you would test it.
This hypothesis could be tested by examining the number of likes on posts from one or more creators that contain certain hashtags, compared to posts by those same creators which do not contain hashtags. If the number of likes on average is higher for the posts with hashtags than those without, my hypothesis will be correct. The variales that will be the most helpful in this will be "likescount", "author_did", and searching for a specific term within the post.

# Section 2 — Endpoint Plan (Design Your Data Collection)
Identify the Bluesky API endpoints you will use and why they are suitable for testing your hypothesis.
Link: https://docs.bsky.app/docs/category/http-reference

Planned endpoints (examples; replace with your own):

app.bsky.feed.searchPosts — to collect posts matching a hashtag or keyword.
app.bsky.actor.getProfiles — to enrich authors with profile metadata (e.g., displayName, followersCount).
app.bsky.feed.getAuthorFeed — to get posts authored by a specific actor (for longitudinal behavior).
For each endpoint, specify:

The key request parameters you will use. e.g. search query q for app.bsky.feed.searchPosts. User profile did for app.bsky.actor.getProfiles
The response objects/fields you will extract. e.g. posts response in case of app.bsky.feed.searchPosts
Why these fields map to the variables in your hypothesis.


## Reliability and Bias
Hashtags of often used for accounts to get more exposure, and can be used on any post; therefore, sometimes the post may have nothing to do with the hashtage at all and be completely unrelated to the content. There also could be spam accounts or spam posts using the same hashtags over and over again to obtain more user interations. 


## Limitations
List any caveats in the response objects (e.g., fields not guaranteed, delayed counts, missing information) that could affect your analysis.
- How to find hashtags/information on specific posts
- Like counts on specific posts rather than a profile
- How to find posts that contain no hashtags

# Section 3 Data Collection
Collect posts that match a query. Adjust QUERY, MAX_POSTS, and any filters your hypothesis requires.

In [84]:
import requests
import time
import json as js
import pandas as pd
import re

BASE_URL = "https://api.bsky.app/xrpc"

## Data Collection (Endpoint 1):
e.g. app.bsky.feed.searchPosts Flatten key fields from Bluesky PostView objects.

In [85]:
endpoint = f"{BASE_URL}/app.bsky.feed.searchPosts"
headers = {"User-Agent": "EMAT-Teaching/1.0 (+contact@example.com)"}
SEARCH_TERM = "#Volcano"  # this is the hashtag/word search
LIMIT = 50
params = {
    "q": SEARCH_TERM,
    "limit": LIMIT
}

resp = requests.get(endpoint, params=params, headers=headers, timeout=30)

print("Status:", resp.status_code)

# data = resp.json()

print("Top-level keys:", list(data.keys()))
search_data = resp.json()

#posts = search_data.get("posts", [])
#print(f"Fetched {len(posts)} posts")

Status: 200
Top-level keys: ['did', 'handle', 'displayName', 'avatar', 'associated', 'labels', 'createdAt', 'description', 'indexedAt', 'banner', 'followersCount', 'followsCount', 'postsCount', 'pinnedPost']


In [86]:
posts = search_data.get("posts", [])


In [87]:
## Flatten the posts

rows = []
# posts_data = []
for p in posts:
    author = post.get("author", {})
    text = post.get("record", {}).get("text", "")
    
    # Extract hashtags using regex
    hashtags = re.findall(r"#\w+", text)

    posts_data.append({
        "uri": post.get("uri"),
        "cid": post.get("cid"),
        "text": text,
        "createdAt": post.get("indexedAt"),
        "creator_did": author.get("did"),
        "creator_handle": author.get("handle"),
        "creator_displayName": author.get("displayName"),
        "hashtags": hashtags
    })

posts_df = pd.DataFrame(posts_data)
print(posts_df.head(5))

                                                 uri  \
0  at://did:plc:4xh6645el4eovoijd7krwfi3/app.bsky...   
1  at://did:plc:ugzypqtdmopxnvkkqcb5qscp/app.bsky...   
2  at://did:plc:k2ojas5zzclc2gewlrcgwv7y/app.bsky...   
3  at://did:plc:k2ojas5zzclc2gewlrcgwv7y/app.bsky...   
4  at://did:plc:i5yx63kwc22vsonhdfoohvyb/app.bsky...   

                                                 cid  \
0  bafyreig2agdmakztaxha4jatsfubjkr34v27ksg7qqsky...   
1  bafyreicqfchr4poavn54aiczkqefpf6as4eufbrrmvl6x...   
2  bafyreic4jqvhfv4637zsujpupzcann4kbsxjlhexnqaw4...   
3  bafyreiebytwfvu2dnaa6wea5ltngds2oxkcvw42zoqr5t...   
4  bafyreigmarmjk6izhiepjphdtqtwghmv33qdbogxv5ho2...   

                                                text  \
0  In the wee hours of this morning, we returned ...   
1  No, it's not The Pyramids. Pavlof volcano in A...   
2  Sunrise \n#Rinjani #Sunrise #Lombok #gili #mar...   
3  Before sunrise\n#Rinjani #Sunrise #Lombok #gil...   
4  Volcanic Ash Advisory (#VAAC) for #Lewotobi

## Data Collection (Endpoint 2):
e.g. app.bsky.actor.getProfiles

Enrich the post data with profile attributes (followers count, display name, etc.).
We gather unique author identifiers (did) from the posts and request them in batches.
NOTE: Will this be a for loop?

In [79]:
## Let us get profile data for all the authors from the previous feed
# get unique author ids which is dids
unique_dids = posts_df["creator_did"].dropna().unique().tolist()
#print(unique_dids)

# Get author profiles for these dids
all_profiles = []
for d in unique_dids:
    params = []
    params.append(("actor", d))
    r = requests.get(f"{BASE_URL}/app.bsky.actor.getProfile", params=params, timeout=30)
    data = r.json()

    # Append this profile in our list 
    # flatten tha data for profile
    all_profiles.append({
        "did": data.get("did"),
        "handle": data.get("handle"),
        "displayName": data.get("displayName"),
        "followersCount": data.get("followersCount"),
        "likeCount": p.get("likeCount"),
        "postsCount": data.get("postsCount"),
        "description": data.get("description"),
    })

all_profiles_df = pd.DataFrame(all_profiles)
# This will take a while to load!
all_profiles_df.head(5)

Unnamed: 0,did,handle,displayName,followersCount,likeCount,postsCount,description
0,did:plc:4xh6645el4eovoijd7krwfi3,marthaeastman.bsky.social,Martha Eastman,145,8,443,"Unitarian Universalist, Yoga Teacher, Jewelry ..."
1,did:plc:ugzypqtdmopxnvkkqcb5qscp,mrxexon.bsky.social,mrxexon,2115,8,11802,Retiree from the Oregon coast. Vegetarian. Sco...
2,did:plc:k2ojas5zzclc2gewlrcgwv7y,terumbudivers.com,Terumbu,1003,8,1904,Own photos about diving in Pantai Sire - Gili ...
3,did:plc:i5yx63kwc22vsonhdfoohvyb,volcano.link,VolcanoDiscovery,486,8,5679,"Info, News and Photos about Volcanoes and Eart..."
4,did:plc:vfkow4vkkx2imuhorrvdb2g2,volcanoviews.stefanbohacek.online.ap.brid.gy,Views of volcanoes,24,8,1103,🌉 bridged from https://stefanbohacek.online/@v...


In [82]:
# Get author profiles for these dids
unique_dids = posts_df["creator_did"].dropna().unique().tolist()

profile_url = f"{BASE_URL}/app.bsky.actor.getProfile"
all_profiles = []
# Step 4: Fetch profile data for each DID

for did in unique_dids:
    r = requests.get(profile_url, params={"actor": did}, headers=headers, timeout=30)
    profile = r.json()
    all_profiles.append({
        "did": profile.get("did"),
        "handle": profile.get("handle"),
        "displayName": profile.get("displayName"),
        "followersCount": profile.get("followersCount"),
        "postsCount": profile.get("postsCount"),
        "description": profile.get("description"),
    })
all_profiles_df = pd.DataFrame(all_profiles)
print(profiles_df.head(5))

                                did  \
0  did:plc:4xh6645el4eovoijd7krwfi3   
1  did:plc:ugzypqtdmopxnvkkqcb5qscp   
2  did:plc:k2ojas5zzclc2gewlrcgwv7y   
3  did:plc:i5yx63kwc22vsonhdfoohvyb   
4  did:plc:vfkow4vkkx2imuhorrvdb2g2   

                                         handle         displayName  \
0                     marthaeastman.bsky.social      Martha Eastman   
1                           mrxexon.bsky.social             mrxexon   
2                             terumbudivers.com             Terumbu   
3                                  volcano.link    VolcanoDiscovery   
4  volcanoviews.stefanbohacek.online.ap.brid.gy  Views of volcanoes   

   followersCount  postsCount                 createdAt  \
0             145         443  2025-01-22T14:45:58.940Z   
1            2114       11792  2025-01-30T01:37:03.743Z   
2            1003        1904  2023-08-07T05:28:12.556Z   
3             486        5677  2024-11-14T08:45:12.576Z   
4              24        1102  2025-01-09T0

# Section 4 — Build DataFrames
Use a pandas method to combine your DataFrames. Use your own endpoints and dataframes. Adjust based on your plan:

- merge on a key (author_did), or<br>
- concat to stack rows from multiple endpoints, or<br>
- join to add columns using an index.<br>
- Wrangling (select, clean, sort)

In [81]:
# Classic pandas stitch:
# merge joins rows from the two dataframes based on matching key values.
posts_enriched = posts_df.dropna().merge(
    # Adds "author_" to every column name in all_profiles_df
    # Why? To avoid name collisions (e.g., both dataframes could have handle, displayName) 
    # and to make the origin obvious: anything about the author now clearly starts with author_.
    all_profiles_df.add_prefix("author_"),
    # left_on="author_did": use posts_df["author_did"] as the join key on the left.
    left_on="creator_did",
    # right_on="author_did": use the prefixed key from the right dataframe (formerly did).
    right_on="author_did",
    # how="left": a left join. Keep every row from posts_df (every post), 
    # even if there is no matching profile. If a profile is missing, 
    # the author columns become NaN. 
    # This is what you want for enrichment—don’t drop posts just because the profile lookup failed.
    how="left"
)

posts_enriched.head(5)

Unnamed: 0,uri,cid,text,createdAt,creator_did,creator_handle,creator_displayName,hashtags,author_did,author_handle,author_displayName,author_followersCount,author_postsCount,author_description
0,at://did:plc:4xh6645el4eovoijd7krwfi3/app.bsky...,bafyreig2agdmakztaxha4jatsfubjkr34v27ksg7qqsky...,"In the wee hours of this morning, we returned ...",2025-10-12T23:36:45.991Z,did:plc:4xh6645el4eovoijd7krwfi3,marthaeastman.bsky.social,Martha Eastman,"[#vacation, #Iceland, #Grindavik, #hike, #Fagr...",did:plc:4xh6645el4eovoijd7krwfi3,marthaeastman.bsky.social,Martha Eastman,145,443,"Unitarian Universalist, Yoga Teacher, Jewelry ..."
1,at://did:plc:ugzypqtdmopxnvkkqcb5qscp/app.bsky...,bafyreicqfchr4poavn54aiczkqefpf6as4eufbrrmvl6x...,"No, it's not The Pyramids. Pavlof volcano in A...",2025-10-12T23:32:10.782Z,did:plc:ugzypqtdmopxnvkkqcb5qscp,mrxexon.bsky.social,mrxexon,"[#photo, #alaska, #volcano]",did:plc:ugzypqtdmopxnvkkqcb5qscp,mrxexon.bsky.social,mrxexon,2115,11802,Retiree from the Oregon coast. Vegetarian. Sco...
2,at://did:plc:k2ojas5zzclc2gewlrcgwv7y/app.bsky...,bafyreic4jqvhfv4637zsujpupzcann4kbsxjlhexnqaw4...,Sunrise \n#Rinjani #Sunrise #Lombok #gili #mar...,2025-10-12T22:19:05.982Z,did:plc:k2ojas5zzclc2gewlrcgwv7y,terumbudivers.com,Terumbu,"[#Rinjani, #Sunrise, #Lombok, #gili, #marineli...",did:plc:k2ojas5zzclc2gewlrcgwv7y,terumbudivers.com,Terumbu,1003,1904,Own photos about diving in Pantai Sire - Gili ...
3,at://did:plc:k2ojas5zzclc2gewlrcgwv7y/app.bsky...,bafyreiebytwfvu2dnaa6wea5ltngds2oxkcvw42zoqr5t...,Before sunrise\n#Rinjani #Sunrise #Lombok #gil...,2025-10-12T22:06:05.221Z,did:plc:k2ojas5zzclc2gewlrcgwv7y,terumbudivers.com,Terumbu,"[#Rinjani, #Sunrise, #Lombok, #gili, #marineli...",did:plc:k2ojas5zzclc2gewlrcgwv7y,terumbudivers.com,Terumbu,1003,1904,Own photos about diving in Pantai Sire - Gili ...
4,at://did:plc:i5yx63kwc22vsonhdfoohvyb/app.bsky...,bafyreigmarmjk6izhiepjphdtqtwghmv33qdbogxv5ho2...,Volcanic Ash Advisory (#VAAC) for #Lewotobi #v...,2025-10-12T21:45:11.285Z,did:plc:i5yx63kwc22vsonhdfoohvyb,volcano.link,VolcanoDiscovery,"[#VAAC, #Lewotobi, #volcano, #VAAC, #Darwin]",did:plc:i5yx63kwc22vsonhdfoohvyb,volcano.link,VolcanoDiscovery,486,5679,"Info, News and Photos about Volcanoes and Eart..."


# Section 5 — Conclusion
Describe any patterns you observe in the collected data and how they relate to your hypothesis.
Describe challenges you faced.
Unfortunatly my results were not very clear. I could not find a way to locate the likescount for specific posts rather than the entire acounts. There was some correlation between a higher number of hashtags being used and more likes linked to that account; however this finding was not consistant enough to draw a reasonable conclusion. 