# Section 1 ‚Äî Report Header & Hypothesis

### Report Title: Hashtags on BlueSky
### Your Name: Cecilia Pisano
### Date: 2025-10-12

## Hypothesis

Posts that include at least one hashtag receive more likes on average than posts without hashtags. 

-- Variables that help form this hypothesis are Independent of the presence of hashtags being yes or no, Dependent being the number of likes per post accounted for. Bluesky's API can provide such aspects of the post content that the hashtag would be under and the likes count per post.

## Theoretical Rationale

Hashtags are used on social media paltforms to categorize content that people post to the hashtag to form a community. It increases a post's visbility by connecting to their community on the hashtag the user has put it under. Affordance theory by Treem and Leonardi in 2012, confirmed that, "features such as hashtags provide users with new ways to interact and gain exposure" (Treem & Leonardi). Other platforms such as Twitter (X) and Instagram show posts with hashtags often get higher engagement with even just being on the home screen. Having this knowledge in mind with Bluesky, it is likely to be the same scenario that posts with hashtags are seen by more users, increasing the chances of getting more likes on the post.

## Statistical Application

To test my hypothesis I'd suggest a group comparrison approach to recieve the most accurate between 2 groups. The test would compare the average number of likes on posts that atleast include one hashtag to posts without without hashtag. Group 1 would be Posts with atleast one hashtag and Group 2 would be posts without any hashtags. The test would result in comparing the number of likes per post between the two groups.

Variables that would be used are like_count-- shows how many likes a post recieves numerically and has_hashtag-- shows whether a post contains a hashtag or not.

If the difference between the two groups is statiscally significattly different from one another, my hypothesis then would be correct that hashtags are ultimately linked to higher engagement on Bluesky.


# Section 2 ‚Äî Endpoint Plan (Design Your Data Collection)

Identify the Bluesky API endpoints you will use and why they are suitable for testing your hypothesis.
Link: https://docs.bsky.app/docs/category/http-reference

app.bsky.feed.searchPosts ‚Äî to collect posts matching a topic, hashtag, or keyword set.
app.bsky.actor.getProfiles ‚Äî to enrich authors with profile metadata (e.g., displayName, followersCount).
app.bsky.feed.getAuthorFeed ‚Äî to get posts authored by a specific actor (for longitudinal behavior).
For each endpoint, specify:

To test my hypothesis, using the app.bsky.feed.searchPosts endpoint allowed me to collect a sample of public posts and also allows me to search using the q (query) parameter to gather posts that include hashtags. By using the q parameter, extracting text (hashtags) and likeCount (engagement) to get my varaibles in a columns. Another endpoint that I will use is the app.bsky.actor.getProfiles to retrieve profile metadata for the post authors using the did parameter. Allowing me to extract fields of followersCount that will help me understand the effect of the audience size. And the last endpoint I will use i the app.bsky.feed.getAuthorFeed to pull the effects of the posts from individual users using the actor parametter to look into how the hashtag being used, affects the likes that the post recieves within the same account. Endpoints such as these, provide the data to classify the posts into groups, compare the like counts in between posts and control user-specific factors that are affected by the hashtag being used.


## Reliability and Bias

The data could be seen as unreliable because of spam accounts/bots on bluesky that could intrerupt how accurate the data truly is. Bots/Spam accounts can be seen as 'not real accounts' used by humans which could make the hypothesis inaccurate to what I am looking for for if hashtags prove more engagement with likes on posts. People can bot accounts and make them like their posts for real-world currency to make their account be fluffed up. Bluesky possibly has the ability 'silence' a post in a way that makes the engagement on ones account be lower than expected, this is something that Tiktok and Instagram does by shadow banning an account to recieve less engagement then what it normally should be. Sensitive data could be another thing that could get in the way when collecting this data; it also doesn't provide data for possible privated accounts (if Bluesky has that privacy system in place-- I have not used it).

## Limitations

Limitations that I found in data were that the endpoint I orginally wanted didn't show the hashtags that were on posts, so I had to switch because the endpoint didn't have the information I was using for my hypothesis. And using then the next endpoint to see if my data matched up; the endpoint didn't have the liked posts aspect that I was wanting. I settled for the posts having hashtags rather than liked posts because the information could show me if the hashtags increase engagement on their profiles rather than just on the posts.

# Section 3 Data Collection

Collect posts that match a query. Adjust QUERY, MAX_POSTS, and any filters your hypothesis requires.

In [144]:
#imports
import requests
import time
import json as js
import pandas as pd
import re

BASE_URL = "https://api.bsky.app/xrpc"

## Data Collection (Endpoint 1):

e.g. app.bsky.feed.searchPosts Flatten key fields from Bluesky PostView objects.



In [145]:
endpoint = f"{BASE_URL}/app.bsky.feed.searchPosts"
headers = {"User-Agent": "EMAT-Teaching/1.0 (+contact@example.com)"}
SEARCH_TERM = "#ocean" # this is the keyword being searched
LIMIT = 50
params = {
    "q": SEARCH_TERM,
    "limit": LIMIT
}

resp = requests.get(endpoint, params=params, headers=headers, timeout=30)

print("Status:", resp.status_code)

# data = resp.json()

print("Top-level keys:", list(data.keys()))
search_data = resp.json()

posts = search_data.get("posts", [])
print(f"Fetched {len(posts)} posts")

Status: 200
Top-level keys: ['posts', 'cursor']
Fetched 49 posts


In [146]:
posts_data = []
for post in posts:
    author = post.get("author", {})
    text = post.get("record", {}).get("text", "")
    
    # Extract hashtags using regex
    hashtags = re.findall(r"#\w+", text)

    posts_data.append({
        "uri": post.get("uri"),
        "cid": post.get("cid"),
        "text": text,
        "createdAt": post.get("indexedAt"),
        "creator_did": author.get("did"),
        "creator_handle": author.get("handle"),
        "creator_displayName": author.get("displayName"),
        "hashtags": hashtags
    })


print("\nTop 5 Posts:")
print(posts_df.head(5))
posts_df = pd.DataFrame(posts_data)


Top 5 Posts:
                                                 uri  \
0  at://did:plc:kd5yqu6lp55ktbw2e422gm26/app.bsky...   
1  at://did:plc:ky7ox3ap67ecqp7bmzk43hrs/app.bsky...   
2  at://did:plc:lc6hdhnjkgccd6ysi7e23pll/app.bsky...   
3  at://did:plc:a674w4yr7fc3mfj4pc2rzhfz/app.bsky...   
4  at://did:plc:ecwqtr7kgqoxrtvkjrk6blsy/app.bsky...   

                                                 cid  \
0  bafyreicn2fg72lhpbjw2mbgrb6fhf7rvdzy5d2rgmytob...   
1  bafyreif74p7metkrripdmilm6swqzebwpdsiglgk4znxv...   
2  bafyreia75atbilozph2jawnipnddki5qu3a4r63zy3vka...   
3  bafyreicl6jbqxjisl3az3evlsl3mbbt5ovykiuc2kafk2...   
4  bafyreihxif764lzgl3bbwykprzxnu7j36gatildhecigj...   

                                                text  \
0  A Huge school of rays circling around the boat...   
1  WEST SUPERIOR - 30NM NE of Outer Island, WI\n\...   
2  Coral reefs at risk of mass dieback as global ...   
3  Meet the Giant River Otter, South America‚Äôs to...   
4  #DavidHasselhoff\n\nBaywatc

## Data Collection (Endpoint 2):

e.g. app.bsky.actor.getProfile

Enrich the post data with profile attributes (followers count, display name, etc.).
We gather unique author identifiers (did) from the posts and request them in batches.
NOTE: Will this be a for loop?

In [147]:
unique_dids = posts_df["creator_did"].dropna().unique().tolist()

profile_url = f"{BASE_URL}/app.bsky.actor.getProfile"
all_profiles = []

for did in unique_dids:
    r = requests.get(profile_url, params={"actor": did}, headers=headers, timeout=30)
    
    if r.status_code == 200:
        profile = r.json()
        all_profiles.append({
            "did": profile.get("did"),
            "handle": profile.get("handle"),
            "displayName": profile.get("displayName"),
            "followersCount": profile.get("followersCount"),
            "postsCount": profile.get("postsCount"),
            "createdAt": profile.get("createdAt"),
            "description": profile.get("description"),
        })

all_profiles_df = pd.DataFrame(all_profiles)
print("\nTop 5 Profiles:")
print(profiles_df.head(5))


Top 5 Profiles:
                                did  \
0  did:plc:4xh6645el4eovoijd7krwfi3   
1  did:plc:ugzypqtdmopxnvkkqcb5qscp   
2  did:plc:k2ojas5zzclc2gewlrcgwv7y   
3  did:plc:i5yx63kwc22vsonhdfoohvyb   
4  did:plc:vfkow4vkkx2imuhorrvdb2g2   

                                         handle         displayName  \
0                     marthaeastman.bsky.social      Martha Eastman   
1                           mrxexon.bsky.social             mrxexon   
2                             terumbudivers.com             Terumbu   
3                                  volcano.link    VolcanoDiscovery   
4  volcanoviews.stefanbohacek.online.ap.brid.gy  Views of volcanoes   

   followersCount  postsCount                 createdAt  \
0             145         443  2025-01-22T14:45:58.940Z   
1            2114       11792  2025-01-30T01:37:03.743Z   
2            1003        1904  2023-08-07T05:28:12.556Z   
3             486        5677  2024-11-14T08:45:12.576Z   
4              24        1

# Section 4 ‚Äî Build DataFrames

Use a pandas method to combine your DataFrames. Use your own endpoints and dataframes. Adjust based on your plan:

merge on a key (author_did), or
concat to stack rows from multiple endpoints, or
join to add columns using an index.
Wrangling (select, clean, sort)

In [148]:
# Classic pandas stitch:
# merge joins rows from the two dataframes based on matching key values.
posts_enriched = posts_df.merge(
    # Adds "author_" to every column name in all_profiles_df
    # Why? To avoid name collisions (e.g., both dataframes could have handle, displayName) 
    # and to make the origin obvious: anything about the author now clearly starts with author_.
    all_profiles_df.add_prefix("author_"),
    # left_on="author_did": use posts_df["author_did"] as the join key on the left.
    left_on="creator_did",
    # right_on="author_did": use the prefixed key from the right dataframe (formerly did).
    right_on="author_did",
    # how="left": a left join. Keep every row from posts_df (every post), 
    # even if there is no matching profile. If a profile is missing, 
    # the author columns become NaN. 
    # This is what you want for enrichment‚Äîdon‚Äôt drop posts just because the profile lookup failed.
    how="left"
)

posts_enriched.head(5)

Unnamed: 0,uri,cid,text,createdAt,creator_did,creator_handle,creator_displayName,hashtags,author_did,author_handle,author_displayName,author_followersCount,author_postsCount,author_createdAt,author_description
0,at://did:plc:kd5yqu6lp55ktbw2e422gm26/app.bsky...,bafyreicn2fg72lhpbjw2mbgrb6fhf7rvdzy5d2rgmytob...,A Huge school of rays circling around the boat...,2025-10-13T00:06:50.481Z,did:plc:kd5yqu6lp55ktbw2e422gm26,seethroughcanoe.bsky.social,See Through Canoe,"[#nature, #amazing, #animals, #wildlife, #awes...",did:plc:kd5yqu6lp55ktbw2e422gm26,seethroughcanoe.bsky.social,See Through Canoe,1943,63,2025-06-12T15:08:43.440Z,100% original nature videos. My company makes ...
1,at://did:plc:ky7ox3ap67ecqp7bmzk43hrs/app.bsky...,bafyreif74p7metkrripdmilm6swqzebwpdsiglgk4znxv...,"WEST SUPERIOR - 30NM NE of Outer Island, WI\n\...",2025-10-13T00:04:28.486Z,did:plc:ky7ox3ap67ecqp7bmzk43hrs,sea.stefanbohacek.online.ap.brid.gy,"At sea, at a lake","[#sea, #ocean, #water, #webcam]",did:plc:ky7ox3ap67ecqp7bmzk43hrs,sea.stefanbohacek.online.ap.brid.gy,"At sea, at a lake",11,1503,2025-01-09T02:02:26.744Z,üåä\n\nüåâ bridged from https://stefanbohacek.onli...
2,at://did:plc:lc6hdhnjkgccd6ysi7e23pll/app.bsky...,bafyreia75atbilozph2jawnipnddki5qu3a4r63zy3vka...,Coral reefs at risk of mass dieback as global ...,2025-10-12T23:44:19.480Z,did:plc:lc6hdhnjkgccd6ysi7e23pll,bigearthdata.bsky.social,"Climate, Ecology, War & More by Dr. Glen Barry...","[#CoralReef, #Ocean, #TippingPoint, #ClimateCh...",did:plc:lc6hdhnjkgccd6ysi7e23pll,bigearthdata.bsky.social,"Climate, Ecology, War & More by Dr. Glen Barry...",6318,301416,2024-02-11T13:26:52.835Z,Reality Matters. #AI curated #Environment #Cli...
3,at://did:plc:a674w4yr7fc3mfj4pc2rzhfz/app.bsky...,bafyreicl6jbqxjisl3az3evlsl3mbbt5ovykiuc2kafk2...,"Meet the Giant River Otter, South America‚Äôs to...",2025-10-12T23:40:11.680Z,did:plc:a674w4yr7fc3mfj4pc2rzhfz,wildlifenomads.bsky.social,Wildlife Nomads,"[#ocean, #nature, #animals, #wildlife, #brazil...",did:plc:a674w4yr7fc3mfj4pc2rzhfz,wildlifenomads.bsky.social,Wildlife Nomads,218,377,2024-11-28T04:05:23.746Z,üåç Discover the world's wildlife and natural wo...
4,at://did:plc:ecwqtr7kgqoxrtvkjrk6blsy/app.bsky...,bafyreihxif764lzgl3bbwykprzxnu7j36gatildhecigj...,#DavidHasselhoff\n\nBaywatch Nights 2x01 - Ter...,2025-10-12T23:38:12.786Z,did:plc:ecwqtr7kgqoxrtvkjrk6blsy,randombaywatch.bsky.social,RandomBaywatch,"[#DavidHasselhoff, #RandomBaywatch, #lvdlpx, #...",did:plc:ecwqtr7kgqoxrtvkjrk6blsy,randombaywatch.bsky.social,RandomBaywatch,116,9884,2024-07-04T06:36:20.778Z,Random Baywatch pics hourly. \n\nI am the worl...


# Section 5 ‚Äî Conclusion

Based on the patterns I observed in the data collected, having that the posts were categorized completely by a searchword (being ocean for this example), I believe that the followers were more prominent and larger in the more hashtags were used. Though my hypothesis said that either NO hashtags or more than 1 hashtag, the data was collected with only hashtags in mind; however, the amount of hashtags could be seen in the data to be a clear indicator that the more hashtags that were used in the posts, the higher the followers count rather than the likes on the post, would be higher. For one of the bluesky endpoints, likes on posts for my original hypothesis was not able to be seen and merged together with the other endpoint to make the data table but also only 1 of the endpoints had the hashtags that were needed for my hypothesis. Though my hypothesis wasn't fully able to be done with the API data, there was a trend that still could be seen in the data as the more hashtags the post has, the possible more followers that the account as; I would change this to be my final hypothesis in the next round I would try to work with API's.