# Bluesky Data Science
## Part 00 - Introduction

November 2024  
Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Background

> **tl;dr**: This is a series of notebooks introducing functionality for doing common data science tasks using the ATProto protocol powering Bluesky, a post-Twitter microblogging platform.

### RIP Twitter

After Twitter's acquisition by Elon Musk in October 2022, the platform rapidly deteriorated through active mismanagement by dismissing the teams dedicated to trust and safety, policy, and research. Bots, spam, and extremist content have proliferated and Musk's active campaigning on behalf of Donald Trump's 2024 presidential campaign has further disillusioned users.

Access to Twitter data via its API was also enclosed behind an exorbitantly priced paywall, effectively shutting out researchers, journalists, and students. Given the increasingly precarity of Twitter's API access, I no longer recommend that researchers build research projects around Twitter. 

### Bluesky
Alternative microblogging services like [Bluesky](https://bsky.app) and [Mastodon](https://joinmastodon.org/) have emerged in this aftermath. Bluesky, in particular, appears to have attracted a critical mass of regular and influential users as well as persuasive features for safety, on-boarding new users, and data access. You will need to have access to a Bluesky/ATProto account and password to proceed.

Bluesky runs on a decentralized social networking software called the "authenticated transfer protocol" or "ATProto". More details about ATProto:

* [Documentation](https://docs.bsky.app/docs/advanced-guides/atproto)
* [Github repo](https://github.com/bluesky-social/atproto)
* [Wikipedia article](https://en.wikipedia.org/wiki/AT_Protocol)

As data scientists we want to responsibly retrieve data from the social network to understand the behavior of its users. Instead of using web scraping techniques to retrieve this data through a web browser, we will leverage the application programming interfaces (APIs) provided by ATProto to get (and post) data about posts, users, timelines, *etc*. We are going to leverage the ATProto Python software development kit (SDK) developed by Ilya Siamionau:

* [Documentation](https://atproto.blue/en/latest/)
* [Github repo](https://github.com/MarshalX/atproto)

We are going to build on many of the [examples](https://github.com/MarshalX/atproto/tree/main/examples) published in the ATProto Python SDK.

### Code
This is a Jupyter Notebook, a framework that allows developers and researchers to write code and documentation inside a file that can also execute code and embed tables and figures. This Notebook is developed in the Anaconda package environment for managing other scientific libraries. Please consult the any of the many tutorials about using Jupyter Notebooks if you are not already familiar:

* [Documentation](https://docs.jupyter.org/en/latest/)
* [Dataquest tutorial](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)
* [Datacamp tutorial](https://www.datacamp.com/tutorial/tutorial-jupyter-notebook)
* [Codecademy](https://www.codecademy.com/article/how-to-use-jupyter-notebooks)

The rest of the notebooks will assume familiarity with basic Python data structures (lists, dictionaries), flow control and exception handling (`if`, `for`, `while`, `try`), and objects (methods, attributes).

### Setup
To access data from Bluesky's ATProto APIs, we need to install the atproto library. 

**You should only need to run this cell once.** After it is installed the first time, you can skip it in the future.

In [None]:
! conda install atproto -c conda-forge --yes

We'll need a few common libraries for all these examples.

In [3]:
# Lets us talk to other servers on the web
import requests

# APIs spit out data in JSON
import json

# Handling dates and times
from datetime import datetime

# DataFrames
import pandas as pd
import numpy as np

# Data visualization
%matplotlib inline
import matplotlib.pyplot as plt

## Client

Once installed, import the library and client.

In [2]:
# Import the atproto Client object
from atproto import Client

You will need to enter your Bluesky/ATProto username and password to retrieve data through this `Client` object.

**NOTE**: Storing your username and password in plain text like this Jupyter Notebook is an *enormous* security risk. Do not share your notebook with others or they will be able to access and post to your account! Replace these values your account's handle and password.

In [1]:
atproto_creds = {}
atproto_creds['handle'] = 'YourAccount.bsky.social'
atproto_creds['password'] = 'AVerySecurePassword'

The better alternative is to store your account credentials in a file called "atproto.json" outside the notebook. You can copy this code below into a text editor. Make sure to save the file in the same location where you are running this notebook.

```
{
    "handle": "YourAccount.bsky.social",
    "password": "AVerySecurePassword"
}
```

This next code block will not work for you unless you have created a JSON file called "atproto.json" like the one above that lives in the same directory as where you are running this notebook.

In [5]:
with open('atproto.json','r') as f:
    atproto_creds = json.load(f)

Set up your connection to the API.

In [8]:
# Create a client instance
client = Client()

# Use the client to log in
profile = client.login(
    login = atproto_creds['handle'],
    password = atproto_creds['password']
)

# Once logged in, print the profile handle
print(profile.handle)

brianckeegan.com


Now you are getting data from the Bluesky/ATProto API!

## Profile

The `profile` object returned from the client contains a variety of metadata about our account.

The `.dict()` method convers the `ProfileView` object into a more accessible dictionary object.

In [22]:
profile.dict()

{'did': 'did:plc:lse7quaysss2d3xxm76mouhd',
 'handle': 'brianckeegan.com',
 'associated': {'chat': {'allow_incoming': 'following',
   'py_type': 'app.bsky.actor.defs#profileAssociatedChat'},
  'feedgens': 0,
  'labeler': False,
  'lists': 0,
  'starter_packs': 0,
  'py_type': 'app.bsky.actor.defs#profileAssociated'},
 'avatar': 'https://cdn.bsky.app/img/avatar/plain/did:plc:lse7quaysss2d3xxm76mouhd/bafkreicfipjusogzrmin7y7tazfkgvhlytnqo5x3cld4ye7hdqamgsznrm@jpeg',
 'banner': 'https://cdn.bsky.app/img/banner/plain/did:plc:lse7quaysss2d3xxm76mouhd/bafkreibmrmhm2xashjwsaqv2mkmiot6gnv7jqmr4us7w3dpne3d6fuilmq@jpeg',
 'created_at': '2023-05-25T20:28:43.424Z',
 'description': '{Social, Data, Network, Information} Scientist.\n\nHigh-tempo collaboration, information commons, public interest data science.\n\nBorn at 345ppm. \n\nhttps://www.brianckeegan.com/',
 'display_name': 'Brian C. Keegan',
 'followers_count': 1823,
 'follows_count': 1014,
 'indexed_at': '2024-11-13T20:46:24.612Z',
 'joined_

What else is available inside the `profile` object? We can retrieve other metadata via attributes.

In [11]:
# Profile description
print(profile.description)

{Social, Data, Network, Information} Scientist.

High-tempo collaboration, information commons, public interest data science.

Born at 345ppm. 

https://www.brianckeegan.com/


In [12]:
# Display name
print(profile.display_name)

Brian C. Keegan


In [115]:
# Number of followers
profile.followers_count

22734

In [116]:
# Number of follows
profile.follows_count

407

In [117]:
# Number of posts
profile.posts_count

111

In [118]:
# Date account was created
profile.created_at

'2023-04-29T16:59:53.679Z'

In [112]:
# Link to where avatar is stored
print(profile.avatar)

https://cdn.bsky.app/img/avatar/plain/did:plc:lse7quaysss2d3xxm76mouhd/bafkreicfipjusogzrmin7y7tazfkgvhlytnqo5x3cld4ye7hdqamgsznrm@jpeg


All the attributes are listed here. (You could type `dir(profile)` and get a lot of other helper attributes and methods, I've tried to filter these out by omitting the leading underscore)

In [17]:
[i for i in dir(profile) if i[0] != '_']

['associated',
 'avatar',
 'banner',
 'construct',
 'copy',
 'created_at',
 'description',
 'dict',
 'did',
 'display_name',
 'followers_count',
 'follows_count',
 'from_orm',
 'handle',
 'indexed_at',
 'joined_via_starter_pack',
 'json',
 'labels',
 'model_computed_fields',
 'model_config',
 'model_construct',
 'model_copy',
 'model_dump',
 'model_dump_json',
 'model_extra',
 'model_fields',
 'model_fields_set',
 'model_json_schema',
 'model_parametrized_name',
 'model_post_init',
 'model_rebuild',
 'model_validate',
 'model_validate_json',
 'model_validate_strings',
 'parse_file',
 'parse_obj',
 'parse_raw',
 'pinned_post',
 'posts_count',
 'py_type',
 'schema',
 'schema_json',
 'update_forward_refs',
 'validate',
 'viewer']

### Retrieving other users' profiles

Use the `.get_profile()` function to retrieve the profile information for another user by their handle.

In [21]:
# Retrieve the data and store as bsky_app_profile
bsky_app_profile = client.get_profile('bsky.app')

# Inspect the JSON
bsky_app_profile.dict()

{'did': 'did:plc:z72i7hdynmk6r22z27h6tvur',
 'handle': 'bsky.app',
 'associated': {'chat': {'allow_incoming': 'none',
   'py_type': 'app.bsky.actor.defs#profileAssociatedChat'},
  'feedgens': 6,
  'labeler': False,
  'lists': 2,
  'starter_packs': 0,
  'py_type': 'app.bsky.actor.defs#profileAssociated'},
 'avatar': 'https://cdn.bsky.app/img/avatar/plain/did:plc:z72i7hdynmk6r22z27h6tvur/bafkreihagr2cmvl2jt4mgx3sppwe2it3fwolkrbtjrhcnwjk4jdijhsoze@jpeg',
 'banner': 'https://cdn.bsky.app/img/banner/plain/did:plc:z72i7hdynmk6r22z27h6tvur/bafkreichzyovokfzmymz36p5jibbjrhsur6n7hjnzxrpbt5jaydp2szvna@jpeg',
 'created_at': '2023-04-12T04:53:57.057Z',
 'description': 'official Bluesky account (check domain👆)\n\nPress: press@blueskyweb.xyz\nSupport: support@bsky.app',
 'display_name': 'Bluesky',
 'followers_count': 9287387,
 'follows_count': 2,
 'indexed_at': '2024-10-17T07:17:00.646Z',
 'joined_via_starter_pack': None,
 'labels': [],
 'pinned_post': {'cid': 'bafyreicnt42y6vo6pfpvyro234ac4o6ijug6a

## Posts
Get posts from author. Use the `get_author_feed()` function to return up to 100 posts at a time for an account.

In [19]:
profile_feed = client.get_author_feed(
    actor = 'brianckeegan.com',
    limit = 100
)

len(profile_feed.feed)

97

Access the posts via the `.feed` attribute.

In [20]:
profile_feed.feed[0].post.dict()

'{"author":{"did":"did:plc:2pkng5eiv24p4rdpkes2envv","handle":"boulderhousing.bsky.social","associated":null,"avatar":"https://cdn.bsky.app/img/avatar/plain/did:plc:2pkng5eiv24p4rdpkes2envv/bafkreigtadxlwxfxm77ykbs374lmatpyyevh662e6szdzvifeeifnd23hm@jpeg","created_at":"2024-11-14T01:43:47.108Z","display_name":"BoulderHousing.net","labels":[],"viewer":{"blocked_by":false,"blocking":null,"blocking_by_list":null,"followed_by":"at://did:plc:2pkng5eiv24p4rdpkes2envv/app.bsky.graph.follow/3lautdbvn7q2q","following":"at://did:plc:lse7quaysss2d3xxm76mouhd/app.bsky.graph.follow/3lauqwo5olg2s","known_followers":null,"muted":false,"muted_by_list":null,"py_type":"app.bsky.actor.defs#viewerState"},"py_type":"app.bsky.actor.defs#profileViewBasic"},"cid":"bafyreids5ufcm63id6hay3gyznz4cowi5w3wxqnq4rnvijp62nqswmupfa","indexed_at":"2024-11-14T01:51:03.240Z","record":{"created_at":"2024-11-14T01:51:03.240Z","text":"Hello BSkyers! We advocate for more housing in Boulder. Read our latest newsletter here. u

## Timeline
Get current timeline.

In [123]:
timeline = client.get_timeline(algorithm='reverse-chronological')

len(timeline.feed)

48

In [125]:
timeline.feed[1].post.dict()

{'author': {'did': 'did:plc:gthl7kdyryv5lylos55da747',
  'handle': 'bcdreyer.bsky.social',
  'associated': {'chat': {'allow_incoming': 'following',
    'py_type': 'app.bsky.actor.defs#profileAssociatedChat'},
   'feedgens': None,
   'labeler': None,
   'lists': None,
   'starter_packs': None,
   'py_type': 'app.bsky.actor.defs#profileAssociated'},
  'avatar': 'https://cdn.bsky.app/img/avatar/plain/did:plc:gthl7kdyryv5lylos55da747/bafkreieuifdijuivbvvdmmyph2nvcsk37l6eub4rru4qoxe6vphserfn6i@jpeg',
  'created_at': '2023-06-24T12:46:55.564Z',
  'display_name': 'Benjamin Dreyer',
  'labels': [],
  'viewer': {'blocked_by': False,
   'blocking': None,
   'blocking_by_list': None,
   'followed_by': None,
   'following': None,
   'known_followers': None,
   'muted': False,
   'muted_by_list': None,
   'py_type': 'app.bsky.actor.defs#viewerState'},
  'py_type': 'app.bsky.actor.defs#profileViewBasic'},
 'cid': 'bafyreihk4375rbayimrbxa5iwfqax7n6hfuz3arjjjwxjtjhiysgqazsma',
 'indexed_at': '2024-11-

## Relationships

### Follows
Follows are the accounts that an account follows. Their posts appear in the account's timeline. We can only retrieve up to 100 at a time.

In [126]:
follows = client.get_follows(actor='brianckeegan.com',limit=100)

len(follows.follows)

100

In [127]:
follows.follows[0].dict()

{'did': 'did:plc:tfkzy22x2og2w2264fycbb7b',
 'handle': 'dwensign.bsky.social',
 'associated': None,
 'avatar': 'https://cdn.bsky.app/img/avatar/plain/did:plc:tfkzy22x2og2w2264fycbb7b/bafkreiheq3a63cx2aibhlwi5qrk7rtwxi6eplh7xzvmszjeondoszhsql4@jpeg',
 'created_at': '2024-09-04T00:32:07.231Z',
 'description': None,
 'display_name': 'David Ensign',
 'indexed_at': '2024-11-12T05:05:10.639Z',
 'labels': [],
 'viewer': {'blocked_by': False,
  'blocking': None,
  'blocking_by_list': None,
  'followed_by': 'at://did:plc:tfkzy22x2og2w2264fycbb7b/app.bsky.graph.follow/3laq33af3fj2b',
  'following': 'at://did:plc:lse7quaysss2d3xxm76mouhd/app.bsky.graph.follow/3laq3lzprkv23',
  'known_followers': None,
  'muted': False,
  'muted_by_list': None,
  'py_type': 'app.bsky.actor.defs#viewerState'},
 'py_type': 'app.bsky.actor.defs#profileView'}

Get the next set of follows (101 to 200) using the cursor.

In [None]:
next_follows = client.get_follows(
    actor = 'brianckeegan.com',
    limit = 100,
    cursor = follows.cursor
)

Check that these are different accounts returned.

In [None]:
follows.follows[0].handle == next_follows.follows[0].handle

### Followers
Followers are the other accounts that follow a user. The user's posts appear in their timelines. We can only retrieve up to 100 at a time.

In [24]:
followers = client.get_followers(actor='brianckeegan.com',limit=100)

len(followers.followers)

100

Inspect one of these followers.

In [25]:
followers.followers[0].dict()

{'did': 'did:plc:3rvwizwndqx6cnwdbfywliut',
 'handle': 'lanebecker.bsky.social',
 'associated': None,
 'avatar': 'https://cdn.bsky.app/img/avatar/plain/did:plc:3rvwizwndqx6cnwdbfywliut/bafkreigsozlbubyrqrzqplhilua7l7uawnzjmv2vzu5eybjdokch2al2ji@jpeg',
 'created_at': '2024-10-31T13:42:54.762Z',
 'description': 'Prestidigitation, double shuffling, honey-fugling, hornswaggling, & skullduggery.',
 'display_name': 'Lane Becker',
 'indexed_at': '2024-11-09T20:50:05.314Z',
 'labels': [],
 'viewer': {'blocked_by': False,
  'blocking': None,
  'blocking_by_list': None,
  'followed_by': 'at://did:plc:3rvwizwndqx6cnwdbfywliut/app.bsky.graph.follow/3lauzfpzcpv2s',
  'following': None,
  'known_followers': None,
  'muted': False,
  'muted_by_list': None,
  'py_type': 'app.bsky.actor.defs#viewerState'},
 'py_type': 'app.bsky.actor.defs#profileView'}

## Posting content

### Writing a post

In [128]:
client.send_post("I am posting this from a Jupyter Notebook in my class!")

CreateRecordResponse(uri='at://did:plc:lse7quaysss2d3xxm76mouhd/app.bsky.feed.post/3lareilifis2h', cid='bafyreifacygwvqtrqc7dts2fk67iyv2sa2u3irqdlry7trusq424zmqu7a')

### Liking a post

In [129]:
client.like(
    uri='at://did:plc:lse7quaysss2d3xxm76mouhd/app.bsky.feed.post/3lareilifis2h',
    cid='bafyreifacygwvqtrqc7dts2fk67iyv2sa2u3irqdlry7trusq424zmqu7a'
)

CreateRecordResponse(uri='at://did:plc:lse7quaysss2d3xxm76mouhd/app.bsky.feed.like/3lareki7q6p23', cid='bafyreiclt4hminrjuj5c5zvsobrdmvx5jetw7xjpopkffj4ma35eflcsiu')

### Writing a post with an image

In [132]:
with open('/Users/briankeegan/Desktop/small_class.jpg','rb') as img:
    img_data = img.read()

client.send_image(
    text = 'Now we are sending images from a notebook!',
    image = img_data,
    image_alt = 'Selfie with class'
)

CreateRecordResponse(uri='at://did:plc:lse7quaysss2d3xxm76mouhd/app.bsky.feed.post/3larexne6vk23', cid='bafyreihu76mm2xakrlzasrpmlvyyatvi2idlyf3j2rqeqo3tsgh6nioure')