## Lab 1.5: Scraping Mastodon

In this notebook we will be scraping messages from [Mastodon](https://docs.joinmastodon.org/), an open-source alternative to Twitter (X). The messages send over Mastodon are called *toots*. **Check the code to crawl toots below and try to understand the structure of the toots in JSON format.**  

## 1. Crawling Toots 

In [2]:
import json
import requests
import pandas as pd
import pprint

# Set up your keyword(s) for the hashtag
hashtag = 'veganism'  
URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'

# You can crawl at most 40 toots at once
params = {'limit': 40}

response = requests.get(URL, params=params)
toots = json.loads(response.text)  # saves to json object
   
# We use prettyprint to add indents to the output
pp = pprint.PrettyPrinter(indent=1)
for toot in toots[0:2]:
    pp.pprint(toot)
    print("\n---------------\n")



{'account': {'acct': 'SofiaK@social.politicaconciencia.org',
             'avatar': 'https://files.mastodon.social/cache/accounts/avatars/106/604/592/089/366/171/original/f3f8499cf48abaa5.gif',
             'avatar_static': 'https://files.mastodon.social/cache/accounts/avatars/106/604/592/089/366/171/static/f3f8499cf48abaa5.png',
             'bot': False,
             'created_at': '2021-07-18T00:00:00.000Z',
             'discoverable': False,
             'display_name': '¡Emancipemos a los animales!',
             'emojis': [],
             'fields': [],
             'followers_count': 351,
             'following_count': 526,
             'group': False,
             'header': 'https://files.mastodon.social/cache/accounts/headers/106/604/592/089/366/171/original/aacb7722bcf3d8ca.gif',
             'header_static': 'https://files.mastodon.social/cache/accounts/headers/106/604/592/089/366/171/static/aacb7722bcf3d8ca.png',
             'id': '106604592089366171',
             'last_s

## 2. Inspecting metadata 

The toots come with a complex metadata structure. JSON represents this as nested dictionaries and you can access information using the respective keys. **Inspect the metadata structure and reflect how the information is relevant for your analysis.** 

If you use keys such as *language*, *sensitive*, or *bot* for filtering the posts, you should **check in the documentation how this information is determined.** It might be the case that some information is missing, i.e., some keys are not necessarily available for all toots. 

In [4]:
import html2text
posts = toots
print("Metadata for toot:")
print(posts[0].keys())
print()

# The json format is a nested dictionary
print("Metadata for user account which posted toot:")
print(posts[0]["account"].keys())
print()
      
print(html2text.html2text(posts[2]["content"]))

Metadata for toot:
dict_keys(['id', 'created_at', 'in_reply_to_id', 'in_reply_to_account_id', 'sensitive', 'spoiler_text', 'visibility', 'language', 'uri', 'url', 'replies_count', 'reblogs_count', 'favourites_count', 'edited_at', 'content', 'reblog', 'account', 'media_attachments', 'mentions', 'tags', 'emojis', 'card', 'poll'])

Metadata for user account which posted toot:
dict_keys(['id', 'username', 'acct', 'display_name', 'locked', 'bot', 'discoverable', 'group', 'created_at', 'note', 'url', 'uri', 'avatar', 'avatar_static', 'header', 'header_static', 'followers_count', 'following_count', 'statuses_count', 'last_status_at', 'emojis', 'fields'])

What Is Veganism? And How Is It Different To A Plant-Based Diet?
[#plantbased](https://veganism.social/tags/plantbased)
[#diet](https://veganism.social/tags/diet)
[#veganism](https://veganism.social/tags/veganism)
[#GoVegan](https://veganism.social/tags/GoVegan)
[https://plantbasednews.org/culture/ethics/what-is-
veganism/](https://plantbase

## 3. Getting around the limit

For every request, you can only access at most 40 toots. To get around this, we need to look at the unique identifier (id) of the toot. Toots that have been created later, have a larger id. 

In [5]:
num_toots = 200
results = []

# Be careful, if this expression never evaluates to False, 
# you will be stuck in an endless loop

while len(results) < num_toots:
    
    # Send query
    r = requests.get(URL, params=params)
    toots = json.loads(r.text)  # saves to json object
    
    # If we do not get any results, we leave the loop (although we might have collected less than num_toots)
    if len(toots) == 0: 
        break
    
    for toot in toots: 
        results.append(toot)
        if len(results) == num_toots:
            break
            
        
    max_id = toots[-1]['id'] 
    # We add the parameter max_id to our request
    params['max_id'] = max_id
    
    print("Next request looking at toots with ids smaller than: ")
    print(max_id)
    
    
print(results[199:200])

Next request looking at toots with ids smaller than: 
111287369327270860
Next request looking at toots with ids smaller than: 
111173953504503449
Next request looking at toots with ids smaller than: 
111079964383007656
Next request looking at toots with ids smaller than: 
111034361527215691
Next request looking at toots with ids smaller than: 
110961547458716493
[{'id': '110961547458716493', 'created_at': '2023-08-27T12:20:46.725Z', 'in_reply_to_id': None, 'in_reply_to_account_id': None, 'sensitive': False, 'spoiler_text': '', 'visibility': 'public', 'language': 'nl', 'uri': 'https://mastodon.social/users/itsveganjim/statuses/110961547458716493', 'url': 'https://mastodon.social/@itsveganjim/110961547458716493', 'replies_count': 0, 'reblogs_count': 9, 'favourites_count': 13, 'edited_at': None, 'content': '<p><a href="https://mastodon.social/tags/melaniejoy" class="mention hashtag" rel="tag">#<span>melaniejoy</span></a> <a href="https://mastodon.social/tags/veganism" class="mention hasht

You can additionally use the "created_at" timestamp to limit your toots to a certain time period. 

## Reproducibility

If someone else crawls toots using your code at a later time, they will retrieve a different set because Mastodon always returns the most recent toots first. 

To avoid this, you should provide a list of ids of your crawled toots. 

In [6]:
ids = [toot["id"] for toot in results]
# You can specifiy the id of the toot you want to crawl
params = {'id': ids[0]}

r = requests.get(URL, params=params)
toots = json.loads(r.text)  # saves to json object

print(toots[0]["content"] == results[0]["content"])

True


**How do you need to adjust the above code to write a function that yields the correct toots when given a list of ids?** Keep in mind that toots might have been deleted by a user in the meantime. In that case, it is impossible to retrieve the same set of toots. You need to make sure that the code deals with this exception and outputs an appropriate warning. 

## 4. Saving results

We have two options for saving the results. 
1. We can select specific attributes and save them as a tsv-file. 
2. If we do not want to decide yet which attributes we need, we can simply dump the whole JSON result to a file and process it later. 

**Make sure that you understand the code below. Open the result files in an editor and compare the differences.** 


In [7]:
# Collect the results
toots_as_json =[]
toots_as_text =[]

for toot in toots: 
    # TODO: You might want to apply some filtering based on metadata here
    
    # Option 1: only keep selected attributes
    text= html2text.html2text(toot["content"])
    keep = str(toot["created_at"]) + "\t" + str(toot["id"]) + str(toot["account"]["id"]) + "\t" + text
    toots_as_text.append(keep)  
    
    # Option 2: keep everything and process later
    toots_as_json.append(toot)
    
# Write them to a file
csv_file = "../results/mastodon_search_results/results_veganism.csv"
json_file = "../results/mastodon_search_results/results_veganism.json"

with open(csv_file, 'w',encoding="utf-8") as outfile:
    csv_header = "Created at\tUser\tText\n"
    outfile.write(csv_header)
    outfile.write("\n".join(toots_as_text))

with open(json_file, 'w') as outfile:
    json.dump(toots_as_json, outfile)
    