## Lab 1.5: Scraping Mastodon

In this notebook we will be scraping messages from [https://docs.joinmastodon.org/](Mastodon), an open-source alternative to Twitter (X). The messages send over Mastodon are called *toots*. 

In [1]:
import json
import requests
import pandas as pd

## 1. Setting up 

A few settings and variables need to be defined before we can start scraping, starting with the **URL** from which we will access the data. We then set the parameters, in this case **'limit'**, a parameter which sets the maximum number of posts that can be pulled at once. 

Note that above the URL, we have a commented out line of code which allows you to search for posts containing a given **hashtag**, such as the example 'coffee'. For this, we would have to use a slightly different **URL** format, which is also commented out below **hashtag**. Remove the comment to apply this specification to your query. 

Additional settings include setting a time-frame limit to the posts you collect using the **pandas Timestamp** and **DateOffset** functions. This will serve to cap how much data is collected. We then add the flag **is_end**, which will be set to **True** once we have past our set time-frame and stop scraping. Currently, we are delimiting the time-frame to all entries posted within the most recent hour.

Finally, we will create an empty list **results** to store the data.

can we play around with query, geo location, language, timestamp 

In [4]:

#hashtag = 'veganism'  
#URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'

URL = 'https://mastodon.social/api/v1/timelines/public' #comment out if using hashtag alternative 
params = {    # set parameters
    'limit': 40   # max value of posts 
}


# limit collection, currently set to only collecting toots within the most recent hour
since = pd.Timestamp('now', tz='utc') - pd.DateOffset(hour=1)
is_end = False  # will be set to True 

# create list to store results
results = []



## 2. Scraping

Once we've defined the necessary settings, we can start scraping. 

We create a loop to go through the posts, including **break** from the loop once we reach the end of posts. We then go through each post we've collected to make sure we are within our set time-frame. Each post is then added to our **results** list.


In [6]:
while True:
    r = requests.get(URL, params=params)
    posts = json.loads(r.text)  # saves to json object
    
    if len(posts) == 0: # breaks code if we reach end of toots
        break
        
    for p in posts:
        timestamp = pd.Timestamp(t['created_at'], tz='utc')
        if timestamp <= since:  # check if we've reach the end of timeframe
            is_end = True  # if so is_end is True
            break
            
        results.append(t)  # add each toot to results list
    
    if is_end:  # if end of timeframe, exit loop
        break
        
    max_id = toots[-1]['id'] 
    params['max_id'] = max_id
        
    

TypeError: __new__() got an unexpected keyword argument 't'

In [24]:
#df = pd.DataFrame(results)  # store results in pandas data frame
#|print(df)

## 3. Inspecting the results

With the data stored in **results** as a list of posts, each post taking the form of a nested dictionary, we can now inspect the data. After importing the **pprint** module, which allows us to "pretty-print" python data structures, we can examine some instances of the data by extracting a given **post** from the list and printing it.

Take time to inspect the dictionary, what might be keys of interest (apart from **'content'**), what do they refer to. Notice the **'url'** key which holds the link to the post in question. Note that some information might not be present across posts, certain keys may not appear for all data entries. 


Some keys which might be of interest: 
- 'language'
- 'favourites_count'
- 'followers_count'
- 'reblogs_count'
- 'replies_count'
- 'media_attachments' 

**How might we use this information to filter through posts of interest?**

In [46]:
#print(df)
import pprint
pp = pprint.PrettyPrinter(indent=1)

post = results[20]
pp.pprint(post)

{'account': {'acct': 'RolloTreadway@beige.party',
             'avatar': 'https://files.mastodon.social/cache/accounts/avatars/111/172/164/743/232/125/original/5c5c4d77d584f238.png',
             'avatar_static': 'https://files.mastodon.social/cache/accounts/avatars/111/172/164/743/232/125/original/5c5c4d77d584f238.png',
             'bot': False,
             'created_at': '2023-10-03T00:00:00.000Z',
             'discoverable': False,
             'display_name': 'RolloTreadway',
             'emojis': [],
             'fields': [{'name': 'Location',
                         'value': 'Tynedale, sunny Northumberland',
                         'verified_at': None},
                        {'name': 'having',
                         'value': 'a nice cup of tea and a sit down',
                         'verified_at': None}],
             'followers_count': 615,
             'following_count': 409,
             'group': False,
             'header': 'https://files.mastodon.social/cache/ac

## 4. Specify Queries

Using what you learned in the Python course on extracting information from nested dictionaries, write your own code to filter through your data to extract more specific types of posts.

Remember, you can also adjust the time-frame of the collected posts and re-scrape data. Additionally, utilizing the **'tags'** key may be a better way to filter based on a specific set of hashtags.