# Pinboard

Pinboard is a social bookmarking site where people share links to content and *tag* them by assigning a word that describes the content. These tags are free-form, and each user decides which ones to use.

Pinboard has a nice [API](https://pinboard.in/api/) for interacting with your own bookmarks, but not for getting all public bookmarks for a tag. Pinboard also makes all tag pages available as RSS, e.g. https://feeds.pinboard.in/rss/t:covid-19 but it unfortunately doesn't allow paging back in time.

So unfortunately we're going to have to scrape the pages. But fortunately this won't be too difficult with the [requests_html](https://requests-html.kennethreitz.org/) module because Pinboard has done such a nice job of using [semantic html](https://en.wikipedia.org/wiki/Semantic_HTML).

In [31]:
import time
import requests_html
import dateutil.parser

def pinboard(hashtag):
    http = requests_html.HTMLSession()
    pinboard_url = 'https://pinboard.in/t:{}'.format(hashtag)
    while True:
        print(pinboard_url)
        resp = http.get(pinboard_url)
        bookmarks = resp.html.find('.bookmark')
        for b in bookmarks:
            a = b.find('.bookmark_title', first=True)
            yield {
                'url': a.attrs['href'],
                'title': a.text,
                'created': dateutil.parser.parse(b.find('.when', first=True).attrs['title'])
            }
    
        a = resp.html.find('#top_earlier', first=True)
        if not a:
            break
    
        next_url = 'https://pinboard.in' + a.attrs['href']
        if pinboard_url == next_url:
            break
        
        time.sleep(1)
        pinboard_url = next_url

In [32]:
next(pinboard('covid-19'))

https://pinboard.in/t:covid-19


{'url': 'https://mutualaidmamas.com/',
 'title': 'Welcome to Mutual Aid Medford and Somerville (MAMAS) network | Mutual Aid Medford and Somerville Network (MAMAS)',
 'created': datetime.datetime(2020, 3, 23, 21, 43, 20)}

Now we can write all the results to a CSV file. But lets look for a few variants: covid-19, covid_19, covid19. To avoid repeating the same urls we can keep track of them and only write them once.

In [33]:
import csv

urls = set()
with open('data/pinboard.csv', 'w') as fh:
    out = csv.DictWriter(fh, fieldnames=['url', 'created', 'title'])
    out.writeheader()
    for hashtag in ['covid-19', 'covid_19', 'covid19']:
        for bookmark in pinboard(hashtag):
            if bookmark['url'] not in url:
                out.writerow(bookmark)
                urls.add(bookmark['url'])            

https://pinboard.in/t:covid-19
https://pinboard.in/t:covid-19/?start=50
https://pinboard.in/t:covid-19/?start=100
https://pinboard.in/t:covid-19/?start=150
https://pinboard.in/t:covid-19/?start=200
https://pinboard.in/t:covid-19/?start=250
https://pinboard.in/t:covid-19/?start=300
https://pinboard.in/t:covid-19/?start=350
https://pinboard.in/t:covid-19/?start=400
https://pinboard.in/t:covid-19/?start=450
https://pinboard.in/t:covid-19/?start=500
https://pinboard.in/t:covid-19/?start=550
https://pinboard.in/t:covid-19/?start=600
https://pinboard.in/t:covid-19/?start=650
https://pinboard.in/t:covid-19/?start=700
https://pinboard.in/t:covid-19/?start=750
https://pinboard.in/t:covid-19/?start=800
https://pinboard.in/t:covid-19/?start=850
https://pinboard.in/t:covid-19/?start=900
https://pinboard.in/t:covid-19/?start=950
https://pinboard.in/t:covid_19
https://pinboard.in/t:covid_19/?start=50
https://pinboard.in/t:covid_19/?start=100
https://pinboard.in/t:covid_19/?start=150
https://pinboard

In [35]:
import pandas

# prevent dataframe columns from being truncated
pandas.set_option('display.max_columns', None)
pandas.set_option('display.width', None)
pandas.set_option('display.max_colwidth', None)

df = pandas.read_csv('data/pinboard.csv')
df

Unnamed: 0,url,created,title
0,https://mutualaidmamas.com/,2020-03-23 21:43:20,Welcome to Mutual Aid Medford and Somerville (MAMAS) network | Mutual Aid Medford and Somerville Network (MAMAS)
1,https://eand.co/the-upsides-of-a-global-pandemic-4dbb00be4a03,2020-03-23 21:22:24,302 Found
2,https://kottke.org/20/03/some-people,2020-03-23 21:21:06,Some People
3,https://www.thestar.com/business/2020/03/23/construction-industry-braces-for-devastating-shutdown-as-ford-calls-for-closure-of-non-essential-workplaces.html,2020-03-23 21:19:37,Construction industry braces for ‘devastating’ shutdown as Ford calls for closure of non-essential workplaces | The Star
4,https://www.thestar.com/news/canada/2020/03/23/health-experts-warn-of-face-mask-shortage-1472-cases-of-confirmed-and-presumptive-covid-19-in-canada.html,2020-03-23 21:18:09,Toronto declares state of emergency; Ford shuts down all non-essential services starting Wednesday as Ontario reports 78 new cases; Three more deaths in B.C. | The Star
...,...,...,...
934,https://www.theguardian.com/commentisfree/2020/mar/20/boris-johnson-covid-19-prime-minister-brexit?CMP=Share_iOSApp_Other,2020-03-21 06:10:18,"When Johnson says we'll turn the tide in 12 weeks, it's just another line for the side of a bus | Marina Hyde | Opinion | The Guardian"
935,https://www.ecdc.europa.eu/en/2019-ncov-background-disease,2020-03-21 06:07:48,Disease background of COVID-19
936,https://twitter.com/i/web/status/1241194881273540609,2020-03-21 06:03:33,Twitter
937,https://twitter.com/i/web/status/1241136296816304129,2020-03-21 05:55:27,Twitter
