# Scrape the archive of Radiolab podcasts from wnyc
### RL_archive_bs4.ipynb

I've been a fan of Radiolab ever since I was a college freshman, when I was assigned to listen to ["Musical Language."](https://www.wnycstudios.org/podcasts/radiolab/episodes/91512-musical-language) It was my first introduction to the medium, and the thoughtfulness of this episode's composition really left a mark on me.

Radiolab has evolved a great deal over time, shifting from science and technology to more culture/politics. Moreover, older episodes continue to be trimmed from podcast apps, making it difficult and inconvenient to listen to my favorite old episodes.

I decided to build a quick scraper of the wnycstudios website to download the complete archive of Radiolab episodes, which I can then listen to via my Plex Pass app.

In [6]:
import pandas as pd
from bs4 import BeautifulSoup
from wget import download
from urllib.request import urlopen
import re
from time import sleep

## Let's get a working example using Beautiful Soup

In [40]:
library_path = '/Volumes/Elements/NVIDIA_SHIELD/PlexLibrary/Podcasts/radiolab/'

stump_url = 'https://www.wnycstudios.org'

base_url = 'https://www.wnycstudios.org/shows/radiolab/podcasts'
pages = range(1,100) # the archive is paginated. 100 should be enough to cover the history

In [11]:
html = urlopen(base_url).read()

In [13]:
bs = BeautifulSoup(html)

In [16]:
articles = bs.find_all('article')

In [21]:
url = articles[0].find('a')['href']

In [30]:
podcast_url = stump_url + url
podcast_html = BeautifulSoup(urlopen(podcast_url).read())

In [37]:
podcast_download_link = podcast_html.find('a', attrs= {'class': 'download-link'})['href']

In [41]:
download(podcast_download_link, library_path)

'/Volumes/Elements/NVIDIA_SHIELD/PlexLibrary/Podcasts/radiolab//radiolab_podcast20graham.mp3'

## Seems to work. Time to generalize and scale to the entire archive

In [118]:
def download_podcast(podcast_url):
    podcast_html = BeautifulSoup(urlopen(podcast_url).read())
    podcast_download = podcast_html.find('a', attrs= {'class': 'download-link'})
    if podcast_download != None: # It's possible for an article to not be a downloadable episode
        if podcast_download.get('href') != None: # It is also possible for a page to erroneously omit a download link
            href = podcast_download['href']
            # In rare cases, the redirect url does not work, but the original url will.
            podcast_download_link = ['http' + x for x in re.split('http', href)][-1]            
            download(podcast_download_link, library_path)
    
def get_podcast_urls(page_no):
    url = base_url + '/' + str(page_no)
    bs = BeautifulSoup(urlopen(url).read())
    articles = bs.find_all('article')
    urls = [stump_url + article.find('a')['href'] for article in articles]
    return(urls)


In [132]:
pages = range(1,100)
for page_no in pages:
    podcast_urls = get_podcast_urls(page_no)
    if len(podcast_urls) > 0:
        for p in podcast_urls:
            sleep(15)
            print(p)
            download_podcast(p)
        sleep(60*3)
    else:
        print('No more pages after ' + str(page_no))
        break

https://www.wnycstudios.org/podcasts/radiolab/episodes/91508-morality
https://www.wnycstudios.org/podcasts/radiolab/episodes/91504-beyond-time
https://www.wnycstudios.org/podcasts/radiolab/episodes/91562-mortality
https://www.wnycstudios.org/podcasts/radiolab/episodes/91569-memory-and-forgetting
https://www.wnycstudios.org/podcasts/radiolab/episodes/91552-zoos
https://www.wnycstudios.org/podcasts/radiolab/episodes/91584-time
https://www.wnycstudios.org/podcasts/radiolab/episodes/91528-sleep
https://www.wnycstudios.org/podcasts/radiolab/episodes/91539-placebo
https://www.wnycstudios.org/podcasts/radiolab/episodes/91496-who-am-i
https://www.wnycstudios.org/podcasts/radiolab/episodes/91580-stress
https://www.wnycstudios.org/podcasts/radiolab/episodes/91524-where-am-i
No more pages after 44
