# `bs4` Further Topics

In [1]:
import requests
import bs4

## Objective

1. Visit the page at this URL in your browser - https://www.discogs.com/lists/277616
2. Using Python, download the HTML document that corresponds to this page.
3. Write some Python code using `bs4` to
    - Return the titles of all the albums listed on this page
    - Return a list of links to the album cover images for each of the albums listed on this page (To do this, you need to get the src= attribute of the relevant <img> elements)
    - Return the link to the next page that contains the next 25 results
4. Use Python to download the next page of the list, and repeat, building up a list of records (stored as `dicts`, if you like) for each album in the list


We now move to using Python to request remote resources (e.g. HTML documents hosted by Web servers) so that we can parse them using `bs4` for the purposes of Web scraping. To do this, we will use the third-party `requests` library, because it provides a relatively straightforward interface. We pass the URI of the remote resource as an argument to `requests.get` and it returns a HTTP response object.

In [2]:
resp = requests.get("https://www.discogs.com/lists/277616")

When I run this code on my local machine, I recieve a HTTP response with status code 403. HTTP specifies a list of numeric status codes; you are probably familiar with the HTTP response code 404 - Not Found, a response code indicating that the request was not fulfilled (we know this is an error as it fits the format 4xx). 403 - Forbidden is a response indicating that the server is not willing to fulfill the request as it is has been made by the client.

As HTTP is a plain-text protocol, we can inspect the exact request that was fired off when we used `requests.get`, which is stored in the `resp` variable:

In [3]:
def reveal_HTTP(request):
    # https://stackoverflow.com/a/23816211
    print('{}\r\n{}\r\n\r\n{}'.format(
        request.method + ' ' + request.url,
        '\r\n'.join('{}: {}'.format(k, v) for k, v in request.headers.items()),
        request.body,
    ))

reveal_HTTP(resp.request)

GET https://www.discogs.com/lists/277616
User-Agent: python-requests/2.28.1
Accept-Encoding: gzip, deflate, br
Accept: */*
Connection: keep-alive

None


When scraping in any context, it is important to ensure that you "politely" request resources from remote servers, meaning that you stick to a set of principles that makes it straightforward for you and the remote server administrator to identify unintended behavior on the part of your scraping code. The aspects of politness we will implement today include:

1. Identify your scraper (and, optionally, yourself) in the User-Agent string, which is part of the "header" of the HTTP request
2. Apply ratelimiting to limit the number (and/or speed) of HTTP requests you make of a remote server
3. Respect the HTTP status code returned by the remote server, and adapt the behaviour of your code accordingly

Looking at the User-Agent string provided with the request that failed, we might try to provide more detailed information about our activity and try again.

In [4]:
headers = {
    'User-Agent': 'Research Scraper v0.1 - contact <youremail@example.com>',
}

resp = requests.get("https://www.discogs.com/lists/277616", headers=headers)

Interestingly enough, providing this information caused the server to behave differently, since we now see:

1. That the HTTP response code has changed from 403 to 200 (meaning a document has been succesfully retrieved from the remote server)
2. That the response from the server contains HTML, which we are now ready to parse with the help of `bs4`. (We just look at the first 1000 characters).

In [5]:
resp

<Response [200]>

In [6]:
resp.content[:1000]

b'\n\n\n<!DOCTYPE html>\n<html\n    class="is_not_mobile needs_reduced_ui "\n    lang="en"\n    xmlns:og="http://opengraphprotocol.org/schema/"\n    xmlns:fb="http://www.facebook.com/2008/fbml"\n>\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n        <meta http-equiv="content-language" content="en">\n        <meta http-equiv="pragma" content="no-cache" />\n        <meta http-equiv="expires" content="-1" />\n\n        <!-- OT will rewrite convert these to javascript and update our consent module accordingly -->\n        <script type="text/plain" class="optanon-category-C0002">\n            window.consent.resolveGroup(window.consent.PERFORMANCE_GROUP)\n        </script>\n        <script type="text/plain" class="optanon-category-C0003">\n            window.consent.resolveGroup(window.consent.FUNCTIONALITY_GROUP)\n        </script>\n        <script type="text/plain" class="optanon-category-C0004">\n            window.consent.res

So, we can assign this string to a useful name (`html_doc`) and start manipulating it with the help of `bs4`, as in the last set of examples.

In [7]:
html_doc = resp.content
soup = bs4.BeautifulSoup(html_doc)

Let's break down the task into its constitutent parts. 

In order to complete this task, you are going to need to inspect or otherwise get acquainted with the structure of the web page that you are interested in retrieving information from. The solution will be hidden in the web presentation of the notebook materials; I reccomend that you attempt this task on your own before consulting the solutions. As I have said before, there are multiple ways of achieving the same result in `bs4`, so your solution need not be identical.

### Return the titles of all the albums listed on this page

**Hint**: Consult the documentation for the HTML element called `<ol>` (= **o**rdered **l**ist): https://developer.mozilla.org/en-US/docs/Web/HTML/Element/ol.

In [8]:
ol_albums = soup.find('ol', id='listitems')
li_albums = ol_albums.find_all('li')
album_titles = [li.find('a').get_text() for li in li_albums]
album_titles

['Maroon 5 - Songs About Jane',
 'Hawkwind - Space Ritual',
 'Cave Gaze Wagon - Wonderful Wagon World',
 'The Pax Cecilia - Nouveau (A Theatre Of The Air)',
 'Various - Mythical Tapes Vol. I',
 'Yes - Yessongs',
 "Marc Hamilton - Disque D'Or",
 'Marc Hamilton - Viens',
 'Татьяна Гринденко*, Юрий Смирнов* - Ragtimes = Рэг-Таймы',
 'バニラ・ファッジ* = The Vanilla Fudge* - キープ・ミー・ハンギング・オン = You Keep Me Hanging On',
 'The Lovethugs - Playground Instructors',
 "Various - Blanck Mass Presents The Strange Colour Of Your Body's Tears Re-Score",
 'Jing Chi, Vinnie Colaiuta, Robben Ford, Jimmy Haslip - Jing Chi',
 'Rock Shop - Mr. Lee\'s "Swing\'n Affair" Presents',
 'Blue Mink - Melting Pot',
 'Ars Nova (3) - Pavan For My Lady',
 'Blues Pills - Blues Pills',
 'The Beatles - Lady Madonna',
 "Sleep - Reunion At All Tomorrow's Parties",
 'Reel People Feat. Darien - Sure',
 'Grim Fandango - In The Rough',
 'Paul Mauriat And His Orchestra - Let The Sunshine In / Midnight Cowboy / And Other Goodies',
 'The 


### Return a list of links to the album cover images for each of the albums listed on this page 

NB: To do this, you need to get the `src` attribute of the relevant `<img>` elements.


In [9]:
ol_albums = soup.find('ol', id='listitems')
li_albums = ol_albums.find_all('li')
cover_image_links = [li.find('img')['src'] for li in li_albums]
cover_image_links

['https://i.discogs.com/9S5gwiP7Jr5Emfdrl8ERBHxUAc4p-mPHU8GMbGbGees/rs:fill/g:sm/q:40/h:300/w:300/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTQzNTEz/OC0xMTEzMDM2MzE1/LmpwZw.jpeg',
 'https://i.discogs.com/wMzYu3lEPnNPB8Ord6SyhoA6eZSvgu8ZQMSxRxO34Dw/rs:fill/g:sm/q:40/h:300/w:300/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTY0MDQw/Ni0xNjA4MTM0NTcx/LTE0NzkuanBlZw.jpeg',
 'https://i.discogs.com/wyMbzqKbQSMjcdkWowSrSRX7Pj_m5phhSxw5ZFe3oE4/rs:fill/g:sm/q:40/h:300/w:300/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTgxNTk3/OTctMTQ1NjI1MjI2/Ni00MDg1LmpwZWc.jpeg',
 'https://i.discogs.com/tfzq-AQIFomVcMDxegx4a8pVCgjn9_2CdDROpDALo-M/rs:fill/g:sm/q:40/h:300/w:300/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTIxMzIy/OTctMTM5MTI5ODMw/Ni03ODg4LmpwZWc.jpeg',
 'https://i.discogs.com/WU-cIgUI_-3-1vFxM06SgUO4oykhn_i8XgV39Ss0A8Y/rs:fill/g:sm/q:40/h:300/w:300/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTQ4ODA1/MTUtMTM3ODc0NDI0/Mi0zNzg1LmpwZWc.jpeg',
 'https://i.discogs.com/Ifdi30FMqrHGtfnE2CkzYqONmmN0YdRPgcNND2KP6

### Return the link to the next page that contains the next 20 results

In [10]:
nav = soup.find('nav', attrs={'aria-label':'Pagination'})
link = nav.find('a', class_='pagination_next')
link_destination = link.attrs['href']

# The link returned was a relative link, so to make it usable, we need to 
# construct an absolute link using our knowledge of the page that we are currently on
absolute_link = 'https://www.discogs.com' + link_destination

## Expanding the scope of the scraping activity

First, let's combine everything we've got so far in to a helpful function:

In [11]:
def scrape_album_info(html_doc):
    soup = bs4.BeautifulSoup(html_doc)

    ol_albums = soup.find('ol', id='listitems')
    li_albums = ol_albums.find_all('li')

    album_titles = [li.find('a').get_text() for li in li_albums]
    cover_image_links = [li.find('img')['src'] for li in li_albums]

    nav = soup.find('nav', attrs={'aria-label':'Pagination'})
    link = nav.find('a', class_='pagination_next')
    link_destination = link.attrs['href']
    absolute_next_link = 'https://www.discogs.com' + link_destination

    return {
        'album_titles' : album_titles,
        'cover_image_links' : cover_image_links,
        'absolute_next_link' : absolute_next_link
    }

Double-check it works as expected:

In [12]:
album_info = scrape_album_info(html_doc)
album_info

{'album_titles': ['Maroon 5 - Songs About Jane',
  'Hawkwind - Space Ritual',
  'Cave Gaze Wagon - Wonderful Wagon World',
  'The Pax Cecilia - Nouveau (A Theatre Of The Air)',
  'Various - Mythical Tapes Vol. I',
  'Yes - Yessongs',
  "Marc Hamilton - Disque D'Or",
  'Marc Hamilton - Viens',
  'Татьяна Гринденко*, Юрий Смирнов* - Ragtimes = Рэг-Таймы',
  'バニラ・ファッジ* = The Vanilla Fudge* - キープ・ミー・ハンギング・オン = You Keep Me Hanging On',
  'The Lovethugs - Playground Instructors',
  "Various - Blanck Mass Presents The Strange Colour Of Your Body's Tears Re-Score",
  'Jing Chi, Vinnie Colaiuta, Robben Ford, Jimmy Haslip - Jing Chi',
  'Rock Shop - Mr. Lee\'s "Swing\'n Affair" Presents',
  'Blue Mink - Melting Pot',
  'Ars Nova (3) - Pavan For My Lady',
  'Blues Pills - Blues Pills',
  'The Beatles - Lady Madonna',
  "Sleep - Reunion At All Tomorrow's Parties",
  'Reel People Feat. Darien - Sure',
  'Grim Fandango - In The Rough',
  'Paul Mauriat And His Orchestra - Let The Sunshine In / Midnig

The value of identifying the link to the next page of results on the site comes when we decide to then pass that URI back to `requests` and parse the next 25 results. Don't forget to politely add our User-Agent string.

In [13]:
next_resp = requests.get(album_info['absolute_next_link'], headers=headers)

Sure enough, we have succesfully retrieved the next 25 items in the list!

In [14]:
next_album_info = scrape_album_info(next_resp.content)
next_album_info

{'album_titles': ['Silver Rocket - Tesla',
  'The Moody Blues - Live At The Isle Of Wight Festival',
  'Lili Boniche - Trésors De La Chanson Judéo-Arabe',
  'Daddy Longlegs - Shifting Sands',
  'Fruupp - Seven Secrets',
  'Tony Mottola - A Latin Love-In',
  'Diana Ross & The Supremes - Let The Sunshine In',
  'Various - Sol Y Sombra. La Primera Alta Comedia Musical De Siesta',
  'Γιάννης Γλέζος - Η Ελένη Του Μάη Και Άλλα Τραγούδια Του Γιάννη Γλέζου',
  "Various - Pop Power 60's et 70's Volume 2",
  'Gypsy (15) - Gypsy',
  'Kula Shaker - Strangefolk',
  'Claudio Villa - Antologia Della Canzone Italiana Vol. 3 1921-1928',
  'Mandala (22) - Mandala',
  "Black Trip - Goin' Under",
  'Cream (2) - Reunion 3rd Night',
  'Rocco Careri & Arturo Macchiavelli Feat. Eric King - Hot Butterfly (Remixed by Cool Million & Rob Hardt)',
  'The Goastt* - Midnight Sun',
  "Alessandro Galluzzi, G. B. Martelli* - Tu Che M'hai Preso Il Cuore / Fox Della Luna",
  'Sergio Pérez (2) - Desfile De Exitos',
  'Hon

Of course, the purpose of working with Python here is to automate the work of paginating through all these HTML documents, so the final step is to use these tools to automate the process of following the chain of next page links, until either (a) the total number of expected items has been processed or (b) there is no valid response from the server. We can do this, including a short 1 second wait between requests, as follows:

In [15]:
seed_uri = 'https://www.discogs.com/lists/277616'

In [16]:
import time 

def run_scrape(seed_uri):
    data = [] 
    to_get = []
    count = 0

    seed_resp = requests.get(seed_uri, headers=headers)
    seed_album_info = scrape_album_info(seed_resp.content)
    data.append(seed_album_info)
    to_get.append(seed_album_info['absolute_next_link'])

    while len(to_get) > 0 and count < 5:
        time.sleep(1)
        resp = requests.get(to_get.pop(0), headers=headers)
        count += 1 

        album_info = scrape_album_info(resp.content)
        data.append(album_info)
        to_get.append(album_info['absolute_next_link'])
    
    return data


In [17]:
scraped_data = run_scrape(seed_uri='https://www.discogs.com/lists/277616')