## Scraping non-tabular, multipage sites
Scrape the top 500 <a href="https://bestsellingalbums.org/decade/2010">best-selling albums of the 2010's</a>. Your data must include the following datapoints:

- Name of album
- Name of artist
- Number of albums sold 
- The link to the page that breaks down sales by country (found by clicking album title)



In [1]:
## create cells as needed

import pandas as pd
import requests
from bs4 import BeautifulSoup
from random import randrange
import time

### Scraping pages

In [2]:
# top 500 best-selling albums
# there are 50 items per page

base_url = "https://bestsellingalbums.org/decade/2010-" # includes a hyphen at the end 
last_page = 10 # total number of pages we have to scrape

errors_list = []
albums_list = []
artists_list = []
sales_list = []
links_list = []

def page_scraper(url):
    # setting up soup
    if response.status_code == 200:
        # scraping albums
        albums = albums_list.extend([ album.get_text() for album in soup.find_all("div", class_="album") ])
        # scraping artists
        artists = artists_list.extend([ artist.get_text() for artist in soup.find_all("div", class_="artist") ])
        # scraping albums sold
        sales = sales_list.extend([ int(sale.get_text().replace("Sales: ", "").replace(",", "")) for sale in soup.find_all("div", class_="sales") ])
        # scraping link to sales breakdown
        links = links_list.extend([ s["href"] for s in soup.select("div.album > a[href]") ])
    else:
        print("Failed to scrape")

    return albums, artists, sales, links

for page_num in range(1, last_page +1): # +1 to make sure I get the last page
    if page_num == 1: # "-0" and "-1" lead to an error! so we're scraping the first page differently...
        print(f"Scraping page {page_num} of {last_page}...")
        url = base_url
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        albums, artists, sales, links = page_scraper(url)  
        snoozer = randrange(5, 12)
        print(f"Snoozing for {snoozer} seconds before moving on to the next page.")
        time.sleep(snoozer)
    else:
        print(f"Scraping page {page_num} of {last_page}...")
        url = f"{base_url}{page_num}"  # this starts at page 2...
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        try:
            albums, artists, sales, links = page_scraper(url)  
        except Exception as e:
            errors_list.append(page_num, e)
            print(f"Something went wrong in page {page_num} due to {e}.")
        finally:
            snoozer = randrange(5, 12)
            print(f"Snoozing for {snoozer} seconds before moving on to the next page.")
            time.sleep(snoozer)

print("Scraping is done!")

Scraping page 1 of 10...
Snoozing for 10 seconds before moving on to the next page.
Scraping page 2 of 10...
Snoozing for 8 seconds before moving on to the next page.
Scraping page 3 of 10...
Snoozing for 6 seconds before moving on to the next page.
Scraping page 4 of 10...
Snoozing for 8 seconds before moving on to the next page.
Scraping page 5 of 10...
Snoozing for 11 seconds before moving on to the next page.
Scraping page 6 of 10...
Snoozing for 7 seconds before moving on to the next page.
Scraping page 7 of 10...
Snoozing for 9 seconds before moving on to the next page.
Scraping page 8 of 10...
Snoozing for 6 seconds before moving on to the next page.
Scraping page 9 of 10...
Snoozing for 7 seconds before moving on to the next page.
Scraping page 10 of 10...
Snoozing for 5 seconds before moving on to the next page.
Scraping is done!


### Saving to df

In [3]:
df_2010_albums = list(zip(albums_list, artists_list, sales_list, links_list))
main_df = pd.DataFrame(df_2010_albums, columns=["Album", "Artist", "Total Sales", "Link to Country Sales Breakdown"])

In [4]:
main_df

Unnamed: 0,Album,Artist,Total Sales,Link to Country Sales Breakdown
0,21,ADELE,30000000,https://bestsellingalbums.org/album/1034
1,25,ADELE,23000000,https://bestsellingalbums.org/album/1035
2,CHRISTMAS,MICHAEL BUBLÉ,15000000,https://bestsellingalbums.org/album/30524
3,1989,TAYLOR SWIFT,14748116,https://bestsellingalbums.org/album/45488
4,PURPOSE,JUSTIN BIEBER,14000000,https://bestsellingalbums.org/album/23318
...,...,...,...,...
495,UNDER PRESSURE,LOGIC,1060000,https://bestsellingalbums.org/album/27268
496,THE STRANGE CASE OF,HALESTORM,1060000,https://bestsellingalbums.org/album/17960
497,UNCAGED,ZAC BROWN BAND,1055000,https://bestsellingalbums.org/album/56701
498,FUTURE,FUTURE,1050371,https://bestsellingalbums.org/album/16036
