## Scraping non-tabular, multipage sites
Scrape the top 500 <a href="https://bestsellingalbums.org/decade/2010">best-selling albums of the 2010's</a>. Your data must include the following datapoints:

- Name of album
- Name of artist
- Number of albums sold 
- The link to the page that breaks down sales by country (found by clicking album title)



### My Approach

- verify data on page
- find targets tags and classes using inspect elements
- scrape for a single page to verify
- iterate through all pages

In [1]:
## import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import randrange

In [2]:
## scrape one page of data points (then we build one that iterates through all pages)
## url to scrape
url = "https://bestsellingalbums.org/decade/2010"

In [3]:
## request site
response = requests.get(url)
response

<Response [200]>

In [4]:
## turn response into soup (navigable html from string)
soup = BeautifulSoup(response.text, "html.parser")

In [5]:
## grab ALL albums and store in variable
all_albums = soup.find_all("div", class_="album_card")
all_albums[0]

<div class="album_card"><div class="rank">1</div><div class="cover"><img class="pic" onerror="this.src='../includes/default.png';this.onerror='';" src="../covers/1034.jpg"/></div><div class="data_col"><div class="album"><a href="https://bestsellingalbums.org/album/1034">21</a></div><div class="artist"><a href="https://bestsellingalbums.org/artist/218" title="ADELE album sales">ADELE</a></div><div class="sales">Sales: 30,000,000</div><div class="rank_mobile">Rankings:</div><div class="ranks_row"><div class="ranks"><a href="https://bestsellingalbums.org/year/2011" title="Best-selling albums of 2011"><span class="ranks_desc_art">Rank in </span>2011</a> : 1</div><div class="ranks"><a href="https://bestsellingalbums.org/decade/2010" title="Best-selling albums of 2010's"><span class="ranks_desc_art">Rank in </span>2010's</a>: 1</div><div class="ranks"><a href="https://bestsellingalbums.org/overall" title="Best-selling albums of all time">Overall<span class="ranks_desc_art"> rank</span></a> :

## Shortened version:

In [6]:
## artist name
artists_list = [ artist.find("div", class_="artist").get_text() for artist in all_albums ]

## album title
albums_list = [ album.find("div", class_="album").get_text() for album in all_albums ]

## album links
albums_url_list = [ url.find("a").get("href") for url in all_albums ]

##sales
sales_list = [ int(sales.find("div", class_="sales").get_text().replace("Sales: ","").replace(",","")) for sales in all_albums ]

In [7]:
## get column names

col_names = [ columns["class"][0] for columns in all_albums[0].find_all("div")[3:6] ]
col_names

['album', 'artist', 'sales']

In [8]:
## convert to df
main_list = list(zip(artists_list, albums_list, sales_list, albums_url_list))
df = pd.DataFrame(zip(artists_list, albums_list, sales_list, albums_url_list),
                  columns = col_names + ["more_info"])
df

Unnamed: 0,album,artist,sales,more_info
0,ADELE,21,30000000,https://bestsellingalbums.org/album/1034
1,ADELE,25,23000000,https://bestsellingalbums.org/album/1035
2,MICHAEL BUBLÉ,CHRISTMAS,15000000,https://bestsellingalbums.org/album/30524
3,TAYLOR SWIFT,1989,14748116,https://bestsellingalbums.org/album/45488
4,JUSTIN BIEBER,PURPOSE,14000000,https://bestsellingalbums.org/album/23318
5,ED SHEERAN,DIVIDE,13787460,https://bestsellingalbums.org/album/12876
6,SOUNDTRACK,FROZEN,12632083,https://bestsellingalbums.org/album/42961
7,KATY PERRY,TEENAGE DREAM,12134000,https://bestsellingalbums.org/album/23977
8,ED SHEERAN,X,11879785,https://bestsellingalbums.org/album/12880
9,BRUNO MARS,DOO-WOPS & HOOLIGANS,11270000,https://bestsellingalbums.org/album/6777


In [9]:
# iterate to capture the first 500

all_dfs = [] ## hold all dfs
url = "https://bestsellingalbums.org/decade/2010" ## base url

count = 1 
while count <= 10:
    print(f"Scraping {url}")
    ## get response
    response = requests.get(url)
    print(response)
    ## turn response into soup (navigable html from string)
    soup = BeautifulSoup(response.text, "html.parser")
    print("converted to soup")
    ## grab ALL albums data and store in variable
    all_albums = soup.find_all("div", class_="album_card")
    print("got all album data")
    ## lists to hold data
    artists_list = [ artist.find("div", class_="artist").get_text() for artist in all_albums ]
    albums_list = [ album.find("div", class_="album").get_text() for album in all_albums ]
    albums_url_list = [ url.find("a").get("href") for url in all_albums ]
    sales_list = [ int(sales.find("div", class_="sales").get_text().replace("Sales: ","").replace(",","")) for sales in all_albums ]

    ## get column names
    col_names = [ columns["class"][0] for columns in all_albums[0].find_all("div")[3:6] ]
        
    ## convert to df
    main_list = list(zip(artists_list, albums_list, sales_list, albums_url_list))
    dfs = pd.DataFrame(zip(artists_list, albums_list, sales_list, albums_url_list),
                  columns = col_names + ["more_info"])
    all_dfs.append(df)
    print("dataframe in list of dataframes")

    ## increment url and set timer
    count += 1
    url = "https://bestsellingalbums.org/decade/2010"
    url = f"{url}-{count}"
    snoozer = randrange(5,12)
    print(f"snoozing for {snoozer} seconds before next scrape")
    time.sleep(snoozer)
    
print("done scraping all links")        

Scraping https://bestsellingalbums.org/decade/2010
<Response [200]>
converted to soup
got all album data
dataframe in list of dataframes
snoozing for 10 seconds before next scrape
Scraping https://bestsellingalbums.org/decade/2010-2
<Response [200]>
converted to soup
got all album data
dataframe in list of dataframes
snoozing for 7 seconds before next scrape
Scraping https://bestsellingalbums.org/decade/2010-3
<Response [200]>
converted to soup
got all album data
dataframe in list of dataframes
snoozing for 6 seconds before next scrape
Scraping https://bestsellingalbums.org/decade/2010-4
<Response [200]>
converted to soup
got all album data
dataframe in list of dataframes
snoozing for 10 seconds before next scrape
Scraping https://bestsellingalbums.org/decade/2010-5
<Response [200]>
converted to soup
got all album data
dataframe in list of dataframes
snoozing for 9 seconds before next scrape
Scraping https://bestsellingalbums.org/decade/2010-6
<Response [200]>
converted to soup
got all

In [10]:
## check if correct number of dfs
len(all_dfs)

10

In [11]:
## call len a single df to verify correct amount of data points
len(all_dfs[1])

50

In [12]:
## convert to a single df rather than a list of df
df = pd.concat(all_dfs, ignore_index = True)
df

Unnamed: 0,album,artist,sales,more_info
0,ADELE,21,30000000,https://bestsellingalbums.org/album/1034
1,ADELE,25,23000000,https://bestsellingalbums.org/album/1035
2,MICHAEL BUBLÉ,CHRISTMAS,15000000,https://bestsellingalbums.org/album/30524
3,TAYLOR SWIFT,1989,14748116,https://bestsellingalbums.org/album/45488
4,JUSTIN BIEBER,PURPOSE,14000000,https://bestsellingalbums.org/album/23318
...,...,...,...,...
495,MACKLEMORE & RYAN LEWIS,THE HEIST,5858500,https://bestsellingalbums.org/album/28330
496,EMINEM,THE MARSHALL MATHERS LP 2,5790318,https://bestsellingalbums.org/album/13762
497,TAYLOR SWIFT,LOVER,5686733,https://bestsellingalbums.org/album/45493
498,JAY-Z & KANYE WEST,WATCH THE THRONE,5550000,https://bestsellingalbums.org/album/21088


In [13]:
## call df with 500 albums
df

Unnamed: 0,album,artist,sales,more_info
0,ADELE,21,30000000,https://bestsellingalbums.org/album/1034
1,ADELE,25,23000000,https://bestsellingalbums.org/album/1035
2,MICHAEL BUBLÉ,CHRISTMAS,15000000,https://bestsellingalbums.org/album/30524
3,TAYLOR SWIFT,1989,14748116,https://bestsellingalbums.org/album/45488
4,JUSTIN BIEBER,PURPOSE,14000000,https://bestsellingalbums.org/album/23318
...,...,...,...,...
495,MACKLEMORE & RYAN LEWIS,THE HEIST,5858500,https://bestsellingalbums.org/album/28330
496,EMINEM,THE MARSHALL MATHERS LP 2,5790318,https://bestsellingalbums.org/album/13762
497,TAYLOR SWIFT,LOVER,5686733,https://bestsellingalbums.org/album/45493
498,JAY-Z & KANYE WEST,WATCH THE THRONE,5550000,https://bestsellingalbums.org/album/21088
