# Purpose

The purpose of this jupyter notebook is to scrape the Billboard Top 100 charts for the last 7 years to get the song, artist, and rank on the chart from the website. To do this, I will be using BeautifulSoup. 

First, I will import the libraries I will be using to scrape and form a list of all the dates that I want to pull information as well as format the urls using the dates that I have pull to create the dataframe as well as load the website that I will be pulling the data. 

In [1]:
import requests
from bs4 import BeautifulSoup
from datetime import date, datetime, timedelta

import numpy as np
import pandas as pd

Using timedelta, I will create a function that will create a list of weeks by day to insert into my url.

In [2]:
def week_delta(start, end, delta):
    curr = start
    while curr < end:
        yield curr
        curr += delta

week_as_datetime = []
weeks = []
for result in week_delta(date(2012, 4, 14), date(2019, 4, 20), timedelta(days=7)):
    week_as_datetime.append(result)
    weeks.append(result.strftime('%Y-%m-%d'))

weeks[-1:]

['2019-04-13']

Next, I will create a function that will both insert the individual weeks into the url to grab the top 100 songs for that week from the Billboard website. The function will also use the html information and div classes that contain all of the information we want and scrape the website filing in all the necessary information into a list of dictionaries.

In [14]:
url = 'https://www.billboard.com/charts/hot-100/{}'
top_100 = []

def billboard_web_scrape(url):
    for week in weeks:
        file_url = url.format(week)
        response = requests.get(file_url)
        print(response.status_code)
        
        # Soupify the text 
        page = response.text
        
        # Parse HTML
        soup = BeautifulSoup(page, 'html.parser')
        
        # Grab information from top 100 charts 
        for minisoup in soup.find_all(class_='chart-list-item'):
            song = {}
            song['week_of'] = datetime.strptime(week, '%Y-%m-%d')
            song['rank'] = minisoup['data-rank']
            song['title'] = minisoup['data-title']
            song['artist'] = minisoup['data-artist']
            try: 
                song['last_week'] = minisoup.find('div', {'class':'chart-list-item__last-week'}).get_text(strip=True)
            except:
                song['last_week'] = np.nan
            try:     
                song['peak_position'] = minisoup.find('div', {'class':'chart-list-item__weeks-at-one'}).get_text(strip=True)
            except:
                song['peak_position'] = np.nan

            try: 
                song['week_on_chart'] = minisoup.find('div', {'class':'chart-list-item__weeks-on-chart'}).get_text(strip=True)
            except:
                song['week_on_chart'] = np.nan
            top_100.append(song)
        

In [15]:
billboard_web_scrape(url)

200
200


In [16]:
top_100[:-1]

[{'week_of': datetime.datetime(2019, 4, 13, 0, 0),
  'rank': '1',
  'title': 'Old Town Road',
  'artist': 'Lil Nas X Featuring Billy Ray Cyrus',
  'last_week': '7',
  'peak_position': '1',
  'week_on_chart': '4'},
 {'week_of': datetime.datetime(2019, 4, 13, 0, 0),
  'rank': '2',
  'title': 'Sunflower (Spider-Man: Into The Spider-Verse)',
  'artist': 'Post Malone & Swae Lee',
  'last_week': '2',
  'peak_position': '1',
  'week_on_chart': '24'},
 {'week_of': datetime.datetime(2019, 4, 13, 0, 0),
  'rank': '3',
  'title': 'Wow.',
  'artist': 'Post Malone',
  'last_week': '1',
  'peak_position': '1',
  'week_on_chart': '15'},
 {'week_of': datetime.datetime(2019, 4, 13, 0, 0),
  'rank': '4',
  'title': 'Please Me',
  'artist': 'Cardi B & Bruno Mars',
  'last_week': '3',
  'peak_position': '1',
  'week_on_chart': '7'},
 {'week_of': datetime.datetime(2019, 4, 13, 0, 0),
  'rank': '5',
  'title': 'Middle Child',
  'artist': 'J. Cole',
  'last_week': '4',
  'peak_position': '2',
  'week_on_char

## Create file

To store the information scraped from the Billboard website, I will create a csv file. 

In [None]:
# Name columns
cols = ['week_of','rank', 'title', 'artist', 'last_week', 'peak_position', 'week_on_chart']

# Turn list of dictionaries into dataframe
df = pd.DataFrame(top_100, columns=cols)
df.tail()
print(df.shape)

In [None]:
df.to_csv('billboard_top_100.csv', index=False)