# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!

In this lab, you'll practice your scraping skills on an online music magazine and events website called Resident Advisor.

## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://ra.co website. For reproducibility we will use the [Internet Archive](https://archive.org/) Wayback Machine to retrieve a version of this page from March 2019.

Start by navigating to the events page [here](https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30) in your browser. It should look something like this:

<img src="images/ra_top.png">

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

## Write a Function to Scrape all of the Events on the Given Page

The function should return a Pandas DataFrame with columns for the `Event_Name`, `Venue`, and `Number_of_Attendees`.

Start by importing the relevant libraries, making a request to the relevant URL, and exploring the contents of the response with `BeautifulSoup`. Then fill in the `scrape_events` function with the relevant code.

In [1]:
# Relevant imports
import requests 
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time

In [4]:
EVENTS_PAGE_URL = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"

# Exploration: making the request and parsing the response
page = requests.get(EVENTS_PAGE_URL)
soup = BeautifulSoup(page.content, 'html.parser')


In [15]:
# Find the container with event listings in it
all_events = soup.find('div', attrs={"data-tracking-id": "events-all"})
listings = all_events.find("ul").find("li")
print(listings.text[:100])
# event_container.prettify

̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRA


In [26]:
# Find a list of events by date within that container
event_dates = listings.findChildren(recursive=False)

In [46]:
# Extract the date (e.g. Sat, 30 Mar) from one of those containers
first_date_info = event_dates[0].find("div", class_="sticky-header").text
first_date_ele = event_dates[0]
first_date = first_date_info.strip("'̸")
# first_date

'Sat, 30 Mar'

In [47]:
# Extract the name, venue, and number of attendees from one of the
# events within that container
first_event_eles = first_date_ele.findChildren('ul')
first_event_ele = first_event_eles[0]

first_event_name = first_event_ele.find("h3").text

event_venue_attendees = first_event_ele.findAll("div", attrs={"height": 30})
event_venue = event_venue_attendees[0].text

attendence = int(venue_attendees[-1].text)

print(first_event_name)
print('--------------')
print(event_venue)
print('--------------')
print(first_date)
print('--------------')
print(attendence)

UnterMania II
--------------
TBA - New York
--------------
Sat, 30 Mar
--------------
457


In [52]:
all_events = soup.find('div', attrs={"data-tracking-id": "events-all"})
listings = all_events.find("ul").find("li")
dates = listings.findChildren(recursive=False)

In [54]:
# Loop over all of the event entries, extract this information
# from each, and assemble a dataframe
rows = []

for container in dates:
        
        if not container.text:
            continue

        date = container.find("div", class_="sticky-header").text
        date = date.strip("'̸")

        events = container.findChildren("ul")
        for event in events:
            
            name = event.find("h3").text
            venue_attendees = event.findAll("div", attrs={"height": 30})
            venue = venue_attendees[0].text
            try:
                num_attendees = int(venue_attendees[-1].text)
            except ValueError:
                num_attendees = np.nan

            rows.append([name, venue, date, num_attendees])
            
df = pd.DataFrame(rows)
df


Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1.0


In [66]:
# Bring it all together in a function that makes the request, gets the
# list of entries from the response, loops over that list to extract the
# name, venue, date, and number of attendees for each event, and returns
# that list of events as a dataframe

def scrape_events(events_page_url):
    #Your code here
    rows = []
    response = requests.get(EVENTS_PAGE_URL)
    soup = BeautifulSoup(response.content, "html.parser")
    
    all_events = soup.find('div', attrs={"data-tracking-id": "events-all"})
    listings = all_events.find("ul").find("li")
    dates = listings.findChildren(recursive=False)

    for container in dates:
        
        if not container.text:
            continue

        date = container.find("div", class_="sticky-header").text
        date = date.strip("'̸")

        events = container.findChildren("ul")
        for event in events:
            
            name = event.find("h3").text
            venue_attendees = event.findAll("div", attrs={"height": 30})
            venue = venue_attendees[0].text
            try:
                num_attendees = int(venue_attendees[-1].text)
            except ValueError:
                num_attendees = np.nan

            rows.append([name, venue, date, num_attendees])

# Make the list of lists into a dataframe and display
    event_table = pd.DataFrame(rows)
    event_table.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return event_table

In [56]:
# Test out your function
scrape_events(EVENTS_PAGE_URL)

Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1.0


## Write a Function to Retrieve the URL for the Next Page

As you scroll down, there should be a button labeled "Next Week" that will take you to the next page of events. Write code to find that button and extract the URL from it.

This is a relative path, so make sure you add `https://web.archive.org` to the front to get the URL.

![next page](images/ra_next.png)

In [58]:
# find the next page button 
svg = soup.find("svg", attrs={"aria-label": "Right arrow"})

#finding the button and walking back to get the link associated with it
svg_parent = svg.parent
link = svg.parent.previousSibling

rel_path = link.get("href") #grab the link from the href tag
next_url = "https://web.archive.org" + rel_path
next_url

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

In [61]:
# Fill in this function, to take in the current page's URL and return the
# next page's URL
def next_page(url):
    #Your code here
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    svg = soup.find("svg", attrs={"aria-label": "Right arrow"})
    svg_parent = svg.parent
    link = svg.parent.previousSibling
    rel_path = link.get("href")
    next_url = "https://web.archive.org" + rel_path
    
    return next_url

In [62]:
# Test out your function
next_page(EVENTS_PAGE_URL)

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

## Scrape the Next 500 Events

In other words, repeatedly call `scrape_events` and `next_page` until you have assembled a dataframe with at least 500 rows.

Display the data sorted by the number of attendees, greatest to least.

We recommend adding a brief `time.sleep` call between `requests.get` calls to avoid rate limiting.

In [67]:
# Your code here
big_df = pd.DataFrame()
# df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]

current_URL = EVENTS_PAGE_URL

while big_df.shape[0] <= 500:
    df = scrape_events(current_URL)
    time.sleep(.2)
    big_df = pd.concat([big_df, df])
    
    current_url = next_page(current_URL)
    time.sleep(.2)
    
big_df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1.0


In [69]:
big_df.sort_values("Number_of_Attendees",ascending=False)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
91,Teksupport: Honey Dijon (All Night Long) Sold Out,99 Scott Ave,"Fri, 5 Apr",463.0
91,Teksupport: Honey Dijon (All Night Long) Sold Out,99 Scott Ave,"Fri, 5 Apr",463.0
91,Teksupport: Honey Dijon (All Night Long) Sold Out,99 Scott Ave,"Fri, 5 Apr",463.0
91,Teksupport: Honey Dijon (All Night Long) Sold Out,99 Scott Ave,"Fri, 5 Apr",463.0
91,Teksupport: Honey Dijon (All Night Long) Sold Out,99 Scott Ave,"Fri, 5 Apr",463.0
...,...,...,...,...
39,"Petra, Matthusen & Lang, White & Pitsiokos, an...",H0L0,"Sat, 30 Mar",
39,"Petra, Matthusen & Lang, White & Pitsiokos, an...",H0L0,"Sat, 30 Mar",
39,"Petra, Matthusen & Lang, White & Pitsiokos, an...",H0L0,"Sat, 30 Mar",
39,"Petra, Matthusen & Lang, White & Pitsiokos, an...",H0L0,"Sat, 30 Mar",


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!