# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!

In this lab, you'll practice your scraping skills on an online music magazine and events website called Resident Advisor.

## Objectives

You will be able to:

* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://ra.co website. For reproducibility we will use the [Internet Archive](https://archive.org/) Wayback Machine to retrieve a version of this page from March 2019.

Start by navigating to the events page [here](https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30) in your browser. It should look something like this:

<img src="images/ra_top.png">

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

## Write a Function to Scrape all of the Events on the Given Page

The function should return a Pandas DataFrame with columns for the `Event_Name`, `Venue`, and `Number_of_Attendees`.

Start by importing the relevant libraries, making a request to the relevant URL, and exploring the contents of the response with `BeautifulSoup`. Then fill in the `scrape_events` function with the relevant code.

In [111]:
# Relevant imports
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import time

In [2]:
EVENTS_PAGE_URL = "https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30"

# Exploration: making the request and parsing the response
page = requests.get(EVENTS_PAGE_URL)
soup = BeautifulSoup(page.content, 'html.parser')

In [18]:
# Find the container with event listings in it
events_container = soup.find('li', class_='Column-sc-18hsrnn-0 gnwWng')
print(events_container.prettify())

<li class="Column-sc-18hsrnn-0 gnwWng">
 <div class="Box-omzyfs-0 fYkcJU">
  <div class="Box-omzyfs-0 SectionStyledBox-tvjxx0-0 eEYxFS sticky-header">
   <h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fwuoVk">
    <span class="Text-sc-1t0gn2o-0 dvCBwl" color="accent" font-weight="normal">
     <span class="Text-sc-1t0gn2o-0 gSvLLX" color="accent" font-weight="normal">
      ̸
     </span>
     Sat, 30 Mar
    </span>
   </h3>
  </div>
  <hr class="Divider__HorizontalDivider-sc-1qsmuc-0 klshtO"/>
  <ul class="Grid__GridStyled-sc-1l00ugd-0 fuNsvk grid" data-test-id="ticketed-event">
   <li class="Column-sc-18hsrnn-0 jHShKh">
    <div class="Box-omzyfs-0 sc-AxjAm dqkjhR" data-test-id="ticketed-event">
     <h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fhMVGI">
      <a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-test-id="event-listing-heading" data-tracking-id="/events/1234892" href="/web/20210325230938/https://ra.co/events/1234892">
       <span class="Text-sc-1t0

In [24]:
type(events_container.contents[0])

bs4.element.Tag

In [31]:
# Find a list of events by date within that container
regex = re.compile("Box-omzyfs-0 fYkcJU|Box-omzyfs-0 bEpoyR")
# events = events_container.contents[0].findAll('li', class_='Column-sc-18hsrnn-0 jHShKh')
events_by_date = events_container.findAll('div', {'class': regex})
events_by_date[0]

<div class="Box-omzyfs-0 fYkcJU"><div class="Box-omzyfs-0 SectionStyledBox-tvjxx0-0 eEYxFS sticky-header"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fwuoVk"><span class="Text-sc-1t0gn2o-0 dvCBwl" color="accent" font-weight="normal"><span class="Text-sc-1t0gn2o-0 gSvLLX" color="accent" font-weight="normal"≯</span>Sat, 30 Mar</span></h3></div><hr class="Divider__HorizontalDivider-sc-1qsmuc-0 klshtO"/><ul class="Grid__GridStyled-sc-1l00ugd-0 fuNsvk grid" data-test-id="ticketed-event"><li class="Column-sc-18hsrnn-0 jHShKh"><div class="Box-omzyfs-0 sc-AxjAm dqkjhR" data-test-id="ticketed-event"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fhMVGI"><a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-test-id="event-listing-heading" data-tracking-id="/events/1234892" href="/web/20210325230938/https://ra.co/events/1234892"><span class="Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf" color="primary" data-test-id="event-listing-heading" data-tracking-id="/events/123489

In [37]:
# Extract the date (e.g. Sat, 30 Mar) from one of those containers
events_by_date[0].find('span', class_='Text-sc-1t0gn2o-0 dvCBwl').contents[-1]

'Sat, 30 Mar'

In [40]:
# Extract the name, venue, and number of attendees from one of the
# events within that container
events_one_day = events_by_date[0].findAll('li', class_='Column-sc-18hsrnn-0 jHShKh')
name = events_one_day[0].find('span', class_='Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf').text
name

'UnterMania II'

In [65]:
regex2 = re.compile("Text-sc-1t0gn2o-0 hhfigA|Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 knBJCg")
venue = events_one_day[0].find('span', {'class': regex2}).text
venue

'TBA - New York'

In [83]:
attendees = events_one_day[0].find('div', class_='Box-omzyfs-0 sc-AxjAm ebaaK').next_sibling.find('span').text
attendees

'457'

In [88]:
len(events_one_day)

40

In [92]:
# Loop over all of the event entries, extract this information
# from each, and assemble a dataframe
names = []
venues = []
dates = []
attendees = []

for event_days in events_by_date:
    temp_date = event_days.find('span', class_='Text-sc-1t0gn2o-0 dvCBwl').contents[-1]
    events_on_day = events_by_date[0].findAll('li', class_='Column-sc-18hsrnn-0 jHShKh')
    for event in events_on_day:
        names.append(event.find('span', class_='Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf').text)
        venues.append(event.find('span', {'class': regex2}).text)
        dates.append(temp_date)
        try:
            attendees.append(event.find('div', class_='Box-omzyfs-0 sc-AxjAm ebaaK').next_sibling.find('span').text)
        except:
            attendees.append(0)
    data = {'Event_Name': names, 'Venue': venues,
            'Event_Date': dates, 'Number_of_Attendees': attendees}
    df = pd.DataFrame(data)

print(df.shape)
df.head()

(280, 4)


Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89


In [93]:
# Bring it all together in a function that makes the request, gets the
# list of entries from the response, loops over that list to extract the
# name, venue, date, and number of attendees for each event, and returns
# that list of events as a dataframe

def scrape_events(EVENTS_PAGE_URL):
    #Your code here
    page = requests.get(EVENTS_PAGE_URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    events_container = soup.find('li', class_='Column-sc-18hsrnn-0 gnwWng')
    
    regex1 = re.compile("Box-omzyfs-0 fYkcJU|Box-omzyfs-0 bEpoyR")
    regex2 = re.compile("Text-sc-1t0gn2o-0 hhfigA|Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 knBJCg")
    
    events_by_date = events_container.findAll('div', {'class': regex1})
    
    names = []
    venues = []
    dates = []
    attendees = []

    for event_days in events_by_date:
        temp_date = event_days.find('span', class_='Text-sc-1t0gn2o-0 dvCBwl').contents[-1]
        events_on_day = events_by_date[0].findAll('li', class_='Column-sc-18hsrnn-0 jHShKh')
        for event in events_on_day:
            names.append(event.find('span', class_='Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf').text)
            venues.append(event.find('span', {'class': regex2}).text)
            dates.append(temp_date)
            try:
                attendees.append(event.find('div', class_='Box-omzyfs-0 sc-AxjAm ebaaK').next_sibling.find('span').text)
            except:
                attendees.append(0)
    
    data = {'Event_Name': names, 'Venue': venues,
                'Event_Date': dates, 'Number_of_Attendees': attendees}
    df = pd.DataFrame(data)

    return df

In [94]:
# Test out your function
scrape_events(EVENTS_PAGE_URL)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89
...,...,...,...,...
275,"Dub Disco w/ Monolithic, Special Guest",Baby's All Right,"Fri, 5 Apr",2
276,Fatoumata Diawara,Le Poisson Rouge,"Fri, 5 Apr",1
277,Astral Turf: A Springy Spacey Dance Party,Uniondocs,"Fri, 5 Apr",1
278,Dead Crew Orbit Live PA March Showcase,The Brown Note,"Fri, 5 Apr",1


## Write a Function to Retrieve the URL for the Next Page

As you scroll down, there should be a button labeled "Next Week" that will take you to the next page of events. Write code to find that button and extract the URL from it.

This is a relative path, so make sure you add `https://web.archive.org` to the front to get the URL.

![next page](images/ra_next.png)

In [104]:
# Find the button, find the relative path, create the URL for the current `soup`
url_ext = soup.find('div', class_='Box-omzyfs-0 sc-AxjAm Panel__StyledAlignment-sc-1udo2qh-0 ArchiveNavigator___StyledPanel2-x733n4-2 kKGHmX').find('a').attrs['href']
new_url = 'https://web.archive.org/' + url_ext
new_url

'https://web.archive.org//web/20210325230938/https://ra.co/events/us/newyork?week=2019-04-06'

In [105]:
# Fill in this function, to take in the current page's URL and return the
# next page's URL
def next_page(url):
    #Your code here
    page = requests.get(EVENTS_PAGE_URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    url_ext = soup.find('div', class_='Box-omzyfs-0 sc-AxjAm Panel__StyledAlignment-sc-1udo2qh-0 ArchiveNavigator___StyledPanel2-x733n4-2 kKGHmX').find('a').attrs['href']
    return 'https://web.archive.org/' + url_ext

In [106]:
# Test out your function
next_page(EVENTS_PAGE_URL)

'https://web.archive.org//web/20210325230938/https://ra.co/events/us/newyork?week=2019-04-06'

## Scrape the Next 500 Events

In other words, repeatedly call `scrape_events` and `next_page` until you have assembled a dataframe with at least 500 rows.

Display the data sorted by the number of attendees, greatest to least.

We recommend adding a brief `time.sleep` call between `requests.get` calls to avoid rate limiting.

In [109]:
url = EVENTS_PAGE_URL

In [123]:
# Your code here
df = pd.DataFrame()
while len(df) < 1000:
    df = df.append(scrape_events(url), ignore_index=True)
    url = next_page(url)
    time.sleep(1)

In [124]:
df.tail()

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
1003,Disorient presents: Glamtech,Chelsea Music Hall,"Fri, 12 Apr",1
1004,Cyber Loft #2,The Mercury Lounge,"Fri, 12 Apr",1
1005,Sweet Tooth: A Candy Rave,Secret Loft,"Fri, 12 Apr",1
1006,"SVR presents: Pajama Party 2: April 6th, 2019",The Deep end,"Fri, 12 Apr",1
1007,Bonsai: West1ne,Sunnyvale,"Fri, 12 Apr",0


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!