# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [2]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [3]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [4]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

In [5]:
html_page = requests.get('https://www.residentadvisor.net/events/us/newyork')
soup = BeautifulSoup(html_page.content, 'html.parser')

In [6]:
event_listings = soup.find('div', id="event-listing")
entries = event_listings.findAll('li')
titles = []
venues = []
dates = []
attendees = []
for entry in entries:
    event=entry.find('h1',class_='event-title')
    if event != None:
        event_info = event.text.split('at')
        title = event_info[0].strip()
        titles.append(title)
        venue = event_info[1].strip()
        venues.append(venue)
    date = entry.find('time')
    if date != None:
        date = date.text.split('T')[0]
        dates.append(date)
    attend = entry.find('p',class_='attending')
    if attend != None:
        attend = int(attend.text.split()[0])
        attendees.append(attend)
df = pd.DataFrame([titles,venues,dates,attendees]).transpose()
df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,ReSolute,TBA - New York,2020-05-28,6
1,Virtual Thursday: Planetarium with Mr. Curtain...,Nowadays,2020-05-28,2
2,Joe Wong + Nite Cre,ures,2020-05-28,1
3,Virtual Thursday: Wh,is Mutual Aid,2020-05-28,1
4,Virtual Friday: DJ Voices and Zenker Brothers,Nowadays,2020-05-29,1
5,[CANCELED] Cristoph - Made Event & Gray Area,Quantum,2020-05-29,20
6,[POSTPONED] Bicep Live,Knockdown Center,2020-05-30,101
7,Zero presents... Carte Blanche Rooftop Party O...,The W,2020-05-30,8
8,Teksupport: Mind Against (All Night Long),99 Scott Ave,2020-05-30,22
9,Never Fake It: Seth Magoon & Mike Guimond,Le Bain,2020-05-30,4


In [7]:
def scrape_events(events_page_url):
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    event_listings = soup.find('div', id="event-listing")
    entries = event_listings.findAll('li')
    titles = []
    venues = []
    dates = []
    attendees = []
    for entry in entries:
        event=entry.find('h1',class_='event-title')
        if event != None:
            event_info = event.text.split('at')
            title = event_info[0].strip()
            titles.append(title)
            venue = event_info[1].strip()
            venues.append(venue)
        date = entry.find('time')
        if date != None:
            date = date.text.split('T')[0]
            dates.append(date)
        attend = entry.find('p',class_='attending')
        if attend != None:
            attend = int(attend.text.split()[0])
            attendees.append(attend)
    df = pd.DataFrame([titles,venues,dates,attendees]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return(df)

In [8]:
scrape_events('https://www.residentadvisor.net/events/us/florida')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,(Cancelled) Steve Bug by Link Miami Rebels,"Floyd, Miami",2020-05-29,2
1,Un_mute,"TBA - Miami, Miami",2020-05-29,1
2,Zendid,"TBA - Miami, Miami",2020-05-30,2


## Write a Function to Retrieve the URL for the Next Page

In [9]:
html_page = requests.get('https://www.residentadvisor.net/events/us/newyork')
soup = BeautifulSoup(html_page.content, 'html.parser')
next_page = soup.find('li', id="liNext").find('a')
next_page_ext ='/week'+(next_page.get('href','missing').split('/week')[-1])
next_page_url = url + next_page_ext

In [10]:
def next_page(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    split_url = url.split('/week')[0]
    next_page = soup.find('li', id="liNext").find('a')
    next_page_ext ='/week'+(next_page.get('href','missing').split('/week')[-1])
    url = split_url + next_page_ext
    return url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [14]:
list_dfs = []
total_rows = 0
url = "https://www.residentadvisor.net/events/us/newyork"
while total_rows <= 100:
    print(url)
    df = scrape_events(url)
    list_dfs.append(df)
    total_rows += len(df)
    url = next_page(url)
    time.sleep(.2)
df = pd.concat(list_dfs)
df = df.iloc[:100]
print(len(df))
df.sort_values(by=['Number_of_Attendees','Event_Date'],ascending=False)

https://www.residentadvisor.net/events/us/newyork
https://www.residentadvisor.net/events/us/newyork/week/2020-06-01
https://www.residentadvisor.net/events/us/newyork/week/2020-06-08
https://www.residentadvisor.net/events/us/newyork/week/2020-06-15
https://www.residentadvisor.net/events/us/newyork/week/2020-06-22
https://www.residentadvisor.net/events/us/newyork/week/2020-06-29
https://www.residentadvisor.net/events/us/newyork/week/2020-07-06
https://www.residentadvisor.net/events/us/newyork/week/2020-07-13
https://www.residentadvisor.net/events/us/newyork/week/2020-07-20
https://www.residentadvisor.net/events/us/newyork/week/2020-07-27
https://www.residentadvisor.net/events/us/newyork/week/2020-08-03
https://www.residentadvisor.net/events/us/newyork/week/2020-08-10
https://www.residentadvisor.net/events/us/newyork/week/2020-08-17
https://www.residentadvisor.net/events/us/newyork/week/2020-08-24
https://www.residentadvisor.net/events/us/newyork/week/2020-08-31
https://www.residentadviso

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
6,[POSTPONED] All Day I Dream Summer Season Opening,Brooklyn Mirage,2020-06-07,437
1,October Moonlight Midnight Yacht Cruise,Harbor Lights Yacht,2020-10-10,415
0,Can't Stop The Feeling Midnight Yacht Cruise,Harbor Lights Yacht,2020-07-24,219
4,Lane 8 - Brightest Lights Tour (Sunday) - resc...,Brooklyn Mirage,2020-08-16,137
0,Monolink - rescheduled,Kings Hall - Avant Gardner,2020-09-30,122
...,...,...,...,...
1,NYC July 4th Weekend Hip Hop vs Reggae® Yacht ...,Skyport Marina,2020-07-03,
2,NYC Independence Day Weekend Yacht Party Cruis...,Skyport Marina,2020-07-03,
4,718 Sessions Bo,Party 2020,2020-06-28,
3,[CANCELLED] Wrecked & Carry N,ion Pride 2020,2020-06-27,


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!