# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.residentadvisor.net/events'
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')


In [4]:
# soup.prettify

In [5]:
events = soup.find('div', {'id': 'event-listing'})
events.prettify

<bound method Tag.prettify of <div class="fl col4" id="event-listing">
<ul class="list" id="items">
<li><p class="eventDate date"><a href="/events.aspx?ai=82&amp;v=day&amp;mn=5&amp;yr=2020&amp;dy=14"><span>Thu, 14 May 2020 /</span></a></p></li><li class=""><article class="event-item clearfix tickets-bkg-logo" itemscope="" itemtype="http://data-vocabulary.org/Event"><a href="/events/1350021#tickets"><img class="nohide" src="https://residentadvisor.net/images/ra-tix.png" style="height: 23px; width: 40px; right: 0px; position: absolute; top: 1px;"/></a><span style="display:none;"><time datetime="2020-05-14T00:00" itemprop="startDate">2020-05-14T00:00</time></span><a href="/events/1350021"><img height="76" src="/images/events/flyer/2020/5/us-0514-1350021-list.jpg" width="152"/></a><div class="bbox"><h1 class="event-title" itemprop="summary"><a href="/events/1350021" itemprop="url" title="Event details of Sold Out, Below Radar: Nicola Cruz (DJ set)">Sold Out, Below Radar: Nicola Cruz (DJ se

In [6]:
events.find('h1', class_='event-title').find('a').text

'Sold Out, Below Radar: Nicola Cruz (DJ set)'

In [7]:
#get event_names on the page
event_names = [h1.find('a').text for h1 in events.findAll('h1', class_='event-title')]
event_names

['Sold Out, Below Radar: Nicola Cruz (DJ set)', 'DJ Shadow']

In [8]:
#figure out venue logic
events.find('h1', class_='event-title').find('span').text[3:]

'The Black Box'

In [9]:
#get venues
venues = [h1.find('span').text[3:] for h1 in events.findAll('h1', class_='event-title')]
venues

['The Black Box', 'Ogden Theatre']

In [10]:
#logic to get attending numbers
events.find('p', class_='attending').find('span').text

'25'

In [11]:
attending = [p.find('span').text for p in events.findAll('p', class_='attending')]
attending

['25', '3']

In [12]:
#logic to get show dates
events.find('p', class_='eventDate').text[:-2]

'Thu, 14 May 2020'

In [13]:
dates = [p.text[:-2] for p in events.findAll('p', class_='eventDate')]
dates

['Thu, 14 May 2020', 'Sun, 17 May 2020']

In [14]:
import pandas as pd

def scrape_events(events_page_url):
    #Your code here
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    event_container = soup.find('div', {'id': 'event-listing'})
    event_names = [h1.find('a').text for h1 in event_container.findAll('h1', class_='event-title')]
    venues = [h1.find('span').text[3:] for h1 in event_container.findAll('h1', class_='event-title')]
    dates = [p.text[:-2] for p in event_container.findAll('p', class_='eventDate')]
    attending = [p.find('span').text for p in event_container.findAll('p', class_='attending')]
    
    df = pd.DataFrame([event_names, venues, dates, attending]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    
    return df

In [15]:
test_url = 'https://www.residentadvisor.net/events/us/colorado/week/2020-05-26'
test_df = scrape_events(test_url)
test_df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Chromeo and Madeon,Red Rocks Amphitheatre,"Fri, 29 May 2020",1
1,All Day I Dream of the Mile High City,Sculpture Park,"Sat, 30 May 2020",217


## Write a Function to Retrieve the URL for the Next Page

In [16]:
# soup.prettify

In [17]:
soup.find('a', {'ga-event-action': 'Next '}).attrs['href']

'/events/us/colorado/week/2020-05-19'

In [18]:
url = 'https://www.residentadvisor.net/events/events/us/colorado/week/2020-05-19'
url.split('/')[:-1]

['https:',
 '',
 'www.residentadvisor.net',
 'events',
 'events',
 'us',
 'colorado',
 'week']

In [31]:
def next_page(url):
    #Your code here
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    if soup.find('a', {'ga-event-action': 'Next '}):
        url_ext = soup.find('a', {'ga-event-action': 'Next '}).attrs['href']
        base_url = 'https://www.residentadvisor.net' #is there a cleaner way to do this
        return base_url + url_ext
    else:
        return url

print(next_page('https://www.residentadvisor.net/events/us/colorado/week/2020-05-19'))

https://www.residentadvisor.net/events/us/colorado/week/2020-05-26


## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [51]:
import time
#Your code here
events_df = pd.DataFrame()
url = 'https://www.residentadvisor.net/events/us/newyork'

#doing 100 because it takes too long
while len(events_df) < 1000:
    events_df = events_df.append(scrape_events(url))
    url = next_page(url)
    print(len(events_df))


17
28
35
42
52
59
65
71
80
89
92
95
96
101
103
104
105
107
108
112
116
121
127
127
128
132
135
135
137
139
139
139
139
139
139
139
139
139
141
141
141
141
141
141
141
141
141
141
141
141
141
141
141
141
141
141
141
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142
142


From cffi callback <function _verify_callback at 0x11e4bc2f0>:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/OpenSSL/SSL.py", line 316, in wrapper
    connection = Connection._reverse_mapping[ssl]
KeyboardInterrupt


SSLError: HTTPSConnectionPool(host='www.residentadvisor.net', port=443): Max retries exceeded with url: /events/us/newyork/week/2021-06-22 (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))

In [53]:
events_df.columns

Index(['Event_Name', 'Venue', 'Event_Date', 'Number_of_Attendees'], dtype='object')

In [57]:
len(events_df)
events_df.sort_values(by=['Number_of_Attendees', 'Event_Date'])

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Mezerg,Le Poisson Rouge,"Fri, 04 Dec 2020",1
1,NYC Hip Hop vs. Reggae Midnight Yacht Party Je...,Skyport Marina,"Fri, 05 Jun 2020",1
0,"Haza with Boston Chery, Iyabo, Carmen Sandiego...",The Sultan Room,"Fri, 19 Jun 2020",1
0,NYC 90s vs 2K Summer Midnight Yacht Party Jewel,Skyport Marina,"Fri, 26 Jun 2020",1
1,Joe Wong + Nite Creatures,Le Poisson Rouge,"Fri, 29 May 2020",1
...,...,...,...,...
8,Made in Colombia 2020 - Boat Party,Circle Line Cruises,,
2,Elrow NYC - Rowsattacks - Postponed,Avant Gardner,,
3,Live It Up Midnight Yacht Cruise,Harbor Lights Yacht,,
4,Salmo,Le Poisson Rouge,,


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!