# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
html_page = requests.get('https://www.residentadvisor.net/events/us/washingtonstate')

In [4]:
webpage = html_page.content
soup = BeautifulSoup(webpage, "html.parser")

In [5]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<html lang="en,ja,es">
<head id="_x1"><title>
	RA: Events in Washington State, United States of America
</title><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="en,ja,es" http-equiv="content-language"/><meta content="RA: Resident Advisor" name="Description"/><meta content="RA, residentadvisor, resident, advisor, music, ra, events, in, washington, state, united, states, america" name="Keywords"/><meta content="Resident Advisor" name="Author"/><meta content="Resident Advisor" property="og:site_name"/><meta content="712773712080127" property="fb:app_id"/><link href="/bundles/default-css?v=FkfRVAlFvpndxqgZliJaJOXD-OhkiRFP8nrBK9Pg2R01" rel="stylesheet"/>
<meta content="app-id=981952703, app-argument=ra-guide://search" name="apple-itunes-app"/><link href="/bundles/cat-listings-css?v=qgpSmyPbylOKeJFqy2yvCrTgAsw9yQYcJtLKS_vPO6s1" rel="stylesheet"/>
<link href="/favicon.ico" rel="icon" type="image/vnd.microsoft.ico

In [6]:
listings = soup.find('div', id="event-listing")
print(listings)

<div class="fl col4" id="event-listing">
<ul class="list" id="items">
<li><p class="eventDate date"><a href="/events.aspx?ai=411&amp;v=day&amp;mn=2&amp;yr=2020&amp;dy=6"><span>Thu, 06 Feb 2020 /</span></a></p></li><li class=""><article class="event-item clearfix" itemscope="" itemtype="http://data-vocabulary.org/Event"><span style="display:none;"><time datetime="2020-02-06T00:00" itemprop="startDate">2020-02-06T00:00</time></span><a href="/events/1375259"><img height="76" src="/images/events/flyer/2020/2/us-0206-1375259-list.jpg" width="152"/></a><div class="bbox"><h1 class="event-title" itemprop="summary"><a href="/events/1375259" itemprop="url" title="Event details of Field Trip 106: Dustycloud">Field Trip 106: Dustycloud</a> <span>at <a href="/club.aspx?id=68203">Q Nightclub</a>, <a href="/events.aspx?ai=46">Seattle</a></span></h1><div class="grey event-lineup">Dustycloud</div><p class="attending"><span>2</span> Attending</p></div></article></li><li><p class="eventDate date"><a href

In [7]:
list_items = listings.find_all('li')
list_items

[<li><p class="eventDate date"><a href="/events.aspx?ai=411&amp;v=day&amp;mn=2&amp;yr=2020&amp;dy=6"><span>Thu, 06 Feb 2020 /</span></a></p></li>,
 <li class=""><article class="event-item clearfix" itemscope="" itemtype="http://data-vocabulary.org/Event"><span style="display:none;"><time datetime="2020-02-06T00:00" itemprop="startDate">2020-02-06T00:00</time></span><a href="/events/1375259"><img height="76" src="/images/events/flyer/2020/2/us-0206-1375259-list.jpg" width="152"/></a><div class="bbox"><h1 class="event-title" itemprop="summary"><a href="/events/1375259" itemprop="url" title="Event details of Field Trip 106: Dustycloud">Field Trip 106: Dustycloud</a> <span>at <a href="/club.aspx?id=68203">Q Nightclub</a>, <a href="/events.aspx?ai=46">Seattle</a></span></h1><div class="grey event-lineup">Dustycloud</div><p class="attending"><span>2</span> Attending</p></div></article></li>,
 <li><p class="eventDate date"><a href="/events.aspx?ai=411&amp;v=day&amp;mn=2&amp;yr=2020&amp;dy=7">

In [8]:
# if this evaluates to True then I know that this is an "event" item rather than a "date" item
list_items[1].find('p', class_="eventDate date") == None

True

In [9]:
# this above is missing the "submit event" item so I am going to change this measure the length of each string to find the real events

In [24]:
# all list items which are events will have string lengths greater than 30 (more like greater than 70)
details = list_items[7].text
len(details)

211

In [25]:
date = details[0:16]
date

'2020-02-07T00:00'

In [26]:
other_details = details[16:]
other_details

'Innerflight » Drop: DJ T. › Dubspeeka › Ben Annand › Rob Noble › Rhines › Night Train at The Monkey Loft, Seattle¤ DJ T, ¤ DUBSPEEKA, ¤ BEN ANNAND, ¤ ROB NOBLE, ¤ RHINES, ¤ NIGHT TRAIN1 Attending'

In [13]:
# prototype to find venue
venue = other_details.split(' at ')[1].split(',')[0]
venue

'Black Lodge'

In [14]:
# prototype to find performer
performer = other_details.split(' at ')[0]
performer

'Physical Wash, Lower Tar, Chrome Corpse, Leash, Vox Sinistra'

In [27]:
# how to find number attending???

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [30]:
rows = []
def scrape_events(events_page_url):
    html_page = requests.get(events_page_url)
    webpage = html_page.content
    soup = BeautifulSoup(webpage, "html.parser")
    listings = soup.find('div', id="event-listing")
    list_items = listings.find_all('li')
    for item in range(len(list_items)):
        details = list_items[item].text
        if len(details) > 30:
            date = details[0:16]
            other_details = details[16:]
            venue = other_details.split(' at ')[1].split(',')[0]
            performer = other_details.split(' at ')[0]
            rows.append([performer, date, venue])
        else:
            continue
    df = pd.DataFrame(rows)
    df.columns = ["Performer", "Date", "Venue"]
    return df

In [31]:
page_1 = scrape_events('https://www.residentadvisor.net/events/us/washingtonstate')

In [32]:
page_1

Unnamed: 0,Performer,Date,Venue
0,Field Trip 106: Dustycloud,2020-02-06T00:00,Q Nightclub
1,Secondnature feat. Simo Cell,2020-02-07T00:00,TBA - Seattle
2,Hunt & Gather Winter Diversion: Doc Martin,2020-02-07T00:00,Kremwerk
3,Future Funktion V,2020-02-07T00:00,Kremwerk
4,Citrus Room 2-Year Anniversary: Sasha Marie,2020-02-07T00:00,Q Nightclub
5,Innerflight » Drop: DJ T. › Dubspeeka › Ben An...,2020-02-07T00:00,The Monkey Loft
6,"Physical Wash, Lower Tar, Chrome Corpse, Leash...",2020-02-08T00:00,Black Lodge
7,Sundown: Enamour [Anjunadeep],2020-02-08T00:00,Kremwerk


## Write a Function to Retrieve the URL for the Next Page

In [33]:
# create list of dates to be added to url at 1 week intervals
import datetime
numdays = 1000
base = datetime.date(2020,2,10)
date_list = [base + 7*datetime.timedelta(days=x) for x in range(numdays)]
date_list[0:5]

[datetime.date(2020, 2, 10),
 datetime.date(2020, 2, 17),
 datetime.date(2020, 2, 24),
 datetime.date(2020, 3, 2),
 datetime.date(2020, 3, 9)]

In [34]:
# prototype converting datetime object into string format for insertion into URL
date_list[0]
month = '{:02d}'.format(date_list[0].month) #convert day to two digit format
day = '{:02d}'.format(date_list[0].day) # convert month to two digit format
year = date_list[0].year
string_date = '{}-{}-{}'.format(year, month, day)
string_date

'2020-02-10'

In [35]:
# turn datetime list into list of string dates
string_dates = []
for i in range(len(date_list)):
    month = '{:02d}'.format(date_list[i].month)
    day = '{:02d}'.format(date_list[i].day)
    year = date_list[i].year
    string_date = '{}-{}-{}'.format(year, month, day)
    string_dates.append(string_date)

In [36]:
string_dates[0:5]

['2020-02-10', '2020-02-17', '2020-02-24', '2020-03-02', '2020-03-09']

In [37]:
def next_page(url):
    #Your code here
    if url == 'https://www.residentadvisor.net/events/us/washingtonstate':
        next_page_url = 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-02-10'
    else: 
        url_string_date = url[-10:]
        date_index = string_dates.index(url_string_date) + 1
        next_page_url = 'https://www.residentadvisor.net/events/us/washingtonstate/week/' + string_dates[date_index]
    return next_page_url

In [39]:
list_of_urls = []
# there are only 9 pages with shows
for i in range(9):
    url = 'https://www.residentadvisor.net/events/us/washingtonstate/week/' + string_dates[i]
    list_of_urls.append(url)

list_of_urls

['https://www.residentadvisor.net/events/us/washingtonstate/week/2020-02-10',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-02-17',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-02-24',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-03-02',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-03-09',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-03-16',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-03-23',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-03-30',
 'https://www.residentadvisor.net/events/us/washingtonstate/week/2020-04-06']

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [40]:
#Your code here
master_df = pd.DataFrame()

cum_len_dataframes = 0

while cum_len_dataframes <= 1000:
    for url in list_of_urls:
        df = scrape_events(url)
        master_df.append(df)
        cum_len_dataframes += len(df)

In [42]:
master_df
# why is this not working?????

In [43]:
# testing other URL's.. This is pulling event names that aren't even in the source code for the URL
# for example try to find "DJ Seinfeld" in the source-code

df_x = scrape_events('https://www.residentadvisor.net/events/us/washingtonstate/week/2020-02-17')
df_x

Unnamed: 0,Performer,Date,Venue
0,Field Trip 106: Dustycloud,2020-02-06T00:00,Q Nightclub
1,Secondnature feat. Simo Cell,2020-02-07T00:00,TBA - Seattle
2,Hunt & Gather Winter Diversion: Doc Martin,2020-02-07T00:00,Kremwerk
3,Future Funktion V,2020-02-07T00:00,Kremwerk
4,Citrus Room 2-Year Anniversary: Sasha Marie,2020-02-07T00:00,Q Nightclub
...,...,...,...
110,Haüsed: DJ Seinfeld,2020-02-20T00:00,Kremwerk
111,Field Trip 108: Dombresky,2020-02-20T00:00,Q Nightclub
112,Kremwerk 6 Year Anniversary Friday: Omar S [Re...,2020-02-21T00:00,Kremwerk
113,Pop Secret: Destructo,2020-02-21T00:00,Q Nightclub


In [45]:
# this is also pulling many duplicates
df_x.Performer.value_counts()

Pop Secret: Destructo                                                                              4
Kremwerk 6 Year Anniversary Friday: Omar S [Research x Queer Dance Affair x Boyhood present]       4
Haüsed: DJ Seinfeld                                                                                4
Kremwerk 6 Year Aniversary Saturday: Shigeto, Jon Casey, Taso, Lefto [Shook & Soulfocus present    4
Field Trip 108: Dombresky                                                                          4
Telefon Tel Aviv                                                                                   3
Bounce Brunch *Spring??* ft Tyler Morrison & Toastercookie                                         3
Haüsed: Get Physical Takeover with Pezzner & Jon Lee                                               3
Shameless 17 Year Anniversary Party with DeWalta & Residents                                       3
Haüsed x Depth: La Fleur                                                                   

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!