
### Webscraping! 

Webscraping is a powerful tool for getting your own data - it's important to follow the policies outlined at the `./robots.txt` extension

I.e. the scraping policy for [Open Table](https://www.opentable.com) is listed at [https://www.opentable.com/robots.txt](https://www.opentable.com/robots.txt). Your path reading skills will help you here! 


In [29]:
## imports 
import pandas as pd 
from bs4 import BeautifulSoup
import requests 
from time import sleep # this is a courtesy to the server so you don't overload it
import matplotlib.pyplot as plt 


%matplotlib inline 

In [30]:
## let's scrape from austin chronicle 
url = 'https://www.austinchronicle.com/events/music/'


### 1) First get the requests and content with: 
``` python 
res = requests.get(url)

```

In [31]:
## code . Be a good web surfer and don't over request a server for too long
res = requests.get(url)

In [32]:
res.status_code

200

In [33]:
res.content

b'\n\n\n\n\n\n\n\n<!DOCTYPE html>\n<html xmlns="https://www.w3.org/1999/xhtml" xmlns:fb="https://www.facebook.com/2008/fbml" lang="en-US">\n<head>\n  \n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n  <meta charset="UTF-8">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">\n\n  <link rel="stylesheet" type="text/css" href="https://cloud.typography.com/7790492/668686/css/fonts.css" />\n\n  <script src="https://use.fontawesome.com/1a933a8ada.js"></script>\n\n  <link rel="stylesheet" type="text/css" href="/Styles/responsive/ac-style.css?v=2.9" />\n\n  <link rel="stylesheet" type="text/css" href="/Styles/responsive/ac-menu.css" />\n  <link rel="stylesheet" type="text/css" href="/Styles/responsive/magnific.css" />\n  \n  <link rel="stylesheet" type="text/css" href="/Styles/responsive/fluid.css?v=1.3" />\n  \n\n  <link rel="apple-touch-icon" sizes="57x57" href="/apple-icon-57x57.png?v=2">\n  <link rel="apple-

### 2) "Soupify" the object

``` python 
soup = BeautifulSoup(res.content)
```

In [34]:
## code 
soup = BeautifulSoup(res.content)

In [35]:
soup

<!DOCTYPE html>
<html lang="en-US" xmlns="https://www.w3.org/1999/xhtml" xmlns:fb="https://www.facebook.com/2008/fbml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<link href="https://cloud.typography.com/7790492/668686/css/fonts.css" rel="stylesheet" type="text/css"/>
<script src="https://use.fontawesome.com/1a933a8ada.js"></script>
<link href="/Styles/responsive/ac-style.css?v=2.9" rel="stylesheet" type="text/css"/>
<link href="/Styles/responsive/ac-menu.css" rel="stylesheet" type="text/css"/>
<link href="/Styles/responsive/magnific.css" rel="stylesheet" type="text/css"/>
<link href="/Styles/responsive/fluid.css?v=1.3" rel="stylesheet" type="text/css"/>
<link href="/apple-icon-57x57.png?v=2" rel="apple-touch-icon" sizes="57x57"/>
<link href="/apple-icon-60x60.png?v=2" rel="apple-touch-icon" sizes="60x60"/>
<link href="/ap

### 3) Try and find the element that contains the venue

In [36]:
## code
for i in soup.find_all(name='div', attrs={'class': 'venue'}):
    print(i.text)

Stay Gold
Threadgill's Old No. 1
McCullough Theatre
Barracuda
Come & Take It Live
The Electric Church
One World Theatre
The Austin Beer Garden Brewing Co.
ACL Live at the Moody Theater
Angel's Icehouse
Antone's Nightclub
Azul Tequila
Banger's Sausage House & Beer Garden
The Barn
B.D. Riley's Irish Pub
Black Sparrow Music Parlor
Blindside Tattoo & Piercing
Brentwood Social House
Broken Spoke
Buck's Backyard
Buzz Mill Riverside
Buzz Mill Shady
C-Boy's Heart & Soul
The Capital Grille
Carousel Lounge
Cedar Street Courtyard
Cenote Windsor Park
Central Market North


### 4) What about the date and time? 

In [37]:
## code 
for i in soup.find_all(name='div', attrs={'class':'date-time'})[:5]:
    print(i.text)

Sun., Jan 19., 9pm-12mid   get tickets
Fri., Jan. 17, 8-10pm   get tickets
Fri., Jan. 17, 7:30pm   get tickets
Fri., Jan. 17, 10pm  get tickets
Fri., Jan. 17, 8pm  get tickets


### 5) A link and the artist?

_Hint: Try setting a base url_

In [38]:
### code  
for i in soup.find_all('h2')[:5]:
    print(i.text)
    print(i.find('a').attrs['href'])

All Skate: Sundays at Stay Gold
http://www.facebook.com/events/585614641982520/
Songwriters Series: Nakia
http://www.threadgills.com/events-old-no-1
Andrew Cyrille Quartet
/events/music/andrew-cyrille-quartet-2626679/
Hikes (album release), the Kraken Quartet, Christelle Bofale
/events/music/hikes-album-release-2632606/
Lucifer, Savage Master, Overdose
/events/music/lucifer-savage-master-overdose-2626825/


When making a dataframe/scraper, keep in mind what your rows and columns will represent from the webpage

### 6) Refactoring - Find the element that contains all of the information we potentially need - then grab the individual items from that element

In [39]:
### code 
for i in soup.find_all(name='div', attrs={'class':'event-text'}):
    try:
        print(i.find(name='div', attrs={'class':'venue'}).text)
    except:
        print('no venue')
    try:
        print(i.find(name='span',attrs={'class':'event-date'}).text)
    except:
        print('no time')
    try:
        print(i.find('h2').text)
    except:
        print('no artist')
    try:
        print(i.find('h2').find('a').attrs['href'])
    except:
        print('no link')

Stay Gold
no time
All Skate: Sundays at Stay Gold
http://www.facebook.com/events/585614641982520/
Threadgill's Old No. 1
no time
Songwriters Series: Nakia
http://www.threadgills.com/events-old-no-1
McCullough Theatre
Fri., Jan. 17
Andrew Cyrille Quartet
/events/music/andrew-cyrille-quartet-2626679/
Barracuda
Fri., Jan. 17
Hikes (album release), the Kraken Quartet, Christelle Bofale
/events/music/hikes-album-release-2632606/
Come & Take It Live
Fri., Jan. 17
Lucifer, Savage Master, Overdose
/events/music/lucifer-savage-master-overdose-2626825/
The Electric Church
Fri., Jan. 17
Evergreen, ThunderStars (album release), Dottie
/events/music/thunderstars-album-release-2654787/
One World Theatre
Fri., Jan. 17
Dar Williams
/events/music/dar-williams-2627543/
The Austin Beer Garden Brewing Co.
Fri., Jan. 17
Snizz Boogie
/events/music/snizz-boogie-2632574/
ACL Live at the Moody Theater
Fri., Jan. 17
The Pink Floyd Laser Spectacular
/events/music/the-pink-floyd-laser-spectacular-2624539/
Angel's

### 7)   Put it in a data frame! 

In [40]:
### code 

df = pd.DataFrame(columns=['artist', 'venue', 'time', 'link'])

### code 
for i in soup.find_all(name='div', attrs={'class':'event-text'}):
    try:
        print(i.find(name='div', attrs={'class':'venue'}).text)
    except:
        print('no venue')
    try:
        print(i.find(name='span',attrs={'class':'event-date'}).text)
    except:
        print('no time')
    try:
        print(i.find('h2').text)
    except:
        print('no artist')
    try:
        print(i.find('h2').find('a').attrs['href'])
    except:
        print('no link')

Stay Gold
no time
All Skate: Sundays at Stay Gold
http://www.facebook.com/events/585614641982520/
Threadgill's Old No. 1
no time
Songwriters Series: Nakia
http://www.threadgills.com/events-old-no-1
McCullough Theatre
Fri., Jan. 17
Andrew Cyrille Quartet
/events/music/andrew-cyrille-quartet-2626679/
Barracuda
Fri., Jan. 17
Hikes (album release), the Kraken Quartet, Christelle Bofale
/events/music/hikes-album-release-2632606/
Come & Take It Live
Fri., Jan. 17
Lucifer, Savage Master, Overdose
/events/music/lucifer-savage-master-overdose-2626825/
The Electric Church
Fri., Jan. 17
Evergreen, ThunderStars (album release), Dottie
/events/music/thunderstars-album-release-2654787/
One World Theatre
Fri., Jan. 17
Dar Williams
/events/music/dar-williams-2627543/
The Austin Beer Garden Brewing Co.
Fri., Jan. 17
Snizz Boogie
/events/music/snizz-boogie-2632574/
ACL Live at the Moody Theater
Fri., Jan. 17
The Pink Floyd Laser Spectacular
/events/music/the-pink-floyd-laser-spectacular-2624539/
Angel's

In [41]:
### code 
df = pd.DataFrame(columns=['artist', 'venue', 'time', 'link'])
for i in soup.find_all(name='div', attrs={'class':'event-text'}):
    try:
        venue = i.find(name='div', attrs={'class':'venue'}).text
    except:
        venue = 'no venue'
    try:
        time = i.find(name='span',attrs={'class':'event-date'}).text
    except:
        time = 'no time'
    try:
        artist = i.find('h2').text
    except:
        artist = 'no artist'
    try:
        link = i.find('h2').find('a').attrs['href']
    except:
        link = 'no link'
    df.loc[len(df)] = [artist, venue, time, link]

In [42]:
df

Unnamed: 0,artist,venue,time,link
0,All Skate: Sundays at Stay Gold,Stay Gold,no time,http://www.facebook.com/events/585614641982520/
1,Songwriters Series: Nakia,Threadgill's Old No. 1,no time,http://www.threadgills.com/events-old-no-1
2,Andrew Cyrille Quartet,McCullough Theatre,"Fri., Jan. 17",/events/music/andrew-cyrille-quartet-2626679/
3,"Hikes (album release), the Kraken Quartet, Chr...",Barracuda,"Fri., Jan. 17",/events/music/hikes-album-release-2632606/
4,"Lucifer, Savage Master, Overdose",Come & Take It Live,"Fri., Jan. 17",/events/music/lucifer-savage-master-overdose-2...
5,"Evergreen, ThunderStars (album release), Dottie",The Electric Church,"Fri., Jan. 17",/events/music/thunderstars-album-release-2654787/
6,Dar Williams,One World Theatre,"Fri., Jan. 17",/events/music/dar-williams-2627543/
7,Snizz Boogie,The Austin Beer Garden Brewing Co.,"Fri., Jan. 17",/events/music/snizz-boogie-2632574/
8,The Pink Floyd Laser Spectacular,ACL Live at the Moody Theater,"Fri., Jan. 17",/events/music/the-pink-floyd-laser-spectacular...
9,Neel Cole,Angel's Icehouse,"Fri., Jan. 17",/events/music/neel-cole-2629406/


In [44]:
df.shape

(133, 4)

In [43]:
### code 
df = pd.DataFrame(columns=['artist', 'venue', 'time', 'link'])
for page in range(1,6):
    url = f'https://www.austinchronicle.com/events/music/2020-01-17/page-{page}/'
    res = requests.get(url)
    soup = BeautifulSoup(res.content)
    for i in soup.find_all(name='div', attrs={'class':'event-text'}):
        try:
            venue = i.find(name='div', attrs={'class':'venue'}).text
        except:
            venue = 'no venue'
        try:
            time = i.find(name='span',attrs={'class':'event-date'}).text
        except:
            time = 'no time'
        try:
            artist = i.find('h2').text
        except:
            artist = 'no artist'
        try:
            link = i.find('h2').find('a').attrs['href']
        except:
            link = 'no link'
        df.loc[len(df)] = [artist, venue, time, link]
    sleep(1)