
### Webscraping! 

Webscraping is a powerful tool for getting your own data - it's important to follow the policies outlined at the `./robots.txt` extension

I.e. the scraping policy for [Open Table](https://www.opentable.com) is listed at [https://www.opentable.com/robots.txt](https://www.opentable.com/robots.txt). Your path reading skills will help you here! 


In [1]:
## imports 
import pandas as pd 
from bs4 import BeautifulSoup
import requests 
from time import sleep 
import matplotlib.pyplot as plt 


%matplotlib inline 

In [2]:
## let's scrape from austin chronicle 
url = 'https://www.austinchronicle.com/events/music/'


### 1) First get the requests and content with: 
``` python 
res = requests.get(url)

```

In [3]:
## code 
res = requests.get(url)

### 2) "Soupify" the object

``` python 
soup = BeautifulSoup(res.content)
```

In [4]:
## code 
soup = BeautifulSoup(res.content)

### 3) Try and find the element that contains the venue

In [5]:
## code the name of the tag is 'div' and we want the tag to have 
## a class=venue attribute 

for i in soup.find_all(name='div',attrs={'class':'venue'})[:5]:
    print(i.text)

Threadgill's Old No. 1
Frank Erwin Center
Antone's Nightclub
Bijou Lounge
Broken Spoke


### 4) What about the date and time? 

In [20]:
## code 
for i in soup.find_all(name='span',attrs={'class':'event-date'})[:5]:
    print(i.text)

Fri., Jan. 17
Fri., Jan. 17
Fri., Jan. 17
Fri., Jan. 17
Fri., Jan. 17


### 5) A link and the artist?

_Hint: Try setting a base url_

In [28]:
for i in soup.find_all('h2')[:5]:
    print(i.a.attrs['href'])

http://www.facebook.com/events/585614641982520/
http://www.threadgills.com/events-old-no-1
/events/music/andrew-cyrille-quartet-2626679/
/events/music/hikes-album-release-2632606/
/events/music/lucifer-savage-master-overdose-2626825/


In [25]:
### code  
for i in soup.find_all('h2')[:5]:
    print(i.text)
    print(i.find('a').attrs['href'])

All Skate: Sundays at Stay Gold
http://www.facebook.com/events/585614641982520/
Songwriters Series: Nakia
http://www.threadgills.com/events-old-no-1
Andrew Cyrille Quartet
/events/music/andrew-cyrille-quartet-2626679/
Hikes (album release), the Kraken Quartet, Christelle Bofale
/events/music/hikes-album-release-2632606/
Lucifer, Savage Master, Overdose
/events/music/lucifer-savage-master-overdose-2626825/


### 6) Refactoring - Find the element that contains all of the information we potentially need - then grab the individual items from that element

In [34]:
### code 
for i in soup.find_all(name='div', attrs={'class':'event-text'}):
    try:
        print(i.find(name='div', attrs={'class':'venue'}).text)
    except:
        print('no venue')
        
    try:
        print(i.find(name='span',attrs={'class':'event-date'}).text)
    except:
        print('no time')
        
    try:
        print(i.find('h2').text)
    except:
        print('no artist')
        
    try:
        print(i.find('h2').find('a').attrs['href'])
    except:
        print('no link')

Stay Gold
no time
All Skate: Sundays at Stay Gold
http://www.facebook.com/events/585614641982520/
Threadgill's Old No. 1
no time
Songwriters Series: Nakia
http://www.threadgills.com/events-old-no-1
McCullough Theatre
Fri., Jan. 17
Andrew Cyrille Quartet
/events/music/andrew-cyrille-quartet-2626679/
Barracuda
Fri., Jan. 17
Hikes (album release), the Kraken Quartet, Christelle Bofale
/events/music/hikes-album-release-2632606/
Come & Take It Live
Fri., Jan. 17
Lucifer, Savage Master, Overdose
/events/music/lucifer-savage-master-overdose-2626825/
The Electric Church
Fri., Jan. 17
Evergreen, ThunderStars (album release), Dottie
/events/music/thunderstars-album-release-2654787/
One World Theatre
Fri., Jan. 17
Dar Williams
/events/music/dar-williams-2627543/
The Austin Beer Garden Brewing Co.
Fri., Jan. 17
Snizz Boogie
/events/music/snizz-boogie-2632574/
ACL Live at the Moody Theater
Fri., Jan. 17
The Pink Floyd Laser Spectacular
/events/music/the-pink-floyd-laser-spectacular-2624539/
Angel's

### 7)   Put it in a data frame! 

In [37]:
### code 

df = pd.DataFrame(columns=['artist', 'venue', 'time', 'link'])

for i in soup.find_all(name='div', attrs={'class':'event-text'}):
    try:
        venue = i.find(name='div', attrs={'class':'venue'}).text
    except:
        venue = 'no venue'
        
    try:
        time = i.find(name='span',attrs={'class':'event-date'}).text
    except:
        time = 'no time'
        
    try:
        artist = i.find('h2').text
    except:
        artist = 'no artist'
        
    try:
        link = i.find('h2').find('a').attrs['href']
    except:
        link = 'no link'
        
    df.loc[len(df)] = [artist, venue, time, link]

In [41]:
### code 

df = pd.DataFrame(columns=['artist', 'venue', 'time', 'link'])

for page in range(1,6):
    url = f'https://www.austinchronicle.com/events/music/2020-01-17/page-{page}/'
    res = requests.get(url)
    soup = BeautifulSoup(res.content)

    for i in soup.find_all(name='div', attrs={'class':'event-text'}):
        try:
            venue = i.find(name='div', attrs={'class':'venue'}).text
        except:
            venue = 'no venue'

        try:
            time = i.find(name='span',attrs={'class':'event-date'}).text
        except:
            time = 'no time'

        try:
            artist = i.find('h2').text
        except:
            artist = 'no artist'

        try:
            link = i.find('h2').find('a').attrs['href']
        except:
            link = 'no link'

        df.loc[len(df)] = [artist, venue, time, link]
        
    sleep(1)

In [43]:
df.shape

(133, 4)

In [40]:
for i in range(1,6):
    print(f'second {i}')
    #sleep(1)

second 1
second 2
second 3
second 4
second 5


In [None]:
df = pd.DataFrame(columns=['performer', 'venue', 'time', 'link'])


for page in range(1,6):
    
    url = f'https://www.austinchronicle.com/events/music/2019-10-05/page-{page}/'

    res = requests.get(url)
    soup = BeautifulSoup(res.content)
    
    for event in soup.find_all('div', {'class':'event-text'}):
        try:
            venue = event.find('div', {'class':'venue'}).text
        except:
            venue = 'no venue listed'

        time = event.find('div', {'class':'date-time'}).text

        performer = event.find('h2').find('a').text
        link = event.find('h2').find('a').attrs['href']

        df.loc[len(df)] = [performer, venue, time, link]
    
    sleep(1)

In [44]:
df

Unnamed: 0,artist,venue,time,link
0,Songwriters Series: Nakia,Threadgill's Old No. 1,no time,http://www.threadgills.com/events-old-no-1
1,All Skate: Sundays at Stay Gold,Stay Gold,no time,http://www.facebook.com/events/585614641982520/
2,Andrew Cyrille Quartet,McCullough Theatre,"Fri., Jan. 17",/events/music/andrew-cyrille-quartet-2626679/
3,"Hikes (album release), the Kraken Quartet, Chr...",Barracuda,"Fri., Jan. 17",/events/music/hikes-album-release-2632606/
4,"Lucifer, Savage Master, Overdose",Come & Take It Live,"Fri., Jan. 17",/events/music/lucifer-savage-master-overdose-2...
...,...,...,...,...
128,"Grimefest Austin w/ Svdden Death, Effin, Ruvlo...",The Venue ATX,"Fri., Jan. 17",/events/music/grimefest-austin-2642339/
129,Driftwood Nights w/ Christy Hays,Vista Brewing,"Fri., Jan. 17",/events/music/driftwood-nights-w-christy-hays-...
130,Crashing In w/ King Louie,Volstead Lounge,no time,/events/music/crashing-in-w-king-louie-2369317/
131,"Bleached Roses, Slideshow",Whip In,"Fri., Jan. 17",/events/music/bleached-roses-slideshow-2655254/
