# Project 2
For project 2, we built an ETL pipeline to create a database containing data on Phish live shows from 1993-2023. First, we scrapped the website phish.net for setlists and the Wikipedia page for Phish concert tours and festivals for attendance and box office data. We then transformed the extracted data, reformatting columns, value formatting, and dropping rows that lacked useful data. We stored all of the results for each year in a variable, which were in turn each stored in a list. To load our data, we converted each variable to a DataFrame and then wrote the DataFrames to CSV files. Finally, we loaded the data from the CSV files directly to SQLite and PostgreSQL. There are many interesting questions that can be explored when analyzing the database we have prepared. For example, we could look at how City and Year affect Attendance and Attendance/Capacity. We could count the recurrence of previous Cities and weigh Attendance and Gross to create a predictive model to determine the likeliest cities to be announced for future show dates. 

In [1]:
#from splinter import Browser
#from bs4 import BeautifulSoup as soup
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os
import shutil
from pathlib import Path
import sqlite3
from sqlalchemy import create_engine
from sqlalchemy.types import Integer, Text, String, DateTime, Float
import psycopg2

## Part 1: Extract
#### Scrapping for Data

In [2]:
#browser = Browser('chrome')

In [3]:
# Pull in setlist data

url = 'https://phish.net/setlists/phish/'
#browser.visit(url)
#html = browser.html
#phish_soup = soup(html, 'html.parser')
respons = requests.get(url)

phish_soup = BeautifulSoup(respons.text, 'html.parser')

Now we begin scraping the setlist data!

In [4]:
dates = phish_soup.find_all('span', class_='setlist-date')

date_strings = [date.text[-11:] for date in dates]

cleaned_date_strings = [date.strip() for date in date_strings]

cleaned_date_strings[0:5]

['10/15/2023', '10/14/2023', '10/13/2023', '10/11/2023', '10/10/2023']

In [20]:
venues = phish_soup.find_all('div', class_='setlist-venue')

# venue_strings = [venue.text.strip().title() for venue in venues]
venue_names = [venue.find('span').text.strip() for venue in venues]

venue_names

['UNITED CENTER',
 'UNITED CENTER',
 'UNITED CENTER',
 'ERVIN J. NUTTER CENTER, WRIGHT STATE UNIVERSITY',
 'ERVIN J. NUTTER CENTER, WRIGHT STATE UNIVERSITY',
 'BRIDGESTONE ARENA',
 'BRIDGESTONE ARENA',
 'BRIDGESTONE ARENA',
 "DICK'S SPORTING GOODS PARK",
 "DICK'S SPORTING GOODS PARK",
 "DICK'S SPORTING GOODS PARK",
 "DICK'S SPORTING GOODS PARK",
 'BROADVIEW STAGE AT SPAC',
 'BROADVIEW STAGE AT SPAC',
 'MADISON SQUARE GARDEN',
 'MADISON SQUARE GARDEN',
 'MADISON SQUARE GARDEN',
 'MADISON SQUARE GARDEN',
 'MADISON SQUARE GARDEN',
 'MADISON SQUARE GARDEN',
 'MADISON SQUARE GARDEN',
 'TD PAVILION AT THE MANN',
 'TD PAVILION AT THE MANN',
 "ST. JOSEPH'S HEALTH AMPHITHEATER AT LAKEVIEW",
 'THE PAVILION AT STAR LAKE',
 'THE PAVILION AT STAR LAKE',
 'LIVE OAK BANK PAVILION AT RIVERFRONT PARK',
 'LIVE OAK BANK PAVILION AT RIVERFRONT PARK',
 'AMERIS BANK AMPHITHEATRE',
 'AMERIS BANK AMPHITHEATRE',
 'AMERIS BANK AMPHITHEATRE',
 'ORION AMPHITHEATER',
 'ORION AMPHITHEATER',
 'HOLLYWOOD BOWL',
 'HOL

In [13]:
venues

[<div class="setlist-venue">
 <a href="/venue/1587/United_Center"><span class="hideunder768">UNITED CENTER</span><span class="hideover768">UNITED CENTER</span></a>
 </div>,
 <div class="setlist-venue">
 <a href="/venue/1587/United_Center"><span class="hideunder768">UNITED CENTER</span><span class="hideover768">UNITED CENTER</span></a>
 </div>,
 <div class="setlist-venue">
 <a href="/venue/1587/United_Center"><span class="hideunder768">UNITED CENTER</span><span class="hideover768">UNITED CENTER</span></a>
 </div>,
 <div class="setlist-venue">
 <a href="/venue/526/Ervin_J._Nutter_Center%2C_Wright_State_University"><span class="hideunder768">ERVIN J. NUTTER CENTER, WRIGHT STATE UNIVERSITY</span><span class="hideover768">ERVIN J. NUTTER CENTER, WRIGHT STATE UNIVERSITY</span></a>
 </div>,
 <div class="setlist-venue">
 <a href="/venue/526/Ervin_J._Nutter_Center%2C_Wright_State_University"><span class="hideunder768">ERVIN J. NUTTER CENTER, WRIGHT STATE UNIVERSITY</span><span class="hideover76

In [6]:
locations = phish_soup.find_all('div', class_='setlist-location')
locations

locations = [location.text.strip() for location in locations]

cities = [location.split(',')[0].title() for location in locations]
states = [location[-2:].upper() for location in locations]

print(cities[0:5])
print(states[0:5])
    

['Chicago', 'Chicago', 'Chicago', 'Dayton', 'Dayton']
['IL', 'IL', 'IL', 'OH', 'OH']


In [7]:
# set_list_notes = phish_soup.find_all('div', class_='setlist-notes')

# set_list_notes = [note.text.strip() for note in set_list_notes]
# set_list_notes

In [8]:
phish_p1_df = pd.DataFrame({
    'Date': cleaned_date_strings,
    'Venue': venue_strings,
    'City': cities,
    'State': states
})
phish_p1_df

Unnamed: 0,Date,Venue,City,State
0,10/15/2023,United Center,Chicago,IL
1,10/14/2023,United Center,Chicago,IL
2,10/13/2023,United Center,Chicago,IL
3,10/11/2023,"Ervin J. Nutter Center, Wright State University",Dayton,OH
4,10/10/2023,"Ervin J. Nutter Center, Wright State University",Dayton,OH
...,...,...,...,...
88,02/26/2022,Moon Palace,Cancun,CO
89,02/25/2022,Moon Palace,Cancun,CO
90,02/24/2022,Moon Palace,Cancun,CO
91,02/23/2022,Moon Palace,Cancun,CO


In [24]:
url2 = 'https://phish.net/setlists/?year='
all_years_dates = []
all_venues = []

# Get the current year
current_year = datetime.now().year

# Loop from 1982 to the current year
for year in range(1982, current_year + 1):
    
    year_url = url2 + str(year)
    
    respons = requests.get(year_url)

    phish_soup = BeautifulSoup(respons.text, 'html.parser')
    
    # Dates
    dates = phish_soup.find_all('span', class_='setlist-date')

    date_strings = [date.text[-11:] for date in dates]

    cleaned_date_strings = [date.strip() for date in date_strings]
    
    all_years_dates.extend(cleaned_date_strings)
    
    # Venues
    venues = phish_soup.find_all('div', class_='setlist-venue')

    venue_strings = [venue.find('span').text.strip().title().replace("'S", "'s") for venue in venues]
    
    all_venues.extend(venue_strings)
    
all_venues[50:]

['Goddard College',
 'Memorial Auditorium Basement',
 'University Of Vermont',
 'University Of Vermont',
 'University Of Vermont',
 "Hunt's",
 "Hunt's",
 'Slade Hall, University Of Vermont',
 'Memorial Auditorium',
 "Hunt's",
 "Nectar's",
 "Nectar's",
 "Nectar's",
 "Hunt's",
 "Hunt's",
 "Nectar's",
 "Nectar's",
 "Nectar's",
 'Sculpture Room, Goddard College',
 "Cork's",
 "Cork's",
 "Hunt's",
 "Nectar's",
 "Nectar's",
 "Hunt's",
 'The Ranch',
 'St. Lawrence University',
 'Odd Fellows Hall',
 "Nectar's",
 "Nectar's",
 'Goddard College',
 'Goddard College',
 "Hunt's",
 "Hunt's",
 'The Ranch',
 'Vergennes Day',
 "Ian Mclean's Farm",
 "Nectar's",
 "Nectar's",
 'The Ranch',
 'Haybarn Theater, Goddard College',
 "Nectar's",
 "Nectar's",
 "Nectar's",
 "Nectar's",
 "Nectar's",
 'Billings Lounge, University Of Vermont',
 "Nectar's",
 "Nectar's",
 'Goddard College',
 'Slade Hall, University Of Vermont',
 'Slade Hall, University Of Vermont',
 'Johnson State College',
 "Mad Maggie's Farm",
 "Nectar

In [10]:
len(all_years_dates)
len(all_venues)

2112

## Part 2: Transform
#### Cleaning and Formatting Data