# Concert Festival - Data Collection

<b> Author: </b> Derek A Maier | maierd@canisius.edu

<hr>

## Scope:

The scope of this workbook includes the following:

- Data Collection from the Concert Festival data site
- Data Manipulation / Cleansing for Analysis
- Storage of the data collection above

<hr> 

## Assumptions: 

The development and execution of this workbook / analysis is based on the following assumptions (<i>check marks indicate the assumption has been met</i>):

- <b> ✓ </b> Ethical ability to scrape the Concert Festival Site
- <b> ✓ </b> Data availability of historical concert information for certain artists 

<hr>

## References:

The data was provided by the following base site: https://www.concertarchives.org/

The use case used for the dashboard analysis was Martin Garrix, whose base url was the following: https://www.concertarchives.org/bands/martin-garrix?page=1#concert-table

<hr>

## 1.a. Import Libraries & Set Global Variables

In [2]:
# Imports
import os
import pickle
import pandas as pd
from datetime import datetime

from dam.scrape import Engine # NOTE: This is a custom package I created for ethical hacking, simply replace with a GET request

In [3]:
# Global variables
data_dir = 'data/'
concert_collection_file = os.path.join(data_dir, 'martin_garrix_concert_collection.xlsx') # This will be our flat concert dataset
concert_set_collection_file = os.path.join(data_dir, 'martin_garrix_concert_set_collection.xlsx') # This will be our xref table for looking up concert sets to the concert ID in the dataset above


## 1.b. Utility Functions to Parse the Target Site(s)

In [63]:
# Create all utility functions for parsing the desired content from our target site

# Step 1. Gather all concert specific data 
def parse_concert_details(url):

    """
    Description:
    ------------
    Parses all of individual concert details for the concert collection dataset
    
    Params:
    ------------
    url : (str)
        Represents the base url to scrape, or defaults to the static URL
        
    Returns
    ------------
    data : (list)
        Represents a list of dictionaries with key:value pairs of the parsed data elements 
    
    """
     
    data = []    
    engine = Engine()
    
    soup = None
    
    try:
        soup = engine.soup_request(url)
        main_div = soup.find("div", attrs={"class": "table-responsive"}) # Main container
        table_div = main_div.find("table") # Table container
        trs = table_div.find_all("tr")[1:] # Table rows
        
        for i in range(len(trs)):
            data.append({
                'concert_date': trs[i].find_all("td")[0].text.replace('\n', '').strip(),
                'concert_name': trs[i].find_all("td")[1].find('strong').text.replace('\n', '').strip(),
                'concert_link': "https://www.concertarchives.org" + trs[i].find_all("td")[1].find('a')['href'],
                'concert_venue_name': trs[i].find_all("td")[2].text.replace('\n', '').strip(),
                'concert_venue_link': "https://www.concertarchives.org" + trs[i].find_all("td")[2].find('a')['href'] if trs[i].find_all("td")[2].find('a') is not None else '',
                'concert_location_name': trs[i].find_all("td")[3].text.replace('\n', '').strip(),
                'concert_location_link': "https://www.concertarchives.org" + trs[i].find_all("td")[3].find('a')['href'] if trs[i].find_all("td")[3].find('a') is not None else ''
            })

    except:
        pass
    
    return data

## 1.c. Gather & Dump the Concert Collection Data to Disk

<b> Note: </b> We will loop through all of Base URL pages and execute the parsing function, because the base site data is paginated 

In [9]:
# Artist of Interest
artist = "martin-garrix" # This represents the format of the artist in the URL string

# Total Pages for Martin Garrix
total_pages = 11

# Base URLs to Scrape
base_urls = [f'https://www.concertarchives.org/bands/{artist}?page={i}#concert-table' for i in range(total_pages+1)][1:]

In [14]:
print('The following links will be passed through our parsing function and appended to a master list of results\n' + '='*105)
base_urls

The following links will be passed through our parsing function and appended to a master list of results


['https://www.concertarchives.org/bands/martin-garrix?page=1#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=2#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=3#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=4#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=5#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=6#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=7#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=8#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=9#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=10#concert-table',
 'https://www.concertarchives.org/bands/martin-garrix?page=11#concert-table']

In [66]:
# Master Results List 
concert_collection = []

# Loop through each URL and append results to master
for idx,url in enumerate(base_urls):
    # This is where our parsing function will go
    print(f'Collection Details for URL No. {idx} of {total_pages}')
    concert_collection += parse_concert_details(url)

Collection Details for URL No. 0 of 11
Collection Details for URL No. 1 of 11
Collection Details for URL No. 2 of 11
Collection Details for URL No. 3 of 11
Collection Details for URL No. 4 of 11
Collection Details for URL No. 5 of 11
Collection Details for URL No. 6 of 11
Collection Details for URL No. 7 of 11
Collection Details for URL No. 8 of 11
Collection Details for URL No. 9 of 11
Collection Details for URL No. 10 of 11
Timeout Error: HTTPSConnectionPool(host='www.concertarchives.org', port=443): Read timed out. (read timeout=5)


In [68]:
print(f'The total number of records of this dataset: {len(concert_collection)}')
concert_collection

The total number of records of this dataset: 500


[{'concert_date': 'Jul 30, 2020–Aug 02, 2020',
  'concert_name': 'Untold Festival 2020',
  'concert_link': 'https://www.concertarchives.org/concerts/untold-festival-2020',
  'concert_venue_name': 'Cluj Arena',
  'concert_venue_link': 'https://www.concertarchives.org/venues/cluj-arena',
  'concert_location_name': 'Cluj-Napoca, Romania',
  'concert_location_link': 'https://www.concertarchives.org/locations/cluj-napoca-romania'},
 {'concert_date': 'Feb 23, 2020–Feb 23, 2020',
  'concert_name': 'Carnaval do B.E.M 2020',
  'concert_link': 'https://www.concertarchives.org/concerts/carnaval-do-b-e-m-2020',
  'concert_venue_name': 'Estádio do Mineirão',
  'concert_venue_link': 'https://www.concertarchives.org/venues/estadio-do-mineirao-e9b6b8e4-9c91-48e0-882a-6e8b9e425ad5',
  'concert_location_name': 'Belo Horizonte, Brazil',
  'concert_location_link': 'https://www.concertarchives.org/locations/belo-horizonte-brazil'},
 {'concert_date': 'Feb 22, 2020–Feb 23, 2020',
  'concert_name': 'Carnaval 

## 1.e. Convert to DataFrame and Dump to Disk

In [70]:
# Convert to DataFrame
df = pd.DataFrame(concert_collection)

df.head()

Unnamed: 0,concert_date,concert_name,concert_link,concert_venue_name,concert_venue_link,concert_location_name,concert_location_link
0,"Jul 30, 2020–Aug 02, 2020",Untold Festival 2020,https://www.concertarchives.org/concerts/untol...,Cluj Arena,https://www.concertarchives.org/venues/cluj-arena,"Cluj-Napoca, Romania",https://www.concertarchives.org/locations/cluj...
1,"Feb 23, 2020–Feb 23, 2020",Carnaval do B.E.M 2020,https://www.concertarchives.org/concerts/carna...,Estádio do Mineirão,https://www.concertarchives.org/venues/estadio...,"Belo Horizonte, Brazil",https://www.concertarchives.org/locations/belo...
2,"Feb 22, 2020–Feb 23, 2020",Carnaval Maori 10 anos • Martin Garrix 2020,https://www.concertarchives.org/concerts/carna...,Maori Beach Club,https://www.concertarchives.org/venues/maori-b...,"Porto Alegre, Brazil",https://www.concertarchives.org/locations/port...
3,"Feb 21, 2020",Carnaval Maori 10 Anos,https://www.concertarchives.org/concerts/carna...,Maori Beach Club,https://www.concertarchives.org/venues/maori-b...,"Xangri-lá, Brazil",https://www.concertarchives.org/locations/xang...
4,"Feb 01, 2020",Martin Garrix,https://www.concertarchives.org/concerts/marti...,DAER Dayclub South Florida,https://www.concertarchives.org/venues/daer-da...,"Fort Lauderdale, FL",https://www.concertarchives.org/locations/fort...


In [71]:
# Dump to disk for storage and next stage in the project
df.to_excel(concert_collection_file, index=False)