In [1]:
import pandas as pd
import json
import requests
import math

# SeatGeek

The SeatGeek API is going to provide two key pieces of data: __artist genre__ and __venue score__.

SeatGeek assigns artists to one of 21 __genres__, some of which don't have very many artists. Some we can combine into meaningful groupings for modeling, whereas others we'll have to ignore as they don't have enough artists for a useful model nor are they easily grouped with other genres.

SeatGeek assigns each venue a __score__ -- SeatGeek's API documentation describes the scores as follows:

_Most document types include a score field. The score field is used to indicate the document's relative popularity within its type. score is a floating point value in 0 <= score <= 1._

_Currently score values for events, venues and performers are based on estimated sales volume on the secondary ticket market (normalized such that the most popular document has a score of 1)._

SeatGeek's API also allows the user to pull a full list of artists in their database with a single API query. Because the Bandsintown API does not allow this (more on this later), I'm going to use the SeatGeek list of artists as our master list and use the list of SeatGeek artists to query the Bandsintown API. Therefore, I'll start with the full list of SeatGeek artists, including genre.

__Note that my SeatGeek client_id is redacted from this file.__

## SeatGeek Artist List, Including Genre

As described above, I'm going to pull down a list of every artist in the SeatGeek database. Because SeatGeek includes non-musical events (such as sports), I'll limit this artist list to only those with an associated genre -- non-musical artists do not have an assigned genre in the SeatGeek database.

### Aritst genre helper functions

In [2]:
def get_sg_bands_pagesnum():
    """
    Queries the SeatGeek API to request the full list of performers in the database,
    then determines the number of pages in the JSON file returned by the API call,
    with 1000 performers per page. This is necessary for iterating through the returned
    JSON file page-by-page to extract band information. Note that the value of 1000
    performers per page is hard coded.
    Output: the number of pages in the returned JSON file (1000 items per page)
    """
    items_per_page = 1000
    url = 'https://api.seatgeek.com/2/performers?format=json'
    payload = {'per_page' : items_per_page,
               'type' : 'band',
               'client_id' : 'XXXX',
              }
    r = requests.get(url, params=payload,verify=True)
    json_obj = json.loads(r.text)
    return math.ceil(json_obj['meta']['total']/json_obj['meta']['per_page'])

def json_primary_genre(band):
    """
    The JSON file returned by the SeatGeek API can contain multiple genres for one performer,
    however, only one genre is marked as the primary genre for that performer.
    Output: the primary genre in the SeatGeek database for a given band.
    """
    for item in band.get("genres"):
        if item.get('primary'):
            return item.get('name')
        
def get_sg_bands_page(page=1):
    """
    Requests a single page of results from the SeatGeek API query for the list of performers
    and iterates through the JSON file to extract information about each band.
    Because the SeatGeek database includes non-musical acts, this function only pulls the
    information for performers with an associated genre. Note that the value of 1000 performers 
    per page is hard coded.
    Output: a Pandas dataframe with each band's name, SeatGeek ID, genre, and SeatGeek score
    for a single page of results within the SeatGeek database of bands.
    """
    items_per_page = 1000
    url = 'https://api.seatgeek.com/2/performers?format=json'
    payload = {'per_page' : items_per_page,
               'page' : page,
               'type' : 'band',
               'client_id' : 'XXXX',
              }
    r = requests.get(url, params=payload,verify=True)
    json_obj = json.loads(r.text)
    info_list = []
    for item in json_obj['performers']:
        # This next if statement only pulls in entries with associated genres, which excludes other events
        # that for some reason come up as 'bands' in this search.
        if item.get('genres'):
            info_list.append(
             {'name' : str(item.get('name',{})),
              'id' : str(item.get('id',{})),
              'genre' : str(json_primary_genre(item)),
              'score' : str(item.get('score',{}))
                               })
    bands_df = pd.DataFrame(info_list)
    return bands_df

def get_all_sg_bands():
    """
    Checks for the number of pages returned by a SeatGeek API query for the full list of
    performers in the SeatGeek database, then iterates through the results, page by page.
    Output: a Pandas dataframe with each band's name, SeatGeek ID, genre, and SeatGeek score
    for the full SeatGeek database of bands.
    """
    band_list = []
    total_pages = get_sg_bands_pagesnum()
    #Loop through the number pages of data and combine data into single dataframe
    for pageNum in range(1,total_pages+1):
        band_list.append(get_sg_bands_page(page=pageNum))
    bands_df = pd.concat(band_list,axis=0)
    return bands_df

### Pull and save SeatGeek artist list

In [3]:
sg_bands_df = get_all_sg_bands()
print(sg_bands_df.shape)

(44264, 4)


In [4]:
# Drop duplicates
sg_bands_df.drop_duplicates(subset='name',inplace=True)
print(sg_bands_df.shape)
sg_bands_df.head()

(39257, 4)


Unnamed: 0,genre,id,name,score
0,Pop,35,Taylor Swift,0.89
1,Classic Rock,2597,The Rolling Stones,0.89
2,Pop,13546,Ed Sheeran,0.88
3,Classic Rock,696,Eric Clapton,0.88
4,Pop,16709,Charli XCX,0.87


In [None]:
sg_bands_df.to_csv('sg_bands.csv')

## SeatGeek Genre List

For future reference, I'll pull the official list of genres from the SeatGeek API rather than rely on the genres within the artist list I've pulled in case there are any discrepancies.

In [5]:
def get_seatgeek_genrelist():
    """Queries the SeatGeek API to return a dataframe of possible artist genres."""
    r = requests.get('https://api.seatgeek.com/2/genres?client_id=XXXX')
    json_obj = json.loads(r.text)
    info_list = []
    for item in json_obj['genres']:
        info_list.append(
         {'genre' : str(item.get('name',{}))     
                               })
    genre_df = pd.DataFrame(info_list)
    return genre_df

In [6]:
get_seatgeek_genrelist()

Unnamed: 0,genre
0,Country
1,Pop
2,Rock
3,Alternative
4,Indie
5,Punk
6,Blues
7,Soul
8,Folk
9,Jazz


## SeatGeek Venue Scores

The other piece of information I need from SeatGeek is the __venue score__, which, as described above, is a measure of relative popularity. As with the artist list, fortunately for me, the SeatGeek API allows me to pull down all of the venues in its database, along with their scores, with a single API call.

### Venue score helper functions

In [7]:
def get_sg_venues_pagesnum():
    """
    Queries the SeatGeek API to request the full list of venues in the database,
    then determines the number of pages in the JSON file returned by the API call,
    with 1000 venues per page. This is necessary for iterating through the returned
    JSON file page-by-page to extract venue information. Note that the value of 1000
    performers per page is hard coded.
    Output: the number of pages in the returned JSON file (1000 items per page)
    """
    items_per_page = 1000
    url = 'https://api.seatgeek.com/2/venues?format=json'
    payload = {'per_page' : items_per_page,
               'client_id': 'XXXX',
              }
    r = requests.get(url, params=payload,verify=True)
    json_obj = json.loads(r.text)
    return math.ceil(json_obj['meta']['total']/json_obj['meta']['per_page'])

def get_sg_venues_page(page=1):
     """
    Requests a single page of results from the SeatGeek API query for the list of venues
    and iterates through the JSON file to extract information about each venue. Note that 
    the value of 1000 performers per page is hard coded.
    Output: a Pandas dataframe with each venue's SeatGeek ID, name, city, state, country,
    and score, for a single page of results within the SeatGeek database of venues.
    """
    items_per_page = 1000
    url = 'https://api.seatgeek.com/2/venues?format=json'
    payload = {'per_page' : items_per_page,
               'page' : page,
               'client_id' : 'XXXX',
              }
    r = requests.get(url, params=payload,verify=True)
    json_obj = json.loads(r.text)
    info_list = []
    for item in json_obj['venues']:
        info_list.append(
         {'sg_venue_id' : str(item.get('id',{})),
         'venue_name' : str(item.get('name',{})),
         'venue_city' : str(item.get('city',{})),
         'venue_state' : str(item.get('state',{})),
         'venue_score' : str(item.get('score',{})),
         'venue_country' : str(item.get('country',{})), 
                               })    
    venues_df = pd.DataFrame(info_list)
    return venues_df

def get_all_sg_venues():
    """
    Checks for the number of pages returned by a SeatGeek API query for the full list of
    venues in the SeatGeek database, then iterates through the results, page by page.
    Output: a Pandas dataframe with each venue's SeatGeek ID, name, city, state, country,
    and score, for the full SeatGeek database of venues.
    """
    venue_list = []
    total_pages = get_sg_venues_pagesnum()
    #Loop through the number pages of data and combine data into single dataframe
    for pageNum in range(1,total_pages+1):
        venue_list.append(get_sg_venues_page(page=pageNum))
    venues_df = pd.concat(venue_list,axis=0)
    return venues_df

### Pull and save SeatGeek venue list

In [8]:
sg_venues_df = get_all_sg_venues()
print(sg_venues_df.shape)
sg_venues_df.head()

(391237, 6)


In [None]:
sg_venues_df.to_csv('sg_venues.csv')

# Bandsintown

The Bandsintown API provides us with information about gigs played by any given artist. I'll run two sets of queries, first a __list of gigs (by artist)__ for all of the artists I got from SeatGeek, then I need to pull down __Bandsintown artist information__ for each SeatGeek artist so that I can merge the artist and gig information on each artist's Bandsintown ID.

Unlike SeatGeek, Bandsintown does not provide a full list of artists or gigs with a single API call; for that reason I need to call it individually for each artist in the SeatGeek artist list, by name. This takes quite a bit of time (12-24 hours for each set of API calls on my MacBook).

__Note that my Bandsintown app_id is redacted from this file.__

## Bandsintown Artist Events (Gigs)

I'll take the full list of SeatGeek artists as derived above and query the Bandsintown API to get the full set of events in their database for that artist. Unfortunately this needs to be done individually for each artist.

### Artist gig list helper function

In [None]:
def get_all_artist_events(artist_list):
    """
    Takes a list of artists and queries the Bandsintown API to get the list of events for each
    artist in the Bandsintown database.
    Output: a Pandas dataframe with all gigs associated with artists in the input list, including
    the gig's Bandsintown event ID, associated artist ID, event date, and venue name, city, state,
    and country.
    """
    gigs_list = []
    for band in artist_list:    
        r = requests.get('https://rest.bandsintown.com/artists/{}/events?app_id=XXXX&date=all'.format(band))
        # Processing the returned JSON file results in an error if the band is not found.
        try:
            json_obj = json.loads(r.text)
            show_list = []
            for show in json_obj:
                show_list.append(
                 {'BIT_event_id' : str(show.get('id',{})),
                 'BIT_artist_id' : str(show.get('artist_id',{})),
                 'BIT_event_date' : str(show.get('datetime',{})),
                 'BIT_venue_country' : str(show.get('venue').get('country',{})),
                 'BIT_venue_city' : str(show.get('venue').get('city',{})),
                 'BIT_venue_name' : str(show.get('venue').get('name',{})),         
                 'BIT_venue_state' : str(show.get('venue').get('region',{}))        
                                   })
            show_df = pd.DataFrame(show_list)
            gigs_list.append(show_df)
        # Skip the band if it is not found in the Bandsintown database:
        except:
            continue
        all_gigs = pd.concat(gigs_list,axis=0)
        return all_gigs

### Pull and save Bandsintown list of gigs by artist

In [None]:
SG_artist_list = sg_bands_df['name'].tolist()

In [None]:
all_gigs_df = get_all_artist_events(SG_artist_list)
print(all_gigs_df.shape)
all_gigs_df.head()

In [None]:
all_gigs_df.to_csv('all_gigs.csv')

## Bandsintown artist information

In addition to the gig list, we can pull additional information for each artist from Bandsintown, including their tracker count (the number of Bandsintown users following that band, which should be a decent measure of popularity).

### Bandsintown artist information helper function

In [None]:
def get_all_artist_info(artist_list):
    """
    Takes a list of artists and queries the Bandsintown API to get information about each artist
    from the Bandsintown database.
    Output: a Pandas dataframe with the artist's name, Bandsintown artist ID, and tracker count
    (the number of Bandsintown users following the artist).
    """
    artist_info_list = []
    for band in sg_bands_df['name'].tolist():    
        r = requests.get('https://rest.bandsintown.com/artists/{}?app_id=XXXX&date=all'.format(band))
        # Processing the returned JSON file results in an error if the band is not found.
        try:
            json_obj = json.loads(r.text)
            info_list = []
            info_list.append(
                {'BIT_artist_name' : str(json_obj.get('name',{})),
                 'BIT_tracker_count' : str(json_obj.get('tracker_count',{})),
                 'BIT_artist_id' : str(json_obj.get('id',{}))      
                                       })
            artist_df = pd.DataFrame(info_list)
            artist_info_list.append(artist_df)
        # Skip the band if it is not found in the Bandsintown database:
        except:
            continue
        all_artist_info = pd.concat(artist_info_list,axis=0)
        return all_artist_info

### Pull and save Bandsintown artist information

In [None]:
artist_info_df = get_all_artist_info(SG_artist_list)
print(artist_info_df.shape)
artist_info_df.head()

In [None]:
artist_info_df.to_csv('BIT_artist_info.csv')

# Next step: merging files

Please see the next notebook in the series for code that merges the information from these four files into a single file for modeling.