# Enrich Raw Input Data

The data pulled from the BigQuery ListenBrainz dataset is pretty sparse and contains only the artist's name, the user's name, and the corresponding number of listen that user has for that artist. In order to train a prediction algorithm, we must enrich the raw input data by leveraging third party data providers to provide descriptors of the artist's music which may inform user preference.

We use the [Last FM API](http://last.fm/api/) to get the data for each artist. In our code below we assume that the first matching artist is the correct one.

Our <font color="darkblue"><strong>working hypothesis</strong></font> is that if the user in question does not like any musicians that are tagged as "rock" artists, then they probably won't like The Beatles.

In [None]:
import requests
import pandas as pd

In [None]:
lastfm_api_keys = dict([x.split('\t') for x in open('lastfm.conf').read().split('\n')])

In [None]:
artists_df = pd.read_feather("input_data/artist_df.feather")

For each artist in our `artists_df` dataframe, we perform the following steps:

1. Search Last FM for the artist name
2. Grab the id for that artist (mbid)
3. Call getInfo to get more data on the artist
4. Get the tags
5. Add the tags to the columns and set as True: meaning we saw that tag for that artist



In [None]:
def get_artist_tags(artist_name: str) -> list:
    """For a provided musical artist, searches Last FM for corresponding 'tag' data, then returns that data"""
    
    tags = []
    
    search_response = requests.get(
        url=lastfm_api_keys['API Root'],
        params=dict(
            method="artist.search",
            artist=artist_name,
            api_key=lastfm_api_keys['API key'],
            format="json"
        )
    ).json()
    
    mbid = search_response['results']['artistmatches']['artist'][0]['mbid']  # unique identifier for a particular artist
    
    tag_response = requests.get(
        url=lastfm_api_keys['API Root'],
        params=dict(
            method="artist.getInfo",
            artist=artist_name,
            mbid=mbid,
            api_key=lastfm_api_keys['API key'],
            format="json"
        )
    ).json()
    
    tags = ["tag_{}".format(x['name'].lower().replace("-", " ")) for x in tag_response['artist']['tags']['tag']]
    
    return tags

In [None]:
get_artist_tags("The White Stripes")

Now that we've defined our function to get the data we need, let's iterate over our dataset and retrieve tag information for all of our artists.

The code block below takes a while to run, and for no good reason that I can see.

Let's do this instead - let's 

In [None]:
artists_df['artist_name'].head()

Running the code below results in a couple of issues:
1. The LastFM API obviously has some sort of rate limiting, and some artists are skipped
2. Performance for iterating over dataframes is slow
3. We receive an additional PerformanceWarning for the DataFrame being highly fragmented, a result of calling `frame.insert` too many times

In [None]:
for index, row in artists_df.iterrows():
    artist_name = row['artist_name']
    
    try:
        tags = get_artist_tags(artist_name)
        
        for tag in tags:
            artists_df.at[index, tag] = True
            
    except Exception as e:
        print(e)

After retrieving the tag data from LastFM API we then fill any missing information with a default of 'False'.

In [None]:
artists_df.fillna(False, inplace=True)
artists_df.head(5)

Finally we output the enriched data to a portable file format.

In [None]:
artists_df.to_feather("enriched_data/band_tags.feather")