# Engineer Training Data

Now that we have enriched the artist data with descriptive information about each artist, our next step is to join and refine the input data so that it can be used to train an AutoML classification model. 


In [None]:
import pandas as pd

In [None]:
play_counts_df = pd.read_feather("input_data/play_counts_df.feather")

In [None]:
play_counts_df.head()

Pivot the table, using the user_name as the index. This groups the data by user_name and retrieves a count per user of the number of listens for each artist.

In [None]:
play_counts_pivot_df = pd.DataFrame(play_counts_df.pivot(index='user_name', columns='artist_name', values='cnt').to_records())
play_counts_pivot_df.head()

In [None]:
# Double check that this value is 20
play_counts_pivot_df[play_counts_pivot_df.user_name == "-nils-"].iloc[0]['Muse']

In addition to the artist data for each user, we would also like to use information about the user's preferences for certain genres or styles of artist. To obtain features that represent this information, we're going to use the enriched artist data that was generated using the [1-Enrich_Raw_Input_Data.ipynb](./1-Enrich_Raw_Input_Data.ipynb) file.

Our goal here is to create a count for each user of the total number of times a particular LastFM tag is represented in that user's listening history. To do this, we'll join the user data with the tag data using 'artist' as the join key, then sum the tag columns to get a count for the total number of times a particular tag shows up in a user's listening history.

In [None]:
band_tags = pd.read_feather("enriched_data/band_tags.feather")
band_tags.head()

In [None]:
list(band_tags.columns)[:15]

In [None]:
play_counts_with_tags_df = pd.merge(play_counts_df, band_tags, on='artist_name', how='left')
# prevent data leakage by removing Beatles data from the tag df
play_counts_with_tags_df_no_beatles = play_counts_with_tags_df[play_counts_with_tags_df['artist_name']!='The Beatles'].copy()
user_tag_counts_df = play_counts_with_tags_df_no_beatles.drop(columns=['artist_name','cnt']).groupby('user_name').sum().reset_index()
user_artist_tag_counts_df = pd.merge(play_counts_pivot_df, user_tag_counts_df, on='user_name', how='left')

In [None]:
user_artist_tag_counts_df.head()

Set a Target column to turn this into a classification problem, if they play Beatles (even once) then label is True. Then drop the column 'The Beatles' in order to prevent data leakage.

In [None]:
import numpy as np
user_artist_tag_counts_df['Like The Beatles'] =  user_artist_tag_counts_df['The Beatles'].apply(lambda x: not np.isnan(x))
user_artist_tag_counts_df.drop('The Beatles', axis=1, inplace=True)

We need to clean up the column names as Cloud AutoML is very picky and will fail if you do things like have Unicode file names. I wish Google will fix this. I'll complain ;)

In [None]:
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("post-roc", "post roc1", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("trip-hop", "trip hop1", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace(" ", "_", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("/", "", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("é", "", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("ö", "o", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("+", "_and_", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("&", "and", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("!", "", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("-", "_", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace(".", "", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("ó", "o", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("'", "", regex=False)

# Japanese Artists
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("久石譲", "Joe_Hisaishi", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("川井憲次", "Kenji_Kawai", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("梶浦由記", "Yuki_Kajiura", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("植松伸夫", "Nobuo_Uematsu", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("菅野よう子", "Yoko_Kanno", regex=False)
user_artist_tag_counts_df.columns = user_artist_tag_counts_df.columns.str.replace("近藤浩治", "Koji_Kondo", regex=False)

Let's set aside a random sample of data as a holdout set in order to test the trained and deployed model.

In [None]:
user_artist_tag_counts_df.shape

In [None]:
user_artist_tag_counts_df.Like_The_Beatles.value_counts().plot(kind='pie')

In [None]:
user_artist_tag_counts_df.shape

In [None]:
# pull 10 data points to test online model serving

In [None]:
inference_sample = user_artist_tag_counts_df.sample(n=10, random_state=42)
inference_sample.reset_index(drop=True).to_feather("test_data/inference_sample.feather")

In [None]:
remainder = user_artist_tag_counts_df.drop(inference_sample.index)
train_sample = remainder.sample(n=1985, random_state=42)
remainder = remainder.drop(train_sample.index)
validation_sample = remainder.sample(n=250, random_state=42)
test_sample = remainder.drop(validation_sample.index)

In [None]:
train_sample.loc[:, "data_split"] = "TRAIN"
validation_sample.loc[:, "data_split"] = "VALIDATE"
test_sample.loc[:, "data_split"] = "TEST"

In [None]:
# Confirm that these numbers add up the way they should
train_sample.shape[0] + validation_sample.shape[0] + test_sample.shape[0] + inference_sample.shape[0] == user_artist_tag_counts_df.shape[0]

In [None]:
input_data_df = pd.concat([train_sample, validation_sample, test_sample])

In [None]:
input_data_df.shape

In [None]:
input_data_df.to_csv("training_data/file_out_2485_tags.csv", index=None, encoding="utf-8")

Running head on the output CSV file will serve as a quick gut check to make sure that everything looks good in the CSV file.

In [None]:
input_data_df.data_split.value_counts()

In [None]:
input_data_df.data_split.value_counts() / len(input_data_df)

In [None]:
!head training_data/file_out_2485_tags.csv

The line below copies the CSV file that was just output to a GCS bucket in Google Cloud.

In [None]:
!gsutil cp training_data/file_out_2485_tags.csv gs://csalling-docai-datasets-regional/beatles/file_out_2485_tags.csv

Note that normally I wouldn't use a bucket obviously intended for a different application, but right now for some reason we can only create buckets that are in Finland, so I'm using an older bucket I had lying around for experimentation purposes.