# Wilson's Morning Wake Up Playlist Generator, Modeling and Learning

The following steps will be executed:

* Upload your data to S3.
* Define a benchmark and candidate models and training scripts
* Train models and deploy.
* Evaluate deployed estimator.

## Load Data to S3

In [21]:
import pandas as pd
import boto3
import sagemaker

In [24]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [25]:
!ls -la data

total 312
drwxrwxr-x 2 ec2-user ec2-user   4096 Mar  4 23:01 .
drwxrwxr-x 8 ec2-user ec2-user   4096 Mar 17 04:35 ..
-rw-rw-r-- 1 ec2-user ec2-user  28467 Mar  4 23:01 test.csv
-rw-rw-r-- 1 ec2-user ec2-user 113732 Mar  5 01:19 train.csv
-rw-rw-r-- 1 ec2-user ec2-user 166951 Feb 19 02:35 wmw_tracks.csv


## Upload your training data to S3

In [26]:
# should be the name of directory you created to save your features data
data_dir = 'data'

# set prefix, a descriptive name for a directory  
prefix = 'sagemaker/wmw_estimator'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

---

# Modeling

It's time to define and train the models!

---

## Complete a training script 

To implement a custom estimator, I need to complete a `train.py` script. 

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model

To complete a `train.py` file, you will:
1. Import any extra libraries you need
2. Define any additional model training hyperparameters using `parser.add_argument`
2. Define a model in the `if __name__ == '__main__':` section
3. Train the model in that same section


In [5]:
# Directory of train.py
!pygmentize model/train.py

Error: cannot read infile: [Errno 2] No such file or directory: 'model/train.py'


---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function you specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which you can set to the latest version of PyTorch, `1.0`.

## Define PyTorch estimators

In [28]:
# Build sequences and targets
def create_playlist_sequences(input_data):
    input_playlists = []
    
    for i in input_data['volume'].unique():
        temp_vol = input_data[input_data['volume'] == i]
        playlist_X = temp_vol.iloc[:, 2:].values
        labels_y = temp_vol.iloc[:, 2:-3].values
        input_playlists.append((playlist_X, labels_y))
        
    return input_playlists

In [29]:
from unittest.mock import MagicMock, patch

def _print_success_message():
    print('Tests Passed!')

def test_playlist_sequences(input_playlists):
    
    track_features = [-2.39099487, -2.63509459, -0.27732204,  0.92969533, -0.48983686,-1.15691947,  1.08569029, -1.20454903,  2.09618458, -5.37044178, 0.23380331]
    
    track_features_len = 11
    target_features_len = 8
    
    # check shape and equality of first track
    assert len(input_playlists[0][0][0]) == len(track_features), \
        'Number of features in input_playlist features does not match expected number of ' + str(len(track_features))    
    
    # check shape of input and output arrays
    assert input_playlists[0][0].shape[1]==track_features_len, \
        'input_features should have as many columns as selected features, got: {}'.format(train_x.shape[1])
    assert input_playlists[0][1].shape[1]==target_features_len, \
        'target_features should have as many columns as selected features, got: {}'.format(train_x.shape[1])
    
    #TODO: Add more tests
    
    _print_success_message()

### Test run of benchmark and candidate models and train components
Here I will see if the configurations I have set work accordingly with no errors. Once it runs smoothly, I will instantiate an estimator using the Sagemaker API.

In [30]:
import os
import torch
import torch.utils.data

train_data = pd.read_csv(os.path.join(data_dir, "train.csv"))

# Gather sequences and targets
processed_data = create_playlist_sequences(train_data)

In [9]:
# Training function for LSTM
def train_lstm(model, train_loader, epochs, criterion, optimizer, device):
    """
    This is the training method that is called by the PyTorch training script of the LSTM model. The parameters
    passed are as follows:
    model        - The PyTorch model that we wish to train.
    train_loader - The PyTorch DataLoader that should be used during training.
    epochs       - The total number of epochs to train for.
    criterion    - The loss function used for training. 
    optimizer    - The optimizer to use during training.
    device       - Where the model and data should be loaded (gpu or cpu).
    """
    
    # training loop is provided
    for epoch in range(1, epochs + 1):
        model.train() # Make sure that the model is in training mode.

        total_loss = 0

        for batch in train_loader:
            
            # get data
            batch_x, batch_y = batch
            
            # 
            batch_x = torch.from_numpy(batch_x).float().squeeze()
            batch_y = torch.from_numpy(batch_y).float()

            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)

            optimizer.zero_grad()
            
            model.hidden_cell = (torch.zeros(1, 1, model.hidden_layer_dim),
                torch.zeros(1, 1, model.hidden_layer_dim))

            # get predictions from model
            y_pred = model(batch_x)
            
            # perform backprop
            loss = criterion(y_pred, batch_y)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
            
        if epoch%25 == 1:
            print("Epoch: {}, Loss: {}".format(epoch, total_loss / len(train_loader)))

In [11]:
import torch.optim as optim
from model.LSTM_Estimator import LSTMEstimator

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMEstimator(11, 30, 8)
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = torch.nn.L1Loss()

train_lstm(model, processed_data, 100, loss_fn, optimizer, device)

Epoch: 1, Loss: 0.7757460233327504
Epoch: 26, Loss: 0.08717616818643906
Epoch: 51, Loss: 0.05579267497602347
Epoch: 76, Loss: 0.04229031135705677


In [2]:
%env SPOTIFY_EMAIL=gillaw06@gmail.com

env: SPOTIFY_EMAIL=gillaw06@gmail.com


In [3]:
%env SPOTIFY_ID=ce1d1ca394724265951a48a0deea6d01

env: SPOTIFY_ID=ce1d1ca394724265951a48a0deea6d01


In [4]:
%env SPOTIFY_SECRET=3ce5bb4c8c18423f9e8b3f12db963e31

env: SPOTIFY_SECRET=3ce5bb4c8c18423f9e8b3f12db963e31


In [5]:
# Spotify API
import spotipy
import spotipy.util as util

# Defaults
import os
import sys

# Spotify for developers client auth variables
username = os.environ['SPOTIFY_EMAIL']
spotify_id = os.environ['SPOTIFY_ID']
spotify_secret = os.environ['SPOTIFY_SECRET']

# Set API scope
scope='playlist-read-private'

# Get auth token
token = util.prompt_for_user_token(username, 
                                   scope,
                                   client_id=spotify_id,
                                   client_secret=spotify_secret,
                                   redirect_uri='http://localhost/')

In [7]:
from spotipy.oauth2 import SpotifyClientCredentials

In [8]:

#Authenticate
sp = spotipy.Spotify(
    client_credentials_manager = SpotifyClientCredentials(
        client_id=spotify_id,
        client_secret=spotify_secret
    )
)

In [51]:
# Read in WMW tracks to date for recommendations
track_data = pd.read_csv(os.path.join(data_dir, "wmw_tracks.csv"))

track_data.head()

Unnamed: 0,volume,position,track_name,artist_name,danceability,energy,key,loudness,mode,speechiness,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,38,1,Finding It There,Goldmund,0.187,0.00257,1,-37.134,1,0.0427,...,0.0915,0.0374,123.707,audio_features,6CnPCuUcM3A5PMP4gUy0vw,spotify:track:6CnPCuUcM3A5PMP4gUy0vw,https://api.spotify.com/v1/tracks/6CnPCuUcM3A5...,https://api.spotify.com/v1/audio-analysis/6CnP...,220120,5
1,38,2,Light Forms,Rohne,0.671,0.545,10,-12.848,0,0.0393,...,0.118,0.284,133.036,audio_features,6MkUPsz5hYeneo0a9H0VT8,spotify:track:6MkUPsz5hYeneo0a9H0VT8,https://api.spotify.com/v1/tracks/6MkUPsz5hYen...,https://api.spotify.com/v1/audio-analysis/6MkU...,265870,4
2,38,3,C-Side,Khruangbin,0.688,0.779,11,-10.129,0,0.0579,...,0.349,0.938,94.073,audio_features,6GvAM8oyVApQHGMgpBt8yl,spotify:track:6GvAM8oyVApQHGMgpBt8yl,https://api.spotify.com/v1/tracks/6GvAM8oyVApQ...,https://api.spotify.com/v1/audio-analysis/6GvA...,283407,4
3,38,4,Didn't I (Dave Allison Rework),Darondo,0.539,0.705,0,-6.729,1,0.0527,...,0.133,0.685,186.033,audio_features,1owjOeZt1BdYWW6T8fIAEe,spotify:track:1owjOeZt1BdYWW6T8fIAEe,https://api.spotify.com/v1/tracks/1owjOeZt1BdY...,https://api.spotify.com/v1/audio-analysis/1owj...,328000,4
4,38,5,Woman Of The Ghetto - Akshin Alizadeh Remix,Marlena Shaw,0.707,0.573,7,-8.403,0,0.0276,...,0.0858,0.189,100.006,audio_features,2h8cQH7zhUWrynZi2MKhhC,spotify:track:2h8cQH7zhUWrynZi2MKhhC,https://api.spotify.com/v1/tracks/2h8cQH7zhUWr...,https://api.spotify.com/v1/audio-analysis/2h8c...,302467,4


In [None]:
import tqdm

class Playlist():
    def __init__(self):
        self.name = "Wilson's Morning Wake Up Vol. Test"
        self.intro_songs = []
        self.search_results = []
        self.recommended_track_ids = pd.DataFrame() #list of track ids straight from spotify
        self.trax = [] #all tracks as dict
        self.df = None #this is where the data goes
        self.playlist = None
        
        # Input your songs
#         self.intro_songs.append(input('Song 1: '))
#         self.intro_songs.append(input('Song 2: '))

        self.intro_songs.append('mess hall luke howard')
        self.intro_songs.append('soulless nato medrado')
        
        # DO EVERYTHING
        self.get_recommendations()
        
    def get_recommendations(self):
        print('Getting Recommendations...')
        
        # Grab all WMW songs to date
        for _, row in tqdm(track_data.iterrows()):
            song_search = row['track_name'].partition('-')[0] + ' ' + row['artist_name']
            print(song_search)
            try:
                song_res = sp.search(song_search, limit=1)['tracks']['items'][0]
                
                self.search_results.append({
                    'id': song_res['id'],
                    'artists': [i['name'] for i in song_res['artists']],
                    'name': song_res['name']
                })
            except:
                "Song not searchable"

        
        # Get recommendations based on WMW songs already selected
        for res in tqdm(self.search_results):
            results = sp.recommendations(seed_tracks = [res['id']], limit=10)
            for r in results['tracks']:
                track={}
                track['id'] = r['id']
                track['artists'] = [i['name'] for i in r['artists']],
                track['name'] = r['name']
                track['artist_name'] = track['track']['artists'][0]['name']
                track_features = sp.audio_features(r['track']['id'])[0]
                track.update(track_features)
                final_track = pd.DataFrame(track, index=[0])
                recommended_track_ids = recommended_track_ids.append(final_track, ignore_index=True)
  

In [64]:
Playlist()

Getting Recommendations...
Finding It There Goldmund
Light Forms Rohne
C Khruangbin
Didn't I (Dave Allison Rework) Darondo
Woman Of The Ghetto  Marlena Shaw
Flyga Hosini
Warm Winter Koresma
Fewer Looks Affelaye
Eff Five Parra for Cuva
How Often Lane 8
Because You Move Me Tinlicker
Earth Lapalux
Second Sun  Nils Hoffmann
Some of Them (feat. MELI)  Amtrac
Radical  Amtrac
Bower Luke Howard
Human Memory Susumu Yokota
Lovely Dots Jascha Hagen
Dissociation In The Car Park At Sain Laurence Guy
Aaj Shanibar Rupa
It Might Be Time Tame Impala
Flight 99 Masego
Close Jesper Ryom
Midnight Mischief  Jordan Rakei
Curiosity Pablo Nouvelle
Cupa Cupa Parra for Cuva
Wild Tourist
Breathing Ben Böhmer
Lost In Mind Ben Böhmer
Ave  Modd
To The Ground Matt Fax
Birdsong  Ludovico Einaudi
Nami Meitei
Finally Moving Pretty Lights
Bad Bad News Leon Bridges
Silver Linings Catching Flies
Mango Pulp (feat. Ian Ewing) Edamame
Jakarta Luigi Sambuy
Kin Tourist
Benji Soul Flower
Maia Durante
Sugarbites Martin Roth
Brigh

Lighter SG Lewis
Dissensions  Ben Böhmer
Higher Ground (feat. Naomi Wild) ODESZA
It Never Rains Tom Demac
Trees Felon
Jalapeño Tinlicker
Sometimes Goldmund
Asos Model Crush dné
Vision Joe Garston
Beautiful People Mark Pritchard
Lise M.A BEAT!
Love Is Everywhere  Arms and Sleepers
Where Are You  Robert Babicz
Give It Up Calibre
Drone Bomb Me ANOHNI
Use Me Bill Withers
Twilight Fejká
Dream Machine Dominik Eulberg
Moonlight Fejká
Pacer Jesper Ryom
Waking Dream Luttrell
Pilgrim Balmorhea
California River Tiber
Anjou  Youandewan
Hey Now (When I Give You All My Lovin') Romare
Time Is the Enemy Quantic
Made To Stray Mount Kimbie
Paper Trails DARKSIDE
Ain't No Sunshine Bill Withers
Cirrus Bonobo
Eple Röyksopp
Hold Me Down Mansionair
Humanize Klahr
Back to Basics Alex Gopher
Never Lost Amtrac
Forever Dolphin Love (Erol Alkan's Extended Rework Version 2) Connan Mockasin
Ljus Mojna
Piano Months Teebs
Happiness Jónsi
Öldurót  Ólafur Arnalds
Suns Embee
The Journey Tom Misch
Grow Up Weval
7th Sevens

TypeError: 'module' object is not callable

In [None]:
def predict_playlist(model, initial_songs=[], predict_len=15):
    hidden = model.init_hidden()
    # Cast songs to tensor
    prime_input = text_to_tensor(prime_str)
    predicted = prime_str
    
    # Using initial songs to build up hidden state
    for song in range(initial_songs):
        _, hidden = model()
    

In [11]:
# # Training function
# def train_rnn(model, train_loader, epochs, criterion, optimizer, device):
#     """
#     This is the training method that is called by the PyTorch training script. The parameters
#     passed are as follows:
#     model        - The PyTorch model that we wish to train.
#     train_loader - The PyTorch DataLoader that should be used during training.
#     epochs       - The total number of epochs to train for.
#     criterion    - The loss function used for training. 
#     optimizer    - The optimizer to use during training.
#     device       - Where the model and data should be loaded (gpu or cpu).
#     """
    
#     # training loop is provided
#     for epoch in range(1, epochs + 1):
#         model.train() # Make sure that the model is in training mode.

#         total_loss = 0
        
#         hidden = model.initHidden()

#         for batch in train_loader:
            
#             # get data
#             batch_x, batch_y = batch
            
#             # 
#             batch_x = torch.from_numpy(batch_x).float().squeeze()
#             batch_y = torch.from_numpy(batch_y).float()

#             batch_x = batch_x.to(device)
#             batch_y = batch_y.to(device)

#             optimizer.zero_grad()

#             y_pred = []
            
#             # get predictions
#             for x in batch_x:
#                 y, hidden = model(x, hidden)
#                 y_pred.append(y)
            
#             # perform backprop
#             loss = criterion(y_pred, batch_y)
#             loss.backward()
#             optimizer.step()
            
#             total_loss += loss.data.item()
            
#         if epoch%25 == 1:
#             print("Epoch: {}, Loss: {}".format(epoch, total_loss / len(train_loader)))

#TODO: Create working RNN Benchmark model

In [18]:
# import torch.optim as optim
# from model.RnnEstimator import RNNEstimator

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model = RNNEstimator(11, 30, 8)
# optimizer = optim.Adam(model.parameters(), lr=0.001)
# loss_fn = torch.nn.L1Loss()

# train_rnn(model, processed_data, 100, loss_fn, optimizer, device)

### Build and Train the PyTorch Model with Hyperparameter Tuning

In [15]:
# Estimator code
from sagemaker.pytorch import PyTorch
output_path = 's3://{}/{}'.format(bucket, prefix)

estimator = PyTorch(entry_point="LSTM_Train.py",
                    source_dir="model",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    output_path = output_path,
                    train_instance_type='ml.m4.xlarge',
                    hyperparameters={
                        'input_features': 11,
                        'hidden_dim': 12,
                        'output_dim': 8,
                        'epochs': 100
                    })

In [16]:
# Fit estimator
estimator.fit({'train': input_data})

2020-03-05 03:44:03 Starting - Starting the training job...
2020-03-05 03:44:04 Starting - Launching requested ML instances.........
2020-03-05 03:45:34 Starting - Preparing the instances for training.........
2020-03-05 03:47:17 Downloading - Downloading input data
2020-03-05 03:47:17 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-03-05 03:47:37,157 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-03-05 03:47:37,160 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-05 03:47:37,172 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-03-05 03:47:37,176 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-03-05 03:47:37,390 sagemaker-containers INFO     Module LSTM_Train doe

In [17]:
%%time

# deploy your model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-----------!CPU times: user 254 ms, sys: 15.1 ms, total: 270 ms
Wall time: 5min 32s


In [28]:
torch.Tensor(processed_data[0][0]).float()

tensor([[-2.3910e+00, -2.6351e+00, -2.7732e-01,  9.2970e-01, -4.8984e-01,
         -1.1569e+00,  1.0857e+00, -1.2045e+00,  2.0962e+00, -5.3704e+00,
          2.3380e-01],
        [ 2.9102e-01, -1.3109e-01, -3.5296e-01,  6.4590e-01, -2.8399e-01,
         -1.2541e-01, -9.2107e-01,  1.4323e+00,  6.8586e-01, -3.7562e-01,
          6.8282e-01],
        [ 3.8522e-01,  9.4912e-01,  6.0842e-02, -1.4610e+00,  1.5104e+00,
          2.6102e+00, -9.2107e-01,  1.7253e+00, -7.7038e-01,  1.8358e-01,
         -1.1925e+00],
        [-4.4044e-01,  6.0751e-01, -5.4846e-02, -1.4123e+00, -1.6747e-01,
          1.5519e+00,  1.0857e+00, -1.4975e+00, -8.6166e-01,  8.8285e-01,
          3.2336e+00],
        [ 4.9051e-01, -1.8348e-03, -6.1326e-01, -9.6511e-01, -5.3411e-01,
         -5.2279e-01, -9.2107e-01,  5.5335e-01, -8.1149e-01,  5.3856e-01,
         -9.0695e-01],
        [ 6.9554e-01,  1.2076e+00, -5.0396e-02,  8.1217e-01, -7.6404e-01,
         -7.5214e-02,  1.0857e+00, -3.2617e-02,  3.1377e-01,  1.0708e-0

In [31]:
preds = predictor.predict(torch.Tensor(processed_data[0][0]).float())

In [32]:
len(preds)

15

In [49]:
torch.Tensor(new_tracks[-1]).float()

tensor([-2.3910, -2.6351, -0.2773,  0.9297, -0.4898, -1.1569,  1.0857, -1.2045,
         2.0962, -5.3704,  0.2338])

In [57]:
fut_pred = processed_data[0][0][0]

[-2.3909948690196825,
 -2.635094590468726,
 -0.2773220412902482,
 0.9296953263811546,
 -0.4898368594156362,
 -1.1569194705342014,
 1.0856902892884872,
 -1.2045490262286025,
 2.0961845785776654,
 -5.370441776536932,
 0.23380331292868914]

In [62]:
# model.eval()

fut_pred = processed_data[0][0][0]

playlist_len = 15

new_tracks = [torch.Tensor(fut_pred).float()]

print(new_tracks[-1])


predictor.predict(new_tracks[-1])

# for i in range(playlist_len - len(fut_pred)):
#         print(i)
#         print(predictor.predict(new_tracks[-1].values))
#         break

tensor([-2.3910, -2.6351, -0.2773,  0.9297, -0.4898, -1.1569,  1.0857, -1.2045,
         2.0962, -5.3704,  0.2338])


ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request.  Either the server is overloaded or there is an error in the application.</p>
". See https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-pytorch-2020-03-05-03-44-02-776 in account 999752527953 for more information.