
## WOrk with sagemaker
https://www.youtube.com/watch?v=R0vC31OXt-g

https://medium.com/akeneo-labs/machine-learning-workflow-with-sagemaker-b83b293337ff

In [1]:
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
base = 'DEMO-blazingtext-review'
prefix = 'sagemaker/' + base

role = sagemaker.get_execution_role()

In [2]:
role

'arn:aws:iam::855476558244:role/service-role/AmazonSageMaker-ExecutionRole-20190926T142620'

In [3]:
import os
import pandas as pd
import numpy as np
import boto3
import json
from sagemaker.predictor import json_deserializer

## **DATASET**

### Data from **Amazon Customer Reviews dataset** 


In [4]:
!mkdir /tmp/recsys/
!aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz /tmp/recsys/

mkdir: cannot create directory ‘/tmp/recsys/’: File exists
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz to ../../../tmp/recsys/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz


In [5]:
df = pd.read_csv('/tmp/recsys/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz', delimiter = '\t', error_bad_lines = False)

b'Skipping line 92523: expected 15 fields, saw 22\n'
b'Skipping line 343254: expected 15 fields, saw 22\n'
b'Skipping line 524626: expected 15 fields, saw 22\n'
b'Skipping line 623024: expected 15 fields, saw 22\n'
b'Skipping line 977412: expected 15 fields, saw 22\n'
b'Skipping line 1496867: expected 15 fields, saw 22\n'
b'Skipping line 1711638: expected 15 fields, saw 22\n'
b'Skipping line 1787213: expected 15 fields, saw 22\n'
b'Skipping line 2395306: expected 15 fields, saw 22\n'
b'Skipping line 2527690: expected 15 fields, saw 22\n'


In [6]:
df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12190288,R3FU16928EP5TC,B00AYB1482,668895143,Enlightened: Season 1,Digital_Video_Download,5,0,0,N,Y,I loved it and I wish there was a season 3,I loved it and I wish there was a season 3... ...,2015-08-31
1,US,30549954,R1IZHHS1MH3AQ4,B00KQD28OM,246219280,Vicious,Digital_Video_Download,5,0,0,N,Y,As always it seems that the best shows come fr...,As always it seems that the best shows come fr...,2015-08-31
2,US,52895410,R52R85WC6TIAH,B01489L5LQ,534732318,After Words,Digital_Video_Download,4,17,18,N,Y,Charming movie,"This movie isn't perfect, but it gets a lot of...",2015-08-31
3,US,27072354,R7HOOYTVIB0DS,B008LOVIIK,239012694,Masterpiece: Inspector Lewis Season 5,Digital_Video_Download,5,0,0,N,Y,Five Stars,excellant this is what tv should be,2015-08-31
4,US,26939022,R1XQ2N5CDOZGNX,B0094LZMT0,535858974,On The Waterfront,Digital_Video_Download,5,0,0,N,Y,Brilliant film from beginning to end,Brilliant film from beginning to end. All of t...,2015-08-31


In [7]:
# Drop some fileds that won't be used.
df = df[['customer_id', 'product_id', 'product_title', 'star_rating', 'review_date']]

### Most users do not rate most movies - CHeck our long tail

In [8]:
customers = df['customer_id'].value_counts()
products = df['product_id'].value_counts()

quantiles = [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99, 1]
print('customers\n', customers.quantile(quantiles))
print('products\n', products.quantile(quantiles))

customers
 0.00       1.0
0.01       1.0
0.02       1.0
0.03       1.0
0.04       1.0
0.05       1.0
0.10       1.0
0.25       1.0
0.50       1.0
0.75       2.0
0.90       4.0
0.95       5.0
0.96       6.0
0.97       7.0
0.98       9.0
0.99      13.0
1.00    2704.0
Name: customer_id, dtype: float64
products
 0.00        1.00
0.01        1.00
0.02        1.00
0.03        1.00
0.04        1.00
0.05        1.00
0.10        1.00
0.25        1.00
0.50        3.00
0.75        9.00
0.90       31.00
0.95       73.00
0.96       95.00
0.97      130.00
0.98      199.00
0.99      386.67
1.00    32790.00
Name: product_id, dtype: float64


Filter out customers that have not rated many movies.

In [9]:
customers = customers[customers >= 5]
customers = customers[customers >= 10]
reduce_df = df.merge(pd.DataFrame({'customer_id': customers.index})).merge(pd.DataFrame({'product_id':products.index}))

Concatenate product titles to treat each one as a single word

In [10]:
reduce_df['product_title'] =reduce_df['product_title'].apply(lambda x: x.lower().replace('', '-'))

In [11]:
reduce_df.head()

Unnamed: 0,customer_id,product_id,product_title,star_rating,review_date
0,27072354,B008LOVIIK,-m-a-s-t-e-r-p-i-e-c-e-:- -i-n-s-p-e-c-t-o-r- ...,5,2015-08-31
1,16030865,B008LOVIIK,-m-a-s-t-e-r-p-i-e-c-e-:- -i-n-s-p-e-c-t-o-r- ...,5,2014-06-20
2,18602179,B008LOVIIK,-m-a-s-t-e-r-p-i-e-c-e-:- -i-n-s-p-e-c-t-o-r- ...,5,2014-12-23
3,51264580,B008LOVIIK,-m-a-s-t-e-r-p-i-e-c-e-:- -i-n-s-p-e-c-t-o-r- ...,4,2012-12-07
4,11260824,B008LOVIIK,-m-a-s-t-e-r-p-i-e-c-e-:- -i-n-s-p-e-c-t-o-r- ...,5,2015-05-30


In [12]:
reduce_df.sort_values(['customer_id', 'review_date']).groupby('customer_id')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7faccab88710>

Write customer purchase histories

In [13]:
first = True
with open('customer_purchases.txt', 'w') as f:
    for customer, data in reduce_df.sort_values(['customer_id', 'review_date']).groupby('customer_id'):
        if first:
            first = False
        else:
            f.write('\n')
        f.write(''.join(data['product_title'].tolist()))

In [14]:
!head -c 10 customer_purchases.txt

OSError: [Errno 12] Cannot allocate memory

Write to S3 to SageMaker training can use it

In [15]:
inputs = sess.upload_data('customer_purchases.txt', bucket, '{}/word2vec/train'.format(prefix))

In [16]:
inputs

's3://sagemaker-us-east-2-855476558244/sagemaker/DEMO-blazingtext-review/word2vec/train/customer_purchases.txt'

## **Train**

Create a SageMaker estimator and specify:
    
* gorithm container location
* IAMrole for permissions
* Training Hardware
* S3 output location

In [17]:
bt = sagemaker.estimator.Estimator(
     sagemaker.amazon.amazon_estimator.get_image_uri(boto3.Session().region_name, 'blazingtext', 'latest'),
     role,
     train_instance_count =1,
     train_instance_type = 'ml.t2.medium',#''ml.t2.medium', #'ml.p3.2xlarge', 'ml.p2.xlarge', #
     output_path = 's3://{}/{}/output'.format(bucket, prefix),
     sagemaker_session = sess)

In [18]:
bt

<sagemaker.estimator.Estimator at 0x7fac75531e48>

## See algorithm hyperparameters:

* min_count: Remove titles that occur less than 5 times
* vector_dim: Embed in a 100-dimensional subspace
* subwords: use subwords to capture similarity in individual words of the vide titles
* min_char & max_char: subword character limits

In [19]:
bt.set_hyperparameters(mode="skipgram", epochs = 10, min_count = 5, sampling_threshold = 0.0001,
                      learning_rate = 0.05, window_size = 5, vector_dim =100, negative_samples = 5,
                      min_char =5, max_char = 10, evaluation = False, subwords = True)

Start the SageMaker training job with .fit():
* Provisions instances
* Loads algorithm container
* Load data from S3
* Trains models
* Outputs model artifacts to S3
* Tears down training cluster

In [20]:
bt.fit({'train':sagemaker.s3_input(inputs, distribution = 'FullyReplicated', content_type = 'text/plain')})

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'ml.t2.medium' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.c4.8xlarge, ml.c5.9xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.c5.18xlarge, ml.p3dn.24xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]