# *SageMaker Example for ML Kayfabe Training (REVISITED ON 04/22/2021)* 


 Etienne P Jacquot - ASC IT SYSADMIN - epj@asc.upenn.edu

## Based on helpful SageMaker example w/ banking purchase data here: 
- https://aws.amazon.com/getting-started/tutorials/build-train-deploy-machine-learning-model-sagemaker/
- This notebook is very similar, and I used the bank_clean.csv as reference for cleaning up my WWE instagram csv file!

### For this notebook we are looking at the most famous WWE star Ronda Rousey

- Initial ML Testing was for `Roman Reigns` Instagram Posts
    - Check him out here: https://www.instagram.com/romanreigns/?hl=en
    - I went through and added 1s and 0s for YES/NO if his photos are in Kayfabe but it was only like 80 pictures 



- This notebook looks at `Ronda Rousey` Instagram Posts ...

    - Check her out here: https://www.instagram.com/rondarousey/
    - This was around 3,600 images but a lot were pre-WWE career so I stopped Kayfabe training around like 1,500. 

_______

#### *Thinking ahead... *

In anticipation of meeting w/ Jake & Matt

*How can we use this notebook as a proof of concept to then scale up a pipeline for congressional records?*

- Stanford congressional record https://data.stanford.edu/congress_text

In [2]:
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.predictor import csv_serializer   

## Define IAM role


In [3]:
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + containers[my_region] + " container for your SageMaker endpoint.")

Success - the MySageMakerInstance is in the us-east-1 region. You will use the 811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest container for your SageMaker endpoint.


## Create your respective S3 Bucket for model results

In [4]:
bucket_name = 'mldatawwe' # <--- CHANGE THIS VARIABLE TO A UNIQUE NAME FOR YOUR BUCKET

In [5]:
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
      s3.create_bucket(Bucket=bucket_name)
    else: 
      s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region })
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


## Read your Machine Learning data

- In this case I had a funny script which displayed one image at a time and it prompted me w/ system dialog box if Y/N for Kayfabe... I did this for Ronda Rousey

![](img/roman1.png)

## *TRAINED DATA FOR RONDA ROUSEY* --> 

- I am not a source of authority on kayfabe... 
* ambiguity of UFC v WWE career transition

In [7]:
model_data_json = pd.read_json('wwe_instagram_data/rondarousey_ml_df.json')
model_data_json.to_csv('wwe_instagram_data/rondarousey_ml_df.csv')

In [9]:
# IGNORE THIS CELL - Permissions need to be in place for pulling from s3 bucket link
# We just do this manually in the next step
'''try:
  urllib.request.urlretrieve ("https://mldatawwe.s3.amazonaws.com/romanreigns_ml_binary_v4.csv", "romanreigns_ml_binary_testing.csv")
  print('Success: downloaded bank_clean.csv.')
except Exception as e:
  print('Data load error: ',e)
'''


'try:\n  urllib.request.urlretrieve ("https://mldatawwe.s3.amazonaws.com/romanreigns_ml_binary_v4.csv", "romanreigns_ml_binary_testing.csv")\n  print(\'Success: downloaded bank_clean.csv.\')\nexcept Exception as e:\n  print(\'Data load error: \',e)\n'

In [8]:
# Apparently I used this cell various times in testing for Roman & Ronda classified data
# In this case, I had to manually upload the .csv file to SageMaker Jhub ...

try:
    #model_data = pd.read_csv('wwe_instagram_data/romanreigns_clean.csv',index_col=0)
    #model_data = pd.read_csv('wwe_instagram_data/romanreigns_sagemaker_testing.csv',index_col=0)
    
    model_data = pd.read_csv('wwe_instagram_data/rondarousey_ml_df.csv',index_col=0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: Data loaded into dataframe.


## This Instagram data contains AWS Rekognition results as matrix w/ many columns for each object detected

- of course there are many values which are zero... I honestly forget what threshold I used, I think mbod suggested 95% just to keep the column # manageable at ~430

In [15]:
model_data.shape

(1470, 436)

In [16]:
model_data.head()

Unnamed: 0,postUrl,description,commentCount,likeCount,location,locationId,pubDate,isSidecar,profileUrl,username,...,rekog_Suggestive,rekog_Revealing_Clothes,rekog_Illustrated_Nudity_Or_Sexual_Activity,rekog_Explicit_Nudity,rekog_Weapons,rekog_Nudity,rekog_Female_Swimwear_Or_Underwear,rekog_Physical_Violence,rekog_Sexual_Activity,rekog_Partial_Nudity
0,https://www.instagram.com/p/B6Dva5DHdVA,I never thought I’d have so much fun playing a...,179,46171,,,2019-12-14 16:05:52,True,https://www.instagram.com/rondarousey,rondarousey,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,https://www.instagram.com/p/B6DxU13n_jG,I never thought I’d have so much fun playing a...,179,46171,,,2019-12-14 16:05:52,True,https://www.instagram.com/rondarousey,rondarousey,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,https://www.instagram.com/p/B5oMy_unLT5,TONIGHT is part 1 of our two part @totaldivas ...,669,236320,,,2019-12-03 23:07:10,False,https://www.instagram.com/rondarousey,rondarousey,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,https://www.instagram.com/p/BzeDytOHREX,People ask me all the time if I miss @wwe - we...,5438,338573,,,2019-07-03 23:05:41,False,https://www.instagram.com/rondarousey,rondarousey,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,https://www.instagram.com/p/Bg0JBU0nQHS,Apparently warm welcomes come few and far betw...,1233,193068,,,2018-03-27 05:22:08,True,https://www.instagram.com/rondarousey,rondarousey,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## *UPDATE -->* Okay so this step I was doing is basically get our ML features df

_______

### Identify columns of relevance for ml training & cleaning data...
- Mostly either binary or continuous values 
- AWS Rekognition results were added to this df as columns
    - Only Rekognition `labels`,`moderation`, or `celebrity` that have > 90% confidence
    - This comes as json like nested object, **need to incorporate this into df for ML**
    - Took all those rekognition results as columns and appended to df as an array of 0s
    - Looped back through for each row (each instagram image) and changes 0s to 1s to represent Rekognition results

In [18]:
# to remove any string objects for our ml_df... 
dtypes = model_data.dtypes
cols = model_data.columns
model_dtypes = model_data['postUrl'].dtypes

In [19]:
ml_cols = []

for col in cols:
    if not model_data[col].dtypes == model_dtypes:
        #print(col,'    ',model_data[col].dtypes)
        ml_cols.append(col)

#print(ml_cols)
ml_df = model_data[ml_cols]
ml_df['isSidecar'] = ml_df['isSidecar'].astype(int)

# Not sure about this Unnamed: 0 columns, I need to fix this for input csv 
#ml_df = ml_df.drop(columns=["Unnamed: 0.1",'Unnamed: 0.1.1','caption'])
ml_df = ml_df.drop(columns=['caption'])
ml_df.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


commentCount                            int64
likeCount                               int64
location                              float64
locationId                            float64
isSidecar                               int64
                                       ...   
rekog_Nudity                          float64
rekog_Female_Swimwear_Or_Underwear    float64
rekog_Physical_Violence               float64
rekog_Sexual_Activity                 float64
rekog_Partial_Nudity                  float64
Length: 414, dtype: object

## Our AWS Rekognition columns yield `414` ml feature columns

- most of those cols are float, though probably could be bool?

- this also includes other numeric data like `commentCount` & `likeCount` (this was *at the time of data collection, the phantom buster online paid / trial service for social media web scraping*)

In [20]:
# Notice the 71 columns which are 'rekog_*'
ml_df.y_yes.head()

0    0
1    0
2    1
3    1
4    1
Name: y_yes, dtype: int64

In [21]:
# Make sure there is the y_no column!
ml_df['y_yes'] = ml_df['y_yes'].astype(bool)
ml_df['y_no'] = -(ml_df['y_yes'])
ml_df[['y_yes','y_no']] = ml_df[['y_yes','y_no']].astype(int)

In [22]:
ml_df[['y_yes','y_no']].head()

Unnamed: 0,y_yes,y_no
0,0,1
1,0,1
2,1,0
3,1,0
4,1,0


In [25]:
ml_df

Unnamed: 0,commentCount,likeCount,location,locationId,isSidecar,postId,viewCount,pubYear,pubMonth,pubDay,...,rekog_Revealing_Clothes,rekog_Illustrated_Nudity_Or_Sexual_Activity,rekog_Explicit_Nudity,rekog_Weapons,rekog_Nudity,rekog_Female_Swimwear_Or_Underwear,rekog_Physical_Violence,rekog_Sexual_Activity,rekog_Partial_Nudity,y_no
0,179,46171,,,1,2198809599234921728,209260.0,2019,12,14,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,179,46171,,,1,2198817979597060352,209260.0,2019,12,14,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,669,236320,,,0,2191057499675473152,,2019,12,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,5438,338573,,,0,2080116756560285952,2892913.0,2019,7,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,1233,193068,,,1,1744058629194842624,,2018,3,27,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3488,2771,334771,,,0,1746129224303295232,1549081.0,2018,3,30,...,,,,,,,,,,1
3489,898,110991,,,0,1745855406591150080,930044.0,2018,3,29,...,,,,,,,,,,1
3490,4044,508364,,,0,1744404107067956992,,2018,3,27,...,,,,,,,,,,1
3491,806,219025,,,0,1744222310111710976,,2018,3,27,...,,,,,,,,,,1


In [26]:
# If you need clean export for reference
ml_df.to_csv('wwe_instagram_data/rondarousey_ml_df_clean042221.csv',header=False)

## *Split your data for testing your model*

<img alt='Ronda Rousey Meme Example' src='https://github.com/atnjqt/random_stuff/blob/master/ronda1.png?raw=true' border="100px" style="float: right; height: 300px">


**_TESTING_ --> ROMAN REIGNS** 
-  Originally tried `.7/.3` split for RomanReigns but this failed to give results ... 
- Changed to `.85/.15` since this is a small dataset for Roman!

**RONDA ROUSEY**

- we will do a 70/30 split which results in our subset for testing our model against the known values: 

``` print(train_data.shape, test_data.shape)
(1029, 415) (441, 415) ``` 




In [27]:
# Testing now on RondaRousey:
train_data, test_data = np.split(ml_df.sample(frac=1, random_state=1729), [int(0.70 * len(ml_df))])
print(train_data.shape, test_data.shape)

(1029, 415) (441, 415)


In [28]:
# Cleaned training data
train_data.head()

Unnamed: 0,commentCount,likeCount,location,locationId,isSidecar,postId,viewCount,pubYear,pubMonth,pubDay,...,rekog_Revealing_Clothes,rekog_Illustrated_Nudity_Or_Sexual_Activity,rekog_Explicit_Nudity,rekog_Weapons,rekog_Nudity,rekog_Female_Swimwear_Or_Underwear,rekog_Physical_Violence,rekog_Sexual_Activity,rekog_Partial_Nudity,y_no
2179,1124,144873,,,0,1966164915120956160,,2019,1,27,...,,,,,,,,,,0
2837,578,119544,,,1,1926958425292828160,,2018,12,4,...,,,,,,,,,,0
2978,591,109505,,,0,1892442285539815424,,2018,10,17,...,,,,,,,,,,0
2816,392,71790,,,1,2162876667814374144,343335.0,2019,10,26,...,,,,,,,,,,1
3062,1192,154360,,,1,1866253171647883520,,2018,9,11,...,,,,,,,,,,0


In [29]:
# Cleaned test data
test_data.head()

Unnamed: 0,commentCount,likeCount,location,locationId,isSidecar,postId,viewCount,pubYear,pubMonth,pubDay,...,rekog_Revealing_Clothes,rekog_Illustrated_Nudity_Or_Sexual_Activity,rekog_Explicit_Nudity,rekog_Weapons,rekog_Nudity,rekog_Female_Swimwear_Or_Underwear,rekog_Physical_Violence,rekog_Sexual_Activity,rekog_Partial_Nudity,y_no
2836,578,119544,,,1,1926958418355473408,,2018,12,4,...,,,,,,,,,,0
3468,711,166935,,,0,1752646971963523840,,2018,4,8,...,,,,,,,,,,0
174,1110,91815,,,0,1414740728248898816,,2016,12,27,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
493,429,96372,,,0,1023463777057916544,,2015,7,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3306,2435,199912,,,1,1804489316887917568,,2018,6,18,...,,,,,,,,,,0


## Prepare your training data for XGBoost AWS module

- these cells I likely adapted from the AWS example

- they must have updated the sagemaker module with `.TrainingInput`, described [here](https://stackoverflow.com/questions/64256639/syntaxerror-amazon-sagemaker-object-has-no-attribute)

#### More info on XGBoost algorithm here
- https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost

In [32]:
# Create train.csv based on the Kayfabe trained & cleaned csv
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train04222021.csv', index=False, header=False)

# upload the file to the correct bucket & directory for XGBoost training
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train04222021.csv')).upload_file('train04222021.csv')

# Input this training file from S3 bucket

s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')
#s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

# 

In [33]:
# Create SageMaker session
sess = sagemaker.Session()

# Configure XGBoost algorithm estimator computer resources
xgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type='ml.m4.xlarge',output_path='s3://{}/{}/output'.format(bucket_name, prefix),sagemaker_session=sess)

# Configure XGBoost algorithm estimator hyperparameters
# not really sure what these default weights are, I assume it's for binary YES/NO ...
xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)

# 

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


_______

# *UPDATE -->* Running your XGBoost Training Fit to create your Model


<img alt='Ronda Rousey Championship GIF' src='https://media.giphy.com/media/2gTPhMOpb5keJZpmcI/giphy.gif' border="100px" style="float: right;">

- *This is likely an iterative process for fine tuning hyperparameters! This simple notebook does not cover those steps*


- This will use S3 bucket `s3_input_train` created earlier on in this notebook for our traning output



- This is probably the most expensive of AWS resources we are running in this workflow. This outputs billable time:

```
Training seconds: 53
Billable seconds: 53
```

_____


        **--> RUN YOUR XGBOOST FIT 🚀**


In [35]:
xgb.fit({'train': s3_input_train})

2021-04-23 03:57:50 Starting - Starting the training job...
2021-04-23 03:57:52 Starting - Launching requested ML instancesProfilerReport-1619150270: InProgress
.........
2021-04-23 03:59:44 Starting - Preparing the instances for training......
2021-04-23 04:00:50 Downloading - Downloading input data
2021-04-23 04:00:50 Training - Downloading the training image...
2021-04-23 04:01:22 Uploading - Uploading generated training model.
2021-04-23 04:01:44 Completed - Training job completed
[34mArguments: train[0m
[34m[2021-04-23:04:01:17:INFO] Running standalone xgboost training.[0m
[34m[2021-04-23:04:01:17:INFO] Path /opt/ml/input/data/validation does not exist![0m
[34m[2021-04-23:04:01:17:INFO] File size need to be processed in the node: 0.95mb. Available memory size in the node: 8410.63mb[0m
[34m[2021-04-23:04:01:17:INFO] Determined delimiter of CSV input is ','[0m
[34m[04:01:17] S3DistributionType set as FullyReplicated[0m
[34m[04:01:17] 1029x413 matrix with 424386 entries 

Training seconds: 53
Billable seconds: 53


## *UPDATED -->* Deploying your training model


- This will take a long time to run so be patient!
    - I had to stop / interrupt the kernel & refresh the notebook and rerun but the kernel did *not* restart ...



### *Tips & thoughts on scaling up* : Use a lower tier / cost resource for your test & dev...

Here we are using `ml.m4.xlarge`

- when your model is tuned you can scale up to a more expensive instance for a more computationally-powerful deployed model

- *Be sure to power off when finished!*
    - During initial testing for Roman Reigns I left multiple instances open and it cost us like nearly $50 for the day ... 


In [39]:
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

---------------!

### *Run your testing data against your deployed `xgb_predictor` model*:

- We use this to check accuracy of our model on the test subset

In [41]:
test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values # load the data into an array

#xgb_predictor.content_type = 'text/csv' # set the data type for an inference -- no longer necessary
xgb_predictor.serializer = csv_serializer # set the serializer type

predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!

predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array

print(predictions_array.shape)

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


(441,)


### Finally, display testing result output in confusion matrix

- With `.7/.3` split for Roman Reigns, this was giving errors as there was only 1 column... Resolved by changing to `.85/.15` for small dataset.


- For Ronda Rousey with 1,500 images traning images, was able to run `.7/.3` split which is demonstrated below:

In [42]:
confusion_matrix = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = confusion_matrix.iloc[0,0]
fn = confusion_matrix.iloc[1,0]
tp = confusion_matrix.iloc[1,1]
fp = confusion_matrix.iloc[0,1]
p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("-"*40)
print("{0:<15}{1:<15}{2:>8}".format("Predicted---->", "No Kayfabe", "Kayfabe"))
print("\nObserved")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Kayfabe", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Kayfabe", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))


Overall Classification Rate: 83.9%

----------------------------------------
Predicted----> No Kayfabe      Kayfabe

Observed
No Kayfabe     85% (175)    17% (41)
Kayfabe         15% (30)     83% (195) 



## This XGBoost training model results are not bad for this example

Was able to predict Y/N for Kayfabe w/ accuracy of 83.9% for this test run.

#### *Remember... I am not an expert on Kayfabe!*

<img alt='Ronda Rousey Mural Art' src='https://github.com/atnjqt/random_stuff/blob/master/ronda3.png?raw=true' border="100px" style="float: left;height: 450px">


## *MAKE SURE TO DELETE ENDPOINT WHEN YOU ARE DONE!!!*

You need to go in the SageMaker console to **Inference > Endpoints** https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/endpoints

- get the *Name* of your endpoint if you need to manually delete, though you should be able to pass `xgb_predictor.endpoint`


In [54]:
endpoint = 'xgboost-2021-04-23-04-31-02-286' # <--- manually removing this inference endpoint after testing
sagemaker.Session().delete_endpoint(endpoint)

In [48]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint) # <--- ran this twice so it threw an error second time
#bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
#bucket_to_delete.objects.all().delete()

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


ClientError: An error occurred (ValidationException) when calling the DeleteEndpoint operation: Could not find endpoint "arn:aws:sagemaker:us-east-1:064258348567:endpoint/xgboost-2021-04-23-04-45-33-247".



__________

## CONCLUSION

- this notebook demonstrates a simple example for supervised ML predictor for social media image analysis detected objects