# SageMaker Processing Script: MXNet and GluonTS

This notebook shows a very basic example of using SageMaker Processing to create train, test and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.

In a nutshell, we will create a MXNetProcessor object, passing the MXNet version we want to use, as well as our managed infrastructure requirements.

For our use case, we will download a well-known datasets, publicly available online, called the [Numenta Anomaly Benchmark (NAB) dataset](https://github.com/numenta/NAB). This dataset is composed of over 50 labeled real-world and artificial timeseries data files plus a novel scoring mechanism designed for real-time applications. 

In our example, we will use the volume of tweets for Amazon, and we will process this dataset to make it compatible with MXNet GluonTS library.

In [None]:
!mkdir -p .data .output

In [None]:
import pandas as pd

#getting train datatset of twitter volume
url = "https://raw.githubusercontent.com/numenta/NAB/master/data/realTweets/Twitter_volume_AMZN.csv"
df = pd.read_csv(url, header=0, index_col=0)
df.to_csv('.data/dataset.csv')

In [None]:
import sagemaker 

session = sagemaker.Session()
bucket = session.default_bucket()
key_prefix = 'frameworkprocessors/mxnet-example'

source_path = session.upload_data('.data', bucket=bucket, key_prefix=f'{key_prefix}/data')
source_path

## Create the script you'd like to use with Processing with your logic

This script is executed by Amazon SageMaker.

In the `main`, it does the core of the operations: it reads and parses arguments passed as parameters, unpacks the model file, then loads the model, preprocess, predict, postprocess the data. Remember to write data locally in the final step so that SageMaker can copy them to S3.

In [None]:
!pygmentize mxnet-gluonts-processing.py

## Create the Sagemaker Processor 

Once the data has been uploaded to S3, we can now create the `MXNetProcessor` object. We specify the version of the framework that we want to use, the python version, the role with the correct permissions to read the dataset from S3, and the instances we're planning on using for our processing job.

In [None]:
from sagemaker.mxnet import MXNetProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

mxp = MXNetProcessor(
    framework_version='1.8.0',
    py_version='py37',
    role=get_execution_role(), 
    instance_count=1,
    instance_type='ml.c5.xlarge',
    base_job_name='frameworkprocessor-mxnet'
)

All that's left to do is to `run()` the Processing job: we will specify our python script that contains the logic of the transformation in the `code` argument and its dependencies in the `source_dir` folder, the `inputs` and the `outputs` of our job.

Note: in the folder indicated in the `source_dir` argument, it is possible to have a `requirements.txt` file with the dependencies of our script. This file will make SageMaker Processing automatically install the packages specified in it by running the `pip install -r requirements.txt` command before launching the job itself.

In [None]:
mxp.run(
    code='mxnet-gluonts-processing.py',
    source_dir='.',
    inputs=[
        ProcessingInput(
            input_name='data', source=source_path,
            destination='/opt/ml/processing/input/data/'
        )
    ],
    outputs=[
        ProcessingOutput(output_name='processed_data', source='/opt/ml/processing/output/')
    ],
    logs=False
)

We can now check the results of our processing job, and list the outputs from S3.

In [None]:
output_path = mxp.latest_job.outputs[0].destination

!aws s3 ls --recursive $output_path