This will only run in Sagemaker Studio

### Downloading csv file

In [None]:
%% # this is magic command, other wise we need to put ! in below commands
shwget -N https://github.com/h2oai/h2o-2/raw/master/smalldata/bank-additional-full.csv
unzip -o bank-additional-full.csv    

### Reading CSV file

In [None]:
import os
import pandas as pd
df = pd.read_csv('bank-additional-full.csv',sep = ';')
df.head(2)

### Uploading dataset to Amazon S3 bucket.
We'll use a default bucket automatically created by SageMaker in the region we're running in.

We'll just add a prefix to keep things nice and tidy

In [1]:
## No need to upload it in bucket
## If you face some error than uncomment and run this cell also

#import sagemaker
#prefix = 'sagemaker/DEMO-smprocessing/input'
#input_data = sagemaker.Session().upload_data(path='./bank-additional-full.csv',
#                                             key_prefix=prefix)

### Running a processing script:
We use the SKLearnProcessor object from the SageMaker SDK to configure the processing job:

In [7]:
import sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(
                    framework_version='0.20.0', # version of scikit-learn we want to use
                    role=sagemaker.get_execution_role(),
                    instance_type='ml.t3.medium', # select instance of your choice
                    instance_count=1) # run instance 1 time

Then, we simply launch the job, passing the name of the script(preprocessing.py),

the dataset input path in S3,

the user-defined dataset paths inside the SageMaker Processing environment, 

and the command-line arguments:

In [None]:
from sagemaker.processing import ProcessingInput,ProcessingOutput
sklearn_processor.run(
    code='preprocessing.py',
    inputs=[ProcessingInput(
    source='bank-additional-full.csv',
    # Our data in Container
    destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(
    source='/opt/ml/processing/train',
    output_name='train_data'),ProcessingOutput(
    source='/opt/ml/processing/test',
    output_name='test_data')],
    arguments=['--train-test-split-ratio', '0.2'])


Job Name:  sagemaker-scikit-learn-2021-03-06-11-00-50-019
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-603012210694/sagemaker-scikit-learn-2021-03-06-11-00-50-019/input/input-1/bank-additional-full.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-603012210694/sagemaker-scikit-learn-2021-03-06-11-00-50-019/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-603012210694/sagemaker-scikit-learn-2021-03-06-11-00-50-019/output/train_data', 'LocalPath': '/opt/ml/processin