# Train Using SageMaker Manifest File with Scikit-Learn Decision Tree Regressor

This is a sample Python program that trains a simple scikit-learn model on the California dataset using Manifest File.

for more details please refer the documentation: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html


**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

In [19]:
!pip install sagemaker -U -q

[0m

In [20]:
import datetime
import time
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

sm_boto3 = boto3.client("sagemaker")
s3_boto3 = boto3.client('s3')
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sess.boto_session.region_name
bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print(f"Using sagemaker version {sagemaker.__version__}")

Using sagemaker version 2.180.0


## SageMaker Training

### Launching a training job with the Python SDK using Manifest File

In [21]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
        entry_point="scikit_learn_california.py",
        source_dir="code",
        framework_version="1.2-1",
        instance_type="ml.c5.xlarge",
        # keep_alive_period_in_seconds=3600,
        role=role,
        hyperparameters={"max_leaf_nodes": 30},
)

### Manifest File

A manifest might look like this: s3://bucketname/example.manifest

A manifest is an S3 object which is a JSON file consisting of an array of elements. The first element is a prefix which is followed by one or more suffixes. SageMaker appends the suffix elements to the prefix to get a full set of S3Uri. Note that the prefix must be a valid non-empty S3Uri that precludes users from specifying a manifest whose individual S3Uri is sourced from different S3 buckets.

The following code example shows a valid manifest format:
```

[ {"prefix": "s3://customer_bucket/some/prefix/"},

"relative/path/to/custdata-1",

"relative/path/custdata-2",

...

"relative/path/custdata-N"

]
```

This JSON is equivalent to the following S3Uri list:

- s3://customer_bucket/some/prefix/relative/path/to/custdata-1

- s3://customer_bucket/some/prefix/relative/path/custdata-2

- ...

- s3://customer_bucket/some/prefix/relative/path/custdata-N

The complete set of S3Uri in this manifest is the input data for the channel for this data source. The object that each S3Uri points to must be readable by the IAM role that SageMaker uses to perform tasks on your behalf.

Let's inspect our Manifest file

In [22]:
manifest_file_path = "s3://aws-ml-blog/artifacts/scikit-learn-training-w-manifest/california-housing-manifest.json"

In [23]:
import json

s3_response = s3_boto3.get_object(
        Bucket='aws-ml-blog',
        Key='artifacts/scikit-learn-training-w-manifest/california-housing-manifest.json'
    )
s3_object_body = s3_response.get('Body')
content_str = s3_object_body.read().decode()
json.loads(content_str)

[{'prefix': 's3://aws-ml-blog/artifacts/scikit-learn-training-w-manifest/'},
 'train/california_train.csv',
 'test/california_test.csv',
 'validation/california_validation.csv']

You can see that there are three files. One file for train, one file for test, and one file for validation.

Now let's inspect the files in the bucket

In [24]:
!aws s3 ls s3://aws-ml-blog/artifacts/scikit-learn-training-w-manifest/ --recursive

2023-08-27 14:49:43          0 artifacts/scikit-learn-training-w-manifest/
2023-08-27 14:51:41        174 artifacts/scikit-learn-training-w-manifest/california-housing-manifest.json
2023-08-27 14:50:46     699696 artifacts/scikit-learn-training-w-manifest/test/california_test.csv
2023-08-27 14:50:49    3265248 artifacts/scikit-learn-training-w-manifest/train/california_train.csv
2023-08-27 15:30:50    3265248 artifacts/scikit-learn-training-w-manifest/train/california_train_1.csv
2023-08-27 15:30:51    3265248 artifacts/scikit-learn-training-w-manifest/train/california_train_2.csv
2023-08-27 15:30:52    3265248 artifacts/scikit-learn-training-w-manifest/train/california_train_3.csv
2023-08-27 14:50:44     699696 artifacts/scikit-learn-training-w-manifest/validation/california_validation.csv
2023-08-27 15:31:31     699696 artifacts/scikit-learn-training-w-manifest/validation/california_validation_1.csv
2023-08-27 15:31:34     699696 artifacts/scikit-learn-training-w-manifest/validation/

We can clearly see that there are many files for train, test, and validation.

During SageMaker raining job run, we expect to load only one file for train, one file for test, and one file for validation using the Manifest file we review previously.

In [25]:
inputs = sagemaker.inputs.TrainingInput(
    s3_data=manifest_file_path, s3_data_type="ManifestFile"
)

In [26]:
sklearn_estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2023-08-27-15-50-07-529


Using provided s3_resource
2023-08-27 15:50:08 Starting - Starting the training job...
2023-08-27 15:50:23 Starting - Preparing the instances for training......
2023-08-27 15:51:18 Downloading - Downloading input data...
2023-08-27 15:52:04 Training - Training image download completed. Training in progress..[34m2023-08-27 15:52:08,097 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-08-27 15:52:08,100 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-08-27 15:52:08,107 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-08-27 15:52:08,304 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-08-27 15:52:08,315 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-08-27 15:52:08,327 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0