## Loading the data

In [155]:
import pandas as pd
df = pd.read_excel("crops.xlsx")

In [156]:
from sklearn.model_selection import train_test_split

In [157]:
train, test_and_validate = train_test_split(df, 
                                            test_size=0.2, 
                                            random_state=42)

In [158]:
test, validate = train_test_split(test_and_validate, 
                                  test_size=0.5, 
                                  random_state=42)

## Uploading to Amazon S3

We need to upload the data to S3 since we will be pulling the datasets from there for our model



In [159]:
import boto3
import io
import os

In [173]:
bucket='c93435a2086654l5083652t1w4297761148-sandboxbucket-1mais214ncuy7'
prefix='crops'
train_file='train.csv'
test_file='test.csv'
validate_file='validate.csv'
whole_file='crops.csv'
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)
upload_s3_csv(whole_file,"crops", df)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


## Lets create the estimator

Now that the data in Amazon S3, we can train the model.

The first step is to get the XGBoost container URI.

In [161]:
from sagemaker.image_uris import retrieve
import sagemaker
role=sagemaker.get_execution_role()
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


The only value to point out is the *num_class*, which is set to *6* to match the number of target classes in the dataset.

In [162]:
hyperparams = {
    "num_round": "40",
    "objective": "reg:linear"
}

Use the `estimator` function to set up the model. Here are a few parameters of interest:

- **instance_count** - Defines how many instances will be used for training. We will use on* instance.
- **instance_type** - Defines the instance type for training. In this case, it's ml.m4.xlarge.

In [163]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator


In [164]:
# Set up SageMaker
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = 'c93435a2086654l5083652t1w4297761148-sandboxbucket-1mais214ncuy7'
prefix = 'crops'

# Specify your container (image) for XGBoost
container = get_image_uri(sagemaker_session.boto_region_name, 'xgboost')


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [165]:
# Initialize and configure the XGBoost estimator
xgb_model = Estimator(container,
                      role,
                      instance_count=1,
                      instance_type='ml.m5.xlarge',
                      output_path=f's3://{bucket}/{prefix}/output',
                      sagemaker_session=sagemaker_session,
                      hyperparameters=hyperparams)


## Creating the input channels

The estimator needs *channels* to feed data into the model. For training, the *train_channel* and the *validate_channel* will be used.

In [166]:
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

## Training the model

Running `fit` will train the model.

**Note:** This process can take up to 5 minutes.

In [167]:
df

Unnamed: 0.1,Unnamed: 0,year,hectares,production,yield,country_Argentina,country_Australia,country_Austria,country_Belgium,country_Brazil,...,country_region_valencia,country_region_xinjiang,country_region_yunnan,country_region_zhejiang,country_region_Île-de-France,crop_cereals,crop_maize,crop_spring wheat,crop_wheat,crop_winter wheat
0,0,1902,285197.421053,695060.488176,1.310000,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,1903,285197.421053,695060.488176,1.470000,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,1904,285197.421053,695060.488176,1.270000,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,1905,285197.421053,695060.488176,1.330000,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,1906,285197.421053,695060.488176,1.280000,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35824,36702,2013,75520.000000,278300.000000,3.685117,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
35825,36703,2014,82120.000000,309500.000000,3.768875,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
35826,36704,2015,89800.000000,351300.000000,3.912027,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
35827,36705,2016,76590.000000,253900.000000,3.315054,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [168]:
xgb_model.fit(inputs=data_channels, logs=False)

INFO:sagemaker:Creating training-job with name: xgboost-2023-11-05-19-09-25-411



2023-11-05 19:09:25 Starting - Starting the training job....
2023-11-05 19:09:54 Starting - Preparing the instances for training..........
2023-11-05 19:10:46 Downloading - Downloading input data.....
2023-11-05 19:11:17 Training - Downloading the training image..
2023-11-05 19:11:32 Training - Training image download completed. Training in progress...
2023-11-05 19:11:47 Uploading - Uploading generated training model.
2023-11-05 19:11:59 Completed - Training job completed


## Viewing the metrics from the training job

After the job is complete, you can view the metrics from the training job.

In [169]:
s = sagemaker.analytics.TrainingJobAnalytics(xgb_model._current_job_name, 
                                            metric_names=['train:rmse', 'validation:rmse']
)

s_df = s.dataframe()
s_df = s_df.iloc[:, 1:3]
s_df

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Unnamed: 0,metric_name,value
0,train:rmse,669.6315
1,validation:rmse,669.476


In [207]:
xgb_predictor = xgb_model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large'
)

INFO:sagemaker:Creating model with name: xgboost-2023-11-05-22-29-11-715
INFO:sagemaker:Creating endpoint-config with name xgboost-2023-11-05-22-29-11-715
INFO:sagemaker:Creating endpoint with name xgboost-2023-11-05-22-29-11-715


----!

After this point we couldn't figure out how to make predictions.