## Multi-class classification with Linear Learner---training and validation

Steps: (1) Load dataset from s3 onto the notebook, (2) clean, transform, analyze and prepare the dataset, (3) create and train model with Linear Learner algorithm (dataset for training and validation is the same as in the notebook file for XGBoost algorith in this repository)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import io
from datetime import datetime

import boto3, sagemaker
from sagemaker import get_execution_role
import sagemaker.amazon.common as smac


### Step 1: Load the dataset from S3 onto the notebook

In [2]:
role = get_execution_role()
bucket = 'ml-projects-bl'
sub_folder = 'ufo_dataset'
data_key = 'ufo_fullset.csv'
data_location = 's3://{}/{}/{}'.format(bucket, sub_folder, data_key)

df = pd.read_csv(data_location, low_memory=False)
df.head()

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome
0,1977-04-04T04:02:23.340Z,1977-03-31,23:46,circle,4,1,rain,Ila,Bashirian,47.329444,-122.578889,Y,N,N,explained
1,1982-11-22T02:06:32.019Z,1982-11-15,22:04,disk,4,1,partly cloudy,Eriberto,Runolfsson,52.664913,-1.034894,Y,Y,N,explained
2,1992-12-07T19:06:52.482Z,1992-12-07,19:01,circle,49,1,clear,Miller,Watsica,38.951667,-92.333889,Y,N,N,explained
3,2011-02-24T21:06:34.898Z,2011-02-21,20:56,disk,13,1,partly cloudy,Clifton,Bechtelar,41.496944,-71.367778,Y,N,N,explained
4,1991-03-09T16:18:45.501Z,1991-03-09,11:42,circle,17,1,mostly cloudy,Jayda,Ebert,47.606389,-122.330833,Y,N,N,explained


### Step 2: Clean, Transform, Analyze, and Prepare the dataset

First check if there are missing values

In [3]:
missing_values = df.isnull().values.any()
if(missing_values):
    display(df[df.isnull().any(axis=1)])

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome
1024,2011-03-23T18:32:20.473Z,2011-03-22,21:12,,3,1,rain,Deon,Feil,37.681944,-121.766944,Y,N,N,explained
2048,1998-04-23T18:47:16.029Z,1998-04-23,10:07,,40,2,partly cloudy,Vincenzo,Rohan,38.254167,-85.759444,Y,Y,N,explained


There are 2 records with missing values of 'shape'. Now to check which the shapes are and their occurrances.

In [4]:
df['shape'].value_counts()

circle      6047
disk        5920
light       1699
square      1662
triangle    1062
sphere      1020
box          200
oval         199
pyramid      189
Name: shape, dtype: int64

replace the missing shape values with the most common shape which is circle

In [5]:
df['shape'] = df['shape'].fillna(df['shape'].value_counts().index[0])

now start data transformation: (1) convert the data types of reportedTimestamp and eventDate to datetime (2) convert the data type of shape and weather to category data type, (3) map physicalEvidence and contact from 'Y', 'N, to 1, 0, (4) convert researchOutcome to category data type as target attribute.

In [6]:
df['reportedTimestamp'] = pd.to_datetime(df['reportedTimestamp'])
df['eventDate'] = pd.to_datetime(df['eventDate'])

In [7]:
df['shape'] = df['shape'].astype('category')
df['weather'] = df['weather'].astype('category')

In [8]:
df['physicalEvidence'] = df['physicalEvidence'].replace({'Y':1, 'N':0})
df['contact'] = df['contact'].replace({'Y':1, 'N':0})

In [9]:
df['researchOutcome'] = df['researchOutcome'].astype('category')

In [14]:
df.dtypes

reportedTimestamp    datetime64[ns, UTC]
eventDate                 datetime64[ns]
eventTime                         object
shape                           category
duration                           int64
witnesses                          int64
weather                         category
firstName                         object
lastName                          object
latitude                         float64
longitude                        float64
sighting                          object
physicalEvidence                   int64
contact                            int64
researchOutcome                 category
dtype: object

now drop the columns that are not important. (1) drop 'sighting' because it is always 'Yes'; (2) drop first name and last name because these are not important with regards to the reserachOutcome; (3) drop reportedTimestamp because the time of sighting deoes not help to determine the legitimacy of the sighting, (4) if the eventDate and eventTime are not evenly distributed, it may helps if some sort of buckets (e.g., seasons) can be created, but since the eventDate and envetTime are pretty evenly distributed, they can be dropped too.

In [10]:
df.drop(columns=['firstName', 'lastName', 'sighting', 'reportedTimestamp', 'eventDate', 'eventTime'], inplace=True)

In [11]:
df.head()

Unnamed: 0,shape,duration,witnesses,weather,latitude,longitude,physicalEvidence,contact,researchOutcome
0,circle,4,1,rain,47.329444,-122.578889,0,0,explained
1,disk,4,1,partly cloudy,52.664913,-1.034894,1,0,explained
2,circle,49,1,clear,38.951667,-92.333889,0,0,explained
3,disk,13,1,partly cloudy,41.496944,-71.367778,0,0,explained
4,circle,17,1,mostly cloudy,47.606389,-122.330833,0,0,explained


Now apply one-hot encoding for categorical values: (1) apply one-hot encoding for both weather and shape attributes; (2) map researchOutcome (target) into numerical values.

In [12]:
df = pd.get_dummies(df, columns=['weather', 'shape'])

In [13]:
df['researchOutcome'] = df['researchOutcome'].replace({'unexplained':0, 'explained':1, 'probable':2})

In [14]:
display(df.head())


Unnamed: 0,duration,witnesses,latitude,longitude,physicalEvidence,contact,researchOutcome,weather_clear,weather_fog,weather_mostly cloudy,...,weather_stormy,shape_box,shape_circle,shape_disk,shape_light,shape_oval,shape_pyramid,shape_sphere,shape_square,shape_triangle
0,4,1,47.329444,-122.578889,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,4,1,52.664913,-1.034894,1,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,49,1,38.951667,-92.333889,0,0,1,1,0,0,...,0,0,1,0,0,0,0,0,0,0
3,13,1,41.496944,-71.367778,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,17,1,47.606389,-122.330833,0,0,1,0,0,1,...,0,0,1,0,0,0,0,0,0,0


In [15]:
display(df.shape)

(18000, 23)

Now start training and validating the linear-learner algorithm. Randomize and split the data for training, calidation and testing.(1) Randomize the data; (2) split the data to use 80% for training, 10% for validation during training, and 10% for testing the model after it is deployed

In [16]:
#Randomize the dataset
#df = df.sample(frac=1).reset_index(drop=True)
np.random.seed(0)

In [17]:
#split the data for training, validation and testing
rand_split = np.random.rand(len(df))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
test_list = rand_split >= 0.9

In [18]:
data_train = df[train_list]
data_val = df[val_list]
data_test = df[test_list]

Now make the target attribute researchOutcome as the first attribute according to the requiremnts set by AWS documentations. After this, the datasets will be stored as csv format in S3

In [19]:
# Re-arranges the columns
cols = list(data_train)
cols.insert(0, cols.pop(cols.index('researchOutcome')))
data_train = data_train[cols]

cols = list(data_val)
cols.insert(0, cols.pop(cols.index('researchOutcome')))
data_val = data_val[cols]

cols = list(data_test)
cols.insert(0, cols.pop(cols.index('researchOutcome')))
data_test = data_test[cols]

# Breaks the datasets into attribute numpy.ndarray and the same for target attribute.  
train_X = data_train.drop(columns='researchOutcome').values
train_y = data_train['researchOutcome'].values

val_X = data_val.drop(columns='researchOutcome').values
val_y = data_val['researchOutcome'].values

test_X = data_test.drop(columns='researchOutcome').values
test_y = data_test['researchOutcome'].values

Now create recordIO file for the training data and upload to S3.

In [20]:
train_file = 'sightings_train_recordIO_protobuf.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object('algorithms_lab_922/linearlearner_train/{}'.format(train_file)).upload_fileobj(f)
training_recordIO_protobuf_location = 's3://{}/algorithms_lab_922/linearlearner_train/{}'.format(bucket, train_file)
print('The Pipe mode recordIO protobuf training data: {}'.format(training_recordIO_protobuf_location))

The Pipe mode recordIO protobuf training data: s3://ml-projects-bl/algorithms_lab_922/linearlearner_train/sightings_train_recordIO_protobuf.data


Now create recordIO file for the validation data and upload to S3.

In [21]:
validation_file = 'sightings_validatioin_recordIO_protobuf.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, val_X.astype('float32'), val_y.astype('float32'))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object('algorithms_lab_922/linearlearner_validation/{}'.format(validation_file)).upload_fileobj(f)
validate_recordIO_protobuf_location = 's3://{}/algorithms_lab_922/linearlearner_validation/{}'.format(bucket, validation_file)
print('The Pipe mode recordIO protobuf validation data: {}'.format(validate_recordIO_protobuf_location))


The Pipe mode recordIO protobuf validation data: s3://ml-projects-bl/algorithms_lab_922/linearlearner_validation/sightings_validatioin_recordIO_protobuf.data


### Step 3: Create and train Linear Learner model

Now get tthe ECR container hosted in ECR for the linear leaner algorithm

In [22]:
from sagemaker.amazon.amazon_estimator import get_image_uri
import sagemaker

container = get_image_uri(boto3.Session().region_name, 'linear-learner', "1")

In [23]:
# Create a training job name
job_name = 'linear-learner-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('Here is the job name {}'.format(job_name))

# Here is where the model-artifact will be stored
output_location = 's3://{}/algorithms_lab_922/linearlearner_output'.format(bucket)

Here is the job name linear-learner-job-20190923213051


Now to start building out the model by using the SageMaker Python SDK and passing in everything that is required to create a Linear Learner model.

First create a specific job name.

Then specify the training parameters.

The linear learner container
The IAM role to use
Training instance type and count
S3 location for output data/model artifact
The input type (Pipe)
Linear Learner Hyperparameters

Finally, after everything is included and ready, it's time to call the .fit() function which specifies the S3 location for training and validation data.

In [24]:
print('The feature_dim hyperparameter needs to be set to {}.'.format(data_train.shape[1] - 1))

The feature_dim hyperparameter needs to be set to 22.


In [None]:
sess = sagemaker.Session()

# Setup the LinearLeaner algorithm from the ECR container
linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess,
                                       input_mode='Pipe')
# Setup the hyperparameters
linear.set_hyperparameters(feature_dim=22, # number of attributes (minus the researchOutcome attribute)
                           predictor_type='multiclass_classifier', # type of classification problem
                           num_classes=3)  # number of classes in out researchOutcome (explained, unexplained, probable)


# Launch a training job. This method calls the CreateTrainingJob API call
data_channels = {
    'train': training_recordIO_protobuf_location,
    'validation': validate_recordIO_protobuf_location
}
linear.fit(data_channels, job_name=job_name)

In [26]:
print('Here is the location of the trained Linear Learner model: {}/{}/output/model.tar.gz'.format(output_location, job_name))

Here is the location of the trained Linear Learner model: s3://ml-projects-bl/algorithms_lab_922/linearlearner_output/linear-learner-job-20190923213051/output/model.tar.gz
