# House Price Model Using AWS SageMaker

This will use the a Random Forrest regression to predict the house price. AWS SageMaker will utilize the Sci-kit-learn libraries. 
The model will be stored in a S3 Bucket.

## Setup
In this section specify the sage maker object session that would allow to train and create a prediction model. Also in this section, it will setup the libraries used in the notebook, as well as the S3 buckets.

1. Create the AWS SageMaker Session

In [3]:
# S3 prefix
prefix = 'Scikit-house'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()





2. Load the libraries

In [4]:
import numpy as np
import sklearn.cluster
import pickle
import gzip
import urllib.request
import json
#import mxnet as mx
import boto3
import time
import io
import os
## New for this Project
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split



  from numpy.core.umath_tests import inner1d


## Data 

3. Download data from S3 Bucket and copy into the '/home/ec2-user/Sagemaker/data' folder. This folder in the Notebook EC2 session. 

In [5]:
s3 = boto3.Session().resource('s3')

print (sagemaker_session.default_bucket())

sagemaker-us-east-2-029880428228


In [6]:

s3.meta.client.download_file('mlbuckethose', 'ml_data/DataForModeling.csv', '/home/ec2-user/SageMaker/data/DataForModeling.csv')

In [7]:
WORK_DIRECTORY = 'data'

train_input = sagemaker_session.upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY) )

## Create the SKLearn Object
In the appendix there is a python script file that would create fit the model using the SKLearn and the data that was downloaded from the S3 Bucket. 
This object would call the python script and run it in a training instance, that would only run during the training of the model. Once the model is fit, the compute instance will be released.

1. Set the SKLearn object and set the path for the Model Creation Script

In [8]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'sklearn-randomforest.py'

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session
    )

2. Fit The Model  
This is the section where the model is trained and saved for future deployment.
After the fit is complete, a model will be stored under the 'Model' Section in AWS Sage Maker

In [9]:
sklearn.fit({'train': train_input})

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2019-04-08-03-12-08-484


2019-04-08 03:12:08 Starting - Starting the training job...
2019-04-08 03:12:10 Starting - Launching requested ML instances......
2019-04-08 03:13:37 Starting - Preparing the instances for training......
2019-04-08 03:14:38 Downloading - Downloading input data
2019-04-08 03:14:38 Training - Training image download completed. Training in progress.
2019-04-08 03:14:38 Uploading - Uploading generated training model.
[31m2019-04-08 03:14:33,873 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-04-08 03:14:33,875 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-04-08 03:14:33,887 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-04-08 03:14:34,100 sagemaker-containers INFO     Module sklearn-randomforest does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-04-08 03:14:34,100 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2019-

## Using the trained model to make inference requests <a class="anchor" id="inference"></a>

### Deploy the model <a class="anchor" id="deploy"></a>

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count and instance type. 

This predictor object will deploy the END POINT in AWS SageMaker.

In [10]:
predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

INFO:sagemaker:Creating model with name: sagemaker-scikit-learn-2019-04-08-03-12-08-484
INFO:sagemaker:Creating endpoint with name sagemaker-scikit-learn-2019-04-08-03-12-08-484


---------------------------------------------------------------!

### Choose some data and use it for a prediction <a class="anchor" id="prediction_request"></a>

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

1. Load data for testing the process.

In [11]:
df = pd.read_csv('/home/ec2-user/SageMaker/data/DataForModeling.csv')
df = df[:50]
#df.info()

In [12]:
df.head(1).values

array([[    60,   8450,      7,      5,   2003,   2003,    706,      0,
           150,    856,    856,    854,      0,   1710,      1,      0,
             2,      1,      3,      1,      8,      0,      2,    548,
             0,     61,      0,      0,      0,      0,      0,      2,
          2008, 208500]])

2. Split the data, into features and dependant variable.

In [13]:
if "SalePrice" in df.columns:
    yData = df.SalePrice
    del df["SalePrice"]
xData = df
X_train, X_test, y_train, y_test = train_test_split(xData,yData)

3. Verify some of the loaded data and ensure it does not contain the home price.

In [14]:
#df = pd.read_csv('/home/ec2-user/SageMaker/data/DataForModeling.csv')
X_test.values[1]

array([  20, 7560,    5,    6, 1958, 1965,  504,    0,  525, 1029, 1339,
          0,    0, 1339,    0,    0,    1,    0,    3,    1,    6,    0,
          1,  294,    0,    0,    0,    0,    0,    0,    0,    5, 2009])

4. Pass the values to the predictor object and obtain home prices and compared them against the real value.

In [15]:
print(predictor.predict(X_test))
print(y_test.values)

[111400.  141000.  210350.  149097.5  77540.  140886.   88380.  285390.
 181950.  153450.  129590.  154700.  146050. ]
[118000 139000 208500 149350  68500 144000  82000 277500 129900 140000
 129500 153000 145000]


## Invoking the End Point
This will be use to set a Lambda function with this code to make the predictions.

In [16]:
runtime= boto3.client('runtime.sagemaker')
linear_endpoint='sagemaker-scikit-learn-2019-04-08-03-12-08-484'
payload_file = io.StringIO()
X_test[:2].to_csv(payload_file, header = None, index = None)

#the Next line is use for testing
#testLoad= "20,7560,5,6,1958,1965,504,0,525,1029,1339,0,0,1339,0,0,1,0,3,1,6,0,1,294,0,0,0,0,0,0,0,5,2009" + "\n" + "20,7560,5,6,1958,1965,504,0,525,1029,1339,0,0,1339,0,0,1,0,3,1,6,0,1,294,0,0,0,0,0,0,0,5,2009"
            
#payload = X_test.values[1]
response = runtime.invoke_endpoint(EndpointName=linear_endpoint,
                                   ContentType = 'text/csv',
                                   Body= payload_file.getvalue())
import json
result = json.loads(response['Body'].read().decode())
print(result)
print(payload_file.getvalue())


[111400.0, 141000.0]
190,7420,5,6,1939,1950,851,0,140,991,1077,0,0,1077,1,0,1,0,2,2,5,2,1,205,0,4,0,0,0,0,0,1,2008
20,7560,5,6,1958,1965,504,0,525,1029,1339,0,0,1339,0,0,1,0,3,1,6,0,1,294,0,0,0,0,0,0,0,5,2009



### Endpoint cleanup <a class="anchor" id="endpoint_cleanup"></a>

When you're done with the endpoint, you'll want to clean it up.

In [17]:
sklearn.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-scikit-learn-2019-04-08-03-12-08-484


## Appendix Script to be used by Sagemaker

## Create a Scikit-learn script to train with <a class="anchor" id="create_sklearn_script"></a>
SageMaker can now run a scikit-learn script using the `SKLearn` estimator.

```python
from __future__ import print_function

import argparse
import os
import pandas as pd
from sklearn import svm
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
#from sklearn.cross_validation import train_test_split


if __name__ == '__main__':
    parser = argparse.ArgumentParser()


    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

    # Take the set of files and read them all into a single pandas dataframe
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file,  engine="python") for file in input_files ]
    train_data = pd.concat(raw_data)
    
    if "SalePrice" in train_data.columns:
        yData = train_data.SalePrice
        del train_data["SalePrice"]
    
    xData = train_data
    # labels are in the first column
    #train_y = train_data.ix[:,0]
    #train_X = train_data.ix[:,1:]
    
    #X_train, X_test, y_train, y_test = train_test_split(all_X,all_y)
    
    
    

    # Here we support a single hyperparameter, 'max_leaf_nodes'. Note that you can add as many
    # as your training my require in the ArgumentParser above.
   
    # Now use scikit-learn's decision tree classifier to train the model.
    rf=RandomForestRegressor()
    rf.fit(xData,yData)

    # Print the coefficients of the trained classifier, and save the coefficients
    joblib.dump(rf, os.path.join(args.model_dir, "model.joblib"))


def model_fn(model_dir):
    """Deserialized and return fitted model
    
    Note that this should have the same name as the serialized model in the main method
    """
    rf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return rf
```