This Python code trains a simple random forest machine learning model, makes a prediction on
a new row of data, and then saves the model to S3.
Your task is to import a previously trained model that is saved in that same s3 bucket and
use it to make a prediction on the same row of data. Submit the prediction confidence as your
answer to this Data Analytics Jam Challenge.

CLF requirements: <br/>
-T2 Micro running with the AWS Deep Learning AMI<br/>
-An S3 bucket with a random bucket name<br/>
-Role assigned to the instance with both read and write permissions for the bucket <br/>
-A previously trained model saved in the S3 bucket

In [41]:
#import the required packages
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import seaborn.apionly as sns
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
%matplotlib inline
import pickle

#load a default seaborn dataset, display simple stats about data size, and then print the data's head
df = pd.DataFrame(sns.load_dataset('iris'))
print 'shape of the data frame'+str(df.shape)
print '\nWe have an even spread of iris flower types'
print df.groupby(['species']).size()
print'\nDisplay ten random rows from the iris dataset'
df.sample(n=10)

shape of the data frame(150, 5)

We have an even spread of iris flower types
species
setosa        50
versicolor    50
virginica     50
dtype: int64

Display twenty random rows from the iris dataset


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
62,6.0,2.2,4.0,1.0,versicolor
64,5.6,2.9,3.6,1.3,versicolor
137,6.4,3.1,5.5,1.8,virginica
76,6.8,2.8,4.8,1.4,versicolor
106,4.9,2.5,4.5,1.7,virginica
77,6.7,3.0,5.0,1.7,versicolor
126,6.2,2.8,4.8,1.8,virginica
107,7.3,2.9,6.3,1.8,virginica
147,6.5,3.0,5.2,2.0,virginica
10,5.4,3.7,1.5,0.2,setosa


In [42]:
#let's group setosa and virginica together for the sake of this machine learning exercise
df['y']= np.where(df['species']=='versicolor',1,0)
print df.groupby(['y']).size()
print 'we now have 50 versicolors and 100 non-versicolors'

X=df.drop('species',1).drop('y',1)
y=df['y']
df.sample(n=10)

y
0    100
1     50
dtype: int64
we now have 50 versicolors and 100 non-versicolors


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,y
23,5.1,3.3,1.7,0.5,setosa,0
75,6.6,3.0,4.4,1.4,versicolor,1
138,6.0,3.0,4.8,1.8,virginica,0
4,5.0,3.6,1.4,0.2,setosa,0
30,4.8,3.1,1.6,0.2,setosa,0
115,6.4,3.2,5.3,2.3,virginica,0
0,5.1,3.5,1.4,0.2,setosa,0
41,4.5,2.3,1.3,0.3,setosa,0
69,5.6,2.5,3.9,1.1,versicolor,1
38,4.4,3.0,1.3,0.2,setosa,0


In [45]:
#Initialize the random forest machine learning algorithm object
RANDOM_STATE=0
forest = RandomForestClassifier(n_estimators = 500, random_state=RANDOM_STATE, oob_score="True")

#Train the random forest model on the data
forest_model = forest.fit(X,y)

#use the forest model to make a prediction on a new row of data
#define a new array with the order of 'sepal_length', 'sepal_width', 'petal_length', and 'petal_width'
new_flower=[[5.6,3.2,4.1,1.4]]
print new_flower
prediction=forest_model.predict(new_flower)
print "\na prediction of '1' is for verisicolor. '0' is for prediction of non-versicolor"
print prediction

#print 'This is the prediction confidence for the forest_model on that row of data being a versicolor iris.\
# It is an example of what you should submit for the challenge answer: '+str(prediction_proba[0,0])+'\n'
prediction_proba=forest_model.predict_proba(new_flower)
print '\nthe confidence of the prediction'
print prediction_proba[0,1]

#save (pickle) your model to disk and then to s3
local_path = "/home/ubuntu" # temp local path to export your model
bucket_name = "pickledemo" # s3 bucket name string to save your model
filename = 'finalized_model.sav'
pickle.dump(forest, open(filename, 'wb'))

#you should now see your finalized_model.sav object in the file path
#the ls command prints the contents of this notebook's folder
print "\nlist of the objects in this jupyter notebook's folder"
!ls
 
# Upload the saved model to S3
import boto3
s3 = boto3.resource('s3')
s3.Bucket('pickledemo').put_object(Key='finalized_model.sav', Body=open('finalized_model.sav'))

[[5.6, 3.2, 4.1, 1.4]]

a prediction of '1' is for verisicolor. '0' is for prediction of non-versicolor
[1]

the confidence of the prediction
0.984

list of the objects in this jupyter notebook's folder
finalized_model.sav   Model Pickle Saved in S3.ipynb  seaborn-data	  src
finalized_model.sav?  OpenJupyterNotebook.ipynb       solution_model.sav


s3.Object(bucket_name='pickledemo', key='finalized_model.sav')

In [46]:
'''
Your task is to import the other random forest model that has already been saved in the same S3 bucket 
and use it to make a prediction on the following df3 row of data. This saved model is called
solution_model.sav.

Use the predict_prob method on that imported model. Submit the first three significant figures as your answer.
In example, if the new model's predict_proba output was 0.12345, you should submit 0.123. 
If applicable, do not round the last decimal place.

Keep in mind that these pickled machine learning models can also be deployed to lambda functions!
'''
new_flower2=[[4.8,3.4,1.6,0.2]]
print new_flower2

[[4.8, 3.4, 1.6, 0.2]]
