### Reading the csv data file and creating a data-frame called iris

In [35]:
import boto3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn import svm

# Defining the s3 bucket
s3 = boto3.resource('s3')
bucket_name = 'gabriel-predictive-analytics'
bucket = s3.Bucket(bucket_name)

# Defining the file to be read from s3 bucket
file_key = "Iris.csv"

bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

# Reading the csv file
iris = pd.read_csv(file_content_stream)
iris.head(1)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa


### Looking at the relative frequency table of the Species variable.

In [41]:
# Frequency table of the variable Species
iris['Species'].value_counts()

Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Species, dtype: int64

### Creating Species_numb variable to store numeric classes

In [37]:
iris['Species_numb'] = np.where(iris['Species'] == 'Iris-virginica', 1,
                               np.where(iris['Species'] == 'Iris-versicolor', 2, 3))
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Species_numb
0,1,5.1,3.5,1.4,0.2,Iris-setosa,3
1,2,4.9,3.0,1.4,0.2,Iris-setosa,3
2,3,4.7,3.2,1.3,0.2,Iris-setosa,3
3,4,4.6,3.1,1.5,0.2,Iris-setosa,3
4,5,5.0,3.6,1.4,0.2,Iris-setosa,3


### Let's use the following variables to predict Species_numb: SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm as the predictor variables, and Species_numb is the target variable.

### Let's then split the data into two data-frames (taking into account the proportion of 0s and 1s): train (80%) and test (20%).

In [47]:
# Defining the input and target variables
X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
Y = iris['Species_numb']

# Splitting the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y)

# Standardizing the variables to help the support vector machine model
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### Using the train data-frame, let's bui the one-vs-all multi-class classification strategy with the random forest model (with 500 trees and the maximum depth of each tree equal to 3), for a multi-class classification model. 

### Then, let's use this model to make predictions on the test data and compare these predictions with the actual values using the confusion matrix method.

In [48]:
# Building Random Forest Classifier model
one_vs_add_RF = OneVsRestClassifier(estimator = RandomForestClassifier(n_estimators = 500, max_depth = 3)).fit(X_train, Y_train)

# Predicting on test dataset
one_vs_add_RF_pred = one_vs_add_RF.predict_proba(X_test)
one_vs_add_RF_pred = np.argmax(one_vs_add_RF_pred, axis = 1) + 1

# Confusion Matrix
confusion_matrix(Y_test, one_vs_add_RF_pred)

array([[10,  0,  0],
       [ 1,  9,  0],
       [ 0,  0, 10]])

### Repeating the same process but now using a Support Vector Machine model (with kernel equal to rbf )

In [49]:
# Building Support Vector Machine Classifier model
one_vs_all_svc = OneVsRestClassifier(estimator = svm.SVC(kernel = 'rbf', probability = True)).fit(X_train, Y_train)

# Predicting on test dataset
one_vs_all_svm_pred = one_vs_all_svc.predict_proba(X_test)
one_vs_all_svm_pred = np.argmax(one_vs_all_svm_pred, axis = 1) + 1

# Confusion Matrix
confusion_matrix(Y_test, one_vs_all_svm_pred)

array([[10,  0,  0],
       [ 0, 10,  0],
       [ 0,  0, 10]])

### Using the results from part 6 and 7, I would use the Support Vector Machine Classifier model to predict iris species because it had no missclassification. 