In [134]:
# Imports section
import sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
import pandas as pd
import numpy as np

## Part 1. Loading the dataset

In [136]:
# Load the dataset (load remotely, not locally)
iris = load_iris() # Load in iris
data = pd.DataFrame(iris.data, columns=iris.feature_names) # create an iris dataframe
# Output the first 15 rows of the data
print(data.head(15))
# Display a summary of the table information (number of datapoints, etc.)
print(data.describe())

    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                 5.1               3.5                1.4               0.2
1                 4.9               3.0                1.4               0.2
2                 4.7               3.2                1.3               0.2
3                 4.6               3.1                1.5               0.2
4                 5.0               3.6                1.4               0.2
5                 5.4               3.9                1.7               0.4
6                 4.6               3.4                1.4               0.3
7                 5.0               3.4                1.5               0.2
8                 4.4               2.9                1.4               0.2
9                 4.9               3.1                1.5               0.1
10                5.4               3.7                1.5               0.2
11                4.8               3.4                1.6               0.2

## About the dataset
#### Explain what the data is in your own words. What are your features and labels? What is the mapping of your labels to the actual classes?

The data provided consists of information regarding both the sepal length and width as well as the petal length and width. All of these are considered features. The information provided will be used to classify flowers with different features into an appropriate label of the flower type, Iris Setosa, Iris Versicolor, and Iris Virginica. You can access the labels by viewing iris['target']

## Part 2: Split the dataset into train and test

In [137]:
# Take the dataset and split it into our features (X) and label (y)
X = iris['data'] #iris features (X)
y = iris['target'] #iris label (y)

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=9)

## Part 3: Logistic Regression

In [162]:
# i. Use sklearn to train a LogisticRegression model on the training set
logisticRegr = LogisticRegression(max_iter=100000) #Create a LogisticRegression model
logisticRegr.fit(X_train, y_train) #Train the model using the training sets

# ii. For a sample datapoint, predict the probabilities for each possible class
prediction = logisticRegr.predict_proba(X_test[1].reshape(1,-1)) #probabilities for X_test[1]
print("Prediction for X_test[1]:\n", prediction) 

# iii. Report on the score for Logistic regression model, what does the score measure?
print("Score X & y: \n", logisticRegr.score(X, y)) 
print("Score: X_test & y_test: \n", logisticRegr.score(X_test, y_test))
print("The cross validation scores for X & y: \n", cross_val_score(logisticRegr, X, y, cv=10))

# iv. Extract the coefficents and intercepts for the boundary line(s)
print("Coefficients: \n", logisticRegr.coef_)
print("Intercept: \n", logisticRegr.intercept_)

Prediction for X_test[1]:
 [[0.01151064 0.88108184 0.10740752]]
Score X & y: 
 0.98
Score: X_test & y_test: 
 1.0
The cross validation scores for X & y: 
 [1.         0.93333333 1.         1.         0.93333333 0.93333333
 0.93333333 1.         1.         1.        ]
Coefficients: 
 [[-0.4403194   0.88516663 -2.45760445 -0.99311586]
 [ 0.56942753 -0.2546846  -0.22313114 -0.90139707]
 [-0.12910813 -0.63048203  2.68073558  1.89451293]]
Intercept: 
 [  9.93510942   1.7412485  -11.67635792]


The score reports on the accuracy of the model, the score that we get in regards to the whole model is 0.98. This means that our model predicts the data with 98% accuracy. The score that we get in regards to the the test data of the model model is 1. This means that our model predicts the data with 100% accuracy. When computing the score using cross validation, we get a minimum score of .93 and a maximum of 1 meaning the accuracy ranges between these two values, according to our data.

## Part 4: Support Vector Machine

In [164]:
# i. Use sklearn to train a Support Vector Classifier on the training set
clf = svm.SVC(kernel='linear', probability=True) #Create a svm Classifier
clf.fit(X_train, y_train) #Train the model using the training sets

# ii. For a sample datapoint, predict the probabilities for each possible class
prediction2 = clf.predict_proba(X_test[1].reshape(1,-1))  #probabilities for X_test[1]
print("Prediction for sample datapoint X_test[1] \n", prediction2)

# iii. Report on the score for the SVM, what does the score measure?
print("Score X & y: \n", clf.score(X, y)) 
print("Score: X_test & y_test: \n", clf.score(X_test, y_test))
print("The cross validation scores for X & y: \n", cross_val_score(clf, X, y, cv=10))

#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Prediction for test dataset: \n", y_pred)

Prediction for sample datapoint X_test[1] 
 [[0.00981581 0.92999486 0.06018933]]
Score X & y: 
 0.9866666666666667
Score: X_test & y_test: 
 1.0
The cross validation scores for X & y: 
 [1.         0.93333333 1.         1.         0.86666667 1.
 0.93333333 1.         1.         1.        ]
Prediction for test dataset: 
 [2 1 2 2 1 0 0 0 1 0 0 1 1 1 0]


The score reports on the accuracy of the model, the score that we get in regards to the whole model is 0.9867. This means that our model predicts the data with 98.7% accuracy. The score that we get in regards to the the test data of the model model is 1. This means that our model predicts the data with 100% accuracy. When computing the score using cross validation, we get a minimum score of .87 and a maximum of 1 meaning the accuracy ranges between these two values, according to our data.

## Part 5: Neural Network

In [173]:
# i. Use sklearn to train a Neural Network (MLP Classifier) on the training set
mlp = MLPClassifier(random_state=3, max_iter=1000) #Create a Neural Network (MLP Classifier)
mlp.fit(X_train, y_train) #Train the model using the training sets

# ii. For a sample datapoint, predict the probabilities for each possible class
prediction3 = mlp.predict_proba(X_test[1].reshape(1,-1))  #probabilities for X_test[1]
print(prediction3)

# iii. Report on the score for the Neural Network, what does the score measure?
print("Score X & y: \n", mlp.score(X, y)) 
print("Score: X_test & y_test: \n", mlp.score(X_test, y_test))
print("The cross validation scores for X & y: \n", cross_val_score(mlp, X, y, cv=10))

# iv: Experiment with different options for the neural network, report on your best configuration (the highest score I was able to achieve was 0.8666)

[[0.00244197 0.90953964 0.08801839]]
Score X & y: 
 0.98
Score: X_test & y_test: 
 1.0
The cross validation scores for X & y: 
 [1.         1.         1.         1.         0.86666667 1.
 0.86666667 1.         1.         1.        ]


The score reports on the accuracy of the model, the score that we get in regards to the whole model is 0.98. This means that our model predicts the data with 98% accuracy. The score that we get in regards to the the test data of the model model is 1. This means that our model predicts the data with 100% accuracy. When computing the score using cross validation, we get a minimum score of .87 and a maximum of 1 meaning the accuracy ranges between these two values, according to our data.


Experimenting with different options for the neural network I could only see the differences in the prediction probabilities with no other change to any of the scores.

## Part 6: K-Nearest Neighbors

In [168]:
# i. Use sklearn to 'train' a k-Neighbors Classifier
knn = KNeighborsClassifier(n_neighbors=3) #Create a Neural Network (MLP Classifier)
knn.fit(X_train, y_train) #Load the test data into the model

# ii. For a sample datapoint, predict the probabilities for each possible class
prediction4 = mlp.predict_proba(X_test[1].reshape(1,-1)) #probabilities for X_test[1]
print(prediction4)

# iii. Report on the score for kNN, what does the score measure?
print("Score X & y: \n", knn.score(X, y)) 
print("Score: X_test & y_test: \n", knn.score(X_test, y_test))
print("The cross validation scores for X & y: \n", cross_val_score(knn, X, y, cv=10))


[[0.00199332 0.92459672 0.07340996]]
Score X & y: 
 0.96
Score: X_test & y_test: 
 1.0
The cross validation scores for X & y: 
 [1.         0.93333333 1.         0.93333333 0.86666667 1.
 0.93333333 1.         1.         1.        ]


The score reports on the accuracy of the model, the score that we get in regards to the whole model is 0.96. This means that our model predicts the data with 96% accuracy. The score that we get in regards to the the test data of the model model is 1. This means that our model predicts the data with 100% accuracy. When computing the score using cross validation, we get a minimum score of .87 and a maximum of 1 meaning the accuracy ranges between these two values, according to our data.

## Part 7: Conclusions and takeaways
#### In your own words describe the results of the notebook. Which model(s) performed the best on the dataset? Why do you think that is? Did anything surprise you about the exercise?

The results of this notebook surprised me and made me doubt myself many times in what I am doing. This is due to the score of many models and classifiers being 1, meaning they predict the dataset with 100% accuracy. This of course has to do with the small amount of test data (15 values), but is still shocking. This means that all of the models we used can accurately predict the species of flower based on petal and sepal dimensions. 
The model which performed best would be the SVM model. It had the highest prediction score for X and y with 0.9866666666666667. I think that is because the model allows enough of a room for miscalculation between the 3 species of flower which translates to the highest accuracy rate for the 3 species.