## Chapter 1 - Identifying the inputs/outputs

In this example, we are using the Iris data to classify different types of flowers. By convention, X is known as the features (inputs) and y is known as the target (output). To remember this convention, consider the following formula based on linear algebra:

y = mx + b



In this example, __m__ represents the coefficient. In machine learning, the value of __m__ is often referred to as weights.

__b__ represents the intercept. In this oversimplified example for machine learning, __b__ represents biases. 

The model consists of the weights and biases to create a mathematical formula to predict the value of __y__ based on __x__.

In [7]:
from sklearn import datasets

# the iris data contants information about how to classify a flower 
iris = datasets.load_iris()

# the dictionary keys contains the data in the dataset
keys = iris.keys()
print(keys)

# define features and targets
X = iris.data
y = iris.target
print('features shape:', X.shape)
print('target shape:', y.shape)
features = iris.feature_names
targets = iris.target_names

# show the features and targets
print('feature set:')
print(features)
print('targets:')
print(targets)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
features shape: (150, 4)
target shape: (150,)
feature set:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
targets:
['setosa' 'versicolor' 'virginica']


## Measuring Success

To measure success for a supervised learning model, it is critical to have a separate training data and validation data without any data leakage. This can be accomplished by maintaining a separate dataset to evaluate the model's performance. To maintain a separate dataset, we can either use a dedicated gold standard data or we can partition the data into training and validation data. Because many times we are not lucky enough to have a gold standard dataset, sklearn has a __train_test_split__ function to partition the data.  

### Simple Random Sampling

sklearn has the option to partition the data using a simple random sample. A simple random sample uses a pseudorandom number, known as a random_state, to randomly select data into training and validation. 

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

#### Discussion Question 1

Using a simple random sample of training and validation data, what observations do you notice about the distribution of the training and validation data? What would happen if the target variable was not evenly distributed across the full dataset?

In [9]:
import numpy as np
print("Samples per class (training): {}".format(np.bincount(y_train)))
print("Samples per class (test): {}".format(np.bincount(y_test)))

Samples per class (training): [35 39 38]
Samples per class (test): [15 11 12]


#### Discussion Question 2

Using a stratified sample of training and validation data, what observatations do you notice about the distribution of the training and validation data? What would happen if the target variable was not evenly distributed across the full dataset?

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print("Samples per class (training): {}".format(np.bincount(y_train)))
print("Samples per class (test): {}".format(np.bincount(y_test)))

Samples per class (training): [38 37 37]
Samples per class (test): [12 13 13]


#### Discussion Question 3

Using the readings from Chapter 1 and 2, why is it important to have a training and validation data that is representative of the population? What would happen to the accuracy score if the model __correctly__ predicts a class containing only 1 record? What would happen to the accuracy score if the model __incorrectly__ corrects a class containing only 1 record>

## Chapter 2 - Training a model using the iris data

The purpose of this section is to demonstrate uncertainty in machine learning. According to the famous statistician George Box, "all models are wrong, but some are useful." This means no model will ever be 100% certain in every situation. If there is a possible use case where a model is 100% certain, there is no need to create a model because it is a firm rule. Any time there is a firm rule that is 100% correct in 100% of the examples, machine learning cannot help with that problem. However, if there is a degree of uncertainty, we can assess the likelihood of a model being correct. This chapter identifies key terms how to measure uncertainty using machine learning.

### Fit model to the training data

The first step is to fit the model to the training data.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

### Evaluate model against test partition

Based on chapter 2 readings, the chapter discusses the importance of accuracy. Accuracy is the percent of correct predictions out of the total predictions made. As we discussed in chapter 1, accuracy may not be the most appropriate metric in the presence of class imbalance if there is only 1 record in a class.

In [12]:
accuracy = knn.score(X_test, y_test)
print("Test set accuracy: {:.2f}".format(accuracy))

Test set accuracy: 0.97


### Calculate predicted probabilities

Chapter 2 identifies that every model as some degree of uncertainty. The predicted proabilities for the model identifies the proabilities that the model thinks is correct. These predicted probabilities may be different than the actual predicted proabilities.

Notice how each row in the array contains multiple values on a scale of 0 to 1. The value in a row with the highest prediction is the model's predictions depending if the probability is above a certain threshold.

In [22]:
y_pred_proba = knn.predict_proba(X_test)
print(y_pred_proba)

[[1.  0.  0. ]
 [0.  1.  0. ]
 [0.  1.  0. ]
 [0.  1.  0. ]
 [1.  0.  0. ]
 [0.  1.  0. ]
 [0.  0.2 0.8]
 [0.  0.  1. ]
 [0.  0.  1. ]
 [0.  0.  1. ]
 [0.  0.8 0.2]
 [0.  0.  1. ]
 [0.  1.  0. ]
 [0.  1.  0. ]
 [1.  0.  0. ]
 [1.  0.  0. ]
 [1.  0.  0. ]
 [0.  0.8 0.2]
 [1.  0.  0. ]
 [0.  1.  0. ]
 [0.  0.  1. ]
 [0.  1.  0. ]
 [0.  0.4 0.6]
 [0.  1.  0. ]
 [0.  0.  1. ]
 [0.  1.  0. ]
 [1.  0.  0. ]
 [0.  0.4 0.6]
 [1.  0.  0. ]
 [0.  0.6 0.4]
 [0.  0.  1. ]
 [0.  0.  1. ]
 [1.  0.  0. ]
 [1.  0.  0. ]
 [1.  0.  0. ]
 [1.  0.  0. ]
 [0.  0.  1. ]
 [0.  1.  0. ]]


Because there are three classes, the model makes three predictions 0, 1, and 2. Each integer represents a class or a categorical target variable based on the prediction with the highest predicted probability

In [14]:
predictions = np.argmax(y_pred_proba, axis=1)
predictions

array([0, 1, 1, 1, 0, 1, 2, 2, 2, 2, 1, 2, 1, 1, 0, 0, 0, 1, 0, 1, 2, 1,
       2, 1, 2, 1, 0, 2, 0, 1, 2, 2, 0, 0, 0, 0, 2, 1])

#### Discussion Question 4

The predict function in sklearn performs the same function as the previous chunk. What assumptions does the predict attribute of the model make with reference to thresholds for determining a predicting probability?

In [15]:
y_pred = knn.predict(X_test)
print(y_pred)

[0 1 1 1 0 1 2 2 2 2 1 2 1 1 0 0 0 1 0 1 2 1 2 1 2 1 0 2 0 1 2 2 0 0 0 0 2
 1]


#### Discussion Question 5

Based on the readings in Chapter 2 and the presentation, what does it mean if a model is calibrated? 

#### Discussion Question 6

+ Why are there always errors for a traditional statistical model? 

+ What is the meaning of George Box's quote, "All models are wrong, but some are useful?"

+ If a model has 100% accuracy, what would be something to consider about the model?

This example illustrates the k-Nearest Neighbors model illustration used in Chapter 2. There are other models that may be useful. Factor to consider when selected a model include the types of data such as unstructured text data, ratio data, integer data, or categorical data used as features. Additionally, type of model matters. For example, tree-based models such as decision trees and ensemble trees such as Random Forest models generally do a good job predicting binary classification problems with a True/False or Yes/No target variable. Logistic Regressions and Neural Networks generally do a good job using unstructured text in multiclass classification problems. k-NN does a good job classifying problems when the features are continuous integer/ratio data. 