# Sklearn - Machine Learning and Statistical Inference

### Documentation - https://scikit-learn.org/stable/documentation.html

### A Common Learning Algorithm - K Nearest Neighbors 

In [2]:
# Sklearn has a few specific conventions with how they represent model building:

# Classifiers are abstract objects that hold different attributes(model skeletons) and methods(the
# algorithms that map data (independent variables) and target values (dependent variables) to the skeleton.  

# For example: The LinearRegression Classifier might have the attribute 'coefficients' that holds estimated parameter
# values after running the algorithm 'Ordinary Least Squares' on given data/targets.

# You can spin up a new classifier by calling the appropriate method from the sklearn sub-library; sklearn organizes
# itself by the kinds of problems that one might want to solve(sklearn.neighbors is for discrete dependent variable
# sets, while sklearn.linear_model is used for continuous sets).  There's a good graphic of
# the kinds of problems different algorithms attempt to solve here:

# https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

# All classifiers in sklearn have 'fit' and 'predict' methods: 'fit' loads data onto a classifier, 
# while 'predict' accepts a new feature set (with the same kinds of information present in X during 'fit')
# and returns a guess at the label (dependent variable value) that should belong to those features. 

In [4]:
# K Nearest Neighbors, or KNN, takes a look at a set of features (independent variable values) and selects the 'K'
# training observations that the features are most similar(shortest geometric distance) to.  Then, KNN assigns a 
# label to the feature set based on the labels on those 'K' neighbors.  In the simplest case(K=1), KNN assigns a label
# to incoming data based on the label of the training data with the shortest euclidian distance between itself 
# and the data to be classified.  Luckily, sklearn has done all of the tedious work of actually writing the
# KNN algorithm, so we can simply plagiarize their work!

# Lets apply KNN to classify iris flowers using the familiar dataset.  

In [15]:
from sklearn.datasets import load_iris # alternative source for the iris dataset (with data attribute as a numpy arrays) 

iris = load_iris() # load the data and store under 'iris'

In [16]:
iris.data[0:5] # display the first couple values from our independent variables

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [17]:
iris.target[0:5] # display the first couple values from our dependent variables

array([0, 0, 0, 0, 0])

In [19]:
# Maps the iris names (human readable) to data values (model usable):

names = {label: name for name, label in zip(iris.target_names, range(0, 3))} 

print(names)

{0: 'setosa', 1: 'versicolor', 2: 'virginica'}


In [20]:
from sklearn.neighbors import KNeighborsClassifier # Bring in the sklearn classifier for KNN 

KNN = KNeighborsClassifier() # initialze an empty KNN classifier (a place to hold data and an algorithm as a function)

In [21]:
# By convention, we refer to independent and dependent variables as 'X' and 'y' respectively .  Lets
# origanize our data to reflect this feature:

X, y = iris.data, iris.target

In [22]:
# As model builders, we want to avoid the 'over generalization' problem, where the data used to create model
# parameters describes the data we've given it, but performs poorly when predicting outcomes outside
# of the training set.  In theory, these issues can be caused by idiosyncracies in the data (not randomly sampled, low
# observation counts, etc.), or by structural changes in the data generating process between the training and
# prediction periods (financial models built on pre-2008 crash data might predict very poorly in 2009). Since we
# operate in a world with ever-changing economic conditions, we should be especially sensitive to this problem.

# To help us prevent against over-generalization, we reserve a subset of our data(a test set) that will not
# be used in the training process.  If the model performs 'well' on predicted labels in the test set (which
# we know the right answers to) then we can be more confident that our model will predict well on information
# when it really counts (production).

# Setting the random state variable to a single number simply ensures that the training/test split 
# will be the same each time we run the split procedure (so we can replicate our results).

from sklearn.model_selection import train_test_split # Bring in the sklearn function for splitting data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 420) # Shuffle our data and break into quarters

In [23]:
KNN.fit(X_train, y_train) # Assign the training data to our classifier and run the nearest neighbor algorithm

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [24]:
predictions = KNN.predict(X_test) # Use the model built from training data to predict labels for the reserved 
english_predictions = [names[p] for p in predictions] # Translate numeric lables back into english

print(english_predictions) # Display the predictions

['versicolor', 'virginica', 'virginica', 'setosa', 'setosa', 'versicolor', 'setosa', 'versicolor', 'virginica', 'versicolor', 'virginica', 'versicolor', 'setosa', 'setosa', 'virginica', 'versicolor', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'virginica', 'versicolor', 'setosa', 'versicolor', 'versicolor', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'setosa', 'setosa', 'versicolor', 'virginica', 'versicolor']


In [25]:
# How did our model do?  Lets take a look at the the predicted labels of our test set, versus the actual labels:

print(['Incorrect' for pred, actual in zip(predictions, y_test) if pred != actual]) # Print 'Incorrect' for every wrong guess

[]


In [26]:
# So our model had 100% accuracy guessing labels in our test set.  Sklearn also has its own implementation for
# gauging accuracy:

from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions) # Returns correct guesses / total guesses as a floating point number.

1.0

In [27]:
# Now we have a working classifier that boasts a perfect accuracy score, but who said that K Nearest Neighbors
# would yield the 'best' solution?  How do I pick between alternative classifiers?

# The standard ML answer to this question is to use cross validation.  Cross validation randomnly splits our data into
# 'K' sections(you might see this referred to as 'K fold' validation, too).  If K = 5, our data is partitioned into 5
# parts and 5 separate models are trained using 4 sections of the data to train and 1 rotating section of test data.
# Then, accuracy scores for each are computed (typically in terms of mean squared error, but you can pick any penalty
# function you want) and compared across each of the models.  The use of multiple folds tells us wether or not the model we're
# using is sensitive to how the data is sampled; we prefer higher accuracy scores and lower variance between those
# scores, all else the same.

# A note on cross validation: This is analogous to making modeling decisions based on nothing but R^2 in the context
# of traditional stat modeling, which can be dangerous.  There's a big spiel about this issue that would be better handled
# in another context.

In [28]:
# Lets compute the cross validation score for a K Nearest Neighbor Model

import sklearn.model_selection as model_selection

# Note: If we are using cross validation, the train/test split procedure is handled automatically, so
# we can pass the original data instead of the splits

# The following makes 3 KNN models with 3 unique splits and scores the accuracy of each:
model_selection.cross_val_score(KNN, X, y, cv=3, scoring='accuracy')

array([0.98039216, 0.98039216, 1.        ])

In [29]:
# Now we can run the same procedure on an alternative classifier and make a decision based on the scores

from sklearn.tree import DecisionTreeClassifier # A competing classifier

tree = DecisionTreeClassifier()

model_selection.cross_val_score(tree, X, y, cv=3, scoring='accuracy') # Perform the same procedure using a tree classifier

array([0.98039216, 0.92156863, 0.97916667])

In [30]:
# Our results may differ due to random shuffling, but my results show a higher average accuracy score and
# reduced variance for the KNN model, which suggests KNN is superior to the tree model here.

# Now that we are somewhat happy with the average accuracy of the KNN process on this problem and we did not
# observe a high variance in accuracy score with respect to sampling, we can retrain the KNN classifier
# on all of the available data:

KNN = DecisionTreeClassifier() # initialize an empty shell, writing over our old KNN model
Final_Model = KNN.fit(X, y) # Fit all of the avilable data

In [None]:
# And we're done.  Our 'Final_Model' is ready to label new incoming feature sets with iris names.  

In [102]:
# There are a multitude of algorithms available to us, however.  How do we abstract this model selection problem
# so that we aren't constantly writing and rewriting accuracy tests?  Take a look in the 'Generalized Machine Learning'
# notebook for an attempt at solving this problem.