Models are only as useful as the quality of their predictions, and thus fundamentally our goal is not to create models (which is easy) but to create high-quality models (which is hard). Therefore, before we explore the myriad learning algorithms, we first set up how we can evaluate the models they produce.

In [152]:
%matplotlib
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

Using matplotlib backend: MacOSX


Create pipeline

In [60]:
#Create a pipeline that preprocesses the data, trains the model, and then evaluates it using cross-validation

iris_digits = datasets.load_digits()
iris_features = iris_digits.data
iris_target = iris_digits.target

boston = datasets.load_boston()
boston_features, boston_targets = boston.data,boston.target

standardizer = StandardScaler()
logit = LogisticRegression()

pipeline = make_pipeline(standardizer, logit)


In [61]:
#K-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)
cv_results = cross_val_score(pipeline, iris_features, iris_target, cv=kf, scoring="accuracy", n_jobs=1)
cv_results.mean()

0.964931719428926

Create two examples of DummyBaseline models. 
Regressiona and Classification. 
And compare the scores to LinearRegression() and RandomForest()

In [63]:
#Create baseline REGRESSION model
#You want a simple baseline regression model to compare against your model
#his regressor is useful as a simple baseline to compare with other (real) regressors. Don't use for real problems
features_train, features_test, target_train, target_test = train_test_split(
    boston_features, boston_targets, random_state=0)

dummy = DummyRegressor(strategy='mean')

dummy.fit(features_train, target_train)

dummy.score(features_test, target_test)


# # Create dummy regressor that predicts 20's for everything
# clf = DummyRegressor(strategy='constant', constant=20)
# clf.fit(features_train, target_train)

# # Evaluate score
# clf.score(features_test, target_test)

#Non-dummy model
ols = LinearRegression()
ols.fit(features_train, target_train)

# Get R-squared score
ols.score(features_test, target_test)

0.6353620786674619

In [71]:
#You want a simple baseline CLASSIFIER to compare against your model
# Load flower data
iris = datasets.load_iris()

features, target = iris.data, iris.target

# Split into training and test set
features_train, features_test, target_train, target_test = train_test_split(
features, target, random_state=0)

dummy = DummyClassifier(strategy='uniform', random_state=1)
dummy.fit(features_train, target_train)
dummy.score(features_test, target_test)


0.42105263157894735

In [72]:
# Create Actual Classifier
classifier = RandomForestClassifier()

# Train model
classifier.fit(features_train, target_train)

# Get accuracy score
classifier.score(features_test, target_test)

0.9736842105263158

Test classification model.
True Positives and True Negatives with False Positive and False Negatives (TP,TN,FP,FN) 
Accuracy = (TP+TN)/(TP+TN+FP+FN)

In [97]:
X, y = make_classification(n_samples = 10000,
                           n_features = 3,
                           n_informative = 3,
                           n_redundant = 0,
                           n_classes = 2,
                           random_state = 1)

logit = LogisticRegression()
cross_val_score(logit,X,y, scoring="accuracy")

#Precision is the proportion of every observation predicted to be positive that is actually positive. 
#We can think about it as a measurement noise in our predictions—that is, when we predict something is positive, how likely we are to be right. 
#Models with high precision are pessimistic in that they only predict an observation is of the positive class when they are very certain about it.
# P = TP/(TP+FP)
cross_val_score(logit,X,y, scoring="precision")

#Recall is the proportion of every positive observation that is truly positive. Recall measures the model’s ability to identify an observation of the positive class. 
#Models with high recall are optimistic in that they have a low bar for predicting that an observation is in the positive class
# R = TP/(TP/FN)
cross_val_score(logit,X,y, scoring="recall")

#F1 = 2 * (precision * recall) / (precision + recall)
cross_val_score(logit,X,y, scoring="f1")

array([0.95166617, 0.95765275, 0.95558223])

In [165]:
#You want to evaluate a binary classifier and various probability thresholds
roc_feature, roc_target = make_classification(n_samples=5000,
                                             n_features=5,
                                             n_classes=2,
                                             n_informative=3,
                                             random_state=3)
roc_features_train, roc_features_test, roc_target_train, roc_target_test = train_test_split(roc_feature, roc_target,
                                                                                           test_size=0.1, random_state=1)

logit = LogisticRegression()

logit.fit(roc_features_train, roc_target_train)

roc_target_probability = logit.predict_proba(roc_features_test)[:,1]

false_positive_rate, true_positive_rate, threshold = roc_curve(roc_target_test,
                                                               roc_target_probability)


plt.title("Receiver Operating Characteristic")
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()


In [171]:
# view predicted probabilities
roc_features_test[0:1]
# logit.predict_proba(roc_features_test)[0:1]

array([[ 1.9686408 , -2.09841386, -1.96596986, -1.40162674,  1.22208429]])

In [172]:
roc_features_train[0:1]
# logit.predict_proba(roc_features_train)[0:1]

array([[ 1.64523025, -0.46164841, -1.38822831,  0.17903566, -0.94450214]])