> Reference:
+ [machinelearningmastery: classification mla spot checking](http://machinelearningmastery.com/spot-check-classification-machine-learning-algorithms-python-scikit-learn/)

+ Linear Machine Learning Algorithms:
    - Logistic Regression
    - Linear Discriminant Analysis
+ Nonlinear Machine Learning Algorithms:
    - K-Nearest Neighbors
    - Naive Bayes
    - Classification and Regression Trees
    - Support Vector Machines

In [12]:
import pandas
from sklearn import cross_validation
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)

In [21]:
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
	cv_results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

LR: 0.769515 (0.048411)
LDA: 0.773462 (0.051592)
KNN: 0.726555 (0.061821)
CART: 0.693865 (0.063468)
NB: 0.755178 (0.042766)
SVM: 0.651025 (0.072141)


# Linear MLA: Logistic Regression #
+ Assumes Gaussian distribution for numeric input variables
+ Can model binary classification problems

In [13]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.76951469583


# Linear MLA: Linear Discriminant Analysis (LDA)  #
+ Assumes a Gaussian distribution for the numerical input variables.
+ Can model binary and multi-class classification.

In [14]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.773462064252


# Nonlinear MLA: K-Nearest Neighbors #
+ Uses a distance metric to find the K most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

In [15]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.726555023923


# Nonlinear MLA: Naive Bayes #
+ Calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption).
+ Assumes Gaussian distribution for real valued data.

In [16]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.75517771702


# Nonlinear MLA: Classification and Regression Trees #
+ CART or just decision trees construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like Gini).
+ Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. 

In [17]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.692652084757


# Nonlinear MLA: Support Vector Machines #
+ Seeks a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and influence where the line is placed. 
+ SVM has been extended to support multiple classes.
+ Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default.

In [18]:
from sklearn.svm import SVC
model = SVC()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.651025290499
