## **Linear Algorithms**

* Logistic Regression
* Linear Discriminant Analysis
* Naive Bayes

Linear algorithms are those that are often drawn from the field of statistics and make strong assumptions about the functional form of the problem. We can refer to them as linear because the output is a linear combination of the inputs, or weighted inputs, although this definition is stretched. You might also refer to these algorithms as probabilistic algorithms as they are often fit under a probabilistic framework. They are often fast to train and often perform very well. Examples of linear algorithms you should consider trying include:

## **Nonlinear Algorithms**

Nonlinear algorithms are drawn from the field of machine learning and make few assumptions about the functional form of the problem. We can refer to them as nonlinear because the output is often a nonlinear mapping of inputs to outputs. They often require more data than linear algorithms and are slower to train. Examples of nonlinear algorithms you should consider trying include:

* Decision Tree
* k-Nearest Neighbors
* Artificial Neural Networks
* Support Vector Machine

## **Ensemble Algorithms**

Ensemble algorithms are also drawn from the field of machine learning and combine the predictions from two or more models. There are many ensemble algorithms to choose from, but when spot-checking algorithms, it is a good idea to focus on ensembles of decision tree algorithms, given that they are known to perform so well in practice on a wide range of problems. Examples of ensembles of decision tree algorithms you should consider trying include:

* Bagged Decision Trees
* Random Forest
* Extra Trees
* Stochastic Gradient Boosting

## **Summary**

* Naive Algorithms
    * Majority Class
    * Minority Class
    * Class Priors
* Linear Algorithms
    * Logistic Regression
    * Linear Discriminant Analysis
    * Naive Bayes
* Nonlinear Algorithms
    * Decision Tree
    * k-Nearest Neighbors
    * Artificial Neural Networks
    * Support Vector Machine
* Ensemble Algorithms
    * Bagged Decision Trees
    * Random Forest
    * Extra Trees
    * Stochastic Gradient Boosting

# Spot Check Imbalanced Algorithms

There are perhaps four types of imbalanced classification techniques to spot check:

Data Sampling Algorithms
Cost-Sensitive Algorithms
One-Class Algorithms
Probability Tuning Algorithms

Examples of popular combinations of over and undersampling include:

SMOTE and Random Undersampling
SMOTE and Tomek Links
SMOTE and Edited Nearest Neighbors

## Cost-Sensitive Algorithms
Cost-sensitive algorithms are modified versions of machine learning algorithms designed to take the differing costs of misclassification into account when fitting the model on the training dataset.

These algorithms can be effective when used on imbalanced classification, where the cost of misclassification is configured to be inversely proportional to the distribution of examples in the training dataset.

There are many cost-sensitive algorithms to choose from, although it might be practical to test a range of cost-sensitive versions of linear, nonlinear, and ensemble algorithms.

Some examples of machine learning algorithms that can be configured using cost-sensitive training include:

Logistic Regression
Decision Trees
Support Vector Machines
Artificial Neural Networks
Bagged Decision Trees
Random Forest
Stochastic Gradient Boosting

## One-Class Algorithms

Algorithms used for outlier detection and anomaly detection can be used for classification tasks.

Although unusual, when used in this way, they are often referred to as one-class classification algorithms.

In some cases, one-class classification algorithms can be very effective, such as when there is a severe class imbalance with very few examples of the positive class.

Examples of one-class classification algorithms to try include:

One-Class Support Vector Machines
Isolation Forests
Minimum Covariance Determinant
Local Outlier Factor

# Hyperparameter Tuning

Hyperparameter Tuning
After spot-checking machine learning algorithms and imbalanced algorithms, you will have some idea of what works and what does not on your specific dataset.

The simplest approach to hyperparameter tuning is to select the top five or 10 algorithms or algorithm combinations that performed well and tune the hyperparameters for each.

There are three popular hyperparameter tuning algorithms that you may choose from:

Random Search
Grid Search
Bayesian Optimization

# How to Fix k-Fold Cross-Validation for Imbalanced Classification

[Source: How to Fix k-Fold Cross-Validation for Imbalanced Classification, *Machine Learning Mastery*](https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/)

## Repeated Stratified K-Fold Cross-Validation

In [3]:
# example of stratified k-fold cross-validation with an imbalanced dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1)
kfold = RepeatedStratifiedKFold(n_splits=3, random_state=1)
# print out the number of splits created
c = 0
for train_ix, test_ix in kfold.split(X, y):
    c += 1
print(c)
# enumerate the splits and summarize the distributions
for train_ix, test_ix in kfold.split(X, y):
	# select rows
	train_X, test_X = X[train_ix], X[test_ix]
	train_y, test_y = y[train_ix], y[test_ix]
	# summarize train and test composition
	train_0, train_1 = len(train_y[train_y==0]), len(train_y[train_y==1])
	test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1])
	print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))


30
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=6, Test: 0=330, 1=4
>Train: 0=660, 1=7, Test: 0=330, 1=3
>Train: 0=660, 1=7, Test: 0=330, 1=

## Stratified K-Fold Cross-Validation

In [4]:
# example of stratified k-fold cross-validation with an imbalanced dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# enumerate the splits and summarize the distributions
for train_ix, test_ix in kfold.split(X, y):
	# select rows
	train_X, test_X = X[train_ix], X[test_ix]
	train_y, test_y = y[train_ix], y[test_ix]
	# summarize train and test composition
	train_0, train_1 = len(train_y[train_y==0]), len(train_y[train_y==1])
	test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1])
	print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2


## Train-Test Split Cross-Validation

In [5]:
# example of stratified train/test split with an imbalanced dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1)
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)
# summarize
train_0, train_1 = len(trainy[trainy==0]), len(trainy[trainy==1])
test_0, test_1 = len(testy[testy==0]), len(testy[testy==1])
print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

>Train: 0=495, 1=5, Test: 0=495, 1=5


# Data Leakage Example 2

[Source: Stratified K Fold Cross Validation, *Geeks for Geeks*](https://www.geeksforgeeks.org/stratified-k-fold-cross-validation/)

## Repeated Stratified K-Fold Cross-Validation - WITH DATA LEAKAGE

In [6]:
# This code may not be run on GFG IDE
# as required packages are not found.
	
# STRATIFIES K-FOLD CROSS VALIDATION { 10-fold }

# Import Required Modules.
from statistics import mean, stdev
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn import linear_model
from sklearn import datasets

# FEATCHING FEATURES AND TARGET VARIABLES IN ARRAY FORMAT.
cancer = datasets.load_breast_cancer()

# Input_x_Features.
x = cancer.data

# Input_ y_Target_Variable.
y = cancer.target

# Feature Scaling for input features.
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)

# Create classifier object.
lr = linear_model.LogisticRegression()

# Create StratifiedKFold object.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []

for train_index, test_index in skf.split(x, y):
    x_train_fold, x_test_fold = x_scaled[train_index], x_scaled[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    lr.fit(x_train_fold, y_train_fold)
    lst_accu_stratified.append(round(lr.score(x_test_fold, y_test_fold),2))

# Print the output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy That can be obtained from this model is:',
	round(max(lst_accu_stratified)*100,2), '%')
print('\nMinimum Accuracy:',
	round(min(lst_accu_stratified)*100,2), '%')
print('\nOverall Accuracy:',
	round(mean(lst_accu_stratified)*100,2), '%')
print('\nStandard Deviation is:', round(stdev(lst_accu_stratified),2))


List of possible accuracy: [0.93, 0.96, 0.98, 1.0, 0.96, 0.96, 0.98, 0.95, 0.95, 0.98]

Maximum Accuracy That can be obtained from this model is: 100.0 %

Minimum Accuracy: 93.0 %

Overall Accuracy: 96.5 %

Standard Deviation is: 0.02


## Repeated Stratified K-Fold Cross-Validation - WITHOUT DATA LEAKAGE (fixed by myself)

In [7]:
# This code may not be run on GFG IDE
# as required packages are not found.
	
# STRATIFIES K-FOLD CROSS VALIDATION { 10-fold }

# Import Required Modules.
from statistics import mean, stdev
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn import linear_model
from sklearn import datasets

# FEATCHING FEATURES AND TARGET VARIABLES IN ARRAY FORMAT.
cancer = datasets.load_breast_cancer()

# Input_x_Features.
x = cancer.data

# Input_ y_Target_Variable.
y = cancer.target

# Create classifier object.
lr = linear_model.LogisticRegression()

# Create StratifiedKFold object.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []

for train_index, test_index in skf.split(x, y):
    x_train_fold, x_test_fold = x_scaled[train_index], x_scaled[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    
    # Feature Scaling for input features.
    scaler = preprocessing.MinMaxScaler()
    scaler.fit(x_train_fold)
    x_train_scaled = scaler.transform(x_train_fold)
    x_test_scaled = scaler.transform(x_test_fold)
    
    lr.fit(x_train_scaled, y_train_fold)
    lst_accu_stratified.append(round(lr.score(x_test_scaled, y_test_fold),2))

# Print the output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy That can be obtained from this model is:',
	round(max(lst_accu_stratified)*100,2), '%')
print('\nMinimum Accuracy:',
	round(min(lst_accu_stratified)*100,2), '%')
print('\nOverall Accuracy:',
	round(mean(lst_accu_stratified)*100,2), '%')
print('\nStandard Deviation is:', round(stdev(lst_accu_stratified),2))

List of possible accuracy: [0.93, 0.96, 0.98, 0.98, 0.96, 0.96, 0.98, 0.95, 0.95, 0.98]

Maximum Accuracy That can be obtained from this model is: 98.0 %

Minimum Accuracy: 93.0 %

Overall Accuracy: 96.3 %

Standard Deviation is: 0.02


# Data Leakage

[Source: How to Avoid Data Leakage When Performing Data Preparation, *Machine Learning Mastery*](https://machinelearningmastery.com/data-preparation-without-data-leakage/)

## With Data Leakage

In [8]:
# naive approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 84.848


## Without Data Leakage

In [9]:
# correct approach for normalizing the data after the data is split before the model is evaluated
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 85.455
