<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Classification Techniques

_Your one-stop-shop for keeping classification techniques and code handy_

---


We will split up the work for this notebook:
- Person 1: Logistic Regression, Regularized Logistic Regression
- Person 2: kNN
- Person 3: Decision Tree and Random Forest Classifier
- Person 4: Naive Bayes Classifier
- Person 5: Support Vector Machines
---

Classification models covered in this notebook:
- [Logistic Regression](#Logistic-Regression)
    - [Regularized Logistic Regression](#Regularized-Logistic-Regression)
- [K-Nearest Neighbor](#kNN)
- [Decision Tree](#Decision-Tree)
- [Random Forest Classifier](#Random-Forest-Classifier)
- [Naive Bayes Classifier](#Naive-Bayes-Classifier)
- [Support Vector Machines](#Support-Vector-Machines)
---

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
%matplotlib inline

  return f(*args, **kwds)
  import pandas.util.testing as tm
  return f(*args, **kwds)


In [2]:
# Read in the data.
admissions = pd.read_csv('./data/admissions.csv')
admissions.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [3]:
admissions.isnull().sum()

admit       0
gre         2
gpa         2
prestige    1
dtype: int64

In [4]:
admissions.dropna(axis=0, inplace=True)

In [5]:
# split into testing and training sets
features = ['gre', 'gpa', 'prestige']
X = admissions[features] # feature matrix
y = admissions['admit'] # target vector

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

---
# Logistic Regression
_Noelle_

When should you use the algorithm?
> **ANSWER HERE**: When you are trying to predict a binary outcome or are trying to find the probability of being in each class. Should also be used when you need a very interpretable classification model.

What are some benefits and drawbacks of the model?
> **Benefits:**  
- Simple
- Can perform well even for complex problems
- Interpretable
- Can see both predicted probability of being in your target class in addition to your prediction

> **Drawbacks:**  
- Assumption of linearity
- Can easily be overfit

> [more](https://theprofessionalspoint.blogspot.com/2019/03/advantages-and-disadvantages-of.html)

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

log_reg = LogisticRegression(penalty='none', solver='lbfgs') # instantiate & turn off regularization
log_reg.fit(X_train, y_train) # fit

# simple evaluation function for scoring
def classification_evaluate(y_train_true, y_train_pred, y_test_true, y_test_pred):
    print('Evaluation Metrics')
    print('---------------------------')
    print('Training accuracy:', accuracy_score(y_train_true, y_train_pred))
    print('Testing accuracy:', accuracy_score(y_test_true, y_test_pred))
    print(' ')
    print('Training Recall:', recall_score(y_train_true, y_train_pred))
    print('Testing Recall:', recall_score(y_test_true, y_test_pred))
    print(' ')
    print('Training Precision:', precision_score(y_train_true, y_train_pred))
    print('Testing Precision:', precision_score(y_test_true, y_test_pred))
    print(' ')
    print('Training F1:', f1_score(y_train_true, y_train_pred))
    print('Testing F1:', f1_score(y_test_true, y_test_pred))

# score
classification_evaluate(y_train, log_reg.predict(X_train), y_test, log_reg.predict(X_test))

Evaluation Metrics
---------------------------
Training accuracy: 0.734006734006734
Testing accuracy: 0.61
 
Training Recall: 0.24719101123595505
Testing Recall: 0.16216216216216217
 
Training Precision: 0.6470588235294118
Testing Precision: 0.42857142857142855
 
Training F1: 0.35772357723577236
Testing F1: 0.23529411764705885


---
# Regularized Logistic Regression
_Noelle_

When should you use the algorithm?
> **ANSWER HERE**: When your logistic regression is overfit and needs to be regularized. Note that logistic regression in sklearn is regularized by default.

What are some benefits and drawbacks of the model?
> **ANSWER HERE**: Same as above! Another benefit would be that it prevents overfitting. When using the l1 norm as a penalty, you can actually zero out coefficients.

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [32]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# scale
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

# get params to gridsearch over with various regularization types/strengths
log_params = {
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'],
    'C': np.logspace(0, 1, 10)
}

# instantiate
log_gridsearch = GridSearchCV(LogisticRegression(), 
                              log_params,
                              cv=5,
                              verbose=1)

# fit
log_gridsearch.fit(X_train_sc, y_train)

# save best model from gridsearch
best_model = log_gridsearch.best_estimator_

# evaluate model
classification_evaluate(y_train, best_model.predict(X_train_sc), y_test, best_model.predict(X_test_sc))

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Evaluation Metrics
---------------------------
Training accuracy: 0.734006734006734
Testing accuracy: 0.61
 
Training Recall: 0.24719101123595505
Testing Recall: 0.16216216216216217
 
Training Precision: 0.6470588235294118
Testing Precision: 0.42857142857142855
 
Training F1: 0.35772357723577236
Testing F1: 0.23529411764705885


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.5s finished


---
# kNN
_Colin_

When should you use the algorithm?
> **ANSWER HERE**: 
- Useful for nonlinear data
- "Should I give you a loan? Do people with your characteristics tend to default on loans?"
- Easy to interpret visually, can be plotted
- ML map: https://scikit-learn.org/stable/_static/ml_map.png
- Source used: https://blog.usejournal.com/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7

What are some benefits and drawbacks of the model?
> **ANSWER HERE**: 
- Data is stored within algorithm
- High computation power required
- More user intuition or trials may be required to remove less useful features

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [33]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

  return f(*args, **kwds)


In [34]:
# Scale
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# instantiate
knn = KNeighborsClassifier()
# fit
knn.fit(X_train, y_train)
# preliminary scores
print(f'Train score: {knn.score(X_train, y_train)}')
print(f'Test score: {knn.score(X_train, y_train)}')
print(f'Cross Val score: {cross_val_score(knn, X_train, y_train, cv=10).mean()}')

Train score: 0.7643097643097643
Test score: 0.7643097643097643
Cross Val score: 0.6970607553366175


In [35]:
# Evaluate
from sklearn.metrics import confusion_matrix
y_preds = knn.predict(X_test)
confusion_matrix(y_test, y_preds)

array([[47, 16],
       [24, 13]])

---
# Decision Tree
_Matt_

**What is a Decision Tree?**
---
- Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the corresponding output is in the training data) where the data is continuously split according to a certain parameter.
- The goal is to create a model that predicts the value of a target variable based on several input variables.
- The tree can be explained by two entities, namely decision nodes and leaves.
    - The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split.
- Two main types:
    - Classification trees (Yes/No types)
    - Regression trees (Continuous data types)
- Two main metrics:
    - Gini Impurity (aka entropy)
        - measures the amount of uncertainty at a node
        - 0 means no uncertainty
        - 1 - chances of being right
    - Information gain
        - Effectively the change in entropy
        - let’s us find the question that reduces uncertainty the most
            - this is how we determine our root node, or node from which we ask our first question
        - once IG = 0 we get a leaf
    - *Variance reduction**
        - Introduced in CART, often employed with regression trees
            - CART (Classification and Regression Tree) is one of many decision tree-specific algorithms
**...if you want to know more**
- Decision Tree models are created using 2 steps: **Induction and Pruning**.
    - **Induction** is where we actually build the tree i.e set all of the hierarchical decision boundaries based on our data. Because of the nature of training decision trees they can be prone to major overfitting.
    - **Pruning** is the process of removing the unnecessary structure from a decision tree, effectively reducing the complexity to combat overfitting with the added bonus of making it even easier to interpret.
- There are several parameters that you can set for your decision tree model in Scikit Learn too. Here are a few of the more interesting ones to play around with to try and get some better results:
     - `max_depth`: The max depth of the tree where we will stop splitting the nodes. This is similar to controlling the maximum number of layers in a deep neural network. Lower will make your model faster but not as accurate; higher can give you accuracy but risks overfitting and may be slow.
    - `min_samples_split`: The minimum number of samples required to split a node. We discussed this aspect of decision trees above and how setting it to a higher value would help mitigate overfitting.
    - `max_features`: The number of features to consider when looking for the best split. Higher means potentially better results with the tradeoff of training taking longer.
    - `min_impurity_split`: Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold. This can be used to tradeoff combating overfitting (high value, small tree) vs high accuracy (low value, big tree).
    - `presort`: Whether to presort the data to speed up the finding of best splits in fitting. If we sort our data on each feature beforehand, our training algorithm will have a much easier time finding good values to split on.
**When should you use the algorithm?**
---
- Classification with many categories
- We have large data set
- Data is both numerical and categorical
- Yes/No problems (boolean logic)
**What are some benefits and drawbacks of the model?**
---
**Benefits:**
- Simple to understand and to interpret. Trees can be visualized.
- Requires little data preparation beyond treating null values
- Able to handle both numerical and categorical data.
- Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic.
- Possible to validate a model using statistical tests.
- Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
**Drawbacks:**
- High potential for overfitting. Decision-tree learners can create over-complex trees that do not generalize the data well.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
***sources:***
- https://www.kdnuggets.com/2018/12/guide-decision-trees-machine-learning-data-science.html
- https://en.wikipedia.org/wiki/Decision_tree_learning
- https://www.kdnuggets.com/2018/12/guide-decision-trees-machine-learning-data-science.html
- https://medium.com/datadriveninvestor/decision-trees-lesson-101-f00dad6cba21
- https://github.com/random-forests/tutorials/blob/master/decision_tree.ipynb
- https://www.youtube.com/watch?v=LDRbO9a6XPU

code is taken directly from this article: https://towardsdatascience.com/decision-tree-in-python-b433ae57fb93

---
# Random Forest Classifier
_Matt_

**What is a Random Forest?**
---
- Like Decision Trees, Random Forests are a method for classification and regression
- RFs construct a multitude of decisoin trees when training the data then output the class that is the mode (i.e. occurs most often) for classification or the mean prediction for regression of the individual trees.
- Help to correct decision trees tendency to overfit the trainin set
- “*A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.*”
    - The reason for this wonderful effect is that the trees protect each other from their individual errors
- So in our random forest, we end up with trees that are not only trained on different sets of data (thanks to bagging) but also use different features to make decisions.
- **IN SHORT: The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.**
**When should you use the algorithm?**
---
- If Decision Tree is consistently overfitting
- Two the prerequisites for random forest to perform well are:
    - There needs to be some actual signal in our features so that models built using those features do better than random guessing.
    - The predictions (and therefore the errors) made by the individual trees need to have low correlations with each other.
- When you have time to train the data
**What are some of the benefits and drawbacks?**
---
**Benefits**
1. Random Forest is based on the bagging algorithm and uses Ensemble Learning technique. It creates as many trees on the subset of the data and combines the output of all the trees. In this way it **reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy.**
2. Random Forest can be used to **solve both classification as well as regression problems**.
3. Random Forest **works well with both categorical and continuous variables**.
4. Random Forest can **automatically handle missing values.**
5. **No feature scaling required**: No feature scaling (standardization and normalization) required in case of Random Forest as it uses rule based approach instead of distance calculation.
6. **Handles non-linear parameters efficiently**: Non linear parameters don’t affect the performance of a Random Forest unlike curve based algorithms. So, if there is high non-linearity between the independent variables, Random Forest may outperform as compared to other curve based algorithms.
7. Random Forest is **usually robust to outliers** and can handle them automatically.
    - Random Forest algorithm is **very stable**. Even if a new data point is introduced in the dataset, the overall algorithm is not affected much since the new data may impact one tree, but it is very hard for it to impact all the trees.
    - Random Forest is comparatively less impacted by noise.
**Drawbacks**
1. **Complexity**: Random Forest creates a lot of trees (unlike only one tree in case of decision tree) and combines their outputs. By default, it creates 100 trees in Python sklearn library. To do so, this algorithm requires much more computational power and resources. On the other hand decision tree is simple and does not require so much computational resources.
2. **Longer Training Period**: Random Forest require much more time to train as compared to decision trees as it generates a lot of trees (instead of one tree in case of decision tree) and makes decision on the majority of votes.
***sources:***
- https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
- https://en.wikipedia.org/wiki/Random_forest
- https://towardsdatascience.com/understanding-random-forest-58381e0602d2
- http://theprofessionalspoint.blogspot.com/2019/02/advantages-and-disadvantages-of-random.html
- https://github.com/WillKoehrsen/Machine-Learning-Projects/blob/master/Random%20Forest%20Tutorial.ipynb

---
# Naive Bayes Classifier
_Justin_

https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/
https://becominghuman.ai/naive-bayes-theorem-d8854a41ea08
https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/ -> Good explanation on how it works
https://www.geeksforgeeks.org/naive-bayes-classifiers/
https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

When should you use the algorithm?
> **ANSWER HERE**: 
> The name naive is used because it assumes the features that go into the model is independent of each other. That is changing the value of one feature, does not directly influence or change the value of any of the other features used in the algorithm.
> The Bayes Rule: The Bayes Rule is a way of going from P(X|Y), known from the training dataset, to find P(Y|X).   
$P(A|B)= \frac{P(B|A)P(A)}{P(B)}$
> $P(Yes|conds)=\frac{P(X_1|Yes)*P(X_2|Yes)*...*P(X_n|Yes)}{P(conds)}$
> $P(No|conds)=\frac{P(X_1|No)*P(X_2|No)*...*P(X_n|No)}{P(conds)}$
> Since both have the same denominator, it can be ignored.  Just care about each numerator.  The greater one is selected

What are some benefits and drawbacks of the model?
> **Benefits**:
- Quick and efficient
- Outperforms other models when indepences holds.

>**Drawbacks**:
- Features ***NEED*** to be independent.  They aren't in the real world most of the time
- Data will need to be smoothed to ensure there is no condition of a feature in the test set that isn't in the training set

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [36]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(f'Train score: {gnb.score(X_train, y_train):.5}')
print(f'Test score: {gnb.score(X_test, y_test):.5}')

Train score: 0.72391
Test score: 0.63


---
# Support Vector Machines
_Nate_

> **What is SVM**:
- The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional (N = the number of features) that distinctly classifies the data points.
- We are looking to maximize the margin between the data points and the hyperplane.
- Hyperplanes are decision boundaries that help classify the data points.
- Support vectors are data points are closer to the hyperplane and influence the position and orientation of the hyperplane.

> **When to use SVM**:
- Can be used for both regression and classification, but is widely used for classification.
![image.png](attachment:image.png)
[TDS Article](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47)

What are some benefits and drawbacks of the model?
> **Benefits**:
- Produces significant accuracy with less computational power.

> **Drawbacks**:
- Can only be used to predict a binary result.

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [37]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.61
