# Classification Cheat Sheet

## General Notes

* 4 Steps: Preprocess, Train, Validate, Predict
* .ravel() function creates a continuous, 1D, flattened array
* Day 1, Activity 6 has many details we haven't explored much

## Logistic Regression

* Logistic Regression is a statistical method for predicting binary outcomes from data.
* Examples of this are "yes" vs "no" or "high credit risk" vs "low credit risk".
* These are categories that translate to probability of being a 0 or a 1
* We can calculate logistic regression by adding an activation function as the final step to our linear model.
* This converts the linear regression output to a probability.

Steps:
1. Generate/Import Data
2. Split Data (X_train, X_test, y_train, y_test)
3. Create Model
4. Fit/Train Model
5. Validate Model
6. Make Predicitions
7. Confusion Matrix
8. Classification Report

In [None]:
# Step 1 - imports
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_blobs #to generate random, clustered data

In [None]:
# Step 2 - split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)

In [None]:
# Step 3 - create model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs', random_state=1)
classifier

In [None]:
# Step 4 - train the data
classifier.fit(X_train, y_train)

In [None]:
# Step 5 - score/validate the model
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

In [None]:
# Step 6 - make predictions
predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test}) #.reset_index(drop=True) #Might need to reset index

In [None]:
# Step 7 - confustion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

In [None]:
# Step 8 - classification report
from sklearn.metrics import classification_report
target_names = ["Group 1", "Group 2"]
print(classification_report(y_test, predictions, target_names=target_names))

## Support Vector Machines

* Tries to finds the “best” margin (distance between the line and the support vectors) that separates the classes and this reduces the risk of error on the data
* In other words, SVM tries to optimally slice the data
* The risk of overfitting is less in SVM, while Logistic regression is vulnerable to overfitting

Steps:
* Same as LR with a different model in Step 3

In [None]:
# Code for a linear SVM model
from sklearn.svm import SVC
classifier = SVC(kernel='linear')
classifier

## Decision Trees

* a graphical representation of possible solutions to a decision based on certain conditions--they encode a series of true/false questions
* tree based algorithms map non-linear relationships in data
* hierarchical if else statements

Steps:
1. Generate/Import Data
2. Preprocess Data (drop, reshape, scale, transform etc.)
3. Fit the Model
4. Make Predicitions
5. Model Evaluation
6. Visualize

In [None]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn import tree
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Needed for decision tree visualization
import pydotplus
from IPython.display import Image

In [None]:
# Step 2 - various preprocess code

# Define features set
X = df.copy()
X.drop("Column", axis=1, inplace=True)

# Define target vector
y = df["Column"].values.reshape(-1, 1)

# Splitting into Train and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

# Create the StandardScaler instance
scaler = StandardScaler()

# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

# Scale the training data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [None]:
# Step 3 - fit
model = tree.DecisionTreeClassifier()

# Fit the model
model = model.fit(X_train_scaled, y_train)

In [None]:
# Step 4 - predict
predictions = model.predict(X_test_scaled)

In [None]:
# Step 5 - evaluate

# Calculating the confusion matrix
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

# Calculating the accuracy score
acc_score = accuracy_score(y_test, predictions)

# Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions))

In [None]:
# Step 6 - visualize

# Create DOT data
dot_data = tree.export_graphviz(
    model, out_file=None, feature_names=X.columns, class_names=["0", "1"], filled=True
)

# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)

# Show graph
Image(graph.create_png())

## Random Forest

* each tree is simpler because it is built from a subset of data
* when weak classifiers are combined, it can create a robust model against overfitting
* random forest runs efficiently on large databases
* feature importance describes which trees were most successful at predicition

Steps:
* Same as Decision Tree with different model and no visualization
* Feature Importance instead of visualizaton

In [None]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Needed for decision tree visualization
import pydotplus
from IPython.display import Image

In [None]:
# Code for random forrest model

# Create a random forest classifier
rf_model = RandomForestClassifier(n_estimators=500, random_state=78)

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

In [None]:
# Random Forests in sklearn will automatically calculate feature importance
importances = rf_model.feature_importances_

# We can sort the features by their importance
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)