# Sprint Challenge 22 Review

1. Begin with baselines for classification.* What is your baseline accuracy, if you guessed the majority class for every prediction?

2. Hold out your test set. (Time-based)*

3. Engineer new feature.* Engineer at least 1 new feature, from a provided list, or your own idea.

4. Decide how to validate your model.* Choose one of the following options. Any of these options are good. You are not graded on which you choose.
   - *Train/validate/test split: time-based*
   - *Train/validate/test split: random 80/20%* train/validate split.
   - *Cross-validation* with independent test set. You may use any scikit-learn cross-validation method.

5. Use a scikit-learn *pipeline* to *encode categoricals* and fit a *Decision Tree* or *Random Forest* model.

6. Get your model’s *validation accuracy.* (Multiple times if you try multiple iterations.)

7. Get your model’s *test accuracy.* (One time, at the end.)

8. Given a *confusion matrix* for a hypothetical binary classification model, *calculate accuracy, precision, and recall.*

***

# 221 - Decision Trees

### Key Points:
* Handling Outliers
* Pipelines
* Basic Decision Tree Classifer
* If none of the leaf nodes contain 100% of a certain group, they're considered 'impure'

### Types of Impurities:
Gini impurity: 
* 1 - (probability of 'yes')^2 - (probability of 'no')^2
* total Gini impurity is the weighted average of each lead node Impurities


### Handling Outliers

Common issues:
* zeroes listed in-place of missing values (NaNs)
* values that should be zero instead listed as very close to zero (2.8e-10)

Pandas Profiling:
Open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code. Also generates interactive reports in web format that can be presented to any person, even if they don't know programming.

What does Pandas Profiling show?
* % of values that are unique
* % of values that are NaN 
* data is skewed
* data contains high cardinality values 
* 

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(train, minimal=True).to_notebook_iframe()

profile

### Pipelines

What is a Pipeline?
* A pipeline is a tool that combines the processes of multiple processes into a single process.
* You only have to call fit and predict once to fit a whole sequence of estimators

Processes that pipeline replaces:
* Encode (OneHotEncoder)
* Imputer (ScaledImputer())
* Scaler (StandardScaler())
* Model (fit)

### Basic Decision Tree Classifier

Decision Tree compontents
* Nodes
    - Test for the value of a certain attribute
* Edges/Branch
  - Correspond to the outcome of a test and connect to the next node or leaf
* Leaf Nodes
  - Terminal nodes that predict the outcome (represent class labels or class distribution)

Types of Decision Trees
* Classification Trees (categorical variables)
* Regression Trees (continuous data types)

Decision Tree Classifier
* Using the decision algorithm, we start at the tree root and split the data on the feature that results in the largest ***information gain (IG)*** (reduction in uncertainty towards the final decision).
* In an iterative process, we can then repeat this splitting procedure at each child node ***until the leaves are pure***. This means that the samples at each leaf node all belong to the same class.
* In practice, we may set a ***limit on the depth of the tree to prevent overfitting***. We compromise on purity here somewhat as the final leaves may still have some impurity.

Advantages of Classification with Decision Trees
* Inexpensive to contruct
* Extermely fast at classifying unknown records
* Easy to interpret for small-sized trees 
* Accuracy comparable to other classification techniques for many simple data sets
* Excludes unimportant features

Disadvantages of Classification with Decision Trees
* Easy to overfit
* Decision Voundary restricted to being parallel to attribute axes
* Decision tree models are often biased toward splits on features having a large number of levels
* Small changes in the training data can result in large changes to decision logic
* Large trees can be difficult to interpret and the decision they make may seem counter intuitive

Applications of Decision Trees in real life
* Biomedical Engineering (decision trees for indentifying features to be used in implantable devices)
* Financial analysis (Customer Satisfaction with a product or service)
* Astronomy (classify galaxies)
* System Control
* Manufacturing and Production (Quality control, Semiconductor Manufacturing, etc)
* Medicines (diagnosis, cardiology, psychiatry)
* Physics (particle detection)

Complexity vs. Consistency
One 'split' decision tree is the most consistently wrong, but not very complex.

***

# 222 - Random Forests

* ordinal encoding with random forests 
* ordinal encoding is more flexible

### How does it work? Explain to a recruiter

### Quiz notes
* LOGISTIC REGRESSION IS A CLASSIFICATION TECHNIQUE
* number of estimators and maximum tree depth increase chance of overfitting
* difference in criterion before the split (parent) to criterion after this split (child)
* Which of the following are evaluation metrics we might consider for a logistic regression? ROC/AUC (area under the curve), Precision, True Positive Rate, Sensitivity, Specificity
* Key Points for describing logistic regression and sigmoid
   * Between 0 and 1
   * The amount a number has to round determines it's confidence of yes or no 
   * [Paper on different types of prediction](arxiv.org/pdf/1708.05070.pdf)
   * Look for additional article link in Slack from 4/15

***

# 223 - Cross Validation

### What is it?

### When do you use it?

***

# 224 - Classification Metrics
* get and interpret the confusion matrix for classification models
* use classification metrics: precision, recall
understand the relationships between precision, recall, thresholds, and predicted probabilities, to * help make decisions and allocate budgets
* Get ROC AUC (Receiver Operating Characteristic, Area Under the Curve) (Keri's favorite - used commonly in the medical field)
* false negative rate - type 2 error
* false positive rate - type 1 error

### Classification Report: Precision and Recall

[Scikit-Learn User Guide — Classification Report](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report)

You will see something called F1 score, alongside precision and recall, but Keri said it's not useful. Accuracy describes precision, recall and F1 score.

### Precision:

* true_positives / (true_positives + false_positives)
* How many selected items are relevant?

### Recall: also Sensitivity (more statistics focused)

* true_positive / (true_positive + false_negative)
* How many relevant items are selected?

### ROC AUC (Receiver Operating Characteristic, Area Under the Curve)

* higher area under the curve is better, want to aim for a 'perfect square' to maximize area
* random guess, AUC of 0.5 (y=x is the 'curve')

### ROC AUC Score

* percentage of the way to 1.0 aka perfect square?