## Chapter 3 Classification

## Leaky features

Leakage is also termed Data Leakage

Leaky features are variables that contain information about the future or target. There’s nothing bad in having data about the target, and we often have that data during model creation time. However, if those variables are not available when we perform a prediction on a new sample, we should remove them from the model as they are leaking data from the future.

<b>Leakage (machine learning)<b>

In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.


<b>Data Leakage Examples</b>

1. Giveaway features: Giveaway features are the features that expose information about the target variable and would not be available after the model is deployed.

2. Leakage during preprocessing: If preprocessing is done on both train and test set.

<b>How to Detect and Avoid Data Leakage</b>

As a general, if the model is too good to be true, we should get suspicious. The model might be somehow memorizing the feature-target relations instead of learning and generalizing.


[Data Science Project Suggestion](!https://drivendata.github.io/cookiecutter-data-science/)

### Classification Problem Flow

1. Gather Data
2. Clean Data
3. Create Features
4. Sample Data (split train and test data)
5. Impute Data
6. Normalize Data / (Refactor code)
7. Baseline Model
8. Build Classifier
9. Stack
10. Evaluate Model
11. Optimize Model (Hyper-parameter tuning)
12. Deploy Model

## Chapter 4 Missing Data

To visualize patterns in the missing data, use the missingno library. This library is useful for viewing contiguous areas of missing data, which would indicate that the missing data is not random (see Figure 4-1). The matrix function includes a sparkline along the right side. Patterns here would also indicate non‐ random missing data. You may need to limit the number of samples to be able to see the patterns.

A <b>dendrogram</b> can also show missing data it is done by clustering of where data is missing ). Leaves that are at the same level predict one another’s presence (empty or filled). The vertical arms are used to indicate how different clusters are. Short arms mean that branches are similar

## Chapter 5 Cleaning Data

## Important note on pandas behaviour

Up to pandas 0.23, if the type is int64, we are guaranteed that there are no missing values. If the type is float64, the values might be all floats, but also could be integer-like numbers with missing values. The pandas library converts integer values that have missing numbers to floats, as this type supports missing values. The object typically means string types (or both string and numeric).


As of pandas 0.24, there is a new Int64 type (notice the capitalization). This is not the default integer type, but you can coerce to this type and have support for missing numbers.

## Chapter 6 Exploring

A pandas DataFrame has an iloc attribute that we can do index operations on. It will let us pick out rows and columns by index location. We pass in the row positions as a scalar, list, or slice, and then we can add a comma and pass in the column positions as a scalar, list, or slice.
Here we pull out the second and fifth row, and the last three columns:
<code> X.iloc[[1, 4], -3:] ;sex_male embarked_Q embarked_S
    677       1.0           0           1
    864       0.0           0           1
    
</code>
There is also a .loc attribute, and we can put out rows and columns based on name (rather than position). Here is the same portion of the DataFrame:
<code> X.loc[[677, 864], "sex_male":] ;sex_male embarked_Q embarked_S 677 1.0 0 1 864 0.0 0 1

</code>


###  Joint Plot
Yellowbrick has a fancier scatter plot that includes histograms on the edge as well as a regression line called a joint plot

### Pair Grid
Using seaborn

Kernel Density Estimations:
- https://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/MISHRA/kde.html
- https://www.youtube.com/watch?v=x5zLaWT5KPs

### Pandas chaining sample


### Correlation/Covarince correlation graph (yellowbrick and seaborn)

### Heatmap

### RadViz (yellowbricks/Pandas as implements RadViz Plot)

A RadViz plot shows each sample on a circle, with the features on the circumference. The values are normalized, and you can imagine that each figure has a spring that pulls samples to it based on the value.
This is one technique to visualize separability between the targets.



### Parallel Coordiantes (yellowbricks/Pandas as implements RadViz Plot)

For multivariate data, you can use a parallel coordinates plot to see clustering visually.

## Chapter 7 Preprocess Data