<a href="https://colab.research.google.com/github/mahynski/chemometric-carpentry/blob/main/notebooks/2_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
❓ ***Objective***: This notebook will introduce the basic techniques used when creating and optimizing predictive models.  Some steps will be reviewed in much more detail later.

🔁 ***Remember***: You can always revisit this notebook for reference again in the future.  Ideas and best practices will be reinforced in future notebooks, so don't worry about remembering everything the first time you see something new.

🧑 Author: Nathan A. Mahynski

📆 Date: May 8, 2024

---

# Exploratory Data Analysis (EDA)

👆 [EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis) is the first step in the modeling process.  

The basic purpose of EDA is start to "play" with your data.  It is usually a very visual process which is meant to generate hypotheses and alert the modeler to (unexpected) trends. The term ["statistical graphics"](https://en.wikipedia.org/wiki/Statistical_graphics) is often used interchangeably with, [though it is different](https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm). The term is attributed to John Tukey, who wrote the book entitled ["Exploratory Data Analysis"](https://en.wikipedia.org/wiki/Exploratory_data_analysis#cite_note-Tukey1977-6) in 1977.  

<img src="https://upload.wikimedia.org/wikipedia/commons/b/ba/Data_visualization_process_v1.png" align="right" height=300 />

👉 EDA is a philosophy 🤔 not a set of techniques! Nonetheless, EDA is usually done by making lots of different plots of the data to help the human observing the data to recognize patterns.

[NIST has some great resources on EDA.](https://www.itl.nist.gov/div898/handbook/eda/eda.htm)  Here is a direct quote:

> "Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis  that employs a variety of techniques (mostly graphical) to:
* maximize insight into a data set
* uncover underlying structure
* extract important variables
* detect outliers and anomalies
* test underlying assumptions
* develop parsimonious models and
* determine optimal factor settings."

EDA should not be confused with ["initial data analysis" (IDA)](https://en.wikipedia.org/wiki/Data_analysis#Initial_data_analysis) which is a more narrow term that encompasses data cleaning (handling missing values, data transformations, etc.); in fact, IDA is part of EDA.


🤔 Remember that EDA is basically just the process of plotting your data in lots of different ways to gain insight.  There is not necessarily a "right" or "wrong" way to do it, but some approaches are more helpful than others.  [ITL@NIST](https://www.itl.nist.gov/div898/handbook/eda/eda.htm) has some good examples of useful plots to make, and the [seaborn](https://seaborn.pydata.org/) python package is very helpful with this.

---
> ❗ Check out [seaborn's gallery](https://seaborn.pydata.org/examples/index.html) - you can easily see different types of plots possible and make them very simply with your own data!
---

For examples of EDA in Python 🐍 with PyChemAuth, please refer to [this notebook](https://pychemauth.readthedocs.io/en/latest/jupyter/api/eda.html).  


# Pipelines

<img src="https://pychemauth.readthedocs.io/en/latest/_images/pipeline.png" height=400 align="right"/>

In scikit-learn, [pipelines](https://scikit-learn.org/stable/modules/compose.html) 🔩 are composite estimators.  They are composed of a series of steps, illustrated at the right.  Each step in the pipeline is a ["transformer"](https://scikit-learn.org/stable/glossary.html#term-transformer) which is a class that implements `fit` and `transform` methods.  

* 🚆 During training, both are called in sequence, where the former "learns" any parameters needed (such as the mean or standard deviation of the data it sees).  
* 🥼 During testing, these parameters are fixed and simply used to transform the data.  

Each transformer is responsible for accepting the data from the last step, transforming it, then handing it to the next step.  This continues in sequence until we reach the last step.  The last step can be a transformer too, but for our purposes we put a predictive model there (either regression or classification model).  The model should implement `fit` and `predict` methods.  The figure at the right shows how data flows through the pipeline at different stages.

👉 When a pipeline is `fit` it sends data through the blue path on the right, going through the `fit` members of all steps; at the final step `predict` is called.  The pipeline is scored using the `score` member of the model.

We will refer to all the steps leading up to to model (last step) as the "pre-processing" steps.  Pre-processing will be covered in detail later, but this is the part of the pipeline it goes with.  There are also many different models, which will also be covered in detail later on.  However, before we get there, it is important to get a 🐦 "bird's eye view" of the main structure we will be working with.

❓ Q: Why is this the main structure we use for modeling?

🙋 A: Because this creates an end-to-end process that can accept the data, do all the cleaning and processing, model the data, and make a prediction in one structure.  That makes it simple to optimize and reproduce!

Because sklearn's estimator API is so widespread, when new models or preprocessing steps are developed by the ML or related communities, they are usually published in a compatible format.  This means that when new developments occur, we can simply "drop them in" the appropriate slot in the pipeline above and try them out!  This enables:
* side-by-side comparison of different approaches
* development of best practices
* "future-proof"(*) your work







For examples of using pipelines in Python 🐍 with PyChemAuth, please refer to [this notebook](https://pychemauth.readthedocs.io/en/latest/jupyter/api/pipelines.html).

# Evaluation metrics

By [default](https://scikit-learn.org/stable/modules/model_evaluation.html) a pipeline uses the `score` function of the last step in the pipeline.  Recall our simple example:

```python
class MyClassifier(BaseEstimator, ClassifierMixin):
    """
    A simple nearest neighbor classifier.
    """
    def __init__(self, demo_param='demo'):
        self.demo_param = demo_param

    def fit(self, X, y):
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)

        # Store the classes seen during fit
        self.classes_ = unique_labels(y)
        self.X_ = X
        self.y_ = y

        # Return the classifier
        return self

    def predict(self, X):
        # Check if fit has been called
        check_is_fitted(self)

        # Input validation
        X = check_array(X)

        closest = np.argmin(euclidean_distances(X, self.X_), axis=1)

        return self.y_[closest]

    def score(self, X, y):
        check_is_fitted(self)
        X, y = check_X_y(X, y)

        predictions = self.predict(X)
        accuracy = np.sum(predictions == y) / len(y)

        return accuracy
```        

sklearn allows you to [define custom scoring functions](https://scikit-learn.org/stable/modules/model_evaluation.html), or make use of many built-in alternatives. You will not need to specify the `score` function for any examples in this course, or for models implemented in PyChemAuth; however, if you wish to see how a model (or pipeline which terminates with a given model) is being scored you can investigate the `score` member of the model used.

```
>>> ?model.score
```

# Cross-Validation

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" align="right" height=300 />

[Cross-validation (CV)](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) is used to estimate the generalization error of a model, that is, how well it performs on unseen data. To be accurate, we must avoid testing on data used during model training since the model will be biased to fit that data. Any estimate of model performance using training data will be overly optimistic; however, it is not always clear by how much, and the scarcity of data often drives one to make assumptions that allow this data to be re-used. Various methods of CV exist to balance the amount of effort expended and assumptions made to obtain this estimate.

sklearn has extensive [documentation](https://scikit-learn.org/stable/modules/cross_validation.html) of the different types of cross-validation tools available in the package; however, we will mostly use "stratified, k-fold" CV.  In this type of cross-validation:
* "k-fold" means the data is split into $k$ equal-sized segments (usually after shuffling); training is repeated $k$ times with a different single segment held out as a test set, while the remainder are used for training.
* "stratified" means that the proportion of different classes is kept as constant as possible across the different splits.  For example, if the data is 80% A and 20% B, each split will also be about this ratio.  This is important so you do not end up with strongly biased folds.

For examples of cross-validating pipelines in Python 🐍 with PyChemAuth, please refer to [this notebook](https://pychemauth.readthedocs.io/en/latest/jupyter/learn/cv_optimization.html).