TODO: breif crash course in predictive modeling. Train-test splits. "Dimensionality."  Hyperparameters.

# scikit-learn

`scikit-learn` is one one of the best libraries you'll ever find for doing (non-neural network) machine learning.  It has a lot of models, with very good implementation, an incredibly easy API, and some of the best documentation for any piece of software I have ever seen.

Install scikit-learn with conda:

```bash
conda install scikit-learn
```

I'd also recommend installing the Intel Extensions for scikit-learn.  This is a library created by Intel that adds _enormous_ speedups to some of the models scikit-learn has.  _Note:_ the Intel Extensions are not compatible with Python 3.10 or later.  You need Python 3.9, as of writing this, or you'll get errors when trying to install or import it.

Install the Intel Extensions for Scikit-Learn with conda:

```bash
conda install -c conda-forge scikit-learn-intelex
```

And then either run your program with the command:

```bash
python -m sklearnex your_program.py
```

Or add these lines to your program before you import anything from scikit-learn:

```python
from sklearnex import patch_sklearn
patch_sklearn()
```

The rest of your program does not need to change: the Intel Extensions give you a free speedup with no code changes needed (for supported models; unsupported models are unaffected)!

# A rare digression to talk about theory stuff

Scikit-learn is all about classical _predictive models_ and _machine learning._  This means it has implementation of a lot of well-tested, well-validated predictive models (like different kinds of regressions; decision trees; random forests; support vector machines; etc), but it _is not_ a library for building neural networks.  (Use Keras, Tensrflow, or PyTorch for that; we'll probably look at Keras later in the year, and maybe PyTorch too).

Scikit-learn is _not_ a library for traditional explanatory analyses.  The models in scikit-learn are designed to optimize *predictive accuracy,* which comes at the cost of *human interpretability.*  E.g.: if you train a simple linear regression, you can easily "open it up" and understand the patterns it learned.  But if you train a gradient boosted ensemble of decision trees, which will usually be more accurate, you can't "open it up" in any meaningful way.  The patterns it's learned will just be too complex, and presented in a way that wouldn't be human-interpretable even if it had learned something simple.

All this means that if you're doing basic, foundational science, e.g. you want to understand something like "how does this particular instruction method impact long-term learning," then you probably want to use a different set of tools.  But if you're building a system that wants to predict, with high accuracy, whether someone will have meaningful long-term learning gains, and your goal is the predictions rather than a deeper theoretical understanding, then these prediction-oriented tools are what you want.

For a deeper dive into some of these ideas, I highly recommend Leo Breiman's paper ["Statistical Modeling: The Two Cultures,"](http://www2.math.uu.se/~thulin/mm/breiman.pdf) which discusses some of the philosophical and practical differences between these two approaches to data modeling.

# The organization of scikit-learn

The scikit-learn library has an absolutely absurd amount of stuff in it.  Most of it, though, can be broken down into four big groups:

1. Transformers.  These are things that take some data in, do some transformation to it, and give you back a transformed version of the data.  (e.g.: they transform features, do dimensionality reduction, add new features like interaction terms or spline terms, they remove features, etc.)
2. Estimators.  These are things that take some data in, and spit out an vector of predictions.
3. "Meta-estimators."  These are things like Pipelines, where you can combine multiple transformers/estimators into a single one, and Grid Searches, where you create a new estimator that does something to another estimator.  These will make a lot more sense when see them in practice.  These behave a lot like regular estimators, but they're just distinct enough to merit their own entry.
4. Some assorted utility functions, most notably, functions to generate various accuracy scores.  E.g., functions to calculate r-squared, F1, and AUC-ROC scores.

It's important to note that these groupings are based on the _code you write,_ not on _what the things do, conceptually._  If we think conceptually about what's in scikit-learn, we get a much longer list:

1. Feature selection: remove some features from the data based on various criteria.  (implemented as transformers)
2. Feature creation: add new features, like interaction terms or spline terms.  (implemented as transformers)
3. Mising value imputation. (implemented as transformers)
4. Classification models.  (estimators)
5. Regression models.  (estimators)
6. Clustering models.  (estimators)
7. Custom ensemble estimators.  (meta-estimators)
8. Model selection and parameter tuning. (meta-estimators)

And so on.

This notebook will be structured like the Pandas one: I'll do a very abridged quick-start, showing off some of the major features and design choices, and then we'll work through one or two real-world example problems and see how scikit-learn helps us get them done.

# Scikit-learn quickstart

Scikit-learn is organized a bit differently from some of the other libraries we've seen so far.  It has a lot more modules, and you'll usually see people's imports done a bit differently from, say, Numpy.  The most important difference is that `import scklearn as sk` (or some other alias) won't work the way you might think, and may not give you access to everything.  It will, at the very least, result in some longer path names to access things.  Conventionally, stuff from scikit-learn is imported using the `from sklearn.[whatever] import [something]` format.

Fortunately, scikit-learn has the most consistent and well-designed interface of probably any software tool--code or otherwise--that I have ever used.  Once you understand how to fit one kind of model, like a linear regression, you understand how to fit basically all kinds of models.  The documentation is also excellent: it is divided into the [User Guide](https://scikit-learn.org/stable/user_guide.html), which describes what the different models and transformations do and how they behave; the [API documentation](https://scikit-learn.org/stable/modules/classes.html), which shows how to write code with scikit-learn; and many, many [Examples](https://scikit-learn.org/stable/auto_examples/index.html) of what these things look like in practice.  The User Guide is absolutely worth reading like a textbook: it is a technical, but surprisingly accessible, crash course in machine learning and predictive modeling.

The example below shows a very quick end-to-end project in scikit-learn.  It loads up a dataset--the Iris dataset--and builds a simple classifier to predict the species of a flower from its measurements.

In [1]:
# Function that will download (if needed) + load the Iris dataset
from sklearn.datasets import load_iris

# Module containing scoring functions
from sklearn import metrics

# Funcion to split a series of arrays into train-test subsets
from sklearn.model_selection import train_test_split

# Module with lots of preprocessing functions, like feature scaling.
# StandardScaler() will is a transformer, which will scale each *feature*
# to have a mean of zero and standard deviation of 1.
from sklearn.preprocessing import StandardScaler

# A support vector machine classifier
from sklearn.svm import SVC

# Load the data.
iris = load_iris()

# `iris` is something called a `bunch`; it's basically a dictionary.
print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


A `bunch` is a data structure that scikit-learn uses to load its standard datasets.  For all intents and purposes, these are just dictionaries, and you'll almost certainly never use them outside of loading sample datasets to experiment with.  Only a few of the fields are of interest to us:

- `"data"`: this is a Numpy array containing the features.  One row per observation, one column per feature.
- `"target"`: a Numpy array with the target/y/dependent values.  Usually this is a one-dimensional array.  This will alway be numeric, even if the dataset is a classification dataet.  Entries are in the same order as the rows in `"data"`: the first entry in `"target"` gives the value for the first observation, and so on.
- `"frame"`: a Pandas DataFrame containing the same basic array from `"data"`, but stored as a DataFrame with column names.  This doesn't always have a value, though; in the case of the Iris dataset, it's `None`.
- `"target_names"`: an array that contains the names of the target/y/independent value's classes, if this is a classification dataset.  This is a short array, with one entry for each target/class.  It's set up such that, if an observation's value `"target"` is 2, the the `bunch["target_names"][2]` gives you the human-readable name of that class.  (this is stored separately from the classes' numeric encodings for efficiency reasins--it takes less space to store a number than text, in most cases).
- `"DESCR"`: a string storing a short description of the dataset.
- `"feature_names"`: an array of strings giving the feature names.  The first columnn `"data"` corresponds to the first entry here, and so on.
- `"filename"`: the name of the file where the data is stores.
- `"data_module"`: what module of scikit-learn this data was loaded from.

Usually, you only ever need the `"data"` or `"frame"` keys, and the `"target"` (and maybe `"target_names"`) keys.

In [2]:
x = iris["data"]
y = iris["target"]
class_names = iris["target_names"][y]

print(x[:10])
print(y[:10])
print(class_names[:10])
print(iris["target_names"])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
[0 0 0 0 0 0 0 0 0 0]
['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa']
['setosa' 'versicolor' 'virginica']


In [3]:
# 80% of the data for training, 20% for evaluating our model's performance.
train_x, test_x, train_y, test_y = train_test_split(
    x, y,
    train_size=0.8,
    # stratify the sampleing on the y values, so that the distribution of
    # y values in each split is about the same.
    stratify=y
)

# Create our scaler and our model, then fit them.
scaler = StandardScaler()
classifier = SVC()

# Every model has a .fit() method that will fit/train the model
# on the data you passed it.  This will update the model in-place,
# but it will also return a copy of itself, so you could do
# `scaler = scaler.fit(...)` and it would work just fine.  Usually
# you'll just see `scaler.fit(...)` though.
scaler.fit(train_x, train_y)

# Transformers have a .transform() method, which only works after
# they've been .fit() to some data.  This will apply whatever the
# transformation is that this particular object does.  In this case,
# it'll scale all the *columns* to have zero mean and unit variance.
# Classification and regression models also have .fit()!
classifier.fit(scaler.transform(train_x), train_y)

# Now, we'll generate the model's predictions, and check a few performance
# metrics.  All estimators have a .predict() method to get their predictions
# on new observations.
preds = classifier.predict(scaler.transform(test_x))

# Accuracy functions all share the base signature of: foo(y_true, y_pred, *args, **kwargs)
print("Test set performance")
print(f"Accuracy: {metrics.accuracy_score(test_y, preds):.2%}")
print(f"Macro-averaged F1 score: {metrics.f1_score(test_y, preds, average='macro'):.2%}")
print(f"Confusion matrix:\n{metrics.confusion_matrix(test_y, preds)}")

Test set performance
Accuracy: 96.67%
Macro-averaged F1 score: 96.66%
Confusion matrix:
[[ 9  1  0]
 [ 0 10  0]
 [ 0  0 10]]


And that's the basics.  Pretty much every project where you're doing predictive modeling has the same few steps--load data, manipulate data, fit model, evaluate model, make some predictions--and scikit-learn does an amazing job of making the data manipulation and model parts very easy.

A few very, very important things to note, before we dive deeper:


- The developers of scikit-learn have made _huge_ efforts to ensure that the API for the library is as consistent and predictable as possible.
    - All models--transformers or estimators--have a `.fit()` method, and `.fit()` is the only method name that fits/trains models.
    - All estimators have a `.predict()` to get predictions for new data.  Where it mathematically makes sense, they also have a `.predict_proba()` and a `.decision_function()` method, which give more details about the prediction.
    - All transformers have a `.transform()` method to apply their transformations to observations.
    - Basically, if you know how one transformer/estimator/etc works, you basically know how they all work.  It's just a matter of learning the imports, the model-specific parameter, and a bit of math/intuition for what the models do.
- All models take Numpy arrays in and return Numpy arrays (pretty much all models can also accept Pandas `DataFrame`s, and a lot can take Scipy sparse matrices as their inputs, too, but the outputs are always dense Numpy arrays).  This means that all of the transformations are *composable.*
    - Since the input and output types are the same, you can run a single dataset through as many tranformations as you want.  We'll see a very convenient tool for doing this in the next notebook.

# Two projects

The rest of the lecture will be organized around two sample projects where the goal is to build some predictive models.

Project 1 will focus on a large-scale, high-dimensional regression problem.  This will be a simplified version of a project, but it'll have a lot of major scikit-learn pieces that you'll need to know about.

Project 2 will focus on an unsupervised learning/clusering problem.

# What we will NOT cover

- More than a very few things in scikit-learn.  It's an enormous library, and I am constantly discovering new things in it.  But, as mentioned, we're focusing on *how to write code with scikit-learn,* and the simple, consistent API means we can separate "how to write code" from "what are the models/tools/etc in this library."
- Details of how to do machine learning and predictive modeling projects.  That is a huge enough topic it would take several full courses to cover in detail.  The goal today is just to cover the code tools.
- Mathematical details of the different models and transformations.  We will briefly discuss why some models might be faster/more accurate/more robust than others, but the details of things like *regularization* and *gradient descent* are well beyond the scope of this session.
- Neural networks.  This is a very advanced topic that requires grappling with a lot of math and a lot of very unique considerations.  I expect that we'll briefly cover neural networks a bit later in the year, but we'll only see the very basics.
- How to deploy, monitor, and maintain models in a production environment.  There's a lot of software and systems development, software engineering, database work, and more that's involved in this.  It's also hard to make a lot of super general statements here.
- Ethics, fairness, bias, etc.  These topics are extremely important, but we don't have time to cover them, since our focus is on surveying the tools.  As time permits, we may come back to this later in the year.