# Machine Learning with scikit-learn

## What Is Machine Learning?

> **"If you torture the data enough, nature will always confess."** –Ronald Coase

As a one line version—not entirely original—I like to think of machine learning as "statistics on steroids."  That characterization may be more cute than is necessary, but it is a good start.  Others have used phrases like "extracting knowledge from raw data by computational means."

The lede on the Wikipedia article provides a bit more.

![Wikipedia entry](https://github.com/bibekebib/ML-Webinar/blob/main/img/ML-Wikipedia.png?raw=1)

Cite: [Wikipedia, 09:29, 2018 October 4](https://en.wikipedia.org/w/index.php?title=Machine_learning&oldid=862453222)

# Supervised and Unsupervised Model


## Supervised Model
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised learning is when we teach or train the machine using data that is well-labelled. Which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.

## Unsupervised Model
Unsupervised learning is a type of machine learning that learns from unlabeled data. This means that the data does not have any pre-existing labels or categories. The goal of unsupervised learning is to discover patterns and relationships in the data without any explicit guidance.

![](https://www.scribbr.com/wp-content/uploads/2023/08/supervised-vs-unsupervised-learning-1.webp)

## Machine Learning Libraries



## What Is scikit-learn?

Scikit-learn provides a large range of algorithms in machine learning that are unified under a common and intuitive API. Most of the dozens of classes provided for various kinds of models share the large majority of the same calling interface. Very often—as we will see in examples below—you can easily substitute one algorithm for another with nearly no change in your underlying code. This allows you to explore the problem space quickly, and often arrive at an optimal, or at least satisficing$^1$ approach to your problem domain or datasets.

* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

<hr/>

<small>$^1$<i>Satisficing is a decision-making strategy of searching through the alternatives until an acceptability threshold is met. It is a portmanteau of satisfy and suffice, and was introduced by Herbert A. Simon in 1956. He maintained that many natural problems are characterized by computational intractability or a lack of information, both of which preclude the use of mathematical optimization procedures.</i></small>

## Overview of Techniques Used in Machine Learning

The diagram below is from the scikit-learn documentation, but the same general schematic of different techniques and algorithms that it outlines applies equally to any other library.  The classes represented in bubbles mostly will have equivalent versions in other libraries.

![Scikit-learn topic areas](https://github.com/bibekebib/ML-Webinar/blob/main/img/sklearn-topics.png?raw=1)

## Classification versus Regression

### Classification

Classification is a type of supervised learning in which the targets for a prediction are a set of categorical values.

### Regression

Regression is a type of supervised learning in which the targets for a prediction are quantitative or continuous values.



## Overfitting and Underfitting

In machine learning models, we have to worry about twin concerns.  On the one hand, we might **overfit** our model to the dataset we have available.  If we train a model extremely accurately against the data itself, metrics we use for the quality of the model will probably show high values.  However, in this scenario, the model is unlikely to extend well to novel data, which is usually the entire point of developing a model and making predictions.  By training in a fine tuned way against one dataset, we might have done nothing more than memorize that collection of values; or at least memorize a spurious pattern that exists in that particular sample data collection.

To some extent (but not completely), overfitting is mitigated by larger dataset sizes.

In contrast, if we choose a model that simply does not have the degree of detail necessary to represent the underlying real-world phenomenon, we get an **underfit** model.  In this scenario, we *smooth too much* in our simplification of the data into a model.

Some illustrations are useful.

![](https://docs.aws.amazon.com/images/machine-learning/latest/dg/images/mlconcepts_image5.png)


![](https://miro.medium.com/v2/resize:fit:1396/1*lARssDbZVTvk4S-Dk1g-eA.png)

# Bias and Variance

## Bias
Bias is simply defined as the inability of the model because of that there is some difference or error occurring between the model’s predicted value and the actual value. These differences between actual or expected values and the predicted values are known as error or bias error or error due to bias.

- Low Bias: Low bias value means fewer assumptions are taken to build the target function. In this case, the model will closely match the training dataset.
- High Bias: High bias value means more assumptions are taken to build the target function. In this case, the model will not match the training dataset closely.

![](https://miro.medium.com/v2/resize:fit:978/1*CgIdnlB6JK8orFKPXpc7Rg.png)

## Variance
Variance is the measure of spread in data from its mean position. In machine learning variance is the amount by which the performance of a predictive model changes when it is trained on different subsets of the training data. More specifically, variance is the variability of the model that how much it is sensitive to another subset of the training dataset. i.e. how much it can adjust on the new subset of the training dataset.

- Low variance: Low variance means that the model is less sensitive to changes in the training data and can produce consistent estimates of the target function with different subsets of data from the same distribution. This is the case of underfitting when the model fails to generalize on both training and test data.
- High variance: High variance means that the model is very sensitive to changes in the training data and can result in significant changes in the estimate of the target function when trained on different subsets of data from the same distribution. This is the case of overfitting when the model performs well on the training data but poorly on new, unseen test data.

## Dimensionality Reduction

Dimensionality reduction is most often a technique used to assist with other techniques. By reducing a large number of features to relatively few features; very often other techniques are more successful relative to these transformed synthetic features. Sometimes the dimensionality reduction itself is sufficient to identify the "main gist" of your data.

## Feature Engineering

Very often, the "features" we are given in our original data are not those that will prove most useful in our final analysis. It is often necessary to identify "the data inside the data." Sometimes feature engineering can be as simple as normalizing the distribution of values. Other times it can involve creating synthetic features out of two or more raw features.

## Feature Selection

Often, the features you have in your raw data contain some features with little to no predictive or analytic value. Identifying and excluding irrelevant features often improves the quality of a model.

## One-hot Encoding

For many machine learning algorithms, including neural networks, it is more useful to have a categorical feature with N possible values encoded as N features, each taking a binary value. Several tools, including a couple functions in scikit-learn will transform raw datasets into this format. Obviously, by encoding this way, dimensionality is increased.

Let us illustrate using a toy test dataset.  The following whimsical data is suggested in a blog post by [Håkon Hapnes Strand](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science).  Imagine we collected some data on individual organisms—namely taxonomic class, height, and lifespan.  Depending on our purpose, we might use this data for either supervised or unsupervised learning techniques (if we had a lot more observations, and a number more features).

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 22, 35, 28],
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Displaying the original DataFrame
print("DataFrame:")
print(df)

DataFrame:
   ID     Name  Age  Color
0   1    Alice   25    Red
1   2      Bob   30   Blue
2   3  Charlie   22  Green
3   4    David   35    Red
4   5      Eva   28  Green


In [None]:
color_mapping = {'Red': 'Color_Red', 'Blue': 'Color_Blue', 'Green': 'Color_Green'}
df_encoded = pd.concat([df, pd.get_dummies(df['Color'].map(color_mapping))], axis=1)

# Dropping the original 'Color' column
df_encoded = df_encoded.drop(['Color'], axis=1)

# Displaying the DataFrame after manual one-hot encoding
print("\nDataFrame after Manual One-Hot Encoding:")
print(df_encoded)


DataFrame after Manual One-Hot Encoding:
   ID     Name  Age  Color_Blue  Color_Green  Color_Red
0   1    Alice   25           0            0          1
1   2      Bob   30           1            0          0
2   3  Charlie   22           0            1          0
3   4    David   35           0            0          1
4   5      Eva   28           0            1          0


## Metrics

After you have trained a model, the big question is "how good" is the model.  There is a lot of nuance to answering that question, and correspondingly a large number of measures and techniques.

One common technique to look at a combination of successes and failure in a machine learning model is a *confusion matrix*.  Let us look at an example, picking up the whimsical data used above.  Suppose we wanted to guess the taxonomic class of an observed organism and our model had these results:

| Predict/Actual | Human    | Octopus  | Penguin  |
|----------------|----------|----------|----------|
| Human          |  **5**   |    0     |    2     |
| Octopus        |    3     |  **3**   |    3     |
| Penguin        |    0     |    1     |  **11**  |

Giving a single number to describe *how good* the model is is not immediately obvious.  The model is very good at predicting penguins, but it gets rather bad when it predicts octopi.  In fact, if the model predicts something is an octopus, it probably isn't (only 1/3rd of such predictions are accurate).

### Accuracy versus Precision versus Recall

Naïvely, we might simply ask about the "accuracy" of a model (at least for classification tasks).  This is simply the number of *right* answers divided by the number of data points.  In our example, we have 28 observations of organisms, and 19 were classified accurately, so that's a **68%** accuracy.  Again though, the accuracy varies quite a lot if we restrict it to just one class of the predictions.  For our multi-class labels, this may not be a bad measure.  

Consider a binary problem though:

| Predict/Actual | Positive | Negative |
|----------------|----------|----------|
| Positive       |    1     |    0     |
| Negative       |    2     |   997    |


![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSdsPFoxS_ROjqfH6LmpP--isjdT4iJKPljE2Q2JLNw_g&s)

Calculating *accuracy*, we find that this model is **99.8%** accurate! That seems pretty good until you think of this test as a medical screening for a fatal disease.  *Two thirds of the people who actually have the disease will be judged free of it by this model* (and hence perhaps not be treated for the condition); that isn't such a happy real-world result.

<hr/>

In contrast with accuracy, the "precision" of a model is defined as:

$$\text{Precision} = \frac{true\: positive}{true\: positive + false\: positive}$$

Generalizing that to the multi-class case, the formula is as follows (for i being the index of the class):

$$\text{Precision}_{i} = \cfrac{M_{ii}}{\sum_i M_{ij}}$$

Applying that to our hypothetical medical screening, we get a a precision of **1.0**.  We cannot do better than that.  The problem is with "recall" which is defined as:

$$\text{Recall} = \frac{true\: positive}{true\: positive + false\: negative}$$

Generalizing that to the multi-class case:

$$\text{Recall}_{i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$$

Here we do much worse by having a recall of **33.3%** in our medical diagnosis case! This is obviously a terrible result if we care about recall.

### F1 Score

There are several different algorithms that attempt to *blend* precision and recall to product a single "score."  Scikit-learn provides a number of other scalar scores that are useful for differing purposes (and other libraries are similar), but F1 score is one that is used very frequently.  It is simply:

$$\text{F1} = 2 \times \cfrac{precision \times recall}{precision + recall}$$

Applying that to our medical diagnostic model, we get an F1 score of 50%.  Still not good, but we account for the high precision to some extent.  For intermediate cases, the F1 score provides good balance.

F1 score can be generalized to multi-class models by averaging the F1 score across each class, counting only correct/incorrect per class.

### Code Examples

In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np

y_true = ["human",   "octopus", "human", "human", "octopus", "penguin", "penguin"]
y_pred = ["octopus", "octopus", "human", "human", "octopus", "human",   "penguin"]
labels = ['octopus', 'penguin', 'human']

In [None]:
cm = confusion_matrix(y_true, y_pred, labels=labels)
print("Confusion Matrix (predict/actual):\n",
      pd.DataFrame(cm, index=labels, columns=labels), sep="")

recall = np.diag(cm) / np.sum(cm, axis=1)
print("\nRecall:\n", pd.Series(recall, index=labels), sep="")

precision = np.diag(cm) / np.sum(cm, axis=0)
print("\nPrecision:\n", pd.Series(precision, index=labels), sep="")

print("\nAccuracy:\n", np.sum(np.diag(cm)) / np.sum(cm))

Confusion Matrix (predict/actual):
         octopus  penguin  human
octopus        2        0      0
penguin        0        1      1
human          1        0      2

Recall:
octopus    1.000000
penguin    0.500000
human      0.666667
dtype: float64

Precision:
octopus    0.666667
penguin    1.000000
human      0.666667
dtype: float64

Accuracy:
 0.7142857142857143


In this particular case, F1 score is very close to accuracy.  In fact, using the "micro" averaging method reduces the result to accuracy.  Using the "macro" averaging makes it equivalent to a NumPy reduction from the formula given.

In [None]:
from sklearn.metrics import f1_score
weighted_f1 = f1_score(y_true, y_pred, average="weighted")
print("\nF1 score:\n", weighted_f1, sep="")


F1 score:
0.7047619047619048


In [None]:
?f1_score

In [None]:
print("Naive averaging F1 score:", np.mean(2*(recall*precision)/(recall+precision)))
print(" sklearn macro averaging:", f1_score(y_true, y_pred, average="macro"))

Naive averaging F1 score: 0.7111111111111111
 sklearn macro averaging: 0.7111111111111111
