<a href="https://colab.research.google.com/github/gheniabla/intro2ml/blob/main/chapters/chapter4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 4 - Scikit-learn


Scikit-learn a machine learning library that provides a various types of machine learning algorithms. It includes implementations of classification, regression, clustering, decision trees, and many more. As of April 2024, Scikit-learn has about 31k commits on GitHub (https://github.com/scikit-learn/scikit-learn)  and an active community of 724k users and 2840 contributors. Creating a machine learning model now a simply matter of calling a scikit-learn functions with the right dataset. The scikit-learn provides training and inference functionalities.

**Applications:**

* clustering
* classification
* regression
* model selection
* dimensionality reduction

**Documentation:**

The official documentation is located at: https://scikit-learn.org/.


## 4.1 Getting Datasets with Scikit-Learn

Scikit-Learn provides several simple datasets for use in building machine learning models. They are simple and clean enough to train a machine learning model. The best part is these datasets are included with the Scikit-Learn package. There is no need to download anything. You'll be working with data in just a few lines of code. Scitkit-Learn  datasets known are also known as toy datasets. The datasets can be found in sklearn.datasets.

In [24]:
from sklearn import datasets

There is a corresponding function for loading each dataset. All of these functions have the same syntax: "load_DATASET()", where DATASET stands for the dataset's name. We employ load_breast_cancer() for the breast cancer dataset. Similarly, we would use load_wine() for the wine dataset and
load_iris() for the iris dataset.
Let's load the cancer dataset and assign it to "data" variable:

In [25]:
data = datasets.load_breast_cancer()

Printing "data.keys()" gives us the following keys:

1) data -  is all the feature data (the attributes of the scan that help us identify if the tumor is malignant or benign, such as radius, area, etc.) in a NumPy array.

2) target - is the target data (the variable you want to predict, in this case whether the tumor is malignant or benign) in a NumPy array

Printing "data.DESCR", short for DESCRIPTION, gives us a description of the dataset

In [26]:
print(data.keys())
print(data.DESCR)


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these 

## 4.2 Training Set and Test Set

Machine learning involves training algorithms to recognize patterns or properties within a dataset and then testing these learned patterns on a different dataset to evaluate the algorithm's performance. Typically, a single dataset is divided into two parts: the training set and the testing set. The training set is used to teach the algorithm to identify patterns or properties, while the testing set is used to assess how well the algorithm has learned these patterns and how effectively it can apply them to new, unseen data. This process allows for the evaluation of the algorithm's ability to generalize from the training data to other data.

In [27]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris_dataset = load_iris()
X, y = iris_dataset.data, iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=31)

Executing the code above, we separate 25 % of the original dataset for the test set, while the rest goes to the train set. Furthermore, we can control the shuffle of the original dataset, specifying the random_state argument .

## 4.3 Data Cleansing with Scikit-Learn

**Missing values:**

Managing missing values is a critical preprocessing step in machine learning that, if neglected, can significantly undermine the performance of your model. The sklearn library offers a tool called SimpleImputer for addressing missing values using several common strategies: mean, most_frequent, median, and constant. When opting for the constant strategy, it's important to specify the fill_value parameter to determine what value will be used to fill in the missing data.

Below, we demonstrate how to use SimpleImputer to fill in missing values in a dataframe, X, by replacing them with the mean of each feature.

In [28]:
from sklearn.impute import SimpleImputer
import numpy as np
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(X)

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

**Standardization:**

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. Depending on your needs and data, sklearn provides a bunch of scalers: StandardScaler, MinMaxScaler, MaxAbsScaler and RobustScaler.

*Standard Scaler:*

Sklearn's main scaler, the StandardScaler, uses a strict definition of standardization to standardize data. It purely centers the data by using the following formula, where u is the mean and s is the standard deviation.

            x_scaled = (x — u) / s

        

In [29]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
X = pd.DataFrame(iris_dataset.data)
X.columns = ['f1', 'f2', 'f3', 'f4']
scaler = StandardScaler()
scaler.fit_transform(X.f3.values.reshape(-1, 1))

array([[-1.34022653],
       [-1.34022653],
       [-1.39706395],
       [-1.2833891 ],
       [-1.34022653],
       [-1.16971425],
       [-1.34022653],
       [-1.2833891 ],
       [-1.34022653],
       [-1.2833891 ],
       [-1.2833891 ],
       [-1.22655167],
       [-1.34022653],
       [-1.51073881],
       [-1.45390138],
       [-1.2833891 ],
       [-1.39706395],
       [-1.34022653],
       [-1.16971425],
       [-1.2833891 ],
       [-1.16971425],
       [-1.2833891 ],
       [-1.56757623],
       [-1.16971425],
       [-1.05603939],
       [-1.22655167],
       [-1.22655167],
       [-1.2833891 ],
       [-1.34022653],
       [-1.22655167],
       [-1.22655167],
       [-1.2833891 ],
       [-1.2833891 ],
       [-1.34022653],
       [-1.2833891 ],
       [-1.45390138],
       [-1.39706395],
       [-1.34022653],
       [-1.39706395],
       [-1.2833891 ],
       [-1.39706395],
       [-1.39706395],
       [-1.39706395],
       [-1.22655167],
       [-1.05603939],
       [-1

*MinMax Scaler:*

The MinMaxScaler transforms features by scaling each feature to a given range. This range can be set by specifying the feature_range parameter (default at (0,1)). This scaler works better for cases where the distribution is not Gaussian or the standard deviation is very small. However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider another scaler.


        x_scaled = (x-min(x)) / (max(x)–min(x))

In [30]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-3,3))
scaler.fit_transform(X.f3.values.reshape(-1, 1))

array([[-2.59322034],
       [-2.59322034],
       [-2.69491525],
       [-2.49152542],
       [-2.59322034],
       [-2.28813559],
       [-2.59322034],
       [-2.49152542],
       [-2.59322034],
       [-2.49152542],
       [-2.49152542],
       [-2.38983051],
       [-2.59322034],
       [-2.89830508],
       [-2.79661017],
       [-2.49152542],
       [-2.69491525],
       [-2.59322034],
       [-2.28813559],
       [-2.49152542],
       [-2.28813559],
       [-2.49152542],
       [-3.        ],
       [-2.28813559],
       [-2.08474576],
       [-2.38983051],
       [-2.38983051],
       [-2.49152542],
       [-2.59322034],
       [-2.38983051],
       [-2.38983051],
       [-2.49152542],
       [-2.49152542],
       [-2.59322034],
       [-2.49152542],
       [-2.79661017],
       [-2.69491525],
       [-2.59322034],
       [-2.69491525],
       [-2.49152542],
       [-2.69491525],
       [-2.69491525],
       [-2.69491525],
       [-2.38983051],
       [-2.08474576],
       [-2

*MaxAbs Scaler:*

The MaxAbsScaler works very similarly to the MinMaxScaler but automatically scales the data to a [-1,1] range based on the absolute maximum. This scaler is meant for data that is already centered at zero or sparse data. It does not shift/center the data, and thus does not destroy any sparsity.


    x_scaled = x / max(abs(x))

In [31]:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit_transform(X.f3.values.reshape(-1, 1))

array([[0.20289855],
       [0.20289855],
       [0.1884058 ],
       [0.2173913 ],
       [0.20289855],
       [0.24637681],
       [0.20289855],
       [0.2173913 ],
       [0.20289855],
       [0.2173913 ],
       [0.2173913 ],
       [0.23188406],
       [0.20289855],
       [0.15942029],
       [0.17391304],
       [0.2173913 ],
       [0.1884058 ],
       [0.20289855],
       [0.24637681],
       [0.2173913 ],
       [0.24637681],
       [0.2173913 ],
       [0.14492754],
       [0.24637681],
       [0.27536232],
       [0.23188406],
       [0.23188406],
       [0.2173913 ],
       [0.20289855],
       [0.23188406],
       [0.23188406],
       [0.2173913 ],
       [0.2173913 ],
       [0.20289855],
       [0.2173913 ],
       [0.17391304],
       [0.1884058 ],
       [0.20289855],
       [0.1884058 ],
       [0.2173913 ],
       [0.1884058 ],
       [0.1884058 ],
       [0.1884058 ],
       [0.23188406],
       [0.27536232],
       [0.20289855],
       [0.23188406],
       [0.202

**Normalization:**
Normalization is the process of scaling individual samples to have unit norm. In basic terms you need to normalize data when the algorithm predicts based on the weighted relationships formed between data points. Scaling inputs to unit norms is a common operation for text classification or clustering.

NOTE - One of the key differences between scaling (e.g. standardizing) and normalizing, is that normalizing is a row-wise operation, while scaling is a column-wise operation.

    max
        The max norm uses the absolute maximum and does for samples what the MaxAbsScaler does for features.

        x_normalized = x / max(x)

        norm_max =
        list(max(list(abs(i) for i in X.iloc[r])) for r in range(len(X)))

    l1
        The l1 norm uses the sum of all the values as and thus gives equal penalty to all parameters, enforcing sparsity.

        x_normalized = x / sum(X)

        norm_l1 =
        list(sum(list(abs(i) for i in X.iloc[r])) for r in range(len(X)))

    l2
        The l2 norm uses the square root of the sum of all the squared values. This creates smoothness and rotational invariance. Some models, like PCA, assume rotational invariance, and so l2 will perform better.

        x_normalized = x / sqrt(sum((i**2) for i in X))

        norm_l2 =
        list(math.sqrt(sum(list((i**2) for i in X.iloc[r])))
        for r in range(len(X)))

## 4.4 Summary

To summarize all the steps in one program, let's load up iris dataset and perform all the steps mentioned above and predict the model using sklearn's SVM model:

In [32]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import svm
from sklearn.metrics import accuracy_score

iris_dataset = load_iris()
X, y = iris_dataset.data, iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=31)


scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
acc_train = accuracy_score(y_train, y_pred_train)
acc_test = accuracy_score(y_test, y_pred_test)