# Introduction

One of the most famous libraries in Machine Learning community is undoubtedly scikit-learn. This is where most beginners take their further step closer to Artificial Intelligence. Go [here](https://scikit-learn.org/stable/) to explore the documentation of the library.

## Installation

Let's start by installation of scikit-learn. Run the following.

In [1]:
!pip install scikit-learn -q

## Exploring the Library

### sklearn.datasets

Throughout the course, we will begin with sklearn.datasets, sklearn.preprocessing, sklearn.models, 

Datasets file contain a variety of functions to import popular datasets to the notebook.

[scikit.datasets](https://scikit-learn.org/stable/datasets.html)

In [2]:
from sklearn import datasets

For instance, remember we worked with iris dataset? We can just use load_iris function that will introduce the dataset to our notebook.

[iris dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset)

In [3]:
iris_data = datasets.load_iris()

In [4]:
iris_data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

It looks like a dictionary object. Let's see the keys.

In [5]:
iris_data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

Let's explore keys. Data key contains the data of features such as sepal Length, petal width, etc.

In [6]:
feature_data = iris_data['data']
feature_data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [7]:
feature_data.shape

(150, 4)

target key, as the name suggests, is just target data. Oh! remember we converted unique values in the target column to have numerical representation instead of category? Now, we don't need to do that as it is already prepared as follows.

In [8]:
target = iris_data['target']
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [9]:
target.shape

(150,)

To access column names and target unique values, see these keys.

In [10]:
column_names = iris_data['feature_names']
column_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [11]:
target_names = iris_data['target_names']
target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Let's convert this data to Pandas for fun!

In [12]:
import pandas as pd

In [13]:
df = pd.DataFrame(feature_data, columns=column_names)

In [14]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


To add the target column, use the following.

In [15]:
df['target'] = target

In [16]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


That's it! Now you can continue your exploratory data analysis.  But we will skip this part for this notebook.

### sklearn.preprocessing

Machine Learning is not just about developing 'magical' AI models. It is a part of data science in which engineers and scientists leverage the best scientific practices to prepare the data before being input to the model. Scikit-learn categorizes some of these techniques in one module so that you can easily explore what is inside.

[sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)

In [17]:
from sklearn import preprocessing

For this tutorial, we will consider MinMaxScaler which is basically based on the below formula. 

![Screenshot%20from%202024-03-26%2016-39-13.png](attachment:Screenshot%20from%202024-03-26%2016-39-13.png)

The idea behind scaling is originally coming from the problem of data distribution variety in each column. Let's use describe method in Pandas to understand what I mean.

In [18]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


Look closely at one specific statistic measurement on each column, such as mean. Mean is not the same for all columns. The same observation is the case for standard deviation, min, max, quantiles (25%, 50%, and 75%). It is a problem because each column has a different **data distribution**. A variety of distribution will cause variable error contributions in the final outputs of the model. It is better in practice to have fair error contribution by columns so that **convergence in training** can more easily be achieved.

PS: Some key words may not be familiar for you. Therefore, I bolded them so that you can distinguish before googling. 

Now let's apply our scaler method.

In [19]:
scaler = preprocessing.MinMaxScaler()
scaler

scaler object is ready to take its input. We need to **fit** our data to the scaler so that it can apply the above-shown formula.

Before we do that, let's just scale feature data separately (apart from target).

In [20]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [21]:
X = df.iloc[:, :4]
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


I usually prefer inputting NumPy array instead of DataFrame object. But you are free to continue with Pandas-inherited objet, though.

In [22]:
feature_data = X.values

In [23]:
## Fitting
scaler.fit(feature_data)

scaler object when fitting returns its own object. It is not necessary to assign it to another variable as the same fitting will originally be observed in scaler variable. 

However, it was fitting. What did this function achieve for the scaler object? In fact, it did not apply the above-shown formula. Instead, it just saved minimum and maximum values of each column. Now, we gonna apply that formula since min and max are already ready.

transform method is used for that purpose.

In [24]:
scaled_data = scaler.transform(feature_data)

In [25]:
scaled_data.shape

(150, 4)

In [26]:
scaled_data

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.30555556, 0.70833333, 0.08474576, 0.04166667],
       [0.13888889, 0.58333333, 0.10169492, 0.04166667],
       [0.13888889, 0.41666667, 0.06779661, 0.        ],
       [0.        , 0.41666667, 0.01694915, 0.        ],
       [0.41666667, 0.83333333, 0.03389831, 0.04166667],
       [0.38888889, 1.        , 0.08474576, 0.125     ],
       [0.30555556, 0.79166667, 0.05084746, 0.125     ],
       [0.22222222, 0.625     ,

In [27]:
feature_data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

What changes do you see? 

Let's analyze means and standard deviations of columns.

In [28]:
scaled_data.mean(axis=0)

array([0.4287037 , 0.44055556, 0.46745763, 0.45805556])

In [29]:
scaled_data.std(axis=0)

array([0.22925036, 0.18100457, 0.29820408, 0.31653859])

Of course, we are not aiming to achieve the same distribution. The same distribution actually means the same data. Our aim is just to have such data that is limited to the specific range so that distribution variety will not cause headache for machine learning model.

### sklearn.model_selection

There is a technique called train-test split, as the name suggests, to split the dataset into two parts - train and test. Now, we get introduced to more ML terms. Go google **generalization** in machine learning. Also, study **overfitting** and **underfitting** as they are correlated to generalization term.

Let's break it down:
* overfitting is the case when the model learns the data extremely good such that it cannot perform well in the practice. 
* underfitting is simply the model does not learn really well.
* generalization is the ability of the model that can generalize both the data that it was trained on and the data the model has never seen during training (test dataset or let's call it 'in practice')

In [31]:
from sklearn import model_selection

In [32]:
X_train, X_test = model_selection.train_test_split(scaled_data, train_size = 0.75)

In [33]:
X_train.shape, X_test.shape

((112, 4), (38, 4))

However, in practice, both X and y must be splitted. See the below example.

In [34]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(scaled_data, target, train_size = 0.75)

In [35]:
X_train.shape, X_test.shape

((112, 4), (38, 4))

In [36]:
y_train.shape, y_test.shape

((112,), (38,))

As you can see the model splitted the data into train and test sides. 75% of the original dataset is partitioned for train while the rest belongs to the test. The model will be trained on 75% portion and its performance will be checked with 25%. 

### sklearn.neighbors

It is the module that contains some model frameworks. We will use KNeighborsClassifier as a model to be trained.

[sklearn.neighbors](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors)

[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [37]:
from sklearn import neighbors

In [38]:
neighbors.KNeighborsClassifier

sklearn.neighbors._classification.KNeighborsClassifier

As you might see from the documentation, the model expects a lot of parameters to initialize but we only choose n_neighbors parameter - the number of neighbors around one data point.

In [39]:
model = neighbors.KNeighborsClassifier(n_neighbors=5)
model

In [40]:
model.fit

<bound method KNeighborsClassifier.fit of KNeighborsClassifier()>

Time to fit the data. fit() method expects two input - X and y. X is features and y is target.

In [41]:
X_train.shape

(112, 4)

In [42]:
y_train.shape

(112,)

Now, we can input the data.

In [43]:
model.fit(X_train, y_train)

predict method will help us to make predictions on test side.

In [44]:
y_pred = model.predict(X_test)
y_pred

array([1, 0, 1, 0, 2, 2, 2, 1, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 1, 2, 1, 2,
       0, 2, 1, 1, 0, 1, 1, 1, 0, 1, 2, 2, 1, 2, 2, 0])

There you go! You just trained your first model and implemented prediction method. Let's see how accurate it was.

In [45]:
y_test

array([2, 0, 1, 0, 2, 2, 1, 1, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 1, 2, 1, 2,
       0, 2, 2, 1, 0, 1, 1, 1, 0, 1, 2, 2, 1, 2, 2, 0])

So precise!