# Scikit-Learn with Python

### In this tutorial, we will learn about different models of Scikit-learn library. It is a library of python for developing Machine Learning Models and Statistical Models.

After careful reading and practicing this tutorial book you will be able to implement models like Regression, Classification and Clustering etc according to your problem statement.


#### This library written in Python and is built upon NumPy, Scipy and Matplotlib.


## Pre-requisite of Scikit-Learn

#### Before jumping into the discussion of models in this library, one should have a basic knowledge of Python Language, Machine learning, Numpy, Scipy and Matplotib. Obviously, having a sound knowledge of above mentioned concepts will be a positive point for him/her. But if you are new to these then it is highly recommended to have a basic information of above mentioned libraries and concepts.

### Assuming that you have installed Python in your system as well as Scikit-learn and related libraries.

### Now we will start working on how to import scikit-learn library and others and their way of usage.

Let's import the Scikit-Learn Library

In [None]:
import sklearn

By executing above given command, one can import this library. Importing any library means, we can use all the built-in functions of that library in our script. 

## Datasets
Best thing about SKLEARN is, it provides few datasets which can be used for practice and better understanding of Machine Learning Concepts.

So, Let's import those datasets too by the following command.

In [None]:
from sklearn import datasets

This library, provides different datasets like iris_dataset, diabetes, digits, Boston House-price Dataset and linnerud dataset for different regression or classification purpose.

At very first step we will start working on concept of classification. 

# Classification

In machine learning, classification refers to a predictive modeling problem where a class label is predicted for a given example of input data.
In simple words, when we have any input example and we need to predict whether it belongs to Class A or Class B, then in this case we use classification methods.

Let's learn more about Classification with examples

First, let's load iris dataset

In [None]:
myIrisData = datasets.load_iris()

### About Iris Dataset
It is necessary to know about the dataset before using it in our code.

Iris dataset is related to Iris Plant Dataset. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.  One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. It has 4 Numeric, predictive Attributes and the class. 
Attributes are as follow: 

- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm

and the classes to which these instances belongs are:

- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

In [None]:
myIrisData.feature_names

In [None]:
myIrisData.target_names

In [None]:
myIrisData.target

In [None]:
myIrisData.data[:10]

In the above cells, you can see features names, target names, target classes and data of feature's instances respectively.

The above data is not easy to understand. We don't know which column represent what data. Let's Import panda library and create a dataframe of the imported iris dataset to better understand the data.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(myIrisData.data, columns=myIrisData.feature_names)

In [None]:
df

First we will create two vectors. One of the data/ features and the other of the target/ labels.

In [None]:
X = myIrisData.data
Y = myIrisData.target

### Splitting and Shuffling Dataset

Now, It's time to split the data randomly for training and testing purpose. For this reason, we have a built-in function named train_test_split() which is provided by sklearn library. This helps us to shuffle the data randomly and split it into 2 parts, i.e train and test part. 

Shuffling the data is necessary to avoid biasness or in other words to ensure that each data point creates an "independent" change on the model, without being biased by the same points before them. Also data set is splitted into 70:30 ratio (70% train set and 30% test set). You can change the ratio according to your need.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)

In above cell, we are importing the function named train_test_split first from sklearn.model_selection. This function help us to split the data in train and test chunks randomly with the ratio of 70:30. Test size will determine the ratio of testing and training chunks. if you want to split the data in 80:20 ratio, then you have to give test_size = 0.2. Random_state = 1 determine that function will split data randomly.

Now, let's check the shape of the vectors.

In [None]:
print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

## Logistic Regression

Logistic Regression is used for classification. For example, given a data instance, predict its flower class.

With the help of sklearn we can easily implement Logistic Regression.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_predicted = logreg.predict(X_test)
y_predicted

### Model Evaluation

After Classification, model needs to be evaluated how good model is predicting. 

Scikit learn provides <b>metrics</b> module for calculating different metrices for a model. Confusion Matrix can be imported from <b>metrics</b> for computing accuracy of our model.

Using the results from above example of Logistic Regression for calculating the accuracy of model and other metrices.

To compute the accuracy, scikit learn provides <b>accuracy_score</b> function.

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)

<b>Confusion Matrix</b> is used for evaluation of a model with respect to class labels.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_predicted)
print(confusion_matrix)

Well, above matrix is called Confusion Matrix. But it is not easily interpretable.

We can display this confusion matrix using ConfusionMatrixDisplay.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay(confusion_matrix).plot()

Sometimes, accuracy is not a good metric for model evaluation, we need other measures like f1-score, precision and recall as well.

Scikit learn provides a very useful function for computing all these metrices in a single line.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predicted))

## KNN (k nearest neighbors)

Let's explore another classification model that is <b>KNN Classifier</b>.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

classifier_knn = KNeighborsClassifier(n_neighbors = 3)
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)

This is how you can train the model on KNeighborsClassifier and predict the model on splitted test data. 

Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred)

In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

You can see the test model predicted with the accuracy of 97.7%

Providing sample data and the model will make prediction out of that data


In [None]:
sample = [[4, 4, 3, 5], [2, 4, 3, 5]]
preds = classifier_knn.predict(sample)

for p in preds:
    pred_species = myIrisData.target_names[p]
    print("Predictions:", pred_species)


Let's Change the sample data and test again.

In [None]:
sample = [[5, 4, 3, 5], [3, 4, 2, 4]]
preds = classifier_knn.predict(sample)

for p in preds:
    pred_species = myIrisData.target_names[p]
    print("Predictions:", pred_species)

Obviously in practical life, no one train the model again and again for each testing. This train model need to be stored for prediction.
There is a way you can save the model and load it again for further practice.

Let's see how can  we do this.


In [None]:
import joblib
joblib.dump(classifier_knn, 'iris_classifier_knn.joblib')

This model will saved as a file with extension .joblib and can be loaded again.

In [None]:
joblib.load('iris_classifier_knn.joblib')

Now next step is to learn about Linear Regression.

# Linear Regression

This is a supervised ML model. It is one of the best statistical models that studies the relationship between a dependent variable (Y) with a given set of independent variables (X). In other words, this is used when the output variable is continuous and it follows linear relation with dependent variables. It can be used to forecast sales in the coming months by analyzing the sales data for previous months.

Sklearn helps to implement Linear Regression model easily

### Dataset

For linear regression, lets load Boston Housing dataset. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

Scikit learn provides sample of this dataset in its datasets module.

In [None]:
boston = datasets.load_boston()
print(boston.DESCR)

In [None]:
print(boston.data.shape)
print(boston.feature_names)

Let's better understand our data in pandas.

In [None]:
boston_df = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_df['PRICE'] = boston.target
boston_df.head()

Now we will make independent(X) and dependent (Y) variable of our data.

In [None]:
X = boston.data
y = boston.target

Split our dataset in train and test parts.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state=5)

In [None]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()

regression_model.fit(X_train, Y_train)

y_predicted = regression_model.predict(X_test)

## Model Evaluation

Model evaluation for regression is different as compared to model evaluation for classification.

In case of regression, we compute error score of predicted values from the actual ones. There are different error scores which we can compute to evaluate our model. Lets explore a few of them.

### Mean Squared Error:

The MSE is a measure of the quality of an estimator. As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero

In [None]:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(Y_test, y_predicted)

In [None]:
rmse

### r2_score:

Best possible score is 1.0 and it can be negative in case of worse model. A constant model that always predicts the expected value of y, disregarding the input features, would get a score of 0.0.

It represents the proportion of variance of y (independent variables) in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.

#### Syntax:

sklearn.metrics.r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')

In [None]:
from sklearn.metrics import r2_score
r2 = r2_score(Y_test, y_predicted)

In [None]:
r2

In [None]:
print("Model performance: ")
print(f'RMSE is {rmse}')
print(f'R2 score is {r2}')

# Clustering

Cluster analysis is an <b>unsupervised</b> technique used in machine learning that attempts to find clusters of observations within a dataset.

The goal of cluster analysis is to find clusters such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other.

### Dataset

For clutering we will generate a dummy data. Scikit learn provides <b>make_blobs</b> function in its <b>datasets</b> module to generate dummy data for clustering and classification.

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=150, centers=3, n_features=3,
                  random_state=0)
print(X.shape)
print(y)

Here we have generated 150 data points with 3 features. This dataset contains 3 clusters (centers) and we know this as a fact because we have generated this dataset. But in real, we do not know how many clusters are there in our data.

We will use K-means clutering algorithm for clustering our data. Scikit learn provides support for different clustering alogorithms. Wel will import our kmeans clustering algorithm from sklearn.cluster

In [None]:
from sklearn import cluster

kmeans = cluster.KMeans(n_clusters=2) #asking the algorithm to make 2 clusters in our data
kmeans.fit(X)

In [None]:
kmeans.labels_

In above code, we asked our algorithm to create 2 clusters in our data. Algorithm doesn't know how many clusters are there in data, but it will create clusters according to our given number. 

Here 0,1 doesn't mean the class label, 0 and 1 correspondts to respective cluster. It doesn't have anything to do with class label.

Now let's try with 3 clusters and more.

In [None]:
kmeans = cluster.KMeans(n_clusters=3) #asking the algorithm to make 3 clusters in our data
kmeans.fit(X)
kmeans.labels_

In [None]:
kmeans = cluster.KMeans(n_clusters=5) #asking the algorithm to make 5 clusters in our data
kmeans.fit(X)
kmeans.labels_

This way we can do clustering analysis of our data. Clustering algorithms are hard to evaluate