# What is scikit-learn

Scikit learn provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

Scikit-learn includes following features:


1. Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn.

2. Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.

3. Clustering − This model is used for grouping unlabeled data.

4. Cross Validation − It is used to check the accuracy of supervised models on unseen data.

5. Dimensionality Reduction − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.

6. Ensemble methods − As name suggest, it is used for combining the predictions of multiple supervised models.

7. Feature extraction − It is used to extract the features from data to define the attributes in image and text data.

8. Feature selection − It is used to identify useful attributes to create supervised models.

## Data set loading
Scikit-learn have few example datasets like <strong>iris</strong> and <strong>digits</strong> for classification and the <strong>Boston house prices</strong> for regression.

The following code shows an example of loading iris dataset:

In [3]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 10 rows of X:\n", X[:10])
# As you can see, the data in the iris dataset is recorded using numpy arrays.

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

First 10 rows of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


## Creating test and train sets
We can split the dataset for training and testing. The following example will split the data into 70:30 ratio

In [17]:
from sklearn.model_selection import train_test_split
train_data, test_data, trasin_label, test_label  = train_test_split(X, y , train_size=0.7, test_size=0.3)

In [18]:
print(len(X))
print(len(train_data))
print(len(test_data))

150
105
45


## Creating the model
Next, we can use our dataset to train some prediction-model. There is a wide range of Machine Learning (ML) algorithms. For example, we use KNN for classifier:

In [22]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
knn_classifier = KNeighborsClassifier(3)
knn_classifier.fit(train_data,trasin_label)
result = knn_classifier.predict(test_data)
accuracy = metrics.accuracy_score(test_label , result)
print("Accuracy =" , accuracy)

Accuracy = 0.9111111111111111


## Model Persistence
Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of joblib package.

In [24]:
import joblib
joblib.dump(knn_classifier, "knn_classifier.joblib")

['knn_classifier.joblib']

In [25]:
# We can load the saved model using the following method:
loaded_classifier = joblib.load("knn_classifier.joblib")
loaded_classifier

## Preprocessing data
Scikit-learn has package named preprocessing for this purpose. The preprocessing package has the following techniques:
1. Binarisation
2. Mean removal
3. Scaling
4. Normalisation

In [35]:
# Binarisation: This preprocessing technique is used when we need to convert our numerical values into Boolean values.
from sklearn import preprocessing
data = [[2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
preprocessing.binarize(data,threshold=0.5)

array([[1., 0., 1.],
       [0., 1., 1.],
       [0., 0., 1.],
       [1., 1., 0.]])

In [38]:
# Mean removal: This technique is used to eliminate the mean from feature vector so that every feature centered on zero.
import numpy as np
print("Mean = ", np.mean(data,axis=0))
print("Std = " , np.std(data,axis=0))

data_zero_centered = preprocessing.scale(data)

print(data_zero_centered)
print("Mean after scale = ", np.mean(data_zero_centered,axis=0))
print("Std after scale = ", np.std(data_zero_centered,axis=0))


Mean =  [ 1.75  -1.275  2.2  ]
Std =  [2.71431391 4.20022321 4.69414529]
[[ 0.12894603 -0.14880162  0.70300338]
 [-1.19735598  0.8749535   0.27694073]
 [-0.46052153 -1.57729713  0.72430651]
 [ 1.52893149  0.85114524 -1.70425062]]
Mean after scale =  [1.11022302e-16 0.00000000e+00 0.00000000e+00]
Std after scale =  [1. 1. 1.]


In [40]:
#  Scaling: We use this preprocessing technique for scaling the feature vectors.
#  Scaling of feature vectors is important, because the features should not be synthetically large or small.
min_max = preprocessing.MinMaxScaler(feature_range=(-1,1))
data_scaled = min_max.fit_transform(data_zero_centered)
print(data_scaled)

[[-0.02702703  0.16504854  0.98245614]
 [-1.          1.          0.63157895]
 [-0.45945946 -1.          1.        ]
 [ 1.          0.98058252 -1.        ]]


In [49]:
# We use this preprocessing technique for modifying the feature vectors.
# Normalisation of feature vectors is necessary so that the feature vectors can be measured at common scale

# L1 normalization: the sum of the absolute values remains always up to 1 in each row.
l1_data = preprocessing.normalize(data_scaled,"l1")
print(l1_data)
print()

# L2 normalization: the sum of the squares remains always up to 1 in each row.
l2_data = preprocessing.normalize(data_scaled,"l2")
print(l2_data)

[[-0.0230109   0.14052285  0.83646625]
 [-0.38        0.38        0.24      ]
 [-0.18681319 -0.40659341  0.40659341]
 [ 0.33550489  0.32899023 -0.33550489]]

[[-0.02711951  0.16561329  0.98581782]
 [-0.64564628  0.64564628  0.4077766 ]
 [-0.30898878 -0.67250499  0.67250499]
 [ 0.58108685  0.56980361 -0.58108685]]


## Estimator API

All machine learning algorithms in Scikit-Learn are implemented via Estimator API. The object that learns from the data (fitting the data) is an estimator. It can be used with any of the algorithms like classification, regression, clustering or even with a transformer, that extracts useful features from raw data.

Steps in using Estimator API:

Step 1: Choose a class of model
In this first step, we need to choose a class of model. It can be done by importing the appropriate Estimator class from Scikit-learn.

Step 2: Choose model hyperparameters
In this step, we need to choose class model hyperparameters. It can be done by instantiating the class with desired values.

Step 3: Arranging the data
Next, we need to arrange the data into features matrix (X) and target vector(y).

Step 4: Model Fitting
Now, we need to fit the model to your data. It can be done by calling fit() method of the model instance.

Step 5: Applying the model
After fitting the model, we can apply it to new data. For supervised learning, use predict() method to predict the labels for unknown data. While for unsupervised learning, use predict() or transform() to infer properties of the data.

# Linear Modeling with scikit-learn

A linear model is an equation that describes a relationship between two quantities that show a constant rate of change. We represent linear relationships graphically with straight lines. A linear model is usually described by two parameters: the slope, often called the growth factor or rate of change, and the y-intercept, often called the initial value. Given the slope mm and the yy-intercept b,b, the linear model can be written as a linear function y = mx + b.y=mx+b. For example, W
we can represent the position of a car moving at a constant velocity with a linear model.

The following list shows the various linear models provided by Scikit-Learn:

1. Linear Regression
2. Logistic Regression
3. Ridge Regression
4. Bayesian Ridge Regression
5. LASSO
6. Multi-task LASSO
7. Elastic-Net
8. Multi-task Elastic-Net


## Linear regression

Linear regression is a statistical model that studies the relationship between a dependent variable (Y) with a given set of independent variables (X). 
<code>sklearn.linear_model.LinearRegression</code> is the module used to implement linear regression.

In [54]:
from sklearn.linear_model import LinearRegression
X = np.array([[1,1],[1,2],[2,2],[2,3]])
print("X = " , X)
y = np.dot(X,[1,2]) + 3
print("y = " ,y)
lr = LinearRegression(fit_intercept=True,copy_X=True,n_jobs=2 )
lr.fit(X,y)

X =  [[1 1]
 [1 2]
 [2 2]
 [2 3]]
y =  [ 6  8  9 11]


In [55]:
# Altohugh it is not vwery useful, we can check the quality of the model on the training data.
lr.score(X,y)

1.0

In [56]:
# We can predict the output for any given value by using predict method:
lr.predict([[3,4]])

array([14.])

## Logistic Regression

Logistic regression is based on the linear regression, but it is useful for predicting discrete data. In other words, logistic regression is a classification algorithm rather than regression algorithm. Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no, true/false).

Basically, it measures the relationship between the categorical dependent variable and one or more independent variables by estimating the probability of occurrence of an event using its logistics function.

<code>sklearn.linear_model.LogisticRegression</code> is the module used to implement logistic regression.



In [63]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("feature names = ", feature_names)
print("target names = " ,target_names)

feature names =  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
target names =  ['setosa' 'versicolor' 'virginica']


In [105]:
# creating training and test set
from sklearn.model_selection import train_test_split
train_data  , test_data , train_label , test_label = train_test_split(X,y, train_size=0.6 , test_size=0.4)
print("Train data :")
print(train_data[:10])

print("Train labels :")
print(train_label[:10])

print("Test data :")
print(test_data[:10])

print("Test labels :")
print(test_label[:10])


print("Train size = " , len(train_data))

print("Test size = " , len(test_data))

Train data :
[[5.6 2.9 3.6 1.3]
 [5.7 2.8 4.1 1.3]
 [5.  3.5 1.3 0.3]
 [6.8 2.8 4.8 1.4]
 [5.1 3.3 1.7 0.5]
 [7.7 2.6 6.9 2.3]
 [5.6 2.7 4.2 1.3]
 [5.8 2.7 5.1 1.9]
 [5.7 2.8 4.5 1.3]
 [6.4 2.8 5.6 2.2]]
Train labels :
[1 1 0 1 0 2 1 2 1 2]
Test data :
[[5.  3.5 1.6 0.6]
 [6.9 3.2 5.7 2.3]
 [5.3 3.7 1.5 0.2]
 [7.7 2.8 6.7 2. ]
 [6.5 3.  5.2 2. ]
 [6.1 3.  4.9 1.8]
 [4.3 3.  1.1 0.1]
 [7.6 3.  6.6 2.1]
 [5.2 3.4 1.4 0.2]
 [4.8 3.  1.4 0.3]]
Test labels :
[0 2 0 2 2 2 0 2 0 0]
Train size =  90
Test size =  60


In [117]:
from sklearn.linear_model import LogisticRegression
logr = LogisticRegression(n_jobs=2 , max_iter=1000)
logr.fit(train_data,train_label)
logr.score(train_data, train_label)

1.0

In [118]:
# accuracy 95% shows that logistic regression on iris dataset is quite perfect.
logr.score(test_data, test_label)

0.95

## Ridge regression
In ridge regression we modify the loss function by adding a penalty to make a slightly worse fit. This way, we can avoid overfiting to training data and achieve better predictions on testin data.

In other words, linear regression minimizes the sum of the squared residuals, but in logistic regression we minimize the sum of the squared residuals + lambda * slope^2. 