# Scikit Learn (Sklearn)

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

In [None]:
# installing sklearn

! pip install scikit-learn
# ! pip install -U scikit-learn

In [None]:
# importing sklearn
import sklearn

# importing a dataset from sklearn (iris dataset)
from sklearn.datasets import load_iris # flower dataset

# Load Dataset

In [None]:
# load data
iris = load_iris()

# the feature values and labels/Tags
X = iris.data
Y = iris.target

# feature name and label name
feature_names = iris.feature_names
target_names = iris.target_names

print("Feature names:", feature_names)
print("Target names:", target_names)

print("\nFirst 10 rows of X:\n", X[:10])

# Train-test split

In [None]:
# importing train test split
from sklearn.model_selection import train_test_split

# creating a train-test split
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, 
                    test_size = 0.3, random_state = 1 )

# printing size of train and test data

print('X_train : ', X_train.shape)
print('X_test : ', X_test.shape)

print('Y_train : ', Y_train.shape)
print('Y_test : ', Y_test.shape)

# Preprocessing

### Binarization

This preprocessing technique is used when we need to convert our numerical values into Boolean values.

In [None]:
import numpy as np
from sklearn import preprocessing

data = np.array(
  [[2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
)

data_binarized = preprocessing.Binarizer(threshold=0.5).transform(data)
print("\nBinarized data:\n", data_binarized)

### Mean removal

This technique is used to eliminate the mean from feature vector so that every feature centered on zero.

In [None]:
data_scaled = preprocessing.scale(data, )

print("Mean_removed =", data_scaled)

### Scaling

We use this preprocessing technique for scaling the feature vectors. Scaling of feature vectors is important, because the features should not be synthetically large or small.

In [None]:
# defining min max scaler parameters
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))

# using it to scale data
scaled_minmax = data_scaler_minmax.fit_transform(data)

# printing the scaled data
print ("\nMin max scaled data:\n", data_scaled_minmax)

### Normalising
* L1 Normalisation :
It is also called Least Absolute Deviations. It modifies the value in such a manner that the sum of the absolute values remains always up to 1 in each row. Following example shows the implementation of L1 normalisation on input data.



In [None]:
data_normalized_l1 = preprocessing.normalize(data, norm='l1')
print("\nL1 normalized data:\n", data_normalized_l1)

* L2 Normalisation :
Also called Least Squares. It modifies the value in such a manner that the sum of the squares remains always up to 1 in each row. Following example shows the implementation of L2 normalisation on input data.

In [None]:
data_normalized_l2 = preprocessing.normalize(data, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l2)