># Features of SKlearn
Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows −

`Supervised Learning algorithms` − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn.

`Unsupervised Learning algorithms` − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.

`Clustering` − This model is used for grouping unlabeled data.

`Cross Validation` − It is used to check the accuracy of supervised models on unseen data.

`Dimensionality Reduction` − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.

`Ensemble methods` − As name suggest, it is used for combining the predictions of multiple supervised models.

`Feature extraction` − It is used to extract the features from data to define the attributes in image and text data.

`Feature selection` − It is used to identify useful attributes to create supervised models.

`Open Source` − It is open source library and also commercially usable under BSD license.

## Key words
* Dataset - collection of data
* Features - variables of data(predictors,inputs or attributes)
    - Feature names - list of all the names of the features.
    - Feature matrix - the collection of features, in case there are more than one.


* Response - output variable, depend on feature variable(target,label or output)
    - target names - possible values taken by a response vector
    - Response vector - used to represent response column. Generally, we have just one response column.



## Loading IRIS datset


In [1]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print(f"Feature names: {feature_names}")
print(f"Target names: {target_names}")
print("\n First 10 rows of X:\n",X[:10])

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

 First 10 rows of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


In [4]:
from sklearn.datasets import load_breast_cancer
dbc = load_breast_cancer()
X = dbc.data
y = dbc.target
print(dbc.feature_names)
print(dbc.target_names)
print(f"First 10 rows of X\n{X[:10]}")

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']
First 10 rows of X
[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-0

## Splitting the dataset
* to check the accuracy of our model
* splited into two set
    - Training set
    - testing set
    

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(105, 4)
(45, 4)
(105,)
(45,)



<pre>here:
- X = feature matrix (collection of features)
- y = response vector(use to represent response column,generally we have just one response column)</pre>

In [None]:
from sklearn.datasets import load_breast_cancer
dbc = load_breast_cancer()

X = dbc.data
y = dbc.target

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(455, 30)
(114, 30)
(455,)
(114,)


## Train the Model
* using KNN classifier

In [32]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

iris = load_iris()
X = iris.data 
y= iris.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=42)
Classifier_knn = KNeighborsClassifier(n_neighbors=3)
Classifier_knn.fit(X_train,y_train)
y_pred= Classifier_knn.predict(X_test)
# Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred)
print("Accuracy: ",metrics.accuracy_score(y_test,y_pred))
# providing sample data and the model will make prediction out of that data

sample = [[5,5,3,2],[2,4,3,5]]
preds = Classifier_knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds] 
print("predictions:",pred_species)

Accuracy:  0.9833333333333333
predictions: [np.str_('setosa'), np.str_('virginica')]


## Model Persistence
* model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of `dump` and `load` features of `joblib` package.

<pre>
from sklearn.externals import joblib
joblib.dump(classifier_knn, 'iris_classifier_knn.joblib')
</pre>

* iris_classifier_knn.joblib is file name for saving model

## Preprocessing the data
* converting data into meaningful data
* sklearn has package `preprocessing` for it

> Binarisation
* to convert our numerical values into boolean values.

In [46]:
import numpy as np
from sklearn import preprocessing
input_data = np.array(
   [[2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
)
data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)


Binarized data:
 [[1. 0. 1.]
 [0. 1. 1.]
 [0. 0. 1.]
 [1. 1. 0.]]


In the above example, we used `threshold value` = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0

> Mean Removal

* This technique is used to eliminate the mean from feature vector so that every feature centered on zero.

In [48]:
import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [[2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
)

#displaying the mean and the standard deviation of the input data
print("Mean =", input_data.mean(axis=0))
print("Stddeviation = ", input_data.std(axis=0))
#Removing the mean and the standard deviation of the input data

data_scaled = preprocessing.scale(input_data)
print("Mean_removed =", data_scaled.mean(axis=0))
print("Stddeviation_removed =", data_scaled.std(axis=0))

Mean = [ 1.75  -1.275  2.2  ]
Stddeviation =  [2.71431391 4.20022321 4.69414529]
Mean_removed = [1.11022302e-16 0.00000000e+00 0.00000000e+00]
Stddeviation_removed = [1. 1. 1.]


> Scaling
* We use this preprocessing technique for scaling the feature vectors. Scaling of feature vectors is important, because the features should not be synthetically large or small.

In [49]:
import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("\nMin max scaled data:\n", data_scaled_minmax)


Min max scaled data:
 [[0.48648649 0.58252427 0.99122807]
 [0.         1.         0.81578947]
 [0.27027027 0.         1.        ]
 [1.         0.99029126 0.        ]]


> Normalisation
* used to modifying the feature vectors
* necessary so that the feature vectors can be measured at common scale.
* it's of two type
    - L1 Normalisation
    - L2 Normalisation

>> L1 Normalisation
* also called Least Absolute Deviations.
* It modifies the value in such a manner that the sum of the absolute values remains always up to 1 in each row.

In [50]:
import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
print("\nL1 normalized data:\n", data_normalized_l1)


L1 normalized data:
 [[ 0.22105263 -0.2         0.57894737]
 [-0.2027027   0.32432432  0.47297297]
 [ 0.03571429 -0.56428571  0.4       ]
 [ 0.42142857  0.16428571 -0.41428571]]


>> L2 Normalisation
* Also called Least Squares.
* modifies the value in such a manner that the sum of the squares remains always up to 1 in each row

In [51]:
import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l2)


L1 normalized data:
 [[ 0.33946114 -0.30713151  0.88906489]
 [-0.33325106  0.53320169  0.7775858 ]
 [ 0.05156558 -0.81473612  0.57753446]
 [ 0.68706914  0.26784051 -0.6754239 ]]
