# Scikit-learn
**Scikit-learn** is an open source Python library that implements a range of **machine learning, preprocessing, cross-validation and visualization** algorithms using a unified interface

**Contents**   
[1. Basic Example](#1)  
&nbsp;&nbsp;&nbsp;[1.1 Import modules](#1.1)  
&nbsp;&nbsp;&nbsp;[1.2 Import data](#1.2)  
&nbsp;&nbsp;&nbsp;[1.3 Set features and target](#1.3)  
&nbsp;&nbsp;&nbsp;[1.4 Split train and test](#1.4)  
&nbsp;&nbsp;&nbsp;[1.5 Create model](#1.5)  
&nbsp;&nbsp;&nbsp;[1.6 Train model](#1.6)  
&nbsp;&nbsp;&nbsp;[1.7 Predict test](#1.7)  
&nbsp;&nbsp;&nbsp;[1.8 Calculate accuracy](#1.8)
[2. Load data](#2)  
[3. Train and Test Data](#3)  
[4. Preprocessing Data](#4)  
&nbsp;&nbsp;&nbsp;[4.1 Standardization](#2.1)  
&nbsp;&nbsp;&nbsp;[4.2 Normalization](#2.2)  
&nbsp;&nbsp;&nbsp;[4.3 Binarization](#2.3)  
&nbsp;&nbsp;&nbsp;[4.4 Encoding Categorical Features](#2.4)  
&nbsp;&nbsp;&nbsp;[4.5 Imputing Missing Values](#2.5)  
&nbsp;&nbsp;&nbsp;[4.6 Generating Polynomial Features](#2.6)    
[5. Create Model](#5)  
&nbsp;&nbsp;&nbsp;[5.1 Supervised Learning Estimators](#5.1)  
&nbsp;&nbsp;&nbsp;[4.6 Generating Polynomial Features](#5.2)  
[6. Fit Model](#6)  
[7. Prediction](#7)  
[8. Evaluate Model Performance](#8)  
&nbsp;&nbsp;&nbsp;[8.1 Classification Metrics](#8.1)  
&nbsp;&nbsp;&nbsp;[8.2 Regression Metrics](#8.2)  
&nbsp;&nbsp;&nbsp;[8.3 Clustering Metrics](#8.3)  
&nbsp;&nbsp;&nbsp;[8.4 Cross-Validation](#8.4)  
[9. Tune Model](#9)  
&nbsp;&nbsp;&nbsp;[9.1 Grid Search](#9.1)  
&nbsp;&nbsp;&nbsp;[9.2 Randomized Parameter Optimization](#9.2)  


## <a id="1">1. Basic Example </a>
### <a id="1.1">1.1 Import modules </a>

In [48]:
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### <a id="1.2">1.2 Import data </a>

In [49]:
iris = datasets.load_iris()

### <a id="1.3">1.3 Set features and target </a>

In [50]:
X, y = iris.data[:,:2], iris.target

### <a id="1.4">1.4 Split train and test </a>

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=33)

### <a id="1.4">1.4 Preprocess data </a>

In [52]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### <a id="1.5">1.5 Create model </a>

In [53]:
knn = neighbors.KNeighborsClassifier (n_neighbors = 5)

### <a id="1.6">1.6 Train model </a>

In [54]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

### <a id="1.7">1.7 Predict test </a>

In [55]:
y_pred = knn.predict(X_test)

### <a id="1.8">1.8 Calculate accuracy </a>

In [56]:
accuracy_score(y_test,y_pred)

0.631578947368421

## <a id="2">2. Load data </a>
Data need to be numeric and storec as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas Dataframe, are also acceptable.

In [57]:
import numpy as np
X = np.random.random((10,5))
y = np.array(['M','M','F','F','F','M','F','M','M','F','F','F'])
X[X<0.7]=0

## <a id="3">3. Train and Test Data </a>

In [58]:
X, y = iris.data[:,:2], iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

## <a id="4">4. Preprocessing Data</a>
The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators
### <a id="4.1">4.1 Standardization</a>
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

In [59]:
from sklearn.preprocessing import StandardScaler
scaler =StandardScaler().fit(X_train)
standarized_X = scaler.transform(X_train)
standarized_X_test = scaler.transform(X_test)

print("Means:",standarized_X.mean(axis=0).round())
print("Std dev:",standarized_X.std(axis=0))

Means: [ 0. -0.]
Std dev: [1. 1.]


### <a id="4.2">4.2 Normalization </a>
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

In [60]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X=scaler.transform(X_train)
normalized_X_test=scaler.transform(X_test)

print("Min:", normalized_X.min().round(2))
print("Max:", normalized_X.max().round(2))

Min: 0.32
Max: 0.95


### <a id="4.3">4.3 Binarization </a>
Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution.

In [61]:
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)
print (binary_X.max())

1.0


### <a id="4.4">4.4 Encoding Categorical Features </a>
Encode labels with value between 0 and n_classes-1.

In [70]:
from sklearn.preprocessing import LabelEncoder
y = np.array(['M','M','F','F','F','M','F','M','M','F','F','F','Q'])
print(np.unique(y))

['F' 'M' 'Q']


In [71]:
enc = LabelEncoder()
y = enc.fit_transform(y)
print(np.unique(y))

[0 1 2]


### <a id="4.5">4.5 Imputing Missing Values </a>
Imputation transformer for completing missing values.

In [76]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0,strategy='mean',axis=0)
np.unique(np.isnan(imp.fit_transform(X_train)))

array([False])

### <a id="4.6">4.6 Generating Polynomial Features </a>
Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

In [78]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
polyX = poly.fit_transform(X)

## <a id="5">5. Create Model </a>
### <a id="5.1">5.1 Supervised Learning Estimators </a>
#### <a id="5.1.1">5.1.1 Linear Regression </a>
From the implementation point of view, this is just plain Ordinary Least Squares (scipy.linalg.lstsq) wrapped as a predictor object.

In [80]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)

#### <a id="5.1.2">5.1.2 Support Vector Machines (SVM) </a>

In [82]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')

#### <a id="5.1.3">5.1.3 Naive Bayes </a>

In [83]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

#### <a id="5.1.4">5.1.4 KNN </a>

In [84]:
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

### <a id="5.2"> 5.2 Unsupervied Learning Estimators </a>
#### <a id="5.2.1"> 5.2.1 Principal Component Analysis (PCA) </a>
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space

In [87]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

#### <a id="5.2.2"> 5.2.2 K Means </a>
K-Means clustering

In [89]:
from sklearn.cluster import KMeans
k_means = KMeans (n_clusters=3, random_state=0)

## <a id="6">6. Fit Model </a>
### <a id="6.1">6.1 Supervised learning</a>
http://scikit-learn.org/stable/supervised_learning.html

In [91]:
# lr.fit(X,y)
# knn.fit(X_train, y_train)
# svc.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### <a id="6.2">6.2 Unsupervised Learning</a>
http://scikit-learn.org/stable/unsupervised_learning.html

In [None]:
k_means.fit(X_train)
pca_model = pca.fit_tran

## <a id="7">7. Prediction </a>

## <a id="8">8. Evaluate Model Performance </a>
### <a id="8.1">8.1 Classification Metrics </a>

### <a id="8.2">8.2 Regression Metrics </a>

### <a id="8.3">8.3 Clustering Metrics </a>

### <a id="8.4">8.4 Cross-Validation </a>

## <a id="9">9. Tune Model </a>
### <a id="9.1">9.1 Grid Search </a>

### <a id="9.2">9.2 Randomized Parameter Optimization </a>