In [1]:
from sklearn import datasets

#### Generator For Regression  
`make_regression` - produces regression targets as an optionally-sparse random linear combination of random features, with noise.  


In [2]:
datasets.make_regression

<function sklearn.datasets.samples_generator.make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)>

`make_sparse_uncorrelated` - produces a target as a linear combination of four features with fixed coefficients  

In [3]:
datasets.make_sparse_uncorrelated

<function sklearn.datasets.samples_generator.make_sparse_uncorrelated(n_samples=100, n_features=10, random_state=None)>

`make_friedman1` - related by polynomial and sine transforms

In [4]:
datasets.make_friedman1

<function sklearn.datasets.samples_generator.make_friedman1(n_samples=100, n_features=10, noise=0.0, random_state=None)>

`make_friedman2` -  includes feature multiplication and reciprocation

In [5]:
datasets.make_friedman2

<function sklearn.datasets.samples_generator.make_friedman2(n_samples=100, noise=0.0, random_state=None)>

`make_friedman3` - similar with an arctan transformation on the target

In [6]:
datasets.make_friedman3

<function sklearn.datasets.samples_generator.make_friedman3(n_samples=100, noise=0.0, random_state=None)>

#### Generator For Manifold Learning  
`make_s_curve` - Generate an S curve dataset

In [7]:
datasets.make_s_curve

<function sklearn.datasets.samples_generator.make_s_curve(n_samples=100, noise=0.0, random_state=None)>

`make_swiss_roll` - Generate a swiss roll dataset 

In [8]:
datasets.make_swiss_roll

<function sklearn.datasets.samples_generator.make_swiss_roll(n_samples=100, noise=0.0, random_state=None)>

#### Generator For Decomposition  
`make_low_rank_matrix` - Generate a mostly low rank matrix with bell-shaped singular values  

In [9]:
datasets.make_low_rank_matrix

<function sklearn.datasets.samples_generator.make_low_rank_matrix(n_samples=100, n_features=100, effective_rank=10, tail_strength=0.5, random_state=None)>

`make_sparse_coded_signal` - Generate a signal sas a sparse combination of dictionary elements

In [10]:
datasets.make_sparse_coded_signal

<function sklearn.datasets.samples_generator.make_sparse_coded_signal(n_samples, n_components, n_features, n_nonzero_coefs, random_state=None)>

`make_spd_matrix` - Generate a random symmetric, positive-definite matrix 

In [11]:
datasets.make_spd_matrix

<function sklearn.datasets.samples_generator.make_spd_matrix(n_dim, random_state=None)>

`make_sparse_spd_matrix` - Generate a sparse symmetric definite positive matrix

In [12]:
datasets.make_sparse_spd_matrix

<function sklearn.datasets.samples_generator.make_sparse_spd_matrix(dim=1, alpha=0.95, norm_diag=False, smallest_coef=0.1, largest_coef=0.9, random_state=None)>

### Datasets in svmlight / libsvm format 
svmlight/libsvm format: \<label\>  \<feature-id\>:\<feature-value\> \<feature-id\>: \<feature-value\> per line  

In [13]:
datasets.load_svmlight_file

<function sklearn.datasets.svmlight_format.load_svmlight_file(f, n_features=None, dtype=<class 'numpy.float64'>, multilabel=False, zero_based='auto', query_id=False, offset=0, length=-1)>

### Loading From Extrenal Datasets  
1. pandas.io
2. scipy.io  
3. numpy/routine.io  
4. skimage.io / Imageio  
5. scipy.misc.imread  
6. scipy.io.wavfile.read  

### Downloading datasets from the mldata.org repository  
`fetch_mldata`  

In [14]:
datasets.fetch_mldata

<function sklearn.datasets.mldata.fetch_mldata(dataname, target_name='label', data_name='data', transpose_data=True, data_home=None)>

## Dataset Transformations  
clean, reduce, expand, generate feature representations  

In [15]:
from sklearn.pipeline import Pipeline

### Pipeline And FeatureUnion: Cobining Estimators  
Pipeline can use to chain estimators into one, useful when there is ofen a fixed sequence of steps in processing the data

#### How to use  
`Pipeline(estimators)`, estimators is a list of (key, value) tuple, key is the name of estimator, value is the estimator object

In [16]:
from sklearn.svm import SVC
from sklearn.decomposition import PCA  
estimators = [('reduce_im', PCA()), ('clf', SVC())]  
pipe = Pipeline(estimators)
pipe

Pipeline(memory=None,
     steps=[('reduce_im', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

`make_pipeline(estimator1, estimator, ...)` shorthand for creating pipeline, name was autofilled  

In [17]:
from sklearn.pipeline import make_pipeline  
pipe = make_pipeline(PCA(), PCA(), SVC())
pipe

Pipeline(memory=None,
     steps=[('pca-1', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('pca-2', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('svc', SVC(C=1.0, cache_size...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

**steps/named_steps** attribute store the estimators

In [18]:
pipe.steps

[('pca-1',
  PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)),
 ('pca-2',
  PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)),
 ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]

In [19]:
pipe.named_steps

{'pca-1': PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
   svd_solver='auto', tol=0.0, whiten=False),
 'pca-2': PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
   svd_solver='auto', tol=0.0, whiten=False),
 'svc': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False)}

use **estimator__parameter** to access estimator's parameter

In [20]:
pipe.set_params(svc__C=0)

Pipeline(memory=None,
     steps=[('pca-1', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('pca-2', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('svc', SVC(C=0, cache_size=2...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Following shows how to use GridSearchCV  

In [21]:
from sklearn.model_selection import GridSearchCV
param_grid = {'pca-1__n_components': [2, 5, 10], 'svc__C': [0.1, 10, 100]}
grid_search = GridSearchCV(pipe, param_grid=param_grid)

#### Perfermance  
pipeline will cache each transform after calling fit, so if parameters and input data are identical the tranformation wont running  
`Pipeline(..., memory=dirname_or_jobmemoryobject)`

In [22]:
from tempfile import mkdtemp
from shutil import rmtree
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
rmtree(cachedir)

### FeatureUnion: Composite Feature Spaces  
FeatureUnion combine several transformer objects into a new transformer that combines their output  
while fitting, each estimator fit data indenpendently, the output are concatenated end-to-end into larger vectors  

In [23]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import KernelPCA
estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)
combined 

FeatureUnion(n_jobs=1,
       transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
     fit_inverse_transform=False, gamma=None, kernel='linear',
     kernel_params=None, max_iter=None, n_components=None, n_jobs=1,
     random_state=None, remove_zero_eig=False, tol=0))],
       transformer_weights=None)

### Feature Extraction  
The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
**Difference with Feature Selection**: the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features  

#### Loading Features From Dicts  
`DictVectorizer` - convert feature arrays represented as liss of standard python dict object to numpy/scipy representation, it implement the `one-hot` coding for categorical features

In [24]:
>>> measurements = [
...     {'city': 'Dubai', 'temperature': 33.},
...     {'city': 'London', 'temperature': 12.},
...     {'city': 'San Francisco', 'temperature': 18.},
... ]

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()

>>> print(vec.fit_transform(measurements).toarray())

>>> print(vec.get_feature_names())

[[ 1.  0.  0. 33.]
 [ 0.  1.  0. 12.]
 [ 0.  0.  1. 18.]]
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']


Also is a useful representation transformation for training sequence classifiers in NLP model

In [25]:
>>> pos_window = [
...     {
...         'word-2': 'the',
...         'pos-2': 'DT',
...         'word-1': 'cat',
...         'pos-1': 'NN',
...         'word+1': 'on',
...         'pos+1': 'PP',
...     },
...     # in a real application one would extract many such dictionaries
... ]
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> print(pos_vectorized)
>>> print(pos_vectorized.toarray())
>>> print(vec.get_feature_names())

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0
[[1. 1. 1. 1. 1. 1.]]
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']


#### Feature Hashing  
An implementation of Feature hashing, apply a hash function to features to determine their column index in sample matrices directly  
About collisions: use a signed hash function   
Accept mappings or (feature, value) pair or strings  
Output scipy.sparse  

In [26]:
from sklearn.feature_extraction import FeatureHasher
def token_features(token, part_of_speech):
    if token.isdigit():
        yield "numeric"
    else:
        yield "token={}".format(token.lower())
        yield "token,pos={},{}".format(token, part_of_speech)
    if token[0].isupper():
        yield "uppercase_initial"
    if token.isupper():
        yield "all_uppercase"
    yield "pos={}".format(part_of_speech)
raw_X = (token_features(tok, pos) for tok, pos in [('A', 5)])
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)
print(X)

  (0, 729803)	-1.0
  (0, 740061)	1.0
  (0, 892359)	-1.0
  (0, 950346)	-1.0
  (0, 1002789)	-1.0


### Text Feature Extraction  
Extract numerical features from text content - **Vercorization**   
Bag of words/Bag of n-grams Representation  
1. tokenizing  
2. counting  
3. normalizing  

#### Sparsity  
Words in documents is a very small subset, use sparse to store it in order to save memory and fasten 

#### CountVectorizer  
Implement both tokenization and occurrence counting  

In [27]:
>>> from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
print(X.shape)  
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
>>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
>>> print(vectorizer.vocabulary_.get('document'))
>>> vectorizer.transform(['Something completely new.']).toarray()

(4, 9)
1


array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])

We can extract 2-grams of words in order to preserve some of the local ordering infomation  

In [28]:
>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
...                                     token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
...     ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2

array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]],
      dtype=int64)

#### TF-IDF Term Weighting  
Why? some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document  
What? 
> tf-idf(t, d) = tf(t, d) x idf(t)  
> tf means term-frequency, idf means term-frequency times inverse document-frequency  
> idf(t) = log((1+n<d\>) / (1+df(d, t)) + 1  

In [29]:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> tfidf = transformer.fit_transform(X_2)
>>> print(transformer.idf_)
>>> print(tfidf.toarray())

[2.38629436 2.38629436 1.28768207 1.69314718 1.69314718 1.28768207
 1.69314718 2.38629436 2.38629436 2.38629436 2.38629436 2.38629436
 1.         1.69314718 2.38629436 2.38629436 2.38629436 2.38629436
 1.28768207 1.69314718 2.38629436]
[[0.         0.         0.28574186 0.37571621 0.37571621 0.28574186
  0.37571621 0.         0.         0.         0.         0.
  0.22190405 0.37571621 0.         0.         0.         0.
  0.28574186 0.37571621 0.        ]
 [0.         0.         0.1793146  0.         0.         0.1793146
  0.23577716 0.         0.         0.66460105 0.33230052 0.33230052
  0.13925379 0.         0.33230052 0.         0.         0.
  0.1793146  0.23577716 0.        ]
 [0.40240191 0.40240191 0.         0.         0.         0.
  0.         0.         0.40240191 0.         0.         0.
  0.16863046 0.         0.         0.40240191 0.40240191 0.40240191
  0.         0.         0.        ]
 [0.         0.         0.25271307 0.33228732 0.33228732 0.25271307
  0.         0.46

#### Decoding Text Files  
`CountVectorizer`take the encoding params(default is utf-8)  

In [30]:
>>> import chardet    
>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> text2 = b"holdselig sind deine Ger\xfcche"
>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
>>> decoded = [x.decode(chardet.detect(x)['encoding'])
...            for x in (text1, text2, text3)]        
>>> v = CountVectorizer().fit(decoded).vocabulary_    
>>> for term in v: print(v)       

{'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5}
{'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5}
{'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5}
{'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5}
{'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut

#### More  
1. [Applications And Examples](http://scikit-learn.org/stable/modules/feature_extraction.html#applications-and-examples)  
2. [Limitations of the Bag of Words Representation](http://scikit-learn.org/stable/modules/feature_extraction.html#limitations-of-the-bag-of-words-representation)  
3. [Vectorizing a large text corpus with hashing trick](http://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick)  
4. [Performing out-of-core scaling with HashingVectorizer](http://scikit-learn.org/stable/modules/feature_extraction.html#performing-out-of-core-scaling-with-hashingvectorizer)  
5. [Customizing the vectorizer classes](http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes)  

### Image Feature Extract  
#### Patch Extraction  
`extract_patches_2d` - extract patches from image stored as a 2d array/3d(color) array  
`reconstruct_from_patches_2d` - reconstruct image  

In [31]:
>>> import numpy as np
>>> from sklearn.feature_extraction import image
>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> print(one_image[:, :, 0])  # R channel of a fake RGB picture
>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2, random_state=0)
>>> print(patches.shape)
>>> print(patches[:, :, :, 0])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> print(patches.shape)
>>> print(patches[4, :, :, 0])
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

[[ 0  3  6  9]
 [12 15 18 21]
 [24 27 30 33]
 [36 39 42 45]]
(2, 2, 2, 3)
[[[ 0  3]
  [12 15]]

 [[15 18]
  [27 30]]]
(9, 2, 2, 3)
[[15 18]
 [27 30]]


`PatchExtractor` - like patch extractor, but support mult-image, is an estimator, can be use in pipeline

In [32]:
>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape

(45, 2, 2, 3)

#### Connectivity Graph of an Image  
[Detail](http://scikit-learn.org/stable/modules/feature_extraction.html#connectivity-graph-of-an-image)  

### Preprocessing Data  
standardize data
#### Standardization or mean removal and variance scaling  
In practice we often ignore the shape of the distribution and just **transform the data to center it by removing the mean value of each feature**, then **scale it by dividing non-constant features by their standard deviation**.  

`scale` - a quick and easy way to perform standardization, scaled data has zero mean and unit variance 
`StandardScaler` - kind of estimator that implement scale  

In [33]:
from sklearn.preprocessing import scale as sk_scale, normalize as sk_normalize, StandardScaler,MinMaxScaler,MaxAbsScaler,QuantileTransformer,Normalizer,Binarizer, OneHotEncoder
import numpy as np
X_train = np.array([[ 1., -1.,  2.], [ 2.,  0.,  0.], [ 0.,  1., -1.]])
X_scaled = sk_scale(X_train)
print(X_scaled)  
scaler = StandardScaler().fit(X_train)
print(scaler.mean_)
X_test = [[-1., 1., 0.]]
print(scaler.transform(X_test))

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[1.         0.         0.33333333]
[[-2.44948974  1.22474487 -0.26726124]]


`MinMaxScaler/MaxAbsScaler` - scale features to lie between a given minimum and maximum value  
For `MinMaxScaler`  
> X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
> X_scaled = X_std * (max - min) + min  

In [34]:
>>> X_train = np.array([[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
>>> min_max_scaler = MinMaxScaler(feature_range=(0, 1))
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [35]:
>>> X_train = np.array([[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
>>> max_abs_scaler = MaxAbsScaler()
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
>>> print(X_train_maxabs)
>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
>>> print(X_test_maxabs)
>>> max_abs_scaler.scale_         

[[ 0.5 -1.   1. ]
 [ 1.   0.   0. ]
 [ 0.   1.  -0.5]]
[[-1.5 -1.   2. ]]


array([2., 1., 2.])

`robust_scale/RobustScaler`- data contains many outliers, scaling using the mean and variance of the data is likely to not work very well, They use more robust estimates for the center and range of your data.

#### Non-Linear Transformation  
`QuantileTransformer/quantile_transform` provide a non-parametric transformation based on the quantile function to map the data to a uniform distribution with values between 0 and 1  
It is also possible to map the transformed data to a normal distribution by setting output_distribution='normal'

In [36]:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> quantile_transformer = QuantileTransformer(random_state=0)
>>> X_train_trans = quantile_transformer.fit_transform(X_train)
>>> X_test_trans = quantile_transformer.transform(X_test)
>>> print(np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]))
>>> quantile_transformer = QuantileTransformer(
...     output_distribution='normal', random_state=0)
>>> X_trans = quantile_transformer.fit_transform(X)
>>> print(quantile_transformer.quantiles_)

[4.3 5.1 5.8 6.5 7.9]
[[4.3        2.         1.         0.1       ]
 [4.31491491 2.02982983 1.01491491 0.1       ]
 [4.32982983 2.05965966 1.02982983 0.1       ]
 ...
 [7.84034034 4.34034034 6.84034034 2.5       ]
 [7.87017017 4.37017017 6.87017017 2.5       ]
 [7.9        4.4        6.9        2.5       ]]


#### Normalization  
Normalization is the process of scaling individual samples to have unit norm  
This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples  
`normalize/Normalizer(scioy.sparse-liked-input)` - provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms

In [37]:
>>> X = [[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]]
>>> X_normalized = sk_normalize(X, norm='l2')
>>> print(X_normalized)
>>> normalizer = Normalizer().fit(X)  # fit does nothing
>>> print(normalizer.transform(X))
>>> print(normalizer.transform([[-1.,  1., 0.]]))

[[ 0.40824829 -0.40824829  0.81649658]
 [ 1.          0.          0.        ]
 [ 0.          0.70710678 -0.70710678]]
[[ 0.40824829 -0.40824829  0.81649658]
 [ 1.          0.          0.        ]
 [ 0.          0.70710678 -0.70710678]]
[[-0.70710678  0.70710678  0.        ]]


#### Binarization  
Feature binarization is the process of thresholding numerical features to get boolean values  
This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution  
It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly better in practice.  
`Binarizer(threshold)` 

In [38]:
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]

>>> binarizer = Binarizer().fit(X)  # fit does nothing
>>> print(binarizer.transform(X))
>>> binarizer = Binarizer(threshold=1.1)
>>> print(binarizer.transform(X))

[[1. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]
[[0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 0.]]


#### Encoding Categorical Features  
Convert categorical features to features that can be used with scikit-learn estimators
 `OneHotEncoder` - implemented the `one-of-K/one-hot` encoding,  transforms each categorical feature with m possible values into m binary features, with only one active( if there is a possibility that the training data might have missing categorical features, one has to explicitly set n_values)

In [39]:
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
>>> print(enc.transform([[0, 1, 3]]).toarray())
>>> enc = OneHotEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> print(enc.fit([[1, 2, 3], [0, 2, 0]]))

[[1. 0. 0. 1. 0. 0. 0. 0. 1.]]
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values=[2, 3, 4], sparse=True)


#### Imputation(填充)  
How to handle missing value in dataset?  

1. Discard entire rows and/or columns containing missing values  
2. Impute the missing values(i.e., to infer them from the known part of the data)  

`Imputer` - provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located, this class also allows for different missing values encodings, support sparse matrices    

In [40]:
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))  
>>> import scipy.sparse as sp
>>> X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.fit(X)
>>> X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
>>> print(imp.transform(X_test))

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]
[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


#### Generating Polynomial Features  
Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms  
`PolynomialFeatures`  


In [41]:
>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> print(X)
>>> poly = PolynomialFeatures(2)
>>> print(poly.fit_transform(X))
>>> X = np.arange(9).reshape(3, 3)
>>> print(X)
>>> poly = PolynomialFeatures(degree=3, interaction_only=True)
>>> print(poly.fit_transform(X))

[[0 1]
 [2 3]
 [4 5]]
[[ 1.  0.  1.  0.  0.  1.]
 [ 1.  2.  3.  4.  6.  9.]
 [ 1.  4.  5. 16. 20. 25.]]
[[0 1 2]
 [3 4 5]
 [6 7 8]]
[[  1.   0.   1.   2.   0.   0.   2.   0.]
 [  1.   3.   4.   5.  12.  15.  20.  60.]
 [  1.   6.   7.   8.  42.  48.  56. 336.]]


#### Custom Transformer  
`FunctionTransformer` - Convert an existing Python function into a transformer to assist in data cleaning or processing

In [42]:
>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p)
>>> X = np.array([[0, 1], [2, 3]])
>>> print(transformer.transform(X))

[[0.         0.69314718]
 [1.09861229 1.38629436]]


### UnSupervised Dimensionality Reduction  
[Detail](http://scikit-learn.org/stable/modules/unsupervised_reduction.html)

### Random Projection  
Reduce the dimensionality of the data by trading a controlled amount of accuracy, for faster processing times and smaller model sizes.

#### The Johnson-Lindenstrauss Lemma  
The main theoretical result behind the efficiency of random projection is the Johnson-Lindenstrauss lemma (quoting Wikipedia):
> In mathematics, the Johnson-Lindenstrauss lemma is a result concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. The map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.   

`johnson_lindenstrauss_min_dim` - Knowing only the number of sample, estimates conservatively the minimal size of the random subspace to guarantee a bounded distortion introduced by the random projection  

In [43]:
>>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
>>> print(johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5))
>>> print(johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01]))
>>> print(johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1))

663
[    663   11841 1112658]
[ 7894  9868 11841]


#### Gaussian Random Projection  
`GaussianRandomProjection` - reduces the dimensionality by projecting the original input space on a randomly generated matrix where components are drawn from the following distribution N(0, \frac{1}{n_{components}})  

In [44]:
>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100, 10000)
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> print(X_new.shape)

(100, 3947)


#### Sparse Random Projection  
`SparseRandomProjection` - reduces the dimensionality by projecting the original input space using a sparse random matrix  
[Detail](http://scikit-learn.org/stable/modules/random_projection.html#sparse-random-projection)  

In [45]:
>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100,10000)
>>> transformer = random_projection.SparseRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> print(X_new.shape)

(100, 3947)


### Kernel Approximation  
[Detail](http://scikit-learn.org/stable/modules/kernel_approximation.html)  

### Pairwise Metrics, Affinities, Kernels  
evaluate pairwise distances or affinity of sets of samples(distance metrics and kernels)  
all following functions under `sklearn.metrics.pairwise`  

#### Cosine Similarity  
`cosine_similarity(sparse)`  
cosine_similarity computes the L2-normalized dot product of vectors. That is, if x and y are row vectors, their cosine similarity k is defined as:

k(x, y) = \frac{x y^\top}{\|x\| \|y\|}

This is called cosine similarity, because Euclidean (L2) normalization projects the vectors onto the unit sphere, and their dot product is then the cosine of the angle between the points denoted by the vectors.  

#### Linear Kernel  
`linear_kernel` - computes the linear kernel, that is, a special case of polynomial_kernel with degree=1 and coef0=0 (homogeneous). If x and y are column vectors, their linear kernel is:

k(x, y) = x^\top y  

#### Polynomial kernel  
`polynomial_kernel` - computes the degree-d polynomial kernel between two vectors. The polynomial kernel represents the similarity between two vectors. Conceptually, the polynomial kernels considers not only the similarity between vectors under the same dimension, but also across dimensions. When used in machine learning algorithms, this allows to account for feature interaction.

The polynomial kernel is defined as:

k(x, y) = (\gamma x^\top y +c_0)^d

where:

x, y are the input vectors
d is the kernel degree
If c_0 = 0 the kernel is said to be homogeneous.  

#### Sigmoid Kernel  
`sigmoid_kernel` computes the sigmoid kernel between two vectors. The sigmoid kernel is also known as hyperbolic tangent, or Multilayer Perceptron (because, in the neural network field, it is often used as neuron activation function). It is defined as:

k(x, y) = \tanh( \gamma x^\top y + c_0)

where:

x, y are the input vectors
\gamma is known as slope
c_0 is known as intercept

#### RBF Kernel  
`rbf_kernel` computes the radial basis function (RBF) kernel between two vectors. This kernel is defined as:

k(x, y) = \exp( -\gamma \| x-y \|^2)

where x and y are the input vectors. If \gamma = \sigma^{-2} the kernel is known as the Gaussian kernel of variance \sigma^2.  


#### Laplacian Kernel  
`laplacian_kernel` is a variant on the radial basis function kernel defined as:

k(x, y) = \exp( -\gamma \| x-y \|_1)

where x and y are the input vectors and \|x-y\|_1 is the Manhattan distance between the input vectors.

It has proven useful in ML applied to noiseless data.

#### Chi-squared kernel  
The chi-squared kernel is a very popular choice for training non-linear SVMs in computer vision applications. It can be computed using chi2_kernel and then passed to an sklearn.svm.SVC with kernel="precomputed"

In [46]:
>>> from sklearn.svm import SVC
>>> from sklearn.metrics.pairwise import chi2_kernel
>>> X = [[0, 1], [1, 0], [.2, .8], [.7, .3]]
>>> y = [0, 1, 0, 1]
>>> K = chi2_kernel(X, gamma=.5)
>>> print(K)
>>> svm = SVC(kernel='precomputed').fit(K, y)
>>> print(svm.predict(K))
>>> print('or')
>>> svm = SVC(kernel=chi2_kernel).fit(X, y)
>>> print(svm.predict(X))

[[1.         0.36787944 0.89483932 0.58364548]
 [0.36787944 1.         0.51341712 0.83822343]
 [0.89483932 0.51341712 1.         0.7768366 ]
 [0.58364548 0.83822343 0.7768366  1.        ]]
[0 1 0 1]
or
[0 1 0 1]


### Transforming The Prediction Target(y)  
#### LabelBinarizer  
create a label indicator matrix from a list of multi-class labels  

In [47]:
>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> print(lb.fit([1, 2, 6, 4, 2]))
>>> print(lb.classes_)
>>> print(lb.transform([1, 6]))

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
[1 2 4 6]
[[1 0 0 0]
 [0 0 0 1]]


#### LabelEncoder  
normalize labels such that they contain only values between 0 and n_classes-1  

In [48]:
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
>>> print(le.classes_)
>>> print(le.transform([1, 1, 2, 6]))
>>> print(le.inverse_transform([0, 0, 1, 2]))

[1 2 6]
[0 0 1 2]
[1 1 2 6]


  if diff:


It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels

In [49]:
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
>>> print(list(le.classes_))
>>> print(le.transform(["tokyo", "tokyo", "paris"]))
>>> print(list(le.inverse_transform([2, 2, 1])))

['amsterdam', 'paris', 'tokyo']
[2 2 1]
['tokyo', 'tokyo', 'paris']


  if diff:


## Supervised Learning  
### Generalized Linear Models  
The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the input variables. In mathematical notion, if \hat{y} is the predicted value.

\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p

Across the module, we designate the vector w = (w_1,
..., w_p) as coef_ and w_0 as intercept_.  


#### LinearRegression  
Mathematically it solves a problem of the form:

\underset{w}{min\,} {|| X w - y||_2}^2


In [50]:
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV,Lasso

In [51]:
>>> reg = LinearRegression()
>>> reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
>>> reg.coef_

array([0.5, 0.5])

#### Ridge Regression  
The ridge coefficients minimize a penalized residual sum of squares,

\underset{w}{min\,} {{|| X w - y||_2}^2 + \alpha {||w||_2}^2}  

In [52]:
>>> reg = Ridge (alpha = .5)
>>> reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) 
>>> print(reg.coef_)
>>> print(reg.intercept_)

[0.34545455 0.34545455]
0.13636363636363638


#### Ridge Cross Validation  
RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of leave-one-out cross-validation  

In [53]:
>>> reg = RidgeCV(alphas=[0.1, 1.0, 10.0])
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])       
>>> reg.alpha_                                      


0.1

#### Lasso  
The Lasso is a linear model that estimates sparse coefficients  
It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent. For this reason, the Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero weights  
Mathematically, it consists of a linear model trained with \ell_1 prior as regularizer. The objective function to minimize is:

\underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}

The lasso estimate thus solves the minimization of the least-squares penalty with \alpha ||w||_1 added, where \alpha is a constant and ||w||_1 is the \ell_1-norm of the parameter vector.  

In [54]:
>>> reg = Lasso(alpha = 0.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
>>> reg.predict([[1, 1]])

array([0.8])

[More](http://scikit-learn.org/stable/modules/linear_model.html#setting-regularization-parameter)  

#### Multi-Task Lasso  
The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks  


#### Elastic Net  
ElasticNet is a linear regression model trained with L1 and L2 prior as regularizer. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. We control the convex combination of L1 and L2 using the l1_ratio parameter.  
Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.

A practical advantage of trading-off between Lasso and Ridge is it allows Elastic-Net to inherit some of Ridge’s stability under rotation.

The objective function to minimize is in this case

\underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 +
\frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}  


#### Multi-task Elastic Net  
The MultiTaskElasticNet is an elastic-net model that estimates sparse coefficients for multiple regression problems jointly: Y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks.

Mathematically, it consists of a linear model trained with a mixed \ell_1 \ell_2 prior and \ell_2 prior as regularizer. The objective function to minimize is:

\underset{W}{min\,} { \frac{1}{2n_{samples}} ||X W - Y||_{Fro}^2 + \alpha \rho ||W||_{2 1} +
\frac{\alpha(1-\rho)}{2} ||W||_{Fro}^2}  


#### Least Angle Regression  
Least-angle regression (LARS) is a regression algorithm for high-dimensional data  
LARS is similar to forward stepwise regression. At each step, it finds the predictor most correlated with the response. When there are multiple predictors having equal correlation, instead of continuing along the same predictor, it proceeds in a direction equiangular between the predictors.  

The advantages of LARS are:
1. It is numerically efficient in contexts where p >> n (i.e., when the number of dimensions is significantly greater than the number of points)
2. It is computationally just as fast as forward selection and has the same order of complexity as an ordinary least squares.
3. It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune the model.
4. If two variables are almost equally correlated with the response, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
5. It is easily modified to produce solutions for other estimators, like the Lasso.  

The disadvantages of the LARS method include:

1. Because LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al. (2004) Annals of Statistics article.  

#### LARS Lasso  
LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on coordinate_descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients.  

####  Orthogonal Matching Pursuit  
OrthogonalMatchingPursuit and orthogonal_mp implements the OMP algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the L 0 pseudo-norm).

Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate the optimum solution vector with a fixed number of non-zero elements:

\text{arg\,min\,} ||y - X\gamma||_2^2 \text{ subject to } \
||\gamma||_0 \leq n_{nonzero\_coefs}

Alternatively, orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coefficients. This can be expressed as:

\text{arg\,min\,} ||\gamma||_0 \text{ subject to } ||y-X\gamma||_2^2 \
\leq \text{tol}

OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.  


#### Bayesian Regression  
Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand.

This can be done by introducing uninformative priors over the hyper parameters of the model. The \ell_{2} regularization used in Ridge Regression is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the parameters w with precision \lambda^{-1}. Instead of setting lambda manually, it is possible to treat it as a random variable to be estimated from the data.

To obtain a fully probabilistic model, the output y is assumed to be Gaussian distributed around X w:

p(y|X,w,\alpha) = \mathcal{N}(y|X w,\alpha)

Alpha is again treated as a random variable that is to be estimated from the data.

The advantages of Bayesian Regression are:

1. It adapts to the data at hand.
2. It can be used to include regularization parameters in the estimation procedure.  

The disadvantages of Bayesian regression include:

1. Inference of the model can be time consuming.

####  Bayesian Ridge Regression  
BayesianRidge estimates a probabilistic model of the regression problem as described above. The prior for the parameter w is given by a spherical Gaussian:

p(w|\lambda) =
\mathcal{N}(w|0,\lambda^{-1}\bold{I_{p}})

The priors over \alpha and \lambda are chosen to be gamma distributions, the conjugate prior for the precision of the Gaussian.

The resulting model is called Bayesian Ridge Regression, and is similar to the classical Ridge. The parameters w, \alpha and \lambda are estimated jointly during the fit of the model. The remaining hyperparameters are the parameters of the gamma priors over \alpha and \lambda. These are usually chosen to be non-informative. The parameters are estimated by maximizing the marginal log likelihood.

By default \alpha_1 = \alpha_2 =  \lambda_1 = \lambda_2 = 10^{-6}.  


#### Automatic Relevance Determination   
ARDRegression is very similar to Bayesian Ridge Regression, but can lead to sparser weights w [1] [2]. ARDRegression poses a different prior over w, by dropping the assumption of the Gaussian being spherical.

Instead, the distribution over w is assumed to be an axis-parallel, elliptical Gaussian distribution.

This means each weight w_{i} is drawn from a Gaussian distribution, centered on zero and with a precision \lambda_{i}:

p(w|\lambda) = \mathcal{N}(w|0,A^{-1})

with diag \; (A) = \lambda = \{\lambda_{1},...,\lambda_{p}\}.

In contrast to Bayesian Ridge Regression, each coordinate of w_{i} has its own standard deviation \lambda_i. The prior over all \lambda_i is chosen to be the same gamma distribution given by hyperparameters \lambda_1 and \lambda_2.  


#### Logistic regression  
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This implementation can fit binary, One-vs- Rest, or multinomial logistic regression with optional L2 or L1 regularization.

As an optimization problem, binary class L2 penalized logistic regression minimizes the following cost function:

\underset{w, c}{min\,} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .

Similarly, L1 regularized logistic regression solves the following optimization problem

\underset{w, c}{min\,} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .

The solvers implemented in the class LogisticRegression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”:

The solver “liblinear” uses a coordinate descent (CD) algorithm, and relies on the excellent C++ LIBLINEAR library, which is shipped with scikit-learn. However, the CD algorithm implemented in liblinear cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest” fashion so separate binary classifiers are trained for all classes. This happens under the hood, so LogisticRegression instances using this solver behave as multiclass classifiers. For L1 penalization sklearn.svm.l1_min_c allows to calculate the lower bound for C in order to get a non “null” (all feature weights to zero) model.

The “lbfgs”, “sag” and “newton-cg” solvers only support L2 penalization and are found to converge faster for some high dimensional data. Setting multi_class to “multinomial” with these solvers learns a true multinomial logistic regression model [5], which means that its probability estimates should be better calibrated than the default “one-vs-rest” setting.

The “sag” solver uses a Stochastic Average Gradient descent [6]. It is faster than other solvers for large datasets, when both the number of samples and the number of features are large.

The “saga” solver [7] is a variant of “sag” that also supports the non-smooth penalty=”l1” option. This is therefore the solver of choice for sparse multinomial logistic regression.

In a nutshell, one may choose the solver with the following rules:

Case	Solver
L1 penalty	“liblinear” or “saga”
Multinomial loss	“lbfgs”, “sag”, “saga” or “newton-cg”
Very Large dataset (n_samples)	“sag” or “saga”
The “saga” solver is often the best choice. The “liblinear” solver is used by default for historical reasons.

For large dataset, you may also consider using SGDClassifier with ‘log’ loss.  

#### Stochastic Gradient Descent  
Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The partial_fit method allows only/out-of-core learning.

The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., with loss="log", SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector machine (SVM).  

#### Perceptron  
he Perceptron is another simple algorithm suitable for large scale learning. By default:

1. It does not require a learning rate.
2. It is not regularized (penalized).
3. It updates its model only on mistakes.  

The last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the resulting models are sparser.

#### Passive Aggressive Algorithms  
The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C.

For classification, PassiveAggressiveClassifier can be used with loss='hinge' (PA-I) or loss='squared_hinge' (PA-II). For regression, PassiveAggressiveRegressor can be used with loss='epsilon_insensitive' (PA-I) or loss='squared_epsilon_insensitive' (PA-II).  


#### More  
http://scikit-learn.org/stable/modules/linear_model.html#robustness-regression-outliers-and-modeling-errors  
http://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions  

### Linear and Quadratic Discriminant Analysis
[线性二次判别分析](http://scikit-learn.org/stable/modules/lda_qda.html#linear-and-quadratic-discriminant-analysis)  

### Kernel ridge regression  
http://scikit-learn.org/stable/modules/kernel_ridge.html#kernel-ridge-regression  

### Support Vector Machines  
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

1. Effective in high dimensional spaces.  
2. Still effective in cases where number of dimensions is greater than the number of samples.
3. Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
4. Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.  

The disadvantages of support vector machines include:

1. If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
2. SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
3. The support vector machines in scikit-learn support both dense (numpy.ndarray and convertible to that by numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-ordered numpy.ndarray (dense) or scipy.sparse.csr_matrix (sparse) with dtype=float64.  


#### Classification  
[More Detail](http://scikit-learn.org/stable/modules/svm.html#classification)  
`svm.SVC/NuSVC/LinearSVC`

#### NOTE: I Think I Should Not Read Model Now !!!!  
**I am working [here](http://scikit-learn.org/stable/modules/svm.html)**

## Un-Supervised Learning  
[Detail](http://scikit-learn.org/stable/unsupervised_learning.html)  

## Model Selection and Evaluation  
### Cross-Validation: Evaluating Estimator Perfomance  
Overfitting: Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data  

#### Train Test Split  
To avoid overfitting, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test  

In [55]:
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> from sklearn import datasets
>>> from sklearn import svm
>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
>>> X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
>>> print(X_train.shape, y_train.shape)
>>> print(X_test.shape, y_test.shape)
>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)

(90, 4) (90,)
(60, 4) (60,)


0.9666666666666667

#### Cross Validation  
Partition the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.  
A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

A model is trained using k-1 of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

#### cross_val_score  
The simplest way to use cross-validation  

In [56]:
>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> print(scores)
>>> from sklearn import metrics
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')
>>> print(scores)
>>> from sklearn.model_selection import ShuffleSplit
>>> n_samples = iris.data.shape[0]
>>> cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=cv)
>>> print(scores)

[0.96666667 1.         0.96666667 0.96666667 1.        ]
[0.96658312 1.         0.96658312 0.96658312 1.        ]
[0.97777778 0.97777778 1.        ]


#### cross_validate  
The cross_validate function differs from cross_val_score in two ways -

It allows specifying multiple metrics for evaluation.
It returns a dict containing training scores, fit-times and score-times in addition to the test score.  

In [57]:
>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import recall_score
>>> from sklearn.metrics.scorer import make_scorer
>>> scores = cross_validate(clf, iris.data, iris.target, scoring='precision_macro')
>>> print(scores)
>>> scoring = ['precision_macro', 'recall_macro']
>>> clf = svm.SVC(kernel='linear', C=1, random_state=0)
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, cv=5, return_train_score=False)
>>> print(scores)                       
>>> scoring = {'prec_macro': 'precision_macro', 'rec_micro': make_scorer(recall_score, average='macro')}
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, cv=5, return_train_score=True)
>>> print(scores)

{'fit_time': array([0.00059986, 0.00034595, 0.00042748]), 'score_time': array([0.00043631, 0.00042534, 0.0004766 ]), 'test_score': array([1.        , 0.96491228, 0.98039216]), 'train_score': array([0.98095238, 1.        , 0.99047619])}
{'fit_time': array([0.0004909 , 0.00034809, 0.00028443, 0.00039887, 0.00032568]), 'score_time': array([0.00081444, 0.00071907, 0.00082541, 0.00102878, 0.00071669]), 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]), 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])}
{'fit_time': array([0.00064826, 0.00041986, 0.00031424, 0.00035644, 0.00031781]), 'score_time': array([0.00111794, 0.00079131, 0.00077844, 0.00088596, 0.00078726]), 'test_prec_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]), 'train_prec_macro': array([0.97674419, 0.97674419, 0.99186992, 0.98412698, 0.98333333]), 'test_rec_micro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1

#### cross_val_predict  
The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly once can be used  

In [58]:
>>> from sklearn.model_selection import cross_val_predict
>>> predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
>>> metrics.accuracy_score(iris.target, predicted) 

0.9733333333333334

#### Cross Validation Iterators: KFold 
 divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using k - 1 folds, and the fold left out is used for test.  


In [59]:
>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X): print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


#### Cross Validation Iterators: RepeatedKFold & RepeatedStratifiedKFold  
repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition

In [60]:
>>> import numpy as np
>>> from sklearn.model_selection import RepeatedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> random_state = 12883823
>>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
>>> for train, test in rkf.split(X): print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]


#### Cross Validation Iterators: Leave P Out  
LeavePOut is very similar to LeaveOneOut as it creates all the possible training/test sets by removing p samples from the complete set. For n samples, this produces {n \choose p} train-test pairs. Unlike LeaveOneOut and KFold, the test sets will overlap for p > 1.  

In [61]:
>>> from sklearn.model_selection import LeavePOut
>>> X = np.ones(4)
>>> lpo = LeavePOut(p=2)
>>> for train, test in lpo.split(X): print("%s %s" % (train, test))

[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]


#### Cross Validation Iterators: ShuffleSplit  
The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.  

In [62]:
>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.arange(5)
>>> ss = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)
>>> for train_index, test_index in ss.split(X): print("%s %s" % (train_index, test_index))

[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]


#### Cross Validation Iterators: StratifiedKFold/RepeatedStratifiedKFold/StratifiedShuffleSplit  
StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.  

In [63]:
>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.ones(10)
>>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> skf = StratifiedKFold(n_splits=3)
>>> for train, test in skf.split(X, y): print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


#### Cross Validation Iterators: GroupKFold  
GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training set  

In [64]:
>>> from sklearn.model_selection import GroupKFold
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
>>> gkf = GroupKFold(n_splits=3)
>>> for train, test in gkf.split(X, y, groups=groups): print("%s %s" % (train, test))

[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]


#### Cross Validation Iterators: LeaveOneGroupOut/LeaveOneGroupOut    
LeaveOneGroupOut is a cross-validation scheme which holds out the samples according to a third-party provided array of integer groups. This group information can be used to encode arbitrary domain specific pre-defined cross-validation folds.

Each training set is thus constituted by all the samples except the ones related to a specific group.  

LeavePGroupsOut is similar as LeaveOneGroupOut, but removes samples related to P groups for each training/test set.

In [65]:
>>> from sklearn.model_selection import LeaveOneGroupOut
>>> X = [1, 5, 10, 50, 60, 70, 80]
>>> y = [0, 1, 1, 2, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3, 3]
>>> logo = LeaveOneGroupOut()
>>> for train, test in logo.split(X, y, groups=groups): print("%s %s" % (train, test))
>>> from sklearn.model_selection import LeavePGroupsOut
>>> X = np.arange(6)
>>> y = [1, 1, 1, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3]
>>> lpgo = LeavePGroupsOut(n_groups=2)
>>> for train, test in lpgo.split(X, y, groups=groups): print("%s %s" % (train, test))

[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]
[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]


#### Cross Validation Iterators: GroupShuffleSplit  
The GroupShuffleSplit iterator behaves as a combination of ShuffleSplit and LeavePGroupsOut, and generates a sequence of randomized partitions in which a subset of groups are held out for each split.  

In [66]:
>>> from sklearn.model_selection import GroupShuffleSplit
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "a"]
>>> groups = [1, 1, 2, 2, 3, 3, 4, 4]
>>> gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)
>>> for train, test in gss.split(X, y, groups=groups): print("%s %s" % (train, test))

[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]


#### TimeSeriesSplit  
TimeSeriesSplit is a variation of k-fold which returns first k folds as train set and the (k+1) th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.

This class can be used to cross-validate time series data samples that are observed at fixed time intervals.  


In [67]:
>>> from sklearn.model_selection import TimeSeriesSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> for train, test in tscv.split(X): print("%s %s" % (train, test))

[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]


### Tunning The HyperParameters Of An Estimator  
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.  

It is possible and recommended to search the hyper-parameter space for the best cross validation score.

Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator, use: `estimator.get_params()`  

A search consists of:

1. an estimator (regressor or classifier such as sklearn.svm.SVC());
2. a parameter space;
3. a method for searching or sampling candidates;
4. a cross-validation scheme; and
5. a score function.  

Note that it is common that a small subset of those parameters can have a large impact on the predictive or computation performance of the model while others can be left to their default values. It is recommended to read the docstring of the estimator class to get a finer understanding of their expected behavior, possibly by reading the enclosed reference to the literature.  


#### Exhaustive Grid Search  
The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. For instance, the following param_grid:

> ```
> param_grid = [
>   {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
>   {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
> ]  
> ```

The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

#### Randomized Parameter Optimization  
While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

A budget can be chosen independent of the number of parameters and possible values.
Adding parameters that do not influence the performance does not decrease efficiency.
Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for GridSearchCV. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified:  
> ```
> {'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
  'kernel': ['rbf'], 'class_weight':['balanced', None]}
  ```  
  


#### Tips For Parameters Search  
[Detail](http://scikit-learn.org/stable/modules/grid_search.html#tips-for-parameter-search)  
1. Specifying an objective metric  
2. Specifying multiple metrics for evaluation  
3. Composite estimators and parameter spaces  
4. Model selection: development and evaluation  
5. Parallelism  
6. Robustness to failure  

#### Other  
[Detail](http://scikit-learn.org/stable/modules/grid_search.html#alternatives-to-brute-force-parameter-search)  


### Model Evaluation: Quantifying the Quality of Predictions  
3 different APIs for evaluating the quality of a model’s predictions:

1. Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve
2. Scoring parameter: Model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV) rely on an internal scoring strategy
3. Metric functions: The metrics module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.  

Get a baseline value of those metrics for random predictions:  

*  Dummy estimators   
#### `scoring` of CV  
1. use [pre-defined](http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)  
2. customize scroing stragety from metric functions(make_scorer)  

In [68]:
>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

In [69]:
>>> import numpy as np
>>> def my_custom_loss_func(ground_truth, predictions):
...     diff = np.abs(ground_truth - predictions).max()
...     return np.log(1 + diff)
>>> # loss_func will negate the return value of my_custom_loss_func,
>>> #  which will be np.log(2), 0.693, given the values for ground_truth
>>> #  and predictions defined below.
>>> loss  = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> score = make_scorer(my_custom_loss_func, greater_is_better=True)
>>> ground_truth = [[1], [1]]
>>> predictions  = [0, 1]
>>> from sklearn.dummy import DummyClassifier
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)
>>> clf = clf.fit(ground_truth, predictions)
>>> loss(clf,ground_truth, predictions) 
>>> score(clf,ground_truth, predictions) 

0.6931471805599453

#### Multiple Metric Evaluation
specify multiple scoring metrics for the scoring parameter:  

1. As an iterable of string metrics  
2. As a dict mapping the scorer name to the scoring function  


In [70]:
>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import confusion_matrix
>>> # A sample toy binary classification dataset
>>> X, y = datasets.make_classification(n_classes=2, random_state=0)
>>> svm = LinearSVC(random_state=0)
>>> def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
>>> def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
>>> def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
>>> def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
>>> scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn), 'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
>>> cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
>>> # Getting the test set true positive scores
>>> print(cv_results['test_tp'])          
>>> # Getting the test set false negative scores
>>> print(cv_results['test_fn'])          

[12 13 15]
[5 4 1]


#### Classification Score:  Binary Classification   

`precision_recall_curve(y_true, probas_pred)` - Compute precision-recall pairs for different probability thresholds
`roc_curve(y_true, y_score[, pos_label, …])` - Compute Receiver operating characteristic (ROC)  
`roc_auc_score` - Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.  
`average_precision_score` - Compute average precision (AP) from prediction scores  

#### Classification Score: MultiClass  

`cohen_kappa_score(y1, y2[, labels, weights, …])`- Cohen’s kappa: a statistic that measures inter-annotator agreement.
`confusion_matrix(y_true, y_pred[, labels, …])` - Compute confusion matrix to evaluate the accuracy of a classification
`hinge_loss(y_true, pred_decision[, labels, …])` - Average hinge loss (non-regularized)
`matthews_corrcoef(y_true, y_pred[, …])` - Compute the Matthews correlation coefficient (MCC)  

#### Classification Score: MultiLabel  
`accuracy_score(y_true, y_pred[, normalize, …])` - Accuracy classification score.
`classification_report(y_true, y_pred[, …])` - Build a text report showing the main classification metrics
`f1_score(y_true, y_pred[, labels, …])` - Compute the F1 score, also known as balanced F-score or F-measure
`beta_score(y_true, y_pred, beta[, labels, …])` - Compute the F-beta score
`hamming_loss(y_true, y_pred[, labels, …])` - Compute the average Hamming loss.
`jaccard_similarity_score(y_true, y_pred[, …])` - Jaccard similarity coefficient score
`log_loss(y_true, y_pred[, eps, normalize, …])` - Log loss, aka logistic loss or cross-entropy loss.
`precision_recall_fscore_support(y_true, y_pred)` - Compute precision, recall, F-measure and support for each class
`precision_score(y_true, y_pred[, labels, …])` - Compute the precision
`recall_score(y_true, y_pred[, labels, …])` - Compute the recall
`zero_one_loss(y_true, y_pred[, normalize, …])` - Zero-one classification loss.
`average_precision_score(y_true, y_score[, …])` - Compute average precision (AP) from prediction scores
`roc_auc_score(y_true, y_score[, average, …])` - Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

#### Accuracy Score  
The accuracy_score function computes the accuracy, either the fraction (default) or the count (normalize=False) of correct predictions.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

If \hat{y}_i is the predicted value of the i-th sample and y_i is the corresponding true value, then the fraction of correct predictions over n_\text{samples} is defined as

\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)  


In [71]:
>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> print(accuracy_score(y_true, y_pred))
>>> print(accuracy_score(y_true, y_pred, normalize=False))
>>> print(accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2))))

0.5
2
0.5


#### Cohen’s kappa  
This measure is intended to compare labelings by different human annotators, not a classifier versus a ground truth.

The kappa score (see docstring) is a number between -1 and 1. Scores above .8 are generally considered good agreement; zero or lower means no agreement (practically random labels).  


In [72]:
>>> from sklearn.metrics import cohen_kappa_score
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> print(cohen_kappa_score(y_true, y_pred))  

0.4285714285714286


#### Confusion matrix  
 evaluates classification accuracy by computing the confusion matrix.

By definition, entry i, j in a confusion matrix is the number of observations actually in group i, but predicted to be in group j  


In [73]:
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> print(confusion_matrix(y_true, y_pred))  

[[2 0 0]
 [0 0 1]
 [1 0 2]]


#### Classification report  
builds a text report showing the main classification metrics  

In [74]:
>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 1, 0]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))  

             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.50      0.67         2

avg / total       0.67      0.60      0.59         5



#### Hamming loss  
omputes the average Hamming loss or Hamming distance between two sets of samples.

If \hat{y}_j is the predicted value for the j-th label of a given sample, y_j is the corresponding true value, and n_\text{labels} is the number of classes or labels, then the Hamming loss L_{Hamming} between two samples is defined as:

L_{Hamming}(y, \hat{y}) = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} 1(\hat{y}_j \not= y_j)  

In [75]:
>>> from sklearn.metrics import hamming_loss
>>> y_pred = [1, 2, 3, 4]
>>> y_true = [2, 2, 3, 4]
>>> print(hamming_loss(y_true, y_pred))
>>> print(hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2))))

0.25
0.75


#### Jaccard similarity coefficient score  
 computes the average (default) or sum of Jaccard similarity coefficients, also called the Jaccard index, between pairs of label sets.

The Jaccard similarity coefficient of the i-th samples, with a ground truth label set y_i and predicted label set \hat{y}_i, is defined as

J(y_i, \hat{y}_i) = \frac{|y_i \cap \hat{y}_i|}{|y_i \cup \hat{y}_i|}.  

In binary and multiclass classification, the Jaccard similarity coefficient score is equal to the classification accuracy.  

In [76]:
>>> import numpy as np
>>> from sklearn.metrics import jaccard_similarity_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> print(jaccard_similarity_score(y_true, y_pred))
>>> print(jaccard_similarity_score(y_true, y_pred, normalize=False))  

0.5
2


#### Precision, recall and F-measures  
`precision` - the ability of the classifier not to label as positive a sample that is negative
`recall` - the ability of the classifier to find all the positive samples  
`F-measures` - nterpreted as a weighted harmonic mean of the precision and recall  

`average_precision_score(y_true, y_score[, …])` - Compute average precision (AP) from prediction scores
`f1_score(y_true, y_pred[, labels, …])` - Compute the F1 score, also known as balanced F-score or F-measure
`fbeta_score(y_true, y_pred, beta[, labels, …])` - Compute the F-beta score
`precision_recall_curve(y_true, probas_pred)` - Compute precision-recall pairs for different probability thresholds
`precision_recall_fscore_support(y_true, y_pred)` - Compute precision, recall, F-measure and support for each class
`precision_score(y_true, y_pred[, labels, …])` - Compute the precision
`recall_score(y_true, y_pred[, labels, …])` - Compute the recall  

[More Detail](http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures)  

#### More and More ...  
[Detail](http://scikit-learn.org/stable/modules/model_evaluation.html#hinge-loss)  

### Model Presistence  
After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to retrain. The following section gives you an example of how to persist a model with pickle. We’ll also review a few security and maintainability issues when working with pickle serialization.  


In [77]:
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> print(clf2.predict(X[0:1]))
>>> print(y[0])
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl') 
>>> clf = joblib.load('filename.pkl') 

[0]
0


### Validation Curves: Plotting Scores to Evaluate Models  
[Detail](http://scikit-learn.org/stable/modules/learning_curve.html)  

#### Validation Curve  
Sometimes helpful to plot the influence of a single hyperparameter on the training score and the validation score to find out whether the estimator is overfitting or underfitting for some hyperparameter values.

In [78]:
>>> import numpy as np
>>> from sklearn.model_selection import validation_curve
>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import Ridge
>>> np.random.seed(0)
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> indices = np.arange(y.shape[0])
>>> np.random.shuffle(indices)
>>> X, y = X[indices], y[indices]
>>> train_scores, valid_scores = validation_curve(Ridge(), X, y, "alpha", np.logspace(-7, 3, 3))
>>> print(train_scores, '\n\n', valid_scores)

[[0.94141575 0.92944161 0.92267644]
 [0.94141563 0.92944153 0.92267633]
 [0.47253778 0.45601093 0.42887489]] 

 [[0.90335825 0.92525985 0.94159336]
 [0.90338529 0.92523396 0.94159078]
 [0.44639995 0.39639757 0.4567671 ]]


#### Learning Curve  
A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data   

We will probably have to use an estimator or a parametrization of the current estimator that can learn more complex concepts (i.e. has a lower bias). If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization  


In [79]:
>>> from sklearn.model_selection import learning_curve
>>> from sklearn.svm import SVC
>>> train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
>>> print(train_sizes, '\n\n\n', train_scores, '\n\n\n', valid_scores)

[ 50  80 110] 


 [[0.98       0.98       0.98       0.98       0.98      ]
 [0.9875     1.         0.9875     0.9875     0.9875    ]
 [0.98181818 1.         0.98181818 0.98181818 0.99090909]] 


 [[1.         0.93333333 1.         1.         0.96666667]
 [1.         0.96666667 1.         1.         0.96666667]
 [1.         0.96666667 1.         1.         0.96666667]]


## Strategies to scale computationally: bigger data  
[Detail](http://scikit-learn.org/stable/modules/scaling_strategies.html)    


## Computational Performance  
[Detail](http://scikit-learn.org/stable/modules/computational_performance.html)  