# SCIKIT-LEARN Progress
# Fitting and predicting: estimator basics
In Scikit, there are plenty of built-in machine learning algorithms and models.
Therefore, it's simpler to call them **estimators** since we'll be using a lot of them to predict stuff.

Here is a simple example where we fit a RandomForestClassifier to some very basic data:

In [1]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3],  # 2 samples, 3 features
      [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)
RandomForestClassifier(random_state=0)

In here, the **estimator** RandomForestClassifier have the **method fit**

All of the estimators have the method fit.

The fit method generally accepts 2 inputs:

The **samples matrix X** (or design matrix)  have the size (n_**samples**, n_**features**)

The **target values y**, we have:
- For unsupervized learning tasks, y does not need to be specified
- 1d array where the i th entry corresponds to the target of the i th sample (row) of X

Both X and y are usually expected to be numpy arrays or equivalent array-like data types

In [3]:
clf.predict(X)  # predict classes of the training data

array([0, 1])

In [4]:
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

# Transformers and pre-processors
Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.

In other words: data needs to be pre-processed in order to be used for predictions, and that includes Standart function, Sigmoid...

method **transform**.

Different transformations to different features: ColumnTransformer

In [5]:
from sklearn.preprocessing import StandardScaler
X = [[0, 15],
   [1, -10]]
# scale data according to computed scaling values
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

# Pipelines: chaining pre-processors and estimators
Transformers and estimators (predictors) can be combined together into a single unifying object: a **Pipeline**.
As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data.

In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:

The Iris dataset is a toy set, about a plant and has 4 features/attributes each instance, 3 classes, 150 instances, 50 of each class.

In [6]:
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
...
>>> # create a pipeline object
>>> pipe = make_pipeline(
...     StandardScaler(),
...     LogisticRegression()
... )
...
>>> # load the iris dataset and split it into train and test sets
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
... 
>>> # fit the whole pipeline
>>> pipe.fit(X_train, y_train)

In [7]:
>>> # we can now use it like any other estimator
>>> accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

# Model evaluation
**Fitting a model to some data does not entail that it will predict well on unseen data**. This needs to be directly evaluated. We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for cross-validation.

We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper.

In [8]:
>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import cross_validate
...
>>> X, y = make_regression(n_samples=1000, random_state=0)
>>> lr = LinearRegression()
...
>>> result = cross_validate(lr, X, y)  # defaults to 5-fold CV
>>> result['test_score']  # r_squared score is high because dataset is easy

array([1., 1., 1., 1., 1.])

# Automatic parameter searches
**All estimators have parameters** (often called hyper-parameters in the literature) **that can be tuned**.

The generalization power of an estimator often critically depends on a few parameters.

For example a RandomForestRegressor has a n_estimators parameter that determines the number of trees in the forest, and a max_depth parameter that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these parameters should be since they depend on the data at hand.

Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly search over the parameter space of a random forest with a RandomizedSearchCV object. When the search is over, the RandomizedSearchCV behaves as a RandomForestRegressor that has been fitted with the best set of parameters.

In [9]:
>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # define the parameter space that will be searched over
>>> param_distributions = {'n_estimators': randint(1, 5),
...                        'max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
...                             n_iter=5,
...                             param_distributions=param_distributions,
...                             random_state=0)
>>> search.fit(X_train, y_train)

In [10]:
>>> search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [11]:
>>> # the search object now acts like a normal random forest estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)

0.735363411343253

# EXAMPLES

Now onto https://scikit-learn.org/stable/auto_examples/index.html#general-examples.