# Machine learning with Scikit-learn

## Classification

Classification is a task of finding categories of data points. Machine learning algorithms does it automatically by learning from the existing 
training data points and predicting categories of unseen data points. As the name suggests, classification finds classes (categories) which is also
knows as targets/outputs/labels of data points.

### Linear model

In [4]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np

In [10]:
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
score = clf.score(X, y)
print("Accuracy: {}".format(str(np.round(score, 2))))

Accuracy: 0.97


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Support vector machine (classifier)

In [12]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X, y)
clf.predict(X[:2, :])
score = clf.score(X, y)
print("Accuracy: {}".format(str(np.round(score, 2))))

Accuracy: 0.97


### Decision trees

In [16]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
score = clf.score(X, y)
print("Accuracy: {}".format(str(np.round(score, 2))))

Accuracy: 1.0


### Random forest classifier

In [17]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf = clf.fit(X, y)
score = clf.score(X, y)
print("Accuracy: {}".format(str(np.round(score, 2))))

Accuracy: 0.97


## Regression

Regression is also a type of supervised learning which works with datasets having targets. The difference is that the targets are real numbers and not categories. Example tasks of regression are house price prediction, stock prices prediction, temperature prediction and so on. In all such tasks, the predicted quantities are real numbers. There are many algorithms available in Scikit-learn for regression which are either linear or non-linear.

### Linear model

In [22]:
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
reg.score(X, y)

1.0

In [23]:
reg.coef_

array([1., 2.])

In [24]:
reg.intercept_

3.0000000000000018

In [25]:
reg.predict(np.array([[3, 5]]))

array([16.])

### Support vector machine (regressor)

In [28]:
from sklearn import svm
X = [[0, 0], [2, 2]]
y = [0.5, 2.5] # targets are real numbers
regr = svm.SVR()
regr.fit(X, y)

SVR()

In [27]:
regr.predict([[1, 1]])

array([1.5])

### Decision trees

In [29]:
from sklearn import tree
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X, y)

In [30]:
clf.predict([[1, 1]])

array([0.5])

### Random forest regerssor

In [31]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
                       random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=2, random_state=0)
regr.fit(X, y)

print(regr.predict([[0, 0, 0, 0]]))

[-8.32987858]


## Clustering

Clustering is an unsupervised machine learning technique and works on datasets in which targets are not defined. The algorithms used for clustering explore hidden patterns/groups in the datasets. Some examples of clustering techniques are K-means clustering, Hierarchial clustering, DBSCAN and many more.

### K-Means clustering

In [34]:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_

array([1, 1, 1, 0, 0, 0], dtype=int32)

In [35]:
kmeans.predict([[0, 0], [12, 3]])
kmeans.cluster_centers_

array([[10.,  2.],
       [ 1.,  2.]])

### Agglomerative clustering

In [37]:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])
clustering = AgglomerativeClustering().fit(X)
clustering

AgglomerativeClustering()

In [38]:
clustering.labels_

array([1, 1, 1, 0, 0, 0])

### DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

In [39]:
from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3],
              [8, 7], [8, 8], [25, 80]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
clustering.labels_

clustering

DBSCAN(eps=3, min_samples=2)

## Hyperparameter optimisation (Grid search)

In [54]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
import pandas as pd

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)


sorted(clf.cv_results_.keys())

['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_C',
 'param_kernel',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

In [55]:
cv_results = pd.DataFrame(clf.cv_results_)

In [56]:
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.0012,0.000244,0.000558,0.000104,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.001052,0.00012,0.000589,0.000206,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,0.966667,0.966667,0.933333,1.0,0.966667,0.021082,4
2,0.001146,0.000142,0.000595,0.000174,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,3
3,0.001211,0.000313,0.000486,4.8e-05,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1


In [57]:
cv_results[cv_results["rank_test_score"] == 1] ## best hyperparameters with best accuracy. [mean_test_score]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.0012,0.000244,0.000558,0.000104,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.001211,0.000313,0.000486,4.8e-05,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1


## Data preprocessing

Raw datasets are not clean and not ready to be used with machine learning algorithms. There can be many impurities in the datasets such as outliers, incorrectly placed labels, features with large values, missing values, large number of features and many more. These impurities may lead to suboptimal learning and prediction. To avoid these situations, several preprocessing techniques are used, some of which are listed below.

### Standard scaler

In [63]:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
scaler
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [64]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

In [65]:
X_scaled = scaler.transform(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [None]:
#### Minmax scaler

In [66]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

### Transform to a distribution

#### Gaussian distribution

In [68]:
pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
X_lognormal
pt.fit_transform(X_lognormal)

array([[ 0.49024349,  0.17881995, -0.1563781 ],
       [-0.05102892,  0.58863195, -0.57612415],
       [ 0.69420009, -0.84857822,  0.10051454]])

### Normalisation

In [69]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')

X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

### Encoding

#### Categorical variables

In [76]:
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

enc.transform([['female', 'from US', 'uses Safari']])

array([[0., 1., 1.]])

#### One hot encoding

In [77]:
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Safari']]).toarray()

array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

### Imputation

#### Simple imputer

In [80]:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))


[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


#### Multivariate imputer

In [81]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]])

X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
# the model learns that the second feature is double the first
print(np.round(imp.transform(X_test)))

[[ 1.  2.]
 [ 6. 12.]
 [ 3.  6.]]
