<h2>An introduction to machine learning with scikit-learn</2>

Loading an example dataset

In [3]:
from sklearn import datasets

iris = datasets.load_iris()
digits = datasets.load_digits()

print(digits.data.shape) # data
print(digits.target) # target values

print(digits.images[0]) # original sample of the digits data (image of shape (8,8))

(1797, 64)
[0 1 2 ... 8 9 8]
[[ 0.  0.  5. 13.  9.  1.  0.  0.]
 [ 0.  0. 13. 15. 10. 15.  5.  0.]
 [ 0.  3. 15.  2.  0. 11.  8.  0.]
 [ 0.  4. 12.  0.  0.  8.  8.  0.]
 [ 0.  5.  8.  0.  0.  9.  8.  0.]
 [ 0.  4. 11.  0.  1. 12.  7.  0.]
 [ 0.  2. 14.  5. 10. 12.  0.  0.]
 [ 0.  0.  6. 13. 10.  0.  0.  0.]]


Learning and predicting

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T).
An example of an estimator is the class sklearn.svm.SVC, which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.

In [4]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

In [6]:
clf.fit(digits.data[:-1], digits.target[:-1])
clf.predict(digits.data[-1:])

array([8])

<h3>Conventions</h3>

scikit-learn estimators follow certain rules to make their behavior more predictive

In [4]:
import numpy as np
from sklearn import kernel_approximation

rng = np.random.RandomState(0)
X = rng.rand(10, 2000)
print(X.dtype)
X = np.array(X, dtype='float32')
print(X.dtype)

float64
float32


Difference between float32 and float64: float32 is a 32 bit number - float64 uses 64 bits. That means that float64's take up twice as much memory - and doing operations on them may be a lot slower in some machine architectures. However, float64's can represent numbers much more accurately than 32 bit floats. They also allow much larger numbers to be stored.

In [5]:
transformer = kernel_approximation.RBFSampler()
X_new = transformer.fit_transform(X)
print(X.dtype)

float32


In this example, X is float32, and is unchanged by fit_transform(X).
Using float32-typed training (or testing) data is often more efficient than using the usual float64 dtype: it allows to reduce the memory usage and sometimes also reduces processing time by leveraging the vector instructions of the CPU. However it can sometimes lead to numerical stability problems causing the algorithm to be more sensitive to the scale of the values and require adequate preprocessing.

Keep in mind however that not all scikit-learn estimators attempt to work in float32 mode. For instance, some transformers will always cast their input to float64 and return float64 transformed values as a result.
Regression targets are cast to float64 and classification targets are maintained:

In [10]:
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
clf = SVC()
clf.fit(iris.data, iris.target)

print(list(clf.predict(iris.data[:3])))

print(iris.target_names)
print(iris.target)
clf.fit(iris.data, iris.target_names[iris.target])

print(list(clf.predict(iris.data[:3])))

[0, 0, 0]
['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa', 'setosa', 'setosa']


Here, the first predict() returns an integer array, since iris.target (an integer array) was used in fit. The second predict() returns a string array, since iris.target_names was for fitting.

<h3>Refitting and updating parameters</h3>

Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method. Calling fit() more than once will overwrite what was learned by any previous fit():

In [24]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True)

clf = SVC()
clf.set_params(kernel='linear').fit(X, y)
print(clf.predict(X[:5]))
#print(X[:5]) # séléctionne les 5 premiers
#print(X[1:]) # séléctionne tout sauf le premier
#print(X[:-5]) # séléctionne tout sauf les 5 derniers
#print(X[-1:]) # sélectionne le dernier élément
clf.set_params(kernel='rbf').fit(X, y)
print(clf.predict(X[:5]))

[0 0 0 0 0]
[0 0 0 0 0]


Here, the default kernel rbf is first changed to linear via SVC.set_params() after the estimator has been constructed, and changed back to rbf to refit the estimator and to make a second prediction.

<h3>Multiclass vs. multilabel fitting</h3>

When using multiclass classifiers, the learning and prediction task that is performed is dependent on the format of the target data fit upon:

In [33]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]

classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)

array([0, 0, 1, 1, 2])

In the above case, the classifier is fit on a 1d array of multiclass labels and the predict() method therefore provides corresponding multiclass predictions. It is also possible to fit upon a 2d array of binary label indicators:

In [34]:
y = LabelBinarizer().fit_transform(y)
print(y)
classif.fit(X, y).predict(X)

[[1 0 0]
 [1 0 0]
 [0 1 0]
 [0 1 0]
 [0 0 1]]


array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])

Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer. In this case predict() returns a 2d array representing the corresponding multilabel predictions.

Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:

In [35]:
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
print(y)
classif.fit(X, y).predict(X)

[[1 1 0 0 0]
 [1 0 1 0 0]
 [0 1 0 1 0]
 [1 0 1 1 0]
 [0 0 1 0 1]]


array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0]])

In this case, the classifier is fit upon instances each assigned multiple labels. The MultiLabelBinarizer is used to binarize the 2d array of multilabels to fit upon. As a result, predict() returns a 2d array with multiple predicted labels for each instance.

Example of situation where you need MultiLabelBinarizer

In [1]:
from sklearn.preprocessing import MultiLabelBinarizer

# Sample data
genres = ["Romance, Comedy", "Action, Adventure", "Drama", "Horror, Thriller"]

# Create an instance of MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the genres
binary_genre_matrix = mlb.fit_transform([genre.split(", ") for genre in genres])

print(binary_genre_matrix)

[[0 0 1 0 0 1 0]
 [1 1 0 0 0 0 0]
 [0 0 0 1 0 0 0]
 [0 0 0 0 1 0 1]]


Each row corresponds to a movie, and each column corresponds to a genre. The presence of a genre is indicated by a 1, and its absence is indicated by a 0. In this way, you can prepare your data for training a multi-label classification model to recommend movies based on user preferences for different genres.