# How to use Scikit-learn

Scikit-learn is a Python package that allows you to create many different machine learning models and statistical models.

If you can use Scikit-learn, you can build different models relatively easily.

# 1. Install sklearn

- See this website https://scikit-learn.org/stable/install.html. Usually `pip install scikit-learn`

- All sub-packages and modules are listed on this website on the left hand side under "Section Navigation": https://scikit-learn.org/stable/api/index.html

# 2. Load in relevant sklearn modules

You can either do this by:

- a) loading in the whole package of sklearn using `import sklearn`. Be careful as sklearn is a large package

- b) or just import sub-packages such as `from sklearn import metrics`

- c) or individual modules such as `from sklearn.metrics import confusion_matrix`

More information on pros and cons of each can be found here https://discuss.python.org/t/what-is-the-purpose-of-importing-a-package-alone/18433/3

For this exploratory purpose of playing with sklearn, we will load the whole package so option a).

In [11]:
import sklearn

# 3. Load data and clean to be in expected format for algorithm

For the purposes of this example, we will use the breast cancer data. This can be downloaded from a sklearn sub-package called `sklearn.datasets` where there are multiple datasets to choose from. Ours is called `load_breast_cancer`.

Usually for a binary classification, you need your data in the structure of X or x and y. For example, for LinearSVC, it requires you to have the data in the format of:

>X: {array-like, sparse matrix} of shape (n_samples, n_features)
>    Training vector, where n_samples is the number of samples and n_features is the number of features.
>
>
>y: array-like of shape (n_samples,)
>    Target vector relative to X.

Source: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC.fit

For example, pandas DataFrame/Series or numpy array.

In [12]:
data = sklearn.datasets.load_breast_cancer(return_X_y=True, as_frame=True)

#the first item, which has index 0 is the image data
x = data[0]

#the second item, which has index 1 is the cancer classification
y = data[1]

# normally for binary classification: the positive case, also known as 1, is that something exists, e.g. has cancer.
# then the negative case, also known as 0, is that something does not exist, e.g. does not have cancer.
# however, in this dataset, it is reversed.
# this makes this unintuitive and confusion to work with, so we will reverse it.
# benign cases aka non-cancerous (represented as 1) and malignant aka cancerous cases (represented as 0)
y.replace({0:1, 1:0}, inplace=True)

# 4. Train/test split

This has been covered in 02.02 - AI Development Lifecycle and Steps.

In [13]:
chosen_random_state = 4627

# data split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split( # this function splits x and y data into 4 parts. The first two are where the features, or x, are split into train and test. The second two are where the targets, or y, are split into train and test.
                                                    x, # this is the features data 
                                                    y, # this is the target data
                                                    test_size=0.2, # this is the percentage of data that will be sectioned off in the test set so the train set will have 80% of the data and the test set will have 20%
                                                    shuffle = True, # this shuffles the data before splitting it
                                                    stratify = None, # this is used to ensure that the data is split in a way that preserves the percentage of samples for each class
                                                    random_state=chosen_random_state # this is the seed state or random state
                                                    )

# 5. Choose your algorithm

sklearn has multiple different algorithms. See below for some examples of algorithms we can ues for binary classification.

We will cover the algorithms in more detail in further sessions.

The code below "instantiates the model" which means that a model with specific parameters has been created. This does not train the model.

You can only see that the random_state has been given a value, every other parameter has the default value in sklearn. It is NOT recommended to use all the default values. The parameters should be chosen so they are specific to your problem.

In [25]:
chosen_random_state = 2541

linear_svc_model = sklearn.svm.LinearSVC(random_state=chosen_random_state)

log_reg_model = sklearn.linear_model.LogisticRegression(random_state=chosen_random_state, max_iter=10000)

tree_model = sklearn.tree.DecisionTreeClassifier(random_state=chosen_random_state)

# 6. Train your model using the train data

Use the `fit` function to train your model. This function saves in place (i.e. saves over the model) so you do not need to create another variable using the `=` notation.

In [26]:
linear_svc_model.fit(x_train, y_train)

log_reg_model.fit(x_train, y_train)

tree_model.fit(x_train, y_train)


# 7. Evaluate model using test data

sklearn has multiple different evaluation metrics.

First we will predict the test set with the models using the `predict` function.

In [16]:
lin_svc_y_pred = linear_svc_model.predict(x_test)

log_reg_y_pred = log_reg_model.predict(x_test)

tree_y_pred = tree_model.predict(x_test)

Then we will evaluate the predictions with actual with different metrics.

In [21]:
#confusion matrix

lin_svc_confusion_matrix = sklearn.metrics.confusion_matrix(y_test, lin_svc_y_pred)

log_reg_confusion_matrix = sklearn.metrics.confusion_matrix(y_test, log_reg_y_pred)

tree_confusion_matrix = sklearn.metrics.confusion_matrix(y_test, tree_y_pred)

In [18]:
#accuracy

lin_svc_accuracy = sklearn.metrics.accuracy_score(y_test, lin_svc_y_pred)

log_reg_accuracy = sklearn.metrics.accuracy_score(y_test, log_reg_y_pred)

tree_accuracy = sklearn.metrics.accuracy_score(y_test, tree_y_pred)

print(lin_svc_accuracy, log_reg_accuracy, tree_accuracy)

0.9736842105263158 0.956140350877193 0.9649122807017544


In [19]:
#recall

lin_svc_recall = sklearn.metrics.recall_score(y_test, lin_svc_y_pred)

log_reg_recall = sklearn.metrics.recall_score(y_test, log_reg_y_pred)

tree_recall = sklearn.metrics.recall_score(y_test, tree_y_pred)

print(lin_svc_recall, log_reg_recall, tree_recall)

0.9512195121951219 0.926829268292683 0.926829268292683


In [20]:
#f1 score

lin_svc_f1 = sklearn.metrics.f1_score(y_test, lin_svc_y_pred)

log_reg_f1 = sklearn.metrics.f1_score(y_test, log_reg_y_pred)

tree_f1 = sklearn.metrics.f1_score(y_test, tree_y_pred)

print(lin_svc_f1, log_reg_f1, tree_f1)

0.9629629629629629 0.9382716049382716 0.95
