# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Session 23: Final Project II. Classification Problem: Predict Machine Maintenance 

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### 0. Setup

In [1]:
### --- Setup - importing the libraries

# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# - data
import numpy as np
import pandas as pd

# - os
import os

# - ml
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import f1_score, make_scorer

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression

# - visualization
import matplotlib.pyplot as plt
import seaborn as sns

# - parameters
%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'
sns.set_theme()

# - rng
rng = np.random.default_rng(1234)

# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')

# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')

### 1. The dataset

In this exercise you will be using `sklearn.tree.DecisionTreeClassifier` (the Decision Tree model for Classification) and `sklearn.linear_model.LogisticRegressionCV` to train a model to predict the type of failure of a specific type of industrial machinery.

The data set for this exercise is provided in your `_data` directory as `dss2023_finalProject_02.csv`.

The data set is based on the **Machine Predictive Maintenance Classification** data from Kaggle [source](https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification). We did some data preparation for you so that you will be able to proceed to EDA and ML immediately.

#### 1.1 Load the dataset

Load the `dss2023_finalProject_02.csv`. Do not forget to use `index_col=[0]` w. `pd.read_csv()`.

In [23]:
### << YOUR CODE HERE >>

The target variable is `target`, where `1` stands for `failure` and `0` for `no failure`. 

#### 1.2 Numerical Predictors

Produce and visualize a correlation matrix of all numerical predictors from `data_set`.

In [24]:
### << YOUR CODE HERE >>

In [25]:
### << YOUR CODE HERE >>

What can be conluded from this correlation matrix? Any possible problems for Linear Models in a future predictive task? Why? 

<< YOUR EXPLANATION HERE >>

#### 1.3 EDA

##### 1.3.1 Visualize numerical variables against the `target` outcome variable.

Visualize the distributions of the numerical predictors at each level of `target`. 

Produce as many plots as there are levels in `target`. Each plot should contain a set of boxplots, each panel a boxplot for the respective numerical predictor showing its distribution on the respective level of `target`.

In [26]:
### << YOUR CODE HERE >>

Please comment on the outliers.

<< YOUR COMMENT HERE >>

#### 1.4 Class Imbalance

Please provide overview of the frequencies of values in the outcome variable. Comment if the distribution could cause some problems in predictive modeling.

In [27]:
### << YOUR CODE HERE >>

<< YOUR COMMENT HERE >>

#### 1.4 Split into 20% validation and 80% training data

Notice the following from the Setup section: 

`from sklearn.model_selection import train_test_split`

Now, it is extremely easy to make a 80/20 data split with `sklearn`: Google and figure out how to do it. You need to produce two new DataFrames, `train_set` (80% of data) and `validation_set` (20 % of data). Do it:

In [28]:
### << YOUR CODE HERE >>

#### 1.5 Perform a 5-Fold CV for Hyperparameter Tuning for Decision Tree Classifier

In this task, you will be performing hyperparameter tuning for a Decision Tree Classifier using `scikit-learn`. The goal is to find the best combination of hyperparameters that maximize the `weighted F1` score for `training_set`.

The hyperparameters to be tuned are: 

- max_depth, use [5, 10]
- min_samples_leaf, use [100, 250], and
- max_features, use [4, 5, 6]

in order to remind yourself of all these hyperparameters study the [sklearn.tree.DecisionTreeClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

To solve this task, you can follow these steps:

1. Define the categorical and numerical feature columns:

```
categorical_cols = ['type']
numerical_cols = ['airTemperature_K', 
                  'processTemperature_K', 
                  'rotationalSpeed_rpm', 
                  'torque_nm', 
                  'toolWear_min']
```

2. Create a pipeline using the Pipeline class, where you utilize the ColumnTransformer to handle categorical and numerical features separately; use OneHotEncoder for categorical features and StandardScaler for numerical features:

```
pipeline = Pipeline([
    ('preprocessing', ColumnTransformer([
        ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('numerical', StandardScaler(), numerical_cols)
    ])),
    ('classifier', DecisionTreeClassifier(criterion='gini'))
])
```

3. Define a custom scoring function using make_scorer to calculate the weighted F1 score:

```
scoring = make_scorer(f1_score, average='weighted')
```

4. Specify the hyperparameters and their corresponding ranges in a param_grid dictionary.

5. Perform cross-validation using GridSearchCV with 5-fold **stratified sampling**. 

6. Pass the pipeline, param_grid, custom scoring function, and other necessary parameters and fit the grid search object to the training data.


7. Print the best hyperparameters and the corresponding weighted F1 score.

In [29]:
### << YOUR CODE HERE >>

#### 1.6 Now Refit the best model on the whole  `training_set` w/o cross-validation!

Define your `DecisionTreeClassifier()` from the best obtained hyperparameters from CV and re-train it on the whole training set w/o cross-validation. Enter the best obtained hyperparameter values into the `Pipeline()`! Print the model's weighted F1 score.

In [30]:
### << YOUR CODE HERE >>

Now, tell us how does your model perform on the `validation_set`?

In [31]:
### << YOUR CODE HERE >>

See, there is a way to make `DecisionTreeClassifier()` predict the probabilities for each class.

The `predict_proba()` method is a function provided by scikit-learn's `DecisionTreeClassifier` class, which is used to estimate the probability of each class label for a given input sample or set of samples.

Here's an explanation of the `predict_proba()` method:

Parameters:

- `X`: The input samples for which you want to estimate the class probabilities. It should be a 2D array-like or pandas DataFrame.

Returns:

- `proba`: The class probabilities for each input sample. It is an array of shape (`n_samples`, `n_classes`), where `n_samples` is the number of input samples and `n_classes` is the number of classes.

The `predict_proba()` method calculates the probability estimates based on the learned decision tree model. For each input sample, it traverses the decision tree and computes the fraction of training samples that belong to each class within the corresponding leaf node. These fractions represent the probability estimates for each class.

Let's try it out on our `validation_set`:

In [32]:
### << YOUR CODE HERE >>

The first column in `y_val_pred` represents the probability that `target==0`!

Now, we have the class labels in `y` and we can say from the probability in the second column of `y_val_pred` if the predicted label is `1` or `0` by looking if that probability is higher than `.5`, of course.

But... we want to perform and ROC analysis now. 

Do the following:

- set the `decision_treshold` to be a set of numbers from `.001` to `.999` spaced by `.001`
- iterate over decision tresholds and each time
- use `predict_proba()`, check the probability in the second column, predict `1` if is larger than the current `decision_treshold`, and store the result, 
- compute the True Positive Rate (TPR) and the False Positive Rate (FPR),
- so to obtain a Pandas DataFrame with the following columns: `DecTreshold`, `TPR`, `FPR`.

Plot the `observed` vs. `predicted` values from the best obtained model

In [33]:
### << YOUR CODE HERE >>

Now plot the ROC curve for this `DecisionTreeClassifier`!

In [34]:
### << YOUR CODE HERE >>

### 2. How does the Binomial Logistic Regression compare?

#### 2.1 5-fold CV of the L2-Regularized Binomial Logistic Regression Model

- Extract the feature matrix `X` and the outcome `y` from `train_set`
- Define categorical and numerical features again as `categorical_cols` and `numerical_cols`
- Define `enc` as an instance of `OneHotEncoder` and apply it to ``X_train[categorical_cols]` to obtain `X_train_encoded` from `X`
- Define `scl` as an instance of `StandardScale` and apply it to ``X_train[numerical_cols]` to obtain `X_train_scaled` from `X`
- Do this: 

```
X_train_processed = np.hstack((X_train_encoded.toarray(), 
                               X_train_scaled))
```

to obtain `X_train_processed`; you will use `X_train_processed` as your feature matrix;

- Define the scorer: `scorer = make_scorer(f1_score, average='weighted')`
- Use `LogisticRegressionCV()` with the following parameters:
   - solver='liblinear'
   - cv=5
   - penalty='l2'
   - Cs=100
   - class_weight='balanced'
   - scoring=scorer
   - max_iter=1e6
   - n_jobs=-1

to perform a 5-fold CV of the Binomial Logistic Regression model;
- print out the best model's hyperparameters.

In [35]:
### << YOUR CODE HERE >>

Now re-train across the whole `train_set` w/o CV; provide the weighted F1 score for this model.

In [36]:
### << YOUR CODE HERE >>

And now for the `validation_set`:

In [37]:
### << YOUR CODE HERE >>

#### 2.2 Compare the ROC curves of the Decision Tree and the L2-Regularized Binomial Logistic Regression Model

You have all the elements:

- vary the decision treshold for the Binary Logistic Regression to produce a DataFrame with the following columns: `DecTreshold`, `TPR`, `FPR`;
- combine that DataFrame with the similar DataFrame obtained from the ROC analysis of the Decision Tree Model;
- plot the ROC curves of the two models on the same chart: which model performed better?

In [38]:
### << YOUR CODE HERE >>

In [39]:
### << YOUR CODE HERE >>

***

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>