# Predictive Maintenance

This assignment covers the topic of predictive maintenance. Predictive Maintenance problems adress predicting when a machine needs to be maintained ahead of breaking down. This problem can occur anywhere regular maintenance is required for a machine. For example, it can be used in manufacturing, fleet operations, train maintenance, etc.

This assignment will use the [Predictive Maintenance Dataset](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset). The dataset consists of 10 000 data points stored as rows with 14 features in columns. The 'machine failure' label that indicates, whether the machine has failed in this particular datapoint.

# Learning Objectives
- Perform model tuning based on hyper parameters.
- Select the best model after attempting multiple models.
- Perform recursive feature elimination, producing a statistically significant improvement over a model without feature selection.

In [159]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split


df = pd.read_csv('ai4i2020.csv')
print(df.info())
df.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  object 
 4   Process temperature [K]  10000 non-null  object 
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 703.2+ KB
None


Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,1,M14860,M,298.1,308.6,1551,42.8,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0
5,6,M14865,M,298.1,308.6,1425,41.9,11,0
6,7,L47186,L,298.1,308.6,1558,42.4,14,0
7,8,L47187,L,298.1,308.6,1527,40.2,16,0
8,9,M14868,M,298.3,308.7,1667,28.6,18,0
9,10,M14869,M,298.5,309.0,1741,28.0,21,0


Question 1.1:  Write a command that will calculate the number of unique values for each feature in the training data.

In [160]:
# Command(s)
df.nunique()

UDI                        10000
Product ID                 10000
Type                           3
Air temperature [K]           93
Process temperature [K]       82
Rotational speed [rpm]       941
Torque [Nm]                  577
Tool wear [min]              246
Machine failure                2
dtype: int64

Question 1.2: Determine if the data contains any missing values, and replace the values with np.nan. Missing values would be '?'.

In [161]:
df[df == '?'].count() # 140 in air temperature and 183 in process temperature

df = df.replace({'\?': np.NaN}, regex=True)

df[df == '?'].count() # 0 in all columns now



UDI                        0
Product ID                 0
Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
Machine failure            0
dtype: int64

Question 1.3: Replace all missing values with the mean. Change column types to numeric.

In [162]:
# Change column types to numeric
# df.dtypes
cols = df.columns.drop('Type')
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dtypes

# Check out the means and then replace the NaN values with the means
df.describe().loc[['mean']]
means = df[['Air temperature [K]','Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']].mean()
df = df.fillna(means)
df.isnull().sum()
df.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,1,,M,298.1,308.6,1551,42.8,0,0
1,2,,L,298.2,308.7,1408,46.3,3,0
2,3,,L,298.1,308.5,1498,49.4,5,0
3,4,,L,298.2,308.6,1433,39.5,7,0
4,5,,L,298.2,308.7,1408,40.0,9,0


Question 1.4: Drop UDI and 'Product ID' from the data

In [163]:
df = df.drop(columns=['UDI', 'Product ID'])
print(df.columns)

Index(['Type', 'Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]',
       'Machine failure'],
      dtype='object')


Question 2.1: Split the data into training and testing taking into consideration 'Machine failure' as the target (y)

In [164]:

df.isnull().sum()

Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
Machine failure            0
dtype: int64

In [165]:
from sklearn.model_selection import train_test_split

X = df.iloc[:,:-1]
y = df['Machine failure']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Question 2.2: Apply [One-Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to data. Make sure to Fit the training data and transform both training and test data.

In [166]:
from sklearn.preprocessing import OneHotEncoder
print("Base\n", X_train)
catcol_train = np.reshape(X_train['Type'], (-1, 1)) #.select_dtypes(include=['object']).columns.tolist() # this includes just the 'Type' column for X_train
catcol_test = np.reshape(X_test['Type'], (-1, 1))

encoder = OneHotEncoder(sparse_output=False,
                          categories='auto',
                          handle_unknown='error')

# TRAIN
# https://stackoverflow.com/questions/74552336/nan-value-appears-after-concat-two-dataframes this is incredibly annoying and only happens here because of the index mismatch during transformation and concatenation
# https://stackoverflow.com/questions/67857595/how-to-avoid-getting-nan-when-applying-one-hot-encoding-in-pandas
encoder.fit(catcol_train)
onehotlabels = encoder.transform(catcol_train)
new_columns=list()
for col, values in zip(X_train['Type'], encoder.categories_):
    new_columns.extend(['Type_' + str(value) for value in values])
X_train_ohe = pd.concat([X_train, pd.DataFrame(onehotlabels, columns=new_columns, index=X_train.index)], axis='columns').drop('Type', axis=1)
print("Train\n", X_train_ohe)

# TEST
onehotlabels = encoder.transform(catcol_test)
new_columns=list()
for col, values in zip(X_test['Type'], encoder.categories_):
    new_columns.extend(['Type_' + str(value) for value in values])
X_test_ohe = pd.concat([X_test, pd.DataFrame(onehotlabels, columns=new_columns, index=X_test.index)], axis='columns').drop('Type', axis=1)
print("Test\n", X_test_ohe)

Base
      Type  Air temperature [K]  Process temperature [K]  \
8371    L                298.7                    309.7   
5027    L                304.0                    313.3   
9234    L                298.3                    309.1   
3944    L                302.2                    311.2   
6862    M                301.1                    311.2   
...   ...                  ...                      ...   
5734    L                302.3                    311.8   
5191    L                304.0                    313.2   
5390    H                302.8                    312.3   
860     H                296.1                    306.9   
7270    L                300.2                    310.4   

      Rotational speed [rpm]  Torque [Nm]  Tool wear [min]  
8371                    1881         21.7               29  
5027                    1294         54.6              102  
9234                    1904         20.3              140  
3944                    1381         49.4

Question 2.3: Apply [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to the training data since there is class imbalance.

In [167]:
from imblearn.over_sampling import SMOTE
from collections import Counter

counter = Counter(y_train) # Counter({0: 6462, 1: 238})
print(counter)
oversample = SMOTE()
X_train_ohe, y_train = oversample.fit_resample(X_train_ohe, y_train)
counter = Counter(y_train)
print(counter) # Counter({0: 6462, 1: 6462})


Counter({0: 6462, 1: 238})
Counter({0: 6462, 1: 6462})


Question 3.1: Train five machine learning [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier) based on the training data, and evaluate their performance on the test dataset. Use default hyperparameter values.

In [168]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Classifier stats - Taken from Case Study
# -------------------------------------------------

from IPython.display import display, HTML

def confusion_table(confusion_mtx):
    """Renders a nice confusion table with labels"""
    confusion_df = pd.DataFrame({'y_pred=0': np.append(confusion_mtx[:, 0], confusion_mtx.sum(axis=0)[0]),
                                 'y_pred=1': np.append(confusion_mtx[:, 1], confusion_mtx.sum(axis=0)[1]),
                                 'Total': np.append(confusion_mtx.sum(axis=1), ''),
                                 '': ['y=0', 'y=1', 'Total']}).set_index('')
    return confusion_df


def positive_observations(y):
    # What percentage of observations are positive?
    proportion_1 = ((y == 1).sum() / len(y))
    pct_1        = np.around(proportion_1*100, decimals=3)
    display(HTML('<p><h4>{}%</h4>of observations are positive</p>'.format(pct_1)))

def prior_error_rate(confusion_matrix):
    """The prior probability that a result is positive"""
    return 1 - (np.sum(confusion_mtx[1, :]) / np.sum(confusion_mtx))

def total_error_rate(confusion_matrix):
    """Derive total error rate from confusion matrix"""
    return 1 - np.trace(confusion_mtx) / np.sum(confusion_mtx)

def true_positive_rate(confusion_mtx):
    """or sensitivity: the proportion of actual POSITIVES that are correctly identified as such"""
    return confusion_mtx[1, 1] / np.sum(confusion_mtx[1, :])

def false_negative_rate(confusion_mtx):
    """the proportion of actual POSITIVES that are incorrectly identified as negative"""
    return confusion_mtx[1, 0] / np.sum(confusion_mtx[1, :])

def false_positive_rate(confusion_mtx):
    """the proportion of actual NEGATIVES that are incorrectly identified as positives"""
    return confusion_mtx[0, 1] / np.sum(confusion_mtx[0, :])

def true_negative_rate(confusion_mtx):
    """or specificity: the proportion of actual NEGATIVES that are correctly identified as such"""
    return confusion_mtx[0, 0] / np.sum(confusion_mtx[0, :])

def positive_predictive_value(confusion_mtx):
    """or precision: the proportion of predicted positives that are correctly predicted"""
    return confusion_mtx[1, 1] / np.sum(confusion_mtx[:, 1])

def negative_predictive_value(confusion_mtx):
    """the proportion of predicted negatives that are correctly predicted"""
    return confusion_mtx[0, 0] / np.sum(confusion_mtx[:, 0])

def classifier_stats(confusion_mtx):
    return pd.Series({'prior_error_rate': prior_error_rate(confusion_mtx),
                      'total_error_rate': total_error_rate(confusion_mtx),
                      'true_positive_rate (sensitivity)': true_positive_rate(confusion_mtx),
                      'false_negative_rate': false_negative_rate(confusion_mtx),
                      'false_positive_rate': false_positive_rate(confusion_mtx),
                      'true_negative_rate (specificity)': true_negative_rate(confusion_mtx),
                      'positive_predictive_value (precision)': positive_predictive_value(confusion_mtx),
                      'negative_predictive_value': negative_predictive_value(confusion_mtx)})


In [169]:
#Build models (You can either do it combined or separate)

# XGBClassifier does not like the '[' or ']' in the name of columns so need to remove it
X_train_ohe = X_train_ohe.rename(columns={'Air temperature [K]': 'Air Temperature K', 'Process temperature [K]': 'Process Temperature K', 'Rotational speed [rpm]': 'Rotational Speed RPM', 'Torque [Nm]': 'Torque Nm', 'Tool wear [min]': 'Tool Wear min'})
X_test_ohe = X_test_ohe.rename(columns={'Air temperature [K]': 'Air Temperature K', 'Process temperature [K]': 'Process Temperature K', 'Rotational speed [rpm]': 'Rotational Speed RPM', 'Torque [Nm]': 'Torque Nm', 'Tool wear [min]': 'Tool Wear min'})

models = {
    'Logistic Regresion': LogisticRegression().fit(X_train_ohe, y_train),
    'Support Vector Machine': SVC().fit(X_train_ohe, y_train),
    'K-NN': KNeighborsClassifier().fit(X_train_ohe, y_train),
    'Decision Tree':DecisionTreeClassifier().fit(X_train_ohe, y_train),
    'XGBoost': XGBClassifier().fit(X_train_ohe, y_train)
}

# PREDICT
for k in models:
    # Predict
    y_pred = models[k].predict(X_test_ohe)
    # Confusion table
    display(HTML('<h3>{}</h3>'.format(k)))
    confusion_mtx = confusion_matrix(y_test, y_pred)
    display(confusion_table(confusion_mtx))
    # Classifier stats
    display(classifier_stats(confusion_mtx))


Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,2684.0,515.0,3199.0
y=1,19.0,82.0,101.0
Total,2703.0,597.0,


prior_error_rate                         0.969394
total_error_rate                         0.161818
true_positive_rate (sensitivity)         0.811881
false_negative_rate                      0.188119
false_positive_rate                      0.160988
true_negative_rate (specificity)         0.839012
positive_predictive_value (precision)    0.137353
negative_predictive_value                0.992971
dtype: float64

Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,2581.0,618.0,3199.0
y=1,20.0,81.0,101.0
Total,2601.0,699.0,


prior_error_rate                         0.969394
total_error_rate                         0.193333
true_positive_rate (sensitivity)         0.801980
false_negative_rate                      0.198020
false_positive_rate                      0.193185
true_negative_rate (specificity)         0.806815
positive_predictive_value (precision)    0.115880
negative_predictive_value                0.992311
dtype: float64

Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,2867.0,332.0,3199.0
y=1,43.0,58.0,101.0
Total,2910.0,390.0,


prior_error_rate                         0.969394
total_error_rate                         0.113636
true_positive_rate (sensitivity)         0.574257
false_negative_rate                      0.425743
false_positive_rate                      0.103782
true_negative_rate (specificity)         0.896218
positive_predictive_value (precision)    0.148718
negative_predictive_value                0.985223
dtype: float64

Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,3094.0,105.0,3199.0
y=1,37.0,64.0,101.0
Total,3131.0,169.0,


prior_error_rate                         0.969394
total_error_rate                         0.043030
true_positive_rate (sensitivity)         0.633663
false_negative_rate                      0.366337
false_positive_rate                      0.032823
true_negative_rate (specificity)         0.967177
positive_predictive_value (precision)    0.378698
negative_predictive_value                0.988183
dtype: float64

Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,3139.0,60.0,3199.0
y=1,23.0,78.0,101.0
Total,3162.0,138.0,


prior_error_rate                         0.969394
total_error_rate                         0.025152
true_positive_rate (sensitivity)         0.772277
false_negative_rate                      0.227723
false_positive_rate                      0.018756
true_negative_rate (specificity)         0.981244
positive_predictive_value (precision)    0.565217
negative_predictive_value                0.992726
dtype: float64

Questions 3.2:  Perform recursive feature elimination (3 features) on the dataset using a logistic regression classifier with max_iter= 1000, random_state=5.  Any difference in the results? Explain.

In [170]:
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline

rfe = RFE(estimator=LogisticRegression(max_iter=1000, random_state=5), n_features_to_select=3)
model = LogisticRegression(max_iter=1000, random_state=5)
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
pipeline.fit(X_train_ohe, y_train)
y_pred = pipeline.predict(X_test_ohe)

display(HTML('<h3>{}</h3>'.format("LogReg RFE Pipeline")))
confusion_mtx = confusion_matrix(y_test, y_pred)
display(confusion_table(confusion_mtx))
# Classifier stats
display(classifier_stats(confusion_mtx))


Unnamed: 0,y_pred=0,y_pred=1,Total
,,,
y=0,1816.0,1383.0,3199.0
y=1,41.0,60.0,101.0
Total,1857.0,1443.0,


prior_error_rate                         0.969394
total_error_rate                         0.431515
true_positive_rate (sensitivity)         0.594059
false_negative_rate                      0.405941
false_positive_rate                      0.432323
true_negative_rate (specificity)         0.567677
positive_predictive_value (precision)    0.041580
negative_predictive_value                0.977921
dtype: float64

# Analysis of RFE LogReg

Making the assumption that I did this correctly, recursively removing data from the pipeline and then computing some predictions results in a much higher total error rate and a much lower true_positive, true_negative, and a significantly higher precision. This leads me to believe as we removed values, we are likely to see repeated measurement of the same values under the same circumstances. Because the logreg did not converge above, it seems that after recursively removing/eliminating data, we do receive a convergence.

Q.4. Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include:
What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

# Q.4

1. I do not have any experience with any of these models minus the ones we've used in previous classes. RFE is completely new to me, as are SVC, DecisionTree, and XGBoost.
2. Steps - A lot of searching and looking up examples, reading on some concepts and documentation for each of the models. Obstacles - debugging One Hot Encoding issues and debating if I should've used the get_dummies() from Pandas versus the Sklearn functions. There were more minor data cleanups needed to fit certain functions and APIs as well, but I wouldn't call those obstacles. More like additional steps (and expected part of the process).
3. Organization is key! If this were to be applied to a real pipeline it would be nice to reorganize, compartmentalize, extract methodology and functions into discrete, reusable methods and objects.
4. I think the comparison of the models after we cleaned the dataset is  interesting and lets us get a bigger picture view of all the possible options we have available to us when approaching real world problems. One thing that's useful here is One Hot Encoding too as we often have a mix of numerical and categorical data when we see datasets (think of something as simple as human comparison: hair color, eye color vs. height, weight) and yet we don't want to label our data in a way that isn't discernible (you can always inverse the encoded data and map your encodings to meaningful labels), so that gives a useful tool to ML scientists.