# Machine Learning Prediction on AI4I 2020 Dataset

### Tasks:

> 1. Load the Dataset
> 2. Profile the Dataset
> 3. Analyse the data profile
> 4. Impute any NaN datapoints
> 5. Handle the dataset if it is not normal
> 6. Check for Multicollinearity
> 7. Build the model
> 8. Check the accuracy
> 9. Save the model
> 10. Predict on 10 new datapoints

#### Importing necessary modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from pandas_profiling import ProfileReport
from sklearn.linear_model import LinearRegression

  from pandas_profiling import ProfileReport


#### Loading the Dataset

In [2]:
df = pd.read_csv('ai4i2020.csv')
df.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


## Dataset Description

The dataset consists of 10,000 data points stored as rows with 14 features in columns:

 - **`UID`**: unique identifier ranging from 1 to 10000
 - **`product ID`**: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
 - **`air temperature` [$K$]**: generated using a random walk process later normalized to a standard deviation of 2$K$ around 300$K$
 - **`process temperature` [$K$]**: generated using a random walk process normalized to a standard deviation of 1$K$, added to the air temperature plus 10$K$.
 - **`rotational speed` [$rpm$]**: calculated from a power of 2860$W$, overlaid with a normally distributed noise
 - **`torque` [$Nm$]**: torque values are normally distributed around 40$Nm$ with a Ïƒ = 10 Nm and no negative values.
 - **`tool wear` [$min$]**: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process. and a
 - **`machine failure`** label that indicates, whether the machine has failed in this particular datapoint for any of the following failure modes are true.

The machine failure consists of five independent failure modes:

 - **`tool wear failure` ($TWF$)**: the tool will be replaced of fail at a randomly selected tool wear time between 200 & 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).
 - **`heat dissipation failure` ($HDF$)**: heat dissipation causes a process failure, if the difference between air and process temperature is below 8.6$K$ and the tool's rotational speed is below 1380 $rpm$. This is the case for 115 data points.
 - **`power failure` ($PWF$)**: the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 $W$ or above 9000 $W$, the process fails, which is the case 95 times in our dataset.
 - **`overstrain failure` ($OSF$)**: if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 $M$, 13,000 $H$), the process fails due to overstrain. This is true for 98 datapoints.
 - **`random failures` ($RNF$)**: each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset.

If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1. It is therefore not transparent to the machine learning method, which of the failure modes has caused the process to fail

In [3]:
# A quick check to see if the datapoints adhere to the description

print(f"The number of data points where machine fails due to `tool wear failure` {len(df[df['TWF']==1])}")
print(f"The number of data points where machine fails due to `heat dissipation failure` {len(df[df['HDF']==1])}")
print(f"The number of data points where machine fails due to `power failure` {len(df[df['PWF']==1])}")
print(f"The number of data points where machine fails due to `overstrain failure` {len(df[df['OSF']==1])}")
print(f"The number of data points where machine fails due to `random failure` {len(df[df['RNF']==1])}")

The number of data points where machine fails due to `tool wear failure` 46
The number of data points where machine fails due to `heat dissipation failure` 115
The number of data points where machine fails due to `power failure` 95
The number of data points where machine fails due to `overstrain failure` 98
The number of data points where machine fails due to `random failure` 19


 - The number of datapoints which states that the machine has failed due to `tool wear failure` is less ($46$) than the amount stated in the dataset ($120$).
 - The number of datapoints which states that the machine has failed due to `heat dissipation failure` is ($115$) which is same as the amount stated in the dataset ($115$).
 - The number of datapoints which states that the machine has failed due to `power failure` is ($95$) which is same as the amount stated in the dataset ($95$).
 - The number of datapoints which states that the machine has failed due to `overstrain failure` is ($98$) which is same as the amount stated in the dataset ($98$).
 - The number of datapoints which states that the machine has failed due to `random failure` is more ($19$) than the amount stated in the dataset ($5$).

In [4]:
df[(df['TWF'] == 1) | (df['HDF'] == 1) | (df['PWF'] == 1) | (df['OSF'] == 1) | (df['RNF'] == 1)]['Machine failure'].value_counts()

1    330
0     18
Name: Machine failure, dtype: int64

In [5]:
len(df[df['Machine failure'] == 1])

339

As we can see here, there are data points which have been labelled as failed but the 5 independent categories do not show any failure.

We can drop the rows where either one of machine failure categories is 1 but the feature `Machine failure` shows 0.

We also have 1 datapoint where none of the individual categories of Machine failure is 1 but `Machine failure` shows 1.

In [8]:
data = df.copy()
drop_index = df[((df['TWF'] == 0) & (df['HDF'] == 0) & 
                 (df['PWF'] == 0) & (df['OSF'] == 0) & 
                 (df['RNF'] == 0)) & (df['Machine failure'] == 1)].index
df.drop(drop_index, inplace = True)
drop_0_index = df[((df['TWF'] == 1) | (df['HDF'] == 1) | 
                 (df['PWF'] == 1) | (df['OSF'] == 1) |
                 (df['RNF'] == 1)) & (df['Machine failure'] == 0)].index
df.drop(drop_0_index, inplace = True)

Now, let us check if we have any anomalies in the dataset.

In [9]:
df[(df['TWF'] == 1) | (df['HDF'] == 1) | (df['PWF'] == 1) | (df['OSF'] == 1) | (df['RNF'] == 1)]['Machine failure'].value_counts()

1    330
Name: Machine failure, dtype: int64

In [10]:
df['Machine failure'].value_counts()

0    9643
1     330
Name: Machine failure, dtype: int64

In [11]:
# Checking for NaN values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9973 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      9973 non-null   int64  
 1   Product ID               9973 non-null   object 
 2   Type                     9973 non-null   object 
 3   Air temperature [K]      9973 non-null   float64
 4   Process temperature [K]  9973 non-null   float64
 5   Rotational speed [rpm]   9973 non-null   int64  
 6   Torque [Nm]              9973 non-null   float64
 7   Tool wear [min]          9973 non-null   int64  
 8   Machine failure          9973 non-null   int64  
 9   TWF                      9973 non-null   int64  
 10  HDF                      9973 non-null   int64  
 11  PWF                      9973 non-null   int64  
 12  OSF                      9973 non-null   int64  
 13  RNF                      9973 non-null   int64  
dtypes: float64(3), int64(9),

###### Observation: There are no NaN values in the dataset

Now we can drop the columns `TWF`, `HDF`, `PWF`, `OSF` and `RNF` as the information is already contained in the feature `Machine failure`

In [12]:
df.drop(['TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis = 1, inplace = True)
df.head(2)

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,1,M14860,M,298.1,308.6,1551,42.8,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0


Next we can investigate the features `UDI` and `Product ID`

In [13]:
print(f"The number of unique values of feature UDI: {len(df['UDI'].value_counts())}")
print(f"The number of unique values of feature Product ID: {len(df['Product ID'].value_counts())}")

The number of unique values of feature UDI: 9973
The number of unique values of feature Product ID: 9973


As there are all unique values of `UDI` and `Product ID`, we can drop the features.

In [14]:
df.drop(['UDI', 'Product ID'], axis = 1, inplace = True)
df.head(2)

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,M,298.1,308.6,1551,42.8,0,0
1,L,298.2,308.7,1408,46.3,3,0


#### Profiling the Dataset

In [13]:
pf = ProfileReport(df)
pf.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

The report has been created but we should always save the report so that it can be shared with prospective stakeholders and other counterparts for an overwiew of data.

###### Observations: 

> 1. `Air Temperature` has high collinearity (86.4%) with `Process Temperature`
> 2. `Rotational Speed` has high inverse collinearity (91.6%) with `Torque`
> 3. Feature - `Machine failure` is imbalanced. But that is expected as failures should be low.

In [14]:
# Saving the report to AI4I_2020.html

pf.to_file('AI4I_2020.html')

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

We see that `Rotational Speed` has high negative collinearity with `Torque`.
We will make a copy of the dataframe and make a model by dropping one of the features., eg. `Torque`

In [15]:
df.columns

Index(['Type', 'Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]',
       'Machine failure'],
      dtype='object')

In [16]:
df_lessTorque = df.drop('Torque [Nm]', axis = 1)

#### Create Models

First we need to encode the feature `Type` as Linear Regression model will not be able to take it into account.

In [17]:
df = pd.get_dummies(df, columns = ['Type'])

In [18]:
df.columns

Index(['Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]',
       'Machine failure', 'Type_H', 'Type_L', 'Type_M'],
      dtype='object')

In [19]:
X = df[['Process temperature [K]', 'Rotational speed [rpm]', 
        'Torque [Nm]', 'Tool wear [min]', 'Machine failure', 
        'Type_H', 'Type_L', 'Type_M']]
y = df['Air temperature [K]']

In [20]:
X.head(2)

Unnamed: 0,Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,Type_H,Type_L,Type_M
0,308.6,1551,42.8,0,0,0,0,1
1,308.7,1408,46.3,3,0,0,1,0


Now, we need to repeat the same process for `df_lessTorque`

In [21]:
df_lessTorque = pd.get_dummies(df_lessTorque, columns = ['Type'])
df_lessTorque.head(2)

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Tool wear [min],Machine failure,Type_H,Type_L,Type_M
0,298.1,308.6,1551,0,0,0,0,1
1,298.2,308.7,1408,3,0,0,1,0


In [22]:
X_ = df[['Process temperature [K]', 'Rotational speed [rpm]', 
        'Tool wear [min]', 'Machine failure', 'Type_H', 'Type_L', 'Type_M']]
y_ = df['Air temperature [K]']

We are ready to create the models now.

In [23]:
lr = LinearRegression()
lr_ = LinearRegression()
lr.fit(X, y) # Fitting the df dataframe
lr_.fit(X_, y_) # Fitting the df_lessTorque dataframe

LinearRegression()

In [24]:
# Let's print the intercepts and coefficients of the models

print(f"The intercept of 'lr' is {lr.intercept_} and coefficients are {lr.coef_}")
print("------------------")
print(f"The intercept of 'lr_' is {lr_.intercept_} and coefficients are {lr_.coef_}")

The intercept of 'lr' is -64.95673802010947 and coefficients are [ 1.17831957e+00 -1.12233035e-04 -4.17381855e-03 -1.18969509e-04
  6.20944882e-01 -2.79314055e-02  8.23233202e-03  1.96990735e-02]
------------------
The intercept of 'lr_' is -65.46107390902318 and coefficients are [ 1.17840869e+00  8.90432486e-05 -1.06338175e-04  5.84349712e-01
 -2.77128190e-02  8.20579578e-03  1.95070232e-02]


#### Accuracy of the model created

In [25]:
print(f"The accuracy of model 'lr' is {lr.score(X, y)}")
print(f"The accuracy of model 'lr_' is {lr_.score(X_, y_)}")

The accuracy of model 'lr' is 0.7703613181448946
The accuracy of model 'lr_' is 0.7702705883009462


#### Saving the Models

Here we see that both the models have similar accuracy. Infact, the model with all the features has a higher accuracy. Hence, we keep the  model without Torque as we already have one feature which is highly correlated to Torque. Hence we use `lr_`.

In [26]:
file1 = 'linear.sav'
pickle.dump(lr_, open(file1, 'wb'))

#### Predict on dataset

In [27]:
predict_df = pd.read_csv('ai4i2020_test.csv')
predict_df.drop('Torque [Nm]', axis = 1, inplace = True)

In [28]:
lr_.predict(predict_df)

array([298.23553136, 299.04911711, 299.48205873, 299.24656679,
       298.92075911, 299.25044753, 298.46037696, 299.76517299,
       298.57230381])

#### Feature Selection

Let's see the performance of the model using the `Ordinary Least Squared` method from `Statsmodels` module 

In [29]:
import statsmodels
import statsmodels.formula.api as smf

df_lessTorque.columns = df_lessTorque.columns.str.replace(" ","_")

col = {'Air_temperature_[K]': 'Air_Temp',
           'Process_temperature_[K]': 'Process_Temp',
           'Rotational_speed_[rpm]': 'Rotational_speed',
           'Tool_wear_[min]': 'Tool_wear',
           'Machine_failure': 'Failure',
           'Type_H': 'Type_H',
           'Type_L': 'Type_L',
           'Type_M': 'Type_M'}

df_lessTorque = df_lessTorque.rename(columns=col)

In [32]:
lm = smf.ols(formula = 'Air_Temp~Process_Temp+Rotational_speed+Tool_wear+Failure+Type_H+Type_L+Type_M', data = df_lessTorque).fit()
lm.summary()

0,1,2,3
Dep. Variable:,Air_Temp,R-squared:,0.77
Model:,OLS,Adj. R-squared:,0.77
Method:,Least Squares,F-statistic:,5569.0
Date:,"Sun, 12 Feb 2023",Prob (F-statistic):,0.0
Time:,20:24:29,Log-Likelihood:,-13732.0
No. Observations:,9973,AIC:,27480.0
Df Residuals:,9966,BIC:,27530.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-49.0958,1.507,-32.588,0.000,-52.049,-46.143
Process_Temp,1.1784,0.006,181.831,0.000,1.166,1.191
Rotational_speed,8.904e-05,5.36e-05,1.661,0.097,-1.6e-05,0.000
Tool_wear,-0.0001,0.000,-0.701,0.484,-0.000,0.000
Failure,0.5843,0.054,10.795,0.000,0.478,0.690
Type_H,-16.3930,0.502,-32.633,0.000,-17.378,-15.408
Type_L,-16.3571,0.502,-32.552,0.000,-17.342,-15.372
Type_M,-16.3458,0.503,-32.520,0.000,-17.331,-15.360

0,1,2,3
Omnibus:,683.837,Durbin-Watson:,0.03
Prob(Omnibus):,0.0,Jarque-Bera (JB):,252.586
Skew:,-0.107,Prob(JB):,1.42e-55
Kurtosis:,2.25,Cond. No.,1.55e+19


#### Conclusion

We can see that the model gives us prediction on the air temperature of the datapoints loaded. However, we have high multicollinearity problems here and the dataset is not suited for a linear regression model.