# Problem Statement

Availability, reliability and economic sustainability of naval propulsion plants are key elements to cope with because maintenance costs represent a large slice of total operational expenses.

Depending on the adopted strategy, impact of maintenance on overall expenses can remarkably vary; for example, letting an asset running up until breakdown can lead to unaffordable costs. As a matter of fact, a desideratum is to progress maintenance technology of ship propulsion systems from breakdown or preventive maintenance up to more effective **condition-based maintenance** approaches. The success of condition-based maintenance clearly hinges on **the capability of developing effective predictive models.**

In particular, we take into consideration **an application of condition-based maintenance to gas turbines used for vessel propulsion**, where the performance and advantages of exploiting statistical/machine learning method(s) in **modeling the degradation of the propulsion plant** over time are tested. Experiments, conducted on data generated from a sophisticated simulator of a gas turbine, mounted on a Frigate characterized by a *COmbined Diesel eLectric And Gas* propulsion plant type, will allow to show the effectiveness of the proposed statistical/machine learning approaches and to benchmark them in a realistic maritime application.

According to the British Standard, maintenance includes all actions necessary to retain a system or an item in, or restoring it to, a state in which it can perform its required functions. The most common way of inflecting such concepts in practice has been always deployed according to a **"fix it when it breaks"** approach. However, this has been becoming an unaffordable and gold-brick methodology since data gathering from the field is ever cheaper and costs related to a breakdown may overcome the asset value.

Indeed, in the last decades, going-smart technology and crossindustry needs in maintenance, for example, ranging from manufacturing to the transportation domain, have engendered a pivotal change from a reactive to a proactive perspective, trespassing the original focus on repairing–replacing actions toward more sophisticated preventive and prescriptive activities.  In particular, maintenance actions can be framed into a taxonomy, which includes three categories: **corrective, preventive and condition-based**.

***

### Corrective Maintenance (CM)

In corrective maintenance (CM), the equipment or asset is run down to its **breaking down point**, and maintenance activities are **carried out afterward** with the purpose of restoring the system at earliest.

In this case, maintenance activities are triggered by **an unscheduled event** (e.g. failure), and **no a priori strategies** can be deployed. Consequently, **costs related to such approach are usually high**; they comprise direct costs, for example, due to potentially concatenated failures of other parts of the asset, and indirect costs, for example, related to potential losses in (environmental, worker, etc.) safety and integrity, and asset unavailability.

### Preventive maintenance (PM)

Preventive maintenance (PM), instead, is **carried out before breakdowns** in order to avoid them and minimize the possibility of potential issues related to failures. 

Several variations exist, such as adjustments, replacements, renewals and inspections, which take place subject to **a predetermined planning and schedule**; this allows to establish time-slots of unavailability of an asset (or part of it), opposite to the unpredictability, which characterize random failure patterns in CM. With particular reference to systematic PM, parts are replaced independently of their actual status, as a safety-level lifetime is established for them; in such a way, the probability of failures for a system decreases, as popularly proved through adverting to the so-called bathtub curve.

### Condition-based maintenance (CBM)

Condition-based maintenance (CBM), instead, refers to triggering maintenance activities as they are necessitated by **the condition of the target system**. This approach enables determining the conditions of inservice assets to predict potential degradations and to plan, consequently, when maintenance activities will be needed and should be performed to minimize disruptions. 

In other words, CBM switch maintenance view **from pure diagnosis to high-valued prognosis of faults**.

### Which one is the best?

While CM is usually avoided due to the high costs characterizing this approach, CBM is gaining popularity because of the competitive advantages over PM, related to the extreme conservativeness of preventive methods in several domains (leading, once again, to often unjustified remarkable costs). As a matter of fact, assets are typically featured by complex interrelations among the parts of which they consist; such correlations are not straightforward to grasp and identify, thus leading to the unfeasibility of a simple approach like the above-cited bathtub model. CBM, instead, proved to be effective to maximize availability, efficiency, sustainability and safety of assets, enabling the upgrading of maintenance role from being seen as evil up to a way for creating added value.

***

### CBM helps the Navy to achieve balance

The CBM approach to maintenance is very generic, depicting a horizontal view to a crosscut problem in different heterogeneous domains. When a more vertical view is taken into consideration, focusing on the maritime domain, **repair maintenance expenses for conventional ships can amount up to about 20% of total operability costs**, including manning expenses, insurance and administration costs. 

For naval applications, maintenance optimization is a key task, focused to reduce operations costs while getting the optimal availability of the ship for the intended service; such optimization is the result of **a trade-off between excessive maintenance and machinery downtime**, where CBM helps opening the door toward **best balancing costs and availability**. In other words, CBM enables a just-in-time deployment of ship maintenance, by allowing to plan and execute maintenance activities only when needed.

***

# Description of dataset

The experiments have been carried out by means of a numerical simulator of a naval vessel (Frigate) characterized by a Gas Turbine (GT) propulsion plant. The different blocks forming the complete simulator (Propeller, Hull, GT, Gear Box and Controller) have been developed and fine tuned over the year on several similar real propulsion plants. In this release of the simulator it is also possible to take into account the performance decay over time of the GT components such as GT compressor and turbines. 

The propulsion system behaviour has been described with this parameters:

#### 1. Speed
This parameter is controlled via the control lever. The latter can only assume a finite number of position $lp(i)$, which in turn corresponds to a finite set of possible configurations for fuel flow and blade position. Each set point is designed to reach a desired speed $v(i)$

$$lp(i) = i \rightarrow v(i) = 3 \ast lp(i), \forall i \in \{0,1,2,...,9\}$$

**Note that, $lp(i)$ and $v(i)$ are  related by a linear law**.

#### 2. Compressor decay
$$ kMc \in [0.950,1]$$

The decay rate has been sampled through a uniform grid with resolution $10^{-3}$, so as to reach a good granularity when representing the GT compressor decay state

$$ kMc(i) = 1 - i \ast 0.001 $$ 
$$ i \in \{0,1,2,...,50\} $$

#### 3. GT decay
$$ kMt \in [ 0.975, 1] $$

Analogous to the case of compressor, the decay rate is sampled through a uniform grid with resolution $10^{-3}$. The GT decay state can be then represented as

$$ kMt(i) = 1 - i \ast 0.001 $$
$$  i \in \{0,1,2,...,25\} $$

**Once these quantities (Speed/Lever Position, Compressor Decay and GT Decay) are fixed**, 14 heterogeneous measures, related to the propulsion plant which are listed below, can be derived by exploiting the simulator, punctually describing the state of the system.

- Gas turbine shaft torque $GTT$
- Gas turbine rate of revolutions $GTn$
- Gas generator rate of revolutions $GGn$
- Starboard propeller torque $Ts$
- Port propeller torque $Tp$
- HP turbine exit temperature $T48$
- GT compressor inlet air temperature $T1$
- GT compressor outlet air temperature $T2$
- HP turbine exit pressure $P48$
- GT compressor inlet air pressure $P1$
- GT compressor outlet air pressure $P2$
- GT exhaust gas pressure $Pexh$
- Turbine injection control $TIC$
- Fuel flow $mf$

Furthermore, with fixed 9 values of $v$, 51 values of $kMc$ and 26 values of $kMt$, we have exactly 11.934 observations in the dataset.
$$ 9 \ast 51 \ast 26 = 11.934 $$

***

# Description of the statistical method(s) used

### The goals of work

The purpose of the analysis is to train a model on 14 listed measures $ \{  GTT, GTn, GGn, Ts, Tp, T48, T1, T2, P48, P1, P2, Pexh, TIC, mf \}$ and $v$, in order to estimate $kMc$ and $kMt$ from them. This can be seen as either a regression or classification problem.

### The statistical method(s) used

In this exercise, from the view of **a regression problem** we mainly focus on **the linear method**, which is simple and interpretable. We also build up some fancier machine learning models in order to compare their accuracy scores for prediction purposes.

***

# Presentation of the results

### 1. Import packages

In [1]:
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
%matplotlib inline


from sklearn.cross_validation import train_test_split

from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

import statsmodels.api as sm

  from pandas.core import datetools


### 2. Import and clean the data

In [2]:
columns=['lp','v','GTT','GTn','GGn','Ts','Tp','T48','T1','T2','P48','P1','P2','Pexh','TIC','mf','kMc','kMt']
data=pd.read_csv('https://raw.githubusercontent.com/trungha-ngx/Statistics-Project-Condition-Based-Maintenance/master/data.txt',sep="  ",names=columns)

  


#### 2.1. Overview
- The data is clean with no missing values.
- There are 18 columns which consists of 16 features and 2 resposes.
- There are 11.934 instances on this dataset.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11934 entries, 0 to 11933
Data columns (total 18 columns):
lp      11934 non-null float64
v       11934 non-null float64
GTT     11934 non-null float64
GTn     11934 non-null float64
GGn     11934 non-null float64
Ts      11934 non-null float64
Tp      11934 non-null float64
T48     11934 non-null float64
T1      11934 non-null float64
T2      11934 non-null float64
P48     11934 non-null float64
P1      11934 non-null float64
P2      11934 non-null float64
Pexh    11934 non-null float64
TIC     11934 non-null float64
mf      11934 non-null float64
kMc     11934 non-null float64
kMt     11934 non-null float64
dtypes: float64(18)
memory usage: 1.6 MB


#### 2.2. The nature of data
- This is the first 234 instances when we fixed value of $kMc$ at 0.950, and for $v$ run in range of its 9 values, $kMt$ run in range of its 26 values

In [4]:
data.head(234)

Unnamed: 0,lp,v,GTT,GTn,GGn,Ts,Tp,T48,T1,T2,P48,P1,P2,Pexh,TIC,mf,kMc,kMt
0,1.138,3.0,289.964,1349.489,6677.380,7.584,7.584,464.006,288.0,550.563,1.096,0.998,5.947,1.019,7.137,0.082,0.95,0.975
1,2.088,6.0,6960.180,1376.166,6828.469,28.204,28.204,635.401,288.0,581.658,1.331,0.998,7.282,1.019,10.655,0.287,0.95,0.975
2,3.144,9.0,8379.229,1386.757,7111.811,60.358,60.358,606.002,288.0,587.587,1.389,0.998,7.574,1.020,13.086,0.259,0.95,0.975
3,4.161,12.0,14724.395,1547.465,7792.630,113.774,113.774,661.471,288.0,613.851,1.658,0.998,9.007,1.022,18.109,0.358,0.95,0.975
4,5.140,15.0,21636.432,1924.313,8494.777,175.306,175.306,731.494,288.0,645.642,2.078,0.998,11.197,1.026,26.373,0.522,0.95,0.975
5,6.175,18.0,29792.731,2307.404,8828.360,246.278,246.278,800.434,288.0,676.397,2.501,0.998,13.356,1.030,35.760,0.708,0.95,0.975
6,7.148,21.0,38982.180,2678.086,9132.429,332.077,332.077,854.747,288.0,699.954,2.963,0.998,15.679,1.035,45.881,0.908,0.95,0.975
7,8.206,24.0,50996.808,3087.561,9318.562,437.989,437.989,952.122,288.0,741.770,3.576,0.998,18.632,1.040,62.440,1.236,0.95,0.975
8,9.300,27.0,72763.329,3560.395,9778.528,644.905,644.905,1115.797,288.0,789.094,4.498,0.998,22.811,1.049,92.556,1.832,0.95,0.975
9,1.138,3.0,379.880,1355.375,6683.916,7.915,7.915,464.017,288.0,550.985,1.100,0.998,5.963,1.019,3.879,0.079,0.95,0.976


#### 2.3. Check the linearity between $lp$ and $v$
- We can see the *R-squared* is exactly 1, and the *coefficient* of linear model is approximately 3 with *standard error* is zero.
- $lp$ will be removed from the data

In [5]:
LiR_v = sm.OLS(data['v'], sm.add_constant(data['lp'])).fit()
LiR_v.summary()

0,1,2,3
Dep. Variable:,v,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,70300000.0
Date:,"Thu, 11 Jan 2018",Prob (F-statistic):,0.0
Time:,14:07:08,Log-Likelihood:,10438.0
No. Observations:,11934,AIC:,-20870.0
Df Residuals:,11932,BIC:,-20860.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.2373,0.002,-116.415,0.000,-0.241,-0.233
lp,2.9492,0.000,8384.758,0.000,2.948,2.950

0,1,2,3
Omnibus:,808.501,Durbin-Watson:,1.698
Prob(Omnibus):,0.0,Jarque-Bera (JB):,533.166
Skew:,-0.4,Prob(JB):,1.68e-116
Kurtosis:,2.342,Cond. No.,13.1


#### 2.4. Check redundant features
- GT Compressor Inlet Air Temperature $T1$ and GT Compressor Inlet Air Pressure $P1$  have zero-variances. This means they are constant numbers. 
- $T1$ and $P1$ will be removed from the data.

In [6]:
data.describe().round(3)

Unnamed: 0,lp,v,GTT,GTn,GGn,Ts,Tp,T48,T1,T2,P48,P1,P2,Pexh,TIC,mf,kMc,kMt
count,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0
mean,5.167,15.0,27247.499,2136.289,8200.947,227.336,227.336,735.495,288.0,646.215,2.353,0.998,12.297,1.029,33.641,0.662,0.975,0.987
std,2.626,7.746,22148.613,774.084,1091.316,200.496,200.496,173.681,0.0,72.676,1.085,0.0,5.337,0.01,25.841,0.507,0.015,0.008
min,1.138,3.0,253.547,1307.675,6589.002,5.304,5.304,442.364,288.0,540.442,1.093,0.998,5.828,1.019,0.0,0.068,0.95,0.975
25%,3.144,9.0,8375.884,1386.758,7058.324,60.317,60.317,589.873,288.0,578.092,1.389,0.998,7.447,1.02,13.677,0.246,0.962,0.981
50%,5.14,15.0,21630.659,1924.326,8482.082,175.268,175.268,706.038,288.0,637.142,2.083,0.998,11.092,1.026,25.276,0.496,0.975,0.988
75%,7.148,21.0,39001.427,2678.079,9132.606,332.365,332.365,834.066,288.0,693.925,2.981,0.998,15.658,1.036,44.552,0.882,0.988,0.994
max,9.3,27.0,72784.872,3560.741,9797.103,645.249,645.249,1115.797,288.0,789.094,4.56,0.998,23.14,1.052,92.556,1.832,1.0,1.0


#### 2.5. Remove redundant features
- We will remove three features from the data: $lp$, $T1$, $P1$

In [7]:
data.drop(['lp','T1','P1'],axis=1,inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11934 entries, 0 to 11933
Data columns (total 15 columns):
v       11934 non-null float64
GTT     11934 non-null float64
GTn     11934 non-null float64
GGn     11934 non-null float64
Ts      11934 non-null float64
Tp      11934 non-null float64
T48     11934 non-null float64
T2      11934 non-null float64
P48     11934 non-null float64
P2      11934 non-null float64
Pexh    11934 non-null float64
TIC     11934 non-null float64
mf      11934 non-null float64
kMc     11934 non-null float64
kMt     11934 non-null float64
dtypes: float64(15)
memory usage: 1.4 MB


In [8]:
data.head(9)

Unnamed: 0,v,GTT,GTn,GGn,Ts,Tp,T48,T2,P48,P2,Pexh,TIC,mf,kMc,kMt
0,3.0,289.964,1349.489,6677.38,7.584,7.584,464.006,550.563,1.096,5.947,1.019,7.137,0.082,0.95,0.975
1,6.0,6960.18,1376.166,6828.469,28.204,28.204,635.401,581.658,1.331,7.282,1.019,10.655,0.287,0.95,0.975
2,9.0,8379.229,1386.757,7111.811,60.358,60.358,606.002,587.587,1.389,7.574,1.02,13.086,0.259,0.95,0.975
3,12.0,14724.395,1547.465,7792.63,113.774,113.774,661.471,613.851,1.658,9.007,1.022,18.109,0.358,0.95,0.975
4,15.0,21636.432,1924.313,8494.777,175.306,175.306,731.494,645.642,2.078,11.197,1.026,26.373,0.522,0.95,0.975
5,18.0,29792.731,2307.404,8828.36,246.278,246.278,800.434,676.397,2.501,13.356,1.03,35.76,0.708,0.95,0.975
6,21.0,38982.18,2678.086,9132.429,332.077,332.077,854.747,699.954,2.963,15.679,1.035,45.881,0.908,0.95,0.975
7,24.0,50996.808,3087.561,9318.562,437.989,437.989,952.122,741.77,3.576,18.632,1.04,62.44,1.236,0.95,0.975
8,27.0,72763.329,3560.395,9778.528,644.905,644.905,1115.797,789.094,4.498,22.811,1.049,92.556,1.832,0.95,0.975


***

### 3. Train Models

#### 3.1. Split Data to Training and Test sets
- There are 9.547 (80%) instances used to fit the models.
- The remains of instances (20%) will be used to cumpute the accuracy scores of models.

In [10]:
X=data.iloc[:,0:13]
kMc=data.iloc[:,13]
kMt=data.iloc[:,14]
X_train, X_test, kMc_train, kMc_test, kMt_train, kMt_test = train_test_split(X,kMc,kMt,test_size=0.2,random_state=13)

In [11]:
print('X shape =',X.shape)
print('X_train shape =',X_train.shape)
print('X_test shape =',X_test.shape)
print('kMc_train shape =',kMc_train.shape)
print('kMc_test shape =',kMc_test.shape)
print('kMt_train shape =',kMt_train.shape)
print('kMt_test shape =',kMt_test.shape)

X shape = (11934, 13)
X_train shape = (9547, 13)
X_test shape = (2387, 13)
kMc_train shape = (9547,)
kMc_test shape = (2387,)
kMt_train shape = (9547,)
kMt_test shape = (2387,)


***

#### 3.2. Train the models to predict $kMc$

**Linear Regression Model**
- R-squared = 0.792
- All coefficients are significant (nonzero) according to t-test

In [12]:
LiR_kMc = sm.OLS(kMc_train, sm.add_constant(X_train)).fit()
LiR_kMc_pred = LiR_kMc.predict(sm.add_constant(X_test))
LiR_kMc.summary()

0,1,2,3
Dep. Variable:,kMc,R-squared:,0.792
Model:,OLS,Adj. R-squared:,0.792
Method:,Least Squares,F-statistic:,3022.0
Date:,"Thu, 11 Jan 2018",Prob (F-statistic):,0.0
Time:,14:07:52,Log-Likelihood:,34220.0
No. Observations:,9547,AIC:,-68410.0
Df Residuals:,9534,BIC:,-68320.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.3130,0.234,18.413,0.000,3.854,4.772
v,0.0043,0.000,22.593,0.000,0.004,0.005
GTT,5.665e-06,4.1e-07,13.808,0.000,4.86e-06,6.47e-06
GTn,-0.0001,3.09e-06,-37.514,0.000,-0.000,-0.000
GGn,1.697e-05,1e-06,16.961,0.000,1.5e-05,1.89e-05
Ts,-0.0006,1.88e-05,-30.755,0.000,-0.001,-0.001
Tp,-0.0006,1.88e-05,-30.755,0.000,-0.001,-0.001
T48,-0.0003,2.3e-05,-11.319,0.000,-0.000,-0.000
T2,-0.0023,2.99e-05,-76.645,0.000,-0.002,-0.002

0,1,2,3
Omnibus:,225.041,Durbin-Watson:,1.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,124.991
Skew:,0.096,Prob(JB):,7.22e-28
Kurtosis:,2.474,Cond. No.,8.56e+19


**Train other models which are KNN, Decision Tree, Bagging, Radom Forest, Extra Trees**

In [13]:
names = ['LiR_kMc']
MSE = [mean_squared_error(LiR_kMc_pred, kMc_test)]
R2 = [r2_score(LiR_kMc_pred, kMc_test)]

models = []
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('Bag_Re', BaggingRegressor()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('ExtraTreesRegressor', ExtraTreesRegressor()))

for name, model in models:

    model.fit(X_train, kMc_train)
    predictions = model.predict(X_test)
    
    mse = mean_squared_error(predictions, kMc_test)
    r2= r2_score(predictions, kMc_test)

    names.append(name)
    MSE.append(mse)
    R2.append(r2)

**Compare the predictive accuracy of 6 models on the test set**
- The test MSE of Linear Regression Model is approximately zero but at least 4 time higher than test MSEs of others.
- The R-squared of Linear Regression Model is about 0.734, and far less than R-squared of others which are all above 0.936.

In [14]:
df1=pd.DataFrame({'1. Model':names,'3. R-squared':R2,'2. Mean Squared Error':MSE})
df1

Unnamed: 0,1. Model,2. Mean Squared Error,3. R-squared
0,LiR_kMc,4.538543e-05,0.73371
1,KNN,1.280722e-05,0.936131
2,CART,2.236999e-06,0.989649
3,Bag_Re,1.065835e-06,0.994995
4,RandomForest,1.023996e-06,0.995184
5,ExtraTreesRegressor,5.352786e-07,0.9975


***

#### 3.3. Train the models to predict $kMt$

**Train the Linear Regression model**
- R-squared = 0.911
- All coefficients are significant (nonzero) according to t-test

In [15]:
LiR_kMt = sm.OLS(kMt_train, sm.add_constant(X_train)).fit()
LiR_kMt_pred = LiR_kMt.predict(sm.add_constant(X_test))
LiR_kMt.summary()

0,1,2,3
Dep. Variable:,kMt,R-squared:,0.911
Model:,OLS,Adj. R-squared:,0.911
Method:,Least Squares,F-statistic:,8100.0
Date:,"Thu, 11 Jan 2018",Prob (F-statistic):,0.0
Time:,14:07:55,Log-Likelihood:,44688.0
No. Observations:,9547,AIC:,-89350.0
Df Residuals:,9534,BIC:,-89260.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2212,0.078,2.827,0.005,0.068,0.375
v,0.0062,6.33e-05,98.502,0.000,0.006,0.006
GTT,1.384e-05,1.37e-07,101.013,0.000,1.36e-05,1.41e-05
GTn,6.524e-05,1.03e-06,63.265,0.000,6.32e-05,6.73e-05
GGn,3.141e-05,3.34e-07,93.943,0.000,3.08e-05,3.21e-05
Ts,-0.0008,6.3e-06,-126.076,0.000,-0.001,-0.001
Tp,-0.0008,6.3e-06,-126.076,0.000,-0.001,-0.001
T48,-0.0007,7.69e-06,-88.782,0.000,-0.001,-0.001
T2,0.0002,1e-05,17.974,0.000,0.000,0.000

0,1,2,3
Omnibus:,332.782,Durbin-Watson:,2.019
Prob(Omnibus):,0.0,Jarque-Bera (JB):,841.794
Skew:,0.151,Prob(JB):,1.61e-183
Kurtosis:,4.423,Cond. No.,8.56e+19


**Train other models which are KNN, Decision Tree, Bagging, Radom Forest and Extra Trees**

In [16]:
names = ['LiR_kMt']
MSE = [mean_squared_error(LiR_kMt_pred, kMt_test)]
R2 = [r2_score(LiR_kMt_pred, kMt_test)]

models = []
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('Bag_Re', BaggingRegressor()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('ExtraTreesRegressor', ExtraTreesRegressor()))

for name, model in models:

    model.fit(X_train, kMt_train)
    predictions = model.predict(X_test)
    
    mse = mean_squared_error(predictions, kMt_test)
    r2= r2_score(predictions, kMt_test)

    names.append(name)
    MSE.append(mse)
    R2.append(r2)

**Compare the predictive accuracy of 6 models on the test set**
- The test MSE of Linear Regression Model is approximately zero, less than KNN's but still higher than test MSEs of others.
- The R-squared of Linear Regression Model is above 0.9, higher than KNN's but far less than R-squared of others which are all above 0.97.

In [17]:
df2=pd.DataFrame({'1. Model':names,'3. R-squared':R2,'2. Mean Squared Error':MSE})
df2

Unnamed: 0,1. Model,2. Mean Squared Error,3. R-squared
0,LiR_kMt,4.912406e-06,0.905521
1,KNN,6.140209e-06,0.874685
2,CART,1.202413e-06,0.97827
3,Bag_Re,5.723691e-07,0.989477
4,RandomForest,5.871371e-07,0.989147
5,ExtraTreesRegressor,3.961621e-07,0.992734
