## Summary of different models:

    S. No |  Model                |    Training Data    |    Test Data
    
    1.    |  SVM Regressor        |    0.9648315        |    0.9562091
    2.    |  Linear Regression    |    0.9726291        |    0.9660127
    3.    |  Decision Tree        |    0.9922471        |    0.9665272
    4.    |  ADA Boost Regressor  |    0.9742539        |    0.9745007
    5.    |  Random Forest        |    0.9966556        |    0.9774058
    6.    |  Neural Network       |    0.9884488        |    0.9812295
    7.    |  GB Regressor         |    0.9998022        |    0.9792305
    8.    |  LightGBM Regressor   |    0.9975067        |    0.9847313
    9.    |  XGBoost Regressor    |    0.9986248        |    0.9830212
    
    10.   |  Ensemble model       |    0.9871900        |    0.9730434

## Feature importance derived from various models:


Features gathered from the observation of a phenomenon are not all equally informative: some of them may be noisy, correlated or irrelevant. Feature selection aims at selecting a feature set that is relevant for a given task.

    S. No |  Model               |  Rank 1  |  Rank 2  |  Rank 3  |  Rank 4  |  Rank 5 
    
    1.    |  Decision Tree       |  Dt      |  Cr      |  C       |  TT      |  NT       
    2.    |  ADA Boost Regressor |  TT      |  Cr      |  Tt      |  Dt      |  Ct
    3.    |  Random Forest       |  QmT     |  Ct      |  Cr      |  NT      |  Tt
    4.    |  LightGBM Regressor  |  TT      |  C       |  Cr      |  P       |  Mn
    5.    |  XGBoost Regressor   |  TT      |  C       |  Cr      |  Mn      |  P

## Importing the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
import tensorflow as tf
import xgboost as xgb
from sklearn.svm import SVR
from tensorflow import keras
from tensorflow.keras import layers
import keras
from keras.models import Sequential
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
import pickle
import json
import requests

In [None]:
cd "/Users/chiragbhattad/Downloads/DDP/Fatigue Strength dataset"

### Importing the dataset using pandas library and dropping the 'SI. No.' column

In [None]:
data = pd.read_excel("fatigue.xlsx")
data.drop(['Sl. No.'],axis=1, inplace=True)

### Data correlation visualized using heat map

In [None]:
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)

In [None]:
corr = data.corr()
fig = plt.figure(figsize=(25,25))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(data.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)
plt.show()
fig.savefig('corr_heat_map.png')

### Splitting of the dataset into feature variable and target variable

In [None]:
feature = data.columns[:-1]
target = data.columns[-1]
X = data[feature]
y = data[target]

## Data Visualization

In [None]:
plt.scatter(data['TT'], data['Fatigue'])
fig = plt.gcf()
fig.set_size_inches(15, 10)
plt.title('Tempering temperature vs Fatigue strength')
plt.xlabel('Tempering temperature')
plt.ylabel('Fatigue Strength (MPa)')
plt.savefig('TTvsFatigueStrength.png')

In [None]:
label = data['CT']
color= ['blue' if l == 30 else 'orange' for l in label]
fig = plt.gcf()
fig.set_size_inches(15, 10)
plt.scatter(data['C'], data['Fatigue'], color=color)
plt.title('Scatter plot categorized by carburizing temperature')
plt.xlabel('% Carbon')
plt.ylabel('Fatigue Strength (MPa)')
plt.savefig('CarbonvsFatigueStrength.png')

In [None]:
label = data['CT']
color= ['blue' if l == 30 else 'orange' for l in label]
fig = plt.gcf()
fig.set_size_inches(15, 10)
plt.scatter(data['P'], data['Fatigue'], color = color)
plt.title('Scatter plot categorized by carburizing temperature')
plt.xlabel('% Phosporous')
plt.ylabel('Fatigue Strength (MPa)')
plt.savefig('PhosporousvsFatigueStrength.png')

In [None]:
label = data['CT']
color= ['blue' if l == 30 else 'orange' for l in label]
fig = plt.gcf()
fig.set_size_inches(15, 10)
plt.scatter(data['Mn'], data['Fatigue'], color = color)
plt.title('Scatter plot categorized by carburizing temperature')
plt.xlabel('% Manganese')
plt.ylabel('Fatigue Strength (MPa)')
plt.savefig('ManganesevsFatigueStrength.png')

In [None]:
label = data['CT']
color= ['blue' if l == 30 else 'orange' for l in label]
fig = plt.gcf()
fig.set_size_inches(15, 10)
plt.scatter(data['Cr'], data['Fatigue'], color = color)
plt.title('Scatter plot categorized by carburizing temperature')
plt.xlabel('% Chromium')
plt.ylabel('Fatigue Strength (MPa)')
plt.savefig('ChromiumvsFatigueStrength.png')

## Feature Ranking

### 1. Univariate Selection

    Univariate analysis examines the relationship of each feature with the target variable individually. This can be measured using Pearson coefficient, Maximal Information Coefficient or distance correlation.

    1. Pearson Coefficient: It is a measure of the correlation between two variables. It has a value between -1 and +1 with -1 being perfect negative correlation and +1 being perfect positive correlation. Pearson correlation of 0 does not mean the variables are independent.
                        ρ(x,y) = cov(X,Y)/σxσy

    2. Maximal Information Coefficient: It measures the mutual dependance between two variables. MIC gives a score in bits which is not normalized. It is also inconvenient to compute it for continuous variables  in general the variables need to be discretized by binning, but the mutual information score can be quite sensitive to bin selection.
                        I(X,Y) = ∑∑p(x,y)log[p(x,y)/p(x)p(y)]
    
    3. Distance correlation: It is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, y)
np.set_printoptions(precision=3)
print(fit.scores_/sum(fit.scores_))

In [None]:
# Selected Features:
features = fit.transform(X)
print(features[0:5,:])

### 2. Recursive Feature Elimination

    Given an external estimator that assigns weights to features, the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

In [None]:
model = LogisticRegression()
rfe = RFE(model, 5)

fit = rfe.fit(X, y)

print("Num Features: %d"% fit.n_features_)
print("Selected Features: %s"% fit.support_)
print("Feature Ranking: %s"% fit.ranking_)

## Principal Component Analysis

    Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

In [None]:
sc = StandardScaler()
X = sc.fit_transform(X)

In [None]:
pca = PCA(n_components=2, random_state=1)
principal_components = pca.fit_transform(X)

In [None]:
plt.scatter(principal_components[:,0],principal_components[:,1], c = y)
plt.title("Distribution of training dataset after PCA")
fig = plt.gcf()
fig.set_size_inches(15, 10)
plt.savefig('pca.png')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## SVM regressor

    Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm (maximal margin). The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. The main idea is the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.

    Hyperplane is the line that will help us predict the continuous value or target value. There are two lines other than Hyper Plane which creates a margin. These are the boundary lines. Support vectors can be on the Boundary lines or outside it. 

In [None]:
svreg = SVR(kernel='linear', verbose=1, C=15)
svreg.fit(X_train, y_train)
SVRtrain = r2_score(y_train, svreg.predict(X_train))
print("\nSVM Regressor Train data:", SVRtrain)

In [None]:
Y_pred_3 = svreg.predict(X_test)
SVRerror = r2_score(y_test, Y_pred_3)
print("SVM Regressor Test data:", SVRerror)

Visualizing the predictions made by the SVM Regressor algorithm and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_3 = svreg.predict(X_train)
plt.plot(range(349), train_pred_3)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of SVM Regressor on Training data')
plt.savefig('SVMRegressorTrain.png')

In [None]:
plt.plot(range(88), Y_pred_3)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of SVM Regressor on Test data')
plt.savefig('SVMRegressorTest.png')
plt.show()

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, svreg.predict(X))
plt.title("SVM Regressor")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('SVMRegressorScatter.png')

In [None]:
error = svreg.predict(X) - y
plt.hist(error, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for SVM regressor")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('SVMRegressorError.png')

## Linear Regression

    Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. Statistical relationship is not accurate in determining relationship between two variables.

    The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.

In [None]:
linreg = LinearRegression().fit(X_train,y_train)
linregression = linreg.score(X,y)
LTerror = r2_score(y_train, linreg.predict(X_train))
print("Linear Regression Training data:", LTerror)

In [None]:
Y_pred_1 = linreg.predict(X_test)
linerror = r2_score(y_test, Y_pred_1)
print("Linear Regression Test data:", linerror)

Visualizing the predictions made by the Linear regression algorithm and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_1 = linreg.predict(X_train)
plt.plot(range(349), train_pred_1)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Linear Regression on Training data')
plt.savefig('LinRegTrain.png')

In [None]:
plt.plot(range(88), Y_pred_1)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Linear Regression on Test data')
plt.savefig('LinRegTest.png')
plt.show()

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, linreg.predict(X))
plt.title("Linear Regression")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('LinRegScatter.png')

In [None]:
error_1 = linreg.predict(X) - y
plt.hist(error_1, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for Linear regression")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('LinRegError.png')

## Polynomial regression

    Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y|x)
    
    Polynomial Regression are basically used to define or describe non-linear phenomenon such as growth rate of tissues, progression of disease epidemics and distribution of carbon isotopes in lake sediments

In [None]:
poly = PolynomialFeatures(degree = 3) 
X_poly = poly.fit_transform(X) 
  
poly.fit(X_poly, y)

In [None]:
X_train_poly = poly.fit_transform(X_train)
polreg = LinearRegression().fit(X_train_poly,y_train)
# polregression = polreg.score(X_poly,y)
PTerror = r2_score(y_train, polreg.predict(X_train_poly))
print("Polynomial Regression Training data:", LTerror)

In [None]:
X_test_poly = poly.fit_transform(X_test)
Y_pred_16 = polreg.predict(X_test_poly)
polerror = r2_score(y_test, Y_pred_16)
print("Polynomial Regression Test data:", linerror)

In [None]:
train_pred_11 = polreg.predict(X_train_poly)
plt.plot(range(349), train_pred_11)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Polynomial Regression (degree = 3) on Training data')
plt.savefig('PolRegTrain.png')

In [None]:
plt.plot(range(88), Y_pred_16)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Polynomial Regression (degree = 3) on Test data')
plt.savefig('PolRegTest.png')
plt.show()

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, polreg.predict(X_poly))
plt.title("Polynomial Regression (degree = 3)")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('PolRegScatter.png')

In [None]:
error_1 = polreg.predict(X_poly) - y
plt.hist(error_1, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for Polynomial regression(degree = 3)")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('PolRegError.png')

## Decision Tree Regressor

    Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision Trees are Simple to understand and to interpret They require very little data preparation.  Thus it can be classified as a white box model i.e., if a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic.

    A Decision tree regreesor breaks down the dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data. 

In [None]:
dtregr = DecisionTreeRegressor(max_depth=7, random_state=1)
dtregr.fit(X_train, y_train)
DTRtrain = r2_score(y_train, dtregr.predict(X_train))
print("Decision Tree Train data:", DTRtrain)

In [None]:
Y_pred_4 = dtregr.predict(X_test)
DTRerror = r2_score(y_test, Y_pred_4)
print("Decision Tree Test data:", DTRerror)

Visualizing the predictions made by the Decision Tree Regressor and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_4 = dtregr.predict(X_train)
plt.plot(range(349), train_pred_4)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Decision Tree on Training data')
plt.savefig('DTTrain.png')

In [None]:
plt.plot(range(88), Y_pred_4)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Decision Tree on Test data')
plt.savefig('DTTest.png')
plt.show()

In [None]:
independent_features = data[feature]
importance_dt = dict(zip(independent_features.columns, dtregr.feature_importances_))
importance_dt

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, dtregr.predict(X))
plt.title("Decision Tree Regressor")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('DTScatter.png')

In [None]:
error_2 = dtregr.predict(X) - y
plt.hist(error_2, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for Decision Tree")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('DTError.png')

## Neural Network

    An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurones. This is true of ANNs as well.

    Description of the ANN:

    Layer (type)      |          Output Shape |            Param  

    dense_1 (Dense)   |          (None, 512)  |            13312     
    dense_2 (Dense)   |          (None, 256)  |            131328    
    dense_3 (Dense)   |          (None, 128)  |            32896     
    dense_4 (Dense)   |          (None, 64)   |            8256      
    dense_5 (Dense)   |          (None, 01)   |            65        


    Total params: 185,857

    Trainable params: 185,857

    Non-trainable params: 0

In [None]:
NN_model = Sequential()

# The Input Layer :
NN_model.add(Dense(512, kernel_initializer='normal',input_dim = 25, activation='relu'))

# The Hidden Layers :
NN_model.add(Dense(256, kernel_initializer='normal',activation='relu'))
NN_model.add(Dense(128, kernel_initializer='normal',activation='relu'))
NN_model.add(Dense(64, kernel_initializer='normal',activation='relu'))

# The Output Layer :
NN_model.add(Dense(1, kernel_initializer='normal',activation='linear'))

# Compile the network :
NN_model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])
NN_model.summary()

In [None]:
checkpoint_name = 'Weights-{epoch:03d}--{val_loss:.5f}.hdf5' 
checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_loss', verbose = 1, save_best_only = True, mode ='auto')
callbacks_list = [checkpoint]

In [None]:
NN_model.fit(X_train, y_train, epochs=1000, validation_split = 0.2, callbacks=callbacks_list)

In [None]:
NNerror = r2_score(y_train, NN_model.predict(X_train))
print("Neural Network Training data:", NNerror)

In [None]:
Y_pred_2 = NN_model.predict(X_test)
nnerror = r2_score(y_test, Y_pred_2)
print("Neural Network Test data:", nnerror)

Visualizing the predictions made by the Artificial Neural Network and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_2 = NN_model.predict(X_train)
plt.plot(range(349), train_pred_2)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Neural Network on Training data')
plt.savefig('NNTrain.png')

In [None]:
Y_pred_2 = NN_model.predict(X_test)
plt.plot(range(88), Y_pred_2)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Neural Network on Test data')
plt.savefig('NNTest.png')
plt.show()

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, np.ravel(NN_model.predict(X)))
plt.title("Neural Network")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('NNScatter.png')

In [None]:
error_3 = np.ravel(NN_model.predict(X)) - y
plt.hist(error_3, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for Neural Network")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('NNError.png')

## RandomForestRegressor

    The random forest model is a type of additive model that makes predictions by combining decisions from a sequence of base models. More formally we can write this class of models as:

                                        g(x)=f0(x)+f1(x)+f2(x)+...

    where the final model g is the sum of simple base models fi. Here, each base classifier is a simple decision tree. This broad technique of using multiple models to obtain better predictive performance is called model ensembling. In random forests, all the base models are constructed independently using a different subsample of the data. The subsample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True

    The random forest model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories. Unlike linear models, random forests are able to capture non-linear interaction between the features and the target. When dealing with sparse input data (e.g. categorical features with large dimension), we can either pre-process the sparse features to generate numerical statistics, or switch to a linear model, which is better suited for such scenarios.

In [None]:
rfregr = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=12, verbose=True, random_state=1, oob_score=True)
rfregr.fit(X_train, y_train)
rfrtrain = r2_score(y_train, rfregr.predict(X_train))
print("Random Forest Train data:", rfrtrain)

In [None]:
Y_pred_5 = rfregr.predict(X_test)
RFRerror = r2_score(y_test, Y_pred_5)
print("Random Forest Test data:", RFRerror)

Visualizing the predictions made by the Random Forest Regressor and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_5 = rfregr.predict(X_train)
plt.plot(range(349), train_pred_5)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Random Forest on Training data')
plt.savefig('RFTrain.png')

In [None]:
plt.figure(figsize=(15,5))
plt.plot(range(88), Y_pred_5)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Random Forest on Test data')
plt.savefig('RFTest.png')
plt.show()

Using the attribute 'feature_importances_' to rank the importance of each feature with respect to the target variable.

In [None]:
importance = rfregr.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfregr.estimators_],
             axis=0)
indices = np.argsort(importance)[::-1]

print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importance[indices[f]]))

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, rfregr.predict(X))
plt.title("Random Forest")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('RFScatter.png')

In [None]:
error_4 = rfregr.predict(X) - y
plt.hist(error_4, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for Random Forest")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('RFError.png')

## Gradient Boost Regressor

    Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. The objective of any supervised learning algorithm is to define a loss function and minimize it. Gradient Boosting allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbregr = GradientBoostingRegressor(n_estimators=100, max_depth=6, verbose=True, random_state=1)
gbregr.fit(X_train, y_train)
gbrtrain = r2_score(y_train, gbregr.predict(X_train))
print("Random Forest Train data:", gbrtrain)

In [None]:
Y_pred_6 = gbregr.predict(X_test)
GBRerror = r2_score(y_test, Y_pred_6)
print("Random Forest Test data:", GBRerror)

Visualizing the predictions made by the Gradient Boosting Regressor and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_6 = gbregr.predict(X_train)
plt.plot(range(349), train_pred_6)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Gradient Boosting on Training data')
plt.savefig('GBTrain.png')

In [None]:
plt.plot(range(88), Y_pred_6)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Gradient Boosting on Test data')
plt.savefig('GBTest.png')
plt.show()

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, gbregr.predict(X))
plt.title("Random Forest")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('GBScatter.png')

In [None]:
error_5 = gbregr.predict(X) - y
plt.hist(error_5, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for Gradient Boosting")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('GBError.png')

## ADA Boost regressor

    An AdaBoost regressor is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction. As such, subsequent regressors focus more on difficult cases.

    AdaBoost can be used to boost the performance of any machine learning algorithm. These are models that achieve accuracy just above random chance on a classification problem. The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.

In [None]:
from sklearn.ensemble import AdaBoostRegressor
adaregr = AdaBoostRegressor(n_estimators=100, learning_rate=0.9, random_state=1)
adaregr.fit(X_train, y_train)
adartrain = r2_score(y_train, adaregr.predict(X_train))
print("Random Forest Train data:", adartrain)

In [None]:
Y_pred_7 = adaregr.predict(X_test)
adaRerror = r2_score(y_test, Y_pred_7)
print("Random Forest Test data:", adaRerror)

Visualizing the predictions made by the ADA Boost Regressor and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_7 = adaregr.predict(X_train)
plt.plot(range(349), train_pred_7)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of ADA Boosting on Training data')
plt.savefig('ADATrain.png')

In [None]:
plt.plot(range(88), Y_pred_7)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of ADA Boosting on Test data')
plt.savefig('ADATest.png')
plt.show()

In [None]:
importance_1 = adaregr.feature_importances_
std_1 = np.std([tree.feature_importances_ for tree in adaregr.estimators_],
             axis=0)
indices_1 = np.argsort(importance_1)[::-1]

print("Feature ranking:")

for f in range(independent_features.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices_1[f], importance_1[indices_1[f]]))

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, adaregr.predict(X))
plt.title("ADA Boost")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('ADAScatter.png')

In [None]:
error_6 = adaregr.predict(X) - y
plt.hist(error_6, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for ADA Boosting")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('ADAError.png')

## LightGBM

    LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with faster training speed and higher efficiency, lower memory usage and better accuracy. It is based on decision tree algorithms and is used for ranking, classification, regression and many more maching learning tasks. 
    
    Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

In [None]:
import lightgbm as lgb
lgbmregr = lgb.LGBMRegressor(n_jobs=-1, subsample=1.0, learning_rate=0.5, min_split_gain=.01)
lgbmregr.fit(X_train, y_train)
lgbmrtrain = r2_score(y_train, lgbmregr.predict(X_train))
print("Light GBM Train data:", lgbmrtrain)

In [None]:
lgbmrtrain = r2_score(y_train, lgbmregr.predict(X_train))
print("Light GBM Train data:", lgbmrtrain)
Y_pred_8 = lgbmregr.predict(X_test)
lgbmRerror = r2_score(y_test, Y_pred_8)
print("Light GBM Test data:", lgbmRerror)

Visualizing the predictions made by the LightGBM Regressor and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_8 = lgbmregr.predict(X_train)
plt.figure(figsize=(25,5))
plt.plot(range(349), train_pred_8)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Light GBM on Training data')
plt.savefig('LGBMTrain.png')

In [None]:
plt.figure(figsize=(15,5))
plt.plot(range(88), Y_pred_8)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Light GBM on Test data')
plt.savefig('LGBMTest.png')
plt.show()

Using the attribute 'feature_importances_' to rank the importance of each feature with respect to the target variable.

In [None]:
import seaborn as sns

# sorted(zip(clf.feature_importances_, X.columns), reverse=True)
feature_imp = pd.DataFrame(sorted(zip(lgbmregr.feature_importances_,independent_features.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.show()
plt.savefig('lgbm_importances-01.png')

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, lgbmregr.predict(X))
plt.title("Light GBM")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('LGBMScatter.png')

In [None]:
error_7 = lgbmregr.predict(X) - y
plt.hist(error_7, edgecolor='black', linewidth = 1.0, rwidth = 0.9, bins=20)
plt.title("Error plot for Light GBM")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(8, 7)
plt.savefig('LGBMError.png')

## XGBoost Regressor

    XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

In [None]:
xgbregr = xgb.XGBRegressor(n_estimators=100, learning_rate=0.2, max_depth=4)
xgbregr.fit(X_train, y_train)
XGBtrain = r2_score(y_train, xgbregr.predict(X_train))
print("XGBoost Regressor Train data:", XGBtrain)

In [None]:
Y_pred_9 = xgbregr.predict(X_test)
XGBerror = r2_score(y_test, Y_pred_9)
print("XGBoost Regressor Test data:", XGBerror)

Visualizing the predictions made by the XGBoost Regressor and comparing it with the actual data points present in the training and testing dataset.

In [None]:
train_pred_9 = xgbregr.predict(X_train)
plt.plot(range(349), train_pred_9)
plt.plot(range(349), y_train)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of XGBoost on Training data')
plt.savefig('XGBTrain.png')

In [None]:
plt.plot(range(88), Y_pred_9)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of XGBoost on Test data')
plt.savefig('XGBTest.png')
plt.show()

In [None]:
from xgboost import plot_importance
plot_importance(xgbregr)
plt.show()

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, xgbregr.predict(X))
plt.title("XGBoost Regressor")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('XGBScatter.png')

In [None]:
error_8 = xgbregr.predict(X) - y
plt.hist(error_8, edgecolor='black', linewidth = 1.0, rwidth = 0.8, bins=20)
plt.title("Error plot for XGBoost")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('XGBError.png')

## Creating an ensemble model to be used in fatigue prediction

    The ensemble is trained on the top five features from the 25 features dataset and makes its predictions. The top five features are:
    1. Tempering Temperature
    2. % Carbon
    3. % Chromium
    4. % Manganese
    5. % Phosphorous
    
    These five features have been chosen by the machine learning models and are backed by material science theory. Refer to the Thesis for more information regarding this.
    The ensemble consists of the following models:
    1. Neural Network
    2. Decision Tree regressor
    3. XGBoost regressor
    4. LightGBM regressor
    5. ADA Boost regressor

In [None]:
features = ['TT', 'C', 'Cr', 'Mn', 'P']
X_final = data[features]
y_final = data[target]
sc = StandardScaler()
X = sc.fit_transform(X)
Xtrain, Xtest, ytrain, ytest = train_test_split(X_final, y_final, test_size=0.2, random_state=1)

## Neural Network

In [None]:
NN_model1 = Sequential()

# The Input Layer :
NN_model1.add(Dense(512, kernel_initializer='normal',input_dim = 5, activation='relu'))

# The Hidden Layers :
NN_model1.add(Dense(256, kernel_initializer='normal',activation='relu'))
NN_model1.add(Dense(128, kernel_initializer='normal',activation='relu'))
NN_model1.add(Dense(64, kernel_initializer='normal',activation='relu'))

# The Output Layer :
NN_model1.add(Dense(1, kernel_initializer='normal',activation='linear'))

# Compile the network :
NN_model1.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])
NN_model1.summary()

In [None]:
import statistics
NN_model1.fit(Xtrain, ytrain, epochs=400, validation_split = 0.2, callbacks=callbacks_list)
dtregr.fit(Xtrain, ytrain)
xgbregr.fit(Xtrain, ytrain)
lgbmregr.fit(Xtrain, ytrain)
adaregr.fit(Xtrain, ytrain)

pred1 = np.ravel(NN_model1.predict(Xtest))
pred2 = np.array(dtregr.predict(Xtest))
pred3 = np.array(xgbregr.predict(Xtest))
pred4 = np.array(lgbmregr.predict(Xtest))
pred5 = np.array(adaregr.predict(Xtest))

final_pred = np.mean(np.array([pred1, pred2, pred3, pred4, pred5]), axis=0)
print(r2_score(final_pred, y_test))

## Saving the Models as pickle files to use for the webpage

In [None]:
# pickle.dump(NN_model1, open('model1.pkl','wb'))

# pickle.dump(dtregr, open('model2.pkl','wb'))

# pickle.dump(xgbregr, open('model3.pkl','wb'))

# pickle.dump(lgbmregr, open('model4.pkl','wb'))

# pickle.dump(adaregr, open('model5.pkl','wb'))

## Individual model accuracy for 5 features

In [None]:
svreg.fit(Xtrain, ytrain)
svreg1 = r2_score(ytrain, svreg.predict(Xtrain))
print("SVM Regression Train data:", svreg1)
Y_pred_17 = svreg.predict(Xtest)
svregerror1 = r2_score(ytest, Y_pred_17)
print("SVM Regression Test data:", svregerror1)

In [None]:
linreg.fit(Xtrain, ytrain)
linreg1 = r2_score(ytrain, linreg.predict(Xtrain))
print("Linear Regression Train data:", linreg1)
Y_pred_18 = linreg.predict(Xtest)
linregerror1 = r2_score(ytest, Y_pred_18)
print("Linear Regression Test data:", linregerror1)

In [None]:
XTrainpoly = poly.fit_transform(Xtrain) 
polreg.fit(XTrainpoly, ytrain)
polreg1 = r2_score(ytrain, polreg.predict(XTrainpoly))
print("Polynomial Regression Train data:", polreg1)
XTestpoly = poly.fit_transform(Xtest)
Y_pred_19 = polreg.predict(XTestpoly)
polregerror1 = r2_score(ytest, Y_pred_19)
print("Polynomial Regression Test data:", polregerror1)

In [None]:
DTRtrain1 = r2_score(ytrain, dtregr.predict(Xtrain))
print("Decision Tree Train data:", DTRtrain1)
Y_pred_14 = dtregr.predict(Xtest)
DTRerror1 = r2_score(ytest, Y_pred_14)
print("Decision Tree Test data:", DTRerror1)

In [None]:
NNerror_1 = r2_score(ytrain, NN_model1.predict(Xtrain))
print("Neural Network Training data:", NNerror_1)
Y_pred_11 = NN_model1.predict(Xtest)
nnerror_1 = r2_score(ytest, Y_pred_11)
print("Neural Network Test data:", nnerror_1)

In [None]:
rfregr.fit(Xtrain, ytrain)
rfregr1 = r2_score(ytrain, rfregr.predict(Xtrain))
print("Random Forest  Train data:", rfregr1)
Y_pred_20 = rfregr.predict(Xtest)
rfregrerror1 = r2_score(ytest, Y_pred_20)
print("Random Forest  Test data:", rfregrerror1)

In [None]:
gbregr.fit(Xtrain, ytrain)
gbregr1 = r2_score(ytrain, gbregr.predict(Xtrain))
print("Gradient Boosting Regression Train data:", gbregr1)
Y_pred_21 = gbregr.predict(Xtest)
gbregrerror1 = r2_score(ytest, Y_pred_21)
print("Gradient Boosting Regression Test data:", gbregrerror1)

In [None]:
adartrain1 = r2_score(ytrain, adaregr.predict(Xtrain))
print("ADA Boost Train data:", adartrain1)
Y_pred_13 = adaregr.predict(Xtest)
adaRerror1 = r2_score(ytest, Y_pred_13)
print("ADA Boost Test data:", adaRerror1)

In [None]:
lgbmrtrain1 = r2_score(ytrain, lgbmregr.predict(Xtrain))
print("LightGBM Train data:", lgbmrtrain1)
Y_pred_15 = lgbmregr.predict(Xtest)
lgbmRerror1 = r2_score(ytest, Y_pred_15)
print("LightGBM Test data:", lgbmRerror1)

In [None]:
XGBtrain1 = r2_score(ytrain, xgbregr.predict(Xtrain))
print("XGBoost Regressor Train data:", XGBtrain1)
Y_pred_12 = xgbregr.predict(Xtest)
XGBerror1 = r2_score(ytest, Y_pred_12)
print("XGBoost Regressor Test data:", XGBerror1)

## Final prediction using the Ensemble model

In [None]:
final_pred = (np.ravel(NN_model1.predict(X_final)) + np.array(dtregr.predict(X_final)) + np.array(xgbregr.predict(X_final)) + np.array(lgbmregr.predict(X_final)) + np.array(adaregr.predict(X_final)))/5

## Accuracy of the Ensemble Deep Learning model

In [None]:
Ensembletrain = r2_score(y_train, train_pred_10)
print("EnsembleDeep Learning model Train data:", Ensembletrain)

In [None]:
Ensembleerror = r2_score(y_test, Y_pred_10)
print("Ensemble Deep Learning model Test data:", Ensembleerror)

## Printing the Accuracy plots and error plots for the Ensemble Deep Learning model.

In [None]:
train_pred_10 = (np.ravel(NN_model1.predict(Xtrain)) + np.array(dtregr.predict(Xtrain)) + np.array(xgbregr.predict(Xtrain)) + np.array(lgbmregr.predict(Xtrain)) + np.array(adaregr.predict(Xtrain)))/5
plt.plot(range(349), train_pred_10)
plt.plot(range(349), ytrain)
plt.ylim(0, 1500)
plt.ylabel('Fatigue Strength (MPa)')
plt.xlabel('Data-points')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of Ensemble Deep Learning model on Training data')
plt.savefig('EnsembleTrain.png')

In [None]:
Y_pred_10 = (np.ravel(NN_model1.predict(Xtest)) + np.array(dtregr.predict(Xtest)) + np.array(xgbregr.predict(Xtest)) + np.array(lgbmregr.predict(Xtest)) + np.array(adaregr.predict(Xtest)))/5
plt.plot(range(88), Y_pred_10)
plt.plot(range(88), y_test)
plt.ylim(0, 1300)
plt.ylabel('Fatigue Strength (MPa)')
plt.xlabel('Data-points')
L = plt.legend('upper right', prop={'size': 12})
L.get_texts()[0].set_text('predicted value')
L.get_texts()[1].set_text('actual value')
fig = plt.gcf()
fig.set_size_inches(17, 5)
plt.title('Accuracy of ensemble Deep Learning model on Test data')
plt.savefig('EnsembleTest.png')
plt.show()

In [None]:
x = np.linspace(0, 1200, 1000)
plt.plot(x, x+0, '-.k')
plt.scatter(y, final_pred)
plt.title("Ensemble model")
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.xlim(0,1200)
plt.ylim(0,1200)
plt.savefig('EnsembleScatter.png')

In [None]:
plt.plot(range(437), final_pred, color='Blue', label='Prediction')
plt.plot(range(437), y, color='orange', label='Actual')
plt.title('Gauging the accuracy of the ensemble deep learning model')
plt.legend(loc='upper left', prop={'size': 12})
plt.xlabel('Data points')
plt.ylabel('Fatigue Strength (MPa)')
fig = plt.gcf()
fig.set_size_inches(17, 10)
plt.savefig('EnsemblePred.png')

In [None]:
error_9 = final_pred - y
plt.hist(error_9, edgecolor='black', linewidth = 1.0, rwidth = 0.8, bins=20)
plt.title("Error plot for Ensemble deep learning model")
plt.xlabel("Predicted - Actual (MPa)")
plt.ylabel("Frequency")
fig = plt.gcf()
fig.set_size_inches(9, 7)
plt.savefig('EnsembleError.png')