<a href="https://colab.research.google.com/github/dimazjogja/electronic-failure/blob/main/Electronic_Failure_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **ELECTRONIC FAILURE PREDICTION**
* **Background** - Predict the suspect fail parts using the 20 subtests measurement data of component. <br>
* **Objective** - In this project we will make a prototype of application using machine learning into a real web based application using HTML / JS. <br>
* **Dataset** - csv files of VOR dummy file contain of 20 features (Test1 to Test20 with value  (Pass as 1/ Fail as 0)), 1 target (defect part), and 1 data order (not used in prediction, will be dropped) from 99 orders. <br>
The effort of collecting data from PDSheet (Technical Record management / Dokmee) is done manually. This is a big effort. Data available only from 2017 to 2021, there is lack of data we have, but we will try with the available data, with note that there are only REPAIR data that will be used in prediction. 
<br>
* **Feature Engineering** - the real data measurement is represented in Volt, Ampere, Ohm, etc, but we can simplify the data into Pass (1) or Fail (0) for faster process. Please make sure that label target category is typed correctly. a different case will be considered as different label. <br>
* **Model / Classifier** - Random Forest, Gradient Boosting, Logistic Regression <br>
* **Prediction Type** - Supervised Learning Classification with Multi-Label Output (multi label means that the result of prediction has more than 2 data, there are various PN of parts) <br>
* **Optimization Method** - Hyperparameter Tuning using RandomizedSearchCV, GridSearchCV <br>
* **Metrics used** - Accuracy - 72.7% <br>
* **Output file format** - model.pkl (Pickle). This file will then be copied into a server folder, and can be loaded from API (may use Flask, Django, or Laravel composer). 
* **Deployment** into Web Page using API, there are multiple ways to load the pkl model using Flask, Django or Laravel.
**References**:

> 1.   Flask https://towardsdatascience.com/deploy-a-machine-learning-model-using-flask-da580f84e60c
2.   Flask https://www.linkedin.com/pulse/creating-machine-learning-web-api-flask-jonathan-wood/
3.   Laravel https://towardsdatascience.com/how-to-deploy-machine-learning-model-in-laravel-application-5e021494d316 
4.   Django https://medium.com/analytics-vidhya/integrating-a-machine-learning-model-with-django-79dd47eabef1

* **Web development** Web development may use HTML-JAVASCRIPT using Flask REST API or Django RESTful API, OR PHP Laravel. The proptotype design in this scope is just to make a simple usable user interface, that contains:
A search box / combo box to input partnumber group, and after an Enter, there will show the list of SubTests of the selected components, with default Pass values of each. Then the technician may click on a Toggle Button on Pass/Fail value to fill with the Unit Under Repair test result. For example, the technician found a Failure on SubTest 5, 6, 10, then just clicked the toggle button to change the value to Fail. Then Technician simply click Predict Button, the application will predict and giving results of Suspected Fail Parts.

* **Future Works** Enhance the content of Web Application using component test database, including some charts to show the analytics of parts replacement this may give overview to the technician about the hisitorical data of components and parts replacement.



In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### ***a. Libraries***

In [22]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier
plt.style.use('default')

### ***b. Preprocessing***

In [24]:
dataset = pd.read_csv("/content/drive/MyDrive/Electronic Failure Prediction/CoDot2.csv")

# Variables
Features = dataset.drop(['ORDER', 'DEFECT PARTS'], axis=1)      # Feature Matrix / Independent Variable
Labels, Values = pd.factorize(dataset["DEFECT PARTS"])          # Target Variable / Dependent Variable
print(dataset)
print(Features)
print(Labels, Values)

        ORDER  TEST1  TEST2  TEST3  ...  TEST18  TEST19  TEST20     DEFECT PARTS
0   512335678      1      1      0  ...       1       1       1  IC 850-1020-035
1   512335679      1      1      0  ...       1       1       1  IC 850-1020-035
2   512335680      1      1      0  ...       1       1       1  IC 850-1020-031
3   512335681      1      1      0  ...       1       1       1  IC 850-1020-035
4   512335682      1      1      1  ...       1       1       1  IC 850-1020-033
..        ...    ...    ...    ...  ...     ...     ...     ...              ...
94  512335772      1      1      0  ...       1       1       0  IC 850-1020-031
95  512335773      1      1      0  ...       0       0       0  IC 850-1020-030
96  512335774      1      0      1  ...       1       1       0  IC 850-1020-033
97  512335775      1      1      0  ...       1       0       1  IC 850-1020-030
98  512335776      1      0      0  ...       1       0       0  IC 850-1020-030

[99 rows x 22 columns]
    

### ***c. Model Development***

In [9]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(Features, Labels, test_size=0.33, random_state=42)

# Random Forest
rf0 = RandomForestClassifier()  # base model
rf0.fit(X_train, y_train)
# Best RS: 0.621978 using {'n_estimators': 300, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': False}
# Best GS: 0.623077 using {'bootstrap': True, 'max_depth': 100, 'max_features': 'sqrt', 'n_estimators': 300}
rf = RandomForestClassifier(n_estimators=300,  max_depth=100, max_features='sqrt', bootstrap=True)  # optimized
rf.fit(X_train, y_train)
pickle.dump(rf, open('rf.pkl', 'wb'))

# Gradient Boosting
gb0 = OneVsRestClassifier(GradientBoostingClassifier())
gb0.fit(X_train, y_train)
#Best RS: 0.561538 using {'n_estimators': 50, 'max_features': 'sqrt', 'max_depth': 5, 'learning_rate': 0.1}
#Best GS: 0.607692 using {'learning_rate': 0.1, 'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 50}
gb = OneVsRestClassifier(GradientBoostingClassifier(learning_rate=0.1, max_depth=5, max_features='sqrt', n_estimators=50))
gb.fit(X_train, y_train)
pickle.dump(gb, open('gb.pkl', 'wb'))

# Logistic Regression
lr0 = OneVsRestClassifier(LogisticRegression())
lr0.fit(X_train, y_train)
# Best RS: 0.530769 using {'solver': 'newton-cg', 'penalty': 'l2', 'max_iter': 100, 'C': 5000}
# Best GS: 0.530769 using {'C': 7500, 'max_iter': 50, 'penalty': 'l2', 'solver': 'newton-cg'}
lr = OneVsRestClassifier(LogisticRegression(solver='newton-cg', penalty='l2', C=7500, max_iter=50))
lr.fit(X_train, y_train)
pickle.dump(lr, open('lr.pkl', 'wb'))

# K Nearest Neighbour
# Best: 0.575556 using {'metric': 'euclidean', 'weights': 'uniform'}
knn = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=1, metric='euclidean', weights='uniform'))
knn.fit(X_train, y_train)
pickle.dump(knn, open('knn.pkl', 'wb'))

# SVC
# Best: 0.648148 using {'C': 1000, 'gamma': 'scale', 'kernel': 'rbf', 'probability': 'True'}
svc = OneVsRestClassifier(SVC(C=1000, gamma='scale', kernel='rbf', probability=True))
svc.fit(X_train, y_train)
pickle.dump(svc, open('svc.pkl', 'wb'))

### ***d. Model Evaluation***

In [10]:
# RF
y_pred_rf0 = rf0.predict(X_test)
cm_rf0 = confusion_matrix(y_test, y_pred_rf0)
s_acc_rf0 = round(accuracy_score(y_test, y_pred_rf0) * 100, 1)
precision_rf0 = round(precision_score(y_test, y_pred_rf0, average='micro') * 100, 1)
recall_rf0 = round(recall_score(y_test, y_pred_rf0, average='micro') * 100, 1)
f1_rf0 = round(f1_score(y_test, y_pred_rf0, average='micro') * 100, 1)

# RF
y_pred_rf = rf.predict(X_test)
cm_rf = confusion_matrix(y_test, y_pred_rf)
s_acc_rf = round(accuracy_score(y_test, y_pred_rf) * 100, 1)
precision_rf = round(precision_score(y_test, y_pred_rf, average='micro') * 100, 1)
recall_rf = round(recall_score(y_test, y_pred_rf, average='micro') * 100, 1)
f1_rf = round(f1_score(y_test, y_pred_rf, average='micro') * 100, 1)

# GB0
y_pred_gb0 = gb0.predict(X_test)
cm_gb0 = confusion_matrix(y_test, y_pred_gb0)
s_acc_gb0 = round(accuracy_score(y_test, y_pred_gb0) * 100, 1)
precision_gb0 = round(precision_score(y_test, y_pred_gb0, average='micro') * 100, 1)
recall_gb0 = round(recall_score(y_test, y_pred_gb0, average='micro') * 100, 1)
f1_gb0 = round(f1_score(y_test, y_pred_gb0, average='micro') * 100, 1)

# GB
y_pred_gb = gb.predict(X_test)
cm_gb = confusion_matrix(y_test, y_pred_gb)
s_acc_gb = round(accuracy_score(y_test, y_pred_gb) * 100, 1)
precision_gb = round(precision_score(y_test, y_pred_gb, average='micro') * 100, 1)
recall_gb = round(recall_score(y_test, y_pred_gb, average='micro') * 100, 1)
f1_gb = round(f1_score(y_test, y_pred_gb, average='micro') * 100, 1)

# Logistic
y_pred_lr0 = lr0.predict(X_test)
cm_lr0 = confusion_matrix(y_test, y_pred_lr0)
s_acc_lr0 = round(accuracy_score(y_test, y_pred_lr0) * 100, 1)
precision_lr0 = round(precision_score(y_test, y_pred_lr0, average='micro') * 100, 1)
recall_lr0 = round(recall_score(y_test, y_pred_lr0, average='micro') * 100, 1)
f1_lr0 = round(f1_score(y_test, y_pred_lr0, average='micro') * 100, 1)

# Logistic Classifier
y_pred_lr = lr.predict(X_test)
cm_lr = confusion_matrix(y_test, y_pred_lr)
s_acc_lr = round(accuracy_score(y_test, y_pred_lr) * 100, 1)
precision_lr = round(precision_score(y_test, y_pred_lr, average='micro') * 100, 1)
recall_lr = round(recall_score(y_test, y_pred_lr, average='micro') * 100, 1)
f1_lr = round(f1_score(y_test, y_pred_lr, average='micro') * 100, 1)

# KNN
y_pred_knn = knn.predict(X_test)
m_acc_knn = round(knn.score(X_train, y_train) * 100, 1)
cm_knn = confusion_matrix(y_test, y_pred_knn)
s_acc_knn = round(accuracy_score(y_test, y_pred_knn) * 100, 1)
precision_knn = round(precision_score(y_test, y_pred_knn, average='micro') * 100, 1)
recall_knn = round(recall_score(y_test, y_pred_knn, average='micro') * 100, 1)
f1_knn = round(f1_score(y_test, y_pred_knn, average='micro') * 100, 1)

# SVC
y_pred_svc = svc.predict(X_test)
cm_svc = confusion_matrix(y_test, y_pred_svc)
s_acc_svc = round(accuracy_score(y_test, y_pred_svc) * 100, 1)
precision_svc = round(precision_score(y_test, y_pred_svc, average='micro') * 100, 1)
recall_svc = round(recall_score(y_test, y_pred_svc, average='micro') * 100, 1)
f1_svc = round(f1_score(y_test, y_pred_svc, average='micro') * 100, 1)


**Confusion matrices**

In [11]:
print('Confusion Matrix of RF0')
print(cm_rf0)
print(s_acc_rf0)
print(precision_rf0)
print(recall_rf0)
print('Confusion Matrix of RF')
print(cm_rf)
print(s_acc_rf)
print(precision_rf)
print(recall_rf)
print('--------------------------------')
print('Confusion Matrix of GB0')
print(cm_gb0)
print(s_acc_gb0)
print(precision_gb0)
print(recall_gb0)
print('Confusion Matrix of GB')
print(cm_gb)
print(s_acc_gb)
print(precision_gb)
print(recall_gb)
print('--------------------------------')
print('Confusion Matrix of LR0')
print(cm_lr0)
print(s_acc_lr0)
print(precision_lr0)
print(recall_lr0)
print('Confusion Matrix of LR0')
print(cm_lr)
print(s_acc_lr)
print(precision_lr)
print(recall_lr)
print('--------------------------------')
print('Confusion Matrix of KNN')
print(cm_knn)
print('--------------------------------')
print('Confusion Matrix of SVC')
print(cm_svc)
print('--------------------------------')


Confusion Matrix of RF0
[[3 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 2 0 0 0 0 0 0 0 0 0]
 [0 0 0 4 0 0 0 0 0 0 0 0]
 [0 0 0 0 6 0 0 0 3 0 0 0]
 [0 0 0 0 1 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 2 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 1 2 0 0 0 2 0 0 0]
 [0 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1]]
69.7
69.7
69.7
Confusion Matrix of RF
[[3 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 2 0 0 0 0 0 0 0 0 0]
 [0 0 0 4 0 0 0 0 0 0 0 0]
 [1 0 0 0 7 0 0 0 1 0 0 0]
 [0 0 0 0 1 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 2 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 1 1 0 0 0 3 0 0 0]
 [0 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1]]
72.7
72.7
72.7
--------------------------------
Confusion Matrix of GB0
[[3 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 2 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 4 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 6 0 0 0 0 0 0 2 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 2 0 0 0 0 0 0]
 [0 0 0 0 0 0

In [25]:
import numpy as np
Input = [1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0]

# invert back the label numbers into label values name (parts)
parts_svc = Values[svc.predict([Input])[0]]
parts_lr = Values[lr.predict([Input])[0]]
parts_rf = Values[rf.predict([Input])[0]]
parts_knn = Values[knn.predict([Input])[0]]
parts_gb = Values[gb.predict([Input])[0]]

# predict probabilities for ROC / AUC score
prob_svc = svc.predict_proba(X_test)
prob_lr = lr.predict_proba(X_test)
prob_rf = rf.predict_proba(X_test)
prob_knn = knn.predict_proba(X_test)
prob_gb = gb.predict_proba(X_test)

# roc curve for classes
fpr1, fpr2, fpr4, fpr5, fpr6 = {}, {}, {}, {}, {}
tpr1, tpr2, tpr4, tpr5, tpr6 = {}, {}, {}, {}, {}
thr1, thr2, thr4, thr5, thr6 = {}, {}, {}, {}, {}
n_class = 3

for i in range(n_class):
    fpr1[i], tpr1[i], thr1[i] = roc_curve(y_test, prob_knn[:, i], pos_label=i)
    fpr2[i], tpr2[i], thr2[i] = roc_curve(y_test, prob_rf[:, i], pos_label=i)
    fpr4[i], tpr4[i], thr4[i] = roc_curve(y_test, prob_gb[:, i], pos_label=i)
    fpr5[i], tpr5[i], thr5[i] = roc_curve(y_test, prob_lr[:, i], pos_label=i)
    fpr6[i], tpr6[i], thr6[i] = roc_curve(y_test, prob_svc[:, i], pos_label=i)

### *e. Hyper parameter Tuning*

In [12]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
params  = { 'n_estimators': [100, 200, 300], 
            'max_features': ['auto', 'sqrt', 'log2'],
            'max_depth': [10, 100, 200],
            'bootstrap': [True, False]}
rs = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=params, n_jobs=-1, scoring='accuracy', error_score=0)
gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=params, n_jobs=-1, scoring='accuracy', error_score=0)
rs_result = rs.fit(X_train, y_train)
gs_result = gs.fit(X_train, y_train)
print("Best RS: %f using %s" % (rs_result.best_score_, rs_result.best_params_))
print("Best GS: %f using %s" % (gs_result.best_score_, gs_result.best_params_))



Best RS: 0.621978 using {'n_estimators': 300, 'max_features': 'sqrt', 'max_depth': 200, 'bootstrap': True}
Best GS: 0.637363 using {'bootstrap': False, 'max_depth': 200, 'max_features': 'auto', 'n_estimators': 200}


In [13]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
params  = { 'n_estimators': [10, 25, 50, 100], 
            'max_features': ['auto', 'sqrt'],
            'max_depth': [3, 5, 10],
            'learning_rate': [0.01, 0.1]}
rs = RandomizedSearchCV(estimator=GradientBoostingClassifier(), param_distributions=params, n_jobs=-1, scoring='accuracy', error_score=0)
gs = GridSearchCV(estimator=GradientBoostingClassifier(), param_grid=params, n_jobs=-1, scoring='accuracy', error_score=0)
rs_result = rs.fit(X_train, y_train)
gs_result = gs.fit(X_train, y_train)
print("Best RS: %f using %s" % (rs_result.best_score_, rs_result.best_params_))
print("Best GS: %f using %s" % (gs_result.best_score_, gs_result.best_params_))



Best RS: 0.546154 using {'n_estimators': 100, 'max_features': 'sqrt', 'max_depth': 3, 'learning_rate': 0.1}
Best GS: 0.637363 using {'learning_rate': 0.1, 'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 50}


In [14]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import LogisticRegression
params   = { 'solver': ['newton-cg', 'liblinear'],
            'penalty': ['l1', 'l2'],     
            'C': [5000, 3000, 1000],
            'max_iter': [50, 100, 200]}
rs = RandomizedSearchCV(estimator=LogisticRegression(), param_distributions=params, n_jobs=-1, scoring='accuracy', error_score=0)
gs = GridSearchCV(estimator=LogisticRegression(), param_grid=params, n_jobs=-1, scoring='accuracy', error_score=0)
rs_result = rs.fit(X_train, y_train)
gs_result = gs.fit(X_train, y_train)
print("Best RS: %f using %s" % (rs_result.best_score_, rs_result.best_params_))
print("Best GS: %f using %s" % (gs_result.best_score_, gs_result.best_params_))



Best RS: 0.530769 using {'solver': 'newton-cg', 'penalty': 'l2', 'max_iter': 200, 'C': 3000}
Best GS: 0.530769 using {'C': 5000, 'max_iter': 50, 'penalty': 'l2', 'solver': 'newton-cg'}


### ***f. Summary Output***

In [21]:
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'Gradient Boosting'],
    'Accuracy': [s_acc_lr, s_acc_rf, s_acc_gb],
    'Faulty Parts': [parts_lr, parts_rf, parts_gb]
    })

output = results.sort_values(by='Accuracy', ascending=False)
output = output.reset_index(drop=True)
print(output)

                 Model  Accuracy     Faulty Parts
0        Random Forest      72.7  IC 850-1020-030
1    Gradient Boosting      72.7  IC 850-1020-030
2  Logistic Regression      69.7  IC 850-1020-030
