<a href="https://colab.research.google.com/github/fridaruh/haleytek_workshop/blob/master/XGBoost_Predictive_Maintenance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using a XGBoost Algorithm for Predictive Maintenance

One approach could be to use a XGBoost algorithm, a supervised learning method, to create a model based on the collected data. The steps are as follows:

1.   **Data Collection:** Collect data from the machine's sensors in various situations, both normal and failure. Make sure you have enough data in both categories to train your model.
2.   **Data Preprocessing:** Clean and prepare your data for training. This could include removing outliers, normalizing the data, and splitting the data into a training set and a test set.
3.  **Model Training:** Use a XGBoost algorithm to train your model using your training set.
4.  **Model Testing:** Test your model with your test set to see how well it can predict failures based on sensor data.
5.  **Analysis:** Once your model is trained and tested, you can use it to analyze real-time data from the machine's sensors to predict possible failures.



**Data Simulation**

1.  We start by defining the number of samples (n_samples) and the number of sensors (n_sensors) we want to simulate.

2.  We create an empty DataFrame data to store the sensor readings.

3.  We generate sensor data. For each sensor, we generate n_samples data points. These points are drawn from a normal distribution with a mean (loc) of 50 and a standard deviation (scale) of 10.

4.  We create a 'failure' column in our DataFrame and initialize it to 0. This column will indicate whether a machine failure occurred at each time point.

5.  We simulate machine failures. We randomly select 5% of the time points to be "failure" points. For each failure point, we increase the sensor readings at the previous time point by a random factor between 1.1 and 1.5, and we mark the failure point itself with a '1' in the 'failure' column.

Finally, we save the generated data to a CSV file.

## Simulating Data

In [24]:
import pandas as pd
import numpy as np

In [25]:
np.random.seed(0)

In [26]:
# Define how many data points you want
num_data_points = 10000

In [27]:
# Generate random sensor data for temperature (range 0-100), vibration (range 0-100), and pressure (range 0-1000)
temperature = np.random.uniform(low=0.0, high=100.0, size=num_data_points)
vibration = np.random.uniform(low=0.0, high=100.0, size=num_data_points)
pressure = np.random.uniform(low=0.0, high=1000.0, size=num_data_points)



In [28]:
# Combine sensor data into a DataFrame
data = pd.DataFrame({
    'temperature': temperature,
    'vibration': vibration,
    'pressure': pressure,
})

In [29]:
# Generate "failure" labels based on whether sensor readings exceed a threshold
thresholds = {'temperature': 85, 'vibration': 85, 'pressure': 900}


In [30]:
# Start with no failures
data['failure'] = 0

# Mark as failure if any sensor exceeds its threshold
for sensor, threshold in thresholds.items():
    data['failure'] |= (data[sensor] > threshold).astype(int)



In [31]:
# Add noise to the labels
#noise = np.random.randint(0, 2, num_data_points)
#data['failure'] = data['failure'] ^ noise

prob_noise = 0.1
noise = (np.random.uniform(size=num_data_points) < prob_noise).astype(int)

data['failure'] = data['failure'] ^ noise

In [32]:
data.sample(10)

Unnamed: 0,temperature,vibration,pressure,failure
4199,32.575444,84.268226,482.663307,0
2263,51.655594,16.275174,729.221214,0
6739,18.87196,44.926402,801.185549,0
3861,43.110829,87.769967,89.47816,1
3068,74.757636,90.251951,573.113657,1
2670,50.827138,68.433189,306.576277,1
3118,82.528219,59.916378,589.620237,0
2389,47.242885,4.196278,664.213491,0
4798,23.064852,46.80787,832.453321,0
879,73.174419,55.531471,466.429239,0


In [33]:
data['failure'].value_counts(normalize=True)

0    0.6214
1    0.3786
Name: failure, dtype: float64

In [34]:
# Save the DataFrame to a CSV file
data.to_csv('data.csv', index=False)

## Prediction

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [36]:
data = pd.read_csv('data.csv')

Train, test split

In [37]:
X = data.drop('failure', axis=1)
y = data['failure']

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Training de model

In [39]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

In [40]:
!pip install xgboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [41]:
import xgboost as xgb

In [42]:
D_train = xgb.DMatrix(X_train, label=y_train)
D_test = xgb.DMatrix(X_test, label=y_test)

### Defining XGBoost model

In [43]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [44]:
model = xgb.train(param, D_train, steps)

In [45]:
import numpy as np
from sklearn.metrics import precision_score, recall_score, accuracy_score

In [46]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds])

print("Precision = {}".format(precision_score(y_test, best_preds, average='macro')))
print("Recall = {}".format(recall_score(y_test, best_preds, average='macro')))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.903725387921172
Recall = 0.8946537680366691
Accuracy = 0.9055


### Further Exploration with XGBoost

That just about sums up the basics of XGBoost. But there are some more cool features that’ll help you get the most out of your models.

- The gamma parameter can also help with controlling overfitting. It specifies the minimum reduction in the loss required to make a further partition on a leaf node of the tree. I.e if creating a new node doesn’t reduce the loss by a certain amount, then we won’t create it at all.

- The booster parameter allows you to set the type of model you will use when building the ensemble. The default is gbtree which builds an ensemble of decision trees. If your data isn’t too complicated, you can go with the faster and simpler gblinear option which builds an ensemble of linear models.

- Setting the optimal hyperparameters of any ML model can be a challenge. So why not let Scikit Learn do it for you? We can combine Scikit Learn’s grid search with an XGBoost classifier quite easily

In [None]:
from sklearn.model_selection import GridSearchCV

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train, y_train)