First import necessary libraries and read the input data. We're going to do some explorations on the data to understand it better.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#remove the date columns
# we can't really use them for anything as pertains to predicting the outcome
# at least that's what I think
x = pd.read_csv('./traininginputs.csv')
x.drop(columns=['PROC_TRACEINFO'],inplace=True)
y = pd.read_csv('./trainingoutput.csv')
y.drop(columns=['PROC_TRACEINFO'],inplace=True)

# we need to convert the Binar OP130_Resultat_Global_v to a boolean
y['Binar OP130_Resultat_Global_v'] = y['Binar OP130_Resultat_Global_v'].astype(
    'bool')

# we're just creating this to explore the data x & y is what will be actualy used
training = pd.concat([x, y['Binar OP130_Resultat_Global_v']], axis=1)

# training.describe(include='all')

Using the df.corr() function we can see how much the columns are related to the target

In [None]:

training.corr(numeric_only=True)['Binar OP130_Resultat_Global_v'].sort_values(ascending=False)


No column has a signficantly high correlation with the target. So no liner models for this one.

Since there is little correlation, we have to use a non-linear model to predict the target. We will use a Random Forest Classifier, this can handle missing values, which we have and is less prone to overfitting like a single decision tree.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from numpy import ravel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import auc, roc_curve, accuracy_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=42, shuffle=True)
# tweak these variables to get a better model. Good luck!
# you can also look up GridSearchCV to automate this process, ask ChatGPT
rf = RandomForestClassifier(class_weight='balanced', random_state=42, n_estimators=100,
                            min_samples_leaf=0.051, max_features=0.76)
rf.fit(X_train, ravel(y_train))
y_pred = rf.predict(X_test)

# Evaluate Model
# Get probability scores for the positive class
y_probs = rf.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_probs)
rf_auc = auc(fpr, tpr)  # Compute AUC Score

# should be 0.685, try for 0.7
print(f'Random Forest Classifier Accuracy: {rf_auc}')
# not so important here, but its 0.742
print(f'Random Forest Classifier Accuracy: {accuracy_score(y_test, y_pred)}')

You should also use precision recall to evaluate the model, since the classes are imbalanced. (there are hardly any NOK parts). That's why we set the class_weights to balanced. This will make the model pay more attention to the NOK parts.

## Discussion about the Data
The company makes automotive parts and after fabrication, some readings are taken from each part. Those are the columns in the input data, aside from PROC_ID which is the unique ID for each part. At the end of this process, the part is classified as either "OK" or "NOK".

We want to build a model that can predict this classification based on the readings taken from the part without needing that final inspection.

Since we used a Random Forest Classifier, we can also see which features are the most important for the model. This can give us some insights into what makes a part NOK. We can use this information to improve the fabrication process and reduce the number of NOK parts. (I have no idea how this would be done, don't ask me).

## Ways to imporve accuracy and efficiency
- We could try other models like XGBoost or LightGBM
- We could try to balance the classes in the data
- We could try to use some feature engineering to create new features (I was thinking about extracting dates from the PROC_ID, but I don't know if that would be useful)
- We could try to use some feature selection to remove unimportant features (I don't think this is necessary since the model is already fast)
- We could try to use some hyperparameter tuning to improve the model (we *should* do this, but I'm lazy)
