# Model Training Notebook

This notebook is used to setup, clean and train the repair item prediction model.

An example row in the dataset.csv looks like:
```CSV
"ItemID","UUID","Description","ItemDescription"
"9","8f14e45fceea167a5a36dedd4bea2543","CHANGE OIL & FILTER,TOP OFF FLUIDS","CHANGE OIL & FILTER,TOP OFF FLUIDS"
"20","8f14e45fceea167a5a36dedd4bea2543","CHANGE OIL & FILTER,TOP OFF FLUIDS","R & R FRONT STRUT ASSY (ONE)"
"31","8f14e45fceea167a5a36dedd4bea2543","CHANGE OIL & FILTER,TOP OFF FLUIDS","4 WHEEL ALIGNMENT- SUBLET"
```

In [2]:
# %pip install scikit-learn
# %pip install scikit-multilearn
# %pip install pandas
# %pip install numpy
# %pip install scipy
# %pip install xgboost

## Data Setup

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('dataset.csv', header = 0)

In [5]:
df.head()

Unnamed: 0,ItemID,UUID,Description,ItemDescription
0,9,8f14e45fceea167a5a36dedd4bea2543,"CHANGE OIL & FILTER,TOP OFF FLUIDS","CHANGE OIL & FILTER,TOP OFF FLUIDS"
1,20,8f14e45fceea167a5a36dedd4bea2543,"CHANGE OIL & FILTER,TOP OFF FLUIDS",R & R FRONT STRUT ASSY (ONE)
2,31,8f14e45fceea167a5a36dedd4bea2543,"CHANGE OIL & FILTER,TOP OFF FLUIDS",4 WHEEL ALIGNMENT- SUBLET
3,29,6512bd43d9caa6e02c990b0a82652dca,REPAIR (PLUG) TIRE ON CAR,REPAIR (PLUG) TIRE ON CAR
4,53,6f4922f45568161a8cdf4ad2299f6d23,"CHANGE OIL & FILTER,TOP OFF FLUIDS","CHANGE OIL & FILTER,TOP OFF FLUIDS"


In [6]:
df.shape

(251517, 4)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251517 entries, 0 to 251516
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   ItemID           251517 non-null  int64 
 1   UUID             251517 non-null  object
 2   Description      251515 non-null  object
 3   ItemDescription  251465 non-null  object
dtypes: int64(1), object(3)
memory usage: 7.7+ MB


In [8]:
df.isnull().any()

ItemID             False
UUID               False
Description         True
ItemDescription     True
dtype: bool

---

## Data Cleanup

In [9]:
# Remove critical columns that have no value
df = df.dropna(subset=['Description', 'ItemDescription'])

df.isnull().any()

ItemID             False
UUID               False
Description        False
ItemDescription    False
dtype: bool

In [10]:
# Drop unused reference column
df = df.drop(columns=['ItemID'])

In [11]:
# Enforce consistent casing
df['Description'] = df['Description'].str.lower()
df['ItemDescription'] = df['ItemDescription'].str.lower()

df.head()

Unnamed: 0,UUID,Description,ItemDescription
0,8f14e45fceea167a5a36dedd4bea2543,"change oil & filter,top off fluids","change oil & filter,top off fluids"
1,8f14e45fceea167a5a36dedd4bea2543,"change oil & filter,top off fluids",r & r front strut assy (one)
2,8f14e45fceea167a5a36dedd4bea2543,"change oil & filter,top off fluids",4 wheel alignment- sublet
3,6512bd43d9caa6e02c990b0a82652dca,repair (plug) tire on car,repair (plug) tire on car
4,6f4922f45568161a8cdf4ad2299f6d23,"change oil & filter,top off fluids","change oil & filter,top off fluids"


In [12]:
# Remove anything in parentheses, brackets, or braces
df['Description'] = df['Description'].str.replace(r'\(.*\)', '', regex=True)
df['Description'] = df['Description'].str.replace(r'\[.*\]', '', regex=True)
df['Description'] = df['Description'].str.replace(r'\{.*\}', '', regex=True)

# Remove specific values that have no relation to the prediction
df["Description"] = df["Description"].str.replace("notes", "", regex=False)

# Remove trailing whitespace or commas
df["Description"] = df["Description"].str.rstrip(", ")

df.head()

Unnamed: 0,UUID,Description,ItemDescription
0,8f14e45fceea167a5a36dedd4bea2543,"change oil & filter,top off fluids","change oil & filter,top off fluids"
1,8f14e45fceea167a5a36dedd4bea2543,"change oil & filter,top off fluids",r & r front strut assy (one)
2,8f14e45fceea167a5a36dedd4bea2543,"change oil & filter,top off fluids",4 wheel alignment- sublet
3,6512bd43d9caa6e02c990b0a82652dca,repair tire on car,repair (plug) tire on car
4,6f4922f45568161a8cdf4ad2299f6d23,"change oil & filter,top off fluids","change oil & filter,top off fluids"


In [13]:
# Group by UUID and aggregate ItemDescription
df = df.groupby(['UUID']).agg({'Description': 'first', 'ItemDescription': lambda x: list(x)})

# Rename ItemDescription to Items
df = df.rename(columns={'ItemDescription': 'Items'})

df.head()

Unnamed: 0_level_0,Description,Items
UUID,Unnamed: 1_level_1,Unnamed: 2_level_1
000053b1e684c9e7ea73727b2238ce18,"oc, sb, caf, af","[change oil & filter, check tires & adjust tir..."
0000b2815cc3c2b56867cbbf4d36efa5,"oc, sb","[change oil & filter, check tires & adjust tir..."
000133296ef6b63b0210f224e1347365,"oc, su, caf, wipers","[change oil & filter, check tires & adjust tir..."
00017961865c4f766fdbb3cd8fe0bfb0,buff vehicle,"[buff vehicle, work light scratches, spray wax]"
00019d812c1173c8a69c656a40fa8767,warr. ops,[warranty replacement of oil pressure switch]


In [14]:
print(f"Shape before: {df.shape}")

# Remove rows where the length of Items is 1 AND the first element equals the Description column
removed = ~((df['Items'].str.len() == 1) & (df['Items'].str[0] == df['Description']))
df = df[removed]

print(f"Shape after: {df.shape}")

Shape before: (52245, 2)
Shape after: (49653, 2)


In [15]:
df.head()

Unnamed: 0_level_0,Description,Items
UUID,Unnamed: 1_level_1,Unnamed: 2_level_1
000053b1e684c9e7ea73727b2238ce18,"oc, sb, caf, af","[change oil & filter, check tires & adjust tir..."
0000b2815cc3c2b56867cbbf4d36efa5,"oc, sb","[change oil & filter, check tires & adjust tir..."
000133296ef6b63b0210f224e1347365,"oc, su, caf, wipers","[change oil & filter, check tires & adjust tir..."
00017961865c4f766fdbb3cd8fe0bfb0,buff vehicle,"[buff vehicle, work light scratches, spray wax]"
00019d812c1173c8a69c656a40fa8767,warr. ops,[warranty replacement of oil pressure switch]


---

## Regressor Analysis

### Setup

For the purpose of testing various models, we will use a subset of the dataset. This will allow us to quickly test various models and configurations.

General notes:

- X = Predictor (Description of work)
- Y = Predictions (Array of repair labor items)

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputRegressor

In [17]:
model_df = df[1:250] # Take subset of data for testing purposes

binarizer = MultiLabelBinarizer()

X = model_df["Description"]
y = binarizer.fit_transform(model_df["Items"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
])

### Model 1 - Gradient Boosting Regressor

In [19]:
from sklearn.ensemble import GradientBoostingRegressor

model1 = Pipeline([
    ('preprocessing', pipeline),
    ('regressor', MultiOutputRegressor(GradientBoostingRegressor()))
])

# Train the model
model1.fit(X_train, y_train)

y_pred = model1.predict(X_test)

In [20]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score, coverage_error

MSE = mean_squared_error(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
MAPE = mean_absolute_percentage_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)
CE = coverage_error(y_test, y_pred)

display(pd.DataFrame({
    "Metric": ["MSE", "MAE", "MAPE", "R2", "CE"],
    "Value": [MSE, MAE, MAPE, R2, f"{CE:.2f}"]
}))

Unnamed: 0,Metric,Value
0,MSE,0.011071
1,MAE,0.015876
2,MAPE,35692807318816.4
3,R2,0.009714
4,CE,197.39


### Model 2 - Decision Tree Regressor

In [21]:
from sklearn.tree import DecisionTreeRegressor

model2 = Pipeline([
    ('preprocessing', pipeline),
    ('regressor', MultiOutputRegressor(DecisionTreeRegressor()))
])

model2.fit(X_train, y_train)

y_pred = model2.predict(X_test)

In [22]:
MSE = mean_squared_error(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
MAPE = mean_absolute_percentage_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)
CE = coverage_error(y_test, y_pred)

display(pd.DataFrame({
    "Metric": ["MSE", "MAE", "MAPE", "R2", "CE"],
    "Value": [MSE, MAE, MAPE, R2, f"{CE:.2f}"]
}))

Unnamed: 0,Metric,Value
0,MSE,0.014805
1,MAE,0.017741
2,MAPE,45765447325559.484
3,R2,0.286342
4,CE,226.85


### Model 3 - Random Forest

In [23]:
from sklearn.ensemble import RandomForestRegressor

model3 = Pipeline([
    ('preprocessing', pipeline),
    ('regressor', MultiOutputRegressor(RandomForestRegressor()))
])

model3.fit(X_train, y_train)

y_pred = model3.predict(X_test)

In [24]:
MSE = mean_squared_error(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
MAPE = mean_absolute_percentage_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)
CE = coverage_error(y_test, y_pred)

display(pd.DataFrame({
    "Metric": ["MSE", "MAE", "MAPE", "R2", "CE"],
    "Value": [MSE, MAE, MAPE, R2, f"{CE:.2f}"]
}))

Unnamed: 0,Metric,Value
0,MSE,0.010779
1,MAE,0.017994
2,MAPE,47121103282558.38
3,R2,0.144535
4,CE,201.74


### Model 4 - XGBRegressor

In [25]:
from xgboost import XGBRegressor

model4 = Pipeline([
    ('preprocessing', pipeline),
    ('regressor', MultiOutputRegressor(XGBRegressor()))
])

model4.fit(X_train, y_train)

y_pred = model4.predict(X_test)

In [26]:
MSE = mean_squared_error(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
MAPE = mean_absolute_percentage_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)
CE = coverage_error(y_test, y_pred)

display(pd.DataFrame({
    "Metric": ["MSE", "MAE", "MAPE", "R2", "CE"],
    "Value": [MSE, MAE, MAPE, R2, f"{CE:.2f}"]
}))

Unnamed: 0,Metric,Value
0,MSE,0.011419
1,MAE,0.015173
2,MAPE,32941939200567.086
3,R2,0.008591
4,CE,200.71


### Model 5 - K Nearest Neighbors

In [27]:
from sklearn.neighbors import KNeighborsRegressor

model5 = Pipeline([
    ('preprocessing', pipeline),
    ('regressor', MultiOutputRegressor(KNeighborsRegressor(n_neighbors=5)))
])

model5.fit(X_train, y_train)

y_pred = model5.predict(X_test)

In [28]:
MSE = mean_squared_error(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
MAPE = mean_absolute_percentage_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)
CE = coverage_error(y_test, y_pred)

display(pd.DataFrame({
    "Metric": ["MSE", "MAE", "MAPE", "R2", "CE"],
    "Value": [MSE, MAE, MAPE, R2, f"{CE:.2f}"]
}))

Unnamed: 0,Metric,Value
0,MSE,0.011992
1,MAE,0.019474
2,MAPE,46843669480538.79
3,R2,0.30925
4,CE,213.25


---

## Building the Model

Based on the above metrics, the best model for this dataset is the XGBoost XBGRegressor. Let's build the model using the entire dataset.

In [33]:
from xgboost import XGBRegressor

binarizer = MultiLabelBinarizer()

model_df = df[0:10000]

X = model_df["Description"]
y = binarizer.fit_transform(model_df["Items"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)

pipeline = Pipeline([
  ('tfidf', TfidfVectorizer()),
])

model = Pipeline([
  ('preprocessing', pipeline),
  ('regressor', MultiOutputRegressor(XGBRegressor()))
])

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
MAPE = mean_absolute_percentage_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)
CE = coverage_error(y_test, y_pred)

display(f"Statistics for XGBRegressor model with {len(model_df)} samples")
display(pd.DataFrame({
  "Metric": ["MSE", "MAE", "MAPE", "R2", "CE"],
  "Value": [MSE, MAE, MAPE, R2, f"{CE:.2f}"]
}))

'Statistics for XGBRegressor model with 9999 samples'

Unnamed: 0,Metric,Value
0,MSE,0.000381
1,MAE,0.000554
2,MAPE,1227990074090.2388
3,R2,-0.032225
4,CE,3439.02


---

##  Model Usage

### Serializing the Model

Export the selected model and the vectorizer to a file for use in the API.

In [34]:
import pickle

pkl_filename = "model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(model, file)

pkl_bin_filename = "binarizer.pkl"
with open(pkl_bin_filename, 'wb') as file:
    pickle.dump(binarizer, file)

### Implementing the Model in the API

Demo of how to use the model to predict the labor lines given a description of the work performed.

In [43]:
with open("model.pkl", 'rb') as file:
    pickle_model = pickle.load(file)
    file.close()

with open('binarizer.pkl', 'rb') as file:
    pickle_bin = pickle.load(file)
    file.close()

In [42]:
X_predict = ["oc, su, tires"]
y_predict = pickle_model.predict(X_predict)
y_predict_bin = np.round(y_predict).astype(int)
y_predict_labels = pickle_bin.inverse_transform(y_predict_bin)

if (len(y_predict_labels) == 0):
  print("No predictions")
else:
  for _, prediction in enumerate(y_predict_labels[0]):
    print(f"- {prediction}")

- change oil & filter
- check & top off brake fluid as necessary
- check & top off engine coolant as necessary
- check & top off power steering fluid as necessary
- check & top off transmission fluid as necessary
- check & top off washer fluid
- check tires & adjust tire pressure as necessary
- courtesy inspection / general check-over
- mount & comp. balance two new a/s tires
