In [3]:
# run `pip install category_encoders`

import pandas as pd
import category_encoders as ce
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import StandardScaler

## Data Import and Initial Preprocessing

In this step, we import the necessary CSV files, each representing a unique dataset. The list of files includes 'q1', 'q2', 'q3', 'q4', 'q5', and 'q6'. We load each file into a separate pandas DataFrame.

One aspect of our preprocessing at this stage is handling the 'timestamp' column. In our datasets, 'timestamp' is only relevant in the 'q4' file where it denotes the time each diagnostic activity was executed. Therefore, we conditionally remove the 'timestamp' column from all DataFrames except for 'q4'.

Each resulting DataFrame is stored in a dictionary, `dataframes`, using its filename as the key for efficient access and manipulation in the subsequent stages.

In [4]:
file_names = ['q1', 'q2', 'q3', 'q4', 'q5', 'q6']
dataframes = {}

for file in file_names:
    file_path = f'data/{file}.csv'
    df = pd.read_csv(file_path)

    # If the file is not 'q4', drop the 'timestamp' column
    if file != 'q4':
        df = df.drop('timestamp', axis=1)

    df = df.truncate(after=100000) # To be removed
    dataframes[file] = df

## Feature Engineering: Extract Temporal Diagnostic Activity Features

In this section, we perform feature engineering on the 'timestamp' field to extract valuable temporal information about each diagnostic activity. The temporal features we derive are:

1. **Year**: The year the diagnostic activity was performed. This can help detect yearly trends in the data.
2. **Month**: The month the diagnostic activity was performed. This can help identify any monthly patterns.
3. **Day of Week**: The day of the week the diagnostic activity was performed. This can reveal weekly trends, such as certain activities being more common on certain days of the week.
4. **Week of Year**: The ISO week number of the year the diagnostic activity was performed. This can provide a more granular view of yearly trends.
5. **Time Since Last Activity**: The time in seconds since the last diagnostic activity for each consultation. This can help gauge the frequency of activities.
6. **Elapsed Time**: The time in seconds since the first diagnostic activity in each consultation. This can provide insight into the duration of consultations.
7. **Season of the Year**: The season (Winter, Spring, Summer, Autumn) when the diagnostic activity was performed. This can help identify seasonal trends, such as certain activities being more common in certain seasons.

The resulting dataframe now contains several new features that provide additional temporal context about each diagnostic activity.

In [5]:
q4 = dataframes['q4']
q4['timestamp'] = pd.to_datetime(q4['timestamp'])
q4.sort_values(['consultationId', 'timestamp'], inplace=True)

q4['year'] = q4['timestamp'].dt.year
q4['month'] = q4['timestamp'].dt.month
q4['dayOfWeek'] = q4['timestamp'].dt.dayofweek
q4['weekOfYear'] = q4['timestamp'].dt.isocalendar().week
q4['timeSinceLastActivitySec'] = q4.groupby('consultationId')['timestamp'].diff().dt.total_seconds().fillna(0)
q4['elapsedTimeSec'] = q4.groupby('consultationId')['timestamp'].transform(lambda x: (x - x.min())).dt.total_seconds()

In [6]:
# Derive 'Season of the Year'
# Define a function that maps month to season
def month_to_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

# Apply the function to the 'month' column to create the 'season' column
q4['season'] = q4['month'].apply(month_to_season)

q4

Unnamed: 0,consultationId,activityName,timestamp,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season
1526,uid-162502747749226177,VCI.SELECT_MODULE,2021-07-01 02:22:53.656000+00:00,2021,7,3,26,0.000,0.000,Summer
1539,uid-162502747749226177,OTX.GET_PERSISTENCE,2021-07-01 02:23:18.323000+00:00,2021,7,3,26,24.667,24.667,Summer
1540,uid-162502747749226177,ODO,2021-07-01 02:23:18.331000+00:00,2021,7,3,26,0.008,24.675,Summer
1541,uid-162502747749226177,OTX.readDtcs,2021-07-01 02:23:18.339000+00:00,2021,7,3,26,0.008,24.683,Summer
1600,uid-162502747749226177,REST.VEHICLE.FLASHWARE,2021-07-01 02:25:53.168000+00:00,2021,7,3,26,154.829,179.512,Summer
...,...,...,...,...,...,...,...,...,...,...
99996,uid-163107999250525772,OTX.GET_PERSISTENCE,2021-09-08 05:48:17.187000+00:00,2021,9,2,36,1.211,104.773,Autumn
99997,uid-163107999250525772,OTX.vin,2021-09-08 05:48:17.197000+00:00,2021,9,2,36,0.010,104.783,Autumn
99998,uid-163107999250525772,VEHICLE_ID,2021-09-08 05:48:17.233000+00:00,2021,9,2,36,0.036,104.819,Autumn
99999,uid-163107999250525772,OTX.GET_PERSISTENCE,2021-09-08 05:48:32.033000+00:00,2021,9,2,36,14.800,119.619,Autumn


## Feature Engineering: Removing Outlier Diagnostic Activities

In our dataset, certain diagnostic activities performed by the technicians are extremely common and are recorded in virtually every consultation. While these activities are a routine part of the consultation process, they do not carry significant diagnostic information for our model, and therefore, may not be useful in predicting recommendations. For instance, the 'CONSULTATION_START' activity is logged in every consultation but doesn't contribute meaningful information towards diagnosing a specific vehicle issue.

To identify and remove these non-informative activities, we follow a statistical outlier detection approach:

1. **Calculate Commonality**: First, we calculate the commonality score for each activity, which is the frequency of the activity divided by the total number of activities.

2. **Calculate Mean and Standard Deviation**: We then calculate the mean and standard deviation of these commonality scores.

3. **Identify Outliers**: Any activity whose commonality score lies beyond two standard deviations from the mean is considered an outlier. This threshold is based on the empirical rule, which states that for a normal distribution, about 95% of the data lies within two standard deviations of the mean.

4. **Remove Outliers**: Finally, we remove these outlier activities from our dataset, leaving us with a set of activities that are varied enough to provide meaningful information for our model.

In [7]:
activity_commonality = q4.value_counts('activityName')/q4['activityName'].count()
activity_commonality = activity_commonality.reset_index()
activity_commonality.columns = ['activityName', 'commonalityScore']

mean = activity_commonality.commonalityScore.mean()
std = activity_commonality.commonalityScore.std()
print(f'MEAN: {mean}  STD: {std}')

lower = mean - (2 * std)
upper = mean + (2 * std)

# Compose a condition to identify the outliers by checking for commonality score less than or greater than lower and upper bounds respectively.
outliers_condition = (activity_commonality.commonalityScore < lower) | (upper < activity_commonality.commonalityScore)
most_common_activities = activity_commonality[outliers_condition]

most_common_activities

MEAN: 0.0017152658662092624  STD: 0.008966686459682351


Unnamed: 0,activityName,commonalityScore
0,OTX.GET_PERSISTENCE,0.138379
1,REST.VEHICLE.FLASHWARE,0.061569
2,REST.TRANSFORMATION,0.061569
3,ODO,0.061239
4,VCI.SELECT_MODULE,0.051859
5,OTX.odrSequenceV2,0.03585
6,HEALTHCHECK.REFRESH.NETWORK_VIEW,0.03452
7,HEALTHCHECK.REFRESH.DTC_LIST,0.03452
8,OTX.readDtcs,0.03294
9,VCI.SELECT_SUPPLIER,0.03293


In [8]:
# Remove identified outlier (the most common) activities
q4 = q4[~q4.activityName.isin(most_common_activities.activityName)]
q4

Unnamed: 0,consultationId,activityName,timestamp,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season
1799,uid-162502747749226177,HEALTHCHECK.CLEAR_DTCS,2021-07-01 02:32:13.472000+00:00,2021,7,3,26,26.933,559.816,Summer
1800,uid-162502747749226177,OTX.clearDtcs,2021-07-01 02:32:13.476000+00:00,2021,7,3,26,0.004,559.820,Summer
1921,uid-162502747749226177,HEALTHCHECK.CLEAR_DTCS,2021-07-01 02:34:19.675000+00:00,2021,7,3,26,13.328,686.019,Summer
1922,uid-162502747749226177,OTX.clearDtcs,2021-07-01 02:34:19.679000+00:00,2021,7,3,26,0.004,686.023,Summer
2044,uid-162502747749226177,VCI.RELEASE_MODULE,2021-07-01 02:39:24.800000+00:00,2021,7,3,26,288.468,991.144,Summer
...,...,...,...,...,...,...,...,...,...,...
99986,uid-163107957988936004,OTX.PCMFlash,2021-09-08 05:47:49.361000+00:00,2021,9,2,36,0.003,489.505,Autumn
99987,uid-163107957988936004,OTX.VehiclePostFlash,2021-09-08 05:47:58.375000+00:00,2021,9,2,36,9.014,498.519,Autumn
99989,uid-163107957988936004,OTX.PCMPostFlash,2021-09-08 05:47:58.558000+00:00,2021,9,2,36,0.004,498.702,Autumn
99912,uid-16310796452633650,userAction,2021-09-08 05:41:03.192000+00:00,2021,9,2,36,10.665,702.501,Autumn


## Data Merging: Combining Imported Data

In this step, we are merging the imported data into a single dataframe using the `consultationId` as a common identifier.

We are using the `merge()` function from pandas library to perform the merging operation. The `on` parameter is set to 'consultationId', which is the common identifier between the data sources. The `how` parameter is set to 'inner', meaning that we are performing an 'inner' join. This type of join returns only the records that have matching values in both merging tables. In other words, it returns the intersection of the dataframes, ensuring that the resulting dataframe contains only the complete information.

In [9]:
q1 = dataframes['q1']
q2 = dataframes['q2']
q3 = dataframes['q3']
q5 = dataframes['q5']
q6 = dataframes['q6']

data = q1 \
    .merge(q2, on='consultationId', how='inner') \
    .merge(q3, on='consultationId', how='inner') \
    .merge(q4, on='consultationId', how='inner') \
    .merge(q5, on='consultationId', how='inner') \
    .merge(q6, on='consultationId', how='inner')
data.set_index('consultationId', inplace=True)

data

Unnamed: 0_level_0,vin,project,ODO,activityName,timestamp,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season,odxShortName,FullCode,StatusCode,dtcODO,C_ODO_VALUE,dealerCode,distributorCode
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,VCM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,PSM,B1B4649,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,PSM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,PSM,B1B5C49,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,VCM,B1B4649,8,27280,27287.0,2527A,MJO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,VCM,B1B4449,8,6221,6222.0,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,VCM,U213900,8,6221,6222.0,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,PSM,B1B4849,8,6221,6222.0,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,VCM,B1B4849,8,6221,6222.0,2528,MJO


## Data Cleaning: Removing Duplicate Records

In this step of the data preprocessing, we aim to remove any duplicate entries in the dataset.

We utilize the `drop_duplicates()` function from pandas library for this purpose. The `inplace=True` parameter ensures that the operation is performed on the dataset directly, without the need to assign the result to a new variable.

In [10]:
data.drop_duplicates(inplace=True)
data

Unnamed: 0_level_0,vin,project,ODO,activityName,timestamp,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season,odxShortName,FullCode,StatusCode,dtcODO,C_ODO_VALUE,dealerCode,distributorCode
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,VCM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,PSM,B1B4649,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,PSM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,PSM,B1B5C49,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,15.026,54.310,Summer,VCM,B1B4649,8,27280,27287.0,2527A,MJO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,VCM,B1B4449,8,6221,6222.0,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,VCM,U213900,8,6221,6222.0,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,PSM,B1B4849,8,6221,6222.0,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,6222 km,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,0.004,57.714,Summer,VCM,B1B4849,8,6221,6222.0,2528,MJO


## Data Cleaning: Handle Missing Values

In our dataset, we've identified that the 'ODO' column has missing values.

Rather than using common statistical methods like replacing missing values with the mean or median, we opt for a more context-aware approach. We replace the missing 'ODO' values with the 'C_ODO_VALUE' which is the current odometer value collected during the same consultation. This provides a reliable substitute that's closely tied to the actual vehicle data.

Additionally, we transform the 'ODO' field to be numerical by removing the 'km' unit formatting.

#### FUTURE CONSIDERATION
Another option for imputing a value to replace missing values is to leverage k-nearest neighbor (KNN),
were we would be inferring based on some other fields (NOTE: can only be considered once CLAIMS data
is available)

A more robust approach could involve using a model-based imputation, such as regression imputation or a machine learning model like K-Nearest Neighbors (KNN) or a decision tree, to predict the missing 'ODO' values based on other variables. This could be particularly useful if the 'ODO' values are missing based on some underlying pattern that is related to other variables in your data.

In [11]:
# Identify which data fields contain missing values
data.isna().any()

vin                         False
project                     False
ODO                          True
activityName                False
timestamp                   False
year                        False
month                       False
dayOfWeek                   False
weekOfYear                  False
timeSinceLastActivitySec    False
elapsedTimeSec              False
season                      False
odxShortName                False
FullCode                    False
StatusCode                  False
dtcODO                      False
C_ODO_VALUE                 False
dealerCode                  False
distributorCode             False
dtype: bool

In [12]:
# Replace 'km' formatting in the 'ODO' field and fill missing values
data['ODO'] = data['ODO'].str.replace(' km','')
data['ODO'].fillna(value=data.C_ODO_VALUE, inplace=True)

## Data Cleaning: Setting Appropriate Data Types

As part of this step we ensure that the data is represented in the correct format, using appropriate data type for efficient data manipulation and accurate model training.

In this code block, we are setting the data types for each column as follows:

* The 'ODO', 'dtcODO', 'timeSinceLastActivitySec', and 'elapsedTimeSec' columns are set to float64 as they contain numerical continuous data with a floating point.
* The 'year', 'month', 'dayOfWeek', and 'weekOfYear' columns are set to CategoricalDtype(ordered=True) since they contain categorical temporal data that has a natural order
* The 'project', 'activityName', 'odxShortName', 'FullCode', 'StatusCode', 'dealerCode', and 'distributorCode' columns are set to CategoricalDtype(ordered=False) since they contain categorical data without a natural order

In [13]:
data['ODO'] = data['ODO'].astype('float64')
data['dtcODO'] = data['dtcODO'].astype('float64')
data['timeSinceLastActivitySec'] = data['timeSinceLastActivitySec'].astype('float64')
data['elapsedTimeSec'] = data['elapsedTimeSec'].astype('float64')

data['year'] = data['year'].astype(CategoricalDtype(ordered=True))
data['month'] = data['month'].astype(CategoricalDtype(ordered=True))
data['dayOfWeek'] = data['dayOfWeek'].astype(CategoricalDtype(ordered=True))
data['weekOfYear'] = data['weekOfYear'].astype(CategoricalDtype(ordered=True))

data['project'] = data['project'].astype(CategoricalDtype(ordered=False))
data['activityName'] = data['activityName'].astype(CategoricalDtype(ordered=False))
data['odxShortName'] = data['odxShortName'].astype(CategoricalDtype(ordered=False))
data['FullCode'] = data['FullCode'].astype(CategoricalDtype(ordered=False))
data['StatusCode'] = data['StatusCode'].astype(CategoricalDtype(ordered=False))
data['dealerCode'] = data['dealerCode'].astype(CategoricalDtype(ordered=False))
data['distributorCode'] = data['distributorCode'].astype(CategoricalDtype(ordered=False))

data.dtypes

vin                                      object
project                                category
ODO                                     float64
activityName                           category
timestamp                   datetime64[ns, UTC]
year                                   category
month                                  category
dayOfWeek                              category
weekOfYear                             category
timeSinceLastActivitySec                float64
elapsedTimeSec                          float64
season                                   object
odxShortName                           category
FullCode                               category
StatusCode                             category
dtcODO                                  float64
C_ODO_VALUE                             float64
dealerCode                             category
distributorCode                        category
dtype: object

## Data Normalisation: Standardise Numerical Data

In this step, we are standardising the values of the 'elapsedTimeSec', 'timeSinceLastActivitySec', 'ODO', 'dtcODO', and 'C_ODO_VALUE' columns. These columns represent continuous numerical data (temporal data and odometer readings), which we expect to follow a normal-like distribution.

We are using sklearn's StandardScaler for this task. This method standardizes features by removing the mean and scaling to unit variance. This transformation helps to achieve properties of a standard normal distribution where the mean (average) of each feature is 0 and the standard deviation is 1.

By doing this, we are ensuring that these features have the same scale and thus contributing equally to the model's performance.

In [14]:
data_scaler = StandardScaler()
data[['elapsedTimeSec', 'timeSinceLastActivitySec', 'ODO', 'dtcODO', 'C_ODO_VALUE']] = data_scaler.fit_transform(data[['elapsedTimeSec', 'timeSinceLastActivitySec', 'ODO', 'dtcODO', 'C_ODO_VALUE']])
data

Unnamed: 0_level_0,vin,project,ODO,activityName,timestamp,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season,odxShortName,FullCode,StatusCode,dtcODO,C_ODO_VALUE,dealerCode,distributorCode
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,1.760231,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,-0.260100,-0.619171,Summer,VCM,B1B5049,8,1.762017,1.760069,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,1.760231,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,-0.260100,-0.619171,Summer,PSM,B1B4649,8,1.762017,1.760069,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,1.760231,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,-0.260100,-0.619171,Summer,PSM,B1B5049,8,1.762017,1.760069,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,1.760231,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,-0.260100,-0.619171,Summer,PSM,B1B5C49,8,1.762017,1.760069,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,1.760231,HEALTHCHECK.CLEAR_DTCS,2021-07-01 00:01:15.377000+00:00,2021,7,3,26,-0.260100,-0.619171,Summer,VCM,B1B4649,8,1.762017,1.760069,2527A,MJO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,-0.636043,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,-0.284041,-0.617521,Summer,VCM,B1B4449,8,-0.629484,-0.636354,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,-0.636043,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,-0.284041,-0.617521,Summer,VCM,U213900,8,-0.629484,-0.636354,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,-0.636043,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,-0.284041,-0.617521,Summer,PSM,B1B4849,8,-0.629484,-0.636354,2528,MJO
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,-0.636043,OTX.clearDtcs,2021-07-01 10:28:06.506000+00:00,2021,7,3,26,-0.284041,-0.617521,Summer,VCM,B1B4849,8,-0.629484,-0.636354,2528,MJO


## Data Encoding: Encode Categorical Features

We conduct Binary Encoding for categorical variables. The model will require numerical input, and encoding will convert our categorical variables into a numerical format that our model can process.

In this step, we are using binary encoding to convert categorical variables into a form that can be provided to our model to improve its performance. The categorical variables being encoded include 'project', 'activityName', 'odxShortName', 'FullCode', 'StatusCode', 'dealerCode', and 'distributorCode'.

Binary encoding is a combination of Hash encoding and one-hot encoding. In binary encoding, first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This makes binary encoding more space efficient than one-hot encoding, especially for high cardinality variables.

We are using the category_encoders' BinaryEncoder for this task. The 'return_df' parameter is set to True which means the method will return a pandas DataFrame.


In [15]:
encoder = ce.BinaryEncoder(cols = ['project', 'activityName', 'odxShortName', 'FullCode', 'StatusCode', 'dealerCode', 'distributorCode', 'dayOfWeek', 'weekOfYear', 'month', 'season'], return_df = True)
encoded_data = encoder.fit_transform(data)
encoded_data

Unnamed: 0_level_0,vin,project_0,project_1,project_2,ODO,activityName_0,activityName_1,activityName_2,activityName_3,activityName_4,...,C_ODO_VALUE,dealerCode_0,dealerCode_1,dealerCode_2,dealerCode_3,dealerCode_4,dealerCode_5,distributorCode_0,distributorCode_1,distributorCode_2
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.760231,0,0,0,0,0,...,1.760069,0,0,0,0,0,1,0,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.760231,0,0,0,0,0,...,1.760069,0,0,0,0,0,1,0,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.760231,0,0,0,0,0,...,1.760069,0,0,0,0,0,1,0,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.760231,0,0,0,0,0,...,1.760069,0,0,0,0,0,1,0,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.760231,0,0,0,0,0,...,1.760069,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,-0.636043,0,0,0,0,0,...,-0.636354,0,0,1,0,0,0,0,0,1
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,-0.636043,0,0,0,0,0,...,-0.636354,0,0,1,0,0,0,0,0,1
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,-0.636043,0,0,0,0,0,...,-0.636354,0,0,1,0,0,0,0,0,1
uid-162513522884055541,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,-0.636043,0,0,0,0,0,...,-0.636354,0,0,1,0,0,0,0,0,1


## Save Preprocessed Data
Here we are saving the preprocessed data into a CSV file in the 'data' directory. The index=True parameter ensures that the index column is saved in the CSV file.

In [16]:
data.to_csv('data/preprocessed_data.csv', index=True)
encoded_data.to_csv('data/encoded_preprocessed_data.csv', index=True)