In [245]:
import pandas as pd
from pandas.api.types import CategoricalDtype
import category_encoders as ce # run `pip install category_encoders`
from sklearn.preprocessing import StandardScaler

## Data Import and Initial Preprocessing

In this step, we import the necessary CSV files, each representing a unique dataset. The list of files includes 'q1', 'q2', 'q3', 'q4', 'q5', and 'q6'. We load each file into a separate pandas DataFrame.

One important aspect of our preprocessing at this stage is handling the 'timestamp' column. In our datasets, 'timestamp' is only relevant in the 'q4' file where it denotes the time each diagnostic activity was executed. Therefore, we conditionally remove the 'timestamp' column from all DataFrames except for 'q4'.

Each resulting DataFrame is stored in a dictionary, `dataframes`, using its filename as the key for efficient access and manipulation in the subsequent stages.

In [246]:
file_names = ['q1', 'q2', 'q3', 'q4', 'q5', 'q6']
dataframes = {}

for file in file_names:
    file_path = f'data/{file}.csv'
    df = pd.read_csv(file_path)

    # If the file is not 'q4', drop the 'timestamp' column
    if file != 'q4':
        df = df.drop('timestamp', axis=1)

    df = df.truncate(after=1000) # To be removed
    dataframes[file] = df

## Feature Engineering: Extract Temporal Diagnostic Activity Features

In this section, we perform feature engineering on the 'timestamp' field to extract valuable temporal information about each diagnostic activity. The temporal features we derive are:

1. **Year**: The year the diagnostic activity was performed. This can help detect yearly trends in the data.
2. **Month**: The month the diagnostic activity was performed. This can help identify any monthly patterns.
3. **Day of Week**: The day of the week the diagnostic activity was performed. This can reveal weekly trends, such as certain activities being more common on certain days of the week.
4. **Week of Year**: The ISO week number of the year the diagnostic activity was performed. This can provide a more granular view of yearly trends.
5. **Time Since Last Activity**: The time in seconds since the last diagnostic activity for each consultation. This can help gauge the frequency of activities.
6. **Elapsed Time**: The time in seconds since the first diagnostic activity in each consultation. This can provide insight into the duration of consultations.
7. **Season of the Year**: The season (Winter, Spring, Summer, Autumn) when the diagnostic activity was performed. This can help identify seasonal trends, such as certain activities being more common in certain seasons.

After extracting these features, we drop the original 'timestamp' column as it has been fully utilised. The resulting dataframe now contains several new features that provide additional temporal context about each diagnostic activity.

In [247]:
q4 = dataframes['q4']
q4['timestamp'] = pd.to_datetime(q4['timestamp'])
q4.sort_values(['consultationId', 'timestamp'], inplace=True)

q4['year'] = q4['timestamp'].dt.year
q4['month'] = q4['timestamp'].dt.month
q4['dayOfWeek'] = q4['timestamp'].dt.dayofweek
q4['weekOfYear'] = q4['timestamp'].dt.isocalendar().week
q4['timeSinceLastActivitySec'] = q4.groupby('consultationId')['timestamp'].diff().dt.total_seconds().fillna(0)
q4['elapsedTimeSec'] = q4.groupby('consultationId')['timestamp'].transform(lambda x: (x - x.min())).dt.total_seconds()

In [248]:
# Derive 'Season of the Year'
# Define a function that maps month to season
def month_to_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

# Apply the function to the 'month' column to create the 'season' column
q4['season'] = q4['month'].apply(month_to_season)

q4 = q4.drop(columns=['timestamp'], axis=1)
q4

Unnamed: 0,consultationId,activityName,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season
36,uid-162504129815798601,VCI.SELECT_MODULE,2021,7,3,26,0.000,0.000,Summer
180,uid-162509672601672329,VCI.SELECT_SUPPLIER,2021,7,3,26,0.000,0.000,Summer
181,uid-162509672601672329,VCI.SELECT_MODULE,2021,7,3,26,0.192,0.192,Summer
182,uid-162509672601672329,VCI.CHECK_VCI_SERIAL_NUMBER,2021,7,3,26,0.311,0.503,Summer
183,uid-162509672601672329,OTX.GET_PERSISTENCE,2021,7,3,26,1.383,1.886,Summer
...,...,...,...,...,...,...,...,...,...
87,uid-162510583719067147,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer
78,uid-162510599405951750,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer
86,uid-162510872224217421,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer
84,uid-162512405989626584,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer


## Feature Engineering: Removing Outlier Diagnostic Activities

In our dataset, certain diagnostic activities performed by the technicians are extremely common and are recorded in virtually every consultation. While these activities are a routine part of the consultation process, they do not carry significant diagnostic information for our model, and therefore, may not be useful in predicting recommendations. For instance, the 'CONSULTATION_START' activity is logged in every consultation but doesn't contribute meaningful information towards diagnosing a specific vehicle issue.

To identify and remove these non-informative activities, we follow a statistical outlier detection approach:

1. **Calculate Commonality**: First, we calculate the commonality score for each activity, which is the frequency of the activity divided by the total number of activities.

2. **Calculate Mean and Standard Deviation**: We then calculate the mean and standard deviation of these commonality scores.

3. **Identify Outliers**: Any activity whose commonality score lies beyond two standard deviations from the mean is considered an outlier. This threshold is based on the empirical rule, which states that for a normal distribution, about 95% of the data lies within two standard deviations of the mean.

4. **Remove Outliers**: Finally, we remove these outlier activities from our dataset, leaving us with a set of activities that are varied enough to provide meaningful information for our model.

In [249]:
activity_commonality = q4.value_counts('activityName')/q4['activityName'].count()
activity_commonality = activity_commonality.reset_index()
activity_commonality.columns = ['activityName', 'commonalityScore']

mean = activity_commonality.commonalityScore.mean()
std = activity_commonality.commonalityScore.std()
print(f'MEAN: {mean}  STD: {std}')

lower = mean - (2 * std)
upper = mean + (2 * std)

# Compose a condition to identify the outliers by checking for commonality score less than or greater than lower and upper bounds respectively.
outliers_condition = (activity_commonality.commonalityScore < lower) | (upper < activity_commonality.commonalityScore)
most_common_activities = activity_commonality[outliers_condition]

most_common_activities

MEAN: 0.014285714285714282  STD: 0.02392732147353882


Unnamed: 0,activityName,commonalityScore
0,OTX.GET_PERSISTENCE,0.134865
1,VCI.SELECT_MODULE,0.090909
2,REST.VEHICLE.FLASHWARE,0.062937
3,REST.TRANSFORMATION,0.062937
4,ODO,0.062937


In [250]:
# Remove identified outlier (the most common) activities
q4 = q4[~q4.activityName.isin(most_common_activities.activityName)]
q4

Unnamed: 0,consultationId,activityName,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season
180,uid-162509672601672329,VCI.SELECT_SUPPLIER,2021,7,3,26,0.000,0.000,Summer
182,uid-162509672601672329,VCI.CHECK_VCI_SERIAL_NUMBER,2021,7,3,26,0.311,0.503,Summer
184,uid-162509672601672329,OTX.vehicleid,2021,7,3,26,0.015,1.901,Summer
185,uid-162509672601672329,REST.DECODE_VEHICLE_IDENTIFIER,2021,7,3,26,0.074,1.975,Summer
187,uid-162509672601672329,OTX.vin,2021,7,3,26,0.006,3.131,Summer
...,...,...,...,...,...,...,...,...,...
87,uid-162510583719067147,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer
78,uid-162510599405951750,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer
86,uid-162510872224217421,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer
84,uid-162512405989626584,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer


## Data Merging: Combining Imported Data

In this step, we are merging the imported data into a single dataframe using the `consultationId` as a common identifier.

We are using the `merge()` function from pandas library to perform the merging operation. The `on` parameter is set to 'consultationId', which is the common identifier between the data sources. The `how` parameter is set to 'inner', meaning that we are performing an 'inner' join. This type of join returns only the records that have matching values in both merging tables. In other words, it returns the intersection of the dataframes, ensuring that the resulting dataframe contains only the complete information.

In [251]:
q1 = dataframes['q1']
q2 = dataframes['q2']
q3 = dataframes['q3']
q5 = dataframes['q5']
q6 = dataframes['q6']

data = q1 \
    .merge(q2, on='consultationId', how='inner') \
    .merge(q3, on='consultationId', how='inner') \
    .merge(q4, on='consultationId', how='inner') \
    .merge(q5, on='consultationId', how='inner') \
    .merge(q6, on='consultationId', how='inner')
data.set_index('consultationId', inplace=True)

data

Unnamed: 0_level_0,vin,project,ODO,activityName,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season,odxShortName,FullCode,StatusCode,dtcODO,C_ODO_VALUE,dealerCode,distributorCode
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,VCM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,PSM,B1B4649,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,PSM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,PSM,B1B5C49,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,VCM,B1B4649,8,27280,27287.0,2527A,MJO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162512405989626584,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J20E,7 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,0x760,C05D397,104,-1,7.0,x22BB03CF,MazdaDlr
uid-162512405989626584,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J20E,7 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,0x760,C05D397,104,-1,7.0,x22BB03CF,MazdaDlr
uid-162512405989626584,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J20E,7 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,0x760,C05D397,104,-1,7.0,x22BB03CF,MazdaDlr
uid-162512405989626584,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J20E,7 km,CONSULTATION_START,2021,7,3,26,0.0,0.0,Summer,0x760,C05D397,104,-1,7.0,x22BB03CF,MazdaDlr


## Data Cleaning: Removing Duplicate Records

In this step of the data preprocessing, we aim to remove any duplicate entries in the dataset.

We utilize the `drop_duplicates()` function from pandas library for this purpose. The `inplace=True` parameter ensures that the operation is performed on the dataset directly, without the need to assign the result to a new variable.

In [252]:
data.drop_duplicates(inplace=True)
data

Unnamed: 0_level_0,vin,project,ODO,activityName,year,month,dayOfWeek,weekOfYear,timeSinceLastActivitySec,elapsedTimeSec,season,odxShortName,FullCode,StatusCode,dtcODO,C_ODO_VALUE,dealerCode,distributorCode
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer,VCM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer,PSM,B1B4649,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer,PSM,B1B5049,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer,PSM,B1B5C49,8,27280,27287.0,2527A,MJO
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59C,27282 km,CONSULTATION_START,2021,7,3,26,0.000,0.000,Summer,VCM,B1B4649,8,27280,27287.0,2527A,MJO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59K,754 km,VEHICLE_ID,2021,7,3,26,0.013,10.500,Summer,DSC,C05D397,42,-1,754.0,2528,MJO
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59K,754 km,OTX.vehicleSpecification,2021,7,3,26,0.019,26.219,Summer,DSC,C05D397,42,-1,754.0,2528,MJO
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59K,754 km,HEALTHCHECK.NETWORK_VIEW,2021,7,3,26,0.007,26.407,Summer,DSC,C05D397,42,-1,754.0,2528,MJO
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,J59K,754 km,HEALTHCHECK.DTC_LIST,2021,7,3,26,0.005,26.412,Summer,DSC,C05D397,42,-1,754.0,2528,MJO


## Data Cleaning: Handle Missing Values

In our dataset, we've identified that the 'ODO' column has missing values.

Rather than using common statistical methods like replacing missing values with the mean or median, we opt for a more context-aware approach. We replace the missing 'ODO' values with the 'C_ODO_VALUE' which is the current odometer value collected during the same consultation. This provides a reliable substitute that's closely tied to the actual vehicle data.

Additionally, we transform the 'ODO' field to be numerical by removing the 'km' unit formatting.

#### FUTURE CONSIDERATION
Another option for imputing a value to replace missing values is to leverage k-nearest neighbor (KNN),
were we would be inferring based on some other fields (NOTE: can only be considered once CLAIMS data
is available)

A more robust approach could involve using a model-based imputation, such as regression imputation or a machine learning model like K-Nearest Neighbors (KNN) or a decision tree, to predict the missing 'ODO' values based on other variables. This could be particularly useful if the 'ODO' values are missing based on some underlying pattern that is related to other variables in your data.

In [253]:
# Identify which data fields contain missing values
data.isna().any()

vin                         False
project                     False
ODO                         False
activityName                False
year                        False
month                       False
dayOfWeek                   False
weekOfYear                  False
timeSinceLastActivitySec    False
elapsedTimeSec              False
season                      False
odxShortName                False
FullCode                    False
StatusCode                  False
dtcODO                      False
C_ODO_VALUE                 False
dealerCode                  False
distributorCode             False
dtype: bool

In [254]:
# Replace 'km' formatting in the 'ODO' field and fill missing values
data['ODO'] = data['ODO'].str.replace(' km','')
data['ODO'].fillna(value=data.C_ODO_VALUE, inplace=True)

## Data Cleaning/Encoding: Setting Appropriate Data Types

As part of this step we ensure that the data is represented in the correct format, using appropriate data type for efficient data manipulation and accurate model training.

In this code block, we are setting the data types for each column as follows:

* The 'ODO', 'dtcODO', 'timeSinceLastActivitySec', and 'elapsedTimeSec' columns are set to float64 as they contain numerical continuous data with a floating point.
* The 'year', 'month', 'dayOfWeek', and 'weekOfYear' columns are set to CategoricalDtype(ordered=True) since they contain categorical temporal data that has a natural order
* The 'project', 'activityName', 'odxShortName', 'FullCode', 'StatusCode', 'dealerCode', and 'distributorCode' columns are set to CategoricalDtype(ordered=False) since they contain categorical data without a natural order

In [255]:
data['ODO'] = data['ODO'].astype('float64')
data['dtcODO'] = data['dtcODO'].astype('float64')
data['timeSinceLastActivitySec'] = data['timeSinceLastActivitySec'].astype('float64')
data['elapsedTimeSec'] = data['elapsedTimeSec'].astype('float64')

data['year'] = data['year'].astype(CategoricalDtype(ordered=True))
data['month'] = data['month'].astype(CategoricalDtype(ordered=True))
data['dayOfWeek'] = data['dayOfWeek'].astype(CategoricalDtype(ordered=True))
data['weekOfYear'] = data['weekOfYear'].astype(CategoricalDtype(ordered=True))

data['project'] = data['project'].astype(CategoricalDtype(ordered=False))
data['activityName'] = data['activityName'].astype(CategoricalDtype(ordered=False))
data['odxShortName'] = data['odxShortName'].astype(CategoricalDtype(ordered=False))
data['FullCode'] = data['FullCode'].astype(CategoricalDtype(ordered=False))
data['StatusCode'] = data['StatusCode'].astype(CategoricalDtype(ordered=False))
data['dealerCode'] = data['dealerCode'].astype(CategoricalDtype(ordered=False))
data['distributorCode'] = data['distributorCode'].astype(CategoricalDtype(ordered=False))

data.dtypes

vin                           object
project                     category
ODO                          float64
activityName                category
year                           int64
month                       category
dayOfWeek                   category
weekOfYear                  category
timeSinceLastActivitySec     float64
elapsedTimeSec               float64
season                        object
odxShortName                category
FullCode                    category
StatusCode                  category
dtcODO                       float64
C_ODO_VALUE                  float64
dealerCode                  category
distributorCode             category
dtype: object

## Data Encoding: Encode Categorical Features

We conduct Binary Encoding for categorical variables. The model will require numerical input, and encoding will convert our categorical variables into a numerical format that our model can process.

In this step, we are using binary encoding to convert categorical variables into a form that can be provided to our model to improve its performance. The categorical variables being encoded include 'project', 'activityName', 'odxShortName', 'FullCode', 'StatusCode', 'dealerCode', and 'distributorCode'.

Binary encoding is a combination of Hash encoding and one-hot encoding. In binary encoding, first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This makes binary encoding more space efficient than one-hot encoding, especially for high cardinality variables.

We are using the category_encoders' BinaryEncoder for this task. The 'return_df' parameter is set to True which means the method will return a pandas DataFrame.


In [256]:
encoder = ce.BinaryEncoder(cols = ['project', 'activityName', 'odxShortName', 'FullCode', 'StatusCode', 'dealerCode', 'distributorCode', 'dayOfWeek', 'weekOfYear', 'month', 'season'], return_df = True)
data = encoder.fit_transform(data)
data

Unnamed: 0_level_0,vin,project_0,project_1,project_2,ODO,activityName_0,activityName_1,activityName_2,activityName_3,activityName_4,...,StatusCode_2,StatusCode_3,dtcODO,C_ODO_VALUE,dealerCode_0,dealerCode_1,dealerCode_2,dealerCode_3,distributorCode_0,distributorCode_1
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,27282.0,0,0,0,0,0,...,0,1,27280.0,27287.0,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,27282.0,0,0,0,0,0,...,0,1,27280.0,27287.0,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,27282.0,0,0,0,0,0,...,0,1,27280.0,27287.0,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,27282.0,0,0,0,0,0,...,0,1,27280.0,27287.0,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,27282.0,0,0,0,0,0,...,0,1,27280.0,27287.0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,754.0,0,0,0,1,1,...,0,1,-1.0,754.0,1,0,0,0,0,1
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,754.0,0,0,1,0,0,...,0,1,-1.0,754.0,1,0,0,0,0,1
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,754.0,0,0,1,0,0,...,0,1,-1.0,754.0,1,0,0,0,0,1
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,754.0,0,0,1,0,1,...,0,1,-1.0,754.0,1,0,0,0,0,1


## Data Normalisation: Standardise Numerical Data

In this step, we are standardising the values of the 'elapsedTimeSec', 'timeSinceLastActivitySec', 'ODO', 'dtcODO', and 'C_ODO_VALUE' columns. These columns represent continuous numerical data (temporal data and odometer readings), which we expect to follow a normal-like distribution.

We are using sklearn's StandardScaler for this task. This method standardizes features by removing the mean and scaling to unit variance. This transformation helps to achieve properties of a standard normal distribution where the mean (average) of each feature is 0 and the standard deviation is 1.

By doing this, we are ensuring that these features have the same scale and thus contributing equally to the model's performance.

In [257]:
data_scaler = StandardScaler()
data[['elapsedTimeSec', 'timeSinceLastActivitySec', 'ODO', 'dtcODO', 'C_ODO_VALUE']] = data_scaler.fit_transform(data[['elapsedTimeSec', 'timeSinceLastActivitySec', 'ODO', 'dtcODO', 'C_ODO_VALUE']])
data

Unnamed: 0_level_0,vin,project_0,project_1,project_2,ODO,activityName_0,activityName_1,activityName_2,activityName_3,activityName_4,...,StatusCode_2,StatusCode_3,dtcODO,C_ODO_VALUE,dealerCode_0,dealerCode_1,dealerCode_2,dealerCode_3,distributorCode_0,distributorCode_1
consultationId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.482517,0,0,0,0,0,...,0,1,1.483277,1.482133,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.482517,0,0,0,0,0,...,0,1,1.483277,1.482133,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.482517,0,0,0,0,0,...,0,1,1.483277,1.482133,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.482517,0,0,0,0,0,...,0,1,1.483277,1.482133,0,0,0,1,0,1
uid-162509762110151072,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,0,1,1.482517,0,0,0,0,0,...,0,1,1.483277,1.482133,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,-2.750520,0,0,0,1,1,...,0,1,-2.865516,-2.751690,1,0,0,0,0,1
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,-2.750520,0,0,1,0,0,...,0,1,-2.865516,-2.751690,1,0,0,0,0,1
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,-2.750520,0,0,1,0,0,...,0,1,-2.865516,-2.751690,1,0,0,0,0,1
uid-162510422438275647,1A914EAC99CE2399BFB1C60E70BFB0B81475AF25694CF8...,0,1,1,-2.750520,0,0,1,0,1,...,0,1,-2.865516,-2.751690,1,0,0,0,0,1
