## Standardise the feature values in the dataset

Scale the feature data points within the same scale

The following steps were carried out

- 1 Add the target dataset back into the feature dataset
- 2 Preparation of dataset for scaling
- 3 Carry out standardisation - via StandardScaler
- 4 Carry out standardisation - via MinMaxScaler
- 5 Decide and which standardised dataset to use

### 1 Add the target dataset back into the feature dataset

In [82]:
#check dataframe structure after adding target column back

df_no_nan.shape

(15062, 15)

In [83]:
#statistical check of the updated dataframe

df_no_nan.describe()

Unnamed: 0,AltitudeVariation,VehicleSpeedInstantaneous,VehicleSpeedAverage,VehicleSpeedVariance,VehicleSpeedVariation,LongitudinalAcceleration,EngineLoad,EngineCoolantTemperature,ManifoldAbsolutePressure,IntakeAirTemperature,VerticalAcceleration,FuelConsumptionAverage,Car,Journey,drivingStyle
count,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0,15062.0
mean,-0.320529,29.279122,29.21803,145.174728,0.009229,0.371484,32.548708,71.816558,109.14925,22.969858,-0.196887,15.486597,0.551188,2.463551,0.889855
std,1.495935,25.447507,21.251911,117.795082,1.434002,0.868683,24.586986,13.027753,10.69832,9.339776,0.549763,4.118056,0.497389,1.155625,0.31308
min,-4.5,0.0,0.0,0.0,-3.600002,-2.1018,0.0,36.0,88.0,8.0,-1.7402,7.307876,0.0,1.0,0.0
25%,-1.200012,5.4,14.227623,50.828606,-0.9,-0.273,15.686275,64.0,102.0,16.0,-0.52885,12.47135,0.0,1.0,1.0
50%,-0.099976,27.0,24.422261,113.302748,0.0,0.288,29.411766,79.0,105.0,20.0,-0.085,14.875591,1.0,3.0,1.0
75%,0.5,45.0,40.828782,212.241537,0.899998,0.9921,43.92157,80.0,111.0,30.0,0.1914,18.232169,1.0,3.0,1.0
max,3.800049,119.570579,101.354997,506.421168,3.600002,3.1712,100.0,87.0,147.0,53.0,1.3462,28.05439,1.0,4.0,1.0


In [84]:
#check the feature mean values

df_no_nan.mean(axis=0) #check mean of each feature

AltitudeVariation             -0.320529
VehicleSpeedInstantaneous     29.279122
VehicleSpeedAverage           29.218030
VehicleSpeedVariance         145.174728
VehicleSpeedVariation          0.009229
LongitudinalAcceleration       0.371484
EngineLoad                    32.548708
EngineCoolantTemperature      71.816558
ManifoldAbsolutePressure     109.149250
IntakeAirTemperature          22.969858
VerticalAcceleration          -0.196887
FuelConsumptionAverage        15.486597
Car                            0.551188
Journey                        2.463551
drivingStyle                   0.889855
dtype: float64

In [85]:
#check the feature std values

df_no_nan.std(axis=0)#check standard deviation of each feature

AltitudeVariation              1.495935
VehicleSpeedInstantaneous     25.447507
VehicleSpeedAverage           21.251911
VehicleSpeedVariance         117.795082
VehicleSpeedVariation          1.434002
LongitudinalAcceleration       0.868683
EngineLoad                    24.586986
EngineCoolantTemperature      13.027753
ManifoldAbsolutePressure      10.698320
IntakeAirTemperature           9.339776
VerticalAcceleration           0.549763
FuelConsumptionAverage         4.118056
Car                            0.497389
Journey                        1.155625
drivingStyle                   0.313080
dtype: float64

### 2 Preparation of datasets for scaling

In [86]:
#important to load the train test split before StandardScaler otherwise NaNs could be generated

from sklearn.model_selection import train_test_split #used to create the training and test datasets


In [87]:
from sklearn.preprocessing import StandardScaler  #apply sklearn standardisation 

In [88]:
# split the dataframe

X = df_no_nan.drop('drivingStyle', axis=1)

y = df_no_nan['drivingStyle']

In [89]:
#separate the dataset into training and test datasets

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [90]:
#check the shape of the X train & test dataframes

X_train.shape, X_test.shape

((10543, 14), (4519, 14))

In [91]:
X_train.isna().sum()

AltitudeVariation            0
VehicleSpeedInstantaneous    0
VehicleSpeedAverage          0
VehicleSpeedVariance         0
VehicleSpeedVariation        0
LongitudinalAcceleration     0
EngineLoad                   0
EngineCoolantTemperature     0
ManifoldAbsolutePressure     0
IntakeAirTemperature         0
VerticalAcceleration         0
FuelConsumptionAverage       0
Car                          0
Journey                      0
dtype: int64

In [92]:
X_train.head()

Unnamed: 0,AltitudeVariation,VehicleSpeedInstantaneous,VehicleSpeedAverage,VehicleSpeedVariance,VehicleSpeedVariation,LongitudinalAcceleration,EngineLoad,EngineCoolantTemperature,ManifoldAbsolutePressure,IntakeAirTemperature,VerticalAcceleration,FuelConsumptionAverage,Car,Journey
18750,1.400002,96.299995,96.179997,15.800953,0.900002,0.1727,47.84314,80.0,113.0,46.0,-0.2608,9.154309,1,3
22396,0.100002,25.199999,21.824999,121.270032,-0.9,-0.1879,40.784313,79.0,103.0,21.0,0.409,14.886612,1,4
9359,-2.800003,0.0,18.404712,254.973762,0.0,0.27,24.705883,81.0,100.0,24.0,-0.152,20.711361,0,2
12208,0.700001,35.099998,35.279999,64.986708,2.700001,0.6048,76.470589,52.0,113.0,22.0,-1.3545,12.310544,1,3
9055,0.600037,37.154701,52.11791,56.820701,-1.57626,0.5469,0.0,80.0,103.0,12.0,0.133,11.456865,0,2


### 3 Carry out standardisation - via StandardScaler### Standardisation - via StandardScaler

In [93]:
#standardisation using Standardscaler from Scikitlearn

scaler = StandardScaler()

#fit the scaler to train set, it will learn the parameters

scaler.fit(X_train)

#transform the train and test datasets

X_trained_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

In [94]:
#check the mean for each feature learned from the train set

scaler.mean_

array([-3.18400905e-01,  2.93729229e+01,  2.93731544e+01,  1.45053721e+02,
        6.75272741e-03,  3.72436043e-01,  3.27764361e+01,  7.17840273e+01,
        1.09143223e+02,  2.29265864e+01, -1.97515707e-01,  1.54789478e+01,
        5.50981694e-01,  2.46732429e+00])

In [95]:
#the scaler stores the standard deviation of the features learned from the train set

scaler.scale_

array([  1.50012504,  25.56477338,  21.34861851, 117.68489966,
         1.42598694,   0.86997405,  24.7057078 ,  13.08099093,
        10.66732743,   9.36573971,   0.54937312,   4.12249323,
         0.49739408,   1.1569638 ])

In [96]:
#transform the numpy arrays into dataframes to check the statistics

X_trained_scaled = pd.DataFrame(X_trained_scaled, columns=X_train.columns)

X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [97]:
#check the structures

np.round(X_train.describe(),1)

Unnamed: 0,AltitudeVariation,VehicleSpeedInstantaneous,VehicleSpeedAverage,VehicleSpeedVariance,VehicleSpeedVariation,LongitudinalAcceleration,EngineLoad,EngineCoolantTemperature,ManifoldAbsolutePressure,IntakeAirTemperature,VerticalAcceleration,FuelConsumptionAverage,Car,Journey
count,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0
mean,-0.3,29.4,29.4,145.1,0.0,0.4,32.8,71.8,109.1,22.9,-0.2,15.5,0.6,2.5
std,1.5,25.6,21.3,117.7,1.4,0.9,24.7,13.1,10.7,9.4,0.5,4.1,0.5,1.2
min,-4.5,0.0,0.0,0.0,-3.6,-2.1,0.0,36.0,88.0,8.0,-1.7,7.3,0.0,1.0
25%,-1.2,5.4,14.3,50.9,-0.9,-0.3,15.7,64.0,102.0,16.0,-0.5,12.5,0.0,1.0
50%,-0.1,27.0,24.5,112.9,0.0,0.3,29.4,79.0,105.0,20.0,-0.1,14.9,1.0,3.0
75%,0.5,45.0,40.9,211.7,0.9,1.0,44.7,80.0,111.0,30.0,0.2,18.3,1.0,3.0
max,3.8,112.1,101.4,506.1,3.6,3.2,100.0,87.0,147.0,53.0,1.3,28.1,1.0,4.0


This scaling method did not standardise the means to 0 and the standard deviation to 1. So I will try to standardise the dataset using a different scaler

### 4 Carry out standardisation - via MinMaxScaler

In [98]:
#add required library

from sklearn.preprocessing import MinMaxScaler

In [99]:
#setup the scaler

scaler = MinMaxScaler()

#fit the scaler to the train set, it will learn the parameters

scaler.fit(X_train)

#transform train and test sets

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)



In [100]:
#the scaler stores the maximum values of the features learned from the train set

scaler.data_max_

array([  3.80004882, 112.1396408 , 101.3549971 , 506.0896559 ,
         3.60000229,   3.1712    , 100.        ,  87.        ,
       147.        ,  53.        ,   1.3128    ,  28.05438995,
         1.        ,   4.        ])

In [101]:
#the scaler stores the minimum values of the features learned from the train set

scaler.min_

array([ 0.54216549,  0.        ,  0.        ,  0.        ,  0.5       ,
        0.39458964,  0.        , -0.70588235, -1.49152542, -0.17777778,
        0.56999672, -0.35228707,  0.        , -0.33333333])

In [102]:
#the scaler also stores the value range (max - min)

scaler.data_range_

array([  8.30004882, 112.1396408 , 101.3549971 , 506.0896559 ,
         7.20000458,   5.2381    , 100.        ,  51.        ,
        59.        ,  45.        ,   3.053     ,  20.74588346,
         1.        ,   3.        ])

In [103]:
#transform the returned NumPy array to dataframes for further analysis

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [104]:
#check the original training dataset min & max values using np.round to reduce the number of decimal places to 1

np.round(X_train.describe(), 1)

Unnamed: 0,AltitudeVariation,VehicleSpeedInstantaneous,VehicleSpeedAverage,VehicleSpeedVariance,VehicleSpeedVariation,LongitudinalAcceleration,EngineLoad,EngineCoolantTemperature,ManifoldAbsolutePressure,IntakeAirTemperature,VerticalAcceleration,FuelConsumptionAverage,Car,Journey
count,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0
mean,-0.3,29.4,29.4,145.1,0.0,0.4,32.8,71.8,109.1,22.9,-0.2,15.5,0.6,2.5
std,1.5,25.6,21.3,117.7,1.4,0.9,24.7,13.1,10.7,9.4,0.5,4.1,0.5,1.2
min,-4.5,0.0,0.0,0.0,-3.6,-2.1,0.0,36.0,88.0,8.0,-1.7,7.3,0.0,1.0
25%,-1.2,5.4,14.3,50.9,-0.9,-0.3,15.7,64.0,102.0,16.0,-0.5,12.5,0.0,1.0
50%,-0.1,27.0,24.5,112.9,0.0,0.3,29.4,79.0,105.0,20.0,-0.1,14.9,1.0,3.0
75%,0.5,45.0,40.9,211.7,0.9,1.0,44.7,80.0,111.0,30.0,0.2,18.3,1.0,3.0
max,3.8,112.1,101.4,506.1,3.6,3.2,100.0,87.0,147.0,53.0,1.3,28.1,1.0,4.0


In [105]:
#check the scaled training dataset min & max values using np.round and 1 d.p

np.round(X_train_scaled.describe(), 1)

Unnamed: 0,AltitudeVariation,VehicleSpeedInstantaneous,VehicleSpeedAverage,VehicleSpeedVariance,VehicleSpeedVariation,LongitudinalAcceleration,EngineLoad,EngineCoolantTemperature,ManifoldAbsolutePressure,IntakeAirTemperature,VerticalAcceleration,FuelConsumptionAverage,Car,Journey
count,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0,10543.0
mean,0.5,0.3,0.3,0.3,0.5,0.5,0.3,0.7,0.4,0.3,0.5,0.4,0.6,0.5
std,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.3,0.2,0.2,0.2,0.2,0.5,0.4
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.4,0.0,0.1,0.1,0.4,0.3,0.2,0.5,0.2,0.2,0.4,0.2,0.0,0.0
50%,0.5,0.2,0.2,0.2,0.5,0.5,0.3,0.8,0.3,0.3,0.5,0.4,1.0,0.7
75%,0.6,0.4,0.4,0.4,0.6,0.6,0.4,0.9,0.4,0.5,0.6,0.5,1.0,0.7
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The MinMaxScaler process converted max values have now been converted to 1 and then min values have been converted to 0.


### 5 Decide and which standardised dataset to use

As the MinMax Scaler process generated minimum values of 0 and maximum values of 1 for all features it will be used for modelling against classifier algorithms