# 실습자료
* If you have some questions, please contact to me!
* sol0917@unist.ac.kr

## week7. Driving Process Management Practice

- Fuel management is one of the important management factors because bus transportation companies can reduce costs.

- Showing an intuitive score is one way to manage the fuel consumption to the bus driver.

- Therefore, We will develop a scoring function for the eco-driving-level (EDL) using pre-processed bus driving data

- Scale of EDL score is from 0 to 100, where 0 represents a good performance and 100 represents a bad performance for fuel usage.

![screensh](https://drive.google.com/uc?export=view&id=1Nv4mxYSYcNeOLBXz73ds1gpEgUdRLMd8)

- In South Korea, the installation of Digital Tacho Graph (DTG) in vehicles that conduct transportation businesses such as city buses and taxis is stipulated by law

- During driving the vehicle, DTG collects driving records and related information of vehicle and driver such as speed, acceleration, brake signal, GPS information, and driver's information.

- Driver's driving process may be inherent in DTG data

![screensh](https://movingon.blog.gov.uk/wp-content/uploads/sites/45/2014/04/digital-tachograph.jpg)








### 1.Data preparation

#### 1.1 Data import

In [None]:
# Load packages
import io
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# If you are colab user, run this cell
# Uplaod data file into colab
from google.colab import files
file_uploaded = files.upload()
df = pd.read_csv(io.BytesIO(file_uploaded['Driving_Event_Fuel_data_preprocessed.csv']))

In [None]:
# If you are not colab user (i.e., jupyter notebook or vscode), run this cell after 'copy and paste' the csv file in same directory of this file
# Read csv file
df = pd.read_csv('Driving_Event_Fuel_data_preprocessed.csv')

#### 1.2 Data description
- Each data point represents **counts** of each driving behavior
- Each row represents a trip, that is, a single driving operation (between start and end of driving)
- file_index and bus_num can be ignored.

##### * Driving behaviors affecting fuel usage (Derived by previous research)
- Sharp Acceleration (SA, 급가속)
- Sharp Deceleration (SD, 급감속) 
- Prolonged Acceleration (PA, 장기 가속) 
- Prolonged Idling (PI, 장기 공회전) 
- Low-speed Running (LR, 저속운행) 
- High-speed Cruising (HC, 고속 등속 운행) 

##### * Target variable
 - Usage of fuel per kilometer (L/km)

In [None]:
#Show sample of dataframe
df.head(10)

In [None]:
#Show statistic values



In [None]:
#Check the distribution of each variable and target variable.



### 2.Data preprocessing

#### 2.1 Drop disused columns

#### 2.2 Split Train and test dataset
- Ratio of size of train and test dataset is 8:2

#### 2.3 Min max scaler for independent variables
- Normalization : convert each input variable separately to the range 0-1
- Reason : to prevent the magnitude of value does not affect the estimation of the variables' impact(i.e.,
coefficient)
- $X_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$


In [None]:
#import minmax scaler from sklearn.preprocessing



In [None]:
#write a function to convert train data set using minmax scaler
def train_minmaxscaler(df, x_col, feature_range):
    
    return scale_df, x_scaler

In [None]:
x_col = [i for i in train_df.columns if i not in ['fuel']]
feature_range = (0,1)
train_df, x_scaler = train_minmaxscaler(train_df, x_col, feature_range)

In [None]:
train_df.head(5)

In [None]:
train_df.describe()

In [None]:
#write a function to convert test data set using minmax scaler
def test_minmaxscaler(df, x_col, x_scaler):
    
    
    return scale_df

In [None]:
#use the x_scaler of training dataset.
test_df = test_minmaxscaler(test_df, x_col, x_scaler)

In [None]:
test_df.head(5)

In [None]:
test_df.describe()

### 3.Estimation of scoring function using Linear Regression

![screensh](https://miro.medium.com/max/1160/1*Jfx203VYFtcM958gbFsiXA.png)

- Scikit-learn linear regression
- Statsmodels linear regression (Ordinary Least Square)

#### 3.0 Define Metric

In [None]:
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
# define r2_score, mean_squared_error



#### 3.1 Statsmodels linear regression (Ordinary Least Square)

In [None]:
import statsmodels.api as sm

In [None]:
def stat_OLS(train_df, target, drop_for_x= None):
    #seperate indenpendent variable and dependent variable from train and test dataset
    if drop_for_x is not None:
          drop_list = [target, drop_for_x]
    else:
          drop_list = target
    train_x = train_df.drop(columns = drop_list)
    train_y = train_df[[target]]

    #add intercept term into train_x

    #define the model

    #train the model

    #store the prediction of model
    train_pred = results.predict(train_x)
    new_df = pd.DataFrame()
    new_df["prediction"] = train_pred

    #print the result
    print('Stats Linear Regression')
    print(results.summary())
    return new_df, results

In [None]:
OLS_result_df, score_function = stat_OLS(train_df, "fuel")

In [None]:
OLS_result_df.describe()

#### 3.3 Result of fuel usage estimation 
- Usage of Fuel = ${0.954 + 47.715*SA - 16.458*SD +14.595*PA + 5.365*PI -6.563*LR + 24.91*HC}$
- Range of prediction value : 2.452~29.953
- R-squared : 0.909

#### 3.4 Scaling for score function
- Normalization : convert each input variable separately to the range 0-100
- $Y_{scaled} = 100* \frac{y - y_{min}}{y_{max} - y_{min}}$

In [None]:
y_max = OLS_result_df["prediction"].max()
y_min = OLS_result_df["prediction"].min()

In [None]:
def score_minmaxscaler(df, y_max, y_min):
    # add code
    
    return scale_df

In [None]:
scaled_train_df = score_minmaxscaler(train_df, y_max, y_min)

In [None]:
scaled_OLS_result_df, score_function = stat_OLS(scaled_train_df, "normalized_fuel", drop_for_x = "fuel")

In [None]:
scaled_OLS_result_df.describe()

#### 3.5 Score function for Eco driving level 

- Eco driving level = ${-5.449 + 173.51*SA - 59.846*SD +53.072*PA + 19.509*PI -23.866*LR + 94.176*HC}$
- Range of prediction value : 0 ~ 100
- R-squared : 0.909

### 4.Validation of scoring function


In [None]:
#substitute test dataset into our score function for validation

#add intercept term into train_x

#prediction for test dataset

#calculate r2 and rmse score of prediction 

print("rmse : {}".format(rmse))
print("r2 : {0}".format(r2))

### 5.Checking the validation result with visualization

In [None]:
#print plot

### Additional) 6. XGBoost & SHAP


In [None]:
#!pip install shap
import shap
from xgboost import XGBRegressor, plot_importance 

#### 6.1 XGBoost - fuel prediction

In [None]:
train_x = train_df.drop(columns = 'fuel')
train_y = train_df[['fuel']]

test_x = test_df.drop(columns = 'fuel')
test_y = test_df[['fuel']]

In [None]:
#define the model
XGB = XGBRegressor()

#train the model

#prdict the model

#test the model

#calculate r2 and rmse score of prediction 

print("rmse : {}".format(rmse))
print("r2 : {0}".format(r2))

#### 6.1.1 Scaling for score function

In [None]:
# define y_max, y_min
y_max = 
y_min = 

In [None]:
# scale the train data with scaled target variable


# scale the test data with scaled target variable



In [None]:
# define new model. 

# train the model

# prdict the model

# test the model

# calculate r2 and rmse score of prediction 

print('\ntest RMSE: {:.4f} | test R-sqaured: {:.4f}'.format(rmse, r2))

In [None]:
# Checking the validation result with visualization


#### 6.2 SHAP Model

In [None]:
# define SAHP


### 6.3 SHAP Analysis

In [None]:
# shap anlaysis for 1 data sample
data_index = 


In [None]:
# shap analysis for serveral data samples
data_index_start = 0
data_index_end = 100


In [None]:
# shap analysis for whole test data samples


In [None]:
# Mean of shap values per variable
