### Reduce the time a Mercedes-Benz spends on the test bench.

### Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

### Following actions should be performed:

 - If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
 - Check for null and unique values for test and train sets.
 - Apply label encoder.
 - Perform dimensionality reduction.
 - Predict your test_df values using XGBoost.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [2]:
train = pd.read_csv(r'A:\MachineLearning\Project1\train.csv', index_col = 'ID')
test = pd.read_csv(r'A:\MachineLearning\Project1\test.csv', index_col = 'ID')

In [3]:
train.head()

Unnamed: 0_level_0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,130.81,k,v,at,a,d,u,j,o,0,...,0,0,1,0,0,0,0,0,0,0
6,88.53,k,t,av,e,d,y,l,o,0,...,1,0,0,0,0,0,0,0,0,0
7,76.26,az,w,n,c,d,x,j,x,0,...,0,0,0,0,0,0,1,0,0,0
9,80.62,az,t,n,f,d,x,l,e,0,...,0,0,0,0,0,0,0,0,0,0
13,78.02,az,v,n,f,d,h,d,n,0,...,0,0,0,0,0,0,0,0,0,0


Replacing strings with numbers in train and test dataframes. Note that a combined list of all unique strings is prepared for each feature (containing string) for both train and test data before replacing it with numbers. This is done to ensure that each strings gets mapped to same number for both train and test data.

In [5]:
for col in train.columns:
    if(train[col].dtype != np.float64 and train[col].dtype != np.int64):
        
        # making a list of unique strings in train and test feature
        unique_train = train[col].unique().tolist()
        unique_test = test[col].unique().tolist()
        
        # making a combined list
        for member in unique_test:
            if member not in unique_train:
                unique_train.append(member)
               
        # mapping with numbers
        map_dict = dict(zip(unique_train, range(len(unique_train))))
        train[col] = train[col].replace(to_replace = map_dict)
        test[col] = test[col].replace(to_replace = map_dict)

In [19]:
train.head()

Unnamed: 0_level_0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,130.81,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6,88.53,0,1,1,1,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
7,76.26,1,2,2,2,0,2,0,1,0,...,0,0,0,0,0,0,1,0,0,0
9,80.62,1,1,2,3,0,2,1,2,0,...,0,0,0,0,0,0,0,0,0,0
13,78.02,1,0,2,3,0,3,2,3,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
#Checking if train or test data has any NaN value. Also, getting summary of data
print(train.isnull().values.any())
print(test.isnull().values.any())
train.describe()

False
False


Unnamed: 0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,100.669318,12.110715,6.467569,7.851509,2.415301,0.002138,16.839629,3.031124,11.457591,0.013305,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,12.679381,8.315637,4.789927,5.644031,1.361654,0.0739,6.357474,2.554581,7.040194,0.11459,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,90.82,6.0,3.0,4.0,2.0,0.0,11.0,1.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,99.15,11.0,6.0,7.0,2.0,0.0,17.0,2.0,12.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,109.01,15.0,7.0,10.0,3.0,0.0,22.0,6.0,17.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,265.32,46.0,26.0,43.0,6.0,3.0,28.0,11.0,24.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Splitting features and labels. Also, performing min_max scalling on features

In [7]:
X_train = train.iloc[:,1:]
y_train = train.iloc[:,0]

from sklearn.preprocessing import MinMaxScaler
scalling = MinMaxScaler().fit(X_train)
X_train_scalled = scalling.transform(X_train)
test_scalled = scalling.transform(test)

  return self.partial_fit(X, y)


Regression with linear model. Cross-validation score shows that the linear model performs very poorly.

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

reg = LinearRegression()
print(cross_val_score(reg, X_train_scalled, y_train, cv=10)) 

[-1.36765404e+21 -3.80496550e+18 -1.23861964e+20 -3.76773521e+19
 -6.45164466e+15 -4.18090587e+19 -2.75279312e+20 -1.41088404e+20
 -2.51704139e+19 -4.04043030e+20]


Regression with Lasso model (L1 regularization). As the number of features are very large, Lasso regularization would assign lesser weights to non-important features and in-turn reduce their contributation in the final regression model.

In [9]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

grid_values = {'alpha': [0.0235, 0.024, 0.0245]}
grid_lasso = GridSearchCV(Lasso(), param_grid = grid_values, cv=10, scoring = 'r2')
grid_lasso.fit(X_train_scalled, y_train)
predict_lasso = grid_lasso.predict(test_scalled)

print('Mean score matrix: ', grid_lasso.cv_results_['mean_test_score'])
print('Grid best parameter (max. accuracy): ', grid_lasso.best_params_)
print('Grid best score (accuracy): ', grid_lasso.best_score_)

Mean score matrix:  [0.56297736 0.56298228 0.56297693]
Grid best parameter (max. accuracy):  {'alpha': 0.024}
Grid best score (accuracy):  0.5629822810707514


In [10]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

grid_values = {'alpha': [40, 40.5, 41]}
grid_ridge = GridSearchCV(Ridge(), param_grid = grid_values, cv=10, scoring = 'r2')
grid_ridge.fit(X_train_scalled, y_train)
predict_ridge = grid_ridge.predict(test_scalled)

print('Mean score matrix: ', grid_ridge.cv_results_['mean_test_score'])
print('Grid best parameter (max. accuracy): ', grid_ridge.best_params_)
print('Grid best score (accuracy): ', grid_ridge.best_score_)

Mean score matrix:  [0.55376436 0.55376492 0.55376475]
Grid best parameter (max. accuracy):  {'alpha': 40.5}
Grid best score (accuracy):  0.5537649195276827


In [13]:
!pip install xgboost

Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/97/ef/05245964011e4fc5aa0d86e2285a41de122ee1c30d69df05ecfd594bd608/xgboost-1.5.2-py3-none-win_amd64.whl (106.6MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.5.2


In [11]:
import xgboost as xgb

grid_values = {'n_estimators': [74,75,76], 'learning_rate': [0.13,0.135,0.14], 'max_depth': [1,2,3]}
grid_xgb = GridSearchCV(xgb.XGBRegressor(), param_grid = grid_values, cv=10, scoring = 'r2')
grid_xgb.fit(X_train_scalled, y_train)
predict_xgb = grid_xgb.predict(test_scalled)

print('Mean score matrix: ', grid_xgb.cv_results_['mean_test_score'])
print('Grid best parameter (max. accuracy): ', grid_xgb.best_params_)
print('Grid best score (accuracy): ', grid_xgb.best_score_)

Mean score matrix:  [0.55597401 0.55620147 0.55675942 0.58165304 0.5818088  0.58176478
 0.57938243 0.57941404 0.57938819 0.55721329 0.55745942 0.55774667
 0.58204994 0.58214103 0.58202328 0.57727201 0.57720782 0.57713946
 0.55807455 0.55816391 0.55836724 0.58153668 0.58149164 0.58166668
 0.57784533 0.57784512 0.57784363]
Grid best parameter (max. accuracy):  {'learning_rate': 0.135, 'max_depth': 2, 'n_estimators': 75}
Grid best score (accuracy):  0.5821410290719221


In [15]:
from sklearn.decomposition import PCA

n_comp = 12
pca = PCA(n_components=n_comp, random_state=420)
pca2_results_train = pca.fit_transform(X_train)



In [20]:
pca2_results_test = pca.transform(test)

In [17]:
pca2_results_train.shape

(4209, 12)

In [21]:
import xgboost as xgb

grid_values = {'n_estimators': [74,75,76], 'learning_rate': [0.13,0.135,0.14], 'max_depth': [1,2,3]}
grid_xgb = GridSearchCV(xgb.XGBRegressor(), param_grid = grid_values, cv=10, scoring = 'r2')
grid_xgb.fit(pca2_results_train, y_train)
predict_xgb = grid_xgb.predict(pca2_results_test)

print('Mean score matrix: ', grid_xgb.cv_results_['mean_test_score'])
print('Grid best parameter (max. accuracy): ', grid_xgb.best_params_)
print('Grid best score (accuracy): ', grid_xgb.best_score_)

Mean score matrix:  [0.33980706 0.34025943 0.34087325 0.40286799 0.40281615 0.40300438
 0.42522907 0.4249557  0.42545702 0.3412049  0.34215943 0.34318138
 0.4004157  0.40096845 0.40118354 0.41900022 0.41951963 0.41991271
 0.34289618 0.34386447 0.34457812 0.40151868 0.40173057 0.40183854
 0.42486474 0.42500232 0.42522631]
Grid best parameter (max. accuracy):  {'learning_rate': 0.13, 'max_depth': 3, 'n_estimators': 76}
Grid best score (accuracy):  0.42545701833705274
