## Project 1 - Mercedes-Benz Greener Manufacturing

#### DESCRIPTION

* Reduce the time a Mercedes-Benz spends on the test bench.*

---

#### Problem Statement Scenario:

    Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with the crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

    To ensure the safety and reliability of every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

    You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.
    
 ---

#### Following actions should be performed:

    * If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
    * Check for null and unique values for test and train sets
    * Apply label encoder.
    * Perform dimensionality reduction.
    * Predict your test_df values using xgboost

### **Loading libraries and Understanding Data**

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
import xgboost as xgb
from xgboost import XGBRFRegressor
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv(r'E:/Simplilearn/Machine Learning/Projects/Projects for Submission/Project 1 - Mercedes-Benz Greener Manufacturing/Dataset for the project/train.csv')
test = pd.read_csv(r'E:/Simplilearn/Machine Learning/Projects/Projects for Submission/Project 1 - Mercedes-Benz Greener Manufacturing/Dataset for the project/test.csv')

In [3]:
train.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [4]:
test.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [5]:
print(train.shape)
print(test.shape)

(4209, 378)
(4209, 377)


### **If for any column(s), the variance is equal to zero, then you need to remove those variable(s).**

In [6]:
train.var()

ID      5.941936e+06
y       1.607667e+02
X10     1.313092e-02
X11     0.000000e+00
X12     6.945713e-02
            ...     
X380    8.014579e-03
X382    7.546747e-03
X383    1.660732e-03
X384    4.750593e-04
X385    1.423823e-03
Length: 370, dtype: float64

In [7]:
test.var()

ID      5.871311e+06
X10     1.865006e-02
X11     2.375861e-04
X12     6.885074e-02
X13     5.734498e-02
            ...     
X380    8.014579e-03
X382    8.715481e-03
X383    4.750593e-04
X384    7.124196e-04
X385    1.660732e-03
Length: 369, dtype: float64

In [8]:
n = pd.DataFrame((train.var()==0), columns=["a"])
dr = n[n.a==True].T
list(dr.columns)

['X11',
 'X93',
 'X107',
 'X233',
 'X235',
 'X268',
 'X289',
 'X290',
 'X293',
 'X297',
 'X330',
 'X347']

In [9]:
train.drop(columns=list(dr.columns),inplace=True)
test.drop(columns=list(dr.columns),inplace=True)

### **Handling the NaN values and replacing it with the mean**

In [10]:
print("Missing values in train -",train.isnull().values.any())
print("Missing values in test -",test.isnull().values.any())

Missing values in train - False
Missing values in test - False


### Unique Values in Train and Test datasets

In [11]:
#train

columns=[]     
counts=[]                  
unique =[]                
colm =train.columns
for i in colm:
    columns.append(i)
    a = train[i].nunique()
    counts.append(a)
    b=train[i].unique()
    unique.append(b)
    print("The no. of unique values of {} column is = {} \n  & unique values of {} column is = \n {} \n".format(i,train[i].nunique(),i,train[i].unique()))

The no. of unique values of ID column is = 4209 
  & unique values of ID column is = 
 [   0    6    7 ... 8412 8415 8417] 

The no. of unique values of y column is = 2545 
  & unique values of y column is = 
 [130.81  88.53  76.26 ...  85.71 108.77  87.48] 

The no. of unique values of X0 column is = 47 
  & unique values of X0 column is = 
 ['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab'] 

The no. of unique values of X1 column is = 27 
  & unique values of X1 column is = 
 ['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab'] 

The no. of unique values of X2 column is = 44 
  & unique values of X2 column is = 
 ['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah

The no. of unique values of X131 column is = 2 
  & unique values of X131 column is = 
 [1 0] 

The no. of unique values of X132 column is = 2 
  & unique values of X132 column is = 
 [0 1] 

The no. of unique values of X133 column is = 2 
  & unique values of X133 column is = 
 [0 1] 

The no. of unique values of X134 column is = 2 
  & unique values of X134 column is = 
 [0 1] 

The no. of unique values of X135 column is = 2 
  & unique values of X135 column is = 
 [0 1] 

The no. of unique values of X136 column is = 2 
  & unique values of X136 column is = 
 [1 0] 

The no. of unique values of X137 column is = 2 
  & unique values of X137 column is = 
 [1 0] 

The no. of unique values of X138 column is = 2 
  & unique values of X138 column is = 
 [0 1] 

The no. of unique values of X139 column is = 2 
  & unique values of X139 column is = 
 [0 1] 

The no. of unique values of X140 column is = 2 
  & unique values of X140 column is = 
 [0 1] 

The no. of unique values of X141 column 

The no. of unique values of X358 column is = 2 
  & unique values of X358 column is = 
 [0 1] 

The no. of unique values of X359 column is = 2 
  & unique values of X359 column is = 
 [0 1] 

The no. of unique values of X360 column is = 2 
  & unique values of X360 column is = 
 [0 1] 

The no. of unique values of X361 column is = 2 
  & unique values of X361 column is = 
 [1 0] 

The no. of unique values of X362 column is = 2 
  & unique values of X362 column is = 
 [0 1] 

The no. of unique values of X363 column is = 2 
  & unique values of X363 column is = 
 [0 1] 

The no. of unique values of X364 column is = 2 
  & unique values of X364 column is = 
 [0 1] 

The no. of unique values of X365 column is = 2 
  & unique values of X365 column is = 
 [0 1] 

The no. of unique values of X366 column is = 2 
  & unique values of X366 column is = 
 [0 1] 

The no. of unique values of X367 column is = 2 
  & unique values of X367 column is = 
 [0 1] 

The no. of unique values of X368 column 

In [12]:
#test

columns_test=[]                    
counts_test=[]                  
unique_test =[]                
colm =test.columns
for i in colm:
    columns_test.append(i)
    a = test[i].nunique()
    counts_test.append(a)
    b=test[i].unique()
    unique_test.append(b)
    print("The no. of unique values of {} column is = {} \n  & unique values of {} column is = \n {} \n".format(i,test[i].nunique(),i,test[i].unique()))

The no. of unique values of ID column is = 4209 
  & unique values of ID column is = 
 [   1    2    3 ... 8413 8414 8416] 

The no. of unique values of X0 column is = 49 
  & unique values of X0 column is = 
 ['az' 't' 'w' 'y' 'x' 'f' 'ap' 'o' 'ay' 'al' 'h' 'z' 'aj' 'd' 'v' 'ak'
 'ba' 'n' 'j' 's' 'af' 'ax' 'at' 'aq' 'av' 'm' 'k' 'a' 'e' 'ai' 'i' 'ag'
 'b' 'am' 'aw' 'as' 'r' 'ao' 'u' 'l' 'c' 'ad' 'au' 'bc' 'g' 'an' 'ae' 'p'
 'bb'] 

The no. of unique values of X1 column is = 27 
  & unique values of X1 column is = 
 ['v' 'b' 'l' 's' 'aa' 'r' 'a' 'i' 'p' 'c' 'o' 'm' 'z' 'e' 'h' 'w' 'g' 'k'
 'y' 't' 'u' 'd' 'j' 'q' 'n' 'f' 'ab'] 

The no. of unique values of X2 column is = 45 
  & unique values of X2 column is = 
 ['n' 'ai' 'as' 'ae' 's' 'b' 'e' 'ak' 'm' 'a' 'aq' 'ag' 'r' 'k' 'aj' 'ay'
 'ao' 'an' 'ac' 'af' 'ax' 'h' 'i' 'f' 'ap' 'p' 'au' 't' 'z' 'y' 'aw' 'd'
 'at' 'g' 'am' 'j' 'x' 'ab' 'w' 'q' 'ah' 'ad' 'al' 'av' 'u'] 

The no. of unique values of X3 column is = 7 
  & unique values of X3

The no. of unique values of X174 column is = 2 
  & unique values of X174 column is = 
 [0 1] 

The no. of unique values of X175 column is = 2 
  & unique values of X175 column is = 
 [0 1] 

The no. of unique values of X176 column is = 2 
  & unique values of X176 column is = 
 [0 1] 

The no. of unique values of X177 column is = 2 
  & unique values of X177 column is = 
 [0 1] 

The no. of unique values of X178 column is = 2 
  & unique values of X178 column is = 
 [0 1] 

The no. of unique values of X179 column is = 2 
  & unique values of X179 column is = 
 [1 0] 

The no. of unique values of X180 column is = 2 
  & unique values of X180 column is = 
 [0 1] 

The no. of unique values of X181 column is = 2 
  & unique values of X181 column is = 
 [0 1] 

The no. of unique values of X182 column is = 2 
  & unique values of X182 column is = 
 [0 1] 

The no. of unique values of X183 column is = 2 
  & unique values of X183 column is = 
 [0 1] 

The no. of unique values of X184 column 

In [13]:
train_unique_info =pd.DataFrame({"columns":columns, 
              "counts":counts,
               "unique":unique})
train_unique_info.head()

Unnamed: 0,columns,counts,unique
0,ID,4209,"[0, 6, 7, 9, 13, 18, 24, 25, 27, 30, 31, 32, 3..."
1,y,2545,"[130.81, 88.53, 76.26, 80.62, 78.02, 92.93, 12..."
2,X0,47,"[k, az, t, al, o, w, j, h, s, n, ay, f, x, y, ..."
3,X1,27,"[v, t, w, b, r, l, s, aa, c, a, e, h, z, j, o,..."
4,X2,44,"[at, av, n, e, as, aq, r, ai, ak, m, a, k, ae,..."


In [14]:
test_unique_info =pd.DataFrame({"columns":columns_test, 
                                 "counts":counts_test,
                                 "unique":unique_test})
test_unique_info.head()

Unnamed: 0,columns,counts,unique
0,ID,4209,"[1, 2, 3, 4, 5, 8, 10, 11, 12, 14, 15, 16, 17,..."
1,X0,49,"[az, t, w, y, x, f, ap, o, ay, al, h, z, aj, d..."
2,X1,27,"[v, b, l, s, aa, r, a, i, p, c, o, m, z, e, h,..."
3,X2,45,"[n, ai, as, ae, s, b, e, ak, m, a, aq, ag, r, ..."
4,X3,7,"[f, a, c, e, d, g, b]"


In [15]:
#replace strings by categorical values

for col in train.columns:
    if(train[col].dtype != np.float64 and train[col].dtype != np.int64):
        
        # making a list of unique strings in train and test feature
        unique_train = train[col].unique().tolist()
        unique_test = test[col].unique().tolist()
        
        # making a combined list
        for member in unique_test:
            if member not in unique_train:
                unique_train.append(member)
               
        # mapping with numbers
        map_dict = dict(zip(unique_train, range(len(unique_train))))
        train[col] = train[col].replace(to_replace = map_dict)
        test[col] = test[col].replace(to_replace = map_dict)

### Apply label encoder

In [16]:
features = train.iloc[:,2:].values
label = train.iloc[:,1].values
test = test.iloc[:,1:].values
features

array([[ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  1,  1, ...,  0,  0,  0],
       [ 1,  2,  2, ...,  0,  0,  0],
       ...,
       [15,  0,  6, ...,  0,  0,  0],
       [ 3,  4,  3, ...,  0,  0,  0],
       [17,  4, 12, ...,  0,  0,  0]], dtype=int64)

In [17]:
LE = LabelEncoder()
features[:,0] = LE.fit_transform(features[:,0])
features[:,1] = LE.fit_transform(features[:,1])
features[:,2] = LE.fit_transform(features[:,2])
features[:,3] = LE.fit_transform(features[:,3])
features[:,4] = LE.fit_transform(features[:,4])
features[:,5] = LE.fit_transform(features[:,5])
features[:,6] = LE.fit_transform(features[:,6])
features[:,7] = LE.fit_transform(features[:,7])
features

test[:,0] = LE.fit_transform(test[:,0])
test[:,1] = LE.fit_transform(test[:,1])
test[:,2] = LE.fit_transform(test[:,2])
test[:,3] = LE.fit_transform(test[:,3])
test[:,4] = LE.fit_transform(test[:,4])
test[:,5] = LE.fit_transform(test[:,5])
test[:,6] = LE.fit_transform(test[:,6])
test[:,7] = LE.fit_transform(test[:,7])
test

array([[ 1,  0,  2, ...,  0,  0,  0],
       [ 2,  3,  7, ...,  0,  0,  0],
       [ 1,  0,  4, ...,  0,  0,  0],
       ...,
       [13,  0,  4, ...,  0,  0,  0],
       [15,  0,  4, ...,  0,  0,  0],
       [ 2,  7,  7, ...,  0,  0,  0]], dtype=int64)

In [18]:
for f in ["X0","X1","X8","X2","X3","X4","X5","X6"]:
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[f].values)) 
        train[f] = lbl.transform(list(train[f].values))

#### Scaling the data

In [19]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
features = sc.fit_transform(features)

### Performing dimensionality reduction.

In [20]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(features,label)

PCA(n_components=3)

In [21]:
pca.explained_variance_ratio_

array([0.06862221, 0.05647352, 0.04476858])

In [22]:
features1 = pca.transform(features)

### Preparing the data and performing XGBoost.

In [23]:
x_train,x_test,y_train,y_test  = train_test_split(features1 ,
                                                 label , 
                                                 test_size = 0.2,
                                                 random_state = 1)

In [24]:
xgb = XGBRFRegressor()
xgb.fit(features1,label)

XGBRFRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain',
               interaction_constraints='', max_delta_step=0, max_depth=6,
               min_child_weight=1, missing=nan, monotone_constraints='()',
               n_estimators=100, n_jobs=0, num_parallel_tree=100,
               objective='reg:squarederror', random_state=0, reg_alpha=0,
               scale_pos_weight=1, tree_method='exact', validate_parameters=1,
               verbosity=None)

In [25]:
xgb.score(features1,label)

0.36686846176542864

In [26]:
xgb.score(x_test,y_test)

0.36309368996191926

In [27]:
t=sc.transform(test)
test1=pca.transform(t)

xgb.predict(test1)

array([ 76.873566,  97.974014,  77.063705, ..., 102.943565, 102.983055,
        97.43947 ], dtype=float32)

### Predict your test_df values using xgboost

In [28]:
x_train,x_test,y_train,y_test = train_test_split(features,
label,
test_size=0.2,
random_state=1)

In [29]:
model = XGBRFRegressor()
model.fit(x_train,y_train)

XGBRFRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain',
               interaction_constraints='', max_delta_step=0, max_depth=6,
               min_child_weight=1, missing=nan, monotone_constraints='()',
               n_estimators=100, n_jobs=0, num_parallel_tree=100,
               objective='reg:squarederror', random_state=0, reg_alpha=0,
               scale_pos_weight=1, tree_method='exact', validate_parameters=1,
               verbosity=None)

In [30]:
#Check the Quality of the model
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))

0.6221611228381827
0.5931391378182237


In [31]:
model.predict(test)

array([100.69558 ,  93.771736, 100.264824, ...,  92.53066 , 110.44512 ,
        93.99011 ], dtype=float32)