In [None]:
"""
Mercedes-Benz Greener Manufacturing.

DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

* If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
* Check for null and unique values for test and train sets.
* Apply label encoder.
* Perform dimensionality reduction.
* Predict your test_df values using XGBoost.

"""

In [189]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [190]:
# Reading the train dataset
trn = pd.read_csv(r'C:\D_DRIVE\PGP\3 PGP AI - Machine Learning\Machine-Learning--Projects-master\Projects\Projects for Submission\Project 1 - Mercedes-Benz Greener Manufacturing\Dataset for the project\train\train.csv', delimiter = ',')

#Reading the test dataset
tst = pd.read_csv(r'C:\D_DRIVE\PGP\3 PGP AI - Machine Learning\Machine-Learning--Projects-master\Projects\Projects for Submission\Project 1 - Mercedes-Benz Greener Manufacturing\Dataset for the project\test\test.csv', delimiter = ',')

In [191]:
trn.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [192]:
tst.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [193]:
# Insights:
#----------
# There are 378 cols in train data set, X columns are named as X1, X2 etc.. and target is the Y column
# There are 377 cols in test data set , all columns except the Y column.


# Since there are lot of columns , we can use the below aggregate function to know about the datatypes of columns

dtype_df = trn.dtypes.reset_index()
dtype_df.columns = ["feature name","dtypes"]
dtype_df.groupby("dtypes").agg("count").reset_index()

Unnamed: 0,dtypes,feature name
0,int64,369
1,float64,1
2,object,8


In [194]:
dtype_df = tst.dtypes.reset_index()
dtype_df.columns = ["feature name","dtypes"]
dtype_df.groupby("dtypes").agg("count").reset_index()

Unnamed: 0,dtypes,feature name
0,int64,369
1,object,8


In [195]:
# Print the column if it has any null values
for i in trn.columns:
    if trn[i].isnull().sum() > 0 :
        print(trn[i].isnull().sum())

In [196]:
# No features has null values in train data

In [197]:
# Print the column if it has any null values
for i in tst.columns:
    if tst[i].isnull().sum() > 0 :
        print(tst[i].isnull().sum())

In [198]:
# No features has null values in test data

In [199]:
# Checking the unique values for each column in train data
for i in trn.columns:
    print(i,'**',trn[i].unique())

ID ** [   0    6    7 ... 8412 8415 8417]
y ** [130.81  88.53  76.26 ...  85.71 108.77  87.48]
X0 ** ['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']
X1 ** ['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']
X2 ** ['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
X3 ** ['a' 'e' 'c' 'f' 'd' 'b' 'g']
X4 ** ['d' 'b' 'c' 'a']
X5 ** ['u' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac' 'ad' 'ae'
 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
X6 ** ['j' 'l' 'd' 'h' 'i' 'a' 'g' 'c' 'k' 'e' 'f' 'b']
X8 ** ['o' 'x' 'e' 'n' 's' 'a' 'h' 'p' 'm' 'k' 'd' 'i' 'v' 'j' 'b' 'q' 'w' 'g'
 'y' 'l' 'f' 'u' 'r' 't' 'c']
X

In [96]:
# Checking the unique values for each column in test data
for i in tst.columns:
    print(i,'**',tst[i].unique())

ID ** [   1    2    3 ... 8413 8414 8416]
X0 ** ['az' 't' 'w' 'y' 'x' 'f' 'ap' 'o' 'ay' 'al' 'h' 'z' 'aj' 'd' 'v' 'ak'
 'ba' 'n' 'j' 's' 'af' 'ax' 'at' 'aq' 'av' 'm' 'k' 'a' 'e' 'ai' 'i' 'ag'
 'b' 'am' 'aw' 'as' 'r' 'ao' 'u' 'l' 'c' 'ad' 'au' 'bc' 'g' 'an' 'ae' 'p'
 'bb']
X1 ** ['v' 'b' 'l' 's' 'aa' 'r' 'a' 'i' 'p' 'c' 'o' 'm' 'z' 'e' 'h' 'w' 'g' 'k'
 'y' 't' 'u' 'd' 'j' 'q' 'n' 'f' 'ab']
X2 ** ['n' 'ai' 'as' 'ae' 's' 'b' 'e' 'ak' 'm' 'a' 'aq' 'ag' 'r' 'k' 'aj' 'ay'
 'ao' 'an' 'ac' 'af' 'ax' 'h' 'i' 'f' 'ap' 'p' 'au' 't' 'z' 'y' 'aw' 'd'
 'at' 'g' 'am' 'j' 'x' 'ab' 'w' 'q' 'ah' 'ad' 'al' 'av' 'u']
X3 ** ['f' 'a' 'c' 'e' 'd' 'g' 'b']
X4 ** ['d' 'b' 'a' 'c']
X5 ** ['t' 'b' 'a' 'z' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac'
 'ad' 'ae' 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
X6 ** ['a' 'g' 'j' 'l' 'i' 'd' 'f' 'h' 'c' 'k' 'e' 'b']
X8 ** ['w' 'y' 'j' 'n' 'm' 's' 'a' 'v' 'r' 'o' 't' 'h' 'c' 'k' 'p' 'u' 'd' 'g'
 'b' 'q' 'e' 'l' 'f' 'i' 'x']
X10 ** [0 1]
X11 ** [0 1]

In [97]:
# Insights:
#----------
# From the above, we can see most of the featues are binary from X10 onwards 
# Many are constants like X11,X93 etc has only values as zeroes, we will confirm this by finding the variance below
# After confirming the variance as zero, We can drop such cols as they will not contribute to the model
# We can also see fetures with zero variance in train are not similar with test data - so here we have to remove the same features from test data as well

In [56]:
# To identify features with 0 variance , Since we could not find variance of categorical features, we do this after label encoding

In [40]:
# Encoding Categorical features in train data - X0
print(trn['X0'].unique())
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
trn['X0'] = enc.fit_transform(trn['X0'])

['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']


In [41]:
# Applying same label encoder to test data X0 column
tst['X0'] = enc.transform(tst['X0'])

ValueError: y contains previously unseen labels: 'av'

In [141]:
# With the above error, we can understand that some of values in X0 column in test set are not presennt in train set 
# and they were not encoded, hence we get error while applying on test

# Now we will identify the uncommon values
set(trn['X0'].unique()) - (set(tst['X0'].unique()))

{'aa', 'ab', 'ac', 'q'}

In [142]:
set(tst['X0'].unique()) - (set(trn['X0'].unique()))

{'ae', 'ag', 'an', 'av', 'bb', 'p'}

In [143]:
# So we get the uncommon values in both train & test data sets. Now we will apply label encoder using the unique values
enc = LabelEncoder()
enc.fit_transform(['k','az','t','al','o','w', 'j', 'h', 's', 'n', 'ay', 'f', 'x', 'y', 'aj', 'ak', 'am',
 'z', 'q', 'at', 'ap', 'v', 'af', 'a', 'e', 'ai', 'd', 'aq', 'c', 'aa', 'ba', 'as', 'i',
 'r', 'b', 'ax', 'bc', 'u', 'ad', 'au', 'm', 'l', 'aw', 'ao', 'ac', 'g', 'ab', 'ae', 'ag', 'an', 'av', 'bb', 'p'])

array([37, 24, 46, 11, 41, 49, 36, 34, 45, 40, 23, 32, 50, 51,  9, 10, 12,
       52, 43, 18, 15, 48,  6,  0, 31,  8, 30, 16, 29,  1, 26, 17, 35, 44,
       25, 22, 28, 47,  4, 19, 39, 38, 21, 14,  3, 33,  2,  5,  7, 13, 20,
       27, 42], dtype=int64)

In [144]:
trn.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [145]:
# Now tranform the train data using the encoder
trn['X0'] = enc.transform(trn['X0'])

In [146]:
trn.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,37,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,37,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,24,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,24,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,24,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [147]:
tst.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [148]:
# Now tranform the test data using the encoder
tst['X0'] = enc.transform(tst['X0'])

In [149]:
tst.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,24,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,46,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,24,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,24,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,49,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [None]:
## From the above we can see the value 'az' in both train & test data has been decoded as 24. Now this holds good.
# This can be repeated for the columns X1-X6,X8 as well

In [150]:
# Now we will identify the uncommon values for X1 column
print(set(trn['X1'].unique()) - (set(tst['X1'].unique())))
print(set(tst['X1'].unique()) - (set(trn['X1'].unique())))

set()
set()


In [151]:
# From the above we can see there are no uncommon values between train & test data for column X1. So we can directly apply using column
enc = LabelEncoder()
trn['X1'] = enc.fit_transform(trn['X1'])
tst['X1'] = enc.transform(tst['X1'])

In [152]:
# Now we will identify the uncommon values for X1 column
print(set(trn['X2'].unique()) - (set(tst['X2'].unique())))
print(set(tst['X2'].unique()) - (set(trn['X2'].unique())))

{'aa', 'c', 'o', 'l', 'ar'}
{'ad', 'ax', 'u', 'ab', 'w', 'aj'}


In [153]:
# So we have some uncommon values similar to X0 column
print(trn['X2'].unique())
print(tst['X2'].unique())

['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
['n' 'ai' 'as' 'ae' 's' 'b' 'e' 'ak' 'm' 'a' 'aq' 'ag' 'r' 'k' 'aj' 'ay'
 'ao' 'an' 'ac' 'af' 'ax' 'h' 'i' 'f' 'ap' 'p' 'au' 't' 'z' 'y' 'aw' 'd'
 'at' 'g' 'am' 'j' 'x' 'ab' 'w' 'q' 'ah' 'ad' 'al' 'av' 'u']


In [154]:
# Now we define the encoder and fit the values
encX2 = LabelEncoder()
encX2.fit_transform(['at', 'av', 'n', 'e', 'as', 'aq', 'r', 'ai', 'ak', 'm', 'a', 'k','ae', 's', 'f', 'd', 'ag', 
                     'ay', 'ac', 'ap', 'g', 'i', 'aw', 'y', 'b', 'ao', 'al', 'h', 'x', 'au', 't', 'an', 'z', 'ah', 
                     'p', 'am', 'j', 'q', 'af', 'l', 'aa', 'c', 'o', 'ar','ad', 'ax', 'u', 'ab', 'w', 'aj'])

array([20, 22, 38, 29, 19, 17, 42,  9, 11, 37,  0, 35,  5, 43, 30, 28,  7,
       25,  3, 16, 31, 33, 23, 48, 26, 15, 12, 32, 47, 21, 44, 14, 49,  8,
       40, 13, 34, 41,  6, 36,  1, 27, 39, 18,  4, 24, 45,  2, 46, 10],
      dtype=int64)

In [157]:
# Now tranform the train & test data using the encoder
trn['X2'] = encX2.transform(trn['X2'])
tst['X2'] = encX2.transform(tst['X2'])

In [158]:
# Now we will identify the uncommon values for X3 column
print(set(trn['X3'].unique()) - (set(tst['X3'].unique())))
print(set(tst['X3'].unique()) - (set(trn['X3'].unique())))

set()
set()


In [159]:
# From the above we can see there are no uncommon values between train & test data for column X3. So we can directly apply using column
encX3 = LabelEncoder()
trn['X3'] = encX3.fit_transform(trn['X3'])
tst['X3'] = encX3.transform(tst['X3'])

In [160]:
# Now we will identify the uncommon values for X4 column
print(set(trn['X4'].unique()) - (set(tst['X4'].unique())))
print(set(tst['X4'].unique()) - (set(trn['X4'].unique())))

set()
set()


In [161]:
# From the above we can see there are no uncommon values between train & test data for column X4. So we can directly apply using column
encX4 = LabelEncoder()
trn['X4'] = encX4.fit_transform(trn['X4'])
tst['X4'] = encX4.transform(tst['X4'])

In [162]:
# Now we will identify the uncommon values for X5 column
print(set(trn['X5'].unique()) - (set(tst['X5'].unique())))
print(set(tst['X5'].unique()) - (set(trn['X5'].unique())))

{'u'}
{'z', 'a', 'b', 't'}


In [165]:
# So we have some uncommon values similar to X5 column
trn['X5'].unique()

array(['u', 'y', 'x', 'h', 'g', 'f', 'j', 'i', 'd', 'c', 'af', 'ag', 'ab',
       'ac', 'ad', 'ae', 'ah', 'l', 'k', 'n', 'm', 'p', 'q', 's', 'r',
       'v', 'w', 'o', 'aa'], dtype=object)

In [166]:
# Now we define the encoder and fit the values of column X5
encX5 = LabelEncoder()
encX5.fit_transform(['u', 'y', 'x', 'h', 'g', 'f', 'j', 'i', 'd', 'c', 'af', 'ag', 'ab',
       'ac', 'ad', 'ae', 'ah', 'l', 'k', 'n', 'm', 'p', 'q', 's', 'r',
       'v', 'w', 'o', 'aa','z', 'a', 'b', 't'])

array([27, 31, 30, 14, 13, 12, 16, 15, 11, 10,  6,  7,  2,  3,  4,  5,  8,
       18, 17, 20, 19, 22, 23, 25, 24, 28, 29, 21,  1, 32,  0,  9, 26],
      dtype=int64)

In [169]:
# Now tranform the train & test data using the encoder
trn['X5'] = encX5.transform(trn['X5'])
tst['X5'] = encX5.transform(tst['X5'])

In [171]:
# Now we will identify the uncommon values for X6 column
print(set(trn['X6'].unique()) - (set(tst['X6'].unique())))
print(set(tst['X6'].unique()) - (set(trn['X6'].unique())))

set()
set()


In [172]:
# From the above we can see there are no uncommon values between train & test data for column X6. 
# So we can directly apply using column
encX6 = LabelEncoder()
trn['X6'] = encX6.fit_transform(trn['X6'])
tst['X6'] = encX6.transform(tst['X6'])

In [174]:
# Now we will identify the uncommon values for X8 column
print(set(trn['X8'].unique()) - (set(tst['X8'].unique())))
print(set(tst['X8'].unique()) - (set(trn['X8'].unique())))

set()
set()


In [175]:
# From the above we can see there are no uncommon values between train & test data for column X8. 
# So we can directly apply using column
encX8 = LabelEncoder()
trn['X8'] = encX8.fit_transform(trn['X8'])
tst['X8'] = encX8.transform(tst['X8'])

In [176]:
# Identify features with 0 variance , Since we could not find variance of categorical features, we do this after label encoding
temp = []
for i in trn.columns:
    if trn[i].var()==0:
        temp.append(i)
        
print('No. of features in train data has zero variance:',len(temp))
print('List here:',temp)

No. of features in train data has zero variance: 12
List here: ['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']


In [177]:
# Dropping cols with Zero variance

trn.drop(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347'], axis=1)

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,37,23,20,0,3,27,9,14,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,37,21,22,4,3,31,11,14,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,24,24,38,2,3,30,9,23,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,24,21,38,5,3,30,11,4,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,24,23,38,5,3,14,3,13,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,10,20,19,2,3,1,3,16,...,1,0,0,0,0,0,0,0,0,0
4205,8406,108.77,36,16,44,3,3,1,7,7,...,0,1,0,0,0,0,0,0,0,0
4206,8412,109.22,10,23,42,0,3,1,6,4,...,0,0,1,0,0,0,0,0,0,0
4207,8415,87.48,11,19,29,5,3,1,11,20,...,0,0,0,0,0,0,0,0,0,0


In [178]:
# Since features with 0 variance are removed from train data, the same should be removed from test data

tst.drop(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347'], axis=1)

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,24,23,38,5,3,26,0,22,0,...,0,0,0,1,0,0,0,0,0,0
1,2,46,3,9,0,3,9,6,24,0,...,0,0,1,0,0,0,0,0,0,0
2,3,24,23,19,5,3,0,9,9,0,...,0,0,0,1,0,0,0,0,0,0
3,4,24,13,38,5,3,32,11,13,0,...,0,0,0,1,0,0,0,0,0,0
4,5,49,20,19,2,3,31,8,12,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8410,9,9,19,5,3,1,9,4,0,...,0,0,0,0,0,0,0,0,0,0
4205,8411,46,1,9,3,3,1,9,24,0,...,0,1,0,0,0,0,0,0,0,0
4206,8413,51,23,19,5,3,1,3,22,0,...,0,0,0,0,0,0,0,0,0,0
4207,8414,10,23,19,0,3,1,2,16,0,...,0,0,1,0,0,0,0,0,0,0


In [179]:
# Checking whether all features in train are numerical datatype to go before PCA
dtype_df = trn.dtypes.reset_index()
dtype_df.columns = ["feature name","dtypes"]
dtype_df.groupby("dtypes").agg("count").reset_index()

Unnamed: 0,dtypes,feature name
0,int32,8
1,int64,369
2,float64,1


In [180]:
# Checking whether all features in test are numerical datatype to go before PCA
dtype_df = tst.dtypes.reset_index()
dtype_df.columns = ["feature name","dtypes"]
dtype_df.groupby("dtypes").agg("count").reset_index()

Unnamed: 0,dtypes,feature name
0,int32,8
1,int64,369


In [181]:
# Dropping ID from both train & test as it will not be used by model

trnPca=trn.drop(['ID','y'],axis=1)
tstPca=tst.drop(['ID'],axis=1)

In [182]:
# Perform dimensionality reduction using PCA
from sklearn.decomposition import PCA
n_comp = 12
pca = PCA(n_components=n_comp, random_state=420)
trnPca= pca.fit_transform(trnPca)
tstPca = pca.transform(tstPca)

In [183]:
trnPca.shape

(4209, 12)

In [184]:
tstPca.shape

(4209, 12)

In [None]:
# Now the total 377 X columns are reduced into 12 columns after applying PCA

In [185]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [186]:
# ML Modeling with XGboost
import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Defining train & test for model input
train_X = trnPca
train_y = trn['y']

# Splitting
x_train, x_valid, y_train, y_valid = train_test_split(train_X, train_y, test_size=0.2, random_state=420)

# Defining feature set
d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)
d_test = xgb.DMatrix(tstPca)
xgb_params = {
 'n_trees': 500, 
 'eta': 0.0050,
 'max_depth': 3,
 'subsample': 0.95,
 'objective': 'reg:linear',
 'eval_metric': 'rmse',
 'base_score': np.mean(train_y), # base prediction = mean(target)
 'silent': 1
}

# Creating a function for the predicting score
def xgb_r2_score(preds, dtrain):
 labels = dtrain.get_label()
 return 'r2', r2_score(labels, preds)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
mdl = xgb.train(xgb_params, d_train, 1050 , watchlist, early_stopping_rounds=50, feval=xgb_r2_score, maximize=True, verbose_eval=10)

Parameters: { "n_trees", "silent" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	train-rmse:12.76558	train-r2:0.00294	valid-rmse:12.22352	valid-r2:0.00314
[10]	train-rmse:12.58890	train-r2:0.03034	valid-rmse:12.03743	valid-r2:0.03326
[20]	train-rmse:12.42513	train-r2:0.05541	valid-rmse:11.86727	valid-r2:0.06040
[30]	train-rmse:12.27386	train-r2:0.07827	valid-rmse:11.71038	valid-r2:0.08508
[40]	train-rmse:12.13359	train-r2:0.09922	valid-rmse:11.56504	valid-r2:0.10765
[50]	train-rmse:12.00464	train-r2:0.11826	valid-rmse:11.43020	valid-r2:0.12833
[60]	train-rmse:11.88540	train-r2:0.13569	valid-rmse:11.30510	valid-r2:0.14731
[70]	train-rmse:11.77336	train-r2:0.15191	valid-rmse:11.18944	valid-r2:0.16467
[80]	train-rmse:11.66823	train-r2:0.16699	valid-rmse:11.07946	valid-r2:0.1

[970]	train-rmse:9.14323	train-r2:0.48851	valid-rmse:9.03607	valid-r2:0.45524
[980]	train-rmse:9.13350	train-r2:0.48959	valid-rmse:9.03317	valid-r2:0.45559
[990]	train-rmse:9.12282	train-r2:0.49079	valid-rmse:9.02643	valid-r2:0.45641
[1000]	train-rmse:9.10990	train-r2:0.49223	valid-rmse:9.02197	valid-r2:0.45694
[1010]	train-rmse:9.09694	train-r2:0.49367	valid-rmse:9.01625	valid-r2:0.45763
[1020]	train-rmse:9.08740	train-r2:0.49473	valid-rmse:9.01177	valid-r2:0.45817
[1030]	train-rmse:9.07623	train-r2:0.49597	valid-rmse:9.01043	valid-r2:0.45833
[1040]	train-rmse:9.06513	train-r2:0.49721	valid-rmse:9.00572	valid-r2:0.45890
[1049]	train-rmse:9.05751	train-r2:0.49805	valid-rmse:9.00293	valid-r2:0.45923


In [187]:
# Predicting on test set
p_test = mdl.predict(d_test)
p_test

array([ 77.08671,  96.55628,  84.19772, ..., 102.39361, 107.45419,
        96.19262], dtype=float32)

In [188]:
Predicted_Data = pd.DataFrame()
Predicted_Data['y'] = p_test
Predicted_Data.head()

Unnamed: 0,y
0,77.086708
1,96.556282
2,84.197723
3,77.491554
4,110.642532


In [None]:
# With the above model, we have the below metrics:
# train-rmse:9.05751	train-r2:0.49805
# valid-rmse:9.00293	valid-r2:0.45923