# DESCRIPTION

## Reduce the time a Mercedes-Benz spends on the test bench.

### Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

## Following actions should be performed:

If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

Check for null and unique values for test and train sets.

Apply label encoder.

Perform dimensionality reduction.

Predict your test_df values using XGBoost.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Reading the data set
train_df = pd.read_csv("train.csv")
train_df

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,ak,s,as,c,d,aa,d,q,...,1,0,0,0,0,0,0,0,0,0
4205,8406,108.77,j,o,t,d,d,aa,h,h,...,0,1,0,0,0,0,0,0,0,0
4206,8412,109.22,ak,v,r,a,d,aa,g,e,...,0,0,1,0,0,0,0,0,0,0
4207,8415,87.48,al,r,e,f,d,aa,l,u,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Reading the data set
test_df = pd.read_csv("test.csv")
test_df

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8410,aj,h,as,f,d,aa,j,e,0,...,0,0,0,0,0,0,0,0,0,0
4205,8411,t,aa,ai,d,d,aa,j,y,0,...,0,1,0,0,0,0,0,0,0,0
4206,8413,y,v,as,f,d,aa,d,w,0,...,0,0,0,0,0,0,0,0,0,0
4207,8414,ak,v,as,a,d,aa,c,q,0,...,0,0,1,0,0,0,0,0,0,0


In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB


### To check variance , I will separate categorical features from numerical and binary features

In [5]:
features = list(train_df.columns)

In [6]:
# to check the type of data in each of the features
train_df.dtypes.head(20)

ID       int64
y      float64
X0      object
X1      object
X2      object
X3      object
X4      object
X5      object
X6      object
X8      object
X10      int64
X11      int64
X12      int64
X13      int64
X14      int64
X15      int64
X16      int64
X17      int64
X18      int64
X19      int64
dtype: object

In [7]:
# Separating the features which have binary inputs from other type of features
binary_features = []
for f in features:
    if train_df[f].dtype == 'int64' and f != 'ID':
        binary_features.append(f)    

In [8]:
binary_features

['X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16',
 'X17',
 'X18',
 'X19',
 'X20',
 'X21',
 'X22',
 'X23',
 'X24',
 'X26',
 'X27',
 'X28',
 'X29',
 'X30',
 'X31',
 'X32',
 'X33',
 'X34',
 'X35',
 'X36',
 'X37',
 'X38',
 'X39',
 'X40',
 'X41',
 'X42',
 'X43',
 'X44',
 'X45',
 'X46',
 'X47',
 'X48',
 'X49',
 'X50',
 'X51',
 'X52',
 'X53',
 'X54',
 'X55',
 'X56',
 'X57',
 'X58',
 'X59',
 'X60',
 'X61',
 'X62',
 'X63',
 'X64',
 'X65',
 'X66',
 'X67',
 'X68',
 'X69',
 'X70',
 'X71',
 'X73',
 'X74',
 'X75',
 'X76',
 'X77',
 'X78',
 'X79',
 'X80',
 'X81',
 'X82',
 'X83',
 'X84',
 'X85',
 'X86',
 'X87',
 'X88',
 'X89',
 'X90',
 'X91',
 'X92',
 'X93',
 'X94',
 'X95',
 'X96',
 'X97',
 'X98',
 'X99',
 'X100',
 'X101',
 'X102',
 'X103',
 'X104',
 'X105',
 'X106',
 'X107',
 'X108',
 'X109',
 'X110',
 'X111',
 'X112',
 'X113',
 'X114',
 'X115',
 'X116',
 'X117',
 'X118',
 'X119',
 'X120',
 'X122',
 'X123',
 'X124',
 'X125',
 'X126',
 'X127',
 'X128',
 'X129',
 'X130',
 'X131',
 'X132',
 'X133',

In [9]:
categorical_features = []

for f in features:
    if train_df[f].dtype == 'object':
        categorical_features.append(f)

In [10]:
categorical_features

['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']

In [11]:
train_df[categorical_features]

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,k,v,at,a,d,u,j,o
1,k,t,av,e,d,y,l,o
2,az,w,n,c,d,x,j,x
3,az,t,n,f,d,x,l,e
4,az,v,n,f,d,h,d,n
...,...,...,...,...,...,...,...,...
4204,ak,s,as,c,d,aa,d,q
4205,j,o,t,d,d,aa,h,h
4206,ak,v,r,a,d,aa,g,e
4207,al,r,e,f,d,aa,l,u


In [12]:
train_df[binary_features]

Unnamed: 0,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,0,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4205,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4206,0,0,1,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4207,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# check variance for the binary features now

In [14]:
var = train_df[binary_features].var()

In [15]:
var.head(15)

X10    0.013131
X11    0.000000
X12    0.069457
X13    0.054623
X14    0.244893
X15    0.000475
X16    0.002607
X17    0.007547
X18    0.007781
X19    0.089660
X20    0.122430
X21    0.002607
X22    0.079414
X23    0.020248
X24    0.001898
dtype: float64

In [16]:
var = pd.DataFrame(var)

In [17]:
var[1] = binary_features

In [18]:
new_var = {'Binary_Features':var[1],
           'Variance':var[0]}

In [19]:
new_var1 = pd.DataFrame(new_var)

In [20]:
new_var1.reset_index(inplace=True, drop=True)

In [21]:
new_var1.head(15)

Unnamed: 0,Binary_Features,Variance
0,X10,0.013131
1,X11,0.0
2,X12,0.069457
3,X13,0.054623
4,X14,0.244893
5,X15,0.000475
6,X16,0.002607
7,X17,0.007547
8,X18,0.007781
9,X19,0.08966


In [22]:
# To check how many Binary features have variance 0
var0 = new_var1[new_var1['Variance']==0]
var0

Unnamed: 0,Binary_Features,Variance
1,X11,0.0
81,X93,0.0
95,X107,0.0
217,X233,0.0
219,X235,0.0
252,X268,0.0
273,X289,0.0
274,X290,0.0
277,X293,0.0
281,X297,0.0


In [23]:
len(var0)

12

In [24]:
# 12 features have 0 variance, hence will remove them
var0['Binary_Features']

1       X11
81      X93
95     X107
217    X233
219    X235
252    X268
273    X289
274    X290
277    X293
281    X297
313    X330
330    X347
Name: Binary_Features, dtype: object

In [25]:
# dropping the features which have 0 variance
train_df.drop(columns=var0['Binary_Features'],inplace=True)

In [26]:
train_df

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,ak,s,as,c,d,aa,d,q,...,1,0,0,0,0,0,0,0,0,0
4205,8406,108.77,j,o,t,d,d,aa,h,h,...,0,1,0,0,0,0,0,0,0,0
4206,8412,109.22,ak,v,r,a,d,aa,g,e,...,0,0,1,0,0,0,0,0,0,0
4207,8415,87.48,al,r,e,f,d,aa,l,u,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# Now the shape of the data frame has been reduced from 4209 rows × 378 columns to 4209 rows × 366 columns 
# 12 features with variance 0 have been removed from the training data set

## Check for null and unique values for test and train sets.

In [28]:
train_df.isnull().sum().any()         # there are no null/NA values

False

In [29]:
train_df.duplicated().sum().any()     # there are no duplicated rows 

False

In [30]:
train_df.drop(columns="ID",inplace=True)        # Dropping ID column, as it has no significance 

## Apply label encoder

In [31]:
# Label encoder can be applied to categorical features only

In [32]:
train_df[categorical_features]

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,k,v,at,a,d,u,j,o
1,k,t,av,e,d,y,l,o
2,az,w,n,c,d,x,j,x
3,az,t,n,f,d,x,l,e
4,az,v,n,f,d,h,d,n
...,...,...,...,...,...,...,...,...
4204,ak,s,as,c,d,aa,d,q
4205,j,o,t,d,d,aa,h,h
4206,ak,v,r,a,d,aa,g,e
4207,al,r,e,f,d,aa,l,u


In [33]:
from sklearn.preprocessing import LabelEncoder

In [34]:
le = LabelEncoder()

In [36]:
train_df[categorical_features] = train_df[categorical_features].apply(le.fit_transform)
train_df[categorical_features].head(20)

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,32,23,17,0,3,24,9,14
1,32,21,19,4,3,28,11,14
2,20,24,34,2,3,27,9,23
3,20,21,34,5,3,27,11,4
4,20,23,34,5,3,12,3,13
5,40,3,25,2,3,11,7,18
6,9,19,25,5,3,10,7,18
7,36,13,16,5,3,10,9,0
8,43,20,16,4,3,10,8,7
9,31,3,14,2,3,10,0,4


In [37]:
train_df       # Now all the entries are in numerical format

Unnamed: 0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,130.81,32,23,17,0,3,24,9,14,0,...,0,0,1,0,0,0,0,0,0,0
1,88.53,32,21,19,4,3,28,11,14,0,...,1,0,0,0,0,0,0,0,0,0
2,76.26,20,24,34,2,3,27,9,23,0,...,0,0,0,0,0,0,1,0,0,0
3,80.62,20,21,34,5,3,27,11,4,0,...,0,0,0,0,0,0,0,0,0,0
4,78.02,20,23,34,5,3,12,3,13,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,107.39,8,20,16,2,3,0,3,16,0,...,1,0,0,0,0,0,0,0,0,0
4205,108.77,31,16,40,3,3,0,7,7,0,...,0,1,0,0,0,0,0,0,0,0
4206,109.22,8,23,38,0,3,0,6,4,0,...,0,0,1,0,0,0,0,0,0,0
4207,87.48,9,19,25,5,3,0,11,20,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
train_df.corr()      # checking the correlation with each other

Unnamed: 0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
y,1.000000,-0.187081,-0.003032,0.078852,-0.150741,-0.015272,-0.039031,0.004252,0.003328,-0.026985,...,0.029100,0.114005,0.061403,-0.258679,0.067919,0.040932,-0.159815,0.040291,-0.004591,-0.022280
X0,-0.187081,1.000000,-0.271123,-0.139904,-0.070645,0.017988,0.012293,0.037549,0.047735,0.081122,...,0.113272,0.070546,0.045173,-0.102136,0.083352,-0.038618,-0.060401,-0.011174,0.009110,0.011660
X1,-0.003032,-0.271123,1.000000,0.088266,0.205657,-0.020724,0.046417,-0.079119,-0.000306,-0.137193,...,0.056874,-0.102424,-0.248791,0.145282,0.070753,-0.022360,0.120044,-0.029253,0.017603,0.008356
X2,0.078852,-0.139904,0.088266,1.000000,-0.093546,0.002289,-0.017722,0.065778,-0.069932,0.042398,...,-0.174308,0.033697,0.122503,0.131974,0.033645,0.006473,0.024392,-0.019873,-0.002614,-0.004529
X3,-0.150741,-0.070645,0.205657,-0.093546,1.000000,0.015298,-0.008161,-0.048468,-0.001249,0.019663,...,0.051801,-0.105009,-0.588272,0.173723,-0.026446,0.004166,-0.046271,-0.028280,0.007273,0.045180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X380,0.040932,-0.038618,-0.022360,0.006473,0.004166,0.002611,0.010434,-0.014059,0.009511,-0.010479,...,-0.061741,-0.022240,-0.061168,-0.013110,-0.008839,1.000000,-0.007899,-0.003683,-0.001968,-0.003410
X382,-0.159815,-0.060401,0.120044,0.024392,-0.046271,0.002533,-0.031128,0.054548,-0.000996,-0.010164,...,-0.059883,-0.021571,-0.059327,-0.012716,-0.008573,-0.007899,1.000000,-0.003572,-0.001908,-0.003307
X383,0.040291,-0.011174,-0.029253,-0.019873,-0.028280,0.001181,-0.007337,-0.021293,0.038712,-0.004740,...,-0.015413,-0.010059,0.035107,-0.005930,-0.003998,-0.003683,-0.003572,1.000000,-0.000890,-0.001542
X384,-0.004591,0.009110,0.017603,-0.002614,0.007273,0.000631,0.007030,0.023867,0.008950,-0.002532,...,-0.014917,-0.005373,0.008694,-0.003168,-0.002136,-0.001968,-0.001908,-0.000890,1.000000,-0.000824


In [None]:
# Splitting the dependent and independent variables

In [39]:
y = train_df['y'].values
y

array([130.81,  88.53,  76.26, ..., 109.22,  87.48, 110.85])

In [40]:
x = train_df.drop('y',axis=1).values
x

array([[32, 23, 17, ...,  0,  0,  0],
       [32, 21, 19, ...,  0,  0,  0],
       [20, 24, 34, ...,  0,  0,  0],
       ...,
       [ 8, 23, 38, ...,  0,  0,  0],
       [ 9, 19, 25, ...,  0,  0,  0],
       [46, 19,  3, ...,  0,  0,  0]])

### Perform dimensionality reduction

In [41]:
from sklearn.decomposition import PCA

In [42]:
pca = PCA(n_components=20 , random_state=111)

In [43]:
x_pca = pca.fit_transform(x)

In [44]:
from sklearn.model_selection import train_test_split

In [45]:
x_train, x_test, y_train, y_test = train_test_split(x_pca,y,test_size = 0.3, random_state = 111)

In [46]:
x_train.shape, x_test.shape

((2946, 20), (1263, 20))

In [47]:
y_train.shape, y_test.shape

((2946,), (1263,))

## Predict your test_df values using XGBoost

In [48]:
import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as MSE

In [49]:
xgb_r = xgb.XGBRegressor(objective='reg:squarederror', n_estimators = 100 ,subsample=0.85,max_depth=4,colsample_bylevel=0.8,
                         colsample_bynode=0.7,min_child_weight=2,eta=0.1, learning_rate=0.7,seed = 111, base_score = y.mean())

In [50]:
xgb_r.fit(x_train,y_train)

XGBRegressor(base_score=100.66931812782134, booster=None, colsample_bylevel=0.8,
             colsample_bynode=0.7, colsample_bytree=1, eta=0.1, gamma=0,
             gpu_id=-1, importance_type='gain', interaction_constraints=None,
             learning_rate=0.7, max_delta_step=0, max_depth=4,
             min_child_weight=2, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=111,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=111,
             subsample=0.85, tree_method=None, validate_parameters=False,
             verbosity=None)

In [51]:
y_train_pred = xgb_r.predict(x_train)

In [52]:
y_test_pred = xgb_r.predict(x_test)

In [53]:
MSE(y_test,y_test_pred)

129.69352069731883

In [54]:
y_test_pred

array([ 98.467705,  76.107895,  87.66711 , ...,  93.43346 , 109.80992 ,
        99.02905 ], dtype=float32)

In [55]:
y_test

array([ 96.76,  77.64,  87.8 , ...,  97.7 , 110.46,  93.28])

In [None]:
# checking the score 

In [56]:
r2_score(y_train,y_train_pred)

0.9246009499179235

In [57]:
r2_score(y_test,y_test_pred)

0.257498271523232

## Predicting test data frame output 

In [None]:
# Performing procedures of data cleaning in the similar as was done for training data set 

In [58]:
test_df.drop('ID',axis=1,inplace=True)

In [59]:
test_df

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X11,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,az,v,n,f,d,t,a,w,0,0,...,0,0,0,1,0,0,0,0,0,0
1,t,b,ai,a,d,b,g,y,0,0,...,0,0,1,0,0,0,0,0,0,0
2,az,v,as,f,d,a,j,j,0,0,...,0,0,0,1,0,0,0,0,0,0
3,az,l,n,f,d,z,l,n,0,0,...,0,0,0,1,0,0,0,0,0,0
4,w,s,as,c,d,y,i,m,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,aj,h,as,f,d,aa,j,e,0,0,...,0,0,0,0,0,0,0,0,0,0
4205,t,aa,ai,d,d,aa,j,y,0,0,...,0,1,0,0,0,0,0,0,0,0
4206,y,v,as,f,d,aa,d,w,0,0,...,0,0,0,0,0,0,0,0,0,0
4207,ak,v,as,a,d,aa,c,q,0,0,...,0,0,1,0,0,0,0,0,0,0


In [60]:
test_df[categorical_features]

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,az,v,n,f,d,t,a,w
1,t,b,ai,a,d,b,g,y
2,az,v,as,f,d,a,j,j
3,az,l,n,f,d,z,l,n
4,w,s,as,c,d,y,i,m
...,...,...,...,...,...,...,...,...
4204,aj,h,as,f,d,aa,j,e
4205,t,aa,ai,d,d,aa,j,y
4206,y,v,as,f,d,aa,d,w
4207,ak,v,as,a,d,aa,c,q


In [61]:
test_df[categorical_features] = test_df[categorical_features].apply(le.fit_transform)
test_df[categorical_features]

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,21,23,34,5,3,26,0,22
1,42,3,8,0,3,9,6,24
2,21,23,17,5,3,0,9,9
3,21,13,34,5,3,31,11,13
4,45,20,17,2,3,30,8,12
...,...,...,...,...,...,...,...,...
4204,6,9,17,5,3,1,9,4
4205,42,1,8,3,3,1,9,24
4206,47,23,17,5,3,1,3,22
4207,7,23,17,0,3,1,2,16


In [62]:
test_df

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X11,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,21,23,34,5,3,26,0,22,0,0,...,0,0,0,1,0,0,0,0,0,0
1,42,3,8,0,3,9,6,24,0,0,...,0,0,1,0,0,0,0,0,0,0
2,21,23,17,5,3,0,9,9,0,0,...,0,0,0,1,0,0,0,0,0,0
3,21,13,34,5,3,31,11,13,0,0,...,0,0,0,1,0,0,0,0,0,0
4,45,20,17,2,3,30,8,12,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,6,9,17,5,3,1,9,4,0,0,...,0,0,0,0,0,0,0,0,0,0
4205,42,1,8,3,3,1,9,24,0,0,...,0,1,0,0,0,0,0,0,0,0
4206,47,23,17,5,3,1,3,22,0,0,...,0,0,0,0,0,0,0,0,0,0
4207,7,23,17,0,3,1,2,16,0,0,...,0,0,1,0,0,0,0,0,0,0


In [63]:
test_var = test_df[binary_features].var()
test_var

X10     0.018650
X11     0.000238
X12     0.068851
X13     0.057345
X14     0.244859
          ...   
X380    0.008015
X382    0.008715
X383    0.000475
X384    0.000712
X385    0.001661
Length: 368, dtype: float64

In [64]:
test_var = pd.DataFrame(test_var)

In [65]:
test_var[1] = binary_features

In [66]:
test_new_var = {'Binary_Features':test_var[1],
           'Variance':test_var[0]}

In [67]:
test_new_var1 = pd.DataFrame(test_new_var)

In [68]:
test_new_var1.reset_index(inplace=True, drop=True)

In [78]:
test_new_var1.head(10)

Unnamed: 0,Binary_Features,Variance
0,X10,0.01865
1,X11,0.000238
2,X12,0.068851
3,X13,0.057345
4,X14,0.244859
5,X15,0.000712
6,X16,0.002607
7,X17,0.008715
8,X18,0.010114
9,X19,0.09922


In [70]:
# To check how many Binary features have variance 0
test_var0 = test_new_var1[test_new_var1['Variance']==0]
test_var0

Unnamed: 0,Binary_Features,Variance
241,X257,0.0
242,X258,0.0
279,X295,0.0
280,X296,0.0
352,X369,0.0


In [71]:
test_df.drop(columns=test_var0['Binary_Features'],inplace=True)

In [72]:
pca2 = PCA(n_components=20 , random_state=111)

In [73]:
test_pca = pca2.fit_transform(test_df)

In [74]:
test_pca.shape

(4209, 20)

In [75]:
test_df_pred = xgb_r.predict(test_pca)

In [76]:
test_df_pred

array([ 84.61177 , 106.26537 , 119.3957  , ..., 101.437706, 103.113815,
        85.02062 ], dtype=float32)

In [77]:
pd.DataFrame(test_df_pred).to_csv('MB_pred.csv')