### Problem description:
To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

Our objective here was to work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing.

**Step 1 - Read in the data and look at its shape.**

In [1]:
import pandas as pd
import numpy as np
# read in train and test data
train = pd.read_csv('q3_train.csv')
test = pd.read_csv('q3_test.csv')

print(train.shape, test.shape)

(4209, 378) (4209, 377)


**Step 2 - Which is the outcome variable?**

In [2]:
list(set(train.columns) - set(test.columns))

['y']

The column in train that is not in test is `y` and that is our outcome variable. We have 377 predictor variables in our training data. They are all anonymized so we can't make much sense of these!

**Step 3 - Separate out the predictors and the outcome** <br>
I also don't need the `ID` column as a predictor

In [3]:
X_train = train.drop(['y', 'ID'], axis = 1)
y_train = train['y']

In [4]:
print(X_train.shape)
print(y_train.shape)

(4209, 376)
(4209,)


**Step 4 - What are the column data types like?**

In [5]:
print(X_train.dtypes.value_counts())
print(test.dtypes.value_counts())

int64    376
dtype: int64
int64    377
dtype: int64


All of our variables are of the type 'integer'. The question mentions that categorical levels have been converted to numbers. These variables would need to be one-hot encoded so that they can be used in prediction.

**Step 5 - Does our data have missing values?**

In [6]:
def missing_check(df):
    # this gets a column wise sum of missing values
    missing_df = df.isnull().sum().reset_index()
    missing_df.columns = ['variable', 'missing_values']
    missing_df['Total_Rows'] = len(df)
    missing_df['Perc_Missing'] = missing_df['missing_values']*100/len(df)
    # sorts the columns in descending order of missing rows
    missing_df.sort_values('Perc_Missing', ascending = False, inplace = True)
    missing_df = missing_df.loc[missing_df['Perc_Missing']>0, :].reset_index(drop = True)
    if len(missing_df) == 0:
        return "No columns with missing values"
    else:
        return missing_df

In [7]:
print("Missing in train:", missing_check(X_train))
print("Missing in test:", missing_check(test))

Missing in train: No columns with missing values
Missing in test: No columns with missing values


We dont have any missing values in either dataset. This makes our job easier!

**Step 6 - outliers in y?** <br>
We will cap outliers using the median $\pm$ 1.5 * IQR strategy

In [8]:
# calculate iqr and get upper and lower limits
iqr = np.percentile(y_train, 75) - np.percentile(y_train, 25)

lower_limit = np.median(y_train) - 1.5* iqr
upper_limit = np.median(y_train) + 1.5* iqr

def constrain_to_limits(x):
    return np.min(np.max(x, lower_limit),upper_limit)

mask_upper = y_train >= upper_limit
mask_lower = y_train <= lower_limit

y_train[mask_upper] = upper_limit
y_train[mask_lower] = lower_limit

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


**Step 8- Are there variables with just one unique value?** <br>
These do not add value to our outcome predictions, so we will remove these.

In [9]:
# check how many columns have a single unique value
X_train_column_unique_values = X_train.apply(lambda x: pd.Series.nunique(x))

# get the names of the columns to be removed
X_train_columns_remove = X_train_column_unique_values[X_train_column_unique_values == 1].index.tolist()

# drop these columns from the dataframe
X_train = X_train.drop(X_train_columns_remove, axis = 1)
X_test = test.drop(X_train_columns_remove + ['ID'], axis = 1)

print("New shape of the train dataframe:", X_train.shape)
print("New shape of the train dataframe:", X_test.shape)

New shape of the train dataframe: (4209, 364)
New shape of the train dataframe: (4209, 364)


**Step 12 - Remove Correlated variables** <br>
Next we pick out variables that are correlated to each other by a correlation of more than 0.6 and remove them!

In [10]:
correlations = X_train.corr()

In [11]:
def remove_correlated(correlation_df):
    # list of all columns
    all_columns = correlation_df.columns
    chosen_columns = []
    removed_columns = []
        
    while len(all_columns) > 0:
        
        # choose the first column in the list
        col = all_columns[0]
        
        # add it to the chosen columns list
        chosen_columns.append(col)
        
        # set criteria to filter variables
        criteria = abs(correlation_df[col]) >= 0.6
        
        # get correlated variables except for the variable itself
        correlated_columns = list(set(correlation_df.loc[criteria, col].index) - set([col]))
        
        # reduce the overall columns to check
        all_columns = list(set(all_columns) - set(correlated_columns + [col]))
        
        # add columns to be removed in removed columns
        removed_columns.append(correlated_columns)
        
        # filter out removed variable from the correlation_df
        correlations_df = correlation_df[all_columns]
    
    return chosen_columns

In [12]:
chosen_columns = remove_correlated(correlations)

**Step 9 - Align the dataframe columns** <br>
For predictions we want the columns in test and train to be the same. Let's align our dataframes next.

In [13]:
X_train_final = X_train[chosen_columns]
X_train_final, X_test_final = X_train_final.align(X_test, join = 'inner', axis = 1)

print("Shape of train data:", X_train_final.shape)
print("Shape of test data:", X_test_final.shape)

Shape of train data: (4209, 190)
Shape of test data: (4209, 190)


**Step 13 - Convert predictor dataframe and outcome dataframe to arrays**

In [14]:
# store feature names
feature_names = X_train_final.columns.tolist()

# convert to arrays
X_train_array = np.array(X_train_final)
y_train_array = np.array(y_train)
test_array = np.array(X_test_final)

from sklearn.model_selection import train_test_split
X_train_train, X_train_val, y_train_train, y_train_val = train_test_split(X_train_array, y_train_array, test_size = 0.2, 
                                                                         random_state = 42)

print(X_train_train.shape)
print(X_train_val.shape)
print(y_train_train.shape)
print(y_train_val.shape)

(3367, 190)
(842, 190)
(3367,)
(842,)


**Step 14 - Model Building**

**Model 1 - Multiple Linear Regression with LASSO**

In [15]:
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score

# define the lasso cv model
cv_model = LassoCV(alphas = None, cv = 5, max_iter = 10000, random_state = 23)
cv_model.fit(X_train_train, y_train_train)
best_alpha = cv_model.alpha_

lasso_model = Lasso(alpha = best_alpha)
lasso_model.fit(X_train_train, y_train_train)

prediction = lasso_model.predict(X_train_val)
r_square = r2_score(y_train_val, prediction)
print(r_square)

0.6340911907190518


Looks like a trustworthy model. Let me train it on the entire train data and predict using test data.

In [16]:
# fit on entire train data
lasso_model = Lasso(alpha = best_alpha)
lasso_model.fit(X_train_array, y_train_array)

# predictions
prediction = lasso_model.predict(test_array)

Write the output to a csv for upload to Kaggle

In [17]:
output = pd.DataFrame({'ID': test['ID'],
                      'y': prediction})

output.to_csv('sub_lasso_final.csv', index = False)

**Model 2 - Linear Regression with Ridge** <br>
For this we use the entire dataset as limiting the variables did not result in improved accuracy.

In [18]:
# store feature names
feature_names = X_train.columns.tolist()

# convert to arrays
X_train_array = np.array(X_train)
y_train_array = np.array(y_train)
test_array = np.array(X_test)

from sklearn.model_selection import train_test_split
X_train_train, X_train_val, y_train_train, y_train_val = train_test_split(X_train_array, y_train_array, test_size = 0.2,
                                                                         random_state = 42)

print(X_train_train.shape)
print(X_train_val.shape)
print(y_train_train.shape)
print(y_train_val.shape)

(3367, 364)
(842, 364)
(3367,)
(842,)


In [19]:
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge

# define the lasso cv model
cv_model = RidgeCV(alphas = 10**np.linspace(10,-6,100)*0.5, cv = 5)
cv_model.fit(X_train_train, y_train_train)
best_alpha = cv_model.alpha_

# fit on entire training data
ridge_model = Ridge(alpha = best_alpha)
ridge_model.fit(X_train_train, y_train_train)

# get estimate on X_val
prediction = ridge_model.predict(X_train_val)
r_square = r2_score(y_train_val, prediction)
print(r_square)

0.6310564205536129


Fit using ridge on the entire data

In [20]:
# fit on entire train data
ridge_model = Ridge(alpha = best_alpha)
ridge_model.fit(X_train_array, y_train_array)

# predictions
prediction = ridge_model.predict(test_array)

Write the output to a csv for upload to Kaggle

In [21]:
output = pd.DataFrame({'ID': test['ID'],
                      'y': prediction})

output.to_csv('sub_ridge_final.csv', index = False)

We see that RIDGE underperforms on the test data as compared to LASSO. This is probably because it is giving weights to all variables which is causing it to overfit the data.

**Model 3 - MLP Perceptron**

In [22]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

# hidden units = (100,), batch_size = 25, learning_rate = 0.00001, max_iter = 1000, score = 0.504

X_train_train, X_train_val, y_train_train, y_train_val = train_test_split(X_train_array, y_train_array, test_size = 0.2,
                                                                         random_state = 42)

mlp_model = MLPRegressor(activation = 'identity', solver = 'sgd', learning_rate = 'constant', 
                         random_state = 42, learning_rate_init = 0.00001, 
                         hidden_layer_sizes = (100,), max_iter = 1000, batch_size = 25)

mlp_model.fit(X_train_train, y_train_train)

predictions = mlp_model.predict(X_train_val)

print(r2_score(y_train_val, predictions))

0.5554866666752313


In [23]:
# fit on entire training data
mlp_model.fit(X_train_array, y_train_array)

predictions = mlp_model.predict(test_array)

output = pd.DataFrame({'ID': test['ID'],
                      'y': predictions})

output.to_csv('sub_mlp.csv', index = False)

**Model 4 : Feature encoded and modelling with XGboost**

Here instead of directly label encoding the categorical variables, we have encoded the variables based on the target variable.  
For this, the data has been directly downloaded from Kaggle with the categorical variables being intact


In [25]:
train1 = pd.read_csv('train.csv')
test1 = pd.read_csv('test.csv')

# Identifying the categorical columns
train_cols = train1.columns
train_cols_num = train1._get_numeric_data().columns
cat_cols = list(set(train_cols) - set(train_cols_num))
print(cat_cols)

['X4', 'X1', 'X2', 'X3', 'X6', 'X0', 'X5', 'X8']


**ii.Removing the outliers from the dataset**

In [26]:
def limits(k):
    upper_limit = k.mean() + 2*k.std()
    lower_limit = k.mean() - 2*k.std()
    std = k.std()
    return (lower_limit,upper_limit)

outlier_indices = []
mask = (train1['y'] < limits(train1['y'])[0]) | (train1['y'] > limits(train['y'])[1])
outlier_indices.extend(train1['y'][mask].index.values)
train_cleaned = train1.drop(train.index[list(set(outlier_indices))])
train_cleaned

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0
5,18,92.93,t,b,e,c,d,g,h,s,...,0,0,1,0,0,0,0,0,0,0
7,25,91.91,o,l,as,f,d,f,j,a,...,0,0,0,0,0,0,0,0,0,0
8,27,108.67,w,s,as,e,d,f,i,h,...,1,0,0,0,0,0,0,0,0,0
10,31,102.09,h,r,r,f,d,f,h,p,...,0,0,0,0,0,0,0,0,0,0
11,32,98.12,al,r,e,f,d,f,h,o,...,0,0,0,0,0,0,0,0,0,0
12,34,82.62,s,b,ai,c,d,f,g,m,...,0,0,1,0,0,0,0,0,0,0


**iii.Encoding the categorical variables using the mean of the target variables**

In [27]:
# Adding train + test
train1['eval_set'] = 0; test1['eval_set'] = 1
df = pd.concat([train1, test1], axis=0, copy=True,sort = True)
# Reset index
df.reset_index(drop=True, inplace=True)

# Encoding the variables
def add_new_col(x):
    if x not in new_col.keys(): 
        # set n/2 x if is contained in test, but not in train 
        # (n is the number of unique labels in train)
        # or an alternative could be -100 (something out of range [0; n-1]
        return int(len(new_col.keys())/2)
    return new_col[x] # rank of the label

for c in cat_cols:
    # get labels and corresponding means
    new_col = train_cleaned.groupby(c).y.mean().sort_values().reset_index()
    
    # make a dictionary, where key is a label and value is the rank of that label
    new_col = new_col.reset_index().set_index(c).drop('y', axis=1)['index'].to_dict()
    
    # add new column to the dataframe
    df[c + '_new'] = df[c].apply(add_new_col)

# # drop old categorical columns
df_new = df.drop(cat_cols, axis=1)

# # show the result
df_new.head()

# Train test split
X = df.drop(list(set(cat_cols)), axis=1)

# Train
X_train = X[X.eval_set == 0]
y_train = X_train.pop('y'); 
X_train = X_train.drop(['eval_set', 'ID'], axis=1)

# Test
X_test = X[X.eval_set == 1]
X_test = X_test.drop(['y', 'eval_set', 'ID'], axis=1)

# Base score
y_mean = y_train.mean()
# Shapes

print('Shape X_train: {}\nShape X_test: {}'.format(X_train.shape, X_test.shape))

X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)



Shape X_train: (4209, 376)
Shape X_test: (4209, 376)
(4209, 376)
(4209, 376)
(4209,)


**iv. Running the models**


In [36]:
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

xgb_params = {
    'n_trees': 100, 
    'eta': 0.005,
    'max_depth': 3,
    'subsample': 0.95,
    'colsample_bytree': 0.6,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'base_score': np.log(y_mean),
    'silent': 1
}

# form DMatrices for Xgboost training
dtrain = xgb.DMatrix(X_train, np.array(np.log(y_train)))
dtest = xgb.DMatrix(X_test)

# evaluation metric
def the_metric(y_pred, y):
    y_true = y.get_label()
    return 'r2', r2_score(y_true, y_pred)

# xgboost, cross-validation
cv_result = xgb.cv(xgb_params, 
                   dtrain, 
                   num_boost_round=2000, 
                   nfold = 3,
                   early_stopping_rounds=50,
                   feval=the_metric,
                   verbose_eval=False,
                   show_stdv=False
                  )

num_boost_rounds = len(cv_result)
print('num_boost_rounds=' + str(num_boost_rounds))

# train model
model = xgb.train(dict(xgb_params), dtrain, num_boost_round=num_boost_rounds)

# Predict on trian and test
y_train_pred = np.exp(model.predict(dtrain))
y_pred = np.exp(model.predict(dtest))


num_boost_rounds=877


In [37]:
output = pd.DataFrame({'id': test['ID'].astype(np.int32), 'y': y_pred})
output.to_csv('sub_15_encoded.csv', index=False)

<img src="score.png" style="width: 800px"/>

This file has been submitted and we have achieved a score of 0.55107 on private leaderboard

### Summary of Modelling attempts

In the model building process, we tried the following - 
1. Capped outliers to median $\pm$ 1.5 * IQR values. This helped improve the accuracy of my models from 0.51 to 0.53.
2. We also removed variables that showed high correlations with other variables in the data using a correlation threshold of 0.6. Although this helped increase accuracy for LASSO model, it had the opposite effect on RIDGE and MLP models.
3. Considering that all categorical variables were changed to numbers before the dataset was provided for predictions, we tried to one-hot encode those variables. This however, led to no improvement in model accuracy and hence it was discarded.
4. The final accuracies we could manage with the four models were - 
    * $R^{2}= $ 0.53440 with LASSO model
    * $R^{2}= $ 0.53309 with RIDGE model
    * $R^{2}= $ 0.50493 with MLP model
    * $R^{2}= $ 0.55107 with XGBoost model
