# **Milestone 2**

## **Model Building**

1. What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
2. Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like - Name 
3. We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
4. Build Regression models using train data.
5. Evaluate the model performance.

### **Split Data**

<li>Step1: Split the data into X and Y . 
<li>Step2: Encode the categorical variables in X using pd.dummies.
<li>Step3: Split the data into train and test using train_test_split.

<b>Think about it:</b> Why we should drop 'Name','Price','price_log','Kilometers_Driven' from X before splitting?

In [277]:
#Initial Step of data file reading to cards_data df

#Import required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

#Loading data into cars_data DataFrame from the used_cars.csv data file
cars_data = pd.read_csv('used_cars.csv')
 
# Data Imputing for missing values - NaN to Median values for the following columns
# Impute missing values in Price, you can use fillna method in pandas
cars_data['Price'] = cars_data['Price'].fillna(cars_data['Price'].median())

# Impute missing values in Seats, you can use fillna method in pandas
cars_data['Seats'] = cars_data['Seats'].fillna(cars_data['Seats'].median())

# Impute missing values in Mileage, you can use fillna method in pandas
cars_data['Mileage'] = cars_data['Mileage'].fillna(cars_data['Mileage'].median())

# Impute missing values in Power, you can use fillna method in pandas
cars_data['Power'] = cars_data['Power'].fillna(cars_data['Power'].median())

# Impute missing values in Engine, you can use fillna method in pandas
cars_data['Engine'] = cars_data['Engine'].fillna(cars_data['Engine'].median())

# Remove S.No. and New_price columns from data which are not needed for Regression Analysis 
cars_data.drop('S.No.', inplace=True, axis=1)
cars_data.drop('New_price', inplace=True, axis=1)
 

# Removing the 'row' at index 2328 from the data. Hint: use the argument inplace=True
cars_data.drop([2328],inplace = True )

#Print Final cleaned up data set
print (cars_data)

# We can add a log transformed kilometers_driven feature in data
cars_data["kilometers_driven_log"] = np.log(cars_data["Kilometers_Driven"])

# We can Add a log transformed Price feature in data
cars_data["price_log"] = np.log(cars_data["Price"])
 
#Check if any columns still has any NaN values 
col=['Name', 'Location', 'Year', 'Fuel_Type','Transmission','Kilometers_Driven', 'Owner_Type', 'Seats', 'Engine','Power','Mileage','Price', 'price_log']
print (cars_data[col].isnull().sum())

#cars_data =cars_data.astype(float)

# Remove the limit from the number of displayed columns and rows. It helps to see the entire dataframe while printing it
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)

# Step-1
X = cars_data.drop(['Name','Price','price_log','Kilometers_Driven'],axis=1)
y = cars_data[['price_log', 'Price']]
 

                                                   Name    Location  Year  \
0                                Maruti Wagon R LXI CNG      Mumbai  2010   
1                      Hyundai Creta 1.6 CRDi SX Option        Pune  2015   
2                                          Honda Jazz V     Chennai  2011   
3                                     Maruti Ertiga VDI     Chennai  2012   
4                       Audi A4 New 2.0 TDI Multitronic  Coimbatore  2013   
...                                                 ...         ...   ...   
7248                  Volkswagen Vento Diesel Trendline   Hyderabad  2011   
7249                             Volkswagen Polo GT TSI      Mumbai  2015   
7250                             Nissan Micra Diesel XV     Kolkata  2012   
7251                             Volkswagen Polo GT TSI        Pune  2013   
7252  Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan...       Kochi  2014   

      Kilometers_Driven Fuel_Type Transmission Owner_Type  Mileage  Engine 

In [278]:
# Step-2 Use pd.get_dummies(drop_first=True)
X = pd.get_dummies(X,drop_first=True)

In [279]:
# Step-3 Splitting data into training and test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)

(5076, 24) (2176, 24)


In [280]:
# Let us write a function for calculating r2_score and RMSE on train and test data.
# This function takes model as an input on which we have trained particular algorithm.
#the categorical column as the input and returns the boxplots and histograms for the variable.
import sklearn.metrics as metrics
def get_model_score(model, flag=True):
    '''
    model : regressor to predict values of X

    '''
    # defining an empty list to store train and test results
    score_list=[] 
    
    pred_train = model.predict(X_train)
    pred_train_ = np.exp(pred_train)
    pred_test = model.predict(X_test)
    pred_test_ = np.exp(pred_test)
    
    train_r2=metrics.r2_score(y_train['Price'],pred_train_)
    test_r2=metrics.r2_score(y_test['Price'],pred_test_)
    train_rmse=metrics.mean_squared_error(y_train['Price'],pred_train_,squared=False)
    test_rmse=metrics.mean_squared_error(y_test['Price'],pred_test_,squared=False)
    
    #Adding all scores in the list
    score_list.extend((train_r2,test_r2,train_rmse,test_rmse))
    
    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag==True: 
        print("R-sqaure on training set : ",metrics.r2_score(y_train['Price'],pred_train_))
        print("R-square on test set : ",metrics.r2_score(y_test['Price'],pred_test_))
        print("RMSE on training set : ",np.sqrt(metrics.mean_squared_error(y_train['Price'],pred_train_)))
        print("RMSE on test set : ",np.sqrt(metrics.mean_squared_error(y_test['Price'],pred_test_)))
    
    # returning the list with train and test scores
    return score_list


<hr>

For Regression Problems, some of the algorithms used are :<br>

**1) Linear Regression** <br>
**2) Ridge / Lasso Regression** <br>
**3) Decision Trees** <br>
**4) Random Forest** <br>

### **Fitting a linear model**

Linear Regression can be implemented using: <br>

**1) Sklearn:** https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html <br>
**2) Statsmodels:** https://www.statsmodels.org/stable/regression.html

In [281]:
# import Linear Regression from sklearn
from sklearn.linear_model import LinearRegression

In [282]:
# Create a linear regression model
lr = LinearRegression()

In [283]:
# Fit linear regression model
lr.fit(X_train,y_train['price_log']) 

LinearRegression()

In [284]:
# Get score of the model.
LR_score = get_model_score(lr)

R-sqaure on training set :  0.5628499237591433
R-square on test set :  0.7034208186923349
RMSE on training set :  6.778145833783173
RMSE on test set :  5.619982998714185


#### **Observations from results: _____**
Here we see that the R-squared is higher (0.703) for the test data than training data set where it is 0.563. The RSME(Root Mean Squared Error) is higher(6.78) than the same for the test data set where it is 5.62. The lower the RSME is, the better the model is. So, this model is better for theh test set rather training. However, from higher R-squared value for training data set, it felt the model fits better for the training data set. As an RSME between 0.2 to 0.5 is a good indicator of the model fitting data, the model is a good fit for both of training and test data; both scores are higher than 0.5 actually.

#### **Important variables of Linear Regression**

Building a model using statsmodels

In [289]:
# Import Statsmodels 
import statsmodels.api as sm

# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train = sm.add_constant(X_train)
# Add constant to test data
x_test = sm.add_constant(X_test)

def build_ols_model(train):
    # Create the model
    olsmodel = sm.OLS(y_train["price_log"], train)
    return olsmodel.fit()


# Fit linear model on new dataset
olsmodel1 = build_ols_model(x_train)
print(olsmodel1.summary())

                            OLS Regression Results                            
Dep. Variable:              price_log   R-squared:                       0.731
Model:                            OLS   Adj. R-squared:                  0.729
Method:                 Least Squares   F-statistic:                     595.5
Date:                Sat, 04 Jun 2022   Prob (F-statistic):               0.00
Time:                        10:39:47   Log-Likelihood:                -2707.4
No. Observations:                5076   AIC:                             5463.
Df Residuals:                    5052   BIC:                             5620.
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                 

In [290]:
# Retrive Coeff values, p-values and store them in the dataframe
olsmod = pd.DataFrame(olsmodel1.params, columns=['coef'])
olsmod['pval']=olsmodel1.pvalues

In [291]:
# FIlter by significant p-value (pval <0.05) and sort descending by Odds ratio
olsmod = olsmod.sort_values(by="pval", ascending=False)
pval_filter = olsmod['pval']<=0.05
olsmod[pval_filter]

Unnamed: 0,coef,pval
Owner_Type_Second,-0.03473,0.04520219
Owner_Type_Fourth & Above,0.268276,0.04212803
Location_Hyderabad,0.08402,0.01378133
Location_Bangalore,0.097989,0.01062837
Fuel_Type_Diesel,0.206907,0.0007608686
Owner_Type_Third,-0.183473,3.096205e-05
Location_Kolkata,-0.196807,4.085906e-08
kilometers_driven_log,-0.061878,9.57133e-09
Mileage,-0.012983,2.271471e-10
Engine,0.000196,6.849539e-13


In [292]:
# we are looking are overall significant varaible
pval_filter = olsmod['pval']<=0.05
imp_vars = olsmod[pval_filter].index.tolist()

# we are going to get overall varaibles (un-one-hot encoded varables) from categorical varaibles
sig_var = []
for col in imp_vars:
    if '' in col:
        first_part = col.split('_')[0]
        for c in cars_data.columns:
            if first_part in c and c not in sig_var :
                sig_var.append(c)
 

start = '\033[1m'
end = '\033[95m'
print(start+'Most overall significant categorical varaibles of LINEAR REGRESSION  are '+end,':\n',sig_var)

[1mMost overall significant categorical varaibles of LINEAR REGRESSION  are [95m :
 ['Owner_Type', 'Location', 'Fuel_Type', 'kilometers_driven_log', 'Mileage', 'Engine', 'Transmission', 'Power', 'Year']


<b>Build Ridge / Lasso Regression similar to Linear Regression:</b><br>

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [293]:
# import Ridge/ Lasso Regression from sklearn
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV

In [294]:
# Create a Ridge regression model, alpha=1.0 for full penalty
ridge_lasso = Ridge(alpha=1.0)

In [295]:
# Fit Ridge regression model.
ridge_lasso.fit(X_train,y_train['price_log'])

Ridge()

In [296]:
# Get score of the model.
Ridge_Score=get_model_score(ridge_lasso)

R-sqaure on training set :  0.562671552976983
R-square on test set :  0.7032861344323269
RMSE on training set :  6.7795285395575675
RMSE on test set :  5.621258943527664


In [None]:
# My Obserations -
# This model has similar R-squared values for training data set as well test data set (0.563 and 0.703 respectively), 
# The RSME values are also very similar to those of Linear Regression model, and same type of comments apply here. 
# This model is also a strong fit for both of training and test data set.
    

### **Decision Tree** 

https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

In [297]:
# import Decision tree for Regression from sklearn
from sklearn.tree import DecisionTreeRegressor

In [298]:
# Create a decision tree regression model
Dtree = DecisionTreeRegressor(random_state = 1) 


In [299]:
# Fit decision tree regression model.
Dtree.fit(X_train,y_train['price_log'])

DecisionTreeRegressor(random_state=1)

In [300]:
# Get score of the model.
Dtree_Score = get_model_score(Dtree)

R-sqaure on training set :  0.9999092296903165
R-square on test set :  0.5825526342226197
RMSE on training set :  0.09767142812748364
RMSE on test set :  6.667538517004757


**Observations from results -**
Here we find that R-squared value for the training set is really high (0.999), meaning the training data is a perfect fit for this model, or this model is a perfect fit for the training data set; Howeverver, the test data is not; for test data this model is a good fit also. The RMSE is only 0.097 which is an indicator of the "perfect fit" of this model for the training data. However, for the test data, RSME is significantly different (6.66), but still better than the Linear Regression model or the Ridge-Lasso model. 

Print the importance of features in the tree building ( The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )


In [218]:
print(pd.DataFrame(Dtree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                Imp
Power                      0.539645
Year                       0.209081
kilometers_driven_log      0.074535
Engine                     0.053221
Mileage                    0.043895
Transmission_Manual        0.008721
Location_Kolkata           0.007074
Location_Kochi             0.007052
Location_Hyderabad         0.006872
Seats                      0.006360
Owner_Type_Second          0.005467
Location_Coimbatore        0.005242
Location_Mumbai            0.004682
Location_Bangalore         0.004641
Location_Delhi             0.004373
Fuel_Type_Diesel           0.004222
Location_Pune              0.004189
Location_Jaipur            0.002891
Location_Chennai           0.002688
Fuel_Type_Petrol           0.002366
Owner_Type_Third           0.001584
Owner_Type_Fourth & Above  0.001126
Fuel_Type_LPG              0.000072
Fuel_Type_Electric         0.000000


#### **Observations and insights: _____**
This model has indicated by "Imp" values that the 3 key Important features they look at while purchasing a used car are Power, Year of the manufacture of the car and kilometers driven. Engine and Mileage are two other important features they consider for this purpose.

### **Random Forest**

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [319]:
# import Randomforest for Regression from sklearn
from sklearn.ensemble import RandomForestRegressor

In [324]:
# Create a Randomforest regression model 
Rforest = RandomForestRegressor(n_estimators = 100, random_state = 1)

In [340]:
# Fit Randomforest regression model.
Rforest.fit(X_train,y_train['price_log'])

RandomForestRegressor(random_state=1)

In [326]:
# Get score of the model.
Rforest_Score = get_model_score(Rforest)

R-sqaure on training set :  0.591617279038231
R-square on test set :  0.3416452818679234
RMSE on training set :  6.551327473597888
RMSE on test set :  8.37326135201962


**Observations and insights -**
From the R-squard values, I would say this model fits for both of trainign data set as well as the test data set; training data set has a better fit; However,  the RSME values are higher here compared to LR and Ridge models.

**Feature Importance**

In [327]:
# Print important features of Randomforest, similar to decision trees
print(pd.DataFrame(Rforest.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                Imp
Power                      0.514132
Year                       0.178521
kilometers_driven_log      0.096808
Engine                     0.058547
Mileage                    0.052835
Seats                      0.009245
Transmission_Manual        0.009130
Location_Hyderabad         0.009065
Location_Coimbatore        0.009033
Location_Kolkata           0.008087
Owner_Type_Second          0.007378
Location_Kochi             0.006803
Location_Bangalore         0.006524
Location_Mumbai            0.006114
Location_Delhi             0.005348
Location_Chennai           0.005335
Location_Pune              0.004458
Location_Jaipur            0.004359
Fuel_Type_Diesel           0.003391
Fuel_Type_Petrol           0.003221
Owner_Type_Third           0.001255
Owner_Type_Fourth & Above  0.000343
Fuel_Type_LPG              0.000069
Fuel_Type_Electric         0.000000


#### **Observations and insights: _____**
This model is also indicating Power, Year and Kilometers driven as the three major Important features for the used car in making a decision on purchasing; the next two features are Engine and Mileage like the previous model had indicated.

### **Hyperparameter Tuning: Decision Tree**

In [309]:
# Importing DecisionTreeClassifier from sklearn.tree Library
from sklearn.tree import DecisionTreeClassifier

# Importing GridSearchCV from sklearn.model_selection Library
from sklearn.model_selection import GridSearchCV

# Choosing the type of estimator. 
dtree_tuned = DecisionTreeClassifier(random_state=1)

# Fit decision tree regression model.
y_train['price_log'] = y_train['price_log'].astype(int)
dtree_tuned.fit(X_train,y_train['price_log'])
 

# Grid of parameters to choose from.
# Check documentation for all the parametrs that the model takes and play with those.
params = [{'max_depth': list(range(1, 5)), 'max_features': list(range(0,14))}]
    
# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, params, cv=4, scoring='r2' )
grid_obj = grid_obj.fit(X_train,y_train['price_log'])

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
dtree_tuned.fit(X_train,y_train['price_log'])

DecisionTreeClassifier(max_depth=4, max_features=11, random_state=1)

In [310]:
# Get score of the dtree_tuned
scorer = get_model_score(dtree_tuned)

R-sqaure on training set :  0.3894411353252082
R-square on test set :  0.406538636018665
RMSE on training set :  8.010495852610374
RMSE on test set :  7.94988590807018


#### **Observations and insights: _____**
We find that R-squared values for both of training and test data set are much lower (0.389 and 0.406 respectively) than un-tuned version of Decision Tree model; the RSME values are higher as well for both data set. I am afriad the parameters chosen were incorrect or the tuning did not work well. I feel this tuning did not work as expected.

**Feature Importance**

In [311]:
# Print important features of tuned decision tree similar to decision trees
print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                Imp
Year                       0.319833
Transmission_Manual        0.270954
Engine                     0.151997
Mileage                    0.137495
Power                      0.105857
kilometers_driven_log      0.013864
Location_Mumbai            0.000000
Owner_Type_Second          0.000000
Owner_Type_Fourth & Above  0.000000
Fuel_Type_Petrol           0.000000
Fuel_Type_LPG              0.000000
Fuel_Type_Electric         0.000000
Fuel_Type_Diesel           0.000000
Location_Pune              0.000000
Location_Kochi             0.000000
Location_Kolkata           0.000000
Location_Jaipur            0.000000
Location_Hyderabad         0.000000
Location_Delhi             0.000000
Location_Coimbatore        0.000000
Location_Chennai           0.000000
Location_Bangalore         0.000000
Seats                      0.000000
Owner_Type_Third           0.000000


#### **Observations and insights: _____**
In this model, the first three Important features are Year, Tansmission_Manual and Engine; the next two features are Mileage and Power. Noticeable here is the "Imp" factor values; much lower than the un-tuned version of the model.

### **Hyperparameter Tuning: Random Forest**

In [317]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score

# Choose the type of Regressor. 
rforest_tuned = RandomForestClassifier()

# Define the parameters for Grid to choose from 
# Check documentation for all the parametrs that the model takes and play with those
params = [{'max_depth': list(range(1, 5)), 'max_features': list(range(0,14))}]

# Type of scoring used to compare parameter combinations
# r2_score() is used

# Run the grid search
grid_obj = GridSearchCV(rforest_tuned, params, cv=4, scoring='r2' )
grid_obj = grid_obj.fit(X_train,y_train['price_log'])

# Set the clf to the best combination of parameters
rforest_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
rforest_tuned.fit(X_train,y_train['price_log'])

RandomForestClassifier(max_depth=4, max_features=12)

In [313]:
# Get score of the model.
scorer = get_model_score(rforest_tuned)

R-sqaure on training set :  0.35656363363505317
R-square on test set :  0.382464735290441
RMSE on training set :  8.223343485113766
RMSE on test set :  8.109527528735667


#### **Observations and insights: _____**
The R-squared values for both of training and test data set are low, but still good enough per definition to say, the model fits the data; however, none of those is a strong fit. The RMSE errors are also pretty high (> 8.0 ) compared to all other models we observed so far. Looks like this is the worst-fit model so far.

**Feature Importance**

In [314]:
# Print important features of tuned ramdom forest similar to decision trees
print(pd.DataFrame(rforest_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                Imp
Power                      0.362223
Year                       0.318962
Engine                     0.188191
Transmission_Manual        0.050650
kilometers_driven_log      0.040351
Mileage                    0.019237
Seats                      0.007095
Fuel_Type_Diesel           0.005915
Fuel_Type_Petrol           0.003255
Owner_Type_Second          0.001545
Location_Kolkata           0.001036
Location_Coimbatore        0.000572
Owner_Type_Third           0.000522
Location_Bangalore         0.000240
Location_Kochi             0.000083
Location_Hyderabad         0.000079
Owner_Type_Fourth & Above  0.000046
Location_Jaipur            0.000000
Location_Mumbai            0.000000
Location_Pune              0.000000
Fuel_Type_Electric         0.000000
Fuel_Type_LPG              0.000000
Location_Delhi             0.000000
Location_Chennai           0.000000


#### **Observations and insights: ______**
The three major Important features for a used car per this model are Power, Year and Engine. The next two important factors are Kilometers driven and Mileage. The factor values (scores) are low compared to other models we studied, though.


In [338]:
# defining list of models you have trained
models = [lr, ridge_lasso, Dtree, dtree_tuned, Rforest, rforest_tuned ]

# defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train= []
rmse_test= []

# looping through all the models to get the rmse and r2 scores
for model in models:
    # accuracy score
    j = get_model_score(model,False)
    r2_train.append(j[0])
    r2_test.append(j[1])
    rmse_train.append(j[2])
    rmse_test.append(j[3])

In [339]:
comparison_frame = pd.DataFrame({'Model':models, 
                                          'Train_r2': r2_train,'Test_r2': r2_test,
                                          'Train_RMSE':rmse_train,'Test_RMSE':rmse_test}) 
comparison_frame

Unnamed: 0,Model,Train_r2,Test_r2,Train_RMSE,Test_RMSE
0,LinearRegression(),0.56285,0.703421,6.778146,5.619983
1,Ridge(),0.562672,0.703286,6.779529,5.621259
2,DecisionTreeRegressor(random_state=1),0.999909,0.582553,0.097671,6.667539
3,"DecisionTreeClassifier(max_depth=4, max_featur...",0.389441,0.406539,8.010496,7.949886
4,"(DecisionTreeRegressor(max_features='auto', ra...",0.591617,0.341645,6.551327,8.373261
5,"(DecisionTreeClassifier(max_depth=4, max_featu...",0.352894,0.384961,8.246757,8.093122


#### **Observations: _____**
The above comparison shows that the Decision Tree regression model has the best r2 (R-Squared) value of 0.999, and fits the best with both of training and test data with lowest Train_RMSE (0.097). The Next best is the Ridge-Lasso model with a R-squared value of 0.703 for test data, and relatively lower RMSE values (less than 7.0). The Tuned version of Random Forest regressor has the lowest R-Squared values, with high RMSE values for both of training and test data set, and hence this model is the worst fit. I feel that the estimators did not do a great job to choose the correct parameter values for tuning those models becasue tuned models had worse output than regular versions of those models (Decision Tree and Random Forest regressors).


**Note:** You can also try some other algorithms such as kNN and compare the model performance with the existing ones

 **Insights**

**Refined insights:**
Data Insight is the deep understanding a person or an organization gains from analyzing data on a particular issue. This deep understanding helps the person or the organizations make better decisions rather than relying on personal instincts or guts. 

The issue here is improving used car pricing - how the best possible pricing could be made so that the company can make more and more profit selling used cars. The data analysis here using various regressors shows that the 5 Important features (Insights) are the Year, Power, Kilometers driven, Mileage, and Engine. One regression shows Manual Transmission type is an important factor as well. Knowing these factors, the company can better design a better pricing model. The decision tree regression seems the best in my experiments, and that shows the following Important Features with factors  -


**Feature                    Factor**

**---------------------------**

**Power                      0.539645**

**Year                       0.209081**

**Kilometers_driven      0.074535**

**Engine                     0.053221**

**Mileage                    0.043895**

**Transmission_Manual        0.008721**


**Comparison of various techniques and their relative performance:**
From my model design, fitting and scoring experiments, I find that the Decision Tree Regressor is the best in terms of R-squared value and values of Train_RMSE and Test_RMSE. This model has a R-squared value of 0.999 which is close to 1 and being a perfect score, meaning the best fit of data to the model. The Linear Regression model has the second best performance with a R-squared value of 0.7034, then the Ridge-Lasso model with similar metrics. The Random Forest models comes to the fourth position, and the tuned version of the Random Forest method has the worst performance, with a R-squared value of 0.352 for training data set and 0.385 for test data set. 

**Proposal for the final solution design:**
I am going to propose using the Decision Tree Regression model for the solution of Prediction of Pricing for used cars. The R-squared value for this model impressed me. The Important Features indicated my this model is also very much practical; in other words, those matches with real life experience. 