# Building a  KNN-Regression Model to Determine Car Prices


### [Data set information](https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names)


##### Contents:
- Importing data
- Cleaning data
- KNN regression with scikit-learn
    - Model 1: Feature selection and hyperparameter tuning with GridSearchCV
    - Model 2: Nested cross-validation with `cross_val_score` and `GridSearchCV`

In [1]:
# basic imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Importing data

In [2]:
# Read .data file (or download from website)
#cars = pd.read_table('data/imports-85.data', delimiter=",", header=None)
cars = pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', delimiter=",", header=None)
cars.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500


In [3]:
# add header from documentation
header = ['symboling','normalized_losses','make','fuel_type','aspiration','num_doors','body_style','drive_wheels',
 'engine_location','wheel_base','length','width','height','curb_weight','engine_type','num_cylinders',
 'engine_size','fuel_system','bore','stroke','compression_ratio','horsepower','peak_rpm','city_mpg',
 'highway_mpg','price']
cars.columns = header
cars.head(3)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500


## Cleaning data

In [4]:
# Select features that seem to be numeric
numeric_cols = ['wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 
                'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg']
cars[numeric_cols].head() 

Unnamed: 0,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg
0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27
1,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27
2,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154,5000,19,26
3,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102,5500,24,30
4,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115,5500,18,22


In [5]:
# Replace ?'s with NaNs
cars.replace('?', np.NaN, inplace=True)
cars.head(10)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0
5,2,,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250.0
6,1,158.0,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710.0
7,1,,audi,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920.0
8,1,158.0,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875.0
9,0,,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,


In [6]:
# print shape
print(cars.shape)

# Count NaNs
cars.isnull().sum()

(205, 26)


symboling             0
normalized_losses    41
make                  0
fuel_type             0
aspiration            0
num_doors             2
body_style            0
drive_wheels          0
engine_location       0
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_type           0
num_cylinders         0
engine_size           0
fuel_system           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64

In [7]:
# drop entire 'normalized_losses" column (too many NaNs)
cars.drop('normalized_losses', axis=1, inplace=True)

# drop rows where other values are missing
cars.dropna(inplace=True)

# print new shape, null counts, dataframe head
print(cars.shape)
print(cars.isnull().sum())
cars.head(5)

(193, 25)
symboling            0
make                 0
fuel_type            0
aspiration           0
num_doors            0
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_cylinders        0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_ratio    0
horsepower           0
peak_rpm             0
city_mpg             0
highway_mpg          0
price                0
dtype: int64


Unnamed: 0,symboling,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,length,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,audi,gas,std,four,sedan,fwd,front,99.8,176.6,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,audi,gas,std,four,sedan,4wd,front,99.4,176.6,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [8]:
# check datatypes
print(cars[numeric_cols].dtypes)
print('\n price dtype: ',cars.price.dtype)

wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_size            int64
bore                  object
stroke                object
compression_ratio    float64
horsepower            object
peak_rpm              object
city_mpg               int64
highway_mpg            int64
dtype: object

 price dtype:  object


In [9]:
# change data types
cars.bore = cars.bore.astype('float64')
cars.stroke = cars.stroke.astype('float64')
cars.horsepower = cars.horsepower.astype('int64')
cars.peak_rpm = cars.peak_rpm.astype('int64')
cars.price = cars.price.astype('int64')

# recheck types
print(cars[numeric_cols].dtypes)
print('\nprice dtype: ',cars.price.dtype)

wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_size            int64
bore                 float64
stroke               float64
compression_ratio    float64
horsepower             int64
peak_rpm               int64
city_mpg               int64
highway_mpg            int64
dtype: object

price dtype:  int64


In [10]:
# Assign features and target to X,y
X = cars[numeric_cols]
y = cars.price

# Check X,y 
print(y[:2])
X.head(2)

0    13495
1    16500
Name: price, dtype: int64


Unnamed: 0,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg
0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27
1,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27


## Building a model with Scikit-Learn

In [11]:
# scikit-learn imports
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV

### Model #1: KNN Regression 
- Pipelined
    - Normalize features
    - Feature selection 
    - KNeighborsRegressor model
- GridSearchCV
    - Choosing best features
    - Hyperparameter tuning (best KNN K-value)

In [12]:
%%time

# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# outer cross-validation on model, inner cross-validation on hyperparameters
model = GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10, verbose=1, n_jobs=-1)
model.fit(X, y)                         

Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 232 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 2032 tasks      | elapsed:    5.2s


CPU times: user 3.17 s, sys: 65 ms, total: 3.23 s
Wall time: 6.71 s


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    6.6s finished


#### Detailed model information and performance, best cross-validated hyperparameters

In [13]:
# Best neg_MSE
print("Best score (neg_mse): ", model.best_score_ )
# best RMSE
print("Best RMSE: ",np.abs(model.best_score_) ** (1/2))

# Model information
print('\n',model.best_estimator_)
# Best parameters (feature#, K#)
print('\n',model.best_params_)

Best score (neg_mse):  -34646369.1753
Best RMSE:  5886.11664643

 Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=11, score_func=<function f_classif at 0x113966d90>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'))])

 {'kbest__k': 11, 'regressor__n_neighbors': 5}


#### Feature scores and p-values, features that were dropped from best model, CV information

In [14]:
# Determine names of best features
# Access the feature selector by name in model.best_estimator_
features = model.best_estimator_.named_steps['kbest']

# Get the features scores and p-values
scores = ['%.2f' % elem for elem in features.scores_ ]
print("Feature scores:\n {}".format(scores))
pvalues = ['%.2f' % elem for elem in features.pvalues_]
print("Feature p-values:\n {} ".format(pvalues))

# Get list of features that were used in best model
# get_support returns array of T/F vals for whether or not column was used
best_features = list(X.columns[features.get_support()])

# Determine features that were not included in the model
dropped_features = list(set(numeric_cols) - set(best_features))
print("\nDropped features: ", dropped_features)

# Display detailed information on crossvalidated tuning parameters
pd.DataFrame(model.cv_results_)

Feature scores:
 ['5.23', '4.62', '3.50', '7.12', '18.86', '16.41', '1.47', '3.58', '1.42', '9.38', '25.45', '11.27', '11.08']
Feature p-values:
 ['0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.20', '0.00', '0.22', '0.00', '0.00', '0.00', '0.00'] 

Dropped features:  ['compression_ratio', 'bore']


Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_kbest__k,param_regressor__n_neighbors,params,rank_test_score,split0_test_score,split0_train_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.015075,0.002181,-6.050154e+07,-1.578866e+05,1,1,"{'kbest__k': 1, 'regressor__n_neighbors': 1}",260,-1.177076e+08,-1.225918e+05,...,-7.504637e+07,-1.751922e+05,-3.306489e+07,-1.751922e+05,-3.524387e+07,-1.751922e+05,0.003539,0.000574,4.964924e+07,3.077207e+04
1,0.012159,0.001048,-5.648553e+07,-1.081271e+07,1,2,"{'kbest__k': 1, 'regressor__n_neighbors': 2}",258,-1.047361e+08,-1.165162e+07,...,-5.931670e+07,-1.319941e+07,-3.038005e+07,-4.233406e+06,-3.111730e+07,-1.430886e+07,0.002711,0.000325,3.867909e+07,3.747649e+06
2,0.015101,0.001325,-4.891187e+07,-1.766541e+07,1,3,"{'kbest__k': 1, 'regressor__n_neighbors': 3}",255,-9.256505e+07,-1.669807e+07,...,-3.697380e+07,-2.051591e+07,-1.911869e+07,-6.633635e+06,-3.064793e+07,-2.215382e+07,0.006013,0.000482,3.739091e+07,5.925173e+06
3,0.014314,0.001368,-4.254082e+07,-2.057252e+07,1,4,"{'kbest__k': 1, 'regressor__n_neighbors': 4}",225,-9.135842e+07,-1.965240e+07,...,-3.201910e+07,-2.416377e+07,-1.408598e+07,-9.685738e+06,-2.497820e+07,-2.493007e+07,0.004018,0.000376,3.482715e+07,6.265630e+06
4,0.012922,0.001163,-4.069073e+07,-2.229431e+07,1,5,"{'kbest__k': 1, 'regressor__n_neighbors': 5}",165,-9.249683e+07,-2.109928e+07,...,-3.160704e+07,-2.647568e+07,-1.100696e+07,-1.184720e+07,-2.125013e+07,-2.709943e+07,0.004002,0.000219,3.516095e+07,6.307447e+06
5,0.015280,0.001233,-3.982119e+07,-2.349702e+07,1,6,"{'kbest__k': 1, 'regressor__n_neighbors': 6}",92,-8.731033e+07,-2.187641e+07,...,-2.553549e+07,-2.771610e+07,-9.487037e+06,-1.322288e+07,-1.643064e+07,-2.805297e+07,0.002390,0.000054,3.644335e+07,6.431137e+06
6,0.014568,0.001329,-3.795270e+07,-2.449505e+07,1,7,"{'kbest__k': 1, 'regressor__n_neighbors': 7}",65,-8.913201e+07,-2.218794e+07,...,-2.097353e+07,-2.943772e+07,-1.014751e+07,-1.460954e+07,-1.383032e+07,-2.856904e+07,0.001300,0.000295,3.515131e+07,6.420158e+06
7,0.012885,0.001219,-3.847141e+07,-2.576087e+07,1,8,"{'kbest__k': 1, 'regressor__n_neighbors': 8}",68,-8.803203e+07,-2.408454e+07,...,-2.438709e+07,-3.094156e+07,-1.132060e+07,-1.517405e+07,-1.241746e+07,-3.101286e+07,0.003328,0.000169,3.527448e+07,6.840757e+06
8,0.014612,0.001253,-3.917258e+07,-2.592254e+07,1,9,"{'kbest__k': 1, 'regressor__n_neighbors': 9}",78,-8.853665e+07,-2.432103e+07,...,-2.676350e+07,-2.954703e+07,-1.308904e+07,-1.559222e+07,-1.074013e+07,-3.009890e+07,0.001605,0.000027,3.533468e+07,6.678669e+06
9,0.013419,0.001223,-3.854110e+07,-2.547961e+07,1,10,"{'kbest__k': 1, 'regressor__n_neighbors': 10}",69,-8.763707e+07,-2.364110e+07,...,-2.722476e+07,-2.891743e+07,-1.240584e+07,-1.592449e+07,-1.146457e+07,-3.076358e+07,0.000638,0.000099,3.506006e+07,6.295649e+06


### Model #2: Another KNN Regression
#### This time including nested cross-validation using `cross_val_score`+`GridSearchCV` For unbiased estimate
    - Normalization
    - Feature selection
    - Hyperparameter cross-validation
    - Model cross-validation
    - RMSE error metric

In [15]:
%%time

# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10, verbose=1, n_jobs=-1), 
                         X, y, cv=10, scoring="neg_mean_squared_error", verbose=1)

# convert neg_mse to rmse
rmses = np.abs(scores)**(1/2)
# unbiased estimate
avg_rmse = np.mean(rmses)
print("\nAverage RMSE from 10-fold cross-validation: ", avg_rmse, "\n")

Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 232 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 2032 tasks      | elapsed:    4.7s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    5.9s finished
[Parallel(n_jobs=-1)]: Done 268 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 2368 tasks      | elapsed:    6.4s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    6.9s finished
[Parallel(n_jobs=-1)]: Done 268 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 2368 tasks      | elapsed:    6.5s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    7.0s finished
[Parallel(n_jobs=-1)]: Done 232 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 2032 tasks      | elapsed:    5.8s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    7.1s finished
[Parallel(n_jobs=-1)]: Done 232 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 2032 tasks      | elapsed:    6.0s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    7.6s finished
[Parallel(n_jobs=-1)]: Done 160 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 1360 tasks      | elapsed:    4.4s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    8.7s finished
[Parallel(n_jobs=-1)]: Done 232 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 2032 tasks      | elapsed:    6.5s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    8.3s finished
[Parallel(n_jobs=-1)]: Done 232 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 2032 tasks      | elapsed:    6.7s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    8.2s finished
[Parallel(n_jobs=-1)]: Done 268 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 2368 tasks      | elapsed:    6.9s


Fitting 10 folds for each of 260 candidates, totalling 2600 fits


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    7.5s finished
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 1696 tasks      | elapsed:    5.1s


-4022.75011842

Average RMSE from 10-fold cross-validation:  61.6080568178 

CPU times: user 37.6 s, sys: 798 ms, total: 38.4 s
Wall time: 1min 16s


[Parallel(n_jobs=-1)]: Done 2600 out of 2600 | elapsed:    7.6s finished
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.3min finished
