# Used Car Price Prediction: KNN

### Dataset

It is a comma separated file and there are 14 columns in the dataset.

- Location - The location in which the car is being sold or is available for purchase.
- Year - The year or edition of the model.
- KM_Driven - The total kilometers are driven in the car by the previous owner(s) in '000 KM.
- Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
- Transmission - The type of transmission used by the car. (Automatic / Manual)
- Owner_Type - First, Second, Third, or Fourth & Above
- Mileage - The standard mileage offered by the car company in kmpl or km/kg
- Engine - The displacement volume of the engine in CC.
- Power - The maximum power of the engine in bhp.
- Seats - The number of seats in the car.
- Price - The price of the car (target).

### Load Dataset

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [6]:
cars_df = pd.read_csv( "https://drive.google.com/uc?export=download&id=10ABViLN4Q7vgIlLvepCduU4B3C6BneJR" )

In [7]:
cars_df.sample(5)

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,age,KM_Driven,make,mileage,engine,power
149,Delhi,Petrol,Manual,Second,5.0,3.2,5,30,honda,19.4,1198,86.8
788,Pune,Diesel,Manual,Second,8.0,8.0,8,156,toyota,12.99,2494,100.0
786,Delhi,Petrol,Manual,First,5.0,4.53,3,81,hyundai,18.9,1197,82.0
728,Coimbatore,Petrol,Manual,First,5.0,3.32,6,65,hyundai,21.1,814,55.2
402,Pune,Petrol,Manual,First,5.0,5.48,2,14,volkswagen,17.0,1198,73.75


In [8]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1038 entries, 0 to 1037
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Location      1038 non-null   object 
 1   Fuel_Type     1038 non-null   object 
 2   Transmission  1038 non-null   object 
 3   Owner_Type    1038 non-null   object 
 4   Seats         1037 non-null   float64
 5   Price         1038 non-null   float64
 6   age           1038 non-null   int64  
 7   KM_Driven     1038 non-null   int64  
 8   make          1038 non-null   object 
 9   mileage       1038 non-null   float64
 10  engine        1038 non-null   int64  
 11  power         1038 non-null   float64
dtypes: float64(4), int64(3), object(5)
memory usage: 97.4+ KB


### Feature Set Selection

In [11]:
cars_df.columns

Index(['Location', 'Fuel_Type', 'Transmission', 'Owner_Type', 'Seats', 'Price',
       'age', 'KM_Driven', 'make', 'mileage', 'engine', 'power'],
      dtype='object')

In [12]:
x_features = ['KM_Driven', 'Fuel_Type', 'age',
              'Transmission', 'Owner_Type', 'Seats',
              'make', 'mileage', 'engine',
              'power', 'Location']

In [13]:
cat_vars = ['Fuel_Type',
                'Transmission', 'Owner_Type',
                'make', 'Location']

In [14]:
num_vars = list(set(x_features) - set(cat_vars))

In [15]:
num_vars

['KM_Driven', 'mileage', 'age', 'engine', 'power', 'Seats']

In [16]:
cars_df[x_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1038 entries, 0 to 1037
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   KM_Driven     1038 non-null   int64  
 1   Fuel_Type     1038 non-null   object 
 2   age           1038 non-null   int64  
 3   Transmission  1038 non-null   object 
 4   Owner_Type    1038 non-null   object 
 5   Seats         1037 non-null   float64
 6   make          1038 non-null   object 
 7   mileage       1038 non-null   float64
 8   engine        1038 non-null   int64  
 9   power         1038 non-null   float64
 10  Location      1038 non-null   object 
dtypes: float64(3), int64(3), object(5)
memory usage: 89.3+ KB


### Need for Data Transformation

1. Data imputation for Seats Column
    - Mean imputation
2. Categorical Encoding for categorical columns
    - OHE Encoding
3. Data scaling
    - Standard scaling

### Setting X and y variables

In [17]:
X = cars_df[x_features]
y = cars_df['Price']

### Data Splitting

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [20]:
X_train.shape

(830, 11)

In [21]:
X_test.shape

(208, 11)

### Data Imputation

In [22]:
from sklearn.impute import SimpleImputer

In [23]:
imputed_num_vars = ['Seats']

In [24]:
imputed_num_vars

['Seats']

In [25]:
non_imputed_num_vars = list(set(num_vars) - set(imputed_num_vars))

In [26]:
non_imputed_num_vars

['KM_Driven', 'mileage', 'age', 'engine', 'power']

In [27]:
mean_imputer = SimpleImputer(strategy='mean')

### Encode Categorical Variables

In [28]:
from sklearn.preprocessing import OneHotEncoder

In [29]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')

### Scaling numerical vars

In [30]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

### Creating Pipelines

In [31]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [33]:
imputed_num_transformer = Pipeline( steps = [
        ('imputation', mean_imputer),
        ('scaler', scaler)])

In [32]:
non_imputed_num_transformer = Pipeline( steps = [('scaler', scaler)])

In [34]:
cat_transformer = Pipeline( steps = [('ohencoder', ohe_encoder)])

In [37]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num_imputed', imputed_num_transformer, imputed_num_vars),
        ('num_not_imputed', non_imputed_num_transformer, non_imputed_num_vars),
        ('catvars', cat_transformer, cat_vars)])

### KNN (K-Nearest Neighbor)


In [35]:
from sklearn.neighbors import KNeighborsRegressor

In [36]:
#knn = KNeighborsRegressor(n_neighbors=20)
knn = KNeighborsRegressor(n_neighbors=20, weights='distance')

In [39]:
knn_v1 = Pipeline(steps=[('preprocessor', preprocessor),
                          ('knn', knn)])

In [40]:
knn_v1.fit(X_train, y_train)

In [None]:
from sklearn import set_config
set_config(display='diagram')

In [None]:
knn_v1

### Predict on test set

In [41]:
y_pred = knn_v1.predict(X_test)

### K Fold Cross Validation

In [42]:
from sklearn.model_selection import cross_val_score

In [43]:
scores = cross_val_score( knn_v1,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [44]:
scores

array([0.82494184, 0.71891728, 0.75005726, 0.8216027 , 0.74097026,
       0.76401927, 0.72677321, 0.79012772, 0.84630204, 0.74544216])

In [45]:
scores.mean()

0.7729153751230442

In [46]:
scores.std()

0.04263903256390474

In [47]:
from joblib import dump

In [48]:
dump(knn_v1, "cars.pkl")

['cars.pkl']