## Used Car Prediction

In this notebook, we will explore the "Used Car Price Prediction" dataset (link: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset) and apply the "Support Vector Regression" algorithm to this dataset. Support Vector Regression, or SVR for short, is similar to Support Vector Classifier, where SVC is used for classification purposes (my previous notebook: Breast Cancer Prediction), and SVR is used for regression purposes. 

Here we will try to find the best fit for the given dataset which can be accurately predicted under a given test scenario. 

### Step 1: Data cleaning and Preprocessing

We will first import the necessary modules and convert any categorical data to numerical as well as filling in NaN values.

In [None]:
import pandas as pd # For CSV I/O
import numpy as np  # For manipulation of pandas dataframe.
import seaborn as sns # For visualizing data between dependent variable(s)
import matplotlib.pyplot as plt # For comparing multiple dependent variables.

In [None]:
df = pd.read_csv("/kaggle/input/used-car-price-prediction-dataset/used_cars.csv")
df.head()

In [None]:
# df.info() Everything is of object type except "model_year"
# df.duplicated().sum() Returns zero.
df.isna().sum()

As noticed above, in three of the dependent variables, there are a decent number of NaN/zero values that must be fixed. But before that, we have to change all our categorical data, which is what this dataset consists of, and convert into numerical data. 

In [None]:
# Changing categorical data. 
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

df['brand'] = label_encoder.fit_transform(df['brand'])
df.head() # Brand name changed to numbers.

Creating new features mileage_int and price_int by replace mi and $ as well as any commas for
numerical data conversion.

In [None]:
# Function for replacing symbols and changing to numerical data. 

def return_mileage(s): 
    mileage = int((s.replace(',','')).replace('mi.',''))
    return mileage

def return_price(s): 
    price = int((s.replace(',','')).replace('$',''))
    return price

In [None]:
df['mileage_int'] = df['milage'].map(return_mileage)
df.drop('milage', axis=1, inplace=True)

df['price_int'] = df['price'].map(return_price)
df.drop('price', axis=1, inplace=True)

df.head()

Now, we repalce the NaN values of "accident", "fuel_type" and "clean_title"

In [None]:
# Using .replace to replace the NaN values.
df['accident'] = df['accident'].replace(
    {
        'At least 1 accident or damage reported': 'Yes',
        'None reported': 'No'
    }
)

In [None]:
# df['fuel_type'].value_counts() 2 counts of "not supported", 45 counts of "-"
sns.countplot(x='fuel_type', data=df)

In [None]:
df['fuel_type'] = df['fuel_type'].fillna('Gasoline')
df['fuel_type'] = df['fuel_type'].replace(
    {
        '-': 'Hybrid',
        'not supported': 'Hybrid'
    }
)

In [None]:
df['clean_title'] = df['clean_title'].fillna('No')

We have achieved two things: 
1. Adding new features to change them into numerical data (mileage/price)
2. Filling in zero values with non-zero equivalents. 

However, the majority of the data is still in "object" type, and we need to convert them into numerical data. We can use LabelEncoder().fit_transform() to achieve this task. The following implementation is done below

In [None]:
df['accident']=label_encoder.fit_transform(df['accident'])
df['fuel_type']=label_encoder.fit_transform(df['fuel_type'])

df['ext_col'] = label_encoder.fit_transform(df['ext_col'])
df['int_col'] = label_encoder.fit_transform(df['int_col'])
df['transmission'] = label_encoder.fit_transform(df['transmission'])
df['engine'] = label_encoder.fit_transform(df['engine'])
df['model'] = label_encoder.fit_transform(df['model'])
df['clean_title'] = label_encoder.fit_transform(df['clean_title'])

In [None]:
df.info()

# Converted everything to numerical data.

### Step 2: Splitting the dataset.

Here, we split the dataset into independent and dependent variables for SVR training.

In [None]:
X = df.drop('price_int', axis=1)
y = df['price_int']

X.head()

In [None]:
y.head()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score
from sklearn.svm import SVR

NOTE: I had used StandardScaler in order to speed up the steps that I will go into detail below, however that's not a healthy practice under this scenario as all my columns where changed into indices which had to be mapped to their previous column names. I have commented out the StandardScaler code, and the result is still accurate. 

In [None]:
X = StandardScaler().fit_transform(X)
X = pd.DataFrame(X)
X.head()

### Step 3: Finding our best parameters and splitting the dataset (part 2)

Here, we will use train_test_split to split the dataset into training and testing data and also using StratifiedKFold() to split the dataset into 5 parts which will then be used for test C, gamma, and kernel values to find the best parameters for our model.
 
This will then be used in our SVR and be tested using our test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=164)

svr = SVR()
svr_args = {
    'C': [0.01, 0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.05, 0.01, 0.001],
    'kernel': ['rbf', 'poly', 'sigmoid']
}

best_params = GridSearchCV(estimator = svr, 
                           param_grid = svr_args, 
                           cv = cv,
                           verbose = 1,
                           scoring = 'neg_mean_squared_error')

result_svr = best_params.fit(X_train, y_train)
result_svr.best_params_

### Step 4: Model Training

The results of the GridSearch were: 
- C: 0.01
- gamma: 1
- kernel: rbf (radial basis function)

Now, we will use these parameters and train our SVR, this will then be plotted against the true data y and a line will be produced to show the line of best fit. 

In [None]:
svr = svr.set_params(**result_svr.best_params_)
svr.fit(X_train, y_train)

prediction = svr.predict(X_test)

plt.figure(figsize=(16, 8))
plt.scatter(X[10], y, color='darkorange', label='Data') # 10 corresponds to 'mileage_int'
plt.plot(X_test, prediction, color='navy', lw=2, label="rbf")
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('SVR for predicting used car prices')
plt.legend()

plt.show()


### We have completed our model training! 

The line generated above (navy) shows the least error from our SVR algorithm and this is the line of best fit to the dataset above. If there are any questions or if you would like to collab with me on a project, please send me a mail at akshathmangudi@gmail.com. Good day. 

Notebook by Akshath Mangudi