![Used Cars Predicion](https://img.freepik.com/free-vector/businessman-with-smartphone-rents-car-street-via-carsharing-service-carsharing-service-short-periods-rent-best-taxi-alternative-concept_335657-2201.jpg)

In [1]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

sns.set()
warnings.filterwarnings("ignore")



In [2]:
# Reading the dataset
data = pd.read_csv("../input/used-cars-price-prediction/train-data.csv")

In [3]:
# Checking for the shape of the data
data.shape

(6019, 14)

In [4]:
# Understand the data
data.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


In [5]:
# Check for null values
data.isna().sum()

Unnamed: 0              0
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 36
Power                  36
Seats                  42
New_Price            5195
Price                   0
dtype: int64

In [6]:
# Take actions on null values
data.drop(columns=['New_Price'], inplace=True)
data.dropna(inplace=True)

In [7]:
# Divide the data into X & y
X = data.iloc[:, 2:-1]
y = data.iloc[:, -1]

In [8]:
print(y.head(3))
X.head()

0     1.75
1    12.50
2     4.50
Name: Price, dtype: float64


Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats
0,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0
1,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0
2,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0
3,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0
4,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0


## **Extract Numeric Values from `Mileage, Engine & Power` Colns**

In [9]:
X.head(3)

Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats
0,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0
1,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0
2,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0


In [10]:
def extract_numeric_value(value: str) -> float:
    try:
        strRep = str(value)
        floatRep = ""
        for char in strRep:
            if not char.isalpha() and not char.isspace() and char != '/':
                floatRep += char

        return float(floatRep)
    except:
        return None

In [11]:
X['Mileage'] = X['Mileage'].apply(extract_numeric_value)

In [12]:
X['Mileage'].head(3)

0    26.60
1    19.67
2    18.20
Name: Mileage, dtype: float64

In [13]:
X['Engine'] = X['Engine'].apply(extract_numeric_value)

In [14]:
X['Engine'].head(3)

0     998.0
1    1582.0
2    1199.0
Name: Engine, dtype: float64

In [15]:
X['Power'] = X['Power'].apply(extract_numeric_value)

In [16]:
X['Power'].head(3)

0     58.16
1    126.20
2     88.70
Name: Power, dtype: float64

In [17]:
X.isna().sum()

Location               0
Year                   0
Kilometers_Driven      0
Fuel_Type              0
Transmission           0
Owner_Type             0
Mileage                0
Engine                 0
Power                103
Seats                  0
dtype: int64

In [18]:
# filling the null values in `power` coln
X['Power'] = X['Power'].fillna(X['Power'].median())
X.isna().sum()

Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
dtype: int64

# **Time to build Models**

In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse, r2_score

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4780, 10), (1195, 10), (4780,), (1195,))

In [22]:
# Create ColumnTransformer to encode and scale values of data
clf1 = ColumnTransformer([
    ('encode', OneHotEncoder(drop="first", sparse_output=True, handle_unknown="ignore"), ['Location', 'Fuel_Type', 'Transmission', 'Owner_Type']),
    ('scaling', StandardScaler(), ['Year', 'Kilometers_Driven', 'Mileage', 'Engine', 'Power', 'Seats'])
], remainder="passthrough")

In [23]:
from sklearn.linear_model import LinearRegression

# Use LinearRegression
clf2 = LinearRegression()

In [24]:
pipe = Pipeline([
    ('ColumnTransformer', clf1),
    ('Model', clf2)
])

In [25]:
pipe.fit(X, y)

In [26]:
pipe.named_steps

{'ColumnTransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('encode',
                                  OneHotEncoder(drop='first',
                                                handle_unknown='ignore'),
                                  ['Location', 'Fuel_Type', 'Transmission',
                                   'Owner_Type']),
                                 ('scaling', StandardScaler(),
                                  ['Year', 'Kilometers_Driven', 'Mileage',
                                   'Engine', 'Power', 'Seats'])]),
 'Model': LinearRegression()}

In [27]:
y_pred = pipe.predict(X_test)

In [28]:
print(f"The r2_score by LinearRegression Model is {r2_score(y_test, y_pred)}")
print(f"The Mean Squared Error by LinearRegression Model is {mse(y_test, y_pred)}")

The r2_score by LinearRegression Model is 0.6415590180270131
The Mean Squared Error by LinearRegression Model is 52.1806579981603


# **Time to call my Best Friend `Random Forest ðŸŒ²`**

In [29]:
from sklearn.ensemble import RandomForestRegressor

clf3 = RandomForestRegressor(n_estimators=100, random_state=42)

In [30]:
pipe2 = Pipeline([
    ('clf1', clf1),
    ('clf3', clf3)
])

In [31]:
pipe2.fit(X_train, y_train)

In [32]:
y_pred = pipe2.predict(X_test)

In [33]:
print(f"The r2_score by Random Forest Regressor Model is {r2_score(y_test, y_pred)}")
print(f"The Mean Squared Error by Random Forest Regressor Model is {mse(y_test, y_pred)}")

The r2_score by Random Forest Regressor Model is 0.8403879723043517
The Mean Squared Error by Random Forest Regressor Model is 23.235793473546472


# **Let's give a chance to `SGDRegressor` too, I feel bad for him ðŸ˜©**

In [34]:
from sklearn.linear_model import SGDRegressor

clf4 = SGDRegressor(max_iter=500, random_state=42)

In [35]:
pipe3 = Pipeline([
    ('clf1', clf1),
    ('clf4', clf4)
])

In [36]:
pipe3.fit(X_train, y_train)

In [37]:
y_pred = pipe3.predict(X_test)

In [38]:
print(f"The r2_score by SGDRegressor Model is {r2_score(y_test, y_pred)}")
print(f"The Mean Squared Error by SGDRegressor Model is {mse(y_test, y_pred)}")

The r2_score by SGDRegressor Model is 0.42923550588988835
The Mean Squared Error by SGDRegressor Model is 83.09001582552642


## **STEPS TAKEN IN THIS NOTEBOOK**
* First checked for null values and handled them
* Second step, Converted colns with mixed dtype to float64 to pass to the model
* In third step, Done One Hot Encoding on colns with Classified values
* In fourth step, Normalize the data to bring all values on same scale

`Well, though performance was not that better to be proud of ðŸ˜£` but, 

### **Random Forest can be clearly seen dominating in model performance, so another night goes to my friend RFR ðŸŒ²**