## Car Price Prediction Model:
- Car Brand
- Mileage
- Engine Capacity
- Transimission
- Fuel
- Max Power

### Steps Involve in Making The Model
- Car Dataset
- Pre Processing
- Data Analysis
- After Analysis we divide the dataset into two parts (input and output features)
- Splitting the dataset
- Model Creation (Scikit-Learn)
- Model or algorithm selection (Linear Regression)
- Saving the Model (Using Pickle)
- Deployment (Using streamlit)

### Import Necessary Libraries
- sciki-learn, tanserflow, pickle, numpy, pandas

In [1]:
import numpy as np
import pandas as pd
import pickle as pk
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

### Read the dataset

In [2]:
df = pd.read_csv('dataset/Cardetails.csv')

In [3]:
df.head(3)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0


In [4]:
df.drop(['torque'], axis=1, inplace=True)

In [5]:
df.head(2)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,5.0


In [6]:
df.isna().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
seats            221
dtype: int64

In [7]:
df.dropna(inplace=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7907 entries, 0 to 8127
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           7907 non-null   object 
 1   year           7907 non-null   int64  
 2   selling_price  7907 non-null   int64  
 3   km_driven      7907 non-null   int64  
 4   fuel           7907 non-null   object 
 5   seller_type    7907 non-null   object 
 6   transmission   7907 non-null   object 
 7   owner          7907 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7907 non-null   object 
 11  seats          7907 non-null   float64
dtypes: float64(1), int64(3), object(8)
memory usage: 803.1+ KB


In [9]:
df.shape

(7907, 12)

In [10]:
df.drop_duplicates(inplace=True)

In [11]:
df.shape

(6718, 12)

### Data analysis

#### Display unique values of Object

In [12]:
for col in df.select_dtypes(include='object').columns:
    print('-'*90)
    print('-'*90)
    print(f"Unique values of {col} are")
    print(df[col].unique())

------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Unique values of name are
['Maruti Swift Dzire VDI' 'Skoda Rapid 1.5 TDI Ambition'
 'Honda City 2017-2020 EXi' ... 'Tata Nexon 1.5 Revotorq XT'
 'Ford Freestyle Titanium Plus Diesel BSIV'
 'Toyota Innova 2.5 GX (Diesel) 8 Seater BS IV']
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Unique values of fuel are
['Diesel' 'Petrol' 'LPG' 'CNG']
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Unique values of seller_type are
['Individual' 'Dealer' 'Trustmark Dealer']
-----------------------------------------------------------------------------------

In [13]:
def get_brand_name(car_name):
    car_name = car_name.split(' ')[0]
    return car_name.strip()

In [14]:
def clean_value(value):
    value = value.split(' ')[0]
    if value == ' ' or value == '':
        value = 0
    return float(value)

### As we only want brand name instead of complete name, so apply it

In [15]:
df['name'] = df['name'].apply(get_brand_name)

In [16]:
df['name'].unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Datsun', 'Jeep', 'Mercedes-Benz',
       'Mitsubishi', 'Audi', 'Volkswagen', 'BMW', 'Nissan', 'Lexus',
       'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo', 'Kia', 'Fiat', 'Force',
       'Ambassador', 'Ashok', 'Isuzu', 'Opel'], dtype=object)

In [17]:
df.head(1)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,Maruti,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,5.0


In [18]:
df['max_power'] = df['max_power'].apply(clean_value)
df['mileage'] = df['mileage'].apply(clean_value)
df['engine'] = df['engine'].apply(clean_value)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6718 entries, 0 to 8125
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           6718 non-null   object 
 1   year           6718 non-null   int64  
 2   selling_price  6718 non-null   int64  
 3   km_driven      6718 non-null   int64  
 4   fuel           6718 non-null   object 
 5   seller_type    6718 non-null   object 
 6   transmission   6718 non-null   object 
 7   owner          6718 non-null   object 
 8   mileage        6718 non-null   float64
 9   engine         6718 non-null   float64
 10  max_power      6718 non-null   float64
 11  seats          6718 non-null   float64
dtypes: float64(4), int64(3), object(5)
memory usage: 682.3+ KB


In [20]:
df['name'].unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Datsun', 'Jeep', 'Mercedes-Benz',
       'Mitsubishi', 'Audi', 'Volkswagen', 'BMW', 'Nissan', 'Lexus',
       'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo', 'Kia', 'Fiat', 'Force',
       'Ambassador', 'Ashok', 'Isuzu', 'Opel'], dtype=object)

### Encode dataoded']])


In [21]:
label_encoder = LabelEncoder()

In [22]:
# Fit and transform the 'name' column
df['name_encoded'] = label_encoder.fit_transform(df['name'])

#### Count name encoded and name

In [23]:
print(f"Name length {len(df['name'].unique())}")
print(f"Encoded Name {len(df['name_encoded'].unique())}")

Name length 31
Encoded Name 31


In [24]:
df.drop(['name'], axis=1, inplace=True)
df.rename(columns={'name_encoded': 'name'}, inplace=True)

#### Let's encode fuel and transmission and other

In [25]:
# Fit and transform the 'fuel' column
df['fuel'] = label_encoder.fit_transform(df['fuel'])

In [26]:
# Fit and transform the 'transmission' column
df['transmission'] = label_encoder.fit_transform(df['transmission'])

In [27]:
# Fit and transform the 'seller_type' column
df['seller_type'] = label_encoder.fit_transform(df['seller_type'])

In [28]:
# Fit and transform the 'owner' column
df['owner'] = label_encoder.fit_transform(df['owner'])

In [29]:
df.head()

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats,name
0,2014,450000,145500,1,1,1,0,23.4,1248.0,74.0,5.0,20
1,2014,370000,120000,1,1,1,2,21.14,1498.0,103.52,5.0,26
2,2006,158000,140000,3,1,1,4,17.7,1497.0,78.0,5.0,10
3,2010,225000,127000,1,1,1,0,23.0,1396.0,90.0,5.0,11
4,2007,130000,120000,3,1,1,0,16.1,1298.0,88.2,5.0,20


In [30]:
# Reorder columns to place 'C' at the first position
new_order = ['name'] + [col for col in df.columns if col != 'name']

# Reindex the DataFrame with the new column order
df = df[new_order]

In [31]:
df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,20,2014,450000,145500,1,1,1,0,23.4,1248.0,74.0,5.0
1,26,2014,370000,120000,1,1,1,2,21.14,1498.0,103.52,5.0
2,10,2006,158000,140000,3,1,1,4,17.7,1497.0,78.0,5.0
3,11,2010,225000,127000,1,1,1,0,23.0,1396.0,90.0,5.0
4,20,2007,130000,120000,3,1,1,0,16.1,1298.0,88.2,5.0


#### reset index

In [32]:
df.tail(1)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
8125,20,2009,382000,120000,1,1,1,0,19.3,1248.0,73.9,5.0


In [33]:
df.reset_index(inplace=True)

In [34]:
df.tail(1)

Unnamed: 0,index,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
6717,8125,20,2009,382000,120000,1,1,1,0,19.3,1248.0,73.9,5.0


In [35]:
df.drop(['index'], axis=1, inplace=True)

In [36]:
df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,20,2014,450000,145500,1,1,1,0,23.4,1248.0,74.0,5.0
1,26,2014,370000,120000,1,1,1,2,21.14,1498.0,103.52,5.0
2,10,2006,158000,140000,3,1,1,4,17.7,1497.0,78.0,5.0
3,11,2010,225000,127000,1,1,1,0,23.0,1396.0,90.0,5.0
4,20,2007,130000,120000,3,1,1,0,16.1,1298.0,88.2,5.0


### Split dataset into train and test

In [37]:
X = df.drop(columns=['selling_price'])
y = df['selling_price']

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, random_state=42)

### Model Creation

In [40]:
model = LinearRegression()

In [41]:
model.fit(X_train, y_train)

In [42]:
y_pred = model.predict(X_test)

In [44]:
print(f'{y_test.values}    {y_pred}')

[450000 800000 530000 ... 860000  65000 305000]    [ 777946.7591622   700487.11147428  548245.44331118 ... 1084568.115797
 -242403.58199564   35963.04036479]


In [45]:
y_test.values[-1]

305000

In [46]:
y_pred[-1]

35963.04036478698

In [47]:
y_pred[-20]

1735656.4697262794

In [48]:
y_test.values[-20]

1630000