<h2>Importing the libraries</h2>

In [240]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', None)

<h3>Importing the libraries</h3>

In [278]:
data = pd.read_csv('cardekho.csv')

fuel_to_remove = ['LPG', 'CNG']
owner_to_remove = ['Fourth & Above Owner','Test Drive Car']
seller_to_remove = ['Trustmark Dealer']

data = data[~data['fuel'].isin(fuel_to_remove)]
data = data[~data['owner'].isin(owner_to_remove)]
data = data[~data['seller_type'].isin(seller_to_remove)]


X_data = data.iloc[:, [1] + list(range(3,12))]

print(X_data['fuel'].unique())
print(X_data['seller_type'].unique())
print(X_data['transmission'].unique())
print(X_data['owner'].unique())
print(X_data['max_power'].head())

X_data.loc[:, 'max_power'] = X_data['max_power'].replace(['', 'NA', 'None'], np.nan)
X_data['max_power'] = pd.to_numeric(X_data['max_power'], errors='coerce')
X_data.info()


X = X_data.values
print(X[0])
print(X[2])
y = data.iloc[:,2].values



['Diesel' 'Petrol']
['Individual' 'Dealer']
['Manual' 'Automatic']
['First Owner' 'Second Owner' 'Third Owner']
0        74
1    103.52
2        78
3        90
4      88.2
Name: max_power, dtype: object
<class 'pandas.core.frame.DataFrame'>
Index: 7622 entries, 0 to 8127
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   year                7622 non-null   int64  
 1   km_driven           7622 non-null   int64  
 2   fuel                7622 non-null   object 
 3   seller_type         7622 non-null   object 
 4   transmission        7622 non-null   object 
 5   owner               7622 non-null   object 
 6   mileage(km/ltr/kg)  7420 non-null   float64
 7   engine              7420 non-null   float64
 8   max_power           7426 non-null   float64
 9   seats               7420 non-null   float64
dtypes: float64(4), int64(2), object(4)
memory usage: 655.0+ KB
[2014 145500 'Diesel' 'Individual' 'Manual'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_data['max_power'] = pd.to_numeric(X_data['max_power'], errors='coerce')


<h3>Data Preprocessing</h3>

<h4>Taking care of Missing Data</h4>

In [245]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(X[:, 6:10])
X[:, 6:10] = imputer.transform(X[:,6:10])
print(X)

[[2014 145500 'Diesel' ... 1248.0 74.0 5.0]
 [2014 120000 'Diesel' ... 1498.0 103.52 5.0]
 [2006 140000 'Petrol' ... 1497.0 78.0 5.0]
 ...
 [2009 120000 'Diesel' ... 1248.0 73.9 5.0]
 [2013 25000 'Diesel' ... 1396.0 70.0 5.0]
 [2013 25000 'Diesel' ... 1396.0 70.0 5.0]]


<h3>Encoding Categorical Data</h3>

In [276]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
columns_to_encode = [2, 3, 4]

for col in columns_to_encode:
    X[:, col] = le.fit_transform(X[:, col])

print(X[0])
print(X[2])

[2006 140000 1 1 1 'Third Owner' 17.7 1497.0 78.0 5.0]


In [248]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [5])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))
print(X[0])

[1.0 0.0 0.0 2014 145500 0 1 1 23.4 1248.0 74.0 5.0]


<h3>Splitting the data set into Train and Test Variable</h3>

In [250]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(X_train, y_train)
print(X_train[0])

[[1.0 0.0 0.0 ... 1248.0 74.0 5.0]
 [1.0 0.0 0.0 ... 1493.0 70.0 7.0]
 [1.0 0.0 0.0 ... 1248.0 88.5 5.0]
 ...
 [1.0 0.0 0.0 ... 1498.0 89.75 5.0]
 [1.0 0.0 0.0 ... 1086.0 62.1 5.0]
 [0.0 1.0 0.0 ... 796.0 34.2 5.0]] [640000 750000 911999 ... 590000 165000 180000]
[1.0 0.0 0.0 2017 110000 0 1 1 25.2 1248.0 74.0 5.0]


<h3>Feature Scaling</h3>

In [252]:
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
# X_test[:, 3:] = sc.transform(X_test[:, 3:])
# print(X_train[0])


<h3>Implementing the Random Forest Regression</h3>

In [254]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 20, random_state = 0)
regressor.fit(X_train, y_train)

<h3>Evaluating Predicted and Actual Value</h3>

In [256]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[233700.   254999.  ]
 [534249.95 750000.  ]
 [343500.   320000.  ]
 ...
 [409916.67 425000.  ]
 [591208.33 580000.  ]
 [412000.   420000.  ]]


In [257]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.9672310073170591

In [258]:
joblib.dump(regressor, 'trained_model.pkl')

['trained_model.pkl']

In [259]:
output = regressor.predict([[0.0, 0.0, 1.0, 2024, 200, 1, 0, 0, 20.0, 1348.0, 70.0, 5.0]])
output

array([797000.])