### Name: Shrey Srivastava

### Batch: 3

### Group: ML043B11

# Major Project - Predicting Prices Of Used Cars

## Importing Libraries

In [1]:
# basic manipulation and formation of data frames 
import pandas as pd
import numpy as np

# visaulization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# modelling libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# model evaluation libraries
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

## Importing  the data set

In [2]:
df = pd.read_excel('Data_Train.xlsx')
print('The shape of the data frame is {}. It is shown below'.format(df.shape))
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Data_Train.xlsx'

## Cleaning the data

In [None]:
df.dtypes

Now Mileage, Engine and Power have numerical values, therefore we will change there type. But the units have to be removed first.

### Removing units

#### Engine

In [None]:
new_list_engine = []
for engine in df['Engine']:
    string=str(engine)
    rep=string.replace('CC','')
    new_list_engine.append(rep)
    
new_list_engine
    
df.insert(8, 'Engine (CC)', new_list_engine)
df.drop(['Engine'], axis=1, inplace=True)
    
    
df.head()

#### Power

In [None]:
new_list_power = []
for power in df['Power']:
    string=str(power)
    rep=string.replace('bhp','')
    new_list_power.append(rep)
    
new_list_power
    
df.insert(9, 'Power (bhp)', new_list_power)
df.drop('Power', axis=1, inplace=True)
    
    
df.head()

#### Mileage

From a naked eye it appears that there are two different units mentioned, 'km/kg' and 'kmpl'. We will try to find out how many different units are there.

In [None]:
df_mileage1 = pd.DataFrame(df['Mileage'].str.contains(pat='km/kg'))
df_mileage1['Mileage'].value_counts()

In [None]:
df_mileage2 = pd.DataFrame(df['Mileage'].str.contains(pat='kmpl'))
df_mileage2['Mileage'].value_counts()

From the results above we can see that there are only two units mentioned kmpl which is mentioned for 5951 observations, and km/kg which is mentioned for the remaining 66 observations. Let us remove them too. Since there are two types of measurements due to different kinds of fuels, we will call the mileage variable 'Mileage/unit'. 

In [None]:
new_list_mileage = []

for mileage in df['Mileage']:
    string=str(mileage)
    
    if 'kmpl'in string:
        rep=string.replace('kmpl','')
        new_list_mileage.append(rep)
    
    else:
        rep=string.replace('km/kg','')
        new_list_mileage.append(rep)
                        

        
        
new_list_mileage
    
df.insert(7, 'Mileage/unit', new_list_mileage)
df.drop('Mileage', axis=1, inplace=True)
    
    
df.head()

Now we will convert the required columns to numerical types

### Changing type

In [None]:
df[["Mileage/unit"]] = df[["Mileage/unit"]].astype("float")
df[["Engine (CC)"]] = df[["Engine (CC)"]].astype("float")

In [None]:
df["Power (bhp)"] = pd.to_numeric(df["Power (bhp)"], errors='coerce')

In [None]:
df.dtypes

### After changing the types, we can now deal with the missing values, let's look at the missing values

In [None]:
df.isnull().sum()

To deal with the appropriate means, we need to eliminate the outliers first

### Examining outliers

#### Engine

In [None]:
sns.boxplot(x='Engine (CC)', data=df)

In [None]:
df[df['Engine (CC)']>=3000]

#### Power

In [None]:
sns.boxplot(x='Power (bhp)', data=df)

In [None]:
df[df['Power (bhp)']>=230]

#### Mileage

In [None]:
sns.boxplot(x='Mileage/unit', data=df)

Here the outliers below the lower bound are isolated, therefore we will remove them as usual

In [None]:
df= df[df['Mileage/unit']>=5]

In [None]:
df[df['Mileage/unit']>=29]

#### Seats

In [None]:
sns.boxplot(x='Seats',data=df)

In [None]:
df[df['Seats']>8]

#### Price

In [None]:
sns.boxplot(x='Price', data=df)

In [None]:
df[df['Price']>80]

Now for all the columns above, the outliers shown in the boxplots  are not true outliers, since the distribution of car data is not a normal distribution. There is a huge disparity in the Price, Engine and Power of cars depending on the make. This also becomes clear after using domain knowledge about cars and researching about the prices and other variables of these outlier data points. The values marked as outliers in boxplots are actually correct

To solve this problem and find the true outliers, we will try to remove this disparity by creating new categorical variables; 'Engine_Class' and 'Power_Segment'. These categorical variables take into account the segment of the car.

Now, using our domain knowledge about cars, we find that certain variables in this data are functions of other variables.

First we will try to find the outliers in 'Mileage/unit' by grouping it based on 'Fuel_Type' as mileage is a function of the type of fuel

In [None]:
sns.boxplot(x='Fuel_Type', y='Mileage/unit',data=df)

Let's inspect the values which appear to be outliers

In [None]:
check1 = df.loc[df['Fuel_Type']=='CNG']
check1 = check1[check1['Mileage/unit']<=15]
check1

In [None]:
check2 = df.loc[df['Fuel_Type']=='Petrol']
check2 = check2[check2['Mileage/unit']<=7]
check2

These values are also correct, therefore they are not outliers

Now as mentioned earlier, we will create categorical variables

### Creating Categorical Variables

In [None]:
bins_bhp = np.linspace(df['Power (bhp)'].min(), df['Power (bhp)'].max(), 5)
df['Power_Segment'] = pd.cut(df['Power (bhp)'], bins=bins_bhp, labels=['Low', 'Medium', 'High', 'Very High'], include_lowest=True)
df

In [None]:
bins_cc = np.linspace(df['Engine (CC)'].min(), df['Engine (CC)'].max(), 5)
df['Engine_Class'] = pd.cut(df['Engine (CC)'], bins=bins_cc, labels=['Small', 'Medium', 'Large', 'Mega'], include_lowest=True)
df

Here we will try finding outliers in 'Power (bhp)' based on 'Engine_Class', as the power of a car depends on its engine size

In [None]:
sns.boxplot(x='Engine_Class', y='Power (bhp)', data=df)

Let's inspect the values which appear to be outliers

In [None]:
check3 = df.loc[df['Engine_Class']=='Medium']
check3 = check3[check3['Power (bhp)']>=400]
check3

In [None]:
check4 = df.loc[df['Engine_Class']=='Large']
check4 = check4[check4['Power (bhp)']>=400]
check4

These values are also correct and not outliers

Finally, as price of the car is a function of its engine size and power, we try finding the outliers in 'Price' based on the 'Engine_Class' and 'Power_Segment'

In [None]:
sns.boxplot(x='Engine_Class', y='Price', hue='Power_Segment', data=df)

Let's inspect the values which appear to be outliers

In [None]:
check5 = df.loc[df['Engine_Class']=='Small']
check5 = check5[check5['Power_Segment']=='Medium']
check5 = check5[check5['Price']>=40]
check5

In [None]:
check6 = df.loc[df['Engine_Class']=='Medium']
check6 = check6[check6['Power_Segment']=='Medium']
check6 = check6[check6['Price']>=140]
check6

These values are also accurate

Hence our initial prognosis of outliers not being true outliers and in turn being accurate data, turns out to be correct

Now we will replace the missing values

### Replacing missing values

In [None]:
df.isnull().sum()

Since the data has a lot of outliers (according to the boxplot), it is better to replace the missing values with the median.

In [None]:
df['Engine (CC)'].replace(np.nan, df['Engine (CC)'].median(), inplace=True)
df['Power (bhp)'].replace(np.nan, df['Power (bhp)'].median(), inplace=True)
df['Seats'].replace(np.nan, df['Seats'].median(), inplace=True)

In [None]:
df['Mileage/unit'].replace(0, df['Mileage/unit'].median(), inplace=True)

In [None]:
df.isnull().sum()

Let's check if there are any missing values left

Since we created two new variables, we will replace the missing values in 'Engine_Class' and 'Power_Segment' based on the range they fall in

In [None]:
bins_cc = np.linspace(df['Engine (CC)'].min(), df['Engine (CC)'].max(), 5)
df['Engine_Class'] = pd.cut(df['Engine (CC)'], bins=bins_cc, labels=['Small', 'Medium', 'Large', 'Mega'], include_lowest=True)
df[df['Engine_Class'].isnull()]

In [None]:
bins_bhp = np.linspace(df['Power (bhp)'].min(), df['Power (bhp)'].max(), 5)
df['Power_Segment'] = pd.cut(df['Power (bhp)'], bins=bins_bhp, labels=['Low', 'Medium', 'High', 'Very High'], include_lowest=True)
df[df['Power_Segment'].isnull()]

In [None]:
df.insert(9,'Engine_Class',df['Engine_Class'],allow_duplicates=True)

In [None]:
df.insert(11,'Power_Segment',df['Power_Segment'],allow_duplicates=True)

In [None]:
df = df.iloc[:,0:14]
df

In [None]:
df.isnull().sum()

As we can see no missing values left and all the proper outliers have been dealt with

#### Now the data is completely clean




## Exploratory Data Analysis

In [None]:
df.to_csv('clean_data.csv', index=False)

In [None]:
df.head()

Now the prices of old cars depend on how old they are. So let us create a column which tells us how old the car is.

In [None]:
df.insert(3, 'Years_Old', 2020 - df['Year'])
df.head()

In [None]:
df.corr()

In [None]:
df.corr()['Price']

We can clearly see that:
- as the car gets older, the price tends to go lower
- more the kilometers driven, less the price, although the correlation is not that strong
- more the Engine capacity and Power of the car, higher is the price
- Mileage is inversely proportional to the price of cars. This maybe due to the fact that luxury cars have lesser mileage   and greater cost

We have mentioned earlier that there is a huge disparity in the prices of the cars. This is down to the make of the car. Luxury car makers like Audi, Porsche, Range Rover etc. have very expensive cars compared to economy section like Maruti or Tata

We will now try to create a categorical variable which is based on the make of the car

In [None]:
make = []
for name in df['Name']:
    make.append(name.split(' ')[0])

df.insert(1, 'Make', make)
df.head()

In [None]:
df['Make'].unique()

In [None]:
df['Make'].replace('Land','Land Rover',inplace=True)
df['Make'].replace('Isuzu', 'ISUZU',inplace=True)        

In [None]:
df['Make'].nunique()

In [None]:
plt.figure(figsize=(60,60))
sns.swarmplot(x='Make', y='Price', data=df)

In [None]:
df_make = df[['Make', 'Price']]
df_make_plot = df_make.groupby('Make').mean()
df_make_plot

In [None]:
plt.figure(figsize=(20,20))
sns.set(font_scale=0.65)
sns.boxplot(x='Make', y='Price', data=df)

Since box plots give us the distribution of data in each category, as seen above. We can clearly see that categorical divisions can be made on the 'Make' column. Makers like Maruti, Hyundai and Tata are the cheapest, while in the same dataset there is Lamborghini was costs in excess of 1.2Cr. Therefore category of maker makes a huge difference in the price. Thus, we will include the category of make as a variable.

Let us explore this further by plotting the mean price for each type of make

In [None]:
bins = np.linspace(df_make_plot['Price'].min(), df_make_plot['Price'].max(), 15)
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(1,1,1)
df_make_plot.plot(kind='bar', fontsize=14, ax=ax1)
ax1.set_xlabel('Make', fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax1.set_yticks(bins)
ax1.get_legend().remove()
ax1.axhline(y=8, color='r')
ax1.axhline(y=19, color='y')
ax1.axhline(y=35, color='g')
ax1.axhline(y=60, color='b')
ax1.set_title('Means of Prices for Different Makes', fontsize=30)

Based on the two plots above and the reputation of car brands, we can make categories for 'Make'. We will be making categories based on the divisions made in the bar chart above.

In [None]:
make_category = []

for make in df['Make']:
    
    if make in ['Ambassador', 'Chevrolet', 'Datsun', 'Fiat', 'Ford', 'Honda', 'Hyundai', 'Mahindra', 'Maruti', 'Nissan', 'Renault', 'Skoda', 'Tata', 'Volkswagen']:
        make_category.append('Economy')
    
    elif make in ['Force', 'ISUZU', 'Jeep', 'Mitsubishi', 'Toyota', 'Volvo']:
        make_category.append('Mid Segment')
        
    elif make in ['Audi', 'BMW', 'Mercedes-Benz', 'Mini']:
        make_category.append('Upper Segment')
        
    elif make in ['Bentley', 'Jaguar', 'Land Rover', 'Porsche']:
        make_category.append('Luxury')
        
    elif make=='Lamborghini':
        make_category.append('Super Car')

        
df.insert(2, 'Make_Category', make_category)
df.head()

Now since the 'Make_Category' affects the price, we will encode it and see it correlation with 'Price'.

In [None]:
df['Make_Category'].replace(to_replace=['Economy', 'Mid Segment', 'Upper Segment', 'Luxury', 'Super Car'], value=[1, 2, 3, 4, 5], inplace=True)
df.head()

Now let's make this a numerical variable so that its correlation with price can be seen.

In [None]:
df[["Make_Category"]] = df[["Make_Category"]].astype('int')
df.head()

In [None]:
df.corr()['Price']

We will see the correlation of all these variables with 'Price'

In [None]:
sns.regplot(x='Make_Category',y='Price',data=df)

In [None]:
sns.regplot(x='Year',y='Price',data=df)

In [None]:
sns.regplot(x='Years_Old',y='Price',data=df)

In [None]:
sns.regplot(x='Kilometers_Driven',y='Price',data=df)

In [None]:
sns.regplot(x='Mileage/unit',y='Price',data=df)

In [None]:
sns.regplot(x='Engine (CC)',y='Price',data=df)

In [None]:
sns.regplot(x='Power (bhp)',y='Price',data=df)

In [None]:
sns.regplot(x='Seats',y='Price',data=df)

From the regression plots above and the correlation table with 'Price', we can see that 'Year' and 'Years_Old' have the same correlation, therefore we do not need to have 'Years_Old' as it is redundant

Hence we will drop the 'Years_Old' column

In [None]:
df.drop('Years_Old', axis=1, inplace=True)

'Kilometers_Driven' and 'Seats' have poor correlation with 'Price', therefore we will drop them too, otherwise they may cause overfitting.

### Feature Selection

Based on these inferences, we will create a new data frame which will have the relevant predictor variables, which are:
- Make_Category
- Year
- Mileage/unit
- Engine (CC)
- Power (bhp)
- Price

In [None]:
df_features = df[['Name', 'Make_Category', 'Year', 'Mileage/unit', 'Engine (CC)', 'Power (bhp)', 'Price']]
df_features

## Modelling

We will train the following models over our dataset and choose the best model based on the evaluation
- Multiple Linear Regression
- Random Forest Regression
- Decision Tree Regression

Here for training we will perform a train test split and use the same training and testing data to evaluate the model.
For evaluation we use two metrics:
- RMSE
- Cross Validation Score (finally expressed in percentage here)

### Train Test Split

In [None]:
# defining the predictor and target variables
x = df[['Make_Category', 'Year', 'Mileage/unit', 'Engine (CC)', 'Power (bhp)']]
y = df[['Price']]

In [None]:
# we willl now split the data into training set and testing set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
print('Shape of data:\nx_train: {}\nx_test: {}\ny_train: {}\ny_test: {}'.format(x_train.shape, x_test.shape, y_train.shape, y_test.shape))

### Multiple Linear Regression

#### Training

Now here feature scaling and normalization is not required as we are going to use Multiple Regression, therefore the coefficients for these variables will be set accordingly.

We will now try to see if Multiple Linear Regression is a good model for the data

In [None]:
lr = LinearRegression()
lr.fit(x_train, y_train)
yhat = lr.predict(x_test)
yhat.shape

#### Evaluation

In [None]:
score_lr = cross_val_score(lr, x, y, cv=4)

In [None]:
print('Accuracy Score(%):{}'.format(100*score_lr.mean()))
print('RMSE:{}'.format(np.sqrt(mean_squared_error(yhat, y_test))))

In [None]:
sns.residplot(yhat, y_test)

We can see that the residue plot for the Linear Regression model is showing a curvature. This implies that Linear Regression is not a good fit for this data. Therefore we will try another model

### Random Forest Regression

#### Training

In [None]:
rfr = RandomForestRegressor(n_estimators=200)

In [None]:
rfr.fit(x_train,y_train)

In [None]:
rfr_pred = rfr.predict(x_test)

In [None]:
plt.figure(figsize = (15,10))
plt.scatter(y_test,rfr_pred)

#### Evaluation

In [None]:
score_rfr = cross_val_score(rfr, x, y, cv=4)

In [None]:
print('Accuracy Score(%):{}'.format(100*score_rfr.mean()))
print('RMSE:{}'.format(np.sqrt(mean_squared_error(rfr_pred, y_test))))

### Decision Tree Regression

#### Training

In [None]:
dtree = DecisionTreeRegressor()

In [None]:
dtree.fit(x_train,y_train)

In [None]:
pred = dtree.predict(x_test)

In [None]:
plt.figure(figsize = (15,10))
plt.scatter(y_test,pred)

#### Evaluation

In [None]:
score_dtree = cross_val_score(dtree, x, y, cv=4)

In [None]:
print('Accuracy Score(%):{}'.format(100*score_dtree.mean()))
print('RMSE:{}'.format(np.sqrt(mean_squared_error(pred, y_test))))

### Choice of model

After evaluating the 3 models, we see that Random Forest Regression has the least RMSE and the highest Cross Validation Score.
Therefore Random Forest Regression is the best model for the data.

We will now predict the values of price for the test data given using the same model

## Predicting the Prices

### Importing the dataset

In [None]:
test_df = pd.read_excel('Data_Test.xlsx')
test_df.head()

In [None]:
test_df.info()

We will first change the types of a few columns and wrangle the data to the same form as our final training data.

### Removing units

#### Engine

In [None]:
new_list_engine = []
for engine in test_df['Engine']:
    string=str(engine)
    rep=string.replace('CC','')
    new_list_engine.append(rep)
    
new_list_engine
    
test_df.insert(8, 'Engine (CC)', new_list_engine)
test_df.drop(['Engine'], axis=1, inplace=True)
    
    
test_df.head()

#### Power

In [None]:
new_list_power = []
for power in test_df['Power']:
    string=str(power)
    rep=string.replace('bhp','')
    new_list_power.append(rep)
    
new_list_power
    
test_df.insert(9, 'Power (bhp)', new_list_power)
test_df.drop('Power', axis=1, inplace=True)
    
    
test_df.head()

#### Mileage

In [None]:
new_list_mileage = []

for mileage in test_df['Mileage']:
    string=str(mileage)
    
    if 'kmpl'in string:
        rep=string.replace('kmpl','')
        new_list_mileage.append(rep)
    
    else:
        rep=string.replace('km/kg','')
        new_list_mileage.append(rep)
                        

        
        
new_list_mileage
    
test_df.insert(7, 'Mileage/unit', new_list_mileage)
test_df.drop('Mileage', axis=1, inplace=True)
    
    
test_df.head()

### Changing type

In [None]:
test_df[["Mileage/unit"]] = test_df[["Mileage/unit"]].astype("float")
test_df[["Engine (CC)"]] = test_df[["Engine (CC)"]].astype("float")
test_df["Power (bhp)"] = pd.to_numeric(test_df["Power (bhp)"], errors='coerce')
test_df.dtypes

### Dealing missing values

In [None]:
test_df.isnull().sum()

Here we drop the missing values as we are given the test data. Replacing values is not upto our discretion.
Though we have to remove the genuine outliers

In [None]:
test_df.dropna(inplace=True)
test_df.info()

### Removing outliers

In [None]:
sns.boxplot(x='Mileage/unit',data=test_df)

In [None]:
test_df= test_df[test_df['Mileage/unit']>=5]

In [None]:
sns.boxplot(x='Engine (CC)',data=test_df)

In [None]:
sns.boxplot(x='Power (bhp)',data=test_df)

In [None]:
sns.boxplot(x='Seats',data=test_df)

In [None]:
test_df[test_df['Seats']>9]

These values are correct and not outliers

### Creating the same columns as chosen in feature selection

In [None]:
make = []
for name in test_df['Name']:
    make.append(name.split(' ')[0])

test_df.insert(1, 'Make', make)

In [None]:
test_df.head()

In [None]:
test_df['Make'].unique()

In [None]:
test_df['Make'].replace('Land','Land Rover',inplace=True)
test_df['Make'].replace('Isuzu', 'ISUZU',inplace=True)  

In [None]:
test_df['Make'].nunique()

Now since the makers in this dataset are different, we will see which car makers are common and which are exclusive to te new dataset.

In [None]:
exclusive = []
common = []

for make in test_df['Make'].unique():
    
    if make in df['Make'].unique():
        common.append(make)
    
    else:
        exclusive.append(make)
        

print('The common makes are: ',common,'\n')
print('The exclusive makes are: ',exclusive)

We see that 'OpelCorsa' is a make which is not there in the training datset. Therefore, we must classify 'OpelCorsa' in a make category, as 'Make_Category' is a variable used in our model. 

For this we will use our domain knowledge about cars as well as perform analysis with other variables to see which category it lies in

In [None]:
sns.regplot(x='Make_Category', y='Engine (CC)', data=df)

In [None]:
sns.regplot(x='Make_Category', y='Power (bhp)', data=df)

From these plots we can see that there is a positive and strong correlation between the make category and the Engine Size and the Power. Therefore, to find out the make_category of 'OpelCorsa' we should compare its Engine size and Power to the other categories

In [None]:
test_df[test_df['Make']=='OpelCorsa']

There is only one car with the make 'OpelCorsa'. We will compare its engine size and power to other makes and see which category it is closest to.

In [None]:
df_make_comp = df[['Make_Category', 'Engine (CC)', 'Power (bhp)']]
df_make_comp = df_make_comp.groupby('Make_Category').mean()
df_make_comp

Now through bar plots we will compare the average values of Engine size and Power of different Make categories to the Engine size and Power of OpelCorsa

In [None]:
fig_test = plt.figure(figsize=(15,15))
axa = fig_test.add_subplot(211)
axb = fig_test.add_subplot(212)
df_make_comp['Engine (CC)'].plot(kind='bar', fontsize=25, ax=axa)
df_make_comp['Power (bhp)'].plot(kind='bar', fontsize=25, ax=axb)
axa.axhline(y=1389, color='r')
axb.axhline(y=88, color='r')
axa.set_title('Comparison of Engine Capacity of OpelCorsa to the mean of other Make Categories', fontsize=25)
axb.set_title('Comparison of Power of OpelCorsa to the mean of other Make Categories', fontsize=25)
axa.annotate('OpelCorsa', xy=(-0.5,1500), xycoords='data', fontsize=20)
axb.annotate('OpelCorsa', xy=(-0.5,100), xycoords='data', fontsize=20)
axa.set_ylabel('Mean Engine(CC)', fontsize=25)
axb.set_ylabel('Mean Power(bhp)', fontsize=25)
axb.set_xlabel('Make_Category', fontsize=25)

We can clearly see that the values of engine size and power are close to the mean of the class 1 make, therefore it lies in the category 1 which corresponds to Economy. Also after researching the prices of OpelCorsa cars, we can confirm that it lies in th Economy class.

In [None]:
make_category_test = []

for make in test_df['Make']:
    
    if make in ['Ambassador', 'Chevrolet', 'Datsun', 'Fiat', 'Ford', 'Honda', 'Hyundai', 'Mahindra', 'Maruti', 'Nissan', 'Renault', 'Skoda', 'Tata', 'Volkswagen', 'OpelCorsa']:
        make_category_test.append('Economy')
    
    elif make in ['Force', 'ISUZU', 'Jeep', 'Mitsubishi', 'Toyota', 'Volvo']:
        make_category_test.append('Mid Segment')
        
    elif make in ['Audi', 'BMW', 'Mercedes-Benz', 'Mini']:
        make_category_test.append('Upper Segment')
        
    elif make in ['Bentley', 'Jaguar', 'Land Rover', 'Porsche']:
        make_category_test.append('Luxury')
        
    elif make=='Lamborghini':
        make_category_test.append('Super Car')

        
test_df.insert(2, 'Make_Category', make_category_test)

In [None]:
test_df.head()

In [None]:
test_df['Make_Category'].replace(to_replace=['Economy', 'Mid Segment', 'Upper Segment', 'Luxury', 'Super Car'], value=[1, 2, 3, 4, 5], inplace=True)
test_df[["Make_Category"]] = test_df[["Make_Category"]].astype('int')
test_df.head()

Now our test dataset is ready to apply the model on. We can now predict the prices for these cars.

### Predicting

We will now fit the model on the testing data given

In [None]:
X = test_df[['Make_Category', 'Year', 'Mileage/unit', 'Engine (CC)', 'Power (bhp)']]

In [None]:
predicted_price = rfr.predict(X)

In [None]:
test_df['Predicted Price'] = predicted_price

In [None]:
test_df['Predicted Price'] = test_df['Predicted Price'].round(2)

In [None]:
test_df.drop(['Make', 'Make_Category'], axis=1, inplace=True)

## Final Dataset With the Predicted Prices

We were given this dataset to predict the prices for. The original dataset along with the predicted prices are given below.

In [None]:
test_df

### This concludes the internship's major project. I would like to thank the instructors and support team from Verzeo for their guidance. The concepts taught by them and their efforts solidified the interest for Data Science in me.