<a href="https://colab.research.google.com/github/avigangarde/OIBSIP/blob/main/Task_1_CAR_PRICE_PREDICTION_WITH_MACHINE_LEARNING_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - CAR PRICE PREDICTION WITH MACHINE LEARNING



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Gangarde Avinash B


# **Project Summary -**

The goal of this project is to predict car prices and identify the specifications or features that impact the car price. We have been provided with a labeled dataset containing information about various car attributes and their corresponding prices.

To achieve this objective, we will begin with an exploratory data analysis (EDA) to gain insights into the dataset's structure and relationships between features and the target variable (car price). We will select relevant features such as make, model, year, mileage, condition, and others based on statistical analysis and domain knowledge.

Next, we will choose an appropriate machine learning model for regression tasks, such as linear regression, decision trees, or random forests. The selected model will be trained using the labeled dataset, allowing it to learn the relationships between the chosen features and car prices.

We will evaluate the performance of the trained model using a testing set, using metrics such as mean squared error (MSE) or mean absolute error (MAE). Additionally, we will perform feature importance analysis to determine which specifications or features have the most significant impact on car prices. This analysis will provide valuable insights into the factors driving car prices.

The project's outcome will be a prediction model capable of estimating car prices accurately. Moreover, we will identify the key specifications or features that strongly influence the car price, enabling stakeholders to make informed decisions regarding pricing, marketing, and product development.

By leveraging this predictive model and understanding the impact of different specifications on car prices, businesses can optimize their pricing strategy, improve competitiveness, and better meet customer preferences in the automotive market.

# **GitHub Link -**

# **Problem Statement**


In this scenario, you have been provided with a labeled dataset, which means that each data point in the dataset includes both input features (specifications) and the corresponding output variable (car price). The objective is to develop a predictive model that can accurately estimate car prices based on the given specifications/features. Additionally, the goal is to identify which specific specifications or features have a significant impact on the car price.

#### **Define Your Business Objective?**

Business Objective: Optimizing Car Pricing Strategy

The primary business objective for the car price prediction project could be to optimize the car pricing strategy of a company. By accurately predicting car prices, the company can make data-driven decisions to set competitive and profitable prices for their vehicles. This objective can be further broken down into specific goals:

Accurate Pricing: Develop a robust and reliable car price prediction model that can accurately estimate the market value of vehicles based on various factors such as make, model, year, mileage, condition, and additional features.

Competitive Pricing: Utilize the car price predictions to set competitive prices for different car models compared to competitors. The objective is to find the sweet spot where prices are attractive to customers while ensuring profitability for the company.

Market Insights: Gain valuable insights into customer preferences and purchasing patterns by analyzing the factors that significantly influence car prices. This knowledge can help the company understand which features and characteristics have the highest impact on price, enabling them to optimize their product offerings.

Inventory Management: Improve inventory management by predicting demand and adjusting prices accordingly. By analyzing historical sales data and market trends, the company can make informed decisions about stocking and pricing specific models to minimize inventory holding costs and maximize sales.

Customer Segmentation: Leverage the car price prediction model to segment customers based on their price sensitivity and preferences. This segmentation can be used to tailor marketing and promotional strategies, offering personalized deals and incentives to different customer groups.

Pricing Optimization: Continuously refine the car price prediction model based on real-time market data, customer feedback, and sales performance. Implement pricing optimization strategies to adapt to changing market dynamics, customer behavior, and economic factors.

Overall, the business objective is to enhance the company's pricing strategy by leveraging data-driven car price predictions, thereby increasing competitiveness, improving profitability, and better meeting the needs of customers in the automotive market.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import Libraries
#importing necessary libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno

import pylab
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import zscore
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss
import math
from sklearn.ensemble import RandomForestRegressor
!pip install scikit-optimize
from skopt.space import Real, Categorical, Integer
from skopt import BayesSearchCV
from sklearn import ensemble


from sklearn.metrics import confusion_matrix,classification_report
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
df=pd.read_csv("/content/drive/MyDrive/Oasis Infobyte dataset/CAR PRICE PREDICTION WITH MACHINE LEARNING.csv")

In [None]:
# take a first look at dataset
pd.set_option('display.max_columns', None)
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.bar(df)

### What did you know about your dataset?

1. The given data set has 206 rows and 26 columns.
2. The dataset contains zero null values and zero duplicate rows.
3. The dataset has 10 columns with object data types or string data type  and 16 columns with numerical types of data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description 

The terms mentioned correspond to various attributes or variables related to car information. Here's a description of each term:

**Car_ID**: It refers to a unique identifier assigned to each car in the dataset. This ID distinguishes one car from another.

**Symboling**: Symboling represents the insurance risk rating of a car. It is an industry-standard rating that indicates the level of risk associated with insuring a particular vehicle. The symboling scale typically ranges from -3 to +3, where negative values indicate a higher risk and positive values indicate a lower risk.

**CarName**: CarName represents the name or brand of the car. It specifies the manufacturer or the model of the vehicle.

**Fueltype**: Fueltype describes the type of fuel the car uses. It can be either "gas" (petrol) or "diesel."

**Aspiration**: Aspiration refers to the method used to induce air into the car's engine. It can be either "std" (naturally aspirated) or "turbo" (turbocharged).

**Doornumber**: Doornumber indicates the number of doors the car has. It can be "two" or "four."

**Carbody**: Carbody represents the body style or design of the car, such as sedan, hatchback, convertible, wagon, etc.

**Drivewheel**: Drivewheel specifies the type of wheel that provides power and moves the car. It can be front-wheel drive (FWD), rear-wheel drive (RWD), or four-wheel drive (4WD).

**Enginelocation**: Enginelocation describes the placement of the car's engine. It can be either "front" or "rear."

**Wheelbase**: Wheelbase refers to the distance between the centers of the front and rear wheels. It is an important parameter that influences the stability and handling of the vehicle.

**Carwidth**: Carwidth represents the width of the car, typically measured in millimeters (mm). It indicates the width of the body, including the exterior mirrors.

**Carheight**: Carheight denotes the height of the car, usually measured in millimeters (mm). It refers to the vertical measurement from the ground to the highest point of the vehicle.

**Curbweight**: Curbweight indicates the weight of the car without any occupants or cargo. It includes the weight of all standard equipment, fluids, and fuel.

**Enginetype**: Enginetype specifies the configuration or type of the car's engine, such as "ohc" (overhead camshaft), "ohcf" (overhead camshaft and cam follower), "dohc" (dual overhead camshaft), and so on.

**Cylindernumber**: Cylindernumber refers to the number of cylinders in the car's engine. It can be represented as a numerical value, such as 4, 6, or 8.

**Enginesize**: Enginesize indicates the displacement volume of the car's engine. It represents the total capacity of all cylinders and is typically measured in cubic centimeters (cc) or liters (L).

**Fuelsystem**: Fuelsystem represents the type of fuel delivery system used in the car, such as "mpfi" (multi-point fuel injection), "2bbl" (two-barrel carburetor), "4bbl" (four-barrel carburetor), and so on.

**Boreratio**: Boreratio refers to the ratio between the diameter of the engine cylinder and the piston stroke length. It influences the engine's performance and efficiency.

**Stroke**: Stroke represents the length of the piston stroke in the engine, indicating the distance traveled by the piston

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

In [None]:
# Create a list of variables to check for outliers
variables = ['symboling', 'wheelbase',
             'carlength', 'carwidth', 'carheight', 'curbweight',
              'enginesize',  'boreratio', 'stroke',
             'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
             'price']
# Create a box plot for each variable
plt.figure(figsize=(10, 8))
for variable in variables:
    plt.figure(figsize=(8, 6))
    sns.boxplot(data=df[variable])
    plt.title(f'Box Plot - {variable}')
    plt.xlabel(variable)
    plt.show()

## ***3. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Visualize the distribution of car prices
plt.figure(figsize=(10, 6))
sns.histplot(df['price'])
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Visualize the relationship between car prices and engine size
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='enginesize', y='price')
plt.title('Car Price vs. Engine Size')
plt.xlabel('Engine Size')
plt.ylabel('Price')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Visualize the average car price by car body type
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='carbody', y='price')
plt.title('Average Car Price by Body Type')
plt.xlabel('Body Type')
plt.ylabel('Average Price')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Compare the average car prices based on the fuel type and aspiration.
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='fueltype', y='price', hue='aspiration')
plt.title('Average Car Price by Fuel Type and Aspiration')
plt.xlabel('Fuel Type')
plt.ylabel('Average Price')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 5

In [None]:
# Explore the correlation between numeric variables related to car prices.
plt.figure(figsize=(15, 10))
corr_matrix = df.corr()  # Compute the correlation matrix for all variables
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of All Variables')
plt.show()

In [None]:
# Chart - 5 visualization code
# Explore the correlation between numeric variables related to car prices.
plt.figure(figsize=(10, 8))
corr_matrix = df[['wheelbase', 'carwidth', 'carheight', 'curbweight', 'enginesize', 'price']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Car Price Variables')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
df_subset = df[variables]

# Create the pair plot
sns.pairplot(df_subset)
plt.title('Pairwise Relationships of All Variables')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

## 4. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# lets change the string data into numerical data using integer encoding.
# understand  values inside the variable using value counts

In [None]:
# value counts for fueltype
print(df.fueltype.value_counts())
# value counts for aspiration
print(df.aspiration.value_counts())
# value counts for doornumber
print(df.doornumber.value_counts())
# value counts for carbody
print(df.carbody.value_counts())
# value counts for drivewheel
print(df.drivewheel.value_counts())
# value counts for enginelocation
print(df.enginelocation.value_counts())
# value counts for enginetype
print(df.enginetype.value_counts())
# value counts for fuelsystem
print(df.fuelsystem.value_counts())

In [None]:
# value counts for cylinernumber
print(df.cylindernumber.value_counts())

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the fueltype to integers
df['fueltype'] = label_encoder.fit_transform(df['fueltype'])
# Fit and transform the aspiration to integers
df['aspiration'] = label_encoder.fit_transform(df['aspiration'])
# Fit and transform the carbody to integers
df['carbody'] = label_encoder.fit_transform(df['carbody'])
# Fit and transform the drivewheel	 to integers
df['drivewheel'] = label_encoder.fit_transform(df['drivewheel'])
# Fit and transform the enginelocation to integers
df['enginelocation'] = label_encoder.fit_transform(df['enginelocation'])
# Fit and transform the doornumber to integers
df['doornumber'] = label_encoder.fit_transform(df['doornumber'])
# Fit and transform the enginetype to integers
df['enginetype'] = label_encoder.fit_transform(df['enginetype'])
# Fit and transform the cylindernumber to integers
df['cylindernumber'] = label_encoder.fit_transform(df['cylindernumber'])
# Fit and transform the fuelsystem to integers
df['fuelsystem'] = label_encoder.fit_transform(df['fuelsystem'])

In [None]:
# take a look at dataset
df.head()

In [None]:
# check the datatype
df.info()

In [None]:
# converting onject datatype to string datatype
df['CarName'] = df['CarName'].astype(str)

Now , we have done all  the data wrangling work , now our dataset is ready to model implementaion.

In [None]:
# separate the independent and dependent variable 
# Create X (independent variables) by selecting all columns except the target variable
X = df.drop(['car_ID','CarName', 'price'], axis=1)

# Create y (dependent variable) by selecting only the target variable column
y = df['price']


In [None]:
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# Assuming X is your feature matrix and y is the target variable

# Create a Random Forest regressor
rf = RandomForestRegressor()

# Fit the model to your data
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_

# Get feature names
feature_names = X.columns

# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_names)), importances[indices])
plt.xticks(range(len(feature_names)), feature_names[indices], rotation='vertical')
plt.title('Feature Importances')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.show()


# **5.model implementation**

In [None]:
from sklearn.model_selection import train_test_split
# split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Lets rescale the all the values between  0 to 1 using the min_max_scaler
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scale = MinMaxScaler()
X_train=scale.fit_transform(X_train)
X_test= scale.transform(X_test)

In [None]:
# lets check the shape of dataset
X_train.shape

In [None]:
# check the shape of the testing dataset
X_test.shape

In [None]:
# lets define function to calculate all the evaluation metrics
#Evaluate Metrics
def print_evaluate(true, predicted):  
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    print('MAE:', mae)
    print('MSE:', mse)
    print('RMSE:', rmse)
    print('R2 Square', r2_square)
    print('__________________________________')

# **1.Decison Tree regressor**

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Assuming X and y are your independent and dependent variables respectively

# Create an instance of the DecisionTreeRegressor
tree_model = DecisionTreeRegressor()

# Fit the model to the training data
tree_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = tree_model.predict(X_test)

In [None]:
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, y_pred)

# **2.Random Forest regressor**

In [None]:
# import packages
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
# model implementation
param_grid = {  'bootstrap': [True], 'max_depth': [5, 10, None], 'max_features': ['auto', 'log2'], 'n_estimators': [25]}
rfr = RandomForestRegressor(random_state = 1)

random_forest_model= GridSearchCV(estimator = rfr, param_grid = param_grid, cv = 3, n_jobs = 1, verbose = 0, return_train_score=True)


In [None]:
# # Fit the object to train dataset
random_forest_model.fit(X_train, y_train)
train_pred = random_forest_model.predict(X_train)
test_pred_rf = random_forest_model.predict(X_test)

In [None]:
# test model on test data set

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, test_pred_rf)

# **3 . Gradient boost regressor**

In [None]:
# Model Training
from sklearn.ensemble import GradientBoostingRegressor

gbc_reg = GradientBoostingRegressor(random_state =42)
gbc_reg.fit(X_train,y_train)

In [None]:
# Model Prediction
y_pred_gb =gbc_reg.predict(X_test)
train_pred = gbc_reg.predict(X_train)
test_pred = gbc_reg.predict(X_test)

In [None]:
# test model on test data set

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, y_pred_gb)

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.