Car Price Prediction



##### **Project Type**    - EDA/Regression/Classification/Unsupervised - Regression
##### **Contribution**    - Team
##### **Team Member 1 - 2210992549**
##### **Team Member 2 - 2210992552**
##### **Team Member 3 - 2210990092**
##### **Team Member 4 - 2210992432**

# **Project Summary -**

The "Car Price Prediction" project focuses on developing a robust artificial intelligence and machine learning system capable of accurately predicting the sale prices of individual cars.

Key Points:-

* Dataset contains 205 cars with attributes like car ID, symboling, car name, fuel type, and physical dimensions.

* No missing or duplicate values, ensuring data cleanliness.

* Strong correlations observed between car length, width, curb weight, and car price.

* Certain car manufacturers more prevalent, suggesting market trends or brand preferences.

* Various visualizations (histograms, box plots, scatter plots) used to explore relationships.

* Pair plots provided comprehensive view of variable interactions.

* Groupby operations showed insights across different categories like car body style, fuel type.

* Analysis guided decisions on pricing, marketing, and product development.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The primary challenge addressed by this project is to develop a predictive model that can accurately estimate the market value of a car based on its features and specifications. This involves understanding and quantifying the relationship between a car's sale price and its attributes using historical data. The model must be capable of handling a wide variety of cars, from economical to luxury models, and account for the non-linear and complex interactions between features that influence a car's price

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd  # For data manipulation and analysis
import matplotlib.pyplot as plt  # For creating visualizations
import seaborn as sns  # For statistical data visualization

### Dataset Loading

In [None]:
# Load the dataset
car_data = pd.read_csv("/content/car_data_300.csv")
car_data

### Dataset First View

In [None]:
# Dataset First Look
print(car_data.head())  # Display first few rows of the DataFrame

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_cols = car_data.shape  # Get the number of rows and columns in the dataset
print("Number of rows :", num_rows)  # Print the count of rows
print("Number of columns:", num_cols)  # Print the count of columns

### Dataset Information

In [None]:
# Dataset Info
car_data.info()  # Shows a summary of the dataset

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_value_count = car_data.duplicated().sum()  # Count duplicate values in the dataset
print("Duplicate ->", duplicate_value_count)  # Print the count of duplicate values

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values_count = car_data.isnull().sum()  # Count missing values in each column
print("Missing values count per column:")
print(missing_values_count)  # Print the count of missing values


In [None]:
# Visualizing the missing values
# Filter columns with null values
columns_with_null = car_data.columns[car_data.isnull().any()]

# Plot the percentage of missing values for columns with null values
plt.figure(figsize=(14, 7)) # Set the size of the plot
sns.barplot(x=columns_with_null, y=car_data[columns_with_null].isnull().mean() * 100)  # Create a bar plot for columns with null values
plt.xticks(rotation=90) #for right view
plt.xlabel('Columns with Null Values')
plt.ylabel('Percentage of Missing Values')
plt.title('Percentage of Missing Values per Column')
plt.show()

### What did you know about your dataset?

Size: The dataset contains 205 rows and 26 columns, indicating that there are data for 205 car models and 26 different attributes or features recorded for each car.

Columns: The columns include various attributes such as car ID, symboling, car name, fuel type, aspiration, number of doors, body type, drivetrain, engine location, dimensions (wheelbase, car length, car width, car height), curb weight, engine type, number of cylinders, engine size, fuel system, bore ratio, stroke, compression ratio, horsepower, peak rpm, city mpg, highway mpg, and price.

Data Types: The dataset contains a mix of data types including integers, floats, and objects (likely strings).

Missing Values: There are no missing values in any of the columns as all columns have non-null counts equal to the total number of rows, indicating that there are no missing values in the dataset.

Duplicates: The dataset has zero duplicate rows, meaning each row represents a unique car model.

Unique Values: Further exploration of the dataset could reveal the unique values present in categorical columns, providing insights into the diversity of car models and their characteristics.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
column_names = car_data.columns #name of columns
print("column Names:", column_names)

In [None]:
# Dataset Describe
# Generate a summary description of the dataset
my_describe = car_data.describe(include='all')
print(my_describe)

### Variables Description

car_ID: A unique identifier for each car model.

symboling: The insurance risk rating associated with the car, represented as an integer.

CarName: The name of the car model.

fueltype: The type of fuel used by the car (e.g., gas or diesel).

aspiration: The type of aspiration system in the car's engine (e.g., std or turbo).

doornumber: The number of doors in the car (e.g., two or four).

carbody: The body style of the car (e.g., sedan, hatchback, convertible).

drivewheel: The type of drivetrain (e.g., front-wheel drive, rear-wheel drive, four-wheel drive).

enginelocation: The location of the engine in the car (e.g., front or rear).

wheelbase: The distance between the centers of the front and rear wheels.

carlength: The length of the car.

carwidth: The width of the car.

carheight: The height of the car.

curbweight: The weight of the car without occupants or baggage.

enginetype: The type of engine (e.g., dohc, ohcv).

cylindernumber: The number of cylinders in the engine.

enginesize: The size of the engine in cubic centimeters.

fuelsystem: The type of fuel injection system used in the engine.

boreratio: The bore ratio of the engine.

stroke: The stroke length of the engine.

compressionratio: The compression ratio of the engine.

horsepower: The horsepower of the engine.

peakrpm: The peak revolutions per minute of the engine.

citympg: The city miles per gallon fuel efficiency rating.

highwaympg: The highway miles per gallon fuel efficiency rating.

price: The price of the car.



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Iterate through each column and print column name along with unique values
for column in car_data.columns:
    unique_values = car_data[column].unique()
    print(f"Column: {column}")
    print("Unique values:")
    print(unique_values)
    print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Group by carbody and find the average price
avg_price_by_carbody = car_data.groupby('carbody')['price'].mean()

# Group by fueltype and find the median highwaympg
median_highwaympg_by_fueltype = car_data.groupby('fueltype')['highwaympg'].median()

# Group by aspiration and find the maximum horsepower
max_horsepower_by_aspiration = car_data.groupby('aspiration')['horsepower'].max()

# Group by doornumber and find the minimum curbweight
min_curbweight_by_doornumber = car_data.groupby('doornumber')['curbweight'].min()

# Group by enginelocation and find the total number of cars
total_cars_by_enginelocation = car_data.groupby('enginelocation').size()

# Group by symboling and find the average compression ratio
avg_compression_ratio_by_symboling = car_data.groupby('symboling')['compressionratio'].mean()

# Group by cylindernumber and find the average citympg
avg_citympg_by_cylindernumber = car_data.groupby('cylindernumber')['citympg'].mean()

# Displaying the results
print("Average price by carbody:\n", avg_price_by_carbody)
print("\nMedian highwaympg by fueltype:\n", median_highwaympg_by_fueltype)
print("\nMaximum horsepower by aspiration:\n", max_horsepower_by_aspiration)
print("\nMinimum curbweight by doornumber:\n", min_curbweight_by_doornumber)
print("\nTotal cars by enginelocation:\n", total_cars_by_enginelocation)
print("\nAverage compression ratio by symboling:\n", avg_compression_ratio_by_symboling)
print("\nAverage citympg by cylindernumber:\n", avg_citympg_by_cylindernumber)


### What all manipulations have you done and insights you found?

1.Average Price by Car Body Type: Understanding the average price range for different car body types can help in pricing strategies or market segmentation.

2.Median Highway MPG by Fuel Type: Comparing the median highway MPG between different fuel types provides insights into fuel efficiency trends.

3.Maximum Horsepower by Aspiration: Identifying the maximum horsepower for different aspiration types can indicate performance variations.

4.Minimum Curb Weight by Door Number: Analyzing the minimum curb weight based on the number of doors can reveal potential weight differences in car models.

5.Total Cars by Engine Location: Knowing the total number of cars by engine location provides an overview of how prevalent each engine location type is in the dataset.

6.Average Compression Ratio by Symboling: Understanding the average compression ratio for different symbolings may indicate trends in engine design or performance.

7.Average City MPG by Cylinder Number: Comparing the average city MPG based on the number of cylinders provides insights into fuel efficiency variations based on engine configuration.

These insights can be valuable for various purposes such as market analysis, product development, or understanding consumer preferences.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(data=car_data, x='price', bins=20,kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()



##### 1. Why did you pick the specific chart?

I picked a histogram because it provides a visual representation of the distribution of car prices, allowing us to see the spread and concentration of prices.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe the most common price range for cars in the dataset, as well as any outliers or unusual patterns in pricing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can help businesses understand the pricing landscape in the market, identify pricing trends, and make informed pricing decisions.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=car_data, x='fueltype', y='price')
plt.title('Car Prices by Fuel Type')
plt.xlabel('Fuel Type')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a boxplot because it effectively compares the distribution of car prices across different fuel types.

##### 2. What is/are the insight(s) found from the chart?

The boxplot reveals the median, quartiles, and outliers in car prices for each fuel type. We can identify any significant differences in price distributions between gas and diesel cars.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can help businesses understand how fuel type affects car prices and make strategic decisions regarding inventory management, marketing strategies, and product positioning.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=car_data, x='enginesize', y='price')
plt.title('Car Prices vs. Engine Size')
plt.xlabel('Engine Size')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

I selected a scatterplot because it helps visualize the relationship between car prices and engine size, allowing us to identify any patterns or correlations between these variables.

##### 2. What is/are the insight(s) found from the chart?

The scatterplot shows the distribution of car prices relative to engine size, helping us identify whether there's a positive correlation (higher prices for larger engines) or a negative correlation (lower prices for larger engines).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can inform businesses about the pricing dynamics related to engine size, enabling them to make strategic decisions about product offerings, target markets, and pricing strategies.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(data=car_data, x='carbody')
plt.title('Distribution of Car Body Types')
plt.xlabel('Car Body Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a count plot because it effectively displays the count of each car body type, allowing us to compare the frequency of different body types.

##### 2. What is/are the insight(s) found from the chart?

The count plot shows the distribution of car body types in the dataset, helping us identify which body types are most common or least common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can help businesses understand consumer preferences for car body types, which can inform decisions related to inventory management, marketing strategies, and product development.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(data=car_data, x='horsepower', bins=20, kde=True)
plt.title('Distribution of Car Horsepower')
plt.xlabel('Horsepower')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I selected a histogram to visualize the distribution of car horsepower, which provides insight into the spread and concentration of horsepower values across the dataset.

##### 2. What is/are the insight(s) found from the chart?

The histogram displays the frequency of different horsepower values, helping us identify the most common horsepower ranges and any outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can help businesses understand the horsepower preferences of consumers and make strategic decisions regarding product offerings, marketing strategies, and pricing.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=car_data, x='curbweight', y='price')
plt.title('Car Prices vs. Curb Weight')
plt.xlabel('Curb Weight')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

I opted for a scatterplot to visualize the relationship between car prices and curb weight, allowing us to identify any patterns or correlations between these variables.

##### 2. What is/are the insight(s) found from the chart?

The scatterplot shows the distribution of car prices relative to curb weight, helping us determine if there's a positive correlation (higher prices for heavier cars) or a negative correlation (lower prices for heavier cars).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can inform businesses about the pricing dynamics related to curb weight, enabling them to make strategic decisions about product offerings, target markets, and pricing strategies.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=car_data, x='fueltype', y='price', hue='aspiration')
plt.title('Car Prices by Fuel Type and Aspiration')
plt.xlabel('Fuel Type')
plt.ylabel('Price')
plt.legend(title='Aspiration')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a boxplot to compare the distribution of car prices across different fuel types and aspiration types simultaneously.

##### 2. What is/are the insight(s) found from the chart?

The boxplot reveals the median, quartiles, and outliers in car prices for each combination of fuel type and aspiration, allowing us to identify any significant differences in price distributions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can help businesses understand how both fuel type and aspiration type affect car prices and make strategic decisions regarding inventory management, marketing strategies, and product positioning.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=car_data, x='drivewheel', y='price')
plt.title('Car Prices by Drivewheel Type')
plt.xlabel('Drivewheel Type')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

I selected a boxplot to compare the distribution of car prices across different drivewheel types, allowing us to identify any variations in price distributions based on drivewheel configuration.

##### 2. What is/are the insight(s) found from the chart?

The boxplot displays the median, quartiles, and outliers in car prices for each drivewheel type, helping us understand the spread and central tendency of prices within each category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can help businesses understand how drivewheel type affects car prices and make strategic decisions regarding inventory management, marketing strategies, and product positioning

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=car_data, x='carbody', y='price', hue='fueltype')
plt.title('Car Prices by Car Body Type and Fuel Type')
plt.xlabel('Car Body Type')
plt.ylabel('Price')
plt.legend(title='Fuel Type')
plt.show()

##### 1. Why did you pick the specific chart?

I opted for a boxplot to compare the distribution of car prices across different combinations of car body types and fuel types, allowing us to identify any variations in price distributions based on these factors.

##### 2. What is/are the insight(s) found from the chart?

The boxplot displays the median, quartiles, and outliers in car prices for each combination of car body type and fuel type, helping us understand how these factors interact to influence prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can inform businesses about the pricing dynamics related to both car body type and fuel type, enabling them to make strategic decisions about inventory management, marketing strategies, and product positioning.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
import matplotlib.pyplot as plt

# Calculate counts of each car manufacturer
manufacturer_counts = car_data['CarName'].value_counts()

# Select the top 10 manufacturers for visualization
top_manufacturers = manufacturer_counts.head(10)

# Plotting the pie chart
plt.figure(figsize=(8, 8))
plt.pie(top_manufacturers, labels=top_manufacturers.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Top 10 Car Manufacturers')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()



##### 1. Why did you pick the specific chart?

Choice of Pie Chart: Selected for its effectiveness in illustrating relative proportions among the top 10 car manufacturers.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart: Reveals market share distribution, highlighting leading manufacturers and disparities, aiding strategic analysis.

##### 3. Will the gained insights help creating a positive business impact?
Positive Business Impact: Enables informed resource allocation, tailored marketing strategies, and identification of growth opportunities, fostering improved decision-making and business outcomes.

The insights gained from this chart can help businesses understand the relationships between different numerical variables in the dataset, facilitating feature selection, model building, and interpretation of results.

#### Chart - Correlation Heatmap

In [None]:
# Assuming 'car_data' is your DataFrame
# Select only numeric columns for correlation
numeric_data = car_data.select_dtypes(include=['float64', 'int64'])

# Handle missing values by filling them with the column mean
numeric_data = numeric_data.fillna(numeric_data.mean())

# Generate the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_data.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Pairwise Correlation Heatmap of Numerical Variables')
plt.show()


##### 1. Why did you pick the specific chart?

I selected a heatmap to visualize the pairwise correlations between numerical variables in the dataset, providing insight into the strength and direction of relationships between variables.


##### 2. What is/are the insight(s) found from the chart?

The insights gained from this chart can help businesses understand the interdependencies between different numerical variables and identify potential multicollinearity issues when building predictive models. By understanding which variables are strongly correlated, businesses can make informed decisions about feature selection, model building, and interpretation of results.

#### Chart - Pair Plot

In [None]:
# Assuming 'car_data' is your DataFrame
sns.pairplot(car_data[['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 'horsepower', 'price']])
plt.title('Pair Plot of Numerical Variables')
plt.show()

##### 1. Why did you pick the specific chart?

The pair plot was chosen for its ability to visualize relationships between multiple pairs of numerical variables in one plot.

##### 2. What is/are the insight(s) found from the chart?

Insights include strong positive correlations between car length/wheelbase, car width/length/wheelbase, curb weight/length/width/wheelbase, engine size/horsepower, and engine size/horsepower/price. These insights can inform decisions on car design, pricing, and marketing strategies.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
# Load the dataset
data = pd.read_csv('/content/car_data_300.csv')

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Data Preprocessing
# Handle missing values
data.dropna(inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?



```
Missing values are handled by dropping rows. This approach removes incomplete data and is suitable for small missing values.
```


### 2. Categorical Encoding

In [None]:
# Encode your categorical columns
# Encode all categorical variables
label_encoder = LabelEncoder()
categorical_cols = ['CarName', 'fueltype', 'aspiration', 'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'enginetype', 'cylindernumber', 'fuelsystem']
for col in categorical_cols:
    data[col] = label_encoder.fit_transform(data[col])

#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding is used, assigning unique integers to categorical variables. It's chosen for simplicity and suitability for ordinal data.

### 3. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Split Data
X = data.drop(columns=['car_ID', 'price'])
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

A test size of 0.2 (20%) is used, meaning 80% of the data is allocated for training and 20% for testing. This ratio is commonly used to strike a balance between having enough data for training and ensuring a sufficient amount for testing model performance.

### 4. MODELS TO BE USED


# Model Selection
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor()
}


## ***7. ML Model Implementation***

In [None]:
# Model Selection
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor()
}

### ML Model - 1

In [None]:
# Calculate evaluation metrics for each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    accuracy = model.score(X_test, y_test)  # Using model.score() to get R-squared as accuracy
    print(f'{name}:')
    print(f'Mean Absolute Error: {mae}')
    print(f'Mean Squared Error: {mse}')
    print(f'R-squared: {r2}')
    print(f'Accuracy (R-squared): {accuracy}')
    print('')

Mean Absolute Error (MAE): The average absolute difference between the predicted and actual prices. A lower MAE indicates better performance.

### ML Model - 2

Mean Squared Error (MSE): The average of the squares of the differences between predicted and actual prices. It penalizes larger errors more heavily. A lower MSE indicates better performance.

### ML Model - 3

R-squared (R2): Also known as the coefficient of determination, it measures the proportion of the variance in the dependent variable (car prices) that is predictable from the independent variables. R2 ranges from 0 to 1, where 1 indicates a perfect fit. Higher R2 values indicate better performance.

Accuracy (R-squared): While accuracy is typically used for classification tasks, here it represents the R-squared value, which measures how well the regression model fits the data. Higher accuracy (R-squared) indicates better performance.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the dataset
data = pd.read_csv('/content/car_data_300.csv')

# Data Preprocessing
# Handle missing values
data.dropna(inplace=True)

# Encode all categorical variables
label_encoder = LabelEncoder()
categorical_cols = ['CarName', 'fueltype', 'aspiration', 'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'enginetype', 'cylindernumber', 'fuelsystem']
for col in categorical_cols:
    data[col] = label_encoder.fit_transform(data[col])

# Split Data
X = data.drop(columns=['car_ID', 'price'])
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Selection
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor()
}

# Calculate evaluation metrics for each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    accuracy = model.score(X_test, y_test)  # Using model.score() to get R-squared as accuracy
    print(f'{name}:')
    print(f'Mean Absolute Error: {mae}')
    print(f'Mean Squared Error: {mse}')
    print(f'R-squared: {r2}')
    print(f'Accuracy (R-squared): {accuracy}')
    print('')