# Project - Used Car Price Prediction

#### ProjectId - LAI-ML-R-001

## Data Information

<b>Type of Data :</b> Numerical

<b>Total Number of Features:</b> 14

<b>Total Number of Records:</b> 7253

<b>Features Name:</b> S.No., Name, Location, Year, Kilometers_Driven, Fuel_Type, Transmission, Owner_Type, Mileage, Engine, Power, Seats, New_Price, Price

<b>Features Description:</b> 

<b>1. S.No.:</b> A unique identifier for each car entry.

<b>2. Name:</b> The make and model of the car.

<b>3. Location:</b> The city or region where the car is being sold.

<b>4. Year:</b> The year the car was manufactured.

<b>5. Kilometers_Driven:</b> The total distance the car has traveled, measured in kilometers.

<b>6. Fuel_Type:</b> The type of fuel the car uses, such as petrol, diesel, or electric.

<b>7. Transmission:</b> The type of transmission system in the car, either manual or automatic.

<b>8. Owner_Type:</b> The number of previous owners of the car.

<b>9. Mileage:</b> The fuel efficiency of the car, typically measured in kilometers per liter.

<b>10. Engine:</b> The engine capacity of the car, usually measured in cubic centimeters (cc).

<b>11. Power:</b> The maximum power output of the car's engine, usually measured in horsepower (bhp).

<b>12. Seats:</b> The number of seats available in the car.

<b>13. New_Price:</b> The original price of the car when it was new.

<b>14. Price:</b> The current selling price of the car.

# Data Analysis

### Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import missingno
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import squarify

%matplotlib inline

### Loading Data

In [3]:
df = pd.read_csv('../data/data.csv')

### Type of Dataset

In [4]:
type(df)

pandas.core.frame.DataFrame

### Displaying Top 5 Rows

In [None]:
df.head()

### Displaying Bottom 5 rows

In [None]:
df.tail()

#### Challenge*: 

1. If we are having more than 30 features, we are not going to see all features information at a time.

For this you need to add additional method to display all columns

2. In the same way if you want to display all rows you need to use additional method

### Shape of the Dataset

In [None]:
print("Number of Dimensions: ",df.ndim)
shape = df.shape
print("\nTotal Number of Rows: ",shape[0])
print("Total Number of Columns: ",shape[1])

### Displaying Column Names

In [None]:
df.columns

### Displaying DataType of each Column

In [None]:
df.dtypes

### Observed Points*:
    
#### Removing Units from Element Values and Adding to Column Names:

    Examine the Mileage, Engine, and Power columns where units are included in the element values.

    Remove the unit extensions from the element values.

    Add the corresponding units to the column names to clarify the measurements.


### Null Values Information

In [None]:
print("Is there any Null Values in the Data: ", df.isnull().sum().any())

In [None]:
print("Total Number of Null Values in the Data: ",df.isnull().sum().sum())

In [None]:
print("Each Feature Null Value is there or Not: \n\n",df.isnull().any())

In [None]:
print("Each Feature Null Value count: \n\n",df.isnull().sum())

In [None]:
print("Displaying Non Null Values Count:\n\n")
missingno.bar(df, figsize=(10,5), fontsize=12)
plt.show()

In [None]:
print("Displaying Non Null Values Count in ascending order:\n\n")
missingno.bar(df, color="dodgerblue", sort="ascending", figsize=(10,5), fontsize=12)
plt.show()

In [None]:
print("In dataset where exactly null values are present:\n\n")
missingno.matrix(df,figsize=(10,5), fontsize=12, sparkline=False,)

In [None]:
# Calculate the total number of null values
total_null_values = df.isnull().sum().sum()

# Calculate the total number of non-null values
total_non_null_values = df.notnull().sum().sum()

# Calculate the total number of values in the dataset
total_values = total_null_values + total_non_null_values

# Calculate the percentage of null and non-null values
null_percentage = (total_null_values / total_values) * 100
non_null_percentage = (total_non_null_values / total_values) * 100

# Data for the donut chart
labels = ['Null Values', 'Non-Null Values']
values = [null_percentage, non_null_percentage]

# Create the donut chart
fig = go.Figure(data=[go.Pie(labels=labels,
                             values=values,
                             hole=.4,
                             hoverinfo='label+percent',
                             textinfo='value',
                             textfont_size=15)])

# Update layout for better appearance
fig.update_layout(title_text='Percentage of Null and Non-Null Values in the Dataset',
                  annotations=[dict(text='Values', x=0.5, y=0.5, font_size=20, showarrow=False)])

# Show the plot
fig.show()


In [None]:
# Find the number of null values in each column
null_values = df.isnull().sum()

# Calculate the percentage of null values for each column
null_percent = (null_values / len(df)) * 100

# Filter out columns with no null values
null_percent = null_percent[null_percent > 0]

# Create the donut chart
fig = go.Figure(data=[go.Pie(labels=null_percent.index,
                             values=null_percent,
                             hole=.4,
                             hoverinfo='label+percent',
                             textinfo='value',
                             textfont_size=15)])

# Update layout for better appearance
fig.update_layout(title_text='Percentage of Null Values in Each Column',
                  annotations=[dict(text='Null Values', x=0.5, y=0.5, font_size=20, showarrow=False)])

# Show the plot
fig.show()

#### Requirement Specification

1. Handling Null Values in the Price Feature:

    Identify rows where the price feature has null values.

    Separate these rows and place them in the validation data set.

2. Dropping the New Price Feature:

    The new price feature has more than 50% null values.

    Due to the high percentage of null values, drop the price column.

Note*: If the price column is deemed important in the future, values should be entered manually by referencing data from car websites.

3. Filling Null Values in Mileage, Seat, and Power Features:

    Focus on handling null values in the mileage, seat, and power features.

    Determine the type of null values present (e.g., missing completely at random, missing at random, or missing not at random).

    Identify the best method to fill the null values for each feature based on their nature and impact on the dataset.

### Checking Duplicates

In [None]:
print("Any duplicates in the Data Frame: ",df.duplicated().any())

In [None]:
print("Number of duplicates in the dataset: ",df.duplicated().sum())

In [None]:
print("Duplicates rows display here")
duplicate = df[df.duplicated()]
duplicate

#### Note*:

Handling Duplicates in the Dataset:

    Identify if the dataset contains any duplicate entries.

    Find the total number of duplicates present.

    Calculate the percentage of duplicate entries relative to the entire dataset.

    Identify if a particular row is repeated multiple times and determine the frequency of such repetitions.

### Dataset Basic Information

In [None]:
df.info()

<b>Note*: </b> 

Number of float data type columns: 2

Number of integer data type columns: 3

Number of object data type columns: 9

## We are Understanding Each and every Feature

## 1. Feature S.NO

In [None]:
print('Feature Name: S.No.')
print("\nDescription: A unique identifier for each car entry")
print("\nType of Data: Unique Values")
print("\nProgramming Data Type information: Integer")
print("\nValue Arrangement: Sequence")
print("\nRange of the values: 0 to 7252")

#### Requirement

Dropping S.No. Unique Column:

    The S.No. unique column is not useful for machine learning applications.

    Drop the S.No. column, as Python's indexing can handle row and column management efficiently.

## 2. Feature Name

In [None]:
print('Feature Name: Name')
print("\nDescription:The make and model of the car")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: Manufacture, Model, Year and other information")
unique_count = len(df['Name'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nEach Category how many times repeated:\n\n", df['Name'].value_counts())
print("\nTotal Number of Non-Repeated Values",df['Name'].unique())

#### Requirements*:

<b>Unique Categories Identification</b>

Ensure that all categories in the 'Name' column. This is essential for accurate analysis and data integrity.

<b>Separate Manufacturer and Model</b>

Extract the manufacturer and model information from the 'Name' column. This separation helps in more granular analysis and categorization.

<b>Create New Columns</b>

Create two new columns in the DataFrame: 'Manufacturer' and 'Model'. Populate these columns with the extracted manufacturer and model information, respectively. This structure aids in better organization and usability of the dataset.

## 3. Feature Location

In [None]:
print('Feature Name: Location')
print("\nDescription:The city or region where the car is being sold.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: City Names")
unique_count = len(df['Location'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Location'].unique())
print("\nEach Category how many times repeated:\n\n", df['Location'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Location', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()


# Calculate percentage
category_counts = df['Location'].value_counts(normalize=True) * 100
category_df = category_counts.reset_index()
category_df.columns = ['Location', 'Percentage']

# Define custom colors
cmap = plt.get_cmap('cividis')  # Choose a colormap
colors = [cmap(i / len(category_df)) for i in range(len(category_df))]

# Plot Treemap
plt.figure(figsize=(10, 6))
squarify.plot(
    sizes=category_df['Percentage'], 
    label=category_df['Location'] + ' (' + category_df['Percentage'].round(2).astype(str) + '%)',
    alpha=0.8
)

# Add titles and labels
plt.title('Treemap of Category Percentages')
plt.axis('off')  # Remove the axes

# Show plot
plt.show()

## 4. Feature Year

In [None]:
print('Feature Name: Year')
print("\nDescription:The year the car was manufactured.")
print("\nType of Data: Numerical")
print("\nStatistics Data Type information: Quantitative, Descrete")
print("\nProgramming Data Type information: Integer")
print("\nValue information: Car manufacture year")
unique_count = len(df['Year'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Year'].unique())
print("\nEach Category how many times repeated:\n\n", df['Year'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Year', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Year'].value_counts().reset_index()
year_counts.columns = ['Year', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Year',
    title='Distribution of Years',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

## 6. Feature Fuel_Type

In [None]:
print('Feature Name: Fuel_Type')
print("\nDescription:The type of fuel the car uses, such as petrol, diesel, or electric.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: Car fuel information")
unique_count = len(df['Fuel_Type'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Fuel_Type'].unique())
print("\nEach Category how many times repeated:\n\n", df['Fuel_Type'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Fuel_Type', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Fuel_Type'].value_counts().reset_index()
year_counts.columns = ['Fuel_Type', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Fuel_Type',
    title='Distribution of Years',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

## 7. Feature Transmission

In [None]:
print('Feature Name: Transmission')
print("\nDescription:The type of transmission system in the car, either manual or automatic.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: Transmission system in the car")
unique_count = len(df['Transmission'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Transmission'].unique())
print("\nEach Category how many times repeated:\n\n", df['Transmission'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Transmission', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Transmission'].value_counts().reset_index()
year_counts.columns = ['Transmission', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Transmission',
    title='Distribution of Years',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

## 8. Feature Owner_Type

In [None]:
print('Feature Name: Owner_Type')
print("\nDescription:The number of previous owners of the car.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: previous owners of the car")
unique_count = len(df['Owner_Type'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Owner_Type'].unique())
print("\nEach Category how many times repeated:\n\n", df['Owner_Type'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Owner_Type', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Owner_Type'].value_counts().reset_index()
year_counts.columns = ['Owner_Type', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Owner_Type',
    title='Number of previous owners',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

## 9. Feature Mileage

In [None]:
print('Feature Name: Mileage')
print("\nDescription:The type of fuel the car uses, such as petrol, diesel, or electric.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: Car fuel information")
unique_count = len(df['Mileage'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Mileage'].unique())
print("\nEach Category how many times repeated:\n\n", df['Mileage'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Mileage', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Mileage'].value_counts().reset_index()
year_counts.columns = ['Mileage', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Mileage',
    title='Distribution of Years',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

### Requirements

Separate Number and Units:

    Extract numerical values and their corresponding units from given data.

Maintain Space After Number:

    Ensure there is a space between the numerical value and the unit.

Convert Units:

    Handle units such as kmpl and km/kg by converting them into a new column.

Categorical Conversion and Plotting:

    Convert categorical data into ranges for better visualization.

    Apply appropriate graphs to represent the categorical data.

Plot Continuous Analysis:

    Perform continuous data analysis and generate plots accordingly.

## 10. Engine

In [None]:
print('Feature Name: Engine')
print("\nDescription:The engine capacity of the car, usually measured in cubic centimeters (cc).")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: Engine capacity of the car")
unique_count = len(df['Engine'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Engine'].unique())
print("\nEach Category how many times repeated:\n\n", df['Engine'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Engine', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Engine'].value_counts().reset_index()
year_counts.columns = ['Engine', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Engine',
    title='Engine capacity of the car',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

### Requirements

Separate Number and Units:

    Extract numerical values and their corresponding units from given data.

Maintain Space After Number:

    Ensure there is a space between the numerical value and the unit.

Convert Units:

    Handle units such as CC rename column name by adding CC atlast.

Categorical Conversion and Plotting:

    Convert categorical data into ranges for better visualization.

    Apply appropriate graphs to represent the categorical data.

Plot Continuous Analysis:

    Perform continuous data analysis and generate plots accordingly.

## 11. Power

In [None]:
print('Feature Name: Power')
print("\nDescription:The maximum power output of the car's engine, usually measured in horsepower (bhp).")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: Maximum power output of the car's engine")
unique_count = len(df['Power'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Power'].unique())
print("\nEach Category how many times repeated:\n\n", df['Power'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Power', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Power'].value_counts().reset_index()
year_counts.columns = ['Power', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Power',
    title='Maximum power output of the car engine',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

### Requirements

Separate Number and Units:

    Extract numerical values and their corresponding units from given data.

Maintain Space After Number:

    Ensure there is a space between the numerical value and the unit.

Convert Units:

    Handle units such as bhp rename column name by adding bhp atlast.

Categorical Conversion and Plotting:

    Convert categorical data into ranges for better visualization.

    Apply appropriate graphs to represent the categorical data.

Plot Continuous Analysis:

    Perform continuous data analysis and generate plots accordingly.

## 12. Seats

In [None]:
print('Feature Name: Seats')
print("\nDescription: The number of seats available in the car.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Qualitative, Nominal")
print("\nProgramming Data Type information: Object")
print("\nValue information: Number of seats")
unique_count = len(df['Seats'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['Seats'].unique())
print("\nEach Category how many times repeated:\n\n", df['Seats'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='Seats', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['Seats'].value_counts().reset_index()
year_counts.columns = ['Seats', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='Seats',
    title='Number of seats',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

### Requirements

Handle Zero Number of Seats:
    
    Identify records with zero seats.

    Convert zero seat values to null or manually replace them.

    Investigate and document the reason behind the zero seat values.

## 12. New_Price

In [None]:
print('Feature Name: New_Price')
print("\nDescription: The original price of the car when it was new.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Quantitative, Ratio")
print("\nProgramming Data Type information: Object")
print("\nValue information: The original price of the car when it was new")
unique_count = len(df['New_Price'].unique())
print("\nTotal Number Categories in the Feature:",unique_count)
print("\nTotal Number of Non-Repeated Values",df['New_Price'].unique())
print("\nEach Category how many times repeated:\n\n", df['New_Price'].value_counts())

# Graph

# Create count plot
ax = sns.countplot(x='New_Price', data=df)

# Annotate each bin with the count value
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),  # Use a format string to display the count as an integer
        (p.get_x() + p.get_width() / 2., p.get_height()),  # Position the annotation in the center of the bar
        ha='center', va='center',  # Align the text in the center horizontally and vertically
        xytext=(0, 9),  # Offset the text slightly above the bar
        textcoords='offset points'
    )

# Rotate x-axis labels
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Count Plot of Categories with Annotations')
plt.xlabel('Category')
plt.ylabel('Count')

# Show plot
plt.show()



# Calculate the count of each year
year_counts = df['New_Price'].value_counts().reset_index()
year_counts.columns = ['New_Price', 'Count']

# Plot Pie Chart using Plotly
fig = px.pie(
    year_counts,
    values='Count',  # Specify the column containing the values for the pie slices
    names='New_Price',
    title='The original price of the car',
    color_discrete_sequence=px.colors.sequential.Viridis  # Use the 'Viridis' colormap
)

# Show plot
fig.show()

### Requirement

Handle Newprice Column:

    Identify that more than 70% of the data in the newprice column is null.

    Understand that the newvalue column represents the showroom price of the car at the time of purchase.

    Recognize that filling the newprice data using statistical analysis is not feasible and requires manual entry.

    Remove the newprice column for now.

## 12. Price

In [None]:
print('Feature Name: Price')
print("\nDescription: The current selling price of the car.")
print("\nType of Data: Character")
print("\nStatistics Data Type information: Quantitative, Ratio")
print("\nProgramming Data Type information: Object")
print("\nValue information: The current selling price of the car.")

mini = df['Price'].min()
maxi = df['Price'].max()
avg = df['Price'].mean()
med = df['Price'].median()
zeros = (df['Price'] == 0).any()
neg = (df['Price'] < 0).any()
null = df['Price'].isnull().any()
dup = df['Price'].duplicated().any()
dic = {'Minimum':[mini],'Maximum':[maxi],'Mean':[avg],'Median':[med],'Zeros':[zeros], 'Neagative':[neg],'Missing':[null],'Duplicates':[dup]}
df_dci = pd.DataFrame(dic)
display(df_dci.head())

In [None]:
sns.kdeplot(df,x='Price')
plt.show()

In [None]:
sns.histplot(df,x='Price')
plt.show()

### Requirements

Price Column with Lakh Units:

    Add a new column name indicating price in lakh units.
    
Handle Null Values in Price Column:

    Identify that the price column has null values.
    
    Understand that deleting records with null values is not an option.
    
    Recognize the importance of the price column for machine learning purposes.
    
    Convert records with null values in the price column to validation data.

# phase 3

EDA - Exploratory Data Analysis (Relationship Graphs)

we need to work here with PowerBI

In [None]:
<b>Features Description:</b> 

<b>1. S.No.:</b> A unique identifier for each car entry.

<b>2. Name:</b> The make and model of the car.

<b>3. Location:</b> The city or region where the car is being sold.

<b>4. Year:</b> The year the car was manufactured.

<b>5. Kilometers_Driven:</b> The total distance the car has traveled, measured in kilometers.

<b>6. Fuel_Type:</b> The type of fuel the car uses, such as petrol, diesel, or electric.

<b>7. Transmission:</b> The type of transmission system in the car, either manual or automatic.

<b>8. Owner_Type:</b> The number of previous owners of the car.

<b>9. Mileage:</b> The fuel efficiency of the car, typically measured in kilometers per liter.

<b>10. Engine:</b> The engine capacity of the car, usually measured in cubic centimeters (cc).

<b>11. Power:</b> The maximum power output of the car's engine, usually measured in horsepower (bhp).

<b>12. Seats:</b> The number of seats available in the car.

<b>13. New_Price:</b> The original price of the car when it was new.

<b>14. Price:</b> The current selling price of the car.