# Python and Jupyter Notebook for EDA

## Notebook shortcuts

* `DD` Delete cell
* `A` Insert cell above
* `B` Insert cell below
* `X` Cut cell
* `V` Paste cell
* `ENTER` Edit cell
* `CTRL+ENTER` Execute cell
* `SHIFT+ENTER` Execute cell and move to the next one

Run shell commands (eg free/df/pip)

## Data load and preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Convert CSV to Panda dataframe

In [None]:
file_path = "car_price_dataset.csv"

Check dataframe shape (dimensions)

See dataframe information

Convert brand, fule type and door columns to categorical (astype category)

Look at data types after conversion (dtypes)

Compare memory usage after conversion

## Data exploration

Look at the first rows of the dataframe

Look at the last row of the dataframe

Show descriptive statistics

In [None]:
# Adjust display precision (panda set option)
df.describe()

In [None]:
# Top 5 most expensive cars (n largest, count - column)

# Top 5 cheapest cars (n smallest)

In [None]:
# Describe single column

In [None]:
# Count not-null values

In [None]:
# Count values in a category column (value_counts)

In [None]:
# Show rows 1 to 5, columns 3 to 4 (indexed location)

In [None]:
# Sort Mileage (by, ascending)
sorted_data.head()

In [None]:
# Filter car with 5 doors and eletric (Boolean indexing, [() & ()])
filtered_cars.head()

Count cars by brand, fuel type and transmission ([[]], groupby, value counts)

In [None]:
# Feature engineering example (Age = current year - year)
df[['Year', 'Age', 'Price']].head()

## Data visualization

In [None]:
# Plot transmission distribution as pie chart
df["Transmission"].value_counts().plot(kind="pie")

In [None]:
# Show boxplot of prices
df.boxplot(column='Price', by='Year', vert=True)
plt.xticks(rotation=-45)
plt.show()

In [None]:
sns.scatterplot(x='Mileage', y='Price', hue='Price', data=df)
plt.show()

In [None]:
sns.histplot(df["Price"], bins=30, kde=True)
plt.show()

In [None]:
# Correlation heatmap (only numeric columns)
numeric_df = df.select_dtypes(include=[np.number])

# Compute the correlation matrix
corr = numeric_df.corr()
lower_triangle_mask = np.triu(corr, k=1)

plt.figure(figsize=(10, 5))
sns.heatmap(corr, mask=lower_triangle_mask, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Matrix")
plt.show()

In [None]:
# Filtering Examples
luxury_cars = df[df['Price'] > df['Price'].quantile(0.9)]
print("Luxury Cars (Top 10% by Price):")
luxury_cars.head()

In [None]:
# Time Series Analysis: Average Price Per Year
plt.figure(figsize=(10, 5))
df.groupby('Year')['Price'].mean().plot(marker='o', linestyle='-', color='red')
plt.title("Average Car Price Over the Years")
plt.xlabel("Year")
plt.ylabel("Average Price")
plt.grid()
plt.show()

In [None]:
# Price Analysis
plt.figure(figsize=(10, 5))
sns.boxplot(x=df['Year'], y=df['Price'])
plt.xticks(rotation=45)
plt.title("Price Trends Over the Years")
plt.show()

Export dataframe to pickle format


In [None]:
df.to_pickle('car_price.pkl')

## Key Insights and Conclusions

### Brand Impact on Price
- Some brands have significantly higher average prices, indicating a strong correlation between brand reputation and car value.

### Price Trends Over Time
- Newer cars generally have higher prices, while older cars show more variation due to depreciation and condition.

### Car Distribution by Brand
- A few brands dominate the dataset, while others have a relatively small representation.

### Data Considerations and Limitations
- Additional factors like mileage, maintenance history, or regional price differences are not captured in this dataset, which could further refine insights.

### Next Steps for Further Analysis
- Use predictive modeling to estimate car prices based on historical data.
