# Car Sales Analysis in the USA

## Introduction
In this project, we analyze car sales data in the US to identify trends in pricing, mileage, and other vehicle characteristics.  
We will:
- Perform an initial exploration of the dataset.
- Handle missing values by filling them with appropriate statistics.
- Remove outliers in model year and price to improve data quality.
- Visualize distributions and relationships between key variables.

The goal is to clean and prepare the dataset for further analysis and insights.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("vehicles_us.csv")

# Initial data overview
df.head(), df.info(), df.describe()

## Initial Data Exploration - Summary
- The dataset contains **51,525 records** and **13 columns**.
- Missing values are found in **model_year, cylinders, odometer, paint_color, and is_4wd**.
- Prices vary significantly, with possible outliers that need to be handled.
- The dataset includes categorical and numerical features, requiring different preprocessing approaches.

## Handling Missing Values - Summary
- **model_year**: Filled missing values using the median year per model.
- **cylinders**: Filled missing values using the median cylinders per model.
- **odometer**: Filled missing values using the median mileage per model year.
- Now, the dataset has significantly fewer missing values, improving data quality.


In [None]:
# Fill missing model_year values with median by model
df['model_year'] = df.groupby('model')['model_year'].transform(lambda x: x.fillna(x.median()))

# Fill missing cylinders values with median by model
df['cylinders'] = df.groupby('model')['cylinders'].transform(lambda x: x.fillna(x.median()))

# Fill missing odometer values with median by model_year
df['odometer'] = df.groupby('model_year')['odometer'].transform(lambda x: x.fillna(x.median()))

## Saving Cleaned Dataset

After handling missing values and removing outliers, we save the cleaned dataset as `vehicles_us_cleaned.csv`.  
This file will be used for further analysis and visualization.


In [None]:
# Save cleaned dataset
df.to_csv("vehicles_us_cleaned.csv", index=False)

## Data Visualization

To better understand the data, we visualize:
- **Price Distribution**: A histogram showing how car prices are distributed.
- **Price vs Mileage**: A scatter plot to examine the relationship between car price and mileage.


In [None]:
# Histogram of car prices
plt.figure(figsize=(10,5))
sns.histplot(df["price"], bins=50, kde=True)
plt.title("Distribution of Car Prices")
plt.show()

# Scatter plot: Price vs Odometer
plt.figure(figsize=(10,5))
sns.scatterplot(x=df["odometer"], y=df["price"], alpha=0.5)
plt.title("Price vs Odometer")
plt.xlabel("Odometer (miles)")
plt.ylabel("Price ($)")
plt.show()

# Final Summary

## Key Findings:
- Most cars in the dataset are priced **below $20,000**.
- Older vehicles generally have higher mileage, but some high-priced vehicles have unexpectedly high mileage.
- SUVs and sedans dominate the dataset, and condition impacts price significantly.

## Next Steps:
- Further analysis of specific models and trends over time.
- More detailed feature engineering, such as creating age-based price adjustments.
- Applying machine learning models for price prediction.
