# Exploratory Data Analysis (EDA) of Cars: A Comprehensive Student Guide

Welcome to this in-depth exploration of our car dataset! We'll use Python, pandas, and matplotlib to uncover hidden patterns, relationships, and insights.

**What is EDA?**

Exploratory Data Analysis (EDA) is a crucial first step in any data science project. It's about getting to know your data intimately, understanding its structure, identifying potential issues, and formulating questions for further analysis.

### Setting Up

Let's ensure we have the necessary tools ready:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### Loading the Data

Our dataset is stored in a CSV file. We'll use pandas to read it into a DataFrame, a convenient table-like structure.

In [None]:
# Load the data
df = pd.read_excel('/content/automobile_dataset.xlsx')

### First Look: Peeking at the Data

Before we dive deep, let's get a quick overview of our dataset.

In [None]:
# Display the first 5 rows
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

# Print column names and their data types
df.info()



**Key Questions to Explore:**

* What are the different car models in our dataset?
* What range of prices and mileage do we have?
* Are there any missing values?

### Data Cleaning: Tidying Up

Real-world datasets often have missing or inconsistent values. Let's check for missing values and decide how to handle them.

In [None]:
# Check for missing values
print(df.isnull().sum())


**Handling Missing Values (Optional):**

* The column `Repair record 1978` has 5 missing values. Since it's a small number compared to the dataset size, we'll drop the rows with missing values for simplicity.

In [None]:
# Drop rows with missing values in 'Repair record 1978'
df.dropna(subset=['Repair record 1978'], inplace=True)

### Summary Statistics: The Big Picture

Let's calculate summary statistics for the numerical columns to get a sense of central tendencies and spread.

In [None]:
# Calculate descriptive statistics
print(df.describe().to_markdown(numalign="left", stralign="left"))


---

### Analyzing Car Origins and Prices

The previous analysis focused on numerical variables. Now, we will shift our focus to the categorical variable `Car origin`. We will examine its relationship with the numerical variable `Price`. We will use group by and aggregation to find the mean and median prices for each car origin. Additionally, we will create a box plot to visualize the distribution of prices across different car origins.

In [None]:
# Group by `Car origin` and calculate mean and median `Price`
price_stats_by_origin = df.groupby('Car origin')['Price'].agg(['mean', 'median'])

# Print the results
print("Price Statistics by Car Origin:")
print(price_stats_by_origin.to_markdown(numalign="left", stralign="left"))

**Key Insights:**

* On average, foreign cars are slightly more expensive than domestic cars.
* The median price of foreign cars is notably higher than that of domestic cars.

#### Visualizing Price Distribution by Origin

Let's create a box plot to visualize the distribution of prices across car origins.

In [None]:
# Create a box plot to display the distribution of `Price` across different `Car origin` categories
# Prepare data for the boxplot
origins = df['Car origin'].unique()
data = [df[df['Car origin'] == origin]['Price'] for origin in origins]

# Create the figure and axes
plt.figure(figsize=(8, 6))  # Adjust the figure size as needed
ax = plt.axes()

# Create the boxplot
bp = ax.boxplot(data, patch_artist=True)  # patch_artist fills the boxes

# Customize the box colors
colors = ['skyblue', 'salmon']  # Colors for each car origin
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

# Add labels and title
plt.title('Price Distribution by Car Origin')
plt.ylabel('Price')
plt.xlabel('Car Origin')
plt.xticks(range(1, len(origins) + 1), origins) # Set x-ticks to match the origins

# Show the plot
plt.show()

**Key Insights:**

* The box plot confirms that foreign cars generally have higher prices.
* Domestic cars exhibit a wider range of prices, from very affordable to moderately expensive.
* Foreign cars have a more clustered price distribution, mostly in the mid-to-high range.

**Student Challeng**
* Create a categorical variable for mileage, high or low
* look at the distribution of car prices by high and low mileage
* What conclusions can you make?



### Impact of Repair Records on Price

Let's investigate how repair records might influence car prices. We'll focus on cars with a repair record of 3.

In [None]:
# Filter the DataFrame to include only cars with a `Repair record 1978` value of 3
filtered_df = df[df['Repair record 1978'] == 3]

# Calculate the mean and median `Price` for this filtered dataset
mean_price_filtered = filtered_df['Price'].mean()
median_price_filtered = filtered_df['Price'].median()

# Print the results
print("\nPrice Statistics for Cars with Repair Record of 3:")
print(f"Mean Price: {mean_price_filtered:.2f}")
print(f"Median Price: {median_price_filtered:.2f}")

Price Statistics for Cars with Repair Record of 3:
Mean Price: 6429.23
Median Price: 4741.00

**Key Insights:**

* Cars with a repair record of 3 have a mean price of \$6429.23 and a median price of \$4741.00.

**Student Challenge:**

1.  Repeat the above analysis for cars with different repair record values (e.g., 1, 2, 4, 5).
2.  Compare the mean and median prices across different repair records.
3.  Do you notice any trends? Does the repair record seem to affect the price significantly?

**Student Challenge:** Price vs. Weight Showdown

Does car weight change with price?

Create a scatter plot to visualize the relationship between Price and Weight (lbs.).


Hint: Use Matplotlib's scatter function and add labels and a title. Calculate and print the correlation coefficient.


**Student Challenge**: Headroom for Tall Drivers

Identify the top 3 car models with the most headroom (Headroom (in.)).

Hint: Sort the DataFrame by Headroom (in.) in descending order and select the top 3 rows.

**Student Challenge:** Trunk Space Comparison by Origin

Compare the trunk space (Trunk space (cu. ft.)) distribution between domestic and foreign cars using box plots.

Hint: Filter the DataFrame for each origin, then use Matplotlib's boxplot function.

**Student Challenge:** Displacement and Gear Ratio Exploration

Analyze the relationship between Displacement (cu. in.) and Gear ratio. Is there a noticeable pattern?

Hint: Create a scatter plot and calculate the correlation coefficient.