<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-08/exercise_8_2_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 8-2: Analyze the Cars Data

## üîë INSTRUCTOR SOLUTION KEY

**CAP3321C - Data Wrangling**

---

## Import the Data

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
# Download the data file from GitHub
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/cars.pkl
print("Data file downloaded successfully!")

In [None]:
# Load the cars data
cars = pd.read_pickle('cars.pkl')
print("Data shape:", cars.shape)

### Task 4: Display the First Five Rows

In [None]:
# ‚úÖ SOLUTION
cars.head()

---

## Part 1: Melt the Data

### Task 5: Melt the enginesize and curbweight Columns

In [None]:
# ‚úÖ SOLUTION
cars_melted = cars.melt(
    id_vars=['price'],
    value_vars=['enginesize', 'curbweight'],
    var_name='feature'
)
cars_melted.head(10)

#### üìù Instructor Notes - Task 5

**Key Teaching Points:**
- `melt()` transforms wide data to long format
- `id_vars` = columns to keep as-is (identifiers)
- `value_vars` = columns to unpivot into rows
- `var_name` = name for the column holding the old column names
- Default `value_name` is 'value'

**Why melt?**
- Makes it easy to create faceted plots (one plot per feature)
- Required by some plotting functions that expect long format

### Task 6: Create a Scatterplot for the Melted Data

In [None]:
# ‚úÖ SOLUTION
sns.relplot(
    data=cars_melted,
    x='value',
    y='price',
    col='feature',
    facet_kws={'sharex': False}
)

#### üìù Instructor Notes - Task 6

**Key Teaching Points:**
- `col='feature'` creates separate plots for enginesize and curbweight
- `facet_kws={'sharex': False}` allows independent x-axes
  - Important because enginesize and curbweight have very different scales!
- Both show positive correlation with price

**Discussion:** Which feature has a stronger correlation with price?

---

## Part 2: Rank the Data by Price

### Task 7: Add a priceRank Column

In [None]:
# ‚úÖ SOLUTION
cars['priceRank'] = cars['price'].rank()
cars[['price', 'priceRank']].head()

#### üìù Instructor Notes - Task 7

**Key Teaching Points:**
- `rank()` assigns 1 to smallest value, n to largest
- Ties are handled by averaging ranks by default
- Other methods: `method='min'`, `method='max'`, `method='first'`

### Task 8: Display the Ten Lowest-Priced Rows

In [None]:
# ‚úÖ SOLUTION
cars.nsmallest(10, 'price')[['price', 'priceRank']]

# Note: Rows 8 and 9 have the same price (6695) and their ranks are averaged (8.5)

#### üìù Instructor Notes - Task 8

**Key Teaching Points:**
- `nsmallest(10, 'price')` is more efficient than `sort_values().head(10)`
- Notice the tied ranks (e.g., 8.5 for two cars at same price)
- This is the default 'average' method for handling ties

**Acceptable Variations:**
```python
cars.sort_values('price').head(10)
```

---

## Part 3: Bin the Data with Quantiles

### Task 9: Create Price Bins with qcut()

In [None]:
# ‚úÖ SOLUTION
cars['priceGrade'] = pd.qcut(
    cars['price'],
    q=3,
    labels=['low', 'medium', 'high']
)
cars[['price', 'priceGrade']].head(10)

#### üìù Instructor Notes - Task 9

**Key Teaching Points:**
- `qcut()` creates equal-SIZE bins (same number of items in each)
- `cut()` creates equal-WIDTH bins (same range in each)
- `q=3` means tertiles (33.3% in each bin)
- `labels` must have q-1... wait, it should have exactly q labels

**Comparison:**
- `cut()` - fixed bin edges, unequal counts
- `qcut()` - variable bin edges, equal counts

### Task 10: Display Value Counts for priceGrade

In [None]:
# ‚úÖ SOLUTION
cars['priceGrade'].value_counts()

# Note: Counts should be roughly equal (~68 each for 205 total cars)

---

## Part 4: Group and Aggregate the Data

### Task 11: Group by priceGrade and Show Min/Max

In [None]:
# ‚úÖ SOLUTION
cars.groupby('priceGrade')['price'].agg(['min', 'max'])

#### üìù Instructor Notes - Task 11

**Key Teaching Points:**
- `agg()` accepts a list of aggregation functions
- Shows the price ranges for each grade
- Notice there's no gap between max of 'low' and min of 'medium' - qcut uses the actual values as boundaries

### Task 12: Group by carbody and aspiration, Get Average Price

In [None]:
# ‚úÖ SOLUTION
cars.groupby(['carbody', 'aspiration'])['price'].mean()

#### üìù Instructor Notes - Task 12

**Key Teaching Points:**
- Returns a Series with MultiIndex
- Shows average price for each carbody/aspiration combination
- Turbo cars are generally more expensive than standard

### Task 13: Unstack the aspiration Column

In [None]:
# ‚úÖ SOLUTION
cars.groupby(['carbody', 'aspiration'])['price'].mean().unstack('aspiration')

#### üìù Instructor Notes - Task 13

**Key Teaching Points:**
- `unstack()` pivots an index level to columns
- Now easy to compare std vs turbo side-by-side
- NaN appears if a combination doesn't exist

### Task 14: Use pivot_table() for the Same Result

In [None]:
# ‚úÖ SOLUTION
price_by_body_aspiration = cars.pivot_table(
    index='carbody',
    columns='aspiration',
    values='price',
    aggfunc='mean'
)
price_by_body_aspiration

#### üìù Instructor Notes - Task 14

**Key Teaching Points:**
- `pivot_table()` does groupby + agg + unstack in one step
- Same result as Task 13 but cleaner code
- Default `aggfunc` is 'mean', but good practice to specify it

**Discussion:** When to use groupby+unstack vs pivot_table?
- pivot_table is more concise for this exact use case
- groupby is more flexible for complex aggregations

### Task 15: Create a Bar Chart

In [None]:
# ‚úÖ SOLUTION
price_by_body_aspiration.plot(kind='bar', title='Average Price by Body Type and Aspiration')

#### üìù Instructor Notes - Task 15

**Key Teaching Points:**
- Pandas DataFrames have built-in plotting via `.plot()`
- `kind='bar'` creates grouped bar chart
- Columns become different colored bars
- Quick visualization without needing seaborn/matplotlib setup

**Acceptable Variations:**
```python
price_by_body_aspiration.plot.bar()
```

---

## Summary

In this exercise, you practiced data analysis techniques:

**Melting Data:**
- `melt()` - Reshape from wide to long format
- Useful for creating faceted plots

**Ranking:**
- `rank()` - Assign rank values (handles ties by averaging)

**Binning with Quantiles:**
- `pd.qcut()` - Create equal-sized bins based on data distribution
- Compare to `pd.cut()` which uses fixed bin edges

**Grouping and Aggregating:**
- `groupby().agg(['min', 'max'])` - Multiple aggregations
- `groupby().mean().unstack()` - Reshape grouped results
- `pivot_table()` - Combine grouping and reshaping

**Visualization:**
- `df.plot(kind='bar')` - Quick bar charts from DataFrames