<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-08/exercise_8_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 8-2: Analyze the Cars Data

**CAP3321C - Data Wrangling**

---

## Overview

This exercise guides you through the process of analyzing the Cars data. You'll practice melting data, ranking, binning with quantiles, grouping and aggregating, and creating visualizations.

**Group Members:**
- Name 1:
- Name 2:
- Name 3:
- Name 4:

---

## Import the Data

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
# Download the data file from GitHub
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/cars.pkl
print("Data file downloaded successfully!")

In [None]:
# Load the cars data
cars = pd.read_pickle('cars.pkl')
print("Data shape:", cars.shape)

### Task 4: Display the First Five Rows (YOUR CODE)

Display the first five rows of the DataFrame.

**Expected output:** Columns include symboling, fueltype, aspiration, carbody, price, enginesize, curbweight, etc.

In [None]:
# YOUR CODE HERE - display the first 5 rows


---

## Part 1: Melt the Data

Practice reshaping data from wide to long format.

### Task 5: Melt the enginesize and curbweight Columns (YOUR CODE)

Use the `melt()` method to combine the `enginesize` and `curbweight` columns. Name the new variable column `feature` and use the default name of `value` for the value column.

**Hint:** 
- `id_vars` = columns to keep as identifiers (like 'price')
- `value_vars` = columns to melt ('enginesize', 'curbweight')
- `var_name` = name for the new variable column

**Example syntax:**
```python
df_melted = df.melt(
    id_vars=['price'],
    value_vars=['enginesize', 'curbweight'],
    var_name='feature'
)
```

**Expected output:** Long-format DataFrame with 'feature' and 'value' columns

In [None]:
# YOUR CODE HERE - melt enginesize and curbweight columns


### Task 6: Create a Scatterplot for the Melted Data (YOUR CODE)

Use the `relplot()` method to create a scatterplot for the `feature` and `price` data. Use the `col` parameter to create a different plot for each feature. Use the `facet_kws` parameter to give each subplot an independent x-axis.

**Hint:**
- x = 'value', y = 'price'
- `col='feature'` creates separate plots
- `facet_kws={'sharex': False}` gives independent x-axes

**Example syntax:**
```python
sns.relplot(data=df_melted, x='value', y='price',
            col='feature', facet_kws={'sharex': False})
```

**Expected output:** Two scatterplots showing enginesize vs price and curbweight vs price

In [None]:
# YOUR CODE HERE - create scatterplot for melted data


---

## Part 2: Rank the Data by Price

Practice using the rank() method.

### Task 7: Add a priceRank Column (YOUR CODE)

Use the `rank()` method to add a `priceRank` column that ranks each row by the price value.

**Hint:** Use `rank()` on the price column.

**Example syntax:**
```python
df['priceRank'] = df['price'].rank()
```

**Expected output:** New column with rank values (1 = lowest price)

In [None]:
# YOUR CODE HERE - add priceRank column


### Task 8: Display the Ten Lowest-Priced Rows (YOUR CODE)

Display the ten rows with the lowest price in ascending order from lowest to highest. Note that the ranks in row 8 and 9 have been averaged (they have the same price).

**Hint:** Use `nsmallest()` or sort and head.

**Example syntax:**
```python
df.nsmallest(10, 'price')
# or
df.sort_values('price').head(10)
```

**Expected output:** 10 rows with lowest prices, note averaged ranks for ties

In [None]:
# YOUR CODE HERE - display 10 lowest-priced rows


---

## Part 3: Bin the Data with Quantiles

Practice using qcut() to create equal-sized bins.

### Task 9: Create Price Bins with qcut() (YOUR CODE)

Use the `qcut()` method to create three price bins for the data: low, medium, and high. Store these bins in a new column named `priceGrade`.

**Hint:** 
- `qcut()` creates equal-sized bins based on quantiles
- Use `q=3` for 3 bins
- Use `labels` to name the bins

**Example syntax:**
```python
df['priceGrade'] = pd.qcut(df['price'], q=3, labels=['low', 'medium', 'high'])
```

**Expected output:** New column with categorical values: low, medium, high

In [None]:
# YOUR CODE HERE - create priceGrade column using qcut


### Task 10: Display Value Counts for priceGrade (YOUR CODE)

Use the `value_counts()` method to display the number of values for each bin in the `priceGrade` column.

**Expected output:** Roughly equal counts for low, medium, high (since qcut creates equal-sized bins)

In [None]:
# YOUR CODE HERE - display value counts for priceGrade


---

## Part 4: Group and Aggregate the Data

Practice advanced grouping and aggregation.

### Task 11: Group by priceGrade and Show Min/Max (YOUR CODE)

Group the cars data by the `priceGrade` column. Use the `agg()` method to aggregate the price data with the `min()` and `max()` methods. This should display the highest and lowest prices for each bin.

**Hint:** Use `agg()` with a list of functions.

**Example syntax:**
```python
df.groupby('priceGrade')['price'].agg(['min', 'max'])
```

**Expected output:** Table showing min and max price for each grade

In [None]:
# YOUR CODE HERE - group by priceGrade and show min/max price


### Task 12: Group by carbody and aspiration, Get Average Price (YOUR CODE)

Group the data by the `carbody` and `aspiration` columns, and get the average price for each group. This returns a Series object with an index that's created from the carbody and aspiration columns.

**Hint:** Use `groupby()` with two columns and `mean()` on price.

**Example syntax:**
```python
df.groupby(['carbody', 'aspiration'])['price'].mean()
```

**Expected output:** Series with MultiIndex showing average price for each combination

In [None]:
# YOUR CODE HERE - group by carbody and aspiration, get mean price


### Task 13: Unstack the aspiration Column (YOUR CODE)

Unstack the `aspiration` column of the index so the aspiration values are displayed as columns.

**Hint:** Chain `unstack()` after the groupby result.

**Example syntax:**
```python
df.groupby(['carbody', 'aspiration'])['price'].mean().unstack('aspiration')
```

**Expected output:** DataFrame with carbody as rows, aspiration (std/turbo) as columns

In [None]:
# YOUR CODE HERE - unstack aspiration column


### Task 14: Use pivot_table() for the Same Result (YOUR CODE)

Use the `pivot_table()` method to accomplish the same task as steps 12 and 13.

**Hint:** pivot_table combines grouping, aggregating, and unstacking in one step.

**Example syntax:**
```python
df.pivot_table(index='carbody', columns='aspiration', values='price', aggfunc='mean')
```

**Expected output:** Same result as Task 13

In [None]:
# YOUR CODE HERE - use pivot_table for same result


### Task 15: Create a Bar Chart (YOUR CODE)

Use the Pandas `plot()` method to create a bar chart from the DataFrame created in the previous step.

**Hint:** Use `plot(kind='bar')` or `plot.bar()`.

**Example syntax:**
```python
df.plot(kind='bar')
# or
df.plot.bar()
```

**Expected output:** Bar chart comparing std vs turbo prices by carbody type

In [None]:
# YOUR CODE HERE - create bar chart from pivot_table result


---

## Summary

In this exercise, you practiced data analysis techniques:

**Melting Data:**
- `melt()` - Reshape from wide to long format
- Useful for creating faceted plots

**Ranking:**
- `rank()` - Assign rank values (handles ties by averaging)

**Binning with Quantiles:**
- `pd.qcut()` - Create equal-sized bins based on data distribution
- Compare to `pd.cut()` which uses fixed bin edges

**Grouping and Aggregating:**
- `groupby().agg(['min', 'max'])` - Multiple aggregations
- `groupby().mean().unstack()` - Reshape grouped results
- `pivot_table()` - Combine grouping and reshaping

**Visualization:**
- `df.plot(kind='bar')` - Quick bar charts from DataFrames

---

**Submission:** Save this notebook and submit to Canvas before the deadline.