<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-06/exercise_6_2_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 6-2: Clean the Cars Data

## üîë INSTRUCTOR SOLUTION KEY

**CAP3321C - Data Wrangling**

---

**Group Members:**
- SOLUTION KEY

---

## Read the Data

In [None]:
import pandas as pd

In [None]:
# Download the data file from GitHub
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/cars.csv
print("Data file downloaded successfully!")

In [None]:
# Load the cars data
cars = pd.read_csv('cars.csv')
print("Data shape:", cars.shape)

### Task 4: Display the First Five Rows (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars.head()

---

## Part 1: Examine the Data

### Task 5: Run the info() Method (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars.info()

# Note: All 205 entries have non-null values in all columns
# No missing data to handle

#### üìù Instructor Notes - Task 5

**Key Teaching Points:**
- 205 entries (rows), 26 columns
- All columns show 205 non-null = no missing data
- Mix of int64, float64, and object (string) types
- CarName is object type (string) - this is where the problems are

### Task 6: Examine the fueltype Column (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars['fueltype'].value_counts()

# Note: gas = 185, diesel = 20
# Good candidate for category type

#### üìù Instructor Notes - Task 6

**Key Teaching Points:**
- `value_counts()` shows frequency of each unique value
- Good for categorical data exploration
- This column is clean (no spelling errors) unlike CarName

### Task 7: Examine the CarName Column (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars['CarName'].unique()

# Spelling errors to spot:
# - maxda (should be mazda)
# - Nissan vs nissan (inconsistent capitalization)
# - toyouta (should be toyota)
# - vokswagen, vw (should be volkswagen)
# - porcshce (should be porsche)

#### üìù Instructor Notes - Task 7

**Discussion:** Why is data quality important? How could these errors affect analysis?
- If you group by brand, "mazda" and "maxda" would be counted separately
- "volkswagen", "vw", and "vokswagen" would appear as 3 different brands
- This is extremely common in real-world data

---

## Part 2: Fix Spelling and Capitalization Problems

### Task 8: Split CarName into Brand and Name (PRE-FILLED)

Run the cell that adds the `brand` and `name` columns. Note that the two statements in this cell use **lambda expressions**, which you'll learn about in chapter 7.

In [None]:
# PRE-FILLED: Split CarName into brand and name columns
cars['brand'] = cars.apply(lambda x: x.CarName.split(' ')[0], axis=1)
cars['name'] = cars.apply(lambda x: ' '.join(x.CarName.split(' ')[1:]), axis=1)

# Display to verify
cars[['CarName', 'brand', 'name']].head(10)

#### üìù Instructor Notes - Task 8

**Key Teaching Points:**
- Lambda expressions are covered in Chapter 7 - just have students run it
- `split(' ')[0]` gets the first word (brand)
- `' '.join(split(' ')[1:])` rejoins everything after the first word (model name)
- We split to fix brand separately, then will recombine later

### Task 9: Display Unique Brand Values (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars['brand'].unique()

#### üìù Instructor Notes - Task 9

**Acceptable Variations:**
```python
# Also acceptable:
cars['brand'].value_counts()    # shows counts too
sorted(cars['brand'].unique())  # alphabetical order
```

**Common Student Errors:**
- Using `cars.brand.unique()` vs `cars['brand'].unique()` - both work
- Confusing `unique()` with `nunique()` (count vs list)

### Task 10: Fix Misspellings with replace() (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars['brand'] = cars['brand'].replace({
    'maxda': 'mazda',
    'Nissan': 'nissan',
    'toyouta': 'toyota',
    'vokswagen': 'volkswagen',
    'vw': 'volkswagen',
    'porcshce': 'porsche'
})

#### üìù Instructor Notes - Task 10

**Key Teaching Points:**
- `replace()` with dictionary is efficient for multiple replacements
- Case matters: 'Nissan' != 'nissan'
- Could also use `.str.lower()` for capitalization, but replace handles both spelling and case

**Common Student Errors:**
- Missing one of the misspellings (especially 'vw')
- Typos in the correct spellings
- Forgetting to reassign back to the column
- Some students may also find 'alfa-romero' (should technically be 'alfa-romeo') - accept either

**Verify your fixes:**

In [None]:
# Verify the fixes
cars['brand'].unique()

# Should show correctly spelled brands:
# alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar,
# mazda, mercedes-benz, mercury, mitsubishi, nissan, peugeot,
# plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo

### Task 11: Combine Brand and Name into CarName (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars['CarName'] = cars['brand'] + ' ' + cars['name']

#### üìù Instructor Notes - Task 11

**Key Teaching Points:**
- String concatenation with `+` works element-wise on Series
- Must include the space `' '` between brand and name
- This overwrites the original CarName with the corrected version

**Common Student Errors:**
- Forgetting the space: `cars['brand'] + cars['name']` gives 'mazdaglc'
- Using wrong column names

### Task 12: Verify the Fixes (YOUR CODE)

In [None]:
# ‚úÖ SOLUTION
cars.head()

# CarName should now show corrected brand names
# e.g., 'alfa-romero giulia' instead of any misspellings

---

## Part 3: Rename and Drop Columns

### Task 13: Rename Columns (YOUR CODE)

Use the `rename()` method to rename:
- `CarName` column to `brandname`
- `car_ID` column to `carid`

In [None]:
# ‚úÖ SOLUTION
cars = cars.rename(columns={
    'CarName': 'brandname',
    'car_ID': 'carid'
})
cars.columns.tolist()

#### üìù Instructor Notes - Task 13

**Key Teaching Points:**
- Consistent naming conventions make code easier to read and maintain
- All columns are now lowercase
- `rename()` only changes the columns specified in the dictionary

**Common Student Errors:**
- Case sensitivity: 'CarName' not 'carname' or 'Carname'
- Forgetting to reassign the result

### Task 14: Display and Evaluate Columns (YOUR CODE)

Display the first five rows. Note that the `carid` and `symboling` columns don't contain useful data for an analysis.

In [None]:
# ‚úÖ SOLUTION
cars.head()

# Note: carid is just a row number, symboling is an insurance risk rating
# Neither is useful for our analysis
# Also: brand and name columns are no longer needed (data is in brandname)

### Task 15: Drop Unnecessary Columns (YOUR CODE)

Drop the `carid`, `symboling`, `brand`, and `name` columns, and display the first five rows again.

In [None]:
# ‚úÖ SOLUTION
cars = cars.drop(columns=['carid', 'symboling', 'brand', 'name'])

#### üìù Instructor Notes - Task 15

**Key Teaching Points:**
- The Murach exercise only asks to drop `carid` and `symboling`
- But `brand` and `name` should also be dropped since their data is now in `brandname`
- Students who only drop 2 columns are still correct per the textbook instructions

**Acceptable Variations:**
```python
# Per textbook (only 2 columns):
cars = cars.drop(columns=['carid', 'symboling'])

# Better practice (also remove temp columns):
cars = cars.drop(columns=['carid', 'symboling', 'brand', 'name'])
```

In [None]:
# Display the final cleaned DataFrame
cars.head()

---

## Summary

In this exercise, you practiced cleaning data:

**Examining Data:**
- `info()` - View column types and missing values
- `value_counts()` - Count occurrences of each value
- `unique()` - List all unique values in a column

**Fixing Spelling and Capitalization:**
- `replace({...})` - Replace incorrect values with correct ones
- String concatenation - Combine columns with `+`

**Renaming and Dropping Columns:**
- `rename(columns={...})` - Change column names
- `drop(columns=[...])` - Remove unnecessary columns

---

**Submission:** Save this notebook and submit to Canvas before the deadline.