<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-06/exercise_6_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 6-2: Clean the Cars Data

**CAP3321C - Data Wrangling**

---

## Overview

This exercise will guide you through the process of cleaning data on various makes and models of cars. You'll practice examining data, fixing spelling and capitalization problems, and renaming/dropping columns.

**Instructions:**
1. Run the setup cells to load the data
2. Complete each task by writing code in the provided cells
3. Some tasks are pre-filled - just run them and observe
4. Tasks marked with **YOUR CODE** require you to write the code

**Group Members:**
- Name 1:
- Name 2:
- Name 3:
- Name 4:

---

## Read the Data

Run these cells to load the data. Do not modify this section.

In [None]:
import pandas as pd

In [None]:
# Download the data file from GitHub
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/cars.csv
print("Data file downloaded successfully!")

In [None]:
# Load the cars data
cars = pd.read_csv('cars.csv')
print("Data shape:", cars.shape)

### Task 4: Display the First Five Rows (YOUR CODE)

Display the first five rows of the DataFrame to see the data.

**Expected output:** First 5 rows showing columns like car_ID, symboling, CarName, fueltype, etc.

In [None]:
# YOUR CODE HERE - display the first 5 rows


---

## Part 1: Examine the Data

Understand the structure and quality of the data.

### Task 5: Run the info() Method (YOUR CODE)

Examine the data with the `info()` method. Note that none of the columns seem to have missing data.

**Expected output:** Summary showing 26 columns, all with 205 non-null values

In [None]:
# YOUR CODE HERE - run info() on the cars DataFrame


### Task 6: Examine the fueltype Column (YOUR CODE)

Examine the `fueltype` column with the `value_counts()` method. Note that this column stores string values of "gas" and "diesel" that put a car into a category.

**Expected output:** Count of 'gas' and 'diesel' values

In [None]:
# YOUR CODE HERE - run value_counts() on the fueltype column


### Task 7: Examine the CarName Column (YOUR CODE)

Examine the `CarName` column with the `unique()` method. Note that this data contains some spelling errors and inconsistent capitalization.

**Hint:** Look for brand names that appear multiple times with different spellings (e.g., "toyota" vs "toyouta", "volkswagen" vs "vw" vs "vokswagen")

**Expected output:** Array of unique CarName values showing spelling inconsistencies

In [None]:
# YOUR CODE HERE - run unique() on the CarName column


---

## Part 2: Fix Spelling and Capitalization Problems

Clean up the inconsistent brand names.

### Task 8: Split CarName into Brand and Name (PRE-FILLED)

Run the cell that adds the `brand` and `name` columns. Note that the two statements in this cell use **lambda expressions**, which you'll learn about in chapter 7.

This code:
- Splits CarName on the first space to get the brand
- Takes everything after the first space as the model name

In [None]:
# PRE-FILLED: Split CarName into brand and name columns
cars['brand'] = cars.apply(lambda x: x.CarName.split(' ')[0], axis=1)
cars['name'] = cars.apply(lambda x: ' '.join(x.CarName.split(' ')[1:]), axis=1)

# Display to verify
cars[['CarName', 'brand', 'name']].head(10)

### Task 9: Display Unique Brand Values (YOUR CODE)

Display each unique value in the `brand` column. This should help you identify spelling mistakes and inconsistent capitalization.

**Hint:** Use `unique()` or `value_counts()` on the brand column.

**Expected output:** List of brand names - look for misspellings like:
- "maxda" (should be "mazda")
- "Nissan" vs "nissan"
- "toyouta" (should be "toyota")
- "vokswagen" and "vw" (should be "volkswagen")
- "porcshce" (should be "porsche")

In [None]:
# YOUR CODE HERE - display unique values in the brand column


### Task 10: Fix Misspellings with replace() (YOUR CODE)

Use the `replace()` method to fix misspelling and inconsistent capitalization in the `brand` column.

**Hint:** Use `.replace()` with a dictionary mapping wrong values to correct values.

**Example syntax:**
```python
cars['brand'] = cars['brand'].replace({
    'wrong1': 'correct1',
    'wrong2': 'correct2'
})
```

**Fixes needed:**
- `'maxda'` → `'mazda'`
- `'Nissan'` → `'nissan'`
- `'toyouta'` → `'toyota'`
- `'vokswagen'` → `'volkswagen'`
- `'vw'` → `'volkswagen'`
- `'porcshce'` → `'porsche'`

**Expected output:** All brand names should be correctly spelled and lowercase

In [None]:
# YOUR CODE HERE - fix misspellings in the brand column


**Verify your fixes:** Run the cell below to check that all brands are now correct.

In [None]:
# Verify the fixes
cars['brand'].unique()

### Task 11: Combine Brand and Name into CarName (YOUR CODE)

Store the corrected data in the `CarName` column. To do that, you need to combine the data in the `brand` and `name` columns.

**Hint:** Use string concatenation with `+` and add a space between brand and name.

**Example syntax:**
```python
cars['CarName'] = cars['brand'] + ' ' + cars['name']
```

**Expected output:** CarName column updated with corrected brand names

In [None]:
# YOUR CODE HERE - combine brand and name into CarName


### Task 12: Verify the Fixes (YOUR CODE)

Display the first five rows again to see how the data has been fixed.

**Expected output:** First 5 rows showing corrected CarName values

In [None]:
# YOUR CODE HERE - display first 5 rows to verify


---

## Part 3: Rename and Drop Columns

Clean up the column names and remove unnecessary columns.

### Task 13: Rename Columns (YOUR CODE)

Use the `rename()` method to rename:
- `CarName` column to `brandname`
- `car_ID` column to `carid`

That way, these columns use a naming convention that's consistent with the rest of the column names (all lowercase).

**Example syntax:**
```python
cars = cars.rename(columns={
    'OldName1': 'newname1',
    'OldName2': 'newname2'
})
```

**Expected output:** Columns renamed to lowercase convention

In [None]:
# YOUR CODE HERE - rename CarName to brandname and car_ID to carid


### Task 14: Display and Evaluate Columns (YOUR CODE)

Display the first five rows. Note that the `carid` and `symboling` columns don't contain useful data for an analysis.

**Expected output:** First 5 rows with renamed columns

In [None]:
# YOUR CODE HERE - display first 5 rows


### Task 15: Drop Unnecessary Columns (YOUR CODE)

Drop the `carid`, `symboling`, `brand`, and `name` columns, and display the first five rows again.

**Hint:** Use `.drop(columns=[...])` to remove multiple columns at once.

**Example syntax:**
```python
cars = cars.drop(columns=['col1', 'col2', 'col3', 'col4'])
```

**Expected output:** DataFrame without the carid, symboling, brand, and name columns

In [None]:
# YOUR CODE HERE - drop carid, symboling, brand, and name columns


In [None]:
# Display the final cleaned DataFrame
cars.head()

---

## Summary

In this exercise, you practiced cleaning data:

**Examining Data:**
- `info()` - View column types and missing values
- `value_counts()` - Count occurrences of each value
- `unique()` - List all unique values in a column

**Fixing Spelling and Capitalization:**
- `replace({...})` - Replace incorrect values with correct ones
- String concatenation - Combine columns with `+`

**Renaming and Dropping Columns:**
- `rename(columns={...})` - Change column names
- `drop(columns=[...])` - Remove unnecessary columns

---

**Submission:** Save this notebook and submit to Canvas before the deadline.