<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-06/exercise_6_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 6-1: Clean the Polling Data

**CAP3321C - Data Wrangling**

---

## Overview

This exercise will guide you through the process of cleaning the Polling data for the 2016 election for president of the United States. As you clean that data, you'll use most of the procedures and methods presented in this chapter.

**Instructions:**
1. Run the setup cells to load the data
2. Complete each task by writing code in the provided cells
3. Some tasks are pre-filled - just run them and observe
4. Tasks marked with **YOUR CODE** require you to write the code

**Group Members:**
- Name 1:
- Name 2:
- Name 3:
- Name 4:

---

## Read the Data

Run these cells to load the data. Do not modify this section.

In [None]:
import pandas as pd

In [None]:
# Download the data file from GitHub
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/president_polls_2016.csv
print("Data file downloaded successfully!")

In [None]:
# Load the polling data
polls = pd.read_csv('president_polls_2016.csv')
print("Data shape:", polls.shape)
polls.head()

---

## Part 1: Examine the Data

Before cleaning data, you need to understand what you're working with.

### Task 4: Run the info() Method (YOUR CODE)

Run the `info()` method on the DataFrame. Make a note of the columns that have missing values.

**Hint:** Columns with missing values will show fewer non-null counts than the total number of rows.

**Expected output:** A summary showing column names, non-null counts, and data types

In [None]:
# YOUR CODE HERE - run info() on the polls DataFrame


### Task 5: Run the nunique() Method (YOUR CODE)

Run the `nunique()` method on the DataFrame. Make a note of the columns with only one value.

**Hint:** Columns with only 1 unique value don't provide useful information for analysis.

**Expected output:** A Series showing the count of unique values for each column

In [None]:
# YOUR CODE HERE - run nunique() on the polls DataFrame


---

## Part 2: Drop Columns and Rows

Remove unnecessary data to focus on what matters.

### Task 6: Filter Rows by Type (YOUR CODE)

Drop all rows **except** where the `type` column contains a value of `"now-cast"`.

**Hint:** Use `.query()` to filter rows, then reassign to `polls`.

**Example syntax:**
```python
polls = polls.query('column == "value"')
```

**Expected output:** The DataFrame should have fewer rows (only "now-cast" type polls)

In [None]:
# YOUR CODE HERE - keep only rows where type == "now-cast"


### Task 7: Verify Rows Were Dropped (YOUR CODE)

Run the `nunique()` method again to make sure the rows have been dropped. This should show that the `adjpoll` columns have about one third as many rows as they did before.

**Expected output:** Reduced unique value counts, especially in adjpoll columns

In [None]:
# YOUR CODE HERE - run nunique() to verify


### Task 8: Drop Columns with a Single Value (YOUR CODE)

Drop all columns with a single value. These columns don't provide useful information.

**Hint:** 
- First, identify which columns have only 1 unique value from Task 7
- Use `.drop(columns=[list_of_columns])` to remove them
- Remember to reassign or use `inplace=True`

**Example syntax:**
```python
polls = polls.drop(columns=['col1', 'col2', 'col3'])
```

**Expected output:** DataFrame with fewer columns

In [None]:
# YOUR CODE HERE - drop columns with only 1 unique value


### Task 9: Drop Rows Where State is "U.S." (YOUR CODE)

Drop all rows where the `state` column has a value of `"U.S."`. We want to focus on state-level polling data.

**Hint:** Use `.query()` with `!=` to keep rows that are NOT "U.S."

**Example syntax:**
```python
polls = polls.query('state != "U.S."')
```

**Expected output:** DataFrame without national-level polls

In [None]:
# YOUR CODE HERE - drop rows where state == "U.S."


### Task 10: Verify with nunique() Again (YOUR CODE)

Run the `nunique()` method one more time to see how the values have changed.

**Expected output:** Updated unique value counts reflecting the cleaned data

In [None]:
# YOUR CODE HERE - run nunique() to see updated values


---

## Part 3: Rename Columns

Make column names more user-friendly.

### Task 11: Rename the rawpoll Columns (YOUR CODE)

Rename each `rawpoll_name` column to `name_pct` where *name* is the candidate's name. For example, rename the `rawpoll_clinton` column to `clinton_pct`.

**Hint:** Use the `.rename()` method with a dictionary mapping old names to new names.

**Example syntax:**
```python
polls = polls.rename(columns={
    'old_name1': 'new_name1',
    'old_name2': 'new_name2'
})
```

**Columns to rename:**
- `rawpoll_clinton` → `clinton_pct`
- `rawpoll_trump` → `trump_pct`
- `rawpoll_johnson` → `johnson_pct`
- `rawpoll_mcmullin` → `mcmullin_pct`

**Expected output:** DataFrame with renamed columns

In [None]:
# YOUR CODE HERE - rename the rawpoll columns


---

## Part 4: Fix Data Types

Convert columns to appropriate data types for analysis.

### Task 12: Check Current Data Types (YOUR CODE)

Run the `info()` method on the data to see the current data types.

**Expected output:** Notice that date columns are showing as `object` (string) type

In [None]:
# YOUR CODE HERE - run info() to check data types


### Task 13: Create a List of Datetime Columns (YOUR CODE)

Create a list of columns that should use the datetime type. Look at the column names - which ones contain dates?

**Hint:** Look for columns with "date" in the name.

**Example syntax:**
```python
date_cols = ['col1', 'col2', 'col3']
```

**Expected output:** A list variable containing the names of date columns

In [None]:
# YOUR CODE HERE - create a list of datetime column names


### Task 14: Convert to Datetime (YOUR CODE)

Apply the Pandas `to_datetime()` method to the columns in the list of datetime columns to convert these columns to the datetime type.

**Hint:** Use a for loop to convert each column.

**Example syntax:**
```python
for col in date_cols:
    polls[col] = pd.to_datetime(polls[col])
```

**Expected output:** Date columns converted to datetime64 type

In [None]:
# YOUR CODE HERE - convert date columns to datetime


### Task 15: Convert to Category Type (YOUR CODE)

Convert the `state` and `population` columns to the category type. This is more memory-efficient for columns with a limited number of unique values.

**Hint:** Use `.astype('category')` to convert.

**Example syntax:**
```python
polls['column'] = polls['column'].astype('category')
```

**Expected output:** state and population columns converted to category type

In [None]:
# YOUR CODE HERE - convert state and population to category type


### Task 16: Verify Data Types (YOUR CODE)

Run the `info()` method one more time to see the new data types.

**Expected output:** Date columns showing as datetime64, state and population showing as category

In [None]:
# YOUR CODE HERE - run info() to verify new data types


### Task 17: Display the Cleaned Data (YOUR CODE)

Display the Polls data to see how it looks after all the cleaning.

**Expected output:** A clean, well-organized DataFrame ready for analysis

In [None]:
# YOUR CODE HERE - display the cleaned polls DataFrame


---

## Summary

In this exercise, you practiced cleaning data:

**Examining Data:**
- `info()` - View column types and missing values
- `nunique()` - Count unique values per column

**Dropping Rows and Columns:**
- `query()` - Filter rows based on conditions
- `drop(columns=[...])` - Remove unnecessary columns

**Renaming Columns:**
- `rename(columns={...})` - Change column names

**Fixing Data Types:**
- `pd.to_datetime()` - Convert strings to datetime
- `astype('category')` - Convert to category type

---

**Submission:** Save this notebook and submit to Canvas before the deadline.