## Part 7: Cleaning Data & Casting

We'll create a small DataFrame that purposely contains **missing values** of different kinds:
- Python `None`
- `np.nan`
- Custom placeholders: `"NA"` and `"Missing"`

Then we’ll explore `dropna`, `replace`, `isna`, `fillna`, and type casting.

In [None]:
df_small = pd.DataFrame({
    "first": ["Alice", "Bob", np.nan, "Carol", None],
    "last": ["Smith", "Jones", None, "Lee", "Jones"],
    "email": [
        "alice@example.com",
        np.nan,
        None,
        "carol@example.com",
        "Missing",
    ],
    "uid": ["AS100293", "BJ240806", np.nan, "NA", "SJ251203"],
    "year": [1993, 2006, None, np.nan, 2003],
    "age": [32, 19, np.nan, None, 22],
    # "empty_col": [np.nan, np.nan, np.nan, np.nan, np.nan],
})
df_small

### 1. Dropping missing values with `dropna`

In [None]:
# Default: drop rows (axis='index') with ANY missing value
df_any = df_small.dropna()
df_any

In [None]:
# Explicitly: rows with ANY missing
df_any2 = df_small.dropna(axis="index", how="any")
df_any2

In [None]:
# Rows with ALL values missing only
df_allrows = df_small.dropna(axis="index", how="all")
df_allrows

In [None]:
# Drop columns where ALL values are missing (our 'empty_col')
df_drop_allcols = df_small.dropna(axis="columns", how="all")
df_drop_allcols

> ⚠️ If you do `dropna(axis="columns", how="any")` you’ll likely drop **almost every column** (because most real columns have at least one missing). That can leave you with an empty (or nearly empty) DataFrame.

In [None]:
# Drop rows where a specific subset has missing values
df_email_req = df_small.dropna(subset=["email"])
df_email_req

In [None]:
# Multiple required columns, drop only if ALL of the listed are missing
df_uid_or_email_req = df_small.dropna(how="all", subset=["uid", "email"])
df_uid_or_email_req

### 2. Custom missing values → real NaNs

In [None]:
# Replace custom placeholders with real NaNs
df_clean = df_small.replace({"NA": np.nan, "Missing": np.nan})
df_clean

In [None]:
# Where are the NaNs now?
df_clean.isna()

### 3. Filling missing values with `fillna`

In [None]:
# Example: fill missing strings with a flag
df_filled_str = df_clean.fillna("MISSING")
df_filled_str

In [None]:
# For numeric columns, a numeric fill is often more useful (e.g., 0 or -1)
df_filled_num = df_clean.copy()
df_filled_num["year"] = df_filled_num["year"].fillna(-1)
df_filled_num

### 4. Dtypes and why `mean()` may fail

In [None]:
# Check dtypes (note: attribute, not a method)
df_clean.dtypes

In [None]:
# This will raise a TypeError because 'age' is still strings/NaN, not numbers:
# df_clean["age"].mean()

> `age` contains strings (e.g., `"32"`) and NaNs. You cannot compute a numeric mean on `object` dtype.

### 5. Casting to numeric

* Casting directly to `int` fails if there are NaNs.
* Use `float` (or `to_numeric`) instead, then compute the mean.

In [None]:
# Safe numeric conversion
df_clean["age_num"] = pd.to_numeric(df_clean["age"], errors="coerce")  # strings -> numbers, others -> NaN
df_clean.dtypes

In [None]:
# Fails: cannot cast NaNs to int directly
# df_cast["age_num"] = df_cast["age_num"].astype(int)   # ValueError

# Works: floats can hold NaN
df_clean["age_num"] = df_clean["age_num"].astype(float)
df_clean.dtypes

In [None]:
# Now you can compute numeric stats
df_clean["age_num"].mean()

### 6. Handling custom missing values when reading CSVs

You can normalize placeholders at load time:

In [None]:
na_vals = ["NA", "Missing"]
df_loaded = pd.read_csv("path/to/file.csv", na_values=na_vals)

This will automatically treat those strings as `NaN` on import.

### Exercises for Part 7

#### Exercise 7.1:
1. For all columns in the big dataset, compute the **number** and **percentage** of missing values.
2. Show the **top 10** columns by missingness.

#### Exercise 7.2:
1. Convert `Salary` to numeric.  
2. Compute the mean salary after dropping missing values. 
3. Create a copy with missing `Salary` filled by the global median.  
4. Compute the mean salary on the filled copy and compare.
5. Count how many salaries were missing before the fill.
