<a href="https://colab.research.google.com/github/aaniaahh/DataScience-2025/blob/main/Completed/06-Working_with_Data/05_basic_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üßº Cleaning Dirty Data (Missing Values & Type Fixes)
## üîπ LEARNING GOALS:
* Detect and count missing values (`NaN`)
* Fill or drop missing data
* Convert column data types safely
* Understand the difference between `NaN`, `None`, `""`, and type mismatches

## üß™ 1. Load a Messy Dataset


In [None]:
import pandas as pd
import numpy as np

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", None],
    "Age": ["25", "thirty", 35, np.nan, "40"],
    "Signup Date": ["2022-01-01", "not a date", "2022/03/01", None, "April 5, 2022"],
    "Score": [95.5, None, 88.0, 92.5, ""]
}

df = pd.DataFrame(data)
df

## üßØ 2. Detecting Missing or Broken Values

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

In [None]:
df[df.isnull().any(axis=1)]

## üßπ 3. Cleaning Strategy Options

In [None]:
df.fillna({
    "Name": "Unknown",
    "Age": -1,
    "Signup Date": "1970-01-01",
    "Score": 0.0
})

In [None]:
df.dropna()

## üß¨ 4. Data Type Fixes

In [None]:
df.dtypes

In [None]:
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")
df["Signup Date"] = pd.to_datetime(df["Signup Date"], errors="coerce")
df.dtypes

## ü©π 5. Impute (Fill In) Fixed Missing Values

In [None]:
df["Age"].fillna(df["Age"].median(), inplace=True)
df["Score"].fillna(df["Score"].mean(), inplace=True)

In [None]:
df["Signup Date"].fillna(df["Signup Date"].min(), inplace=True)

In [None]:
df["Name"].fillna("Unknown", inplace=True)

## ü§ì 6. Cleaned Data Review

In [None]:
df.info()

In [None]:
df.describe(include="all")

## üß™ Try It Yourself
Modify the `data` dictionary at the top of this notebook. Add:

* A new column with some `None` and `""` values
* At least one row with all columns filled incorrectly Then re-run the notebook and fix it step-by-step.

## üß† Mini-Challenge
üóÇ Load "`data/survey.csv`" and:

* Identify which columns have missing values
* Use `.isnull().sum()` to get a null report
* Use a mix of `.fillna()`, `.dropna()`, and `pd.to_numeric()` or `pd.to_datetime()` to clean it
* Print a summary with `.info()` and `.describe()`

In [2]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/rugbyprof/3603-Programming-for-Data-Science/refs/heads/main/data/students.csv")

# 1. Check which columns have missing values
print("Missing values per column:")
print(df.isnull().sum())

# 2. Show initial info/describe
print("\nInitial info:")
print(df.info())
print("\nInitial describe (numeric only):")
print(df.describe())

# 3. Final check
print("\nAfter cleaning ‚Äî missing values per column:")
print(df.isnull().sum())

print("\nFinal info:")
print(df.info())
print("\nFinal describe (numeric only):")
print(df.describe())

Missing values per column:
first_name       0
last_name        0
math_score       0
science_score    0
dtype: int64

Initial info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   first_name     50 non-null     object
 1   last_name      50 non-null     object
 2   math_score     50 non-null     int64 
 3   science_score  50 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.7+ KB
None

Initial describe (numeric only):
       math_score  science_score
count    50.00000      50.000000
mean     77.56000      80.240000
std      12.17753      11.659384
min      60.00000      62.000000
25%      66.25000      73.000000
50%      75.50000      78.000000
75%      87.00000      89.000000
max     100.00000     100.000000

After cleaning ‚Äî missing values per column:
first_name       0
last_name        0
math_score       0
science_score    0


## üìù Summary
**Concept**  |	**Tool/Function**
-------------|---------------------
Detect nulls |	`df.isnull()`, `df.isnull().sum()`
Drop rows |	`df.dropna()`
Fill values |	`df.fillna()`
Convert types |	`pd.to_numeric()`, `pd.to_datetime()`
Replace values |	`df.replace()`