#### 🧠 What Is Data Cleaning?

**Data Cleaning** is the process of preparing data for analysis by handling:

* Missing values
* Inconsistent formatting
* Duplicate records
* Wrong data types
* Outliers or impossible values

**Real-world datasets are almost always messy**, so these skills are necessary before any meaningful analysis or modeling can happen.

---

#### 📘 Course Breakdown (Lesson by Lesson)

---

#### 1. **Handling Missing Values**

* Identify missing values using `isnull()` or `isna()`
* Drop or fill missing data:

  * Drop with `dropna()`
  * Fill with `fillna()` (mean, median, mode, or specific value)

📌 Example:

```python
# Fill missing age with the median
df['age'] = df['age'].fillna(df['age'].median())
```

💡 You’ll also learn **when** it’s better to drop or fill, based on data context.

---

#### 2. **Scaling and Normalization**

* Why scaling matters: ML algorithms like **k-NN, SVM, or linear models** are sensitive to feature scales.
* Learn:

  * **Min-Max Scaling** (0 to 1)
  * **Standardization** (z-score)

📌 Example:

```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['height', 'weight']] = scaler.fit_transform(df[['height', 'weight']])
```

---

#### 3. **Parsing Dates**

* Convert text-based date columns into **datetime objects**
* Extract date parts like:

  * Year, month, weekday, hour
* Useful for time-based features

📌 Example:

```python
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
```

---

#### 4. **Character Encodings**

* Fix issues when loading text files with special or non-English characters
* Understand encodings like **UTF-8**, **ISO-8859-1**
* Use `encoding=` argument in `read_csv()`

📌 Example:

```python
df = pd.read_csv('data.csv', encoding='ISO-8859-1')
```

💡 This prevents common errors like "UnicodeDecodeError".

---

#### 5. **Inconsistent Data Entry**

* Real datasets often have inconsistent labels:

  * `"New York"` vs `"new york"` vs `"NY"`
  * `"Male"` vs `"M"` vs `"m"`

📌 Fix using:

* `.str.lower()`, `.strip()`, `.replace()`, and mapping dictionaries

```python
df['city'] = df['city'].str.lower().str.strip()
df['gender'] = df['gender'].replace({'M': 'Male', 'F': 'Female'})
```

---

#### 6. **Uniformity and Data Types**

* Convert columns to appropriate types:

  * Strings → datetime
  * Floats → integers (if safe)
  * Object → category (for memory optimization)

📌 Example:

```python
df['category'] = df['category'].astype('category')
```
---

#### ✅ **Skills You'll Gain**

* Handle missing or incorrect values confidently
* Standardize messy, inconsistent data
* Parse and manipulate date/time data
* Clean text data for analysis or modeling
* Scale and normalize numerical data properly
---

#### **Would you like:**

* A checklist for common data cleaning tasks?
* Code templates for cleaning your own dataset?
* Practice datasets with real-world data issues?
