# 1. Introduction to Data Cleaning and Preprocessing

Data cleaning and preprocessing are foundational steps in any data analysis or machine learning project. Without clean data, even the most sophisticated models and analyses can produce unreliable or inaccurate results.

## Definition and Importance
### What is Data Cleaning and Preprocessing?
- **Data Cleaning**: The process of identifying and correcting (or removing) errors and inconsistencies in data to improve its quality.
- **Data Preprocessing**: The steps taken to prepare raw data for analysis or modeling, including transformations, scaling, and encoding.

### Why is it Crucial?
1. **Accuracy**: Dirty data can lead to incorrect conclusions or poorly performing models.
2. **Efficiency**: Cleaning data early reduces the time spent debugging later stages.
3. **Interpretability**: Clean and well-preprocessed data make analysis easier to understand and explain.
4. **Model Performance**: Many machine learning algorithms perform better with properly cleaned and preprocessed data.

## Common Issues in Raw Data
1. **Missing Data**:
   - Missing values in critical fields.
2. **Inconsistent Data**:
   - Variations in text data (e.g., 'USA', 'U.S.A.', 'United States').
3. **Incorrect Data Types**:
   - Numeric fields stored as strings, or incorrect datetime formats.
4. **Duplicate Records**:
   - Redundant rows that can skew analysis.
5. **Outliers**:
   - Extreme values that distort distributions.

# 2. Handling Missing Data

Missing data is one of the most common issues in datasets and can significantly impact analysis and modeling if not handled appropriately.

## Identifying Missing Data
### Methods to Detect Missing Data
Pandas provides functions to identify missing values:

1. **`isnull()`**:
   - Returns a DataFrame of the same shape with `True` for missing values and `False` otherwise.

2. **`notnull()`**:
   - Returns the opposite of `isnull()`.

3. **`sum()`**:
   - Can be combined with `isnull()` to count missing values in columns or rows.

### Visualizing Missing Data
Visualization tools like Seaborn make it easier to understand patterns of missing data. The `heatmap()` function can highlight missing values in a dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Example DataFrame with missing data
data = {"Name": ['Alice', 'Bob', 'Charlie', 'David'], "Age": [25, np.nan, 30, 35], "Salary": [50000, 60000, np.nan, 70000]}
df = pd.DataFrame(data)
print('Original DataFrame:')
print(df)

# Detecting missing data
print('Missing values per column:')
print(df.isnull().sum())

# Visualizing missing data
plt.figure(figsize=(6, 4))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()

## Techniques to Handle Missing Data

### 1. Removing Missing Data
Sometimes, it's acceptable to remove rows or columns with missing values if the proportion of missing data is small.

#### `dropna()`
- Removes rows or columns with missing data.
- Parameters:
  - `axis=0`: Removes rows with missing data (default).
  - `axis=1`: Removes columns with missing data.
  - `thresh`: Keeps rows/columns with at least a minimum number of non-missing values.

In [None]:
# Drop rows with missing data
print('Drop rows with missing data:')
print(df.dropna())

# Drop columns with missing data
print('Drop columns with missing data:')
print(df.dropna(axis=1))

# Keep rows with at least 2 non-missing values
print('Keep rows with at least 2 non-missing values:')
print(df.dropna(thresh=2))

### 2. Filling Missing Data
Instead of dropping missing data, we can fill them with meaningful values.

#### `fillna()`
- Fills missing values with specified values or strategies.
- Strategies:
  - Replace with constants (e.g., `fillna(0)`).
  - Replace with statistical values (mean, median, mode).
  - Use forward-fill (`method='ffill'`) or backward-fill (`method='bfill'`).
#### `ffill()`
  - Fills missing values with forward-fill strategy.
#### `bfill()`
- Fills missing values with backward-fill strategy.


In [None]:
# Fill missing values with mean
print('Fill missing values with mean:')
print(df.fillna(0))

# Forward-fill missing values
print('Forward-fill missing values:')
print(df.ffill())

# Backward-fill missing values
print('Backward-fill missing values:')
print(df.bfill())

### 3. Interpolation
Interpolation estimates missing values based on surrounding data.

#### Types of Interpolation:
- **Linear Interpolation**: Assumes values change linearly between points.
- **Quadratic/Polynomial Interpolation**: Fits a curve to the data for more accuracy in non-linear distributions.

In [None]:
# Linear interpolation
print('Linear interpolation:')
print(df.interpolate())

# Quadratic interpolation
print('Quadratic interpolation:')
print(df.interpolate(method='quadratic'))

### 4. Advanced Imputation
Advanced methods like regression or machine learning-based imputation can be used for complex datasets.

#### Using `SimpleImputer` from Scikit-Learn:
- Provides strategies for imputing missing values.
- Common strategies include mean, median, most frequent, or constant.

In [None]:
from sklearn.impute import SimpleImputer

# Example: Impute missing values with the median
imputer = SimpleImputer(strategy='median')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
print('DataFrame after imputation:')
print(df)