## Data Cleaning

The following steps are essential to perform data cleaning:

- **Remove Unwanted Observations**: Eliminate duplicates, irrelevant entries or redundant data that add noise.
- **Fix Structural Errors**: Standardize data formats and variable types for consistency.
- **Manage Outliers**: Detect and handle extreme values that can skew results, either by removal or transformation.
- **Handle Missing Data**: Address gaps using imputation, deletion or advanced techniques to maintain accuracy and integrity.

In [1]:
import pandas as pd

# Some df
df = pd.DataFrame()

# Check for duplicates - returns of series of bools for each row
duplicated = df.duplicated()

# Calculate percentage of missing values for each column
missing_percent = round((df.isnull().sum() / df.shape[0]) * 100, 2)

# Remove rows where specified columns have missing values
df = df.dropna(subset=['Name'])

# Fill missing values with mean
df["Age"] = df.fillna(df['Age'].mean())

# Calculate mean and standard deviation (std) using df['Age'].mean() and df['Age'].std().
mean = df['Age'].mean()
std = df['Age'].std()

# Define bounds as mean ± 2 * std for outlier detection.
lower_bound = mean - 2 * std
upper_bound = mean + 2 * std

# Filter DataFrame rows within bounds using Boolean indexing.
df2 = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]

## Data Formatting

#### Min-Max Scaling
Rescales the values to a specified range, typically between 0 and 1. It preserves the original distribution and ensures that the minimum value maps to 0 and the maximum value maps to 1.

Used for: 
- **NN**
- **Euclidean Distance** (If one feature is "Salary" (0–200,000) and another is "Age" (0–100), the Salary will dominate the distance calculation entirely)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

df_minmax = df.copy()

# Every column that is of a numerical value
num_col_ = [col for col in df.columns if df[col].dtype != 'object']

df_minmax[num_col_] = scaler.fit_transform(df_minmax[num_col_])

#### Standardization (Z-score scaling)
Transforms the values to have a mean of 0 and a standard deviation of 1. It centers the data around the mean and scales it based on the standard deviation. Standardization makes the data more suitable for algorithms that assume a Gaussian distribution or require features to have zero mean and unit variance.

Z = (X - μ) / σ

Where:
- X = Data
- μ = Mean value of X
- σ = Standard deviation of X