# Normalization (Min-Max Scaling)
Normalization rescales the data to a fixed range, usually [0,1] or [-1,1]. It is particularly useful when you want to ensure that all features have the same scale, which is important for algorithms like k-nearest neighbors (kNN) or neural networks.


(x - x_min) / (x_max - x_min)

In [4]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])

# Normalize using MinMaxScaler
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)


def scaler(x):
    return (x - x.min(axis = 0))/ (x.max(axis = 0) - x.min(axis = 0))

print("Normalized Data:")
print(normalized_data)
print(scaler(data))


Normalized Data:
[[0.  0.  0. ]
 [0.5 0.5 0.5]
 [1.  1.  1. ]]
[[0.  0.  0. ]
 [0.5 0.5 0.5]
 [1.  1.  1. ]]


When to use:

Suitable for models that rely on distance calculations (e.g., kNN, SVM).
When features have different units (e.g., height in cm and weight in kg).

# Standardization (Z-score Scaling)
#### Standardization rescales the data to have a mean of 0 and a standard deviation of 1. It is useful when the data follows a Gaussian distribution or when distance-based models like kNN or SVM are used.

Formula:

$X_{𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑} = \frac{X − 𝜇}{𝜎} $

​
 
#### Where:
#### μ is the mean of the data.
#### σ is the standard deviation.

In [5]:
from sklearn.preprocessing import StandardScaler

# Sample data
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])

# Standardize using StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

print("Standardized Data:")
print(standardized_data)


Standardized Data:
[[-1.22474487 -1.22474487 -1.22474487]
 [ 0.          0.          0.        ]
 [ 1.22474487  1.22474487  1.22474487]]


# When to use:

##### Suitable for algorithms like linear regression, logistic regression, SVM, PCA, where data assumes normal distribution.
##### Particularly useful when your model assumes features are centered around zero.

# Handling Outliers
#### Outliers are extreme values that can distort the training of many models. Handling outliers is critical to improve model robustness and prevent biased learning.

#### Methods:

#### Truncation (Winsorization): Cap the extreme values to a specific percentile or threshold.
#### IQR Method: Use the interquartile range (IQR) to identify and remove outliers.
#### Example (IQR Method):

In [7]:
import pandas as pd

# Sample data with outliers
data = pd.DataFrame({'value': [10, 20, 30, 1000, 50, 60]})

# Calculate IQR
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
filtered_data = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]

print("Filtered Data (IQR):")
print(filtered_data)


Filtered Data (IQR):
   value
0     10
1     20
2     30
4     50
5     60


#### When to use:

#### Outliers should be handled if they are errors or represent noise. For example, if you’re predicting house prices and some prices are far outside the expected range.
#### Different algorithms handle outliers differently, so consider the model you're using.

In [None]:
Imputation
Imputation is used when there are missing values in the dataset. Several methods exist to replace missing values:

Mean/Median Imputation: Replacing missing values with the mean (for normal distribution) or median (for skewed distribution).
KNN Imputation: Imputing missing values by using the k-nearest neighbors algorithm.
Predictive Imputation: Using regression or other models to predict the missing values.
Example (Mean Imputation):

In [None]:
from sklearn.impute import SimpleImputer,KNNImputer

# Sample data with missing values
data = np.array([[1, 2, np.nan],
                 [4, np.nan, 6],
                 [7, 8, 9]],dtype = np.int64)

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(data)

print("Data after Imputation:")
print(imputed_data)


In [None]:
from sklearn.impute import KNNImputer
import numpy as np

# Sample data with missing values
data = np.array([[1, 2, np.nan],
                 [4, np.nan, 6],
                 [7, 8, 9]])

# Impute missing values using KNN (k=2)
imputer = KNNImputer(n_neighbors=2)
imputed_data = imputer.fit_transform(data)

print("Data after KNN Imputation:")
print(imputed_data)

####  When to use:

#### Imputation is used when missing values are not randomly distributed or are due to human error.
#### If the missing values represent a small portion of the data, imputation can prevent significant data loss.

# Truncation (Winsorization)
#### Truncation is a technique used to cap extreme values at a specified threshold. A common form of truncation is Winsorization, where extreme values are replaced with the nearest non-extreme value. This is useful for handling outliers without completely removing them.


In [None]:
import numpy as np
from scipy.stats.mstats import winsorize

# Sample data with outliers
data = np.array([10, 20, 30, 1000, 50, 60, 2000, 80])

# Apply Winsorization (limits at 5% lower and upper percentile)
winsorized_data = winsorize(data, limits=[0.05, 0.05])

print("Original Data:", data)
print("Winsorized Data:", winsorized_data)



#### When to use:

#### When outliers are present but should not be completely removed.
#### For models that are sensitive to extreme values (e.g., linear regression, SVM).
#### In financial data where extreme values may be errors or rare but valid cases.

# Data Integration
#### Data integration is the process of combining data from multiple sources into a single unified dataset. This is necessary in data warehousing, big data analytics, and ETL (Extract, Transform, Load) pipelines.

### Types of Data Integration:
#### Schema Integration: Combining different database schemas.
#### Entity Resolution (Deduplication): Identifying and merging records referring to the same entity.
#### Data Fusion: Resolving conflicts in integrated data (e.g., two datasets report different prices for the same product).
#### Example: Merging Datasets in Pandas

In [None]:
import pandas as pd

# Two datasets with common key 'ID'
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [50000, 60000, 70000]})

# Merge based on 'ID'
merged_df = pd.merge(df1, df2, on='ID')

print("Merged DataFrame:")
print(merged_df)


#### When to use:

#### When consolidating data from multiple sources (e.g., databases, APIs).
#### For creating comprehensive datasets for machine learning.

# Data Transformation
#### Data transformation involves converting data into a suitable format for analysis. It includes scaling, encoding, and feature engineering.

### Types of Data Transformation:
#### Normalization & Standardization (already covered above).
#### Log Transformation (to reduce skewness in data).
#### Encoding (converting categorical data into numerical form).
#### Feature Engineering (creating new meaningful features).
#### Example: Log Transformation

In [None]:
import numpy as np
import pandas as pd

# Sample data with skewed distribution
data = pd.DataFrame({'Value': [10, 50, 100, 500, 1000, 5000]})

# Apply log transformation
data['Log_Value'] = np.log1p(data['Value'])

print("Data after Log Transformation:")
print(data)


#### When to use:

#### When data is highly skewed (e.g., income distribution).
#### To improve performance of linear models that assume normally distributed data.

# Data Reduction
#### Data reduction is used to simplify datasets while retaining important information. This is critical for big data applications where computational efficiency is important.

### Methods of Data Reduction:
#### Feature Selection – Removing irrelevant features.
#### Principal Component Analysis (PCA) – Reducing dimensionality by finding principal components.
#### Sampling – Using a representative subset of the data.
#### Aggregation – Grouping data to reduce granularity (e.g., monthly instead of daily sales).
#### Example: PCA for Dimensionality Reduction

In [None]:
from sklearn.decomposition import PCA
import numpy as np

# Sample data (5 samples, 3 features)
data = np.array([[2.5, 2.4, 1.2],
                 [0.5, 0.7, 0.8],
                 [2.2, 2.9, 1.5],
                 [1.9, 2.2, 1.3],
                 [3.1, 3.0, 1.7]])

# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)

print("Reduced Data:")
print(reduced_data)


#### When to use:

#### When dealing with high-dimensional datasets (e.g., image processing, text data).
#### To improve computational efficiency and avoid the curse of dimensionality.