# **Algorithms for Data Processing**


## **1. Data Cleaning**  
- Handling Missing Values:  
  - Mean/Median/Mode Imputation  
  - K-Nearest Neighbors (KNN) Imputation  
  - Interpolation  
- Outlier Detection and Removal:  
  - Z-score Method  
  - IQR (Inter-quartile Range) Method  
  - Isolation Forest  
  - Local Outlier Factor (LOF)  

In [74]:
import numpy as np
import pandas as pd
import scipy.stats as stats

## **Mean Imputation**
Mean imputation is a technique used to handle missing values in a dataset by replacing them with the **mean** (average) of the available data for that feature. It is commonly applied to numerical data. The mean is calculated as the sum of all values divided by the number of values. This method assumes that the data is normally distributed and does not have significant skewness or outliers.

---

## **Median Imputation**
Median imputation replaces missing values with the **median** (middle value) of the available data for that feature. The median is the value that separates the higher half of the data from the lower half. This method is particularly useful for numerical data that is skewed or contains outliers, as the median is less sensitive to extreme values compared to the mean.

---

## **Mode Imputation**
Mode imputation is used to replace missing values with the **mode** (most frequent value) of the available data for that feature. This technique is typically applied to categorical data or discrete numerical data where a clear dominant value exists. The mode represents the value that appears most frequently in the dataset.

---


In [None]:
# Mean Median Mode Imputations

# Sample data
data = {'A': [1, 2, np.nan, 4, 5]}
df = pd.DataFrame(data)

# Mean imputation
meanVal = df['A'].mean()
df['A']= df['A'].fillna(meanVal)
print("Mean:\n",df)

# Median Imputation
medianVal = df['A'].median()
df['A']= df['A'].fillna(medianVal)
print("Median:\n",df)

# Mode
modVal = df['A'].mod(0)
df['A']= df['A'].fillna(modVal)
print("Mode:\n",df)



### **K-Nearest Neighbors (KNN) Imputation**
KNN imputation is a technique used to handle missing values by replacing them with values from the **k-nearest neighbors** in the dataset. It leverages the idea that similar data points (rows) should have similar values for their features.

---

### **How KNN Imputation Works**
1. **Step 1: Identify Missing Values**  
   Locate the missing values in the dataset.

2. **Step 2: Compute Distances**  
   For each row with missing values, compute the distance (e.g., Euclidean, Manhattan) to all other rows using the available features.

3. **Step 3: Select Nearest Neighbors**  
   Identify the **k-nearest neighbors** (rows) with the smallest distances.

4. **Step 4: Impute Missing Values**  
   - For **numerical features**: Replace the missing value with the **mean** or **median** of the corresponding feature values from the k-nearest neighbors.
   - For **categorical features**: Replace the missing value with the **mode** (most frequent value) of the corresponding feature values from the k-nearest neighbors.

---

### **Key Parameters**
- **n_neighbors (k)**: The number of neighbors to consider. A smaller k may overfit, while a larger k may smooth out patterns.
- **Distance Metric**: Common metrics include Euclidean, Manhattan, or Minkowski distance.
- **Weights**: Neighbors can be weighted by their distance (closer neighbors have more influence).

---


### **Advantages of KNN Imputation**
- Preserves relationships between features.
- Works well for datasets with complex patterns.
- Flexible for both numerical and categorical data.

---

### **Disadvantages of KNN Imputation**
- Computationally expensive for large datasets.
- Sensitive to the choice of k and distance metric.
- Requires scaling of features for accurate distance calculations.

---

[Code](https://www.geeksforgeeks.org/k-nearest-neighbours/)


In [None]:

def euclideanDistance(a, b):
    """Compute Euclidean distance between two vectors, ignoring NaNs."""
    mask = ~np.isnan(a) & ~np.isnan(b)  # Consider only non-NaN values
    if not np.any(mask):
        return np.inf  # If all values are NaN, return infinity
    return np.sqrt(np.sum((a[mask] - b[mask])**2))

def kNN(X, k=3):
    """KNN Imputation for missing values in a NumPy array."""
    xImputed = X.copy()  # Copy of original data
    nRows, nColumns = X.shape

    for rowIndex in range(nRows):
        for columnIndex in range(nColumns):
            if np.isnan(X[rowIndex, columnIndex]):  # Check for missing value
                distances = []
                
                # Find K nearest neighbors
                for neighborIndex in range(nRows):
                    if neighborIndex != rowIndex and not np.isnan(X[neighborIndex, columnIndex]):
                        dist = euclideanDistance(X[rowIndex], X[neighborIndex])
                        distances.append((dist, X[neighborIndex, columnIndex]))

                # Sort by distance and select K closest
                distances.sort(key=lambda x: x[0])
                kNeighbors = [val for _, val in distances[:k]]

                # Impute missing value with mean of K neighbors
                if kNeighbors:
                    xImputed[rowIndex, columnIndex] = np.mean(kNeighbors)

    return xImputed

# Example dataset with missing values (NaN)
X = np.array([[1.0, 2.0, np.nan],
              [2.0, np.nan, 3.0],
              [np.nan, 4.0, 5.0],
              [3.0, 4.0, 6.0]])

# Apply KNN imputation
X_imputed = kNN(X, k=2)

print("Original Data with Missing Values:")
print(X)

print("\nImputed Data:")
print(X_imputed)


## **Algorithm forInterpolation**  

Interpolation in a **matrix (2D data)** depends on the nature of missing values and the structure of the data. The best algorithm depends on:  

1. **Smoothness of Data**  
   - If data follows a smooth trend → **Spline Interpolation or Bicubic Interpolation**  
   - If data has sudden jumps → **Nearest-Neighbor Interpolation**  

2. **Computational Efficiency**  
   - If speed is important → **Linear Interpolation or Nearest-Neighbor**  
   - If accuracy is important → **Polynomial or Spline Interpolation**  

---

### **Common Matrix Interpolation Methods**  

| Algorithm | Description | Best For |
|-----------|-------------|----------|
| **Bilinear Interpolation** | Uses linear interpolation in both row and column directions | Image resizing, smooth surfaces |
| **Bicubic Interpolation** | Uses cubic polynomials for smoother results | High-quality image scaling |
| **Nearest-Neighbor Interpolation** | Uses the closest available value | Discrete data, quick estimation |
| **Spline Interpolation** | Fits smooth curves across the data points | Geospatial data, scientific applications |
| **Kriging Interpolation** | A geostatistical method that models spatial correlation | Geographic and environmental data |

---


### **Best Choice for Matrix Interpolation**
| **Scenario** | **Best Algorithm** | **Best Library** |
|-------------|----------------|---------------|
| **Image Processing** | Bicubic Interpolation | OpenCV (`cv2`) |
| **Smooth Data (Geospatial, Climate)** | Kriging, Spline | SciPy (`scipy.interpolate`) |
| **Quick Estimation** | Nearest-Neighbor | NumPy (`numpy.interp`) |
| **General Purpose** | Bilinear, Bicubic | SciPy (`griddata`) |

---


In [None]:

def linearInterpolation(X):
    """Perform linear interpolation on 1D NumPy array with missing values (NaN)."""
    n = len(X)
    for i in range(n):
        if np.isnan(X[i]):  # If missing value found
            left, right = None, None

            # Find the nearest left non-NaN value
            for k in range(i - 1, -1, -1):
                if not np.isnan(X[k]):
                    left = (k, X[k])
                    break

            # Find the nearest right non-NaN value
            for k in range(i + 1, n):
                if not np.isnan(X[k]):
                    right = (k, X[k])
                    break

            # If both left and right exist, apply linear interpolation
            if left and right:
                x1, y1 = left
                x2, y2 = right
                X[i] = y1 + (y2 - y1) * (i - x1) / (x2 - x1)

    return X

def nearestNeighborInterpolation(X):
    """Perform nearest-neighbor interpolation on 1D NumPy array with missing values."""
    n = len(X)
    for i in range(n):
        if np.isnan(X[i]):  # If missing value found
            left, right = None, None

            # Find the nearest left non-NaN value
            for k in range(i - 1, -1, -1):
                if not np.isnan(X[k]):
                    left = X[k]
                    break

            # Find the nearest right non-NaN value
            for k in range(i + 1, n):
                if not np.isnan(X[k]):
                    right = X[k]
                    break

            # Use the nearest available value
            if left is not None and right is not None:
                X[i] = left if (i - k) < (k - i) else right
            elif left is not None:
                X[i] = left
            elif right is not None:
                X[i] = right

    return X

def polynomialInterpolation(X, degree=2):
    """Perform polynomial interpolation on a 1D NumPy array with missing values."""
    xKnown = np.where(~np.isnan(X))[0]  # Indices of known values
    yKnown = X[xKnown]  # Known values
    xMissing = np.where(np.isnan(X))[0]  # Indices of missing values

    # Fit a polynomial curve to the known data points
    polyCoefficients = np.polyfit(xKnown, yKnown, degree)
    poly_func = np.poly1d(polyCoefficients)

    # Predict missing values using the polynomial function
    X[xMissing] = poly_func(xMissing)

    return X

# Example dataset with missing values
X = np.array([1.0, np.nan, 3.0, 4.0, np.nan, 6.0, np.nan, 8.0])

# Apply interpolation methods
XLinear = linearInterpolation(X.copy())
XNearest = nearestNeighborInterpolation(X.copy())
XPoly = polynomialInterpolation(X.copy(), degree=2)

print("Original Data with Missing Values:")
print(X)

print("\nLinear Interpolation:")
print(XLinear)

print("\nNearest-Neighbor Interpolation:")
print(XNearest)

print("\nPolynomial Interpolation (Degree 2):")
print(XPoly)


In [None]:
# Using Lib
feature = np.array([0, 1, 2, 3, 4])
target = np.array([0, 1, np.nan, 3, 4])

# Interpolate missing value
yIntercept = np.interp(feature, feature[~np.isnan(target)], target[~np.isnan(target)])
print(yIntercept)

# Use Scipy for 2D and OpenCV for matrix - use cases are define above




# **Outlier Detection and Removal**  

### **What is an Outlier?**  
An **outlier** is a data point that significantly differs from the rest of the dataset. Outliers can occur due to errors, variability in data, or rare events. Detecting and removing outliers is essential in **data preprocessing** for improving machine learning model performance.  

---

## **Methods for Outlier Detection and Removal**  

### **1. Z-score Method (Standard Score Method)**  
 **Concept:**  
- Measures how many standard deviations a data point is from the mean.  
- If a data point's **Z-score is too high or too low**, it's considered an outlier.  

 **Formula:**  
$
Z = \frac{(X - \mu)}{\sigma}
$
Where:  
- $ X $ = data point  
- $ \mu $ = mean of the dataset  
- $ \sigma $ = standard deviation  

 **Threshold:**  
- Commonly used thresholds: **Z > 3 or Z < -3** (99.7% of data falls within 3 standard deviations).  

 **Steps:**  
1. Compute the mean ($\mu$) and standard deviation ($\sigma$).  
2. Calculate the Z-score for each data point.  
3. Remove points where **Z-score > threshold (usually 3 or -3)**.  

---

### **2. IQR (Interquartile Range) Method**  
 **Concept:**  
- Uses the **middle 50% of the data** to detect outliers.  
- Any value **below the lower bound or above the upper bound** is considered an outlier.  

 **Formula:**  
$
IQR = Q3 - Q1
$
$
\text{Lower Bound} = Q1 - 1.5 \times IQR
$
$
\text{Upper Bound} = Q3 + 1.5 \times IQR
$
Where:  
- **Q1 (25th percentile)** = First quartile  
- **Q3 (75th percentile)** = Third quartile  
- **IQR (Interquartile Range)** = Spread of middle 50% of data  

 **Steps:**  
1. Compute **Q1 (25th percentile)** and **Q3 (75th percentile)**.  
2. Compute **IQR = Q3 - Q1**.  
3. Compute **upper** and **lower bounds**.  
4. Remove values outside these bounds.  

---

### **3. Isolation Forest (IForest)**  
 **Concept:**  
- A machine learning algorithm that isolates anomalies by **randomly selecting features** and **splitting data points**.  
- Outliers get isolated faster because they lie in low-density regions.  

 **How It Works?**  
1. Randomly select a feature and split it at a random value.  
2. Build a tree where normal points need **more splits** to be isolated, while **outliers are isolated quickly**.  
3. Compute an **anomaly score**, where higher values indicate outliers.  

 **Pros:**  
 Works on **high-dimensional data**  
 **Unsupervised** (no need for labeled data)  
 Efficient for large datasets  


---

### **4. Local Outlier Factor (LOF)**  
 **Concept:**  
- Measures how **isolated a data point is** compared to its neighbors.  
- Uses **density comparison**—if a point is in a low-density region, it's an outlier.  

 **How It Works?**  
1. Compute the **local density** of each point by looking at its **k nearest neighbors**.  
2. Compare each point’s density with its neighbors.  
3. If a point’s density is **significantly lower** than its neighbors, it's an outlier.  

 **Pros:**  
 Works well in **clusters**  
 **Unsupervised** (no need for labeled data)  
 Detects **local anomalies** (useful when outliers are not globally different but locally different)  

---

## **Comparison of Outlier Detection Methods**  

| Method | Type | Best For | Pros | Cons |
|--------|------|----------|------|------|
| **Z-score** | Statistical | Normally distributed data | Simple, fast | Sensitive to skewed data |
| **IQR Method** | Statistical | Skewed data, small datasets | Robust to skewed data | Ignores data distribution |
| **Isolation Forest** | Machine Learning | Large, high-dimensional datasets | Works well on large datasets | Needs tuning |
| **LOF** | Machine Learning | Clustering-based outlier detection | Detects **local** anomalies | Computationally expensive |

---

## **Final Recommendation**
- **For Normally Distributed Data** → **Z-score**  
- **For Skewed Data / Small Data** → **IQR**  
- **For Large & High-Dimensional Data** → **Isolation Forest**  
- **For Clustered Data** → **Local Outlier Factor (LOF)**  


In [None]:

# Sample dataset
data = np.array([10, 12, 14, 15, 16, 100])  # 100 is an outlier

# Compute mean and standard deviation
mean = np.mean(data)
standardDeviation = np.std(data)

# Compute Z-scores
zScores = (data - mean) / standardDeviation

# Remove outliers (Z > 3 or Z < -3)
filteredData = data[np.abs(zScores) < 3]
print(filteredData)




## ** Raw Implementation of Isolation Forest**

### **How It Works**
- Randomly selects a feature and a split value.
- Constructs a tree by recursively splitting the data.
- Outliers are isolated quickly in shallow trees.
- The average depth of a point determines its anomaly score.

---

## **Comparison of Isolation Forest vs Local Outlier Factor (LOF)**

| **Feature**             | **Isolation Forest**            | **Local Outlier Factor (LOF)**    |
|------------------------|---------------------------------|-----------------------------------|
| **Type**              | Tree-based method               | Density-based method             |
| **Best for**          | High-dimensional data           | Small and structured datasets    |
| **Computational Cost**| Fast (O(n log n))               | Slow (O(n²) for large datasets)  |
| **Interpretable**     | Yes (Tree splits)               | Hard to interpret densities      |
| **Works with Clusters** | No (assumes global outliers)  | Yes (detects local outliers)     |

---

## **Final Thoughts**
- **Use Isolation Forest** when working with **large, high-dimensional datasets**.
- **Use LOF** when outliers are **locally different from neighbors**.


In [None]:

class IsolationTree:
    def __init__(self, maxDepth):
        self.maxDepth = maxDepth
        self.left = None
        self.right = None
        self.splitFeature = None
        self.splitValue = None
        self.size = 0

    def fit(self, X, depth=0):
        if depth >= self.maxDepth or X.shape[0] <= 1:
            return
        
        self.splitFeature = np.random.randint(0, X.shape[1])
        minValue, maxValue = np.min(X[:, self.splitFeature]), np.max(X[:, self.splitFeature])

        if minValue == maxValue:
            return

        self.splitValue = np.random.uniform(minValue, maxValue)
        
        leftMask = X[:, self.splitFeature] < self.splitValue
        xLeft, xRight = X[leftMask], X[~leftMask]

        self.left = IsolationTree(self.maxDepth)
        self.right = IsolationTree(self.maxDepth)

        self.left.fit(xLeft, depth + 1)
        self.right.fit(xRight, depth + 1)

class IsolationForest:
    def __init__(self, nTrees=100, maxDepth=10):
        self.nTrees = nTrees
        self.maxDepth = maxDepth
        self.trees = []

    def fit(self, X):
        self.trees = [IsolationTree(self.maxDepth) for _ in range(self.nTrees)]
        for tree in self.trees:
            sampleIndices = np.random.choice(X.shape[0], size=min(256, X.shape[0]), replace=False)
            tree.fit(X[sampleIndices])

    def pathLength(self, X, tree, depth=0):
        if tree is None or (tree.splitFeature is None and tree.splitValue is None):
            return depth

        if X[tree.splitFeature] < tree.splitValue:
            return self.pathLength(X, tree.left, depth + 1)
        else:
            return self.pathLength(X, tree.right, depth + 1)

    def anomalyScore(self, X):
        pathLengths = np.array([np.mean([self.pathLength(x, tree) for tree in self.trees]) for x in X])
        c = 2 * (np.log(X.shape[0] - 1) + 0.5772156649) - (2 * (X.shape[0] - 1) / X.shape[0])
        scores = 2 ** (-pathLengths / c)
        return scores

    def predict(self, X, threshold=0.6):
        scores = self.anomalyScore(X)
        return np.where(scores > threshold, -1, 1)  # -1 for outliers, 1 for normal points

# Example usage
X = np.array([[10], [12], [14], [15], [16], [100]])  # 100 is an outlier

model = IsolationForest(nTrees=50, maxDepth=8)
model.fit(X)

predictions = model.predict(X)
print(predictions)  # -1 indicates outlier, 1 indicates normal


In [None]:

class LocalOutlierFactor:
    def __init__(self, nNeighbors=3):
        self.nNeighbors = nNeighbors
        self.X = None

    def fit(self, X):
        self.X = X

    def euclideanDistance(self, a, b):
        return np.sqrt(np.sum((a - b) ** 2))

    def kDistance(self, point):
        distances = np.array([self.euclideanDistance(point, x) for x in self.X])
        sortedDistances = np.sort(distances)
        return sortedDistances[self.nNeighbors]

    def reachabilityDistance(self, point, neighbor):
        return max(self.kDistance(neighbor), self.euclideanDistance(point, neighbor))

    def localReachabilityDensity(self, point):
        distances = np.array([self.reachabilityDistance(point, neighbor) for neighbor in self.X])
        return 1 / (np.mean(distances) + 1e-10)  # Avoid division by zero

    def localOutlierFactor(self, point):
        lrd_point = self.localReachabilityDensity(point)
        lrd_neighbors = np.array([self.localReachabilityDensity(neighbor) for neighbor in self.X])
        return np.mean(lrd_neighbors) / (lrd_point + 1e-10)  # Avoid division by zero

    def predict(self, X, threshold=1.5):
        lof_scores = np.array([self.localOutlierFactor(point) for point in X])
        return np.where(lof_scores > threshold, -1, 1)  # -1 for outliers, 1 for normal points

# Example usage
X = np.array([[10], [12], [14], [15], [16], [100]])  # 100 is an outlier

lof_model = LocalOutlierFactor(nNeighbors=2)
lof_model.fit(X)

predictions = lof_model.predict(X)
print(predictions)  # -1 indicates outlier, 1 indicates normal


# **Feature Scaling in Machine Learning**  

## **What is Feature Scaling?**  
Feature scaling is a **data preprocessing technique** used to normalize or standardize numerical features in a dataset. Many machine learning algorithms, especially those relying on **distance-based calculations** (e.g., K-Nearest Neighbors, SVM, PCA), perform better when features are on a similar scale.  

---

## **1. Standardization (Z-score Normalization)**  
 **Concept:**  
- Transforms data to have a **mean of 0** and a **standard deviation of 1**.  
- Helps in **normalizing** features with **different units** or **ranges**.  

 **Formula:**  
$
X_{\text{scaled}} = \frac{X - \mu}{\sigma}
$
Where:  
- $ X $ = Original value  
- $ \mu $ = Mean of the feature  
- $ \sigma $ = Standard deviation of the feature  

 **Best for:**  
 **Normally distributed data**  
 Models that assume **zero-mean and unit variance** (e.g., PCA, Logistic Regression, SVM)  

---

## **2. Min-Max Scaling (Normalization)**  
 **Concept:**  
- Rescales data to a fixed range, usually **[0,1]** or **[-1,1]**.  
- Retains the **original distribution** but compresses values into a small range.  

 **Formula:**  
$
X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$
Where:  
- $ X $ = Original value  
- $ X_{\min} $, $ X_{\max} $ = Minimum and maximum values in the dataset  

 **Best for:**  
 **Algorithms that require bounded values** (e.g., Neural Networks, K-Means Clustering)  
 **Features with different units**  

---

## **3. Robust Scaling (for Handling Outliers)**  
 **Concept:**  
- Uses the **median** and **interquartile range (IQR)** instead of the mean and standard deviation.  
- **Reduces the effect of outliers** by scaling data based on robust statistics.  

 **Formula:**  
$
X_{\text{scaled}} = \frac{X - Q1}{Q3 - Q1}
$
Where:  
- $ Q1 $ = 25th percentile (First quartile)  
- $ Q3 $ = 75th percentile (Third quartile)  

 **Best for:**  
 **Datasets with extreme outliers**  
 **Skewed distributions**  

---

## **4. Log Transformation**  
 **Concept:**  
- **Reduces right-skewed distributions** by applying a logarithmic function.  
- **Compresses large values** and expands small values.  
- Helps **normalize** data that follows a power-law distribution.  

 **Formula:**  
$
X_{\text{scaled}} = \log(X + c)
$
Where:  
- $ c $ is a small constant (to avoid log(0) errors).  

 **Best for:**  
 **Data with exponential growth** (e.g., income, population, price distributions).  
 **Handling skewness and heteroscedasticity (unequal variance)**.  

---

## **5. Power Transformation (Box-Cox & Yeo-Johnson)**  
 **Concept:**  
- **Box-Cox and Yeo-Johnson** are transformations that make data more **normally distributed**.  
- Unlike log transformation, these methods work with **negative** and **zero** values.  

### **(a) Box-Cox Transformation**  
- Works only on **positive data** ($ X > 0 $).  
- Uses a parameter **$ \lambda $** to transform data.  

 **Formula:**  
$
X_{\text{scaled}} =
\begin{cases} 
\frac{X^{\lambda} - 1}{\lambda}, & \lambda \neq 0 \\
\log(X), & \lambda = 0
\end{cases}
$

---

### **(b) Yeo-Johnson Transformation**  
- Works on **both positive and negative** data.  
- Similar to Box-Cox but designed for datasets containing **zero or negative values**.  

 **Formula:**  
$
X_{\text{scaled}} =
\begin{cases} 
\frac{(X + 1)^{\lambda} - 1}{\lambda}, & X \geq 0, \lambda \neq 0 \\
\log(X + 1), & X \geq 0, \lambda = 0 \\
-\frac{(-X + 1)^{2 - \lambda} - 1}{2 - \lambda}, & X < 0, \lambda \neq 2 \\
-\log(-X + 1), & X < 0, \lambda = 2
\end{cases}
$


## **Comparison of Feature Scaling Methods**

| **Method**               | **Best For**                  | **Handles Outliers?** | **Works with Negative Values?** |
|-------------------------|-----------------------------|----------------------|--------------------------------|
| **Standardization (Z-score)** | Normal distribution, SVM, PCA | ❌ No |  Yes |
| **Min-Max Scaling**      | Neural networks, bounded data | ❌ No |  Yes |
| **Robust Scaling**       | Data with outliers |  Yes |  Yes |
| **Log Transformation**   | Skewed data, large values | ❌ No | ❌ No (Must be positive) |
| **Box-Cox Transformation** | Normalizing non-normal data | ❌ No | ❌ No (Must be positive) |
| **Yeo-Johnson Transformation** | Normalizing non-normal data |  Yes |  Yes |

---

## **Final Recommendations**
- **Use Standardization (Z-score)** when working with **normally distributed data**.
- **Use Min-Max Scaling** when data should be **bounded within a fixed range**.
- **Use Robust Scaling** if the data contains **outliers**.
- **Use Log Transformation** for **right-skewed data**.
- **Use Box-Cox or Yeo-Johnson** for **non-normal distributions**.


In [None]:
# Z we already Implemented above 


# Min Max Scaling (Normalization)

data = np.array([50, 60, 70, 80, 90])


def minMaxScaling(data):
    minValue = np.min(data)
    maxValue = np.max(data)
    normalizedData = (data - minValue) / (maxValue - minValue)
    return normalizedData

print(minMaxScaling(data))


## Robust Scaling (Outliers)

def robustScaling(data):
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1

    robustScaledData = (data - Q1) / IQR
    return robustScaledData
print(robustScaling(data))



# Logged Transformation

def loggedTransformation(data):
    loggedTransformData = np.log(data + 1)
    return loggedTransformData

print(loggedTransformation(data))



# BoxCox
def boxCox(data):
    return stats.boxcox(data)


#  Yeo-Johnson

def yeoJohnson(data):
    return stats.yeojohnson(data)



**Encoding categorical data** is a process used to convert categorical variables into numerical values so that machine learning algorithms can understand them. Here’s an overview of the common techniques used:

1. **One-Hot Encoding**:
   - This method creates new binary columns for each category in a feature. If a category exists, the corresponding column is marked with a `1`, and others are marked with `0`.
   - **Example**: For the "Color" feature with categories `['Red', 'Green', 'Blue']`, it would create 3 new columns: `Color_Red`, `Color_Green`, `Color_Blue`. A row with `Color = 'Red'` would become `[1, 0, 0]`.

2. **Label Encoding**:
   - Label encoding assigns a unique integer to each category. The categories are mapped to numerical values, usually starting from `0`.
   - **Example**: For `['Red', 'Green', 'Blue']`, the encoding might look like this: `Red -> 0`, `Green -> 1`, `Blue -> 2`.

3. **Target Encoding (Mean Encoding)**:
   - This method replaces each category with the mean of the target variable for that category. It is particularly useful when dealing with high cardinality features.
   - **Example**: If you are predicting house prices and have a feature `Neighborhood`, you would replace each neighborhood with the average price of houses in that neighborhood.

4. **Frequency Encoding**:
   - This technique encodes categories based on the frequency of their occurrence in the dataset. Each category is replaced with the number of times it appears.
   - **Example**: For `['Red', 'Green', 'Blue', 'Red', 'Red']`, `Red` would be encoded as `3`, `Green` as `1`, and `Blue` as `1`.

5. **Binary Encoding**:
   - Binary encoding is a mix of one-hot encoding and label encoding. It first converts the category labels to integers and then transforms those integers into binary code. Each digit of the binary code is represented by a column.
   - **Example**: For categories `['Red', 'Green', 'Blue']`, label encoding would first map them to integers: `Red -> 0`, `Green -> 1`, `Blue -> 2`. Then, the binary encoding of these integers would be:
     - `0 -> 00`
     - `1 -> 01`
     - `2 -> 10`
     These values would be split across two binary columns.

6. **Ordinal Encoding**:
   - This technique is used when the categories have a meaningful order. Each category is assigned an integer based on its position in the order.
   - **Example**: For the feature `Size` with categories `['Small', 'Medium', 'Large']`, ordinal encoding would map them as `Small -> 0`, `Medium -> 1`, `Large -> 2`.


In [None]:
# Sample data
df = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'],
    'Price': [100, 200, 150, 180, 120]
})
def oneHotEncoding(data):

    oHE = pd.get_dummies(data, columns=['Color'])
    return oHE

print(oneHotEncoding(df))


def targetEncoding():
    tE = df.groupby('Color')['Price'].mean()
    f= df['Color_encoded'] = df['Color'].map(tE)
    return f

print(df)


### **Feature Selection**

Feature selection is a crucial step in the data preprocessing pipeline that aims to reduce the number of input features to the model while retaining the most important ones. By removing irrelevant or redundant features, it improves model performance, reduces over-fitting, speeds up computation, and provides better interpretability.

There are three main types of feature selection methods: **Filter Methods**, **Wrapper Methods**, and **Embedded Methods**.

---

### **1. Filter Methods**
Filter methods evaluate the importance of features independently of any machine learning model. They rely on statistical tests to measure the relevance of features to the target variable. These methods are typically fast and can be used as a preprocessing step before applying more complex models.

- **Mutual Information**: Measures the dependency between two variables. Higher mutual information means the feature provides more relevant information about the target variable.
- **Chi-Square Test**: A statistical test used to determine if two categorical variables are independent. It is often used for evaluating categorical features in relation to the target variable.
- **ANOVA (Analysis of Variance) Test**: Used to determine if there are significant differences between the means of different groups. It is useful when the target is continuous, and the features are categorical.

---

### **2. Wrapper Methods**
Wrapper methods evaluate subsets of features by training a machine learning model on them and assessing its performance. These methods are computationally expensive but can potentially find the best subset of features.

- **Recursive Feature Elimination (RFE)**: A technique that recursively removes the least important features based on the performance of the model. The model is trained repeatedly on different subsets of features to identify the most important ones.
- **Forward/Backward Feature Selection**:
  - **Forward Selection**: Starts with no features and iteratively adds features that improve the model's performance.
  - **Backward Selection**: Starts with all features and iteratively removes features that have the least impact on model performance.

---

### **3. Embedded Methods**
Embedded methods perform feature selection during the model training process. These methods are efficient because they combine feature selection and model training into one process.

- **LASSO (L1 Regularization)**: LASSO adds a penalty to the model that encourages the coefficients of less important features to be zero, effectively performing feature selection. This is commonly used in linear regression models.
- **Decision Tree Feature Importance**: Decision trees and tree-based algorithms, like Random Forest and XGBoost, can compute the importance of each feature based on how well they reduce impurity (e.g., Gini index or entropy).
- **SHAP (SHapley Additive exPlanations)**: SHAP values provide a unified measure of feature importance for any machine learning model. They are based on cooperative game theory and explain how much each feature contributes to a model's prediction.

---

### **Dimensionality Reduction**

Dimensionality reduction is the process of reducing the number of features or variables in a dataset, while retaining as much information as possible. This is particularly useful for improving computational efficiency, reducing noise, and mitigating the curse of dimensionality in machine learning models. Dimensionality reduction techniques are broadly categorized into linear and non-linear methods.

Here are the most commonly used dimensionality reduction techniques:

---

### **1. Principal Component Analysis (PCA)**
PCA is a **linear** dimensionality reduction technique that transforms the original features into a smaller set of new features, called **principal components**. These principal components are linear combinations of the original features and are ordered in such a way that the first few components retain most of the variance (information) in the dataset.

#### Key Concepts:
- PCA aims to maximize the variance in the dataset while reducing its dimensionality.
- It does this by identifying the directions (principal components) in which the data varies the most.
- It involves an eigenvalue decomposition of the covariance matrix of the data.

#### Advantages:
- PCA is widely used for feature extraction and noise reduction.
- It reduces over-fitting by removing correlated features.
- PCA can help visualize high-dimensional data by projecting it onto two or three principal components.

---

### **2. Linear Discriminant Analysis (LDA)**
LDA is a **supervised** dimensionality reduction technique used to find a linear combination of features that best separates two or more classes in the dataset. Unlike PCA, which focuses on variance, LDA maximizes the separation between classes.

#### Key Concepts:
- LDA tries to maximize the **between-class variance** and minimize the **within-class variance**.
- It is commonly used in classification tasks to reduce the dimensionality of data while preserving class separability.
- The number of dimensions after LDA is at most the number of classes minus one.

#### Advantages:
- LDA is particularly useful when the dataset has labeled data and is used for class separation.
- It can improve classification performance by projecting data onto a lower-dimensional space that is more suitable for classification.

---

### **3. t-Distributed Stochastic Neighbor Embedding (t-SNE)**
t-SNE is a **non-linear** dimensionality reduction technique primarily used for the visualization of high-dimensional data in a lower-dimensional space (typically 2D or 3D). It is effective in preserving local structure, making it ideal for visualizing clusters and patterns in complex datasets.

#### Key Concepts:
- t-SNE minimizes the divergence between probability distributions that represent pairwise similarities between data points in high-dimensional and low-dimensional spaces.
- It models data points in the high-dimensional space as probability distributions and seeks to map them to a lower-dimensional space with similar probability distributions.

#### Advantages:
- t-SNE is particularly effective in visualizing clusters or groupings in high-dimensional data.
- It is widely used in fields like bio-informatics, NLP, and computer vision for exploring data.

#### Limitations:
- t-SNE is computationally expensive and does not preserve global structure (e.g., distances between clusters may not be preserved well).
- It is typically used for visualization rather than for actual model training or feature extraction.

---

### **4. AutoEncoders**
AutoEncoders are a type of **neural network** used for unsupervised dimensionality reduction. An auto-encoder consists of an encoder and a decoder. The encoder compresses the input data into a smaller representation (latent space), while the decoder attempts to reconstruct the original data from this compressed form.

#### Key Concepts:
- AutoEncoders learn to encode data into a lower-dimensional space by training the network to minimize the reconstruction error between the input and the output.
- The middle layer (latent space) represents the reduced-dimensionality version of the data.
- They can be used for both dimensionality reduction and anomaly detection.

#### Advantages:
- AutoEncoders can capture complex, non-linear relationships in the data.
- They are flexible and can be trained on different types of data, including images, text, and time-series data.

#### Limitations:
- AutoEncoders can be computationally expensive, especially for large datasets.
- The quality of dimensionality reduction depends on the architecture and the choice of hyper-parameters.

---

### **5. Independent Component Analysis (ICA)**
ICA is a **non-linear** dimensionality reduction technique that aims to find statistically independent components in the data. Unlike PCA, which seeks components with maximum variance, ICA looks for components that are as statistically independent as possible.

#### Key Concepts:
- ICA is useful when the data consists of mixed signals, such as in **blind source separation** problems, where the goal is to recover the original sources from their mixtures (e.g., separating audio signals from a mixture of sounds).
- It is based on the assumption that the observed data is a linear combination of statistically independent sources.
- ICA uses higher-order statistical moments to separate the signals.

#### Advantages:
- ICA is particularly useful in signal processing, especially for problems like separating mixed audio signals.
- It can uncover hidden factors that are independent of each other.

#### Limitations:
- ICA can be sensitive to noise and may not work well with highly correlated data.
- It assumes that the underlying sources are independent, which may not always be true for all types of data.

---

### Summary of Differences:
- **PCA** focuses on variance and is linear, making it suitable for data with linear relationships.
- **LDA** is supervised and focuses on class separability, often used in classification problems.
- **t-SNE** is non-linear and best for visualization of high-dimensional data, especially to uncover clusters.
- **AutoEncoders** are neural networks that can model complex, non-linear relationships for dimensionality reduction.
- **ICA** focuses on finding statistically independent components and is often used in signal processing applications.

Each method has its strengths and weaknesses, and the choice of technique depends on the nature of the data and the problem at hand.

In [None]:


# Sample data
df = pd.DataFrame({
    'Feature1': [1, 2, 3, 4, 5],
    'Feature2': [10, 20, 30, 40, 50],
    'Feature3': [100, 200, 300, 400, 500]
})
def principalComponentAnalysis(data):
    # Step 1: Standardize the data
    data = df.values
    mean = np.mean(data, axis=0)
    standardDeviation = np.std(data, axis=0)
    dataStandardized = (data - mean) / standardDeviation

    # Step 2: Compute the covariance matrix
    covarianceMatrix = np.cov(dataStandardized, rowvar=False)

    # Step 3: Compute eigenvalues and eigenvectors
    eigenValues, eigenVectors = np.linalg.eigh(covarianceMatrix)

    # Step 4: Sort eigenvalues and eigenvectors
    sortedIndices = np.argsort(eigenValues)[::-1]
    eigenValuesSorted = eigenValues[sortedIndices]
    eigenVectorsSorted = eigenVectors[:, sortedIndices]

    # Step 5: Select top k eigenvectors (k=2 for 2D projection)
    k = 2
    topEigenVectors = eigenVectorsSorted[:, :k]

    # Step 6: Project data onto the new subspace
    pcaResult = np.dot(dataStandardized, topEigenVectors)
    print(f"PCA Result (2D Projection):\n{pcaResult}")
    return pcaResult

principalComponentAnalysis(df)

In [None]:
# Sample data (3 classes)
df = pd.DataFrame({
    'Feature1': [1, 2, 3, 6, 7, 8, 10, 11, 12],
    'Feature2': [4, 5, 6, 9, 10, 11, 15, 16, 17],
    'Target': [0, 0, 0, 1, 1, 1, 2, 2, 2]
})


def linearDiscriminantAnalysis(data):
    # Step 1: Compute class means
    classMeans = df.groupby('Target').mean().drop(columns=['Target'])

    # Step 2: Compute the within-class scatter matrix (Sw)
    Sw = np.zeros((2, 2))
    for target in df['Target'].unique():
        classData = df[df['Target'] == target].drop(columns=['Target'])
        classMean = classMeans.loc[target]
        Sw += np.dot((classData - classMean).T, (classData - classMean))

    # Step 3: Compute the between-class scatter matrix (Sb)
    overallMean = df.drop(columns=['Target']).mean()
    Sb = np.zeros((2, 2))
    for target in df['Target'].unique():
        classData = df[df['Target'] == target].drop(columns=['Target'])
        classMean = classMeans.loc[target]
        n = classData.shape[0]
        Sb += n * np.outer(classMean - overallMean, classMean - overallMean)

    # Step 4: Solve the generalized eigenvalue problem
    eigenValues, eigenVectors = np.linalg.eig(np.linalg.inv(Sw).dot(Sb))

    # Step 5: Sort eigenvalues and eigenvectors
    sorted_indices = np.argsort(eigenValues)[::-1]
    eigvalsSorted = eigenValues[sorted_indices]
    eigenVectorsSorted = eigenVectors[:, sorted_indices]

    # Step 6: Project data onto the new subspace
    ldaResult = np.dot(df.drop(columns=['Target']), eigenVectorsSorted[:, :2])
    print(f"LDA Result (2D Projection):\n{ldaResult}")
    return ldaResult


### **Data Transformation & Augmentation**

Data transformation and augmentation are techniques used to improve the quality and diversity of data for machine learning models. **Data transformation** involves modifying data to make it more suitable for modeling, while **data augmentation** refers to artificially increasing the size of the dataset by generating new data points through various methods.

---

### **1. Polynomial Features**
Polynomial features are used to add interaction terms between features in a dataset, thereby allowing linear models to capture non-linear relationships. This transformation is especially useful when you want to apply a linear model but your data exhibits polynomial behavior.

#### **Mathematical Explanation:**
- If your original feature is $ x $, the polynomial transformation will generate higher-degree features. For example, for a degree of 2, the transformation would create $ x^2 $.
- For two features $ x_1 $ and $ x_2 $, the polynomial transformation of degree $ d $ would include features such as $ x_1^2 $, $ x_2^2 $, and $ x_1x_2 $, etc.

Given a dataset with features $ x_1, x_2, \dots, x_n $, a polynomial transformation of degree $ d $ will create the following features:

$
x_1, x_2, \dots, x_n, x_1^2, x_2^2, \dots, x_n^2, x_1x_2, x_1x_3, \dots
$

#### **Purpose:**
- Polynomial features help linear models like linear regression to approximate more complex relationships between features.
- The expanded feature space can capture non-linear relationships, even though the model itself is linear.

---

### **2. Discretization (Binning)**
Discretization (also called binning) is a process that converts continuous data into discrete bins or intervals. This is useful when you want to convert continuous variables into categorical ones for modeling or when data exhibits certain groupings.

#### **Mathematical Explanation:**
- Discretization works by defining intervals (bins) in the feature space and assigning each data point to one of those bins. For example, you might define bins for ages: $ [0, 18) $, $ [18, 30) $, $ [30, 50) $, and $ [50, \infty) $.
- Let the feature $ x $ be a continuous variable. The discretized value $ b $ can be computed by:

$
b = \text{bin}(x) 
$

where $ \text{bin}(x) $ maps $ x $ into one of the defined bins based on predefined thresholds.

#### **Purpose:**
- Discretization is helpful when the underlying data has inherent categories or thresholds, such as age groups or income brackets.
- It also reduces the influence of outliers and can improve the performance of certain machine learning models by simplifying the relationship between features.

---

### **3. Data Augmentation**
Data augmentation refers to artificially increasing the size of the dataset by generating new data points. It is commonly used in tasks like image and text classification, where generating additional data helps the model generalize better and prevents overfitting.

#### **For Image Data:**
In the context of images, augmentation techniques involve creating modified versions of the images to add variety to the dataset. Common transformations include:

##### **a) Flipping:**
Flipping an image horizontally or vertically. This helps the model generalize better, as many objects look similar when flipped (e.g., faces, vehicles).

- **Horizontal flip**: Reflects the image across a vertical axis.
- **Vertical flip**: Reflects the image across a horizontal axis.

##### **b) Rotation:**
Rotating an image by a certain degree (e.g., 90°, 180°, or any arbitrary angle). This introduces variety into the dataset, especially useful for cases where objects can appear at different orientations.

- **Rotation matrix**: If the image is represented by a matrix, rotating it involves applying a transformation matrix to the image coordinates.

##### **c) Scaling:**
Scaling (zooming in or out) changes the size of an image. This helps the model learn invariances in object size, making it robust to different object scales in the input data.

- **Scaling formula**: The scaling factor $ \alpha $ scales the image pixels according to:

$
\text{scaled image} = \alpha \times \text{original image}
$

#### **For Text Data:**
Text data can also be augmented to increase diversity and prevent overfitting. Some common techniques include:

##### **a) Synonym Replacement:**
This involves replacing words in the text with their synonyms. By using thesauruses or pre-trained word embeddings, we can find synonyms and replace them in a way that does not change the overall meaning of the text.

- **Example**: Replacing the word "happy" with "joyful".

##### **b) Back Translation:**
Back translation involves translating the text into another language and then translating it back into the original language. This process introduces variation in the phrasing of the sentences, which can help improve the model's robustness.

- **Example**: Translating "I love programming" to French ("J'adore la programmation") and then back to English ("I adore programming").

---

### **Purpose of Data Augmentation:**
- **For Image Data:**
  - It prevents overfitting by introducing more variations of the images.
  - It allows models to be invariant to transformations like rotations, translations, or changes in scale, which is useful in real-world applications.
  
- **For Text Data:**
  - It helps with increasing the diversity of text, especially when data is scarce.
  - It can help capture a wider range of expressions and increase the model's generalization ability.

---

### Summary:

- **Polynomial Features**: Transform features into polynomial forms, capturing non-linear relationships.
- **Discretization (Binning)**: Convert continuous features into discrete bins.
- **Data Augmentation**:
  - **Image Augmentation**: Techniques like flipping, rotation, and scaling to create diverse versions of the same image.
  - **Text Augmentation**: Methods like synonym replacement and back translation to generate diverse text representations.

These methods are integral for enhancing model performance, improving generalization, and overcoming issues like overfitting, especially when data is limited.