# <h1 style='text-align:center'> Data Preprocessing for Machine Learning </h1>

---
---

## Overview
Data preprocessing is a critical step in the machine learning pipeline that involves transforming raw, messy data into a clean and structured format suitable for modeling. Proper preprocessing improves model performance, reduces bias, and ensures more accurate predictions.

> "Garbage in, garbage out" – If the input data is flawed, even the best machine learning algorithms will fail.

---

## Importance of Data Preprocessing
1. **Improves model accuracy** – Clean data helps the model learn patterns effectively.
2. **Reduces noise and bias** – Removes irrelevant or misleading information.
3. **Speeds up training** – Smaller, cleaner datasets require less computation.
4. **Prevents model failure** – Handles missing values, outliers, and inconsistent data.

---

## Steps in Data Preprocessing

### 1. Data Cleaning
- **Handling Missing Values:** Remove or fill missing data using mean, median, mode, or predictive models.
- **Removing Duplicates:** Duplicate rows or records are eliminated to prevent bias.

### 2. Handling Outliers
- Outliers can distort learning.
- Techniques:
  - Z-score method
  - Interquartile Range (IQR)
  - Clipping extreme values

> **Note:** *please find sections of 1 and 2 in Phase_2_Statistics_EDA* 

### [3. Encoding Categorical Data](#3-data-preprocessing-for-categorical-data-in-machine-learning)
- Converts non-numeric categories to numeric values, because machine cannot understand words it can understand numerical values only.

### [4. Feature Scaling(Numerical)](#4-numerical-data-preprocessing-for-machine-learning)
- Ensures all features are on a similar scale.

### [5. Text Data Preprocessing](#5-text-data-preprocessing-for-machine-learning)
- Clean, normalize, and convert raw text into structured numerical representations suitable for machine learning models.

### [6. Image Data Preprocessing](#6-image-data-preprocessing-for-machine-learning)
- Standardize, enhance, and transform raw images into formats and scales suitable for machine learning algorithms.

### [7. Feature Selection & Extraction](#7-feature-selection-and-feature-extraction-for-machine-learning)
- Reduces irrelevant or redundant features to improve efficiency.

### [References](#References)
- Official Documentation of Tools
---

## Common Mistakes to Avoid
- Scaling before splitting train and test sets  
- Applying inconsistent encoding between training and testing data  
- Ignoring outliers without analysis  
- Removing too much data during cleaning  

---

## Tools & Libraries
- **Pandas** – Data manipulation  
- **NumPy** – Numerical operations  
- **Scikit-learn** – Preprocessing and ML utilities  
- **PIL, OpenCV, Re** - Preprocessing Text and Image Data

---

## Summary
Data preprocessing transforms raw data into a clean, structured form, enabling machine learning algorithms to learn efficiently. It is one of the most important steps in the ML workflow and often has the largest impact on model performance.

---

## 3. Data Preprocessing for Categorical Data in Machine Learning

---

## Introduction

Categorical data represents features that contain **label values rather than numeric values**. Examples include:

- Gender: `Male`, `Female`, `Other`
- Color: `Red`, `Blue`, `Green`
- Product Category: `Electronics`, `Clothing`, `Furniture`

Machine learning models require **numerical inputs**, so categorical data must be transformed.

---

## Categorical Data Types

1. **Nominal:** Categories with no intrinsic order.  
   Example: `Red`, `Blue`, `Green`

2. **Ordinal:** Categories with a meaningful order.  
   Example: `Low`, `Medium`, `High`

---

## Preprocessing Techniques

### 1. Label Encoding
Assigns an integer to each category. Suitable for **ordinal data** but can mislead some ML models if applied to nominal data (implies order).


In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])

### 2. One-Hot Encoding

Creates binary columns for each category. Ideal for **nominal data**.

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)
encoded_features = ohe.fit_transform(df[['color']])

#pandas
df = pd.get_dummies(df, columns=['color'])

### 3. Ordinal Encoding

Maps ordered categories to integers reflecting their order.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_mapping = [['Low', 'Medium', 'High']]
oe = OrdinalEncoder(categories=ordinal_mapping)
df['priority_encoded'] = oe.fit_transform(df[['priority']])

### 4. Frequency / Count Encoding

Replaces categories with the **frequency of their occurrence** in the dataset.

In [None]:
freq = df['color'].value_counts()
df['color_freq'] = df['color'].map(freq)

### Implementation Example

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

# Sample dataset
data = {'color': ['Red', 'Blue', 'Green', 'Red'],
        'size': ['S', 'M', 'L', 'S'],
        'priority': ['Low', 'High', 'Medium', 'Medium']}

df = pd.DataFrame(data)

# Label Encoding (color)
le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])

# One-Hot Encoding (size)
df = pd.get_dummies(df, columns=['size'])

# Ordinal Encoding (priority)
ordinal_mapping = [['Low', 'Medium', 'High']]
oe = OrdinalEncoder(categories=ordinal_mapping)
df['priority_encoded'] = oe.fit_transform(df[['priority']])

print(df)

### Tips & Best Practices

* Use **one-hot encoding** for nominal features to avoid misleading order relationships.
* Use **label or ordinal encoding** for features with meaningful order.
* Be cautious of **high cardinality** features; frequency encoding or embedding layers may help.
* Avoid **data leakage**: fit encoders on **training data only** and transform test data separately.
* Always **handle missing values** before encoding.

---

## 4. Numerical Data Preprocessing for Machine Learning

### Introduction

Numerical features are continuous or discrete numbers in your dataset. Examples include:

- Age: 21, 35, 42  
- Salary: 40000, 75000, 120000  
- Temperature: 36.5, 38.2, 40.1  

Preprocessing numerical features is crucial to ensure ML algorithms **perform optimally**.

---

### Common Issues with Numerical Data

1. **Missing Values**: Some entries might be NaN or blank.  
2. **Different Scales**: Features can have vastly different ranges.  
3. **Outliers**: Extreme values can distort models.  
4. **Skewed Distributions**: Non-normal distributions can affect algorithms.  

---

### Preprocessing Techniques

#### 1. Scaling / Normalization

**a) Min-Max Scaling**

Scales values to range [0,1].

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['salary_scaled']] = scaler.fit_transform(df[['salary']])

**b) Standardization (Z-score)**

Centers data around 0 with standard deviation 1.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['salary_std']] = scaler.fit_transform(df[['salary']])

#### 2. Feature Transformation

* **Log Transformation**: Reduces right-skewed distributions
* **Square Root / Cube Root Transformation**: Reduces skewness

```python
import numpy as np
df['salary_log'] = np.log1p(df['salary'])
```

---

### 5. Discretization / Binning

Convert continuous features into bins or categories.

```python
df['age_group'] = pd.cut(df['age'], bins=[0,18,35,60,100], labels=['Child','Youth','Adult','Senior'])
```

---

### Implementation Example

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

# Sample dataset
data = {'age': [25, np.nan, 35, 45, 28],
        'salary': [50000, 60000, 75000, 120000, 40000]}

df = pd.DataFrame(data)

# Handle missing values
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])

# Standardize age
scaler = StandardScaler()
df['age_std'] = scaler.fit_transform(df[['age']])

# Min-Max scale salary
minmax = MinMaxScaler()
df['salary_scaled'] = minmax.fit_transform(df[['salary']])

# Log transform salary
df['salary_log'] = np.log1p(df['salary'])

print(df)


---

## Tips & Best Practices

* Always **fit scalers and imputers on training data** and transform test data separately.
* Use **StandardScaler for algorithms sensitive to variance** (e.g., SVM, KNN).
* Use **MinMaxScaler for algorithms requiring bounded input** (e.g., Neural Networks).
* Treat **outliers carefully**; sometimes they carry important information.
* Check the **distribution** of your data before choosing transformation methods.

---

## 5. Text Data Preprocessing for Machine Learning

---

### Introduction

Text data, unlike numeric data, is **unstructured**. Examples include:

- Reviews: `"The product is amazing!"`  
- Tweets: `"I love this movie #awesome"`  
- Emails: `"Dear user, your account is updated"`

Raw text cannot be directly fed into ML models. Preprocessing is necessary to **clean, normalize, and convert text into numerical features**.

---

### Common Issues in Text Data

1. **Inconsistent casing**: `"Machine Learning"` vs `"machine learning"`  
2. **Punctuation, numbers, symbols**  
3. **Stopwords**: Common words like "the", "is", "and" add noise  
4. **Different word forms**: `"run"`, `"running"`, `"ran"`  
5. **High dimensionality** in vectorized text  

---

### Preprocessing Techniques

### 1. Lowercasing
Convert all text to lowercase for consistency.

```python
text = "Machine Learning is FUN!"
text = text.lower()
# Output: "machine learning is fun!"
````

---

### 2. Removing Punctuation, Numbers, and Whitespace

Remove characters that don’t contribute to meaning.

```python
import re

text = "I love ML 101!!!"
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Output: "I love ML"
```

---

#### 3. Tokenization

Split text into words or subwords.

```python
from nltk.tokenize import word_tokenize

tokens = word_tokenize("I love machine learning")
# Output: ['I', 'love', 'machine', 'learning']
```

---

#### 4. Stopword Removal

Remove common words that don’t carry significant meaning.

```python
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if w.lower() not in stop_words]
# Output: ['love', 'machine', 'learning']
```

---

#### 5. Stemming and Lemmatization

Reduce words to their root form.

**Stemming:**

```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in tokens]
# Output: ['love', 'machin', 'learn']
```

**Lemmatization:**

```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in tokens]
# Output: ['love', 'machine', 'learning']
```

---

#### 6. Vectorization

Convert text into numerical representations:

* **Bag of Words (BoW)**
* **TF-IDF (Term Frequency-Inverse Document Frequency)**
* **Word Embeddings (Word2Vec, GloVe, FastText)**

```python
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love machine learning", "ML is fun"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
```

---

#### 7. Handling Out-of-Vocabulary Words

For embeddings or NLP models, handle unknown words by:

* Assigning a special token `<UNK>`
* Using subword tokenization (BPE, WordPiece)

---

### Implementation Example

```python
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample text
corpus = ["I love machine learning!", "Natural Language Processing is amazing."]

cleaned_corpus = []

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

for doc in corpus:
    doc = doc.lower()  # Lowercase
    doc = re.sub(r'[^a-z\s]', '', doc)  # Remove punctuation
    tokens = word_tokenize(doc)  # Tokenization
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]  # Lemmatization + Stopwords
    cleaned_corpus.append(" ".join(tokens))

# Vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(cleaned_corpus)

print(cleaned_corpus)
print(X.toarray())
```

---

### Tips & Best Practices

* Always **clean text** before vectorization.
* Use **lemmatization** over stemming for better semantic understanding, due after stemming words does not having meaning
* For deep learning, **word embeddings** often outperform BoW or TF-IDF.
* Handle **special characters, emojis, and URLs** in social media text.
* Consider **n-grams** for capturing context.

---



## 6. Image Data Preprocessing for Machine Learning

### Introduction

Image data is **high-dimensional and unstructured**. Examples include:

- Photographs of objects or faces  
- Medical images (X-rays, MRIs)  
- Handwritten digits (MNIST dataset)  

Raw image pixels often require preprocessing before feeding into ML models.

---

### Common Issues in Image Data

1. **Different sizes and resolutions**  
2. **Different color scales** (RGB, grayscale)  
3. **High dimensionality**  
4. **Noise or artifacts**  
5. **Limited dataset size**

---

### Preprocessing Techniques

#### 1. Resizing
Resize images to a uniform size for consistent input dimensions.

```python
from PIL import Image
import numpy as np

img = Image.open('image.jpg')
resized_img = img.resize((128, 128))
resized_array = np.array(resized_img)
````

Or using OpenCV:


```python
import cv2

img = cv2.imread('image.jpg')
resized_img = cv2.resize(img, (128, 128))
```

---

#### 2. Normalization / Scaling

Scale pixel values to a standard range, usually `[0,1]`.

```python
normalized_img = resized_img / 255.0
```

---

#### 3. Data Augmentation

Generate new images from existing ones to increase dataset size:

* Rotation
* Flipping
* Translation
* Brightness adjustment

```python
# Rotation example using OpenCV
(h, w) = resized_img.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle=30, scale=1.0)
rotated_img = cv2.warpAffine(resized_img, matrix, (w, h))

# Horizontal flip
flipped_img = cv2.flip(resized_img, 1)
```

---

#### 4. Grayscale Conversion

Convert RGB images to grayscale to reduce dimensionality if color is not important.

```python
gray_img = cv2.cvtColor(resized_img, cv2.COLOR_BGR2GRAY)
```

Or using PIL:

```python
gray_img = img.convert('L')
```

---

#### 5. Denoising / Filtering

Reduce noise in images using filters:

```python
denoised_img = cv2.GaussianBlur(resized_img, (5, 5), 0)
```

---

#### 6. Image Flattening

Flatten 2D images into 1D arrays for traditional ML models like SVM or Random Forest.

```python
flattened_img = resized_img.flatten()
```

---

### Implementation Example

```python
import cv2
import numpy as np
from PIL import Image

# Load image using PIL
img = Image.open('image.jpg')

# Resize
resized_img = img.resize((128, 128))
resized_array = np.array(resized_img)

# Convert to grayscale
gray_img = img.convert('L')
gray_array = np.array(gray_img)

# Normalize
normalized_img = resized_array / 255.0

# Data augmentation: rotation
(h, w) = resized_array.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle=30, scale=1.0)
rotated_img = cv2.warpAffine(resized_array, matrix, (w, h))

# Horizontal flip
flipped_img = cv2.flip(resized_array, 1)

# Flatten for ML
flattened_img = normalized_img.flatten()

print("Resized shape:", resized_array.shape)
print("Grayscale shape:", gray_array.shape)
print("Flattened shape:", flattened_img.shape)
```

---

### Tips & Best Practices

* Always **resize images** to the same dimensions for batch processing.
* Normalize pixel values to **improve convergence** of ML models.
* Use **data augmentation** to reduce overfitting in small datasets.
* Use **grayscale** only when color information is unnecessary.
* For CNNs, keep **3D shapes** (`height x width x channels`) instead of flattening.
* Save preprocessed images using PIL or OpenCV to avoid repeated computation.

---

## 7. Feature Selection and Feature Extraction for Machine Learning

### Introduction

Features (or variables) are the **inputs** to machine learning models. Not all features are equally informative.  

- **Feature Selection:** Choose the most relevant features from the existing dataset.  
- **Feature Extraction:** Transform the data into a lower-dimensional space while preserving important information.  

Proper feature selection and extraction improve **model accuracy, reduce overfitting, and decrease training time**.

---

## Objectives

- **Feature Selection Objective:**  
  *"Identify and retain the most relevant features to improve model performance and interpretability."*

- **Feature Extraction Objective:**  
  *"Transform original features into a lower-dimensional representation that preserves essential information for modeling."*

---

### Feature Selection

#### 1. Filter Methods
Use statistical measures to score features independently of the model.

- **Techniques:** Correlation, Chi-square, ANOVA, Mutual Information

```python
from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(chi2, k=5).fit_transform(X, y)
````

---

#### 2. Wrapper Methods

Use a predictive model to evaluate feature subsets.

* **Techniques:** Recursive Feature Elimination (RFE), Sequential Feature Selection

```python
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)
```

---

#### 3. Embedded Methods

Feature selection is built into the model training.

* **Techniques:** Lasso Regression (L1), Tree-based feature importance

```python
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.01)
model.fit(X, y)
selected_features = X.columns[model.coef_ != 0]
```

---

### Feature Extraction

#### 1. Principal Component Analysis (PCA)

Transforms features into **uncorrelated components** while retaining maximum variance.

```python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
```

#### 2. Linear Discriminant Analysis (LDA)

Projects features to maximize **class separability**. Often used in classification tasks.

```python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=1)
X_lda = lda.fit_transform(X, y)
```

#### 3. t-SNE / UMAP

Non-linear dimensionality reduction for **visualization and clustering**.

```python
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
```

---

### Implementation Example

```python
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA

# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Feature Selection: top 2 features
X_selected = SelectKBest(f_classif, k=2).fit_transform(X, y)

# Feature Extraction: PCA to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Selected Features Shape:", X_selected.shape)
print("PCA Features Shape:", X_pca.shape)
```

---

### Tips & Best Practices

* **Feature Selection** reduces noise and improves interpretability.
* **Feature Extraction** is useful for high-dimensional datasets (images, text).
* Use **PCA** when you want to preserve variance, **LDA** when class separation is important.
* Combine **feature selection and extraction** for optimal results.
* Avoid **leakage**: fit transformations only on training data, then apply to test data.

---

## References

1. [Scikit-learn preprocessing documentation](https://scikit-learn.org/stable/modules/preprocessing.html)
2. [One-hot encoding explained](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)
3. [Handling categorical data in ML](https://towardsdatascience.com/encoding-categorical-features-21a2651a065c)
4. [Data preprocessing in ML](https://towardsdatascience.com/data-preprocessing-for-machine-learning-3b61b4f94d7f)
5. [Feature scaling techniques](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)
6. [NLTK Documentation](https://www.nltk.org/)
7. [Scikit-learn Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
8. [Text Preprocessing in NLP](https://towardsdatascience.com/text-preprocessing-for-nlp-5e0f746abf1)
9. [Word Embeddings](https://machinelearningmastery.com/what-are-word-embeddings/)
10. [OpenCV Documentation](https://docs.opencv.org/)
11. [Pillow (PIL) Documentation](https://pillow.readthedocs.io/en/stable/)
12. [Image Preprocessing Techniques](https://towardsdatascience.com/image-preprocessing-for-deep-learning-64113f78588f)
13. [Data Augmentation in OpenCV](https://learnopencv.com/data-augmentation-techniques-for-deep-learning/)
14. [Scikit-learn Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
15. [PCA Tutorial](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
16. [LDA in Python](https://scikit-learn.org/stable/modules/lda_qda.html)
17. [t-SNE Guide](https://distill.pub/2016/misread-tsne/)

---
---

# <center> *End of Topic* </center>