## 1. Feature Scaling

* Feature scaling in Machine Learning is the process of transforming numerical features so they are on a similar scale.
* If feature scaling is not done then machine learning algorithm tends to use greater values as higher and consider smaller values as lower regardless of the unit of the values.
* For example - 10cm and 10m will be same for machine learning algo.

In [1]:
import pandas as pd
df = pd.read_csv('SampleFile.csv')
print(df.head())

   LotArea  MSSubClass
0     8450          60
1     9600          20
2    11250          60
3     9550          70
4    14260          60


### Min - Max Scaling 
This method of scaling requires below two-step:

1. First we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the difference between the maximum and the minimum value.

Data is ranged between 0 to 1 . 

In [2]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, 
                         columns=df.columns)
scaled_df.head()

Unnamed: 0,LotArea,MSSubClass
0,0.03342,0.235294
1,0.038795,0.0
2,0.046507,0.235294
3,0.038561,0.294118
4,0.060576,0.235294


### Standardization
This method of scaling is basically based on the central tendencies and variance of the data. 

1. First we should calculate the mean and standard deviation of the data we would like to normalize it.
2. Then we are supposed to subtract the mean value from each entry and then divide the result by the standard deviation.

Centers data at mean 0 with standard deviation 1.

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0 -0.207142    0.073375
1 -0.091886   -0.872563
2  0.073480    0.073375
3 -0.096897    0.309859
4  0.375148    0.073375


### Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
* Median
* Inter-Quartile Range

After calculating these two values we are supposed to subtract the median from each entry and then divide the result by the interquartile range.

In [4]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,  
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0 -0.254076         0.2
1  0.030015        -0.6
2  0.437624         0.2
3  0.017663         0.4
4  1.181201         0.2


## 2. Encoding Techniques
- **Encoding** is the process of converting **categorical data** (text or labels) into **numeric format** so that machine learning algorithms can understand and process it.
- Many ML algorithms work only with numerical inputs.

Reasons : 
- Computers interpret numbers more efficiently than text.
- ML algorithms like **Linear Regression**, **Logistic Regression**, **SVM**, and most tree-based methods require numeric data.
- Helps in **feature engineering** and improves **model performance**.



### Types of Encoding

#### 1. Label Encoding
- Assigns a **unique integer** to each category.
- Ex - ['red', 'green', 'Blue'] -> [0, 1, 2]

- **Pros:** Simple, no extra columns created.  
- **Cons:** Can mislead algorithms into thinking there’s an **ordinal relationship** between categories.


In [5]:
from sklearn.preprocessing import LabelEncoder

data = ['Red', 'Green', 'Blue', 'Red']

le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(f"Encoded Data: {encoded_data}")

Encoded Data: [2 1 0 2]


#### 2. One-Hot Encoding
- Creates a binary column for each category.
- ex:
        Red   → [1, 0, 0]
        Blue  → [0, 1, 0]
        Green → [0, 0, 1]

- Pros: No ordinal assumption.
- Cons: Increases dimensionality for large categorical features.

In [6]:
import pandas as pd

data = ['Red', 'Blue', 'Green', 'Red']

df = pd.DataFrame(data, columns=['Color'])
one_hot_encoded = pd.get_dummies(df['Color'])

print(one_hot_encoded)

    Blue  Green    Red
0  False  False   True
1   True  False  False
2  False   True  False
3  False  False   True


#### 3. Ordinal Encoding
- Ordinal Encoding is used for ordinal data, where categories have a natural order.

- Assigns integers to categories based on their order.

- Ex - [small, medium, large] -> [1,2,3]

- Pros: Maintains order; reduces dimensionality.

- Cons: Not suitable for nominal categories.

In [7]:
from sklearn.preprocessing import OrdinalEncoder
data = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]

encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded_data = encoder.fit_transform(data)

print(f"Encoded Ordinal Data:\n {encoded_data}")

Encoded Ordinal Data:
 [[0.]
 [1.]
 [2.]
 [1.]
 [0.]]


#### 4. Target Encoding
- Replaces a category with the mean of the target variable for that category.
- This technique is especially useful when there is a relationship between the categorical feature and the target variable.
- Pros: Captures relationship to target variable.
- Cons: Risk of overfitting; must apply smoothing/statistical techniques and ensure leakage prevention (e.g., CV or holdout techniques).


In [8]:
pip install category_encoders

Note: you may need to restart the kernel to use updated packages.


In [9]:
import pandas as pd
import category_encoders as ce

df = pd.DataFrame(
    {'City': ['London', 'Paris', 'London', 'Berlin'], 'Target': [1, 0, 1, 0]}
)

encoder = ce.TargetEncoder(cols=['City'])
df_tgt = encoder.fit_transform(df['City'], df['Target'])

print(f"Encoded Target Data:\n{df_tgt}")

Encoded Target Data:
       City
0  0.570926
1  0.434946
2  0.570926
3  0.434946


#### 5.Frequency Encoding
- Frequency Encoding assigns each category a value based on its frequency in the dataset.
- This technique can be useful for handling features with many unique categories.
- Pros: Low computational and storage requirements.
- Cons: Can introduce data leakage if not handled properly.

In [11]:
import pandas as pd
data = ['Red', 'Green', 'Blue', 'Red', 'Red']
series_data = pd.Series(data)
frequency_encoding = series_data.value_counts()

encoded_data = [frequency_encoding[x] for x in data]
print("Encoded Data:", encoded_data)

Encoded Data: [3, 1, 1, 3, 3]


## 3. Imputation Techniques
- **Imputation** is the process of replacing **missing or null values** in a dataset with substituted values.
- Many ML algorithms **cannot handle missing values directly**.
- Proper imputation improves **data quality** and **model performance**.

##### **Types of Missing Data**
1. **MCAR (Missing Completely at Random)**  
   - Missing values have **no relationship** to other data.
2. **MAR (Missing at Random)**  
   - Missingness depends on **observed variables**.
3. **MNAR (Missing Not at Random)**  
   - Missingness depends on **unobserved variables**.


#### **1. Mean / Median / Mode Imputation**
- *Mean* → for numerical data (normally distributed)
- *Median* → for numerical data (skewed distribution)
- *Mode* → for categorical data

- Pros: Simple and fast
- Cons: Ignores relationships between features

In [17]:
import numpy as np

data = np.array([1, 2, np.nan, 4, 5])

mean_value = np.nanmean(data)

data = np.where(np.isnan(data), mean_value, data)

print(data)

[1. 2. 3. 4. 5.]


### 2. Constant Value Imputation
- Replace missing values with a specific constant (e.g., 0, "Unknown").
- Pros: Useful for special meaning values
- Cons: May create artificial patterns

### 3. Forward Fill / Backward Fill
- Forward Fill: Fill with previous value
- Backward Fill: Fill with next value
- Pros: Works well in time-series data
- Cons: May propagate incorrect values

### 4. KNN Imputation
- This technique imputes missing values using the 𝑘 nearest neighbors based on other variables. 
- It’s useful when there is a strong correlation between features.
- Pros: Captures feature relationships
- Cons: Computationally expensive for large datasets

In [20]:
import numpy as np
from sklearn.impute import KNNImputer

data = np.array([[1, 2, np.nan], 
                 [3, np.nan, 5], 
                 [4, 6, 7]])

# Initialize the KNN Imputer with 2 neighbors
imputer = KNNImputer(n_neighbors=2, weights="uniform")

# Apply KNN Imputation
data_imputed = imputer.fit_transform(data)

print("Dataset After KNN Imputation:\n", data_imputed)

Dataset After KNN Imputation:
 [[1. 2. 6.]
 [3. 4. 5.]
 [4. 6. 7.]]


### 5. Multivariate Imputation by Chained Equations (MICE)
- Models each feature with missing values as a function of other features.
- Multiple imputation generates several imputed datasets by creating multiple plausible values for the missing data and combining the results.
- Steps:
    1. Generate 𝑚 imputed datasets.
    2. Analyze each dataset.
    3. Pool the results using: Estimation Techniques

### 6. Regression Imputation
- Predicts missing values using a regression model trained on other features.

### 7. Dropping Missing Values
- Drop rows or drop columns with too many missing values.
- If certain feature have more than 80% Null then drop the feature
- Pros: Removes uncertainty completely
- Cons: Can cause data loss

##### Best Practices

- Identify missing data type (MCAR, MAR, MNAR) before choosing method.

- Avoid data leakage (impute after train-test split).

- For time-series, prefer forward/backward fill.

- For complex datasets, consider KNN or MICE.

- Compare model performance after imputation. 

## 4. Feature Selection 
- Feature selection is the process of choosing the most important input variables (features) from your dataset that contribute the most to predicting the target variable.
- Instead of feeding all available features to a model, we pick only the relevant ones to:

    - Reduce overfitting
    - Improve model accuracy
    - Make models faster and easier to interpret
    - Reduce training cost

- ##### Feature Selection Techniques
| **Category**         | **How it Works**                                                                                         | **Example Techniques**                                                       |
| -------------------- | -------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| **Filter Methods**   | Select features based on statistical scores before model training. Independent of ML algorithm.          | Correlation Coefficient, Chi-square test, ANOVA F-test, Mutual Information   |
| **Wrapper Methods**  | Evaluate subsets of features by training a model multiple times and choosing the best performing subset. | Forward Selection, Backward Elimination, Recursive Feature Elimination (RFE) |
| **Embedded Methods** | Feature selection happens automatically during model training.                                           | Lasso (L1) Regularization, Decision Tree Feature Importance, ElasticNet      |
