# Assignment (17th March) : Feature Engineering - 1

### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

**ANS:** **`Missing values`** in a dataset are data points where no value is stored for a variable in an observation. **`Handling missing values`** is essential because they can lead to biased estimates, reduce the efficiency of analyses, and lead to incorrect conclusions. **`Some algorithms not affected by missing values`** include:
   - Decision Trees
   - Random Forests
   - k-Nearest Neighbors (with proper implementation)
   - XGBoost (with built-in handling)

### Q2: List down techniques used to handle missing data. Give an example of each with python code.

**ANS:** Here are some techniques to handle missing data along with Python code examples:

In [1]:
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, None, 4], 'B': [5, None, 7, 8]}
df = pd.DataFrame(data)

In [2]:
# 1. Drop rows with any missing values
df_dropped = df.dropna()
print(df_dropped)

     A    B
0  1.0  5.0
3  4.0  8.0


In [3]:
# 2. Fill missing values with mean
df_filled_mean = df.fillna(df.mean())
print(df_filled_mean)

          A         B
0  1.000000  5.000000
1  2.000000  6.666667
2  2.333333  7.000000
3  4.000000  8.000000


In [6]:
# 3. Fill missing values with median
df_filled_median = df.fillna(df.median())   
print(df_filled_median)

     A    B
0  1.0  5.0
1  2.0  7.0
2  2.0  7.0
3  4.0  8.0


In [5]:
# 4. Fill missing values with mode
df_filled_mode = df.apply(lambda x: x.fillna(x.mode()[0]), axis=0)
print(df_filled_mode)

     A    B
0  1.0  5.0
1  2.0  5.0
2  1.0  7.0
3  4.0  8.0


### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

**ANS:** 

1. **Imbalanced data** occurs when the classes in a dataset are not represented equally, meaning one class has significantly more samples than the other(s).

2. If **imbalanced data is not handled**:
   - The model may become biased towards the majority class.
   - The model may show high accuracy but poor performance on the minority class.
   - Important signals from the minority class can be ignored.
   - Evaluation metrics like precision, recall, and F1-score may be misleading.

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

**ANS:** 

1. **Up-sampling** (or oversampling) involves increasing the number of instances in the minority class by duplicating or creating synthetic data points until the classes are balanced.

2. **Down-sampling** (or undersampling) involves reducing the number of instances in the majority class by randomly removing samples until the classes are balanced.

**`Example Scenarios:`**

- **When `Up-sampling` is Required**: A fraud detection dataset where fraudulent transactions are much fewer than non-fraudulent ones.

- **When `Down-sampling` is Required**: An email classification dataset with a majority of emails being non-spam and a minority being spam.

### Q5: What is data Augmentation? Explain SMOTE.

**ANS:** `Data augmentation` involves creating new data points from the existing data using various techniques. It is commonly used in image processing to increase the size of the dataset by applying transformations like rotations, flips, and crops to the original images.

**`SMOTE (Synthetic Minority Over-sampling Technique)`**:
- SMOTE is a data augmentation technique used specifically for handling imbalanced datasets. It generates synthetic samples for the minority class by interpolating between existing minority class examples.

Explanation:
1. **Selection**: For each instance in the minority class, SMOTE selects one or more of its nearest neighbors from the same class.
2. **Interpolation**: Synthetic examples are created by taking the difference between the feature vector of the selected instance and its nearest neighbor, multiplying it by a random number between 0 and 1, and adding this to the feature vector of the selected instance.

### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

**ANS:** **`Outliers`** in a dataset are data points that significantly differ from the majority of observations. They may be unusually high or low compared to other values. `Handling outliers` is essential because:
   - Outliers can skew and mislead statistical analyses, affecting measures such as mean and standard deviation.
   - They can distort patterns and trends in data visualization.
   - Outliers may lead to biased or incorrect model predictions.
   - They can indicate data entry errors or rare, but significant, events that need special attention.

### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

**ANS:** To handle missing data:

1. **Removing Rows**: Drop rows with missing values.
2. **Filling Values**: Use mean, median, mode, or other values to fill gaps.
3. **Forward/Backward Fill**: Propagate existing values forward or backward.
4. **Interpolation**: Estimate missing values based on surrounding data.
5. **Imputation Algorithms**: Use techniques like k-Nearest Neighbors or regression.
6. **Domain Knowledge**: Fill based on specific knowledge about the data.

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

**ANS:** To determine if missing data is missing at random or if there is a pattern:

1. **Visual Inspection**: Use heatmaps or missing value plots.
2. **Summary Statistics**: Compare statistics of variables with and without missing data.
3. **Missing Data Indicators**: Analyze correlations between missingness indicators and other variables.
4. **Little's MCAR Test**: Perform statistical tests for Missing Completely at Random (MCAR).
5. **Pattern Analysis**: Examine if missingness correlates with other features.
6. **Correlation Analysis**: Check correlations between missingness and other variables.
7. **Predictive Modeling**: Use models to predict missingness based on other features.

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

**ANS:** Some strategies that you can use to evaluate the performance of your machine learning model on this imbalanced dataset are as follows:

1. **Confusion Matrix**: Analyze true/false positives/negatives.
2. **Precision and Recall**: Measure the modelâ€™s accuracy on the minority class.
3. **F1 Score**: Balance precision and recall.
4. **ROC Curve and AUC**: Assess performance across thresholds.
5. **Precision-Recall Curve**: Focus on the performance of the minority class.
6. **Class Weight Adjustment**: Incorporate class weights during training.
7. **Resampling Techniques**: Use SMOTE or random under-sampling to balance classes.
8. **Stratified K-Fold Cross-Validation**: Ensure balanced class distribution in folds.

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

**ANS:** To balance the dataset and down-sample the majority class:

1. **Random Under-sampling**: Remove samples from the majority class randomly.
2. **Cluster-Based Under-sampling**: Use clustering algorithms to select representative samples from the majority class.
3. **Tomek Links**: Remove samples that are close to the decision boundary.
4. **Edited Nearest Neighbors (ENN)**: Remove samples that are misclassified by their nearest neighbors.

These methods help balance the dataset by reducing the number of samples in the majority class.

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

**ANS:** To balance the dataset and up-sample the minority class:

1. **Random Over-sampling**: Duplicate samples from the minority class.
2. **SMOTE (Synthetic Minority Over-sampling Technique)**: Generate synthetic samples for the minority class.
3. **ADASYN (Adaptive Synthetic Sampling)**: Similar to SMOTE but focuses on generating samples in harder-to-learn regions.
4. **Borderline-SMOTE**: Generate synthetic samples near the decision boundary of the minority class.

These methods help increase the representation of the minority class in the dataset.