## Feature Engineering 1

Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some 
algorithms that are not affected by missing values.

Ans. Missing values in a dataset refer to data points that are not available or not recorded for certain observations. They can occur due to various reasons, such as measurement errors, data entry mistakes, or limitations in the data collection process.

**Reasons why handling missing values is essential:**

1. **Data integrity:** Missing values can compromise the integrity and reliability of the dataset. They can lead to biased results and incorrect conclusions if not handled appropriately.

2. **Statistical analysis:** Many statistical methods, such as regression analysis and clustering, require complete data to produce accurate results. Missing values can hinder the applicability of these techniques and lead to unreliable estimates.

3. **Data visualization:** Missing values can make data visualizations difficult to interpret. They can create gaps or inconsistencies in plots and graphs, making it challenging to identify patterns and trends.

4. **Machine learning:** Missing values can pose challenges for machine learning algorithms. Some algorithms may be sensitive to missing data and may produce inaccurate predictions or fail to learn effectively.

**Algorithms not affected by missing values:**

1. **Decision trees:** Decision trees are a class of machine learning algorithms that make decisions based on a series of binary splits in the data. They are not directly affected by missing values because they can assign missing values to a default category or branch in the tree.

2. **Random forests:** Random forests are an ensemble method that combines multiple decision trees. They are also robust to missing values, as the individual decision trees in the ensemble can compensate for each other's missing data.

3. **k-nearest neighbors (k-NN):** k-NN is a classification algorithm that assigns a label to a data point based on the labels of its k nearest neighbors. It is not affected by missing values because it only considers the available data points when making predictions.

4. **Support vector machines (SVMs):** SVMs are a class of discriminative classification algorithms. They are robust to missing values because they learn a decision boundary that separates the data points into different classes, even in the presence of missing features.

5. **Naive Bayes:** Naive Bayes is a probabilistic classification algorithm that assumes that the features are conditionally independent given the class label. It is not directly affected by missing values because it can assign missing values to a default category or impute them using a method like mean or mode imputation.

Q2: List down techniques used to handle missing data.  Give an example of each with python code.

Ans.Techniques used to handle missing data are:

1. **Missing Data Imputation:** Replacing missing values with imputed values.
2. **Multiple Imputation:** Generating multiple sets of imputed values and combining the results.
3. **Predictive Modeling:** Using a machine learning model to predict missing values.
4. **Case Deletion:** Removing cases with missing values.

   - Example: 

In [2]:
# 1.
# Imputing missing values with the mean of the feature.
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
df = pd.DataFrame({'Age': [20, 30, np.nan, 50, 40],
                    'Income': [10000, 20000, np.nan, 40000, 30000]})

df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)
#2.
#Using the MICE (Multivariate Imputation by Chained Equations) algorithm for multiple imputation.   

df1 = pd.DataFrame({'Age': [20, 30, np.nan, 50, 40],
                    'Income': [10000, 20000, np.nan, 40000, 30000]})

imputer = MICE()
imputed_df = imputer.fit_transform(df1)

print(imputed_df)
#3.
#Using a decision tree model to predict missing age values.
df2 = pd.DataFrame({'Age': [20, 30, np.nan, 50, 40],
                    'Income': [10000, 20000, np.nan, 40000, 30000]})

X = df2[['Income']]
y = df2['Age']

model = DecisionTreeRegressor()
model.fit(X, y)

# Predict missing age value
missing_age_index = df2[df2['Age'].isnull()].index[0]
missing_age_income = df2['Income'][missing_age_index]
predicted_age = model.predict([[missing_age_income]])

# Impute missing age value
df2['Age'][missing_age_index] = predicted_age

print(df2)

#4.
#Dropping rows with missing age values.

df3 = pd.DataFrame({'Age': [20, 30, np.nan, 50, 40],
                    'Income': [10000, 20000, np.nan, 40000, 30000]})

df3 = df3.dropna(subset=['Age'])

print(df3)


    Age   Income
0  20.0  10000.0
1  30.0  20000.0
2  35.0      NaN
3  50.0  40000.0
4  40.0  30000.0


NameError: name 'MICE' is not defined

Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?
Ans. **Imbalanced Data**:

Imbalanced data is a dataset where the distribution of different classes is highly skewed, meaning one class significantly outnumbers the other classes. This imbalance can cause problems for machine learning algorithms, as they may prioritize the majority class and overlook the minority class.


**Problems caused by Imbalanced data**:

1. **Biased Models:** When a model is trained on imbalanced data, it tends to be biased towards the majority class. This means that the model is more likely to correctly classify the majority class and is less likely to accurately classify the minority class.

2. **Poor Performance on Minority Class:** Due to the limited data points, the minority class often has insufficient representation in the training set. The model learns to classify the majority class correctly but has difficulty recognizing and classifying the minority class.

3. **Evaluation Metrics Misinterpretation:** Standard evaluation metrics such as overall accuracy may be misleading in the presence of imbalanced data. A model with high accuracy on the majority class but low accuracy on the minority class may still be considered successful based on overall accuracy. This can lead to incorrect conclusions about the model's performance.

4. **Overfitting to Majority Class:** With limited minority class data, the model may learn to overfit to the majority class to optimize performance metrics that emphasize the majority class. This can result in poor generalization and suboptimal performance on the minority class in real-world scenarios.

5. **Difficulty in Tuning Hyperparameters:** Imbalanced data can make it more challenging to optimize hyperparameters of a machine learning algorithm effectively. The optimal hyperparameter settings for majority and minority classes may differ, so finding a balance can be difficult.


**Consequences of not handling imbalanced data**:

1. **Reduced Model Reliability:** Models trained on imbalanced data may not be reliable for decision-making tasks, especially those involving the minority class. Predictions can be inaccurate, leading to incorrect conclusions or actions.

2. **Inaccessible Minority Class Insights:** Overlooking imbalanced data can result in missing valuable insights about the minority class. These insights may be critical for decision-making processes that require a comprehensive understanding of all classes.

3. **Legal and Ethical Issues:** In certain domains, ignoring imbalanced data can have legal and ethical implications. For example, in applications where fairness and equal treatment are essential (e.g., healthcare, finance), failing to address imbalanced data can lead to unfair outcomes and discrimination against the minority class.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.
Ans.  **Up-sampling**

Up-sampling, also known as interpolation, is the process of increasing the sampling rate of a signal. This is done by inserting new samples between the existing samples, typically using a mathematical interpolation algorithm. Up-sampling can be used to improve the quality of a signal, or to make it compatible with a different system.

**Example:**

An audio signal with a sampling rate of 44.1 kHz can be up-sampled to 96 kHz using a linear interpolation algorithm. This will increase the number of samples in the signal by a factor of 2, and will improve the quality of the audio playback.

**Down-sampling**

Down-sampling is the process of decreasing the sampling rate of a signal. This is done by removing samples from the signal, typically using a mathematical decimation algorithm. Down-sampling can be used to reduce the size of a signal, or to make it compatible with a different system.

**Example:**

An audio signal with a sampling rate of 96 kHz can be down-sampled to 44.1 kHz using a linear decimation algorithm. This will reduce the number of samples in the signal by a factor of 2, and will reduce the quality of the audio playback.

**When are up-sampling and down-sampling required?**

Up-sampling and down-sampling are required in a wide variety of applications. Some of the most common applications include:

* **Audio processing:** Up-sampling can be used to improve the quality of audio playback, while down-sampling can be used to reduce the size of audio files.
* **Image processing:** Up-sampling can be used to increase the resolution of images, while down-sampling can be used to reduce the size of images.
* **Video processing:** Up-sampling can be used to improve the quality of video playback, while down-sampling can be used to reduce the size of video files.
* **Signal processing:** Up-sampling and down-sampling can be used to convert signals between different sampling rates, or to extract specific features from a signal.

Q5: What is data Augmentation? Explain SMOTE.
Ans. 
**Data Augmentation:**
Data augmentation is a technique used to increase the size of a dataset by adding slightly modified copies of existing data points. This helps reduce the impact of overfitting during model training, leading to improved generalization performance.

**SMOTE:**
Synthetic Minority Over-sampling Technique (SMOTE) is a popular data augmentation technique specifically designed to handle imbalanced datasets, where the minority class is significantly smaller than the majority class. Here's how SMOTE works:

1. **Identify the Minority Class:**
   The first step is to identify the minority class, which is the class with fewer data points.

2. **Calculate the Nearest Neighbors:**
   For each data point in the minority class, find its k-nearest neighbors (k-NNs) within the same class.

3. **Generate Synthetic Data Points:**
   For each minority class data point, randomly select one of its k-NNs. Then, create a synthetic data point along the line connecting the chosen data point and its nearest neighbor.

4. **Adjust Data Point Properties:**
   Adjust the features of the synthetic data point by adding a small random perturbation. This ensures that the synthetic data points are slightly different from the original data points.

5. **Repeat the Process:**
   Repeat steps 2 to 4 until enough synthetic data points have been generated to balance the dataset.

**Advantages of SMOTE:**

- Addresses Imbalanced Datasets: SMOTE directly tackles the problem of imbalanced datasets by oversampling the minority class, making it more representative in the dataset.

- Preserves Class Distribution: Unlike other oversampling techniques, SMOTE generates synthetic data points based on the distribution of the minority class, ensuring that the original class proportions are maintained.

- Improves Classification Performance: By balancing the dataset, SMOTE helps classification algorithms better learn the patterns and boundaries between different classes, leading to improved classification performance, especially for the minority class.

**Disadvantages of SMOTE:**

- Potential Overfitting: SMOTE introduces synthetic data points that are similar to the existing data points, which can lead to overfitting, particularly when the dataset is small.

- Noisy Data: The synthetic data points generated by SMOTE can be noisy and might not represent the true distribution of the minority class.

- Computational Cost: Generating synthetic data points can be computationally intensive, especially for large datasets, and may require significant processing power and time.

Q6: What are outliers in a dataset? Why is it essential to handle outliers.
Ans. **Outliers in a Dataset:**

Outliers are extreme values in a dataset that deviate significantly from the other data points. They can be unusually high or unusually low compared to the rest of the data. Outliers can occur due to various reasons, such as:

1. **Measurement Errors:** Errors during data collection or measurement can lead to outliers. These errors can be caused by faulty equipment, human mistakes, or improper data entry.

2. **Data Entry Errors:** Incorrectly entering data can also result in outliers. For example, accidentally typing an extra zero or decimal point can create an outlier value.

3. **Rare Events:** Sometimes, outliers represent genuine extreme events or observations that occur infrequently. These events can be crucial for understanding the tails of the distribution.

4. **Anomalous Data Points:** Outliers can arise from anomalous data points that do not follow the general pattern or distribution of the dataset. This can be due to unique characteristics, exceptional circumstances, or underlying factors.

**Importance of Handling Outliers:**

1. **Robustness and Accuracy:** Outliers can significantly impact statistical analyses and model predictions if not handled appropriately. Ignoring outliers can lead to biased results, incorrect conclusions, and poor model performance.

2. **Detecting Errors and Anomalies:** Outliers can be a valuable indicator of errors or anomalies in the data. By examining outliers, data analysts can identify potential data issues, measurement errors, or unusual observations that require further investigation.

3. **Preserving Data Integrity:** Removing outliers can compromise the integrity of the data by excluding potentially valuable information. In some cases, outliers can provide insights into extreme scenarios, rare events, or unique characteristics that contribute to a comprehensive understanding of the data.

4. **Improving Model Performance:** When used effectively, outlier handling techniques can improve the accuracy and reliability of models. By removing or mitigating the influence of outliers, models can be trained on data that better represents the underlying patterns and relationships.

5. **Transparency and Reproducibility:** Proper handling of outliers ensures transparency and reproducibility in data analysis and modeling. By clearly documenting the methods used to address outliers, researchers and practitioners can facilitate the replication of studies and encourage collaboration.

**Methods for Handling Outliers:**

1. **Data Cleaning:** Identifying and correcting errors or inconsistencies in the data can help eliminate outliers caused by measurement or data entry errors.

2. **Winsorization:** This method replaces extreme outlier values with a specified threshold or value, reducing their impact on statistical analyses.

3. **Trimming:** Trimming involves removing a fixed percentage or a specific number of data points from both ends of the distribution, excluding extreme outliers.

4. **Robust Statistical Methods:** Robust statistical methods are less sensitive to outliers and can provide more reliable results even in the presence of extreme values.

5. **Transformation:** Transforming the data using logarithmic or other nonlinear transformations can reduce the impact of outliers and improve the symmetry of the distribution.

6. **Outlier Detection Algorithms:** Automated algorithms can be used to detect outliers based on statistical measures, distance metrics, or machine learning techniques.


Q7: You are working on a project that requires analyzing customer data. However, you notice that some of 
the data is missing. What are some techniques you can use to handle the missing data in your analysis?
Ans. 1. **Imputation:**

   - **Mean/Median/Mode Imputation:** Replace missing values with the mean, median, or mode of the variable for which data is missing. This method is simple and straightforward, but it assumes that the missing data is missing at random (MAR).
   - **K-Nearest Neighbors (KNN) Imputation:** Find the k most similar data points to the data point with missing values and then use the mean or median of the k data points to impute the missing value. This method is more sophisticated than mean/median/mode imputation, but it requires more computational resources.

2. **Multiple Imputation:**

   - Create multiple plausible complete datasets by imputing missing values using a method such as mean/median/mode or KNN imputation.
   - Analyze each complete dataset separately and then pool the results to obtain final estimates and standard errors.
   - This method is more computationally intensive than single imputation, but it produces more accurate and reliable results.

3. **Propensity Score Matching:**

   - If the missing data is not missing at random (MNAR), you can use propensity score matching to create a comparison group of observations that are similar to the observations with missing data on all observed characteristics.
   - Once you have created a comparison group, you can analyze the data as if it were complete.

4. **Model-Based Imputation:**

   - Use a statistical model to predict the missing values based on the observed values.
   - This method is more complex than the other methods, but it can produce more accurate imputations when the missing data is MNAR.

5. **Missing Indicator Variable:**

   - Create a binary variable that indicates whether or not a value is missing for a particular variable.
   - Then, include this variable in your analysis as an independent variable.
   - This will allow you to assess the impact of missing data on your results.

6. **Exclusion:**

   - If the missing data is a small proportion of the overall dataset and is missing at random, you may be able to simply exclude the observations with missing data from your analysis.
   - However, this can lead to bias if the missing data is informative.


Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are 
some strategies you can use to determine if the missing data is missing at random or if there is a pattern 
to the missing data?

Ans. 1. **Visualize the Data**:

- Create visualizations, such as scatter plots, histograms, or heatmaps, that represent the distribution of the data and highlight missing values.
- Observe the patterns in the missing values. For example, are they concentrated in specific regions, values, or groups?

2. **Check for Missing Data Patterns**:

- Examine the distribution of missing data across variables. Are certain variables more likely to have missing values?
- Look for relationships between missing data and other variables. For example, are missing values more common for observations with certain values of other variables?
- Conduct statistical tests, such as chi-square tests or t-tests, to determine if there are significant differences in the distribution of missing data across groups or categories.

3. **Compare Missing and Non-Missing Cases**:

- Compare the characteristics of observations with missing data and those without missing data.
- Conduct statistical tests to determine if there are significant differences between the two groups in terms of their mean values, distributions, or other relevant statistics.

4. **Analyze Missing Data Patterns over Time**:

- If you have longitudinal data, examine how missing data patterns change over time.
- Look for trends or patterns in the occurrence of missing data, such as seasonal variations or changes in data collection methods.

5. **Correlate Missing Data with Other Variables**:

- Assess the correlation between missing data and other variables in your dataset.
- Look for relationships between the missing data and variables that might be related to the missing values.

6. **Review Data Collection and Recording Processes**:

- Examine the data collection and recording processes to understand how missing data might have occurred.
- Identify potential sources of error or bias that may have contributed to the missing data.

7. **Perform Sensitivity Analysis**:

- Conduct sensitivity analysis to assess the impact of missing data on your results.
- Implement different strategies for handling missing data and compare the outcomes to determine the robustness of your findings.

8. **Consult with Domain Experts**:

- If available, consult with domain experts who are familiar with the data and the context of its collection.
- Seek their insights into potential reasons for missing data and strategies for dealing with it.

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the 
dataset do not have the condition of interest, while a small percentage do. What are some strategies you 
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Ans.   
1. **Choose Appropriate Metrics:** 
   - Accuracy can be deceptive in imbalanced datasets, as it is biased towards the majority class. 
   - Instead, use metrics that consider the minority class, such as:
     - **F1 Score:** Harmonic mean of precision and recall, which penalizes false positives and false negatives equally.
     - **Recall:** Proportion of actual positives correctly classified.
     - **Area Under the Precision-Recall Curve (AUPRC):** Measures the area under the precision-recall curve, which provides a more comprehensive evaluation.
     - **Matthews Correlation Coefficient (MCC):** Considers true positives, false positives, true negatives, and false negatives, providing a balanced assessment of classification performance.


2. **Resampling Techniques:** 
   - **Oversampling:** Replicates examples from the minority class to balance the dataset. Techniques include: 
     - **Random Oversampling:** Randomly selects samples from the minority class and duplicates them.
     - **Synthetic Minority Oversampling Technique (SMOTE):** Creates synthetic minority class instances by interpolating between existing samples. 
     - **Adaptive Synthetic Sampling (ADASYN):** Oversamples minority class instances based on their difficulty to classify.


3. **Cost-Sensitive Learning:** 
   - Assign different costs to different classes during training. This encourages the model to minimize the misclassification costs of the minority class.


4. **Classification Threshold Adjustment:** 
   - Adjust the classification threshold to favor the minority class. This can be done by:
     - Setting a lower threshold for the minority class to increase its recall.
     - Setting a higher threshold for the majority class to reduce its false positives.


5. **Model Selection and Tuning:** 
   - Select and tune the machine learning model carefully. Consider algorithms specifically designed for imbalanced data, such as random forests or gradient boosting machines.


6. **Ensamble Methods:** 
   - Use ensemble methods, such as bagging or random forests, which can help reduce bias and improve performance on imbalanced data.


7. **Cross-Validation and Stratification:** 
   - Use stratified cross-validation or repeated cross-validation to ensure that the evaluation is robust and representative of the entire dataset.


8. **Visualize Model Performance:** 
   - Create confusion matrices, ROC curves, and precision-recall curves to visualize the model's performance on the imbalanced dataset.

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is 
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to 
balance the dataset and down-sample the majority class?

Ans. 1. **Down-sampling:**
    - **Random Undersampling:** Randomly select a subset of data points from the majority class to reduce its size and bring it closer to the size of the minority class.

2. **Over-sampling:**
    - **Random Oversampling:** Duplicating data points from the minority class multiple times to increase its size and match the size of the majority class.

3. **Resampling Techniques:**
    - **Synthetic Minority Oversampling Technique (SMOTE):** Generate synthetic data points in the minority class by interpolating between existing data points.
    - **Adaptive Synthetic Sampling (ADASYN):** Focuses on generating synthetic data points for minority class instances that are difficult to classify.

4. **Cost-Sensitive Learning:**
    - Assign different costs to different classes during the training process. This encourages the model to focus more on correctly classifying the minority class.

5. **Data Augmentation:**
    - For image classification tasks, apply transformations such as cropping, flipping, and color jittering to the majority class data points to create more diverse samples.

6. **Sampling Based on Class Distribution:**
    - Use a stratified sampling approach to ensure that the proportions of each class in the training set reflect the real-world distribution.

7. **Model Selection and Tuning:**
    - Use cross-validation to evaluate different models and hyperparameters to find the best performing model for the imbalanced dataset.

8. **Algorithms Designed for Imbalanced Data:**
    - Use algorithms specifically designed for imbalanced data, such as Random Oversampling Ensembles (ROSE) or Learning from Imbalanced Data sets (LIDS).

9. **Performance Metrics:**
    - Use appropriate performance metrics that are robust to class imbalance, such as the F1-score, area under the receiver operating characteristic curve (AUC-ROC), or Matthews correlation coefficient (MCC).

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a 
project that requires you to estimate the occurrence of a rare event. What methods can you employ to 
balance the dataset and up-sample the minority class?

Ans.
1. **Resampling Techniques:**

 - **Random Oversampling:** Simply duplicate minority class instances randomly to increase their count. It's a straightforward method, but it can lead to overfitting.


 - **Adaptive Synthetic Sampling (ADASYN):** This technique assigns different sampling probabilities to minority class instances based on their difficulty to classify. Instances that are harder to classify are more likely to be oversampled.


 - **Synthetic Minority Oversampling Technique (SMOTE):** SMOTE creates new minority class instances by interpolating between existing ones. It helps to generate synthetic instances that are located in the decision boundary, making the classifier more robust.


 - **Borderline-SMOTE:** This variation of SMOTE focuses on oversampling minority class instances that are close to the decision boundary. It helps to improve the classifier's performance in these critical regions.


2. **Cost-Sensitive Learning:**

 - **Cost-Sensitive Classification:** Assign different misclassification costs to different classes. The classifier is then trained to minimize the overall cost, which encourages it to make fewer mistakes on the minority class.


 - **Cost-Sensitive Resampling:** Modify the resampling techniques mentioned above to incorporate misclassification costs. This ensures that the oversampled minority class instances are more representative and impactful.


3. **Ensembles and Stacking:**

 - **Ensemble Learning:** Combine multiple models trained on different subsets of the data. By combining the predictions of these models, you can achieve better overall performance, even on the minority class.


 - **Stacking:** Train multiple models on different subsets of the data and use their predictions as features for a final model. This approach allows the final model to learn from the strengths of the individual models and make more accurate predictions on the minority class.


4. **Data Augmentation:**

 - **Synthetic Data Generation:** Generate synthetic minority class instances using generative models like GANs (Generative Adversarial Networks) or VAEs (Variational Autoencoders). This technique can significantly increase the size of the minority class without compromising data quality.


5. **Sampling Strategies:**

 - **Random Under-Sampling:** Randomly remove majority class instances to reduce the imbalance ratio. However, this approach can discard valuable information from the majority class.
