### March 17 Assignment, Feature Engineering-I

#### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

#### Ans:
- Missing values in a dataset refer to the absence of a particular value or information in one or more variables or observations. Missing values can occur due to various reasons, such as data collection errors, incomplete surveys, or technical issues.

- It is essential to handle missing values because they can lead to biased or unreliable analysis and modeling results. Ignoring missing values can lead to incorrect conclusions, biased parameter estimates, and reduced statistical power. Missing values can also affect the performance of machine learning algorithms, as many models cannot handle missing data directly.

- Some algorithms that are not affected by missing values include decision trees (e.g., Random Forests and Gradient Boosting), K-nearest neighbors (KNN), and algorithms based on probabilistic graphical models (e.g., Naive Bayes).

#### Q2: List down techniques used to handle missing data. Give an example of each with python code.

#### Ans:
Techniques used to handle missing data include:

- Mean/Median/Mode imputation: Replace missing values with the mean, median, or mode of the variable.
Forward Fill/Backward Fill: Propagate the last observed value forward or the next observed value backward to fill missing values.
Dropping missing values: Remove observations or variables with missing values.
Model-based imputation: Use statistical models to estimate missing values based on other variables.
Multiple imputation: Generate multiple imputations using statistical models and combine the results.

Example using Python code:

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a DataFrame with missing values
data = pd.DataFrame({'A': [1, 2, np.nan, 4, 5],
                     'B': [6, np.nan, 8, np.nan, 10]})

# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
data_mean_imputed = mean_imputer.fit_transform(data)
print(data_mean_imputed)

# Forward fill
data_forward_filled = data.ffill()
print(data_forward_filled)

# Dropping missing values
data_dropped = data.dropna()
print(data_dropped)


[[ 1.  6.]
 [ 2.  8.]
 [ 3.  8.]
 [ 4.  8.]
 [ 5. 10.]]
     A     B
0  1.0   6.0
1  2.0   6.0
2  2.0   8.0
3  4.0   8.0
4  5.0  10.0
     A     B
0  1.0   6.0
4  5.0  10.0


##### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

#### Ans:
- Imbalanced data refers to a situation where the classes or categories in a dataset are not represented equally. In other words, there is a significant disparity in the number of instances between different classes.

- If imbalanced data is not handled, it can lead to biased and unreliable predictions. Machine learning models tend to be biased towards the majority class and may have poor performance on the minority class. The model might learn to favor the majority class, resulting in lower accuracy, precision, recall, or F1-score for the minority class. Imbalanced data can also lead to incorrect assessments of the model's performance.

#### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sapling are required.

#### Ans:
Up-sampling and down-sampling are techniques used to address class imbalance in imbalanced datasets.

- Up-sampling involves increasing the number of instances in the minority class to match the number of instances in the majority class. This can be achieved by randomly duplicating existing instances or generating synthetic instances based on existing minority class samples.

- Down-sampling involves decreasing the number of instances in the majority class to match the number of instances in the minority class. This can be done by randomly selecting a subset of instances from the majority class.

- The choice between up-sampling and down-sampling depends on the specific problem and dataset characteristics. Up-sampling can be useful when the dataset has limited samples of the minority class, while down-sampling can be appropriate when there is a large number of instances in the majority class.

#### Q5: What is data Augmentation? Explain SMOTE.

#### Ans:
- Data augmentation is a technique used to increase the size of a dataset by creating new synthetic samples based on existing samples. It is commonly employed in machine learning tasks, especially when the available dataset is limited.

SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation method used to address class imbalance.

- It generates synthetic samples for the minority class by interpolating between existing minority class samples. SMOTE identifies the k nearest neighbors for each minority class sample, and synthetic samples are created by combining the features of the minority sample with randomly selected neighbors.

- SMOTE helps to balance the class distribution and provides more training samples for the minority class, improving the performance of machine learning models on imbalanced datasets.

#### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

#### Ans:
- Outliers in a dataset are observations that deviate significantly from the majority of other observations. They can be extreme values or data points that are inconsistent with the rest of the data. Outliers can arise due to measurement errors, data corruption, or genuine anomalies in the data.

- Handling outliers is essential because they can distort statistical analyses and modeling results. Outliers can disproportionately influence statistical measures such as the mean and standard deviation, leading to biased estimates. Outliers can also affect the performance of machine learning models by influencing the fitting process and reducing the accuracy of predictions.

- It is important to identify and handle outliers appropriately, considering the specific context and characteristics of the data. Outliers can be handled by removing them if they are due to data entry errors or transformed if they follow a specific distribution. Robust statistical techniques that are less sensitive to outliers, such as median-based methods or robust regression models, can also be employed.

#### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

#### Ans:
When dealing with missing data in customer data analysis, some techniques that can be used include:

- Mean/median imputation: Fill missing values with the mean or median of the respective variable.
- Mode imputation: Fill missing values with the mode (most frequent value) of the respective variable.
- Predictive imputation: Use machine learning models or regression models to predict missing values based on other variables.
- Multiple imputation: Generate multiple imputations by modeling the missing values based on other variables and combine the results.
- Dropping missing values: Exclude observations with missing values from the analysis if the missingness is random or not substantial.

The choice of technique depends on the nature of the data, the amount of missingness, and the specific requirements of the analysis.

#### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

#### Ans:
Strategies to determine if the missing data is missing at random or exhibits a pattern can include:

- Visual analysis: Plotting patterns of missing data using heatmaps or missing data matrices to identify any systematic patterns or relationships.
- Statistical tests: Conducting statistical tests to determine if the missingness is associated with other variables in the dataset.
- Missing data mechanisms: Analyzing the mechanism of missingness, such as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), using statistical techniques or assumptions.

By investigating the patterns and mechanisms of missing data, researchers can make informed decisions on how to handle missing data appropriately and minimize potential biases in the analysis.

#### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

#### Ans:
Strategies to evaluate the performance of machine learning models on an imbalanced dataset with a low percentage of occurrences include:

- Precision, recall, and F1-score: These metrics focus on the performance of the minority class and provide insights into the model's ability to correctly identify the occurrences.
- Area under the Receiver Operating Characteristic curve (ROC-AUC): This metric assesses the overall performance of the model by considering the trade-off between true positive rate (sensitivity) and false positive rate.
- Confusion matrix: Analyzing the confusion matrix can provide detailed information about the model's performance, including true positives, true negatives, false positives, and false negatives.
- Resampling techniques: Employing resampling techniques such as cross-validation or stratified sampling to ensure the evaluation is not biased by the class imbalance.

By considering these strategies, researchers can obtain a comprehensive understanding of the model's performance and make appropriate adjustments to improve classification accuracy and predictive power.

#### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

#### Ans:
To balance a dataset and down-sample the majority class when dealing with an unbalanced dataset, a common method is random undersampling. This involves randomly selecting a subset of instances from the majority class to match the number of instances in the minority class. The goal is to reduce the dominance of the majority class and achieve a more balanced representation of the classes in the dataset.

#### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

#### Ans:
To balance a dataset and up-sample the minority class when dealing with a low percentage of occurrences, a common method is synthetic oversampling. One popular technique is SMOTE (Synthetic Minority Over-sampling Technique). SMOTE generates synthetic samples by interpolating between existing minority class samples and their nearest neighbors. By creating synthetic instances, SMOTE increases the number of instances in the minority class, thereby improving the representation and addressing the class imbalance.