## Feature Engineering 1

1. What are missing values in a dataset? Why is it essential to handle missing values? Name some 
algorithms that are not affected by missing values.

Ans:  
Missing values in a dataset refer to the absence of data for one or more variables for some records. This can happen for various reasons, such as data collection errors, non-responses in surveys, or issues in data entry.  
  
Why is it Essential to Handle Missing Values?  
1. Accuracy: Missing values can lead to biased or incorrect results in statistical analysis and machine learning models. If not handled properly, they can skew the interpretation of data and model performance.  
  
2. Completeness: Many algorithms require a complete dataset to function correctly. Missing data can prevent the application of these algorithms or reduce the quality of their predictions.
  
3. Bias: Ignoring missing values or not addressing them appropriately can introduce bias into the model. For example, if missing data is not random and is related to certain patterns in the dataset, it can lead to inaccurate inferences.
  
4. Efficiency: Handling missing values helps in improving the efficiency and effectiveness of data analysis and model training, leading to more reliable and robust results.
  
Algorithms Not Affected by Missing Values
1. Decision Trees: Can handle missing values by splitting nodes based on available data, and some implementations allow for surrogate splits.
  
2. Random Forests: Like decision trees, random forests can deal with missing values using various strategies, including surrogate splits or treating missing values as a separate category.
  
3. k-Nearest Neighbors (k-NN): When using k-NN, the distance metric can be adjusted to handle missing values. For example, only considering dimensions where both the query point and neighbors have values.
  
4. Naive Bayes: Often handles missing values by assuming that missing values do not influence the class probabilities directly, or by treating missing values as a separate category in some implementations.
  
5. Some implementations of Neural Networks: Advanced neural network architectures and implementations might handle missing data through mechanisms like dropout or embedding layers.
  
6. Robust Regression Models: Certain regression techniques are designed to handle missing values or make robust predictions despite incomplete data.
  
Choosing the right method to handle missing values depends on the nature of the dataset and the specific requirements of the analysis or modeling task.



#### 2. List down techniques used to handle missing data. Give an example of each with python code.

Ans:  
Handling missing data is a critical aspect of data preprocessing in data science and machine learning. There are several techniques used to handle missing data, each with its own advantages and use cases.
some common techniques used for handling missing data are as follows:
1. Dropping rows with missing values.
2. Dropping columns that have many missing values
3. Data imputation:  
   a. mean  
   b. median (if there are many outliers)  
   c. mode (for categorical features)  
4. Using machine learning aglorithms to impute missing values

In [1]:
import seaborn as sns

In [2]:
df = sns.load_dataset('titanic')

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
df.shape

(891, 15)

In [6]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [16]:
## Droppig rows which contain missing values

df_drop_row = df.dropna()

In [17]:
print(df_drop_row.shape)
print(df_drop_row.isnull().sum())

(182, 15)
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64


In [18]:
## Dropping cloumn that contains missing values

df_drop_col = df.dropna(axis = 1)

In [19]:
print(df_drop_col.shape)
print(df_drop_col.isnull().sum())

(891, 11)
survived      0
pclass        0
sex           0
sibsp         0
parch         0
fare          0
class         0
who           0
adult_male    0
alive         0
alone         0
dtype: int64


In [20]:
## imputation 
##mean

df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [22]:
## imputing mean for the missing values in age column
df["age_mean"] = df["age"].fillna(df["age"].mean())

In [24]:
## imputing median for the missing values in the age column
df["age_median"] = df["age"].fillna(df["age"].median())

In [25]:
df[["age_mean","age_median","age"]]

Unnamed: 0,age_mean,age_median,age
0,22.000000,22.0,22.0
1,38.000000,38.0,38.0
2,26.000000,26.0,26.0
3,35.000000,35.0,35.0
4,35.000000,35.0,35.0
...,...,...,...
886,27.000000,27.0,27.0
887,19.000000,19.0,19.0
888,29.699118,28.0,
889,26.000000,26.0,26.0


In [28]:
df[df["embarked"].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age_mean,age_median
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0


In [29]:
df["embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [35]:
embarked_mode = df[df["embarked"].notna()]["embarked"].mode()[0]

In [36]:
df["embarked_mode"] = df["embarked"].fillna(embarked_mode)

In [37]:
df[df["embarked"].isnull()][["embarked","embarked_mode"]]

Unnamed: 0,embarked,embarked_mode
61,,S
829,,S


#### 3.Explain the imbalanced data. What will happen if imbalanced data is not handled?

Ans:  
Imbalanced Data refers to a situation in a classification problem where the classes are not represented equally in the dataset. This means that one class (often the target or outcome class) is significantly more frequent than the other(s). For example, in a medical dataset for detecting a rare disease, there might be many more instances of "healthy" patients compared to "diseased" patients.  
  
Consequences of Not Handling Imbalanced Data  
If imbalanced data is not handled appropriately, it can lead to several problems:  
  
1. Poor Model Performance: Traditional classifiers may perform poorly on the minority class because they tend to be biased towards the majority class. For example, in a binary classification problem with a 95% majority class and a 5% minority class, a model that always predicts the majority class would achieve 95% accuracy. However, this model would fail to identify any instances of the minority class, which could be crucial.  

2. Misleading Accuracy Metrics: Accuracy can be a misleading metric in the case of imbalanced datasets. High accuracy may be achieved by simply predicting the majority class most of the time. For example, if 95% of the data belongs to one class, a model that predicts the majority class for every instance would have 95% accuracy but would be ineffective for the minority class.  

3. Inadequate Predictions: The model might not learn to distinguish between the classes effectively, leading to poor recall or precision for the minority class. This can be particularly problematic in applications like fraud detection, where missing out on fraudulent transactions (minority class) could be costly.  

4. Overfitting to Majority Class: The model might overfit to the majority class due to its prevalence, failing to generalize well to the minority class.

Q4. #### What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

Ans:  
Up-sampling and Down-sampling are techniques used to address class imbalance in datasets, where one class (typically the minority class) has significantly fewer instances compared to the other class (majority class).  
  
**Up-sampling:**  
Up-sampling involves increasing the number of instances in the minority class to match or exceed the number of instances in the majority class. This can be achieved by either duplicating existing instances or generating synthetic instances.  
**When Up-sampling is Required:**  
Example Scenario: In a credit card fraud detection system, fraud cases (minority class) are much less frequent compared to non-fraud cases (majority class). If the model is trained on this imbalanced dataset, it may perform poorly in identifying fraud cases because the model is biased towards the majority class.  

**Down-sampling:**   
Down-sampling involves reducing the number of instances in the majority class to match or approach the number of instances in the minority class. This technique helps in balancing the dataset but can lead to loss of information as some instances from the majority class are removed.  
**When Down-sampling is Required:**  
Example Scenario: In a fraud detection system where fraud cases are much fewer but the majority of non-fraud cases dominate the dataset, the model may become biased towards non-fraud cases. Down-sampling the majority class can help balance the dataset, making it easier for the model to learn about the minority class without being overwhelmed by the majority class.

#### 5. What is data Augmentation? Explain SMOTE.

Ans:  
*Data Augmentation* refers to techniques used to increase the diversity and amount of data available for training machine learning models without actually collecting new data. It involves creating new training examples by transforming the existing data, which can help improve model performance, particularly in cases of limited data or imbalanced datasets.  

**SMOTE (Synthetic Minority Over-sampling Technique):**  
SMOTE is a widely used technique for addressing class imbalance in datasets, especially in classification problems. It works by generating synthetic instances for the minority class rather than simply duplicating existing instances. This helps the model to learn better and generalize more effectively.  
  
How SMOTE Works:  
  
1. Identify Neighbors: For each instance in the minority class, identify its k-nearest neighbors using a distance metric (e.g., Euclidean distance).  
2. Generate Synthetic Samples: Create synthetic samples by interpolating between the minority class instance and its neighbors. Specifically, for each instance, new instances are generated along the line segments joining it to its neighbors.  
Benefits of SMOTE:  
1. Improves Class Distribution: By generating new examples, it balances the class distribution.  
2. Reduces Overfitting: By creating new examples rather than duplicating existing ones, it helps to prevent overfitting.


#### 6. What are outliers in a dataset? Why is it essential to handle outliers?

Ans:   
*Outliers* are data points that differ significantly from other observations in a dataset. They can be unusually high or low compared to the majority of the data. Outliers can arise due to variability in the data, errors in data collection, or they might indicate significant phenomena or anomalies.  
  
Characteristics of Outliers:  
1. Extreme Values: They lie far away from the central tendency (mean or median) of the data.  
2. Influential Points: They can affect the statistical properties of the dataset, such as mean, variance, and correlation.  
  
Why is it Essential to Handle Outliers?  
1. Impact on Statistical Measures:  
Mean and Variance: Outliers can skew the mean and inflate the variance, leading to misleading statistical summaries.  
Correlation: They can distort the relationships between variables, affecting correlation and regression analysis.  

2. Model Performance:  
Model Assumptions: Many statistical models assume that data is normally distributed. Outliers can violate these assumptions and lead to incorrect conclusions.  
Predictive Accuracy: Outliers can negatively impact the performance of machine learning models, as they may lead to overfitting or underfitting. For example, in regression tasks, outliers can disproportionately affect the fit of the model.  

3. Data Quality:  
Errors and Noise: Outliers may indicate errors or inconsistencies in data collection or entry. Addressing these issues can improve the overall data quality.  
Robustness: Handling outliers ensures that the analysis or model is robust and reliable.  

4. Anomaly Detection:  
Special Cases: In some contexts, outliers may represent important anomalies or events, such as fraud in financial transactions or rare diseases in medical data. Identifying these can be critical for specific applications. 

Q7. You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Ans:  
Handling missing data is a crucial step in data preprocessing, as it can significantly affect the quality and outcomes of your analysis. There are several techniques you can use to address missing data, depending on the nature of the data and the extent of the missing values. Here are some common techniques:  
  
1. Remove Missing Data:
Technique: Remove rows or columns that contain missing values.  
Remove Rows: If only a small proportion of rows contain missing values, it might be acceptable to drop these rows.  
Remove Columns: If a large proportion of a column is missing, or if a column has too many missing values, consider removing it.

2. Fill(Impute) Missing Data:  
Technique: Replace missing values with a specific value, such as the mean, median, mode, or a constant.    
    1. Mean/Median/Mode Imputation: Useful for numerical data.  
    2. Constant Imputation: Replacing missing values with a fixed value.
  
3. Forward Fill / Backward Fill:  
Technique: Use the next or previous value in the column to fill missing data.  
    1. Forward Fill (ffill): Fill missing values with the previous value in the column.  
    2. Backward Fill (bfill): Fill missing values with the next value in the column.
4. Predictive Modeling:  
Technique: Use machine learning models to predict missing values based on other features.  

Q8. You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

Ans:  
Visualization: One strategy is to visualize the data using plots and graphs to see if there are any patterns or trends in the missing data. For example, a heatmap can be used to show which values are missing in the dataset.  
  
Summary statistics: Another strategy is to calculate summary statistics for the missing data and compare them to the summary statistics for the non-missing data. If there are significant differences between the two, this may suggest that the missing data is not missing at random.  

Imputation: Imputation can also be used to determine if the missing data is missing at random. If the imputed values are similar to the non-missing values, this may suggest that the missing data is missing at random. If the imputed values are significantly different, this may suggest that there is a pattern to the missing data.  
  
Statistical tests: Statistical tests can also be used to determine if the missing data is missing at random. For example, a chi-square test can be used to test whether the missing data is independent of other variables in the dataset.  

Q9. Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Ans:  
Evaluating machine learning models on imbalanced datasets, especially in scenarios like medical diagnosis where the condition of interest is rare, requires special attention to ensure that the model's performance is assessed accurately. Here are some strategies to evaluate performance effectively:
But since these kind of datasets are usually  imbalanced in nature we can use some Resampling techniques such as oversampling the minority class or undersampling the majority class can also be used to balance the dataset. Once the dataset is balanced, standard metrics such as accuracy, precision, recall, F1-score, and ROC can be used to evaluate the performance of the model.  
**Appropriate Metrics:**  
Accuracy is not a reliable metric for imbalanced datasets, as a model can achieve high accuracy by simply predicting the majority class. Instead, focus on metrics that give a better picture of model performance across both classes:  
  
1. Precision: The proportion of true positive predictions out of all positive predictions made by the model. It is crucial for evaluating how well the model avoids false positives.  
2. Recall (Sensitivity or True Positive Rate): The proportion of actual positive cases correctly identified by the model. It is crucial for evaluating how well the model captures the positive cases.  
3. F1 Score: The harmonic mean of precision and recall. It balances precision and recall, especially useful when the class distribution is imbalanced.
4. ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Measures the model’s ability to distinguish between classes. AUC ranges from 0 to 1, where a higher value indicates better performance.

Q10. When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

Ans:  
When dealing with an unbalanced dataset where the majority of customers report being satisfied, it’s important to use methods to balance the dataset to improve the performance and fairness of your machine learning models. Down-sampling the majority class is one such method. Here’s a detailed look at techniques you can use to balance your dataset by down-sampling the majority class:  
  
1. Random undersampling: This method involves randomly selecting a subset of observations from the majority class to match the size of the minority class. This can be done using techniques such as RandomUnderSampler from the imblearn library in Python.    
2. 
Tomek links: This method involves identifying pairs of observations that are nearest neighbors and belong to different classes. The observation from the majority class is then removed to balance the dataset. This can be done using techniques such as TomekLinks from the imblearn library in Pytho
  
3. Synthetic minority oversampling technique (SMOTE): This method involves generating synthetic observations for the minority class to match the size of the majority class. SMOTE can be used in combination with random undersampling to balance the dataset. This can be done using techniques such as SMOTETomek from the imblearn library in Python.hon.

Q11. You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

Ans:  

When dealing with an imbalanced dataset where the occurrence of a rare event (minority class) is very low, it's important to use techniques that can up-sample the minority class to create a more balanced dataset. This can help improve the performance and fairness of your machine learning models. Here are some methods and techniques you can use to up-sample the minority class:

1. Synthetic Minority Over-sampling Technique (SMOTE)
Technique: SMOTE generates synthetic samples for the minority class by interpolating between existing minority class examples. This helps to create a larger and more diverse set of examples for the minority class.

2. Adaptive Synthetic Sampling (ADASYN)
Technique: ADASYN builds on SMOTE but focuses on generating more synthetic examples near the decision boundary. It creates more synthetic data where the minority class is difficult to classify.

3. Borderline-SMOTE
Technique: Borderline-SMOTE is a variant of SMOTE that only generates synthetic samples near the boundary between the majority and minority classes, focusing on areas where the model is likely to make errors.

4. Random Over-Sampling
Technique: Randomly duplicate existing examples from the minority class to increase its size. This method is simple but may lead to overfitting due to duplication.

5. SMOTE-NC (SMOTE for Nominal and Continuous Features)
Technique: SMOTE-NC is an extension of SMOTE that can handle both numerical and categorical features. It's useful if your dataset contains a mix of feature types.  