## Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

**Answer:**
Missing values in a dataset occur when no data value is stored for a variable in an observation. These can arise due to incomplete data collection, errors during data entry, or other reasons.

Handling missing values is essential because:
- They can introduce bias in analysis.
- They may distort model performance by making the dataset unrepresentative.
- Certain algorithms cannot handle missing values and may break or provide inaccurate results.

**Algorithms not affected by missing values:**
- Decision Trees
- Random Forests
- K-Nearest Neighbors (with imputation)
- Gradient Boosting (XGBoost, LightGBM with built-in handling)

## Q2: List down techniques used to handle missing data.  Give an example of each with python code.

**ANSWER**
Common techniques to handle missing data:
1. **Removing missing values**  
   ```python
   import pandas as pd
   df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, None, 8]})
   df_cleaned = df.dropna()  # Drop rows with missing values
   ```
   
2. **Mean/Median/Mode imputation**
   ```python
   df['A'].fillna(df['A'].mean(), inplace=True)  # Replace NaN with mean
   ```

3. **Forward/Backward Fill**
   ```python
   df.fillna(method='ffill', inplace=True)  # Forward fill missing data
   ```

4. **Imputation using algorithms (e.g., KNN Imputer)**
   ```python
   from sklearn.impute import KNNImputer
   imputer = KNNImputer(n_neighbors=2)
   df_imputed = imputer.fit_transform(df)
   ```

## Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

**ANSWER**
Imbalanced data occurs when one class significantly outnumbers the other(s), which is common in scenarios like fraud detection or disease diagnosis.

If not handled, machine learning models tend to become biased toward the majority class, resulting in:
- High accuracy for the majority class, but poor performance for the minority class.
- Misleading performance metrics that fail to capture the model’s ability to predict the minority class.


## Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down sampling are required.

**ANSWER**
- **Up-sampling:** Increases the number of instances in the minority class by duplicating or generating synthetic samples.
- **Down-sampling:** Reduces the number of instances in the majority class by randomly removing samples.

**Example:**
In a dataset where 90% of data belongs to class A and 10% to class B:
- **Up-sampling:** Add more instances to class B (using SMOTE or random sampling).
- **Down-sampling:** Remove some instances from class A.

Both are used when the dataset is imbalanced, and we need to balance the class distribution for better model performance.


## Q5: What is data Augmentation? Explain SMOTE.

**ANSWER**
**Data augmentation** is the process of artificially increasing the size of a dataset by creating modified versions of existing data. It is commonly used in image processing and other fields where data is scarce.

**SMOTE (Synthetic Minority Over-sampling Technique):**  
A technique to generate synthetic samples for the minority class by interpolating between nearest neighbors. It helps balance datasets where the minority class is underrepresented.

## Q6: What are outliers in a dataset? Why is it essential to handle outliers?

**ANSWER**
Outliers are data points that differ significantly from other observations in the dataset. They can be caused by variability in the data, measurement errors, or other factors.

Handling outliers is essential because:
- They can skew and distort statistical measures.
- They may negatively affect the performance of certain models, such as linear regression.
- Outliers may indicate rare but important events (e.g., fraud).

## Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

**ANSWER**
Techniques to handle missing data in this scenario:
- **Removing rows/columns** with too many missing values.
- **Imputing missing values** using the mean, median, or mode.
- **Using predictive models** like KNN or decision trees to impute missing values.
- **Forward/Backward filling** based on time-series continuity.

## Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

**ANSWER**
Strategies to identify patterns in missing data:
- **Visualize missing data** using heatmaps or bar plots to observe trends.
- **Use statistical tests** such as Little's MCAR test to check if data is missing completely at random (MCAR).
- **Correlation analysis** between missingness and other variables to identify if certain variables are associated with missing data.


## Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

**ANSWER**
Strategies for evaluating models on imbalanced datasets:
- **Use precision, recall, and F1-score** rather than accuracy.
- **Use the Area Under the ROC Curve (AUC-ROC)** to evaluate the model's ability to discriminate between classes.
- **Confusion matrix** analysis to understand true positives, false positives, and false negatives.
- **Use stratified k-fold cross-validation** to ensure balanced representation across training and testing sets.


## Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

**ANSWER**
Methods to down-sample the majority class:
- **Random under-sampling:** Randomly remove instances from the majority class.
- **Cluster-based sampling:** Group majority samples into clusters and then select representative samples to reduce bias.
- **Edited Nearest Neighbor (ENN):** Removes samples from the majority class that are incorrectly classified by their neighbors.


## Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

**ANSWER**
Methods to up-sample the minority class:
- **Random over-sampling:** Randomly duplicate instances from the minority class.
- **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples based on the nearest neighbors.
- **ADASYN (Adaptive Synthetic Sampling):** Similar to SMOTE, but focuses on more difficult-to-classify samples.
