#SECTION A – THEORETICAL QUESTIONS

Q1. What are the most common reasons for missing data in ETL pipelines?
```
The most common reasons for missing data in ETL pipelines are:

1. Data entry errors – Users skip optional fields or enter incomplete forms.
2. System integration issues – Data is lost during merging from multiple sources.
3. Schema mismatch – Source and destination databases have different structures.
4. Sensor or system failure – In IoT or automated systems, devices may fail to record data.
5. Data corruption during transformation – Errors in transformation logic can remove values.
6. Delayed data arrival – Some data may not be available at the time of extraction.
```
Q2. Why is blindly deleting rows with missing values considered a bad practice in ETL?


```
Blindly deleting rows with missing values is bad practice because:

* It may remove a large amount of useful data.
* It can introduce bias in the dataset.
* It reduces sample size, affecting model accuracy.
* Missing data may contain important patterns (e.g., high-income customers not disclosing salary).
Therefore, missing data should be analyzed before removal.
```
Q3. Difference between Listwise Deletion and Column Deletion


```
1. Listwise Deletion
 > Removes entire rows where any value is missing.
 > Used when missing data is very small (e.g., less than 5%).
 > Appropriate scenario: When analyzing survey data where only a few respondents skipped one question.

2. Column Deletion
 > Removes an entire column (feature) if it contains too many missing values.
 > Used when the column is mostly empty and not important.
 > Appropriate scenario: A “Middle Name” column missing for 90% of customers.
```
Q4. Why is median imputation preferred over mean imputation for skewed data such as income?


```
Median is preferred because:

 > Income data is usually right-skewed (few very high values).
 > Mean is affected by extreme outliers.
 > Median represents the middle value and is more robust.

Example:
If incomes are ₹20k, ₹25k, ₹30k, ₹35k, ₹5,00,000
Mean becomes very high due to ₹5,00,000, but median remains stable.
```
Q5. What is forward fill and in what type of dataset is it most useful?


```
Forward fill is a method where missing values are replaced with the last known previous value.

It is most useful in:

  > Time-series datasets
  > Stock prices
  > Sensor readings
  > Daily sales reports

Example:
If Monday = 100, Tuesday = missing → Tuesday becomes 100.
```
Q6. Why should flagging missing values be done before imputation in an ETL workflow?



```
Flagging missing values:

 > Preserves information that data was originally missing.
 > Helps machine learning models detect patterns in missingness.
 > Prevents loss of important business signals.

Example:
Customers who don’t provide income might belong to a specific risk group.
```
Q7. Consider a scenario where income is missing for many customers.
 How can this missingness itself provide business insights?


```
Missing income itself can reveal patterns such as:

1. High-income individuals may avoid sharing salary.
2. Low-income customers may skip income due to hesitation.
3. Certain regions or customer segments may have higher missing rates.
4.It may indicate privacy concerns or trust issues.

Business Insight:
The company can:

 > Adjust marketing strategy.
 > Improve form design.
 > Identify high-risk or high-value customer segments.
```
#SECTION B – PRACTICAL QUESTIONS
Q8. Listwise Deletion

Remove all rows where Region is missing.

Tasks:

> Identify affected rows

> Show the dataset after deletion

> Mention how many records were lost


```
Step 1: Identify affected rows
From the dataset:

Customer_ID 105 (Amit Verma) → Region = NaN

Step 2: Dataset After Deletion

| Customer_ID | Name        | City      | Monthly_Sales | Income | Region |
| ----------- | ----------- | --------- | ------------- | ------ | ------ |
| 101         | Rahul Mehta | Mumbai    | 12000         | 65000  | West   |
| 102         | Anjali Rao  | Bengaluru | NaN           | NaN    | South  |
| 103         | Suresh Iyer | Chennai   | 15000         | 72000  | South  |
| 104         | Neha Singh  | Delhi     | NaN           | NaN    | North  |
| 106         | Karan Shah  | Ahmedabad | NaN           | 61000  | West   |
| 107         | Pooja Das   | Kolkata   | 14000         | NaN    | East   |
| 108         | Riya Kapoor | Jaipur    | 16000         | 69000  | North  |

Step 3: Number of Records Lost
Original records = 8

Remaining records = 7

Records lost = 1
```
Q9. Imputation

Handle missing values in Monthly_Sales using:

Forward Fill

Tasks:

Apply forward fill

Show before vs after values

Explain why forward fill is suitable here


```
| Customer_ID | Name        | Monthly_Sales (Before) |
| ----------- | ----------- | ---------------------- |
| 101         | Rahul Mehta | 12000                  |
| 102         | Anjali Rao  | NaN                    |
| 103         | Suresh Iyer | 15000                  |
| 104         | Neha Singh  | NaN                    |
| 105         | Amit Verma  | 18000                  |
| 106         | Karan Shah  | NaN                    |
| 107         | Pooja Das   | 14000                  |
| 108         | Riya Kapoor | 16000                  |

Replace missing value with the previous available value.

| Customer_ID | Name        | Monthly_Sales (After) |
| ----------- | ----------- | --------------------- |
| 101         | Rahul Mehta | 12000                 |
| 102         | Anjali Rao  | 12000 ✅               |
| 103         | Suresh Iyer | 15000                 |
| 104         | Neha Singh  | 15000 ✅               |
| 105         | Amit Verma  | 18000                 |
| 106         | Karan Shah  | 18000 ✅               |
| 107         | Pooja Das   | 14000                 |
| 108         | Riya Kapoor | 16000                 |


```
Q10. Flagging Missing Data

Create a flag column for missing Income.

Tasks:

Create Income_Missing_Flag (0 = present, 1 = missing)

Show updated dataset

Count how many customers have missing income


```
Step 1: Identify Missing Income Values
Missing Income for:

 Customer_ID 102 (Anjali Rao)

 Customer_ID 104 (Neha Singh)

 Customer_ID 107 (Pooja Das)

Step 2: Create New Column
Income_Missing_Flag

 0 = Income Present

 1 = Income Missing

| Customer_ID | Name        | City      | Monthly_Sales | Income | Region | Income_Missing_Flag |
| ----------- | ----------- | --------- | ------------- | ------ | ------ | ------------------- |
| 101         | Rahul Mehta | Mumbai    | 12000         | 65000  | West   | 0                   |
| 102         | Anjali Rao  | Bengaluru | NaN           | NaN    | South  | 1                   |
| 103         | Suresh Iyer | Chennai   | 15000         | 72000  | South  | 0                   |
| 104         | Neha Singh  | Delhi     | NaN           | NaN    | North  | 1                   |
| 105         | Amit Verma  | Pune      | 18000         | 58000  | NaN    | 0                   |
| 106         | Karan Shah  | Ahmedabad | NaN           | 61000  | West   | 0                   |
| 107         | Pooja Das   | Kolkata   | 14000         | NaN    | East   | 1                   |
| 108         | Riya Kapoor | Jaipur    | 16000         | 69000  | North  | 0                   |

tep 3: Count of Customers with Missing Income
Total missing income customers = 3

(Customer_ID 102, 104, 107)


```















