##Handling Missing Data in ETL

**Q1. What are the most common reasons for missing data in ETL pipelines?**

Answers ▶

1. Source System Issues :
Problems often originate before the data even enters the pipeline.

- Human Error:

 - Schema Changes:

- Corrupted Source Data:

2. Extraction (Ingestion) Failures :
Issues occurring while pulling data from the source.

- API Rate Limits:

- Network Latency/Timeouts:

- Incremental Load Logic:

3. Transformation & Logic Errors :
Data is often lost or filtered out during processing.

- Incorrect Joins:

- Data Type Mismatches:

- Aggressive Filtering:

4. Loading & Destination Constraints :
The final step where data is written to the data warehouse.

- Constraint Violations:

- Buffer/Capacity Issues:

**Q2. Why is blindly deleting rows with missing values considered a bad practice in ETL?**

Answer ▶

1. It Introduces Statistical Bias

Data is rarely missing purely by chance. In statistics, this is the difference between data being Missing Completely at Random (MCAR) and Missing Not at Random (MNAR).

The Situation: Consider a consumer survey in which individuals with higher incomes are less inclined to disclose their pay.

The Error: You are disproportionately eliminating high-earning clients from your dataset if you remove every row that has a missing "Salary" field.

As a result, you will make poor business decisions because your final analysis will be erroneously biased towards lesser wages. The remaining data no longer accurately depicts the population in the real world.

2. Considerable Information Loss

A single row in an ETL pipeline may have hundreds or even thousands of columns. The deletion of the entire row eliminates legitimate data in all other columns if only one non-critical column (such as "Middle Name") is absent.

As an illustration, a row may not have a "Zip Code" but may have important, legitimate information like "Purchase History," "Age," and "Product Preference."

Consequently, deleting the row weakens your downstream models by lowering the sample size available for examining those other variables that are completely legitimate.

3. Reduced Statistical PowerEven

 if the data is missing completely at random, deleting rows reduces your sample size ($N$).

 4. Nightmares in Operations and Auditing

Row counts are the main statistic used for health checks in a production ETL setup.

Data Reconciliation: "Data loss" alerts are triggered if your warehouse loads 9,500 rows out of 10,000 that your source system sends because you silently dropped 500 rows with missing values.

**Q3. Explain the difference between:**

**Listwise deletion**

**Column deletion**

Answer ▶

1. Deletion via List (Row Deletion)

This technique, also referred to as "Complete Case Analysis," eliminates a whole record from the dataset if even one value is absent.

Reasoning: "If I don't know everything about this customer, I will ignore them entirely."

When to apply:

- True randomness (MCAR) characterizes the missing data.

- Very few rows (less than 5% of the entire dataset) are impacted.

- The "target" variable—the object you're attempting to predict—has the missing value, and imputation could add excessive noise.

Benefits: Easy to apply; maintains the remaining variables' distribution (assuming data is MCAR).

Cons: significantly lowers sample size; if data is not missing at random, bias may be introduced.

2. Deletion of Columns (Feature Dropping)


If a sizable portion of a variable's (column's) values are missing, this technique eliminates the variable from the dataset.

Reasoning: "This specific question (variable) was left blank by so many people that it is useless for analysis."


When to apply:

- More than sixty to seventy percent of the data is missing.

- Neither the analysis nor the hypothesis depend on the variable.

- Due of its strong correlation with another fully occupied column, the variable is redundant.

Advantages: Maintains the maximum number of observations (rows); reduces dimensionality to simplify the model.

Cons: For the populated rows, you lose possibly important information that was included in the variable.

**Q4. Why is median imputation preferred over mean imputation for skewed data such as income?**

Answer ▶

1. The Outlier "Pull"

The total of all values is the arithmetic mean. If your income distribution is "Right-Skewed" or "Positive Skew":

- It is the high incomes who will inflate the Mean.

- The bulk of the population is still at the center of the median, or middle value.

2. A Specific Illustration

Consider the yearly salaries of a small team:

- Employees: $45,000, $50,000, $55,000, and $40,000.

- CEO: 2 million

If the central tendency is computed:

Mean Average: $438,000 (no one on the team is accurately represented by this figure).

The median, which appropriately depicts a "typical" employee, is $50,000.

**Q5. What is forward fill and in what type of dataset is it most useful?**

 Answer ▶

 Forward fill, also known as "Last Observation Carried Forward" in statistics or "ffill" in Python's Pandas library, is an imputation technique that moves the most recent valid observation to the subsequent valid one.

 Forward fill is most useful  for Time Series or Longitudinal datasets where data is ordered sequentially.

**Q6. Why should flagging missing values be done before imputation in an ETL workflow?**

Answer ▶

1. "Missingness" is Often a Predictor (Information Signal)

The fact that data is missing can be just as important as the data itself.

2. Differentiating "Real" vs. "Synthetic" Data

Imputed values are estimates, not facts.

3. Monitoring Data Quality & Drift

You need to know if your data sources are degrading over time.

4. Reversibility and Experimentation

Imputation strategies often change.

**Q7. Consider a scenario where income is missing for many customers. How can this  missingness itself provide business insights?**

Answer ▶

1. Risk & Creditworthiness Signal

In financial services, customers who refuse to disclose income often represent a distinct risk profile.

2. Price Sensitivity & Discount Targeting

People who guard their financial data often behave differently in pricing discussions.

3. User Experience (UX) Friction Points

If a specific demographic consistently drops off at the "Income" field in your sign-up form, it reveals a flaw in your onboarding process.

4. Segmentation for Marketing Strategy

"Missing Income" is effectively a customer segment.