**Q1. Data Understanding**

**Identify all data quality issues present in the dataset that can cause problems during data loading**

Answer ▶

1.  Duplicate Records

2. Missing Values (Nulls)

3. Data Type Inconsistency

4. Inconsistent Date Formatting

**Q2. Primary Key Validation**

**Assume Order_ID is the Primary Key.**

**a) Is the dataset violating the Primary Key rule?**

**b) Which record(s) cause this violation?**

Answer ▶

a) The Primary Key rule is being broken by the dataset.

Two primary requirements must be met by a primary key:

- Uniqueness: The Primary Key column cannot contain the same value in two rows.

- Non-Null: There cannot be any null or empty values in the column.

b) The violation is caused by Order_ID: O101.

Row 1: Order_ID is O101.

Row 4: Order_ID is O101.

If Order_ID were rigidly enforced as a Primary Key, the database would reject this data during loading since O101 appears twice in the Order_ID column. In order to fix this, the duplicate entry must be eliminated before the data can be put into a structured table.

**Q3. Missing Value Analysis**

**Which column(s) contain missing values?**

**a) List the affected records**

**b) Explain why loading these records without handling missing values is risky**

Answer ▶

a)

-  Affected Record(s)

Order_ID O102: This record has a value of "Null" in the Sales_Amount column.

b) Risks of Loading Without Handling Missing Values

Loading records with missing values like the "Null" found in O102 poses several risks to data integrity and system stability

- ETL and Loading Failures: The entire data loading procedure will fail and terminate if the Sales_Amount column in the destination database table is set up with a NOT NULL constraint.

- Inaccurate Financial Reporting: The system will overlook the "Null" entry when executing computations such as SUM(Sales_Amount), producing an underestimated total revenue that does not accurately reflect actual business activity.

- Skewed Data Analytics: Averages and medians are examples of statistical metrics that will not be accurate. The average sales value will be less than the actual value if the system counts the row (Order O102) as a transaction even though it adds nothing to the total.

**Q4. Data Type Validation**

**Identify records where Sales_Amount violates expected data type rules.**

**a) Which record(s) will fail numeric validation?**

**b) What would happen if this dataset is loaded into a SQL table with Sales_Amount as DECIMAL ?**

Answer ▶

a)

Order_ID O102: The literal string "Null" is present in this record. Although "NULL" is a legal state in many systems, it will fail validation if it is being read as a string or if the loader expects only numeric characters.

Order_ID O104: "Three Thousand" appears in this record. Numerical data type requirements are obviously broken because this is a string of alphabetical letters rather than a numeric format (such as 3000).

b)

- Conversion Error or Type Mismatch: The SQL engine will try to "cast" or change every incoming number to a decimal. It cannot mathematically convert text into a decimal, therefore when it reaches record O104 ("Three Thousand"), it will throw a Data Conversion Error or Invalid Numeric Literal error.

- Failure to load: Depending on the SQL settings or ETL tool:

Strict Mode: No data is stored to the table in case the full batch load fails and rolls back.

- Handling of "Null": The loader may be able to correctly convert the string "Null" to a database NULL for record O102 if the target column is defined as DECIMAL and permits nulls (NULL). Nevertheless, this record will also cause a load failure if the column is NOT NULL.

**Q5. Date Format Consistency**

**The Order_Date column has multiple formats.**

**a) List all date formats present in the dataset.**

**b) Why is this a problem during data loading?**

Answer ▶

a) Date Formats Present in the Dataset
There are two distinct date formats used in the Order_Date column:

DD-MM-YYYY: (Day-Month-Year)

Example: 12-01-2024

YYYY/MM/DD: (Year/Month/Day)

Example: 2024/01/18 (used for Order_ID O103).

- Parsing errors: To turn text into a "Date" data type, the majority of database loading tools (such as Python's Pandas or SQL Server Integration Services) need a single, predetermined format. The loader will probably give a "Invalid Date Format" error and fail the load if it meets a format it doesn't expect, such as changing from dashes to slashes.


- Problems with Sorting and Filtering: These dates will not sort chronologically if they are loaded as "Strings" (text) to prevent conversion issues. Since 2024/01/18 and 12-01-2024 would be arranged differently, it would be impossible to generate monthly reports or precisely filter by date range.

- Downstream Calculation Errors: Date math is used in many commercial operations.  These computations will produce errors or null results if the date is not correctly identified as a date object.

**Q6. Load Readiness Decision**

**Based on the dataset condition:**

**a) Should this dataset be loaded directly into the database? (Yes/No)**

**b) Justify your answer with at least three reasons**

Answer ▶

a) No

b)

- Primary Key Violations: Order_ID O101 has a duplicate entry in the dataset. Most database management systems (DBMS) would reject the file or give an error as soon as they found the duplicate since a Primary Key needs to be unique for each record.

- Data Type Inconsistencies: Non-numeric values such as "Null" and "Three Thousand" are present in the Sales_Amount column. These text strings will result in a fatal conversion error during loading if the database expects a DECIMAL or INTEGER type.

- Lack of Standardization: "Silent" data problems could result from inconsistent date formats, such as alternating between DD-MM-YYYY and YYYY/MM/DD. Because the system is unable to properly parse or sort the mixed formats, analytical queries or time-based reports would be erroneous even after the load is completed.

- Incomplete Data: Your financial analysis (such as total revenue or average order value) would be biased or incomplete from the beginning if a "Null" result appeared in O102 for a crucial business indicator like Sales_Amount.

**Q7. Pre-Load Validation Checklist**

**List the exact pre-load validation checks you would perform on this dataset before loading**

answer ▶

1. Validation of Uniqueness

- Primary Key Check: Confirm that there are only unique values in the Order_ID column.

- Duplicate Row Check: Look for duplicate entries in the dataset (such as the O101 duplicate) and flag or remove them for further examination.

2. Validation of Data Type and Schema

- Make sure that every entry in the Sales_Amount column is entirely numeric to maintain numeric consistency.

- String-to-Numeric Conversion: Indicate which text-based numbers (like "Three Thousand") need to be converted to a numeric format, either manually or automatically.

- Schema Alignment: Verify that the values' length and accuracy match the database destination .

3. Wholeness and Null Verification

- Required Field Check: Look for any empty or "Null" strings in the NOT NULL fields.

- Business Logic Check: Verify that important business fields, such as Sales_Amount, are equal to or larger than zero.

4. Standardization of Formats

- Date Format Unification: Before trying to load, make sure all dates adhere to a single standard, such as ISO 8601 (YYYY-MM-DD), using a regex or date-parser check.

- Verify that the date column uses consistent separators (dashes vs. slashes) by performing a delimiter/separator check.

5. Referential Integrity (Suggested but optional)

- Validation of Foreign Keys: To avoid orphaned records, if this were a real database, I would verify that the Customer_ID (for example, C001) is present in the master Customers table.

**Q8. Cleaning Strategy**

**Describe the step-by-step cleaning actions required to make this dataset load-ready**

Answer ▶

Step 1: Handle Duplicates

- Identify: Locate rows where all values are identical, specifically the two entries for Order_ID O101.

- Action: Remove the redundant row to satisfy the Primary Key Uniqueness constraint.

- SQL Tip: You can use DISTINCT or ROW_NUMBER() to isolate unique records.

Step 2: Standardize the Order_Date Column

- Identify: Find records that deviate from the primary format, specifically O103 (2024/01/18).

- Action: Convert all date strings into a single, consistent format (preferably ISO 8601: YYYY-MM-DD).

- Result: All dates should look like 2024-01-12, 2024-01-15, etc., to ensure they are sortable and parsable.

Step 3: Resolve Data Type Mismatches

- Identify: Locate non-numeric text in the Sales_Amount column, specifically O104 ("Three Thousand").

- Action: Manually or programmatically convert "Three Thousand" to the numeric value 3000.

- Verification: Ensure the entire column contains only numeric digits.

Step 4: Address Missing (Null) Values

- Identify: Locate the "Null" entry in record O102.

- Action: Decide on a treatment strategy:

Imputation: Replace "Null" with the mean or median of the other sales (if appropriate for your analysis).

Flagging: Replace with a default value like 0.00 if business rules allow.

Removal: Delete the row if the missing data makes the record useless for reporting.

Step 5: Final Schema Validation

- Action: Cast the cleaned columns to their final data types (e.g., Order_ID as VARCHAR, Sales_Amount as DECIMAL, and Order_Date as DATE).

- Check: Run a final scan to ensure no "Null" values remain in columns that require data

**Q9. Loading Strategy Selection**

**Assume this dataset represents daily sales data.**

**a) Should a Full Load or Incremental Load be used?**

**b) Justify your choice.**

Answer ▶

a) For this specific dataset in its current "dirty" state, you should perform an Incremental Load, but only after a thorough cleaning process in a staging area.

b)
- Efficiency and Performance: Over time, daily sales datasets expand forever. As the company expands, a Full Load—re-uploading the complete history daily—would become slower and use more bandwidth than necessary. Only the new records (the delta) from that particular day are processed by an incremental load.

- Avoiding Data Loss: You run the risk of unintentionally wiping or locking your complete current sales history during the overwrite if you choose to use a Full Load and the procedure fails because of the problems we discovered (such as the "Three Thousand" string or duplicate IDs). A failure only impacts the current batch since incremental loading isolates the new data.

- Transaction Integrity: By definition, sales data is "append-only"—it hardly ever changes after an order is placed. By adding new rows to the bottom of the table without affecting previously approved and audited historical records, incremental loading adheres to this timetable.

- Auditability: You can determine the precise time each "batch" of daily sales was entered by using an incremental technique. This makes it simpler to resolve any difficulties with the quality of data from a particular day, such as the one in your image.

**Q10. BI Impact Scenario**

**Assume this dataset was loaded without cleaning and connected to a BI dashboard.**

**a) What incorrect results might appear in Total Sales KPI?**

**b) Which records specifically would cause misleading insights?**

**c) Why would BI tools not detect these issues automatically?**

Answer ▶

a) Incorrect Results in Total Sales KPI

- Inflated Revenue: The Total Sales figure would be artificially high because the duplicated record for O101 ($4,500) would be counted twice.

- Understated Revenue: Because record O104 ("Three Thousand") is text-based, a BI tool trying to sum a numeric column might treat it as a 0 or exclude it entirely, missing $3,000 in actual sales.

- Inaccurate Average Order Value (AOV): Since the tool might count the number of rows (6 rows) but only sum the numeric values, your average sales calculation would be mathematically skewed.

b)
- O101 (Duplicate)
- O102 (Null)
- O103 (Format)
- O104 (Text)

