This project automates the process of cleaning and validating an Excel dataset using Python and pandas. It was designed for a real-world sales dataset and performs column-wise validation including ID formats, categorical standardization, numeric conversion, date parsing, and derived column verification.
- Input File:
sales.xlsx(expected to be placed in thedata/directory) - Output File:
cleaned_sales.xlsx(or with a timestamp)
- โ Drop Duplicates from the dataset
- ๐ Validate IDs (e.g., must match
TXN_1234567format) - ๐ท๏ธ Categorical Text Cleaning (standardize case, validate against allowed values)
- ๐ข Numeric Column Validation (convert invalid entries to
NaN) - ๐
Datetime Parsing (invalid dates converted to
NaT) - ๐งฎ Derived Column Check: Validates if
Total Spent = Quantity ร Price Per Unitand fixes incorrect values - ๐ชต Column-wise Summary Logs printed for transparency
- ๐ Well-documented and modular code using functions and docstrings
--- Validation Report: Total Spent ---
Invalid rows found: 123
Sample invalid values:
Quantity Price Per Unit Total Spent
0 3 5.0 14
1 2 7.0 15

