This project focuses on cleaning a real-world employee layoffs dataset sourced from Kaggle. The raw data was imported into MySQL, where several SQL operations were used to clean and prepare the dataset for analysis.
The objective was to resolve issues in the raw data such as:
- Duplicate records
- Inconsistent text formatting
- Missing or null values
- Incorrect data types
- MySQL
- MySQL Workbench
- Kaggle (for data source)
- Identified duplicate rows using
ROW_NUMBER()with a CTE. - Retained only the first occurrence, and deleted the rest.
- Trimmed extra white spaces from the
companycolumn. - Replaced inconsistent values like:
"united states"β"United States""crypto currency"β"crypto"
- Converted the
datecolumn from TEXT to DATE type usingSTR_TO_DATE()andALTER TABLE.
- Replaced empty strings in
industrywithNULL. - Used self-joins to fill in missing industry values based on the company.
- Removed rows where both
total_laid_offandpercentage_laid_offwere null.
- Dropped helper columns like
row_numused during the cleaning process.
The dataset is now clean, consistent, and ready for:
- Exploratory Data Analysis (EDA)
- Dashboard development (e.g., in Power BI, Tableau)
- Reporting and visualization



