Skip to content

aravind178/Data_cleaning_using_SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

πŸ’Ύ Data Cleaning using SQL

πŸ“„ Project Description

This project focuses on cleaning a real-world employee layoffs dataset sourced from Kaggle. The raw data was imported into MySQL, where several SQL operations were used to clean and prepare the dataset for analysis.

The objective was to resolve issues in the raw data such as:

  • Duplicate records
  • Inconsistent text formatting
  • Missing or null values
  • Incorrect data types

πŸ› οΈ Tools Used

  • MySQL
  • MySQL Workbench
  • Kaggle (for data source)

🧹 Data Cleaning Workflow

βœ… 1. Duplicate Removal

  • Identified duplicate rows using ROW_NUMBER() with a CTE.
  • Retained only the first occurrence, and deleted the rest.

Duplicate removal in SQL


βœ… 2. Data Standardization

  • Trimmed extra white spaces from the company column.
  • Replaced inconsistent values like:
    • "united states" β†’ "United States"
    • "crypto currency" β†’ "crypto"
  • Converted the date column from TEXT to DATE type using STR_TO_DATE() and ALTER TABLE.

Standardizing text and dates


βœ… 3. Handling Null & Missing Values

  • Replaced empty strings in industry with NULL.
  • Used self-joins to fill in missing industry values based on the company.
  • Removed rows where both total_laid_off and percentage_laid_off were null.

Null value treatment in SQL


βœ… 4. Final Cleanup

  • Dropped helper columns like row_num used during the cleaning process.

Final cleanup


πŸ“Œ Outcome

The dataset is now clean, consistent, and ready for:

  • Exploratory Data Analysis (EDA)
  • Dashboard development (e.g., in Power BI, Tableau)
  • Reporting and visualization

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published