Skip to content

bellonbits/SQL-Data-Cleaning-for-Beginners

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SQL Data Cleaning for Beginners 🧹

What is Data Cleaning?

Data cleaning means fixing problems in your data so it's accurate and ready to use.

Let's Practice with Our Data

Step 0: Understand Your Data Structure

First, let's see what our data looks like:

-- See what tables we have
SELECT name FROM sqlite_master WHERE type='table';

-- Check customer_info table structure
SELECT * FROM customer_info LIMIT 5;

-- Check products table structure  
SELECT * FROM products LIMIT 5;

-- Check sales table structure
SELECT * FROM sales LIMIT 5;

Step 1: Explore the Data

Look for problems in the data:

-- Check for missing values in customer names
SELECT * FROM customer_info 
WHERE full_name IS NULL OR full_name = '';

-- Check for unusual prices (too high or too low)
SELECT * FROM products 
WHERE price <= 0 OR price > 10000;

-- Look for duplicate customer names
SELECT full_name, COUNT(*) 
FROM customer_info 
GROUP BY full_name 
HAVING COUNT(*) > 1;

Step 2: Standardize Data Formats

Fix formatting issues:

-- Make all names follow the same format (First Last)
UPDATE customer_info 
SET full_name = TRIM(full_name);

-- Make all locations start with capital letter
UPDATE customer_info 
SET location = UPPER(SUBSTR(location, 1, 1)) || LOWER(SUBSTR(location, 2));

-- Round prices to 2 decimal places
UPDATE products 
SET price = ROUND(price, 2);

Step 3: Remove Duplicates

Clean up duplicate entries:

-- Remove duplicate customers (keep the one with the lowest ID)
DELETE FROM customer_info 
WHERE customer_id NOT IN (
    SELECT MIN(customer_id) 
    FROM customer_info 
    GROUP BY full_name, location
);

Step 4: Fix Data Issues

Correct specific problems:

-- Find products with missing names
SELECT * FROM products WHERE product_name IS NULL;

-- Find sales with invalid product or customer IDs
SELECT * FROM sales 
WHERE product_id NOT IN (SELECT product_id FROM products)
OR customer_id NOT IN (SELECT customer_id FROM customer_info);

Quick Practice Exercises

Try these simple exercises:

  1. Find all customers from Nairobi

    SELECT * FROM customer_info WHERE location = 'Nairobi';
  2. Find the most expensive product

    SELECT * FROM products ORDER BY price DESC LIMIT 1;
  3. Count customers by location

    SELECT location, COUNT(*) 
    FROM customer_info 
    GROUP BY location 
    ORDER BY COUNT(*) DESC;

Tips for Beginners

  1. Always backup your data before cleaning
  2. Test your changes on a small sample first
  3. Work step by step - don't try to fix everything at once
  4. Check your work after each change

Remember!

✅ Data cleaning makes your data better for analysis
✅ Simple fixes can solve big problems
✅ Practice with small changes first
✅ Always double-check your work

Happy cleaning! 🧼

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published