Data cleaning means fixing problems in your data so it's accurate and ready to use.
First, let's see what our data looks like:
-- See what tables we have
SELECT name FROM sqlite_master WHERE type='table';
-- Check customer_info table structure
SELECT * FROM customer_info LIMIT 5;
-- Check products table structure
SELECT * FROM products LIMIT 5;
-- Check sales table structure
SELECT * FROM sales LIMIT 5;
Look for problems in the data:
-- Check for missing values in customer names
SELECT * FROM customer_info
WHERE full_name IS NULL OR full_name = '';
-- Check for unusual prices (too high or too low)
SELECT * FROM products
WHERE price <= 0 OR price > 10000;
-- Look for duplicate customer names
SELECT full_name, COUNT(*)
FROM customer_info
GROUP BY full_name
HAVING COUNT(*) > 1;
Fix formatting issues:
-- Make all names follow the same format (First Last)
UPDATE customer_info
SET full_name = TRIM(full_name);
-- Make all locations start with capital letter
UPDATE customer_info
SET location = UPPER(SUBSTR(location, 1, 1)) || LOWER(SUBSTR(location, 2));
-- Round prices to 2 decimal places
UPDATE products
SET price = ROUND(price, 2);
Clean up duplicate entries:
-- Remove duplicate customers (keep the one with the lowest ID)
DELETE FROM customer_info
WHERE customer_id NOT IN (
SELECT MIN(customer_id)
FROM customer_info
GROUP BY full_name, location
);
Correct specific problems:
-- Find products with missing names
SELECT * FROM products WHERE product_name IS NULL;
-- Find sales with invalid product or customer IDs
SELECT * FROM sales
WHERE product_id NOT IN (SELECT product_id FROM products)
OR customer_id NOT IN (SELECT customer_id FROM customer_info);
Try these simple exercises:
-
Find all customers from Nairobi
SELECT * FROM customer_info WHERE location = 'Nairobi';
-
Find the most expensive product
SELECT * FROM products ORDER BY price DESC LIMIT 1;
-
Count customers by location
SELECT location, COUNT(*) FROM customer_info GROUP BY location ORDER BY COUNT(*) DESC;
- Always backup your data before cleaning
- Test your changes on a small sample first
- Work step by step - don't try to fix everything at once
- Check your work after each change
✅ Data cleaning makes your data better for analysis
✅ Simple fixes can solve big problems
✅ Practice with small changes first
✅ Always double-check your work
Happy cleaning! 🧼