# Ungraded Lab: Data Cleaning Challenges

## 📋 Overview 
Welcome to the Data Cleaning Challenges lab! As a data analyst at BookCycle, you're tasked with cleaning a messy dataset containing transaction records. This lab will guide you through common data cleaning techniques using SQL, helping you prepare the data for analysis.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:
- Identify and handle missing values in SQL
- Standardize date formats across a dataset
- Remove duplicate records from a table
- Transform data to ensure consistency

## 📚 Dataset Information
We'll be working with the <b>transactions_uncleaned_data.csv</b> file, which contains BookCycle's transaction records. This dataset includes information such as <b>transaction IDs, dates, store locations, customer IDs, book IDs, sale prices, payment methods, and whether the transaction was online or in-store.</b>

## 🖥️ Activities

### Activity 1: Exploring the Dataset  

Before we start cleaning, it's crucial to understand our data. Let's explore the transactions_uncleaned_data table to identify potential issues.

<b>Step 1: </b> Connect to the database and view the first few rows:

In [1]:
import sqlite3
import pandas as pd
from datetime import datetime

# Setting up the database. DO NOT edit the code given below
from db_setup import setup_database
setup_database() 

# Connect to the database
conn = sqlite3.connect('bookcycle.db')

✅ Database setup complete: Tables created and populated with data!


In [2]:
query = """
SELECT * 
FROM transactions_uncleaned_data
LIMIT 5;
"""
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,transaction_id,date_time,store_location,customer_id,book_id,sale_price,payment_method,is_online
0,T1001,15/01/23 09:23,University,C1045,B1055,11.99,credit,0.0
1,T1002,15/01/23 10:15,Downtown,C1023,B1032,10.99,,
2,,15/01/23 10:45,,C1078,B1036,11.99,cash,0.0
3,T1004,15/01/23 11:30,University,C1012,B1071,13.99,credit,1.0
4,T1005,15/01/23 13:45,,C1034,B1075,12.99,debit,0.0


<b>Step 2: </b> Check for the count of missing values on each category(column): 

In [3]:
query = """
SELECT COUNT(*) AS total_rows,
       SUM(CASE WHEN transaction_id IS NULL THEN 1 ELSE 0 END) AS missing_transaction_id,
       SUM(CASE WHEN date_time IS NULL THEN 1 ELSE 0 END) AS missing_date_time,
       SUM(CASE WHEN store_location IS NULL THEN 1 ELSE 0 END) AS missing_store_location,
       SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS missing_customer_id,
       SUM(CASE WHEN book_id IS NULL THEN 1 ELSE 0 END) AS missing_book_id,
       SUM(CASE WHEN sale_price IS NULL THEN 1 ELSE 0 END) AS missing_sale_price,
       SUM(CASE WHEN payment_method IS NULL THEN 1 ELSE 0 END) AS missing_payment_method,
       SUM(CASE WHEN is_online IS NULL THEN 1 ELSE 0 END) AS missing_is_online
FROM transactions_uncleaned_data;
"""
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,total_rows,missing_transaction_id,missing_date_time,missing_store_location,missing_customer_id,missing_book_id,missing_sale_price,missing_payment_method,missing_is_online
0,122,1,6,8,0,1,0,6,3


<b>Step 3:</b> Examine unique values in categorical columns:

In [4]:
query = """
SELECT DISTINCT store_location FROM transactions_uncleaned_data;
"""
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,store_location
0,University
1,Downtown
2,
3,Suburban


In [5]:
query = """
SELECT DISTINCT payment_method FROM transactions_uncleaned_data;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,payment_method
0,credit
1,
2,cash
3,debit


 <b>💡 Tip:</b> Pay attention to any inconsistencies in the data, such as misspellings or variations in formatting.

### Activity 2: Handling Missing Values  

Missing values can skew our analysis. Let's handle them appropriately.

<b>Step 1:</b> Remove rows with missing transaction_id (as this is our primary key):

In [6]:
#cursor = conn.cursor()
conn.execute("""
    DELETE FROM transactions_uncleaned_data
    WHERE transaction_id IS NULL;
""")
conn.commit() 


In [7]:
# Confirm changes
df = pd.read_sql_query("SELECT * FROM transactions_uncleaned_data LIMIT 5;", conn)
display(df)

Unnamed: 0,transaction_id,date_time,store_location,customer_id,book_id,sale_price,payment_method,is_online
0,T1001,15/01/23 09:23,University,C1045,B1055,11.99,credit,0.0
1,T1002,15/01/23 10:15,Downtown,C1023,B1032,10.99,,
2,T1004,15/01/23 11:30,University,C1012,B1071,13.99,credit,1.0
3,T1005,15/01/23 13:45,,C1034,B1075,12.99,debit,0.0
4,T1006,15/01/23 14:20,Downtown,C1056,B1001,8.99,,0.0


 <b>💡 Tip:</b> Notice the values with "None" transaction_id are not longer there. 

<b>Step 2:</b> Update missing store_location to 'Unknown':

In [8]:
conn.execute("""
UPDATE transactions_uncleaned_data
SET store_location = 'Unknown'
WHERE store_location IS NULL OR store_location = '';
""")
conn.commit()


In [9]:
# Confirm changes
df = pd.read_sql_query("SELECT * FROM transactions_uncleaned_data", conn)
display(df)

Unnamed: 0,transaction_id,date_time,store_location,customer_id,book_id,sale_price,payment_method,is_online
0,T1001,15/01/23 09:23,University,C1045,B1055,11.99,credit,0.0
1,T1002,15/01/23 10:15,Downtown,C1023,B1032,10.99,,
2,T1004,15/01/23 11:30,University,C1012,B1071,13.99,credit,1.0
3,T1005,15/01/23 13:45,Unknown,C1034,B1075,12.99,debit,0.0
4,T1006,15/01/23 14:20,Downtown,C1056,B1001,8.99,,0.0
...,...,...,...,...,...,...,...,...
116,T1006,15/01/23 14:20,Unknown,C1056,B1001,8.99,,0.0
117,T1007,15/01/23 15:30,University,C1089,B1047,12.99,credit,0.0
118,T1008,15/01/23 16:45,Suburban,C1067,B1095,9.99,,0.0
119,T1009,16/01/23 09:15,University,C1023,B1013,7.99,credit,1.0


<b>Step 3:</b> Fill in missing payment_method based on most common method:

In [10]:
conn.execute("""
UPDATE transactions_uncleaned_data
SET payment_method = (
    SELECT payment_method
    FROM transactions_uncleaned_data
    WHERE payment_method IS NOT NULL AND payment_method != ''
    GROUP BY payment_method
    ORDER BY COUNT(*) DESC
    LIMIT 1
)
WHERE payment_method IS NULL OR payment_method = '';
""")
conn.commit()

In [11]:
# Confirm changes
df = pd.read_sql_query("SELECT * FROM transactions_uncleaned_data", conn)
display(df)

Unnamed: 0,transaction_id,date_time,store_location,customer_id,book_id,sale_price,payment_method,is_online
0,T1001,15/01/23 09:23,University,C1045,B1055,11.99,credit,0.0
1,T1002,15/01/23 10:15,Downtown,C1023,B1032,10.99,credit,
2,T1004,15/01/23 11:30,University,C1012,B1071,13.99,credit,1.0
3,T1005,15/01/23 13:45,Unknown,C1034,B1075,12.99,debit,0.0
4,T1006,15/01/23 14:20,Downtown,C1056,B1001,8.99,credit,0.0
...,...,...,...,...,...,...,...,...
116,T1006,15/01/23 14:20,Unknown,C1056,B1001,8.99,credit,0.0
117,T1007,15/01/23 15:30,University,C1089,B1047,12.99,credit,0.0
118,T1008,15/01/23 16:45,Suburban,C1067,B1095,9.99,credit,0.0
119,T1009,16/01/23 09:15,University,C1023,B1013,7.99,credit,1.0


 <b>💡 Tip:</b> Compare the <b>payment_method</b> column from before and after running the above query. Which payment method do you see replacing the <b><em>None</em></b> values ? 

### Activity 3: Standardizing Date Formats  

Consistent date formats are crucial for accurate time-based analysis.

<b>Step 1:</b> Examine the current date_time formats:

In [12]:
query = """
SELECT DISTINCT date_time FROM transactions_uncleaned_data;
"""

df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,date_time
0,15/01/23 09:23
1,15/01/23 10:15
2,15/01/23 11:30
3,15/01/23 13:45
4,15/01/23 14:20
...,...
91,31/01/23 09:25
92,31/01/23 10:40
93,31/01/23 11:50
94,31/01/23 13:35


<b>Step 2:</b> Update the date_time column to a standard format and handle any remaining non standard formats :


In [13]:
df = pd.read_sql_query("SELECT * FROM transactions_uncleaned_data", conn)

# Convert known date formats to datetime object
converted_dates = pd.to_datetime(df['date_time'], errors='coerce', dayfirst=True)
df['date_time'] = converted_dates.dt.strftime('%Y-%m-%d %H:%M')

# Reload cleaned DataFrame into SQLite
df.to_sql('transactions_uncleaned_data', conn, index=False, if_exists='replace')


121

 <b>💡 Tip:</b> Always validate your date conversions to ensure accuracy.

### Activity 4: Removing Duplicates  

Duplicate records can lead to overestimation in our analysis. Let's remove them.

<b>Step 1:</b> Identify potential duplicates:

In [14]:
query = """
SELECT *, COUNT(*) as count
FROM transactions_uncleaned_data
GROUP BY transaction_id, date_time, store_location, customer_id, book_id, sale_price, payment_method, is_online
HAVING count > 1;
"""
display(pd.read_sql_query(query, conn))

Unnamed: 0,transaction_id,date_time,store_location,customer_id,book_id,sale_price,payment_method,is_online,count
0,T1005,2023-01-15 13:45,University,C1034,B1075,12.99,debit,0.0,2
1,T1007,2023-01-15 15:30,University,C1089,B1047,12.99,credit,0.0,3
2,T1009,2023-01-16 09:15,University,C1023,B1013,7.99,credit,1.0,2
3,T1014,2023-01-16 15:40,Downtown,C1056,B1059,9.99,cash,0.0,2
4,T1084,2023-01-28 11:35,Suburban,C1012,B1097,7.99,credit,0.0,2
5,T1085,2023-01-28 13:45,Suburban,C1023,B1060,11.99,cash,0.0,2
6,T1086,2023-01-28 14:25,Downtown,C1045,B1032,10.99,credit,1.0,2
7,T1098,2023-01-31 10:40,Downtown,C1056,B1003,7.99,cash,0.0,2
8,T1099,2023-01-31 11:50,University,C1089,B1075,12.99,credit,1.0,2
9,T1100,2023-01-31 13:35,University,C1012,B1027,10.99,debit,0.0,2


<b>Step 2:</b> Remove duplicates, keeping only distinct rows:

We can use <b>SELECT DISTINCT</b> to retrieve only unique rows in our analysis, removing any duplicates:

In [15]:
query = """
SELECT DISTINCT *
FROM transactions_uncleaned_data
"""
df = pd.read_sql_query(query, conn)
display(df)

# Reload cleaned DataFrame into SQLite
df.to_sql('transactions_uncleaned_data', conn, index=False, if_exists='replace')

Unnamed: 0,transaction_id,date_time,store_location,customer_id,book_id,sale_price,payment_method,is_online
0,T1001,2023-01-15 09:23,University,C1045,B1055,11.99,credit,0.0
1,T1002,2023-01-15 10:15,Downtown,C1023,B1032,10.99,credit,
2,T1004,2023-01-15 11:30,University,C1012,B1071,13.99,credit,1.0
3,T1005,2023-01-15 13:45,Unknown,C1034,B1075,12.99,debit,0.0
4,T1006,2023-01-15 14:20,Downtown,C1056,B1001,8.99,credit,0.0
...,...,...,...,...,...,...,...,...
105,T1013,2023-01-16 14:25,Unknown,C1034,B1027,10.99,credit,0.0
106,T1024,2023-01-16 15:40,Downtown,C1056,,9.99,credit,0.0
107,T1006,2023-01-15 14:20,Unknown,C1056,B1001,8.99,credit,0.0
108,T1008,2023-01-15 16:45,Suburban,C1067,B1095,9.99,credit,0.0


110

#### ⚙️ Test Your Work:
Run the following query to verify your cleaning:

In [16]:
query = """
SELECT COUNT(*) as total_rows,
       COUNT(DISTINCT transaction_id) as unique_transactions,
       MIN(date_time) as earliest_date,
       MAX(date_time) as latest_date,
       COUNT(DISTINCT store_location) as unique_locations,
       COUNT(DISTINCT payment_method) as unique_payment_methods
FROM transactions_uncleaned_data;
"""

df = pd.read_sql_query(query,conn)
display(df)

Unnamed: 0,total_rows,unique_transactions,earliest_date,latest_date,unique_locations,unique_payment_methods
0,110,99,2023-01-15 09:23,2023-01-31 13:35,4,3


### Close the Connection
It's good practice to close the database connection when you're done

In [17]:
# Close the database connection 
conn.close()

## ✅ Success Checklist
- All transaction_ids are unique and not null
- No missing values in store_location or payment_method
- All dates are in the format 'YYYY-MM-DD HH:MI'
- No duplicate rows based on transaction_id and date_time


## 🔍 Common Issues & Solutions 

- Problem: Duplicate removal affects more rows than expected. 
  - Solution: Review the duplicate identification query and adjust criteria if necessary.

## ➡️ Summary

Congratulations on completing the Data Cleaning Challenges lab! You've now gained valuable skills in handling real-world data issues, which will significantly enhance your ability to prepare datasets for accurate analysis and reporting in your future projects at BookCycle and beyond.


### 🔑 Key Points
- Always explore your data before cleaning
- Handle missing values based on the nature of the data and analysis requirements
- Standardize formats, especially for dates and categorical variables
- Remove duplicates carefully, considering which records to keep