# Customer Dataset Generation for SQL Workshop

This notebook generates a focused customer dataset that includes exactly the fields needed to demonstrate the SQL concepts in the workshop. The data structure is deliberately kept simple and includes only what's necessary for the workshop examples.

Generated fields:
- name (TEXT) - For string operations and pattern matching
- city (TEXT) - For location-based queries and string concatenation
- country (TEXT) - For working with codes and IN operators
- items_purchased (INTEGER) - For numeric operations and range filters
- price_per_item (REAL) - For calculations and NULL handling
- last_purchase (TEXT) - For date functions and filtering
- account_balance (REAL) - For additional numeric comparisons

The generated data will be exported as:
1. `customers.csv` - CSV format
2. `customers.sqlite` - SQLite database for workshop queries

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import sqlite3

# Set random seed for reproducibility
np.random.seed(42)

In [2]:
# Generate minimal sample data that covers all concepts
n_customers = 25  # Enough for examples, not too many

# Sample data for each field - just enough variety for meaningful examples
names = [
    "John Smith", "Maria Garcia", "Li Wei", "Emma Brown", "Ahmed Hassan",  # English, Spanish, Chinese, English, Arabic
    "Sarah Johnson", "Carlos Rodriguez", "Anna Kowalski", "James Wilson", "Yuki Tanaka",  # English, Spanish, Polish, English, Japanese
    "Elena Popov", "Michel Dubois", "Sofia Santos", "Lars Andersen", "Aisha Patel",  # Russian, French, Portuguese, Danish, Indian
    "Diego Martinez", "Lucy Chen", "Ivan Petrov", "Mary Williams", "Raj Kumar",  # Spanish, Chinese, Russian, English, Indian
    "Hans Schmidt", "Isabella Silva", "Fatima Al-Said", "Jun Park", "Anna Ivanova"  # German, Portuguese, Arabic, Korean, Russian
]

# Cities - mix of major cities with good geographic distribution
cities = [
    "New York", "London", "Tokyo", "Paris", "Sydney",  # Major global cities
    "Berlin", "Mumbai", "São Paulo", "Toronto", "Shanghai",  # More major cities
    "Madrid", "Moscow", "Dubai", None, "Mexico City",  # Some NULLs
    "Amsterdam", "Cairo", "Stockholm", None, "Los Angeles",  # More NULLs
    "Rome", "Hong Kong", "Istanbul", "Seoul", "Bangkok"  # Additional variety
]

# Countries - good mix for grouping examples
countries = [
    "US", "GB", "JP", "FR", "AU",  # 5 major countries
    "DE", "IN", "BR", "CA", "CN",  # 5 more countries
    "ES", "RU", "AE", None, "MX",  # Some NULLs
    "NL", "EG", "SE", None, "US",  # More NULLs, plus some duplicates
    "IT", "HK", "TR", "KR", "TH"   # Additional variety
]

# Generate numeric data as lists for easier NULL handling
items_purchased = [np.random.randint(1, 20) for _ in range(n_customers)]
price_per_item = [round(np.random.uniform(10.0, 100.0), 2) for _ in range(n_customers)]
account_balance = [round(np.random.uniform(100.0, 1000.0), 2) for _ in range(n_customers)]

# Generate last_purchase dates - 12 months is enough for examples
base_date = datetime(2024, 1, 1)
last_purchases = []
for i in range(n_customers):
    days = np.random.randint(0, 365)
    date = base_date + timedelta(days=days)
    last_purchases.append(date.strftime('%Y-%m-%d'))

# Add NULL values strategically
null_indices = np.random.choice(n_customers, 6, replace=False)  # 6 NULLs total
for idx in null_indices[:2]:
    items_purchased[idx] = None
for idx in null_indices[2:4]:
    price_per_item[idx] = None
for idx in null_indices[4:]:
    account_balance[idx] = None
for idx in null_indices[:3]:  # Some NULL dates
    last_purchases[idx] = None

# Create the data dictionary
data = {
    'name': names,
    'city': cities,
    'country': countries,
    'items_purchased': items_purchased,
    'price_per_item': price_per_item,
    'last_purchase': last_purchases,
    'account_balance': account_balance
}

# Create DataFrame
df = pd.DataFrame(data)

In [3]:
# Show the first few rows and dataset info
print("Preview of the generated dataset:")
print("\nFirst few rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

# Export to CSV
csv_path = 'customers.csv'
df.to_csv(csv_path, index=False)
print(f"\nDataset exported to {csv_path}")

# Export to SQLite
sqlite_path = 'customers.sqlite'
with sqlite3.connect(sqlite_path) as conn:
    df.to_sql('customers', conn, if_exists='replace', index=False)
print(f"Dataset exported to {sqlite_path}")

# Print value ranges to confirm suitability for workshop examples
print("\nValue ranges in the dataset:")
numeric_cols = ['items_purchased', 'price_per_item', 'account_balance']
for col in numeric_cols:
    non_null = df[df[col].notnull()][col]
    print(f"\n{col}:")
    print(f"  Min: {non_null.min():.2f}")
    print(f"  Max: {non_null.max():.2f}")
    print(f"  Null count: {df[col].isnull().sum()}")

Preview of the generated dataset:

First few rows:
           name      city country  items_purchased  price_per_item  \
0    John Smith  New York      US              NaN           56.28   
1  Maria Garcia    London      GB             15.0             NaN   
2        Li Wei     Tokyo      JP             11.0           14.18   
3    Emma Brown     Paris      FR              8.0           64.68   
4  Ahmed Hassan    Sydney      AU              7.0           25.35   

  last_purchase  account_balance  
0          None           945.55  
1    2024-08-18           905.34  
2    2024-02-10           638.11  
3    2024-01-28           929.69  
4    2024-05-14           179.64  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             25 non-null     object 
 1   city             23 non-null     object 
 2   country     

# Data Dictionary

The generated dataset includes a minimal set of records and fields needed for the SQL workshop examples:

| Column | Type | Description | Example | Notes |
|--------|------|-------------|---------|-------|
| name | TEXT | Customer's full name | "John Smith" | Never NULL, 25 unique names |
| city | TEXT | Customer's city | "New York" | 2 NULL values (8%) |
| country | TEXT | Country code | "US" | 2 NULL values (8%) |
| items_purchased | INTEGER | Number of items purchased | 5 | 2 NULL values (8%), Range: 1-20 |
| price_per_item | REAL | Price per item in USD | 45.99 | 2 NULL values (8%), Range: $10-$100 |
| last_purchase | TEXT | Date of last purchase | "2024-06-15" | 3 NULL values (12%), Range: 2024 dates |
| account_balance | REAL | Account balance in USD | 523.45 | 2 NULL values (8%), Range: $100-$1000 |

This minimal dataset supports all SQL concepts in the workshop:

1. **Basic Queries**
   - SELECT and column selection
   - Column aliases
   - String operations (name, city, country)
   - Numeric calculations (items_purchased × price_per_item)
   - Date operations (last_purchase)

2. **WHERE Clause**
   - Basic comparisons (>, <, =, etc.)
   - Pattern matching (LIKE with name, city)
   - Multiple conditions (AND, OR)
   - NULL handling (IS NULL, IS NOT NULL)
   - Range checks (BETWEEN)

3. **Sorting and Pagination**
   - ORDER BY (all columns)
   - Multiple sort columns
   - LIMIT/OFFSET (25 rows is enough)

4. **Aggregates and Grouping**
   - Basic aggregates (COUNT, SUM, AVG, etc.)
   - GROUP BY (country, city, month)
   - HAVING clause
   - NULL handling in groups

The data has been minimized while ensuring:
- Enough rows for pagination (25)
- Sufficient groups for aggregation (20+ cities, 15+ countries)
- Strategic NULL values
- Reasonable numeric ranges
- Good date distribution (1 year)
- Diverse text data for pattern matching

## 1. Creating Sample Data with NULL Values

First, we'll create a sample customer dataset that includes NULL values in several columns. This will help us demonstrate various NULL handling scenarios. We'll:

1. Import required libraries
2. Create a DataFrame with NULL values
3. Create an SQLite database
4. Load our data into the database

In [4]:
# Import required libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

# Create sample customer data with NULL values
customer_data = {
    'name': ['John Smith', 'Sarah Johnson', 'Michael Lee', 'Emma Davis', 'David Wilson'],
    'city': ['New York', 'Chicago', 'Seattle', None, 'Boston'],
    'country': ['US', 'US', 'US', 'US', 'US'],
    'phone': ['555-0101', None, '555-0303', '555-0404', None],
    'items_purchased': [3, 2, None, 1, 4],
    'price_per_item': [10.0, 15.0, 20.0, None, 12.0]
}

# Create DataFrame
customers_df = pd.DataFrame(customer_data)

# Create SQLite database in memory
engine = create_engine('sqlite:///:memory:')
customers_df.to_sql('customers', engine, index=False)

print("Sample Customer Data:")
print(customers_df)

Sample Customer Data:
            name      city country     phone  items_purchased  price_per_item
0     John Smith  New York      US  555-0101              3.0            10.0
1  Sarah Johnson   Chicago      US      None              2.0            15.0
2    Michael Lee   Seattle      US  555-0303              NaN            20.0
3     Emma Davis      None      US  555-0404              1.0             NaN
4   David Wilson    Boston      US      None              4.0            12.0


## 2. Common Mistake: Using Equality with NULL

A common mistake when working with NULL values is trying to use the equality operator (`=`). Let's see why this doesn't work:

* `NULL` represents an unknown value
* Comparing anything with an unknown value results in an unknown result
* `WHERE column = NULL` will never match any rows

In [5]:
# Incorrect query - trying to find customers with NULL city
incorrect_query = """
SELECT name, city
FROM customers
WHERE city = NULL
"""

print("Attempting to find customers with NULL city using incorrect method (WHERE city = NULL):")
print("\nResults:")
print(pd.read_sql_query(incorrect_query, engine))
print("\nNote: No results were returned because '= NULL' never evaluates to TRUE")

Attempting to find customers with NULL city using incorrect method (WHERE city = NULL):

Results:
Empty DataFrame
Columns: [name, city]
Index: []

Note: No results were returned because '= NULL' never evaluates to TRUE


## 3. Correct Approach: Using IS NULL

The correct way to check for NULL values is using the `IS NULL` operator. This operator is specifically designed to handle the special nature of NULL values. Let's see how it works:

* `IS NULL` checks if a value is NULL
* `IS NOT NULL` checks if a value is not NULL
* These operators work as expected with NULL values

In [6]:
# Correct query - finding customers with NULL city
correct_query = """
SELECT name, city
FROM customers
WHERE city IS NULL
"""

print("Finding customers with NULL city using correct method (WHERE city IS NULL):")
print("\nResults:")
print(pd.read_sql_query(correct_query, engine))

# Also demonstrate IS NOT NULL
not_null_query = """
SELECT name, city
FROM customers
WHERE city IS NOT NULL
"""

print("\nFinding customers with non-NULL city (WHERE city IS NOT NULL):")
print("\nResults:")
print(pd.read_sql_query(not_null_query, engine))

Finding customers with NULL city using correct method (WHERE city IS NULL):

Results:
         name  city
0  Emma Davis  None

Finding customers with non-NULL city (WHERE city IS NOT NULL):

Results:
            name      city
0     John Smith  New York
1  Sarah Johnson   Chicago
2    Michael Lee   Seattle
3   David Wilson    Boston


## 4. Advanced NULL Handling with COALESCE

COALESCE is a powerful function for handling NULL values in SQL. It returns the first non-NULL value in a list of expressions. Common uses include:

* Providing default values for NULL fields
* Creating derived columns that handle NULL values gracefully
* Filtering for records with any NULL value

Let's explore some practical examples:

In [7]:
# Example 1: Using COALESCE for default values
coalesce_query = """
SELECT 
    name,
    COALESCE(city, 'Location Unknown') as city,
    COALESCE(items_purchased, 0) as items_purchased,
    COALESCE(price_per_item, 0.0) as price_per_item
FROM customers
"""

print("Using COALESCE to replace NULL values with defaults:")
print("\nResults:")
print(pd.read_sql_query(coalesce_query, engine))

# Example 2: Finding customers with any missing contact information
missing_info_query = """
SELECT 
    name,
    COALESCE(city, 'Unknown City') as city,
    COALESCE(phone, 'No Phone') as phone
FROM customers
WHERE city IS NULL 
   OR phone IS NULL
"""

print("\nFinding customers with missing contact information:")
print("\nResults:")
print(pd.read_sql_query(missing_info_query, engine))

# Example 3: Calculating total value with NULL handling
total_value_query = """
SELECT 
    name,
    COALESCE(items_purchased, 0) * COALESCE(price_per_item, 0) as total_value
FROM customers
ORDER BY total_value DESC
"""

print("\nCalculating total value with NULL handling:")
print("\nResults:")
print(pd.read_sql_query(total_value_query, engine))

Using COALESCE to replace NULL values with defaults:

Results:
            name              city  items_purchased  price_per_item
0     John Smith          New York              3.0            10.0
1  Sarah Johnson           Chicago              2.0            15.0
2    Michael Lee           Seattle              0.0            20.0
3     Emma Davis  Location Unknown              1.0             0.0
4   David Wilson            Boston              4.0            12.0

Finding customers with missing contact information:

Results:
            name          city     phone
0  Sarah Johnson       Chicago  No Phone
1     Emma Davis  Unknown City  555-0404
2   David Wilson        Boston  No Phone

Calculating total value with NULL handling:

Results:
            name  total_value
0   David Wilson         48.0
1     John Smith         30.0
2  Sarah Johnson         30.0
3    Michael Lee          0.0
4     Emma Davis          0.0


## Summary and Best Practices

When working with NULL values in SQL, remember:

1. **Never use `= NULL` or `!= NULL`**
   * Always use `IS NULL` or `IS NOT NULL` instead

2. **Use COALESCE for default values**
   * Helps avoid NULL propagation in calculations
   * Makes reports more readable
   * Ensures consistent data handling

3. **Consider NULL in your database design**
   * Decide whether columns should allow NULL values
   * Use NOT NULL constraints when appropriate
   * Document your NULL handling strategy

4. **Be careful with aggregate functions**
   * Most aggregate functions ignore NULL values
   * COUNT(*) includes NULL values, COUNT(column) excludes them
   * Use COALESCE when you need to include NULL values in calculations

These practices will help you handle NULL values correctly and write more robust SQL queries.

In [8]:
# Generate income data for pivot/melt and JOIN examples
# This complements the customers table with income information

# Strategic selection of names:
# - Some customers have multiple income sources (John Smith, Maria Garcia, Li Wei)
# - Some customers have no income data (Emma Brown, Carlos Rodriguez, many others)
# - Some people have income but aren't customers (Jennifer Wilson, Robert Taylor, Amanda Lee)

income_records = []

# Customers with multiple income sources
multi_income_customers = [
    ('John Smith', [('Salary', 3000, 'Monday'), ('Freelance', 500, 'Friday')]),
    ('Maria Garcia', [('Salary', 2500, 'Monday'), ('Bonus', 1000, 'Friday')]),
    ('Li Wei', [('Salary', 4000, 'Monday'), ('Investment', 200, 'Wednesday')]),
    ('Ahmed Hassan', [('Salary', 3500, 'Monday'), ('Consulting', 1500, 'Tuesday'), ('Rental', 800, 'Thursday')]),
    ('Sarah Johnson', [('Salary', 2800, 'Monday'), ('Teaching', 400, 'Wednesday')])
]

for name, sources in multi_income_customers:
    for source, amount, day in sources:
        income_records.append({
            'name': name,
            'income_source': source,
            'amount': amount,
            'day': day
        })

# Customers with single income source
single_income_customers = [
    ('Yuki Tanaka', 'Salary', 3200, 'Monday'),
    ('Elena Popov', 'Pension', 1800, 'Friday'),
    ('Lars Andersen', 'Salary', 4500, 'Monday'),
    ('Diego Martinez', 'Salary', 2700, 'Monday'),
    ('Mary Williams', 'Part-time', 1200, 'Tuesday')
]

for name, source, amount, day in single_income_customers:
    income_records.append({
        'name': name,
        'income_source': source,
        'amount': amount,
        'day': day
    })

# People with income who are NOT in the customers table
non_customers = [
    ('Jennifer Wilson', 'Salary', 3100, 'Monday'),
    ('Robert Taylor', 'Salary', 2900, 'Monday'),
    ('Robert Taylor', 'Uber', 600, 'Saturday'),
    ('Amanda Lee', 'Freelance', 2200, 'Thursday'),
    ('Michael Brown', 'Salary', 3400, 'Monday'),
    ('Patricia Davis', 'Consulting', 4000, 'Wednesday')
]

for name, source, amount, day in non_customers:
    income_records.append({
        'name': name,
        'income_source': source,
        'amount': amount,
        'day': day
    })

# Create DataFrame
income_df = pd.DataFrame(income_records)

# Add some NULL values for demonstration
null_indices = np.random.choice(len(income_df), 2, replace=False)
for idx in null_indices:
    income_df.loc[idx, 'amount'] = None

# Display the income data
print("\\nIncome Data Preview:")
print(f"Total records: {len(income_df)}")
print(f"Unique people: {income_df['name'].nunique()}")
print(f"People in both tables: {len(set(income_df['name'].unique()) & set(df['name'].unique()))}")
print(f"Income-only people: {len(set(income_df['name'].unique()) - set(df['name'].unique()))}")
print(f"Customers-only people: {len(set(df['name'].unique()) - set(income_df['name'].unique()))}")
print("\\nFirst 10 income records:")
print(income_df.head(10))

# Export to CSV
income_csv_path = 'income.csv'
income_df.to_csv(income_csv_path, index=False)
print(f"\\nIncome data exported to {income_csv_path}")

# Add to SQLite database
with sqlite3.connect(sqlite_path) as conn:
    income_df.to_sql('income', conn, if_exists='replace', index=False)
    
    # Verify both tables exist
    cursor = conn.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = cursor.fetchall()
    print(f"\\nTables in database: {[t[0] for t in tables]}")

# Show income distribution for pivot examples
print("\\nIncome by day (for pivot examples):")
day_summary = income_df.groupby('day')['name'].count().sort_values(ascending=False)
print(day_summary)

\nIncome Data Preview:
Total records: 22
Unique people: 15
People in both tables: 10
Income-only people: 5
Customers-only people: 15
\nFirst 10 income records:
            name income_source  amount        day
0     John Smith        Salary  3000.0     Monday
1     John Smith     Freelance   500.0     Friday
2   Maria Garcia        Salary  2500.0     Monday
3   Maria Garcia         Bonus  1000.0     Friday
4         Li Wei        Salary     NaN     Monday
5         Li Wei    Investment   200.0  Wednesday
6   Ahmed Hassan        Salary  3500.0     Monday
7   Ahmed Hassan    Consulting  1500.0    Tuesday
8   Ahmed Hassan        Rental   800.0   Thursday
9  Sarah Johnson        Salary     NaN     Monday
\nIncome data exported to income.csv
\nTables in database: ['customers', 'income']
\nIncome by day (for pivot examples):
day
Monday       11
Friday        3
Wednesday     3
Thursday      2
Tuesday       2
Saturday      1
Name: name, dtype: int64
