In [None]:
# Customer Segmentation Dashboard: Phase 1 - Data Collection and Preparation
**Objective**: Source and set up e-commerce customer data in an SQLite database for RFM analysis.

**Tools**:
- Python (pandas, sqlite3)
- SQLite database
- Dataset: Kaggle 'carrie1/ecommerce-data'

**Steps**:
1. Confirm dataset path from Kaggle download.
2. Create an SQLite database.
3. Import the dataset into the database.
4. Explore the data structure (schema and sample rows).

In [18]:
import pandas as pd
import sqlite3
import os

# Confirm the data file is present in the data/ directory
csv_file_path = '../data/data.csv'  # Data file is present in the data/ directory

if os.path.exists(csv_file_path):
    print(f"SUCCESS: Data file found at {csv_file_path}")
else:
    print(f"ERROR: Data file not found at {csv_file_path}")

SUCCESS: Data file found at ../data/data.csv


In [None]:
## Step 1: Set Up SQLite Database

Create a new SQLite database to store the e-commerce data. The database will be saved as `ecommerce_data.db` in the project directory.

In [13]:
# Create/connect to SQLite database
db_path = '../ecommerce_data.db'  # Store in project root
conn = sqlite3.connect(db_path)
print(f'Connected to SQLite database at {db_path}')

Connected to SQLite database at ../ecommerce_data.db


In [None]:
## Step 2: Import Dataset into SQLite

Read the CSV file into a pandas DataFrame and write it to an SQLite table named `transactions`. If the table exists, it will be replaced.

In [14]:
# Read CSV into DataFrame
df = pd.read_csv(csv_file_path, encoding='latin1')  # Use latin1 encoding to avoid errors
print('DataFrame shape:', df.shape)

# Write to SQLite table
table_name = 'transactions'
df.to_sql(table_name, conn, if_exists='replace', index=False)
print(f"Data imported into table '{table_name}' successfully.")

DataFrame shape: (541909, 8)
Data imported into table 'transactions' successfully.


In [None]:
## Step 3: Explore Data Structure

Inspect the table schema (column names and types) and view sample data to confirm the import.

In [15]:
# Get table schema
schema_query = f'PRAGMA table_info({table_name});'
schema = pd.read_sql_query(schema_query, conn)
print('Table Schema:')
print(schema[['name', 'type']])

# Get sample data (first 5 rows)
sample_data_query = f'SELECT * FROM {table_name} LIMIT 5;'
sample_data = pd.read_sql_query(sample_data_query, conn)
print('\nSample Data:')
print(sample_data)

# Get total row count
count_query = f'SELECT COUNT(*) AS row_count FROM {table_name};'
row_count = pd.read_sql_query(count_query, conn)
print('\nTotal Rows:', row_count['row_count'].iloc[0])

# Close connection
conn.close()
print('Database connection closed.')

Table Schema:
          name     type
0    InvoiceNo     TEXT
1    StockCode     TEXT
2  Description     TEXT
3     Quantity  INTEGER
4  InvoiceDate     TEXT
5    UnitPrice     REAL
6   CustomerID     REAL
7      Country     TEXT

Sample Data:
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39

In [None]:
## Summary

- Dataset imported into SQLite table `transactions`.
- Schema includes columns like `InvoiceNo`, `CustomerID`, `InvoiceDate`, `Quantity`, `UnitPrice`, etc.
- Total rows: [Replace with row count from output].
- Next steps: Clean and preprocess data in Phase 2.