In [None]:
# Customer Segmentation Dashboard: Phase 1 - Data Collection and Preparation

**Objective**: Source and set up e-commerce customer data in an SQLite database for RFM analysis.

**Tools**:
- Python (pandas, sqlite3)
- SQLite database
- Dataset: Kaggle 'carrie1/ecommerce-data'

**Steps**:
1. Confirm dataset path from Kaggle download.
2. Create an SQLite database.
3. Import the dataset into the database.
4. Explore the data structure (schema and sample rows).

In [None]:
import pandas as pd
import sqlite3
import kagglehub

# Download dataset and confirm path
path = kagglehub.dataset_download('carrie1/ecommerce-data')
print('Path to dataset files:', path)

# Update this with the actual path to data.csv
csv_file_path = '../data/data.csv'  # Already copied to data/ directory

In [None]:
## Step 1: Set Up SQLite Database

Create a new SQLite database to store the e-commerce data. The database will be saved as `ecommerce_data.db` in the project directory.

In [None]:
# Create/connect to SQLite database
db_path = '../ecommerce_data.db'  # Store in project root
conn = sqlite3.connect(db_path)
print(f'Connected to SQLite database at {db_path}')

In [None]:
## Step 2: Import Dataset into SQLite

Read the CSV file into a pandas DataFrame and write it to an SQLite table named `transactions`. If the table exists, it will be replaced.

In [None]:
# Read CSV into DataFrame
df = pd.read_csv(csv_file_path, encoding='latin1')  # Use latin1 encoding to avoid errors
print('DataFrame shape:', df.shape)

# Write to SQLite table
table_name = 'transactions'
df.to_sql(table_name, conn, if_exists='replace', index=False)
print(f"Data imported into table '{table_name}' successfully.")

In [None]:
## Step 3: Explore Data Structure

Inspect the table schema (column names and types) and view sample data to confirm the import.

In [None]:
# Get table schema
schema_query = f'PRAGMA table_info({table_name});'
schema = pd.read_sql_query(schema_query, conn)
print('Table Schema:')
print(schema[['name', 'type']])

# Get sample data (first 5 rows)
sample_data_query = f'SELECT * FROM {table_name} LIMIT 5;'
sample_data = pd.read_sql_query(sample_data_query, conn)
print('\nSample Data:')
print(sample_data)

# Get total row count
count_query = f'SELECT COUNT(*) AS row_count FROM {table_name};'
row_count = pd.read_sql_query(count_query, conn)
print('\nTotal Rows:', row_count['row_count'].iloc[0])

# Close connection
conn.close()
print('Database connection closed.')

In [None]:
## Summary

- Dataset imported into SQLite table `transactions`.
- Schema includes columns like `InvoiceNo`, `CustomerID`, `InvoiceDate`, `Quantity`, `UnitPrice`, etc.
- Total rows: [Replace with row count from output].
- Next steps: Clean and preprocess data in Phase 2.