In [1]:
import pandas as pd
import sqlite3

# Exploratory Data Analysis (EDA)

In this step, a new connection is established to an SQLite database file named sales.db, located in the ../database/ directory. The connection is assigned to the variable conn and will be used to store or query data related to sales analysis.

In [2]:
db_directory = "../database/sales.db"
conn = sqlite3.connect(db_directory)

## Monthly Sales Performance

This SQL query retrieves monthly sales performance data by joining the orders and order_items tables. It calculates:
- The total number of unique delivered orders (total_orders)
- The total revenue (revenue) from these orders
The results are grouped by month, ordered chronologically, and loaded into a DataFrame named order_sales_df for the visualization.

In [3]:
query="""
SELECT
    strftime('%Y-%m', o.order_purchase_timestamp) AS month,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(oi.price) AS revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_status = 'delivered'
GROUP BY month
ORDER BY month;
"""

order_sales_df = pd.read_sql(query, conn)
order_sales_df.head(10)

Unnamed: 0,month,total_orders,revenue
0,2016-09,1,134.97
1,2016-10,265,40325.11
2,2016-12,1,10.9
3,2017-01,750,111798.36
4,2017-02,1653,234223.4
5,2017-03,2546,359198.85
6,2017-04,2303,340669.68
7,2017-05,3545,489159.25
8,2017-06,3135,421923.37
9,2017-07,3872,481604.52


## Sales by Product Category

This SQL query analyzes sales performance by product category. It joins the order_items, orders, products, and cat_translate tables to retrieve:
- The English name of the product category (category)
- The number of unique delivered orders (total_orders)
- The total revenue (revenue) generated by each category

The results are grouped by product category and sorted in descending order based on revenue.

In [4]:
query="""
SELECT
    ct.product_category_name_english AS category,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(oi.price) AS revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN cat_translate ct ON p.product_category_name = ct.product_category_name
WHERE o.order_status = 'delivered'
GROUP BY category
ORDER BY revenue DESC;
"""

category_sales_df = pd.read_sql(query, conn)
category_sales_df.head()

Unnamed: 0,category,total_orders,revenue
0,health_beauty,8647,1233131.72
1,watches_gifts,5493,1165898.98
2,bed_bath_table,9272,1023434.76
3,sports_leisure,7529,954673.55
4,computers_accessories,6529,888613.62


## Sales by State

This SQL query examines total sales and order volume by Brazilian state. It performs the following operations:
- Retrieves delivered orders and their corresponding customer zip code prefixes.
- Joins with the order_items table to calculate total sales (total_sales).
- Joins with the geolocation table to map zip code prefixes to their respective states (geolocation_state).

The data is grouped by state and sorted in descending order of total sales.

In [5]:
query = """
SELECT 
    g.geolocation_state, 
    COUNT(DISTINCT oc.order_id) AS total_orders, 
    SUM(oi.price) AS total_sales
FROM 
    (SELECT
        o.order_id,
        c.customer_zip_code_prefix
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    WHERE o.order_status = 'delivered'
) oc
JOIN order_items oi ON oc.order_id = oi.order_id
JOIN geolocation g ON oc.customer_zip_code_prefix = g.geolocation_zip_code_prefix
GROUP BY g.geolocation_state
ORDER BY total_sales DESC;
"""

state_sales_df = pd.read_sql_query(query, conn)
state_sales_df.head()

Unnamed: 0,geolocation_state,total_orders,total_sales
0,SP,40480,459653300.0
1,MG,11345,298350500.0
2,RJ,12337,222452800.0
3,RS,5351,79214660.0
4,PR,4912,59076610.0


## Customers with High Purchase Frequency

This SQL query analyzes customers who have made more than two orders. The query performs the following steps:
- Joins the orders and customers tables.
- Groups the data by customer_unique_id.
- Filters to include only customers with more than two orders (HAVING COUNT(o.order_id) > 2).
- Orders the results by the total number of orders in descending order.

In [9]:
query="""
SELECT
    c.customer_unique_id AS customer,
    COUNT(o.order_id) as total_order
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY customer
HAVING COUNT(o.order_id) > 2
ORDER BY total_order DESC
;
"""

customer_retention_df = pd.read_sql(query, conn)
customer_retention_df.head()

Unnamed: 0,customer,total_order
0,8d50f5eadf50201ccdcedfb9e2ac8455,17
1,3e43e6105506432c953e165fb2acf44c,9
2,ca77025e7201e3b30c44b472ff346268,7
3,6469f99c1f9dfae7733b25662e7f1782,7
4,1b6c7548a2a1f9037c1fd3ddfed95f33,7


## Customer Distribution by State

This SQL query analyzes the distribution of customers across different states. It performs the following operations:
- Groups the data by customer_state.
- Counts the number of unique customers (customer_unique_id) in each state.
- Orders the results by the total number of customers in descending order.

In [10]:
query="""
SELECT 
    customer_state, 
    COUNT(DISTINCT customer_unique_id) AS total_customers
FROM customers
GROUP BY customer_state
ORDER BY total_customers DESC;
"""

geo_df = pd.read_sql(query, conn)
geo_df.head()

Unnamed: 0,customer_state,total_customers
0,SP,40302
1,RJ,12384
2,MG,11259
3,RS,5277
4,PR,4882


## Order Delivery Delays

This SQL query calculates delivery delays and the actual versus estimated delivery times for each order. It performs the following operations:
- Computes the number of days of delay (delay_days) by subtracting the estimated delivery date from the actual delivery date.
- Calculates the estimated_delivery_days (difference between the purchase date and estimated delivery date) and actual_delivery_days (difference between the purchase date and actual delivery date).
- Filters out orders where the actual delivery date is NULL.
- Orders the results by delay_days in descending order to identify the most delayed orders.

In [11]:
query = """
SELECT
    CAST(JULIANDAY(order_delivered_customer_date) - JULIANDAY(order_estimated_delivery_date) AS INTEGER) as delay_days,
    order_id,
    order_purchase_timestamp,
    order_estimated_delivery_date,
    order_delivered_customer_date,
    CAST(JULIANDAY(order_estimated_delivery_date) - JULIANDAY(order_purchase_timestamp) AS INTEGER) AS estimated_delivery_days,
    CAST(JULIANDAY(order_delivered_customer_date) - JULIANDAY(order_purchase_timestamp) AS INTEGER) AS actual_delivery_days
FROM orders
WHERE order_delivered_customer_date IS NOT NULL
ORDER BY delay_days DESC;
"""

delivery_df = pd.read_sql(query, conn)
delivery_df.head()

Unnamed: 0,delay_days,order_id,order_purchase_timestamp,order_estimated_delivery_date,order_delivered_customer_date,estimated_delivery_days,actual_delivery_days
0,188,1b3190b2dfa9d789e1f14c05b647a14a,2018-02-23 14:57:35,2018-03-15 00:00:00,2018-09-19 23:24:07,19,208
1,181,ca07593549f1816d26a572e06dc1eab6,2017-02-21 23:31:27,2017-03-22 00:00:00,2017-09-19 14:36:39,28,209
2,175,47b40429ed8cce3aee9199792275433f,2018-01-03 09:44:01,2018-01-19 00:00:00,2018-07-13 20:51:31,15,191
3,167,2fe324febf907e3ea3f2aa9650869fa5,2017-03-13 20:17:10,2017-04-05 00:00:00,2017-09-19 17:00:07,22,189
4,166,285ab9426d6982034523a855f55a885e,2017-03-08 22:47:40,2017-04-06 00:00:00,2017-09-19 14:00:04,28,194


The conn.close() command is used to properly close the connection to the SQLite database. This is a good practice to release resources and ensure that all transactions are finalized.

In [14]:
conn.close()

## Saving DataFrames to CSV Files

In this step, the following DataFrames are saved as CSV files in the ../data/processed/ directory:
- order_sales_df is saved as order_sales.csv.
- category_sales_df is saved as category_sales.csv.
- state_sales_df is saved as state_sales.csv.
- customer_retention_df is saved as cust_retention.csv.
- geo_df is saved as customer_state.csv.
- delivery_df is saved as delivery_days.csv.

The index=False parameter ensures that the DataFrame index is not included as an additional column in the CSV files.

In [None]:
order_sales_df.to_csv('../data/processed/order_sales.csv', index=False)
category_sales_df.to_csv('../data/processed/category_sales.csv', index=False)
state_sales_df.to_csv('../data/processed/state_sales.csv', index=False)
customer_retention_df.to_csv('../data/processed/cust_retention.csv', index=False)
geo_df.to_csv('../data/processed/customer_state.csv', index=False)
delivery_df.to_csv('../data/processed/delivery_days.csv', index=False)