# Case Study #7: Balanced Tree Clothing Co.
The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/).

My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries.

For more details about the **Case Study #7**, click [**here**](https://8weeksqlchallenge.com/case-study-7/).

## Table of Contents
### [1. Importing Libraries](#Import)
### [2. Tables of the Database](#Tables)
### [3. Case Study Questions](#CaseStudyQuestions)
- [A. High Level Sales Analysis](#A)
- [B. Transaction Analysis](#B)
- [C. Product Analysis](#C)
- [D. Bonus Challenge](#D)

___
<a id = 'Import'></a>
## 1. Import Libraries

In [1]:
import pandas as pd
import psycopg2 as pg2
import os
import warnings

warnings.filterwarnings("ignore")

### Connecting the database from Jupyter Notebook

In [2]:
# Get PostgreSQL password
mypassword = os.getenv("POSTGRESQL_PASSWORD")

try:
    conn = pg2.connect(user = 'postgres', password = mypassword, database = 'balanced_tree')
    cursor = conn.cursor()
    print("Database connection successful")
except mysql.connector.Error as err:
   print(f"Error: '{err}'")

Database connection successful


___
<a id = 'Tables'></a>
## 2. Tables of the database

First, let's verify if the connected database contains the 4 dataset names.

In [3]:
cursor.execute("""
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_schema = 'balanced_tree'
""")

table_names = []
print('--- Tables within "balanced_tree" database --- ')
for table in cursor:
    print(table[1])
    table_names.append(table[1])

--- Tables within "balanced_tree" database --- 
product_hierarchy
product_prices
product_details
sales


Here are the 4 datasets of the "balanced_tree" database. For more details about each dataset, please click [**here**](https://8weeksqlchallenge.com/case-study-7/).

In [4]:
for table in table_names:
    print("Table: ", table)
    display(pd.read_sql("SELECT * FROM balanced_tree." + table, conn))

Table:  product_hierarchy


Unnamed: 0,id,parent_id,level_text,level_name
0,1,,Womens,Category
1,2,,Mens,Category
2,3,1.0,Jeans,Segment
3,4,1.0,Jacket,Segment
4,5,2.0,Shirt,Segment
5,6,2.0,Socks,Segment
6,7,3.0,Navy Oversized,Style
7,8,3.0,Black Straight,Style
8,9,3.0,Cream Relaxed,Style
9,10,4.0,Khaki Suit,Style


Table:  product_prices


Unnamed: 0,id,product_id,price
0,7,c4a632,13
1,8,e83aa3,32
2,9,e31d39,10
3,10,d5e9a6,23
4,11,72f5d4,19
5,12,9ec847,54
6,13,5d267b,40
7,14,c8d436,10
8,15,2a2353,57
9,16,f084eb,36


Table:  product_details


Unnamed: 0,product_id,price,product_name,category_id,segment_id,style_id,category_name,segment_name,style_name
0,c4a632,13,Navy Oversized Jeans - Womens,1,3,7,Womens,Jeans,Navy Oversized
1,e83aa3,32,Black Straight Jeans - Womens,1,3,8,Womens,Jeans,Black Straight
2,e31d39,10,Cream Relaxed Jeans - Womens,1,3,9,Womens,Jeans,Cream Relaxed
3,d5e9a6,23,Khaki Suit Jacket - Womens,1,4,10,Womens,Jacket,Khaki Suit
4,72f5d4,19,Indigo Rain Jacket - Womens,1,4,11,Womens,Jacket,Indigo Rain
5,9ec847,54,Grey Fashion Jacket - Womens,1,4,12,Womens,Jacket,Grey Fashion
6,5d267b,40,White Tee Shirt - Mens,2,5,13,Mens,Shirt,White Tee
7,c8d436,10,Teal Button Up Shirt - Mens,2,5,14,Mens,Shirt,Teal Button Up
8,2a2353,57,Blue Polo Shirt - Mens,2,5,15,Mens,Shirt,Blue Polo
9,f084eb,36,Navy Solid Socks - Mens,2,6,16,Mens,Socks,Navy Solid


Table:  sales


Unnamed: 0,prod_id,qty,price,discount,member,txn_id,start_txn_time
0,c4a632,4,13,17,True,54f307,2021-02-13 01:59:43.296000
1,5d267b,4,40,17,True,54f307,2021-02-13 01:59:43.296000
2,b9a74d,4,17,17,True,54f307,2021-02-13 01:59:43.296000
3,2feb6b,2,29,17,True,54f307,2021-02-13 01:59:43.296000
4,c4a632,5,13,21,True,26cc98,2021-01-19 01:39:00.345600
...,...,...,...,...,...,...,...
15090,9ec847,1,54,13,True,f15ab3,2021-03-20 12:01:22.944000
15091,2a2353,3,57,13,True,f15ab3,2021-03-20 12:01:22.944000
15092,e83aa3,5,32,1,True,93620b,2021-03-01 07:11:24.662400
15093,d5e9a6,2,23,1,True,93620b,2021-03-01 07:11:24.662400


___
<a id = 'CaseStudyQuestions'></a>
## 3. Case Study Questions

<a id = 'A'></a>
## A. High Level Sales Analysis

#### 1. What was the total quantity sold for all products?

In [5]:
pd.read_sql("""
SELECT TO_CHAR(SUM(qty), 'FM 999,999') AS total_quantity
FROM balanced_tree.sales
""", conn)

Unnamed: 0,total_quantity
0,45216


Alternatively, we can break down the total quantity by product.

In [6]:
pd.read_sql("""
SELECT 
    pd.product_name, 
    SUM(s.qty) AS total_quantity
FROM balanced_tree.sales s
JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
GROUP BY pd.product_name
ORDER BY total_quantity DESC
""", conn)

Unnamed: 0,product_name,total_quantity
0,Grey Fashion Jacket - Womens,3876
1,Navy Oversized Jeans - Womens,3856
2,Blue Polo Shirt - Mens,3819
3,White Tee Shirt - Mens,3800
4,Navy Solid Socks - Mens,3792
5,Black Straight Jeans - Womens,3786
6,Pink Fluro Polkadot Socks - Mens,3770
7,Indigo Rain Jacket - Womens,3757
8,Khaki Suit Jacket - Womens,3752
9,Cream Relaxed Jeans - Womens,3707


___
#### 2. What is the total generated revenue for all products before discounts?

In [7]:
pd.read_sql("""
SELECT TO_CHAR(SUM(qty*price), 'FM$ 999,999,999.99') AS total_sales
FROM balanced_tree.sales
""", conn)

Unnamed: 0,total_sales
0,"$ 1,289,453."


Alternatively, we can break down the total sales by product.

In [8]:
pd.read_sql("""
SELECT 
    product_name, 
    TO_CHAR(total_sales, 'FM$ 999,999')  AS total_sales
FROM
(
    SELECT pd.product_name, SUM(s.qty * s.price) AS total_sales
    FROM balanced_tree.sales s
    JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
    GROUP BY pd.product_name
    ORDER BY total_sales DESC
) ts
""", conn)

Unnamed: 0,product_name,total_sales
0,Blue Polo Shirt - Mens,"$ 217,683"
1,Grey Fashion Jacket - Womens,"$ 209,304"
2,White Tee Shirt - Mens,"$ 152,000"
3,Navy Solid Socks - Mens,"$ 136,512"
4,Black Straight Jeans - Womens,"$ 121,152"
5,Pink Fluro Polkadot Socks - Mens,"$ 109,330"
6,Khaki Suit Jacket - Womens,"$ 86,296"
7,Indigo Rain Jacket - Womens,"$ 71,383"
8,White Striped Socks - Mens,"$ 62,135"
9,Navy Oversized Jeans - Womens,"$ 50,128"


___
#### 3. What was the total discount amount for all products?

In [9]:
pd.read_sql("""
SELECT TO_CHAR(SUM(qty * price * (discount/100::NUMERIC)), 'FM$ 999,999.99') AS total_discounts
FROM balanced_tree.sales
""", conn)

Unnamed: 0,total_discounts
0,"$ 156,229.14"


___
<a id = 'B'></a>
## B. Transaction Analysis

#### 1. How many unique transactions were there?

In [10]:
pd.read_sql("""
SELECT COUNT(DISTINCT txn_id) AS nb_transactions
FROM balanced_tree.sales
""", conn)

Unnamed: 0,nb_transactions
0,2500


**Result**</br>
There are 2,500 unique transactions made at Balanced Tree Clothing Co.

___
#### 2. What is the average unique products purchased in each transaction?

In [11]:
pd.read_sql("""
WITH count_unique_product_ids_cte AS
(
    SELECT txn_id, COUNT(DISTINCT prod_id) AS product_id
    FROM balanced_tree.sales
    GROUP BY txn_id
)
SELECT ROUND(AVG(product_id), 1) AS avg_unique_products
FROM count_unique_product_ids_cte
""", conn)

Unnamed: 0,avg_unique_products
0,6.0


**Result**</br>
The average number of distinct products purchased per transaction is 6.

___
#### 3. What are the 25th, 50th and 75th percentile values for the revenue per transaction?

In [12]:
pd.read_sql("""
WITH revenue_per_transaction_cte AS
(
    SELECT 
        txn_id, 
        SUM((qty * price) * (1 - discount/100::NUMERIC)) AS revenue
    FROM balanced_tree.sales
    GROUP BY txn_id
)
SELECT 
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue) AS percentile_25th,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY revenue) AS percentile_50th,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) AS percentile_75th
FROM revenue_per_transaction_cte
""", conn)

Unnamed: 0,percentile_25th,percentile_50th,percentile_75th
0,326.405,441.225,572.7625


___
#### 4. What is the average discount value per transaction?

In [13]:
pd.read_sql("""
SELECT 
    CONCAT('$ ', ROUND(AVG(discount_value),2)) AS average_discount
FROM 
(
    SELECT txn_id, SUM(qty * price * (discount/100::NUMERIC)) AS discount_value
    FROM balanced_tree.sales
    GROUP BY txn_id
) dv
""", conn)

Unnamed: 0,average_discount
0,$ 62.49


___
#### 5. What is the percentage split of all transactions for members vs non-members?

We can utilize the `total revenue` as a metric to determine the percentage distribution of all transactions between members and non-members. The analysis reveals that members bought more at Balanced Tree Clothing Co. compared to non-members, with a split of 60.31% for members and 36.69% for non-members.

In [14]:
pd.read_sql("""
WITH txn_revenue_cte AS
(
    SELECT txn_id, member, SUM((qty * price) * (1 - discount/100::NUMERIC)) AS revenue
    FROM balanced_tree.sales
    GROUP BY txn_id, member
)
SELECT 
    member, 
    CONCAT(ROUND(SUM(revenue)/(SELECT SUM(revenue) FROM txn_revenue_cte)::NUMERIC * 100,2), ' %') AS percentage_split
FROM txn_revenue_cte
GROUP BY member
""", conn)

Unnamed: 0,member,percentage_split
0,False,39.69 %
1,True,60.31 %


Alternatively, we can use the `number of transactions` made by members and non-members as a metric to address the question. Employing this approach, the percentage split remains relatively consistent with the previous findings.

In [15]:
pd.read_sql("""
SELECT 
    CONCAT(ROUND(SUM(CASE WHEN member = True THEN 1 ELSE 0 END)/COUNT(*)::NUMERIC * 100,2), ' %') AS member_percentage,
    CONCAT(ROUND(SUM(CASE WHEN member = False THEN 1 ELSE 0 END)/COUNT(*)::NUMERIC * 100,2), ' %') AS nonmember_percentage
FROM balanced_tree.sales
""", conn)

Unnamed: 0,member_percentage,nonmember_percentage
0,60.03 %,39.97 %


___
#### 6. What is the average revenue for member transactions and non-member transactions?

In [16]:
pd.read_sql("""
WITH txn_revenue_cte AS
(
    SELECT txn_id, member, SUM((qty * price) * (1 - discount/100::NUMERIC)) AS revenue 
    FROM balanced_tree.sales
    GROUP BY txn_id, member
)
SELECT member, CONCAT('$ ', ROUND(AVG(revenue),2)) AS avg_revenue
FROM txn_revenue_cte
GROUP BY member
""", conn)

Unnamed: 0,member,avg_revenue
0,False,$ 452.01
1,True,$ 454.14


___
<a id = 'C'></a>
## C. Product Analysis

#### 1. What are the top 3 products by total revenue before discount?

In [17]:
pd.read_sql("""
SELECT product_name, TO_CHAR(total_revenue, 'FM$ 999,999') AS total_revenue
FROM
(
    SELECT 
        pd.product_name, 
        SUM(s.qty * s.price) AS total_revenue
    FROM balanced_tree.sales s
    JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
    GROUP BY pd.product_name
    ORDER BY total_revenue DESC
) tr
LIMIT 3
""", conn)

Unnamed: 0,product_name,total_revenue
0,Blue Polo Shirt - Mens,"$ 217,683"
1,Grey Fashion Jacket - Womens,"$ 209,304"
2,White Tee Shirt - Mens,"$ 152,000"


___
#### 2. What is the total quantity, revenue and discount for each segment?

In [18]:
pd.read_sql("""
SELECT 
    pd.segment_name AS segment, 
    TO_CHAR(SUM(s.qty), 'FM 999,999') AS total_quantity, 
    TO_CHAR(SUM(s.qty * s.price), 'FM$ 999,999') AS total_revenue, 
    TO_CHAR(SUM(s.qty * s.price * s.discount/100::NUMERIC), 'FM$ 999,999') AS total_discounts
FROM balanced_tree.sales s
JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
GROUP BY pd.segment_name
""", conn)

Unnamed: 0,segment,total_quantity,total_revenue,total_discounts
0,Shirt,11265,"$ 406,143","$ 49,594"
1,Jeans,11349,"$ 208,350","$ 25,344"
2,Jacket,11385,"$ 366,983","$ 44,277"
3,Socks,11217,"$ 307,977","$ 37,013"


___
#### 3. What is the top selling product for each segment?

In [19]:
pd.read_sql("""
WITH sales_cte AS
(
    SELECT 
        pd.segment_name AS segment, 
        pd.product_name AS product, 
        SUM(s.qty) AS total_sold_items,
        DENSE_RANK() OVER (PARTITION BY pd.segment_name ORDER BY SUM(s.qty) DESC) AS rank
    FROM balanced_tree.sales s
    JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
    GROUP BY pd.segment_name, pd.product_name
    ORDER BY pd.segment_name
)
SELECT 
    segment, 
    product, 
    total_sold_items 
FROM sales_cte
WHERE rank = 1
""", conn)

Unnamed: 0,segment,product,total_sold_items
0,Jacket,Grey Fashion Jacket - Womens,3876
1,Jeans,Navy Oversized Jeans - Womens,3856
2,Shirt,Blue Polo Shirt - Mens,3819
3,Socks,Navy Solid Socks - Mens,3792


___
#### 4. What is the total quantity, revenue and discount for each category?

In [20]:
pd.read_sql("""
SELECT 
    category,
    TO_CHAR(total_quantity, 'FM 999,999') AS total_quantity,
    TO_CHAR(total_revenue, 'FM$ 999,999') AS total_revenue,
    TO_CHAR(total_discount, 'FM$ 999,999.99') AS total_discount
FROM
(
    SELECT 
        pd.category_name AS category, 
        SUM(s.qty) AS total_quantity, 
        SUM(s.qty * s.price) AS total_revenue, 
        SUM(s.qty * s.price * s.discount/100::NUMERIC) AS total_discount
    FROM balanced_tree.sales s
    JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
    GROUP BY pd.category_name
) t
""", conn)

Unnamed: 0,category,total_quantity,total_revenue,total_discount
0,Mens,22482,"$ 714,120","$ 86,607.71"
1,Womens,22734,"$ 575,333","$ 69,621.43"


___
#### 5. What is the top selling product for each category?

In [21]:
pd.read_sql("""
WITH category_sales_cte AS
(
    SELECT 
        pd.category_name AS category, 
        pd.product_name AS product, 
        SUM(s.qty) AS total_sold_items,
        RANK() OVER (PARTITION BY pd.category_name ORDER BY SUM(s.qty) DESC) AS rank
    FROM balanced_tree.sales s
    JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
    GROUP BY 1,2
)
SELECT 
    category, 
    product, 
    TO_CHAR(total_sold_items, 'FM 999,999') AS total_sold_items
FROM category_sales_cte 
WHERE rank = 1 
""", conn)

Unnamed: 0,category,product,total_sold_items
0,Mens,Blue Polo Shirt - Mens,3819
1,Womens,Grey Fashion Jacket - Womens,3876


___
#### 6. What is the percentage split of revenue by product for each segment?

In [22]:
pd.read_sql("""
SELECT 
    pd.segment_name AS segment, 
    pd.product_name AS product,
    TO_CHAR(SUM(s.qty * s.price), 'FM$ 999,999') AS total_revenue,
    CONCAT(ROUND(SUM(s.qty * s.price)/SUM(SUM(s.qty * s.price)) OVER (PARTITION BY pd.segment_name)::NUMERIC * 100,1), ' %') AS percent_split
FROM balanced_tree.sales s
JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
GROUP BY pd.segment_name, pd.product_name
ORDER BY segment_name, percent_split
""", conn)

Unnamed: 0,segment,product,total_revenue,percent_split
0,Jacket,Indigo Rain Jacket - Womens,"$ 71,383",19.5 %
1,Jacket,Khaki Suit Jacket - Womens,"$ 86,296",23.5 %
2,Jacket,Grey Fashion Jacket - Womens,"$ 209,304",57.0 %
3,Jeans,Cream Relaxed Jeans - Womens,"$ 37,070",17.8 %
4,Jeans,Navy Oversized Jeans - Womens,"$ 50,128",24.1 %
5,Jeans,Black Straight Jeans - Womens,"$ 121,152",58.1 %
6,Shirt,White Tee Shirt - Mens,"$ 152,000",37.4 %
7,Shirt,Blue Polo Shirt - Mens,"$ 217,683",53.6 %
8,Shirt,Teal Button Up Shirt - Mens,"$ 36,460",9.0 %
9,Socks,White Striped Socks - Mens,"$ 62,135",20.2 %


___
#### 7. What is the percentage split of revenue by segment for each category?

In [23]:
pd.read_sql("""
SELECT 
    pd.category_name AS category, 
    pd.segment_name AS segment, 
    TO_CHAR(SUM(s.qty * s.price), 'FM$ 999,999') AS total_revenue,
    CONCAT(ROUND(SUM(s.qty * s.price)/SUM(SUM(s.qty * s.price)) OVER (PARTITION BY pd.category_name)::NUMERIC * 100, 1), ' %') AS percent_split
FROM balanced_tree.sales s
JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
GROUP BY pd.category_name, pd.segment_name
""", conn)

Unnamed: 0,category,segment,total_revenue,percent_split
0,Mens,Socks,"$ 307,977",43.1 %
1,Mens,Shirt,"$ 406,143",56.9 %
2,Womens,Jeans,"$ 208,350",36.2 %
3,Womens,Jacket,"$ 366,983",63.8 %


___
#### 8. What is the percentage split of total revenue by category?

In [24]:
pd.read_sql("""
SELECT 
    pd.category_name AS category, 
    TO_CHAR(SUM(s.qty * s.price), 'FM$ 999,999') AS total_revenue,
    CONCAT(ROUND(SUM(s.qty * s.price)/(SELECT SUM(qty * price) FROM balanced_tree.sales)::NUMERIC * 100, 1), ' %') AS percent_split
FROM balanced_tree.sales s
JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
GROUP BY pd.category_name
""", conn)

Unnamed: 0,category,total_revenue,percent_split
0,Mens,"$ 714,120",55.4 %
1,Womens,"$ 575,333",44.6 %


___
#### 9. What is the total transaction `“penetration”` for *each product*? (hint: penetration = number of transactions where at least 1 quantity of a product was purchased divided by total number of transactions)

In [25]:
pd.read_sql("""
SELECT 
    pd.product_name, 
    COUNT(txn_id) AS nb_transactions,
    CONCAT(ROUND(COUNT(txn_id)/(SELECT COUNT(DISTINCT txn_id) FROM balanced_tree.sales)::NUMERIC*100,2), ' %') AS penetration_percent
FROM balanced_tree.sales s
JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
GROUP BY pd.product_name
ORDER BY penetration_percent DESC
""", conn)

Unnamed: 0,product_name,nb_transactions,penetration_percent
0,Navy Solid Socks - Mens,1281,51.24 %
1,Grey Fashion Jacket - Womens,1275,51.00 %
2,Navy Oversized Jeans - Womens,1274,50.96 %
3,White Tee Shirt - Mens,1268,50.72 %
4,Blue Polo Shirt - Mens,1268,50.72 %
5,Pink Fluro Polkadot Socks - Mens,1258,50.32 %
6,Indigo Rain Jacket - Womens,1250,50.00 %
7,Khaki Suit Jacket - Womens,1247,49.88 %
8,Black Straight Jeans - Womens,1246,49.84 %
9,White Striped Socks - Mens,1243,49.72 %


___
#### 10. What is the most common combination of at least 1 quantity of any 3 products in a 1 single transaction?

This question involves combinations. Let's recall the combination equation: </br> 	

$$\frac{n!}{k!(n-k)!}$$

where </br>
- $n$ represents the n objects in a set (in this case, there is a total of 12 products at the Balanced Tree company: $n=12$)
- $k$ represents the number of distinct objects in each combination (here, we need 3 distinct products in each combination: $k=3$)

Hence, there are `220 possible combinations of 3 different items`:

$$\frac{n!}{k!(n-k)!} = \frac{12!}{3!(12-3)!} = \frac{12!}{3! * 9!} = \frac{12 * 11 * 10}{3 * 2 * 1} = 4 * 5 * 11 = 220$$


To answer this question, we will employ `SELF JOINS` to generate the possible combinations. This approach yields `220 rows`, indicating a total of 220 combinations of 3 distinct items.

In [26]:
query10 = """
WITH products_cte AS
(
    -- Filter the data to include only those txn_ids that have at least 3 products in a single transaction
    
    SELECT 
        s.txn_id,
        pd.product_name
    FROM
    (
        SELECT 
            txn_id, 
            COUNT(prod_id) AS nb_products
        FROM balanced_tree.sales 
        GROUP BY txn_id
    
    ) count_products
    JOIN balanced_tree.sales s ON count_products.txn_id = s.txn_id
    JOIN balanced_tree.product_details pd ON s.prod_id = pd.product_id
    WHERE nb_products >= 3
)
SELECT 
    p.product_name AS product1, 
    p1.product_name AS product2, 
    p2.product_name AS product3, 
    COUNT(p.txn_id) AS nb_occurences     -- Count the occurrences of each combination
FROM products_cte p
JOIN products_cte p1                     -- SELF JOIN Table 1 to Table 2
ON p.txn_id = p1.txn_id 
AND p.product_name != p1.product_name    -- Ensure that items are not duplicated
AND p.product_name < p1.product_name
JOIN products_cte p2                     -- SELF JOIN Table 1 to Table 3
ON p.txn_id = p2.txn_id 
AND p.product_name != p2.product_name    -- Ensure that items are not duplicated in Table 1
AND p1.product_name != p2.product_name   -- Ensure that items are not duplicated in Table 2
AND p.product_name < p2.product_name
AND p1.product_name < p2.product_name
GROUP BY p.product_name, p1.product_name, p2.product_name
ORDER BY nb_occurences DESC
"""

pd.read_sql(query10, conn)

Unnamed: 0,product1,product2,product3,nb_occurences
0,Grey Fashion Jacket - Womens,Teal Button Up Shirt - Mens,White Tee Shirt - Mens,352
1,Black Straight Jeans - Womens,Indigo Rain Jacket - Womens,Navy Solid Socks - Mens,349
2,Black Straight Jeans - Womens,Grey Fashion Jacket - Womens,Pink Fluro Polkadot Socks - Mens,347
3,Blue Polo Shirt - Mens,Grey Fashion Jacket - Womens,White Striped Socks - Mens,347
4,Navy Oversized Jeans - Womens,Teal Button Up Shirt - Mens,White Tee Shirt - Mens,347
...,...,...,...,...
215,Cream Relaxed Jeans - Womens,Pink Fluro Polkadot Socks - Mens,White Tee Shirt - Mens,290
216,Indigo Rain Jacket - Womens,Khaki Suit Jacket - Womens,Pink Fluro Polkadot Socks - Mens,289
217,Cream Relaxed Jeans - Womens,White Striped Socks - Mens,White Tee Shirt - Mens,288
218,Indigo Rain Jacket - Womens,Khaki Suit Jacket - Womens,Navy Oversized Jeans - Womens,287


**Result**</br>
The following output is the most common combination of 3 distinct items bougth in a single transaction.
- Grey Fashion Jacket - Womens
- Teal Button Up Shirt - Mens
- White Tee Shirt - Mens

In [27]:
pd.read_sql(query10 + 'LIMIT 1', conn)

Unnamed: 0,product1,product2,product3,nb_occurences
0,Grey Fashion Jacket - Womens,Teal Button Up Shirt - Mens,White Tee Shirt - Mens,352


___
<a id = 'D'></a>
## D. Bonus Challenge
Use a single SQL query to transform the `product_hierarchy` and `product_prices` datasets to the `product_details` table.

Hint: you may want to consider using a recursive CTE to solve this problem!

Let's recall the `product_details` table.

In [28]:
pd.read_sql("""
SELECT *
FROM balanced_tree.product_details
""", conn)

Unnamed: 0,product_id,price,product_name,category_id,segment_id,style_id,category_name,segment_name,style_name
0,c4a632,13,Navy Oversized Jeans - Womens,1,3,7,Womens,Jeans,Navy Oversized
1,e83aa3,32,Black Straight Jeans - Womens,1,3,8,Womens,Jeans,Black Straight
2,e31d39,10,Cream Relaxed Jeans - Womens,1,3,9,Womens,Jeans,Cream Relaxed
3,d5e9a6,23,Khaki Suit Jacket - Womens,1,4,10,Womens,Jacket,Khaki Suit
4,72f5d4,19,Indigo Rain Jacket - Womens,1,4,11,Womens,Jacket,Indigo Rain
5,9ec847,54,Grey Fashion Jacket - Womens,1,4,12,Womens,Jacket,Grey Fashion
6,5d267b,40,White Tee Shirt - Mens,2,5,13,Mens,Shirt,White Tee
7,c8d436,10,Teal Button Up Shirt - Mens,2,5,14,Mens,Shirt,Teal Button Up
8,2a2353,57,Blue Polo Shirt - Mens,2,5,15,Mens,Shirt,Blue Polo
9,f084eb,36,Navy Solid Socks - Mens,2,6,16,Mens,Socks,Navy Solid


**Result**</br>
I used 2 CTEs to address this question:
- `category_cte`: This CTE retrieves the initial data for category items with the renamed columns. 
- `segment_cte`: This CTE retrieves the initial data for segment items with the renamed columns.

In [29]:
pd.read_sql("""
WITH category_cte AS
(
    -- Retrieve data for category items
    
    SELECT 
        id AS category_id, 
        level_text AS category_name
    FROM balanced_tree.product_hierarchy 
    WHERE level_name = 'Category'
), 

segment_cte AS
(
    -- Retrieve data for segment items
    
    SELECT 
        cat.category_id,
        ph.id AS segment_id, 
        cat.category_name,
        ph.level_text AS segment_name
    FROM balanced_tree.product_hierarchy ph
    JOIN category_cte cat ON ph.parent_id = cat.category_id
    WHERE level_name = 'Segment'
)

SELECT
    pp.product_id,
    pp.price,
    CONCAT(ph.level_text, ' ', segment_name, ' - ', category_name) AS product_name,
    category_id,
    segment_id, 
    ph.id AS style_id,
    category_name,
    segment_name,
    ph.level_text AS style_name

FROM balanced_tree.product_hierarchy ph
JOIN balanced_tree.product_prices pp ON ph.id = pp.id
LEFT JOIN segment_cte seg ON ph.parent_id = seg.segment_id
WHERE level_name = 'Style'
""", conn)

Unnamed: 0,product_id,price,product_name,category_id,segment_id,style_id,category_name,segment_name,style_name
0,c4a632,13,Navy Oversized Jeans - Womens,1,3,7,Womens,Jeans,Navy Oversized
1,e83aa3,32,Black Straight Jeans - Womens,1,3,8,Womens,Jeans,Black Straight
2,e31d39,10,Cream Relaxed Jeans - Womens,1,3,9,Womens,Jeans,Cream Relaxed
3,d5e9a6,23,Khaki Suit Jacket - Womens,1,4,10,Womens,Jacket,Khaki Suit
4,72f5d4,19,Indigo Rain Jacket - Womens,1,4,11,Womens,Jacket,Indigo Rain
5,9ec847,54,Grey Fashion Jacket - Womens,1,4,12,Womens,Jacket,Grey Fashion
6,5d267b,40,White Tee Shirt - Mens,2,5,13,Mens,Shirt,White Tee
7,c8d436,10,Teal Button Up Shirt - Mens,2,5,14,Mens,Shirt,Teal Button Up
8,2a2353,57,Blue Polo Shirt - Mens,2,5,15,Mens,Shirt,Blue Polo
9,f084eb,36,Navy Solid Socks - Mens,2,6,16,Mens,Socks,Navy Solid


Alternatively, we can achieve the same result without using CTEs. Instead, we can obtain the desired outcome by utilizing `SELF JOINS` in the query.

In [30]:
pd.read_sql("""
SELECT 
    pp.product_id,
    pp.price,
    CONCAT(ph2.level_text, ' ', ph1.level_text, ' - ', ph.level_text) AS product_name,
    ph.id AS category_id,
    ph1.id AS segment_id,
    ph2.id AS style_id,
    ph.level_text AS category_name,
    ph1.level_text AS segment_name,
    ph2.level_text AS style_name

FROM balanced_tree.product_hierarchy ph
JOIN balanced_tree.product_hierarchy ph1 ON ph.id = ph1.parent_id  -- SELF JOIN Table 1 to Table 2
JOIN balanced_tree.product_hierarchy ph2 ON ph1.id = ph2.parent_id -- SELF JOIN Table 2 to Table 3
LEFT JOIN balanced_tree.product_prices pp ON ph2.id = pp.id        -- LEFT JOIN Table to the "product_price" table
ORDER BY category_id, segment_id, style_id
""", conn) 

Unnamed: 0,product_id,price,product_name,category_id,segment_id,style_id,category_name,segment_name,style_name
0,c4a632,13,Navy Oversized Jeans - Womens,1,3,7,Womens,Jeans,Navy Oversized
1,e83aa3,32,Black Straight Jeans - Womens,1,3,8,Womens,Jeans,Black Straight
2,e31d39,10,Cream Relaxed Jeans - Womens,1,3,9,Womens,Jeans,Cream Relaxed
3,d5e9a6,23,Khaki Suit Jacket - Womens,1,4,10,Womens,Jacket,Khaki Suit
4,72f5d4,19,Indigo Rain Jacket - Womens,1,4,11,Womens,Jacket,Indigo Rain
5,9ec847,54,Grey Fashion Jacket - Womens,1,4,12,Womens,Jacket,Grey Fashion
6,5d267b,40,White Tee Shirt - Mens,2,5,13,Mens,Shirt,White Tee
7,c8d436,10,Teal Button Up Shirt - Mens,2,5,14,Mens,Shirt,Teal Button Up
8,2a2353,57,Blue Polo Shirt - Mens,2,5,15,Mens,Shirt,Blue Polo
9,f084eb,36,Navy Solid Socks - Mens,2,6,16,Mens,Socks,Navy Solid


In [31]:
conn.close()