# Gold Layer ETL
Time to ingest data into empty schemas once again. This time we are bringing the data from the silver layer into the gold layer. 

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt
import pyspark.sql.functions as F

### a) Customer Profile Table
Which features of the customer base are the most relevant is maximizing expenditure?

In [0]:
%sql
INSERT INTO gold.customer_profile
SELECT
    c.education,
    c.marital_status,
    c.kidhome,
    c.teenhome,
    c.income,
    c.complain,

    -- Total Spending = MntWines + MntFruits + ... + MntSweetProds (Obs: *no* MntGoldProds, as its a different kind of category)  
    SUM(
        f.MntWines +
        f.MntFruits +
        f.MntMeatProducts +
        f.MntFishProducts +
        f.MntSweetProducts
    ) AS total_spending,

    -- Average Spending 
    AVG(
        f.MntWines +
        f.MntFruits +
        f.MntMeatProducts +
        f.MntFishProducts +
        f.MntSweetProducts
    ) AS avg_spent_per_customer

FROM silver.FACT_Sales f
JOIN silver.DIM_Customer c
  ON f.fk_customer = c.id_customer

GROUP BY
    c.education,
    c.marital_status,
    c.kidhome,
    c.teenhome,
    c.income,
    c.complain;


In [0]:
%sql
-- Test
SELECT * FROM gold.customer_profile 
ORDER BY total_spending DESC 

Not very useful, as the combinations are too many, but we can already see which (highly specific) groups are the biggest spenders. The master's graduates who are single and have no offspring take the lead.

In [0]:
df_gold = spark.table("gold.customer_profile")

df_gold.select(
    F.corr("income", "total_spending").alias("corr_income_spending")
).display()


As expected, the correlation between Income and Total Spending is positive and moderate to strong, at 0.64

In [0]:
%sql
SELECT
  education,
  ROUND(AVG(total_spending), 2) AS avg_spending
FROM gold.customer_profile
GROUP BY education
ORDER BY avg_spending DESC;

In [0]:
%sql
SELECT
  marital_status,
  ROUND(AVG(total_spending),2) AS avg_spending
FROM gold.customer_profile
GROUP BY marital_status
ORDER BY avg_spending DESC;


The incorrect categories are gone, but the new 'Unknown' category is not here for some reason...

In [0]:
df_gold = spark.table("gold.customer_profile")

df_gold.select(
    F.corr("kidhome", "total_spending").alias("corr_income_spending")
).display()


In [0]:
df_gold = spark.table("gold.customer_profile")

df_gold.select(
    F.corr("teenhome", "total_spending").alias("corr_income_spending")
).display()


Interestingly, having offspring is negatively correlated with Total Spending. Additionaly, the lowering strength of the correlation might suggest a relation to the maturity of the child.

In [0]:
%sql
SELECT
  kidhome,
  ROUND(AVG(total_spending),2) AS avg_spending
FROM gold.customer_profile
GROUP BY kidhome
ORDER BY kidhome;


In summary:

In [0]:
pdf = df_gold.select(
    "total_spending",
    "income",
    "kidhome",
    "teenhome"
).toPandas()

sns.heatmap(pdf.corr(), annot=True, cmap="coolwarm")
plt.show()

Evidently total spending is positively correlated with income, but it is intriguing that is is *negatively* correlated with having kids at home. Having a teen at home seems not to be relevant. This could suggest that it is wiser to market the company's products to young adults, with income but no family yet.

### Customer Spending Summary
No modeling was used, only joins.

In [0]:
%skip
%sql
SELECT
  education,
  marital_status,
  SUM(MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts) AS total_spending,
  ROUND(
    AVG(
      MntWines + MntFruits + MntMeatProducts +
      MntFishProducts + MntSweetProducts
    ), 2
  ) AS avg_spending,
  COUNT(*) AS customers
FROM silver.fact_sales fs
JOIN silver.dim_customer dc
  ON fs.fk_customer = dc.id_customer
GROUP BY education, marital_status
ORDER BY avg_spending DESC;

In [0]:
%skip
%sql
WITH customer_spending AS (
    SELECT
        f.fk_customer,

        -- Total spending per customer
        SUM(
            f.MntWines +
            f.MntFruits +
            f.MntMeatProducts +
            f.MntFishProducts +
            f.MntSweetProducts
        ) AS total_spending

    FROM silver.FACT_Sales f
    GROUP BY f.fk_customer
)

SELECT
    c.education,
    c.marital_status,
    c.kidhome,
    c.teenhome,
    c.income,
    c.complain,

    COUNT(DISTINCT c.id_customer) AS total_customers,

    -- MÃ©tricas principais
    AVG(cs.total_spending) AS avg_spending_per_customer,
    SUM(cs.total_spending) AS total_spending_group

FROM customer_spending cs
JOIN silver.DIM_Customer c
  ON cs.fk_customer = c.id_customer

GROUP BY
    c.education,
    c.marital_status,
    c.kidhome,
    c.teenhome,
    c.income,
    c.complain
ORDER BY avg_spending_per_customer DESC;


In [0]:
%sql
SELECT
  dc.Education,
  dc.Marital_Status,
  SUM(
    fs.MntWines +
    fs.MntFruits +
    fs.MntMeatProducts +
    fs.MntFishProducts +
    fs.MntSweetProducts
  ) AS total_spending
FROM silver.fact_sales fs
JOIN silver.dim_customer dc
  ON fs.fk_customer = dc.id_customer
GROUP BY
  dc.Education,
  dc.Marital_Status

### d) Which is the most popular method of purchasing?


In [0]:
%sql
CREATE OR REPLACE TABLE gold.purchase_channel_summary AS
SELECT
    ROUND(AVG(NumWebPurchases), 2) AS avg_web,
    ROUND(AVG(NumStorePurchases), 2) AS avg_store,
    ROUND(AVG(NumCatalogPurchases), 2) AS avg_catalog
FROM silver.fact_sales;

In [0]:
display(spark.table("gold.purchase_channel_summary"))

The average consumer purchases the least from the company's catalog and the most from physical stores. While for now there can only be conjecture about the reasons why, it is nevertheless a valuable insight taht might, for instance, suggest that catalogs are not worth mantaining, or that stores should at the very least receive the same level of attention and not be neglected.

# Extra