# Data Mart Case Study:

## Table of Contents

- [Raw data](#Raw-data)

- [Case Study Questions](#A.-Data-Cleansing-Steps)
    - [Data Cleansing Steps](#A.-Data-Cleansing-Steps)
    - [Data Exploration](#B.-Data-Exploration)
    - [Before & After Analysis](#C.-Before-&-After-Analysis)
    - [Bonus Question](#D.-Bonus-Question)


## Raw data

The dataset have 1 table: **weekly_sales**

In [1]:
%reload_ext sql

In [2]:
%%sql

mysql://root:MyN3wP4ssw0rd@localhost:3306/uditdb

### Entity Relationship Diagram

![convert notebook to web app](https://drive.google.com/uc?id=1iG_eG-zVa3gCQXFXhfou3mgmr8mmfMs4)

[Back to top](#Data-Mart-Case-Study:)

## A. Data Cleansing Steps

1. [Convert the week_date to a DATE format](#A1.-Convert-the-week_date-to-a-DATE-format)<br><br>
2. [Add a week_number as the second column for each week_date value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2 etc](#A2.-Add-a-week_number-as-the-second-column-for-each-week_date-value,-for-example-any-value-from-the-1st-of-January-to-7th-of-January-will-be-1,-8th-to-14th-will-be-2-etc)<br><br>
3. [Add a month_number with the calendar month for each week_date value as the 3rd column](#A3.-Add-a-month_number-with-the-calendar-month-for-each-week_date-value-as-the-3rd-column)<br><br>
4. [Add a calendar_year column as the 4th column containing either 2018, 2019 or 2020 values](#A4.-Add-a-calendar_year-column-as-the-4th-column-containing-either-2018,-2019-or-2020-values)<br><br>
5. [Add a new column called age_band after the original segment column using the following mapping on the number inside the segment value](#A5.-Add-a-new-column-called-age_band-after-the-original-segment-column-using-the-following-mapping-on-the-number-inside-the-segment-value)<br><br>
6. [Add a new demographic column using the following mapping for the first letter in the segment values:](#A6.-Add-a-new-demographic-column-using-the-following-mapping-for-the-first-letter-in-the-segment-values:)<br><br>
7. [Ensure all null string values with an "unknown" string value in the original segment column as well as the new age_band and demographic columns](#A7.-Ensure-all-null-string-values-with-an-"unknown"-string-value-in-the-original-segment-column-as-well-as-the-new-age_band-and-demographic-columns)<br><br>
8. [Generate a new avg_transaction column as the sales value divided by transactions rounded to 2 decimal places for each record](#A8.-Generate-a-new-avg_transaction-column-as-the-sales-value-divided-by-transactions-rounded-to-2-decimal-places-for-each-record)<br><br>

[Back to top](#Data-Mart-Case-Study:)

## B. Data Exploration

1. [What day of the week is used for each week_date value?](#B1.-What-day-of-the-week-is-used-for-each-week_date-value?)<br><br>
2. [What range of week numbers are missing from the dataset?](#B2.-What-range-of-week-numbers-are-missing-from-the-dataset?)<br><br>
3. [How many total transactions were there for each year in the dataset?](#B3.-How-many-total-transactions-were-there-for-each-year-in-the-dataset?)<br><br>
4. [What is the total sales for each region for each month?](#B4.-What-is-the-total-sales-for-each-region-for-each-month?)<br><br>
5. [What is the total count of transactions for each platform](#B5.-What-is-the-total-count-of-transactions-for-each-platform)<br><br>
6. [What is the percentage of sales for Retail vs Shopify for each month?](#B6.-What-is-the-percentage-of-sales-for-Retail-vs-Shopify-for-each-month?)<br><br>
7. [What is the percentage of sales by demographic for each year in the dataset?](#B7.-What-is-the-percentage-of-sales-by-demographic-for-each-year-in-the-dataset?)<br><br>
8. [Which age_band and demographic values contribute the most to Retail sales?](#B8.-Which-age_band-and-demographic-values-contribute-the-most-to-Retail-sales?)<br><br>
9. [Can we use the avg_transaction column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead?](#B9.-Can-we-use-the-avg_transaction-column-to-find-the-average-transaction-size-for-each-year-for-Retail-vs-Shopify?-If-not---how-would-you-calculate-it-instead?)<br><br>

[Back to top](#Data-Mart-Case-Study:)

## C. Before & After Analysis


1. [What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?](#C1.-What-is-the-total-sales-for-the-4-weeks-before-and-after-2020-06-15?-What-is-the-growth-or-reduction-rate-in-actual-values-and-percentage-of-sales?)<br><br>
2. [What about the entire 12 weeks before and after?](#C2.-What-about-the-entire-12-weeks-before-and-after?)<br><br>
3. [How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?](#C3.-How-do-the-sale-metrics-for-these-2-periods-before-and-after-compare-with-the-previous-years-in-2018-and-2019?)<br><br>

[Back to top](#Data-Mart-Case-Study:)

## D. Bonus Question

1. [Which areas of the business have the highest negative impact in sales metrics performance in 2020 for the 12 week before and after period?](#D1.-Which-areas-of-the-business-have-the-highest-negative-impact-in-sales-metrics-performance-in-2020-for-the-12-week-before-and-after-period?)<br><br>

[Back to top](#Data-Mart-Case-Study:)

# A. Data Cleansing Steps

<span style="background-color: #ffffa6;">In a single query, perform the following operations and generate a new table in the data_mart schema named **clean_weekly_sales**:</span>


## A1. Convert the <span style="background-color: #a6ffff;">week_date</span> to a <span style="background-color: #a6ffff;">DATE</span> format

## A2. Add a <span style="background-color: #a6ffff;">week_number</span> as the second column for each <span style="background-color: #a6ffff;">week_date</span> value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2 etc

## A3. Add a <span style="background-color: #a6ffff;">month_number</span> with the calendar month for each <span style="background-color: #a6ffff;">week_date</span> value as the 3rd column

## A4. Add a <span style="background-color: #a6ffff;">calendar_year</span> column as the 4th column containing either 2018, 2019 or 2020 values

## A5. Add a new column called <span style="background-color: #a6ffff;">age_band</span> after the original <span style="background-color: #a6ffff;">segment</span> column using the following mapping on the number inside the <span style="background-color: #a6ffff;">segment</span> value

| Segment | Age Band        |
|---------|-----------------|
| 1       | Young Adults    |
| 2       | Middle Aged     |
| 3 or 4  | Retirees        |


## A6. Add a new <span style="background-color: #a6ffff;">demographic</span> column using the following mapping for the first letter in the <span style="background-color: #a6ffff;">segment</span> values:


| Segment | Demographic |
|---------|-------------|
| C       | Couples     |
| F       | Families    |

## A7. Ensure all <span style="background-color: #a6ffff;">null</span> string values with an <span style="background-color: #a6ffff;">"unknown"</span> string value in the original <span style="background-color: #a6ffff;">segment</span> column as well as the new <span style="background-color: #a6ffff;">age_band</span> and <span style="background-color: #a6ffff;">demographic</span> columns

## A8. Generate a new <span style="background-color: #a6ffff;">avg_transaction</span> column as the <span style="background-color: #a6ffff;">sales</span> value divided by <span style="background-color: #a6ffff;">transactions</span> rounded to 2 decimal places for each record

In [33]:
%%sql

USE uditdb;
DROP TABLE IF EXISTS clean_weekly_sales;
CREATE TABLE clean_weekly_sales AS
  SELECT Str_to_date(week_date, '%d/%m/%Y')           AS week_date,
          Week(Str_to_date(week_date, '%d/%m/%Y')) + 1 AS week_number,
          Month(Str_to_date(week_date, '%d/%m/%Y'))    AS month_number,
          Year(Str_to_date(week_date, '%d/%m/%Y'))     AS calendar_year,
          region,
          platform,
          Replace(segment, "null", "unknown")          AS segment,
          CASE
            WHEN RIGHT(segment, 1) = '1' THEN "Young Adults"
            WHEN RIGHT(segment, 1) = '2' THEN "Middle Aged"
            WHEN RIGHT(segment, 1) IN ( '3', '4' ) THEN "Retirees"
            ELSE "unknown"
          END                                          AS age_band,
          CASE
            WHEN LEFT(segment, 1) = "C" THEN "Couples"
            WHEN LEFT(segment, 1) = "F" THEN "Families"
            ELSE "unknown"
          END                                          AS demographic,
          customer_type,
          transactions,
          sales,
          Round(sales / transactions, 2)               AS avg_transaction
   FROM   weekly_sales;

 * mysql://root:***@localhost:3306/uditdb
0 rows affected.
0 rows affected.
17117 rows affected.


[]

[Back to top](#Data-Mart-Case-Study:)

# B. Data Exploration

## B1. What day of the week is used for each <span style="background-color: #a6ffff;">week_date</span> value?

In [35]:
%%sql

SELECT DISTINCT Dayname(week_date) AS day_of_week
FROM   clean_weekly_sales 

 * mysql://root:***@localhost:3306/uditdb
1 rows affected.


day_of_week
Monday


[Back to top](#Data-Mart-Case-Study:)

## B2. What range of week numbers are missing from the dataset?

In [40]:
%%sql

SELECT DISTINCT week_number
FROM clean_weekly_sales
ORDER BY 1

# Missing week numbers are: 1 to 12 and 37 to 52

 * mysql://root:***@localhost:3306/uditdb
24 rows affected.


week_number
13
14
15
16
17
18
19
20
21
22


[Back to top](#Data-Mart-Case-Study:)

## B3. How many total transactions were there for each year in the dataset?

In [83]:
%%sql

SELECT calendar_year,
       Round(Sum(transactions) / Power(10, 6), 2) AS
       total_transactions_in_millions
FROM   clean_weekly_sales 
GROUP  BY 1 

 * mysql://root:***@localhost:3306/uditdb
3 rows affected.


calendar_year,total_transactions_in_millions
2020,375.81
2019,365.64
2018,346.41


[Back to top](#Data-Mart-Case-Study:)

## B4. What is the total sales for each region for each month?

In [84]:
%%sql

SELECT Monthname(week_date)  AS month,
       region,
       Round(Sum(sales) / Power(10, 6), 2) AS
       total_sales_in_millions
FROM   clean_weekly_sales
GROUP  BY 1,2
ORDER  BY  month_number,3 desc

 * mysql://root:***@localhost:3306/uditdb
49 rows affected.


month,region,total_sales_in_millions
March,OCEANIA,783.28
March,AFRICA,567.77
March,ASIA,529.77
March,USA,225.35
March,CANADA,144.63
March,SOUTH AMERICA,71.02
March,EUROPE,35.34
April,OCEANIA,2599.77
April,AFRICA,1911.78
April,ASIA,1804.63


[Back to top](#Data-Mart-Case-Study:)

## B5. What is the total count of transactions for each platform

In [86]:
%%sql

SELECT platform,
       Round(Sum(transactions) / Power(10, 6), 2) AS
       total_transactions_in_millions
FROM   clean_weekly_sales 
GROUP  BY 1 

 * mysql://root:***@localhost:3306/uditdb
2 rows affected.


platform,total_transactions_in_millions
Retail,1081.93
Shopify,5.93


[Back to top](#Data-Mart-Case-Study:)

## B6. What is the percentage of sales for Retail vs Shopify for each month?

In [90]:
%%sql

SELECT 
    MONTHNAME(week_date) AS month,
    ROUND(
        SUM(CASE WHEN platform = 'Retail' THEN sales END) * 100 / SUM(sales),
        2
    ) AS Retail_percentage,
    ROUND(
        SUM(CASE WHEN platform = 'Shopify' THEN sales END) * 100 / SUM(sales),
        2
    ) AS Shopify_percentage
FROM clean_weekly_sales
GROUP BY 1
ORDER BY month_number;


 * mysql://root:***@localhost:3306/uditdb
7 rows affected.


month,Retail_percentage,Shopify_percentage
March,97.54,2.46
April,97.59,2.41
May,97.3,2.7
June,97.27,2.73
July,97.29,2.71
August,97.08,2.92
September,97.38,2.62


[Back to top](#Data-Mart-Case-Study:)

## B7. What is the percentage of sales by demographic for each year in the dataset?

In [93]:
%%sql

SELECT 
    calendar_year,
    ROUND(
        SUM(CASE WHEN demographic = 'Couples' THEN sales END) * 100 / SUM(sales),
        2
    ) AS Couples_percentage,
    ROUND(
        SUM(CASE WHEN demographic = 'Families' THEN sales END) * 100 / SUM(sales),
        2
    ) AS Families_percentage,
    ROUND(
        SUM(CASE WHEN demographic = 'unknown' THEN sales END) * 100 / SUM(sales),
        2
    ) AS unknown_percentage
FROM clean_weekly_sales
GROUP BY 1
ORDER BY 1;

 * mysql://root:***@localhost:3306/uditdb
3 rows affected.


calendar_year,Couples_percentage,Families_percentage,unknown_percentage
2018,26.38,31.99,41.63
2019,27.28,32.47,40.25
2020,28.72,32.73,38.55


[Back to top](#Data-Mart-Case-Study:)

## B8. Which <span style="background-color: #a6ffff;">age_band</span> and <span style="background-color: #a6ffff;">demographic</span> values contribute the most to Retail sales?

In [101]:
%%sql

SELECT age_band,
       demographic,
       round(Sum(sales)*100/(select sum(sales) from  clean_weekly_sales where platform = "Retail")) AS retail_sales_percent
FROM   clean_weekly_sales
WHERE  platform = 'Retail'
GROUP  BY 1,
          2
ORDER  BY 3 DESC

 * mysql://root:***@localhost:3306/uditdb
7 rows affected.


age_band,demographic,retail_sales_percent
unknown,unknown,41
Retirees,Families,17
Retirees,Couples,16
Middle Aged,Families,11
Young Adults,Couples,7
Middle Aged,Couples,5
Young Adults,Families,4


[Back to top](#Data-Mart-Case-Study:)

## B9. Can we use the <span style="background-color: #a6ffff;">avg_transaction</span> column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead?

_Clarification:_ No we can not use avg_transaction to find the average of a whole group, that's not how average work. For that we will need to sum all the sales and then divide it by sum of all the transactions

In [117]:
%%sql

SELECT calendar_year,
       Round(Sum(CASE
                   WHEN platform = "Retail" THEN sales
                 END) / Sum(CASE
                              WHEN platform = "Retail" THEN transactions
                            END), 2) AS retail_avg_transaction,
       Round(Sum(CASE
                   WHEN platform = "Shopify" THEN sales
                 END) / Sum(CASE
                              WHEN platform = "Shopify" THEN transactions
                            END), 2) AS shopify_avg_transaction
FROM   clean_weekly_sales
GROUP  BY 1 ORDER BY 1

 * mysql://root:***@localhost:3306/uditdb
3 rows affected.


calendar_year,retail_avg_transaction,shopify_avg_transaction
2018,36.56,192.48
2019,36.83,183.36
2020,36.56,179.03


[Back to top](#Data-Mart-Case-Study:)

# C. Before & After Analysis

<span style="background-color: #ffffa6;">This technique is usually used when we inspect an important event and want to inspect the impact before and after a certain point in time.<br><br>Taking the **week_date** value of **2020-06-15** as the baseline week where the Data Mart sustainable packaging changes came into effect.<br><br>We would include all **week_date** values for **2020-06-15** as the start of the period __**AFTER**__ the change and the previous **week_date** values would be __**BEFORE**__<br><br>Using this analysis approach - we are answering the following questions:</span>


## C1. What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?

In [4]:
%%sql

with bef_aft as (
select 
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between 1 and 4 then sales end) as post_sales,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between -4 and -1 then sales end) as pre_sales
from clean_weekly_sales)

select *, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft

 * mysql://root:***@localhost:3306/uditdb
1 rows affected.


post_sales,pre_sales,percentage_change
2334905223,2345878357,-0.47


[Back to top](#Data-Mart-Case-Study:)

## C2. What about the entire 12 weeks before and after?

In [3]:
%%sql

with bef_aft as (
select 
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between 1 and 12 then sales end) as post_sales,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between -12 and -1 then sales end) as pre_sales
from clean_weekly_sales)

select *, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft

 * mysql://root:***@localhost:3306/uditdb
1 rows affected.


post_sales,pre_sales,percentage_change
6403922405,7126273147,-10.14


[Back to top](#Data-Mart-Case-Study:)

## C3. How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?

_clarification:_ What woud have been the percentage change between before and after if the same changes would have occured at that certain week in other calendar years

In [13]:
%%sql

with bef_aft as (
select calendar_year,
sum(case when week_number - (week('2020-06-15')+1) between 1 and 12 then sales end) as post_sales,
sum(case when week_number - (week('2020-06-15')+1) between -12 and -1 then sales end) as pre_sales
from clean_weekly_sales group by 1)

select *, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft

 * mysql://root:***@localhost:3306/uditdb
3 rows affected.


calendar_year,post_sales,pre_sales,percentage_change
2020,6403922405,7126273147,-10.14
2019,6303557285,6883386397,-8.42
2018,5976449777,6396562317,-6.57


[Back to top](#Data-Mart-Case-Study:)

# D. Bonus Question

## D1. Which areas of the business have the highest negative impact in sales metrics performance in 2020 for the 12 week before and after period?

- region
- platform
- age_band
- demographic
- customer_type

In [29]:
%%sql

with bef_aft as (
select region,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between 1 and 12 then sales end) as post_sales,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between -12 and -1 then sales end) as pre_sales
from clean_weekly_sales group by 1)

select region, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft order by 2

 * mysql://root:***@localhost:3306/uditdb
7 rows affected.


region,percentage_change
ASIA,-11.19
OCEANIA,-10.96
SOUTH AMERICA,-10.27
CANADA,-10.08
USA,-9.64
AFRICA,-8.6
EUROPE,-3.74


In [30]:
%%sql

with bef_aft as (
select platform,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between 1 and 12 then sales end) as post_sales,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between -12 and -1 then sales end) as pre_sales
from clean_weekly_sales group by 1)

select platform, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft order by 2

 * mysql://root:***@localhost:3306/uditdb
2 rows affected.


platform,percentage_change
Retail,-10.41
Shopify,-1.6


In [31]:
%%sql

with bef_aft as (
select age_band,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between 1 and 12 then sales end) as post_sales,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between -12 and -1 then sales end) as pre_sales
from clean_weekly_sales group by 1)

select age_band, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft order by 2

 * mysql://root:***@localhost:3306/uditdb
4 rows affected.


age_band,percentage_change
unknown,-11.18
Middle Aged,-10.06
Retirees,-9.33
Young Adults,-9.05


In [32]:
%%sql

with bef_aft as (
select demographic,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between 1 and 12 then sales end) as post_sales,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between -12 and -1 then sales end) as pre_sales
from clean_weekly_sales group by 1)

select demographic, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft order by 2

 * mysql://root:***@localhost:3306/uditdb
3 rows affected.


demographic,percentage_change
unknown,-11.18
Families,-9.94
Couples,-8.95


In [33]:
%%sql

with bef_aft as (
select customer_type,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between 1 and 12 then sales end) as post_sales,
sum(case when TIMESTAMPDIFF(WEEK, '2020-06-15', week_date) between -12 and -1 then sales end) as pre_sales
from clean_weekly_sales group by 1)

select customer_type, round((post_sales-pre_sales)*100/pre_sales,2) as percentage_change from bef_aft order by 2

 * mysql://root:***@localhost:3306/uditdb
3 rows affected.


customer_type,percentage_change
Guest,-10.92
Existing,-10.34
New,-6.93


[Back to top](#Data-Mart-Case-Study:)