# Breakfast at the Frat: A Time Series Analysis

Sales and promotion information on the top five products from each of the top three brands within four selected categories (mouthwash, pretzels, frozen pizza, and boxed cereal), gathered from a sample of stores over 156 weeks.

- Unit sales, households, visits, and spend data by product, store, and week
- Base Price and Actual Shelf Price, to determine a product’s discount, if any
- Promotional support details (e.g., sale tag, in-store display), if applicable for the given product/store/week
- Store information, including size and location, as well as a price tier designation (e.g., upscale vs. value)
- Product information, including UPC, size, and description

To identify outliers, it is suggested to look at

- The ratio of units vs. number of visits
- The ratio of visits vs. number of households
- Some items that may be out-of-stock or discontinued for a store

**Source:** https://www.dunnhumby.com/source-files/

In [29]:
from pyspark.sql import SparkSession
import pandas as pd

In [2]:
spark = SparkSession.builder \
    .appName("breakfast") \
    .getOrCreate()

In [3]:
product_data_folder = "dataset/products"
store_data_folder = "dataset/stores"
transaction_data_folder = "dataset/transactions"

### Perform ETL to Answer the Following Questions

1. What is the range of prices offered on products?
1. What is the impact on units/visit of promotions by geographies?
1. Which products would you lower the price to increase sales?

In [48]:
product_df = spark. \
                read. \
                option('header',True). \
                csv(product_data_folder)

In [55]:
product_df.show(5)

+----------+--------------------+-------------+--------------------+--------------------+------------+
|       UPC|         DESCRIPTION| MANUFACTURER|            CATEGORY|        SUB_CATEGORY|PRODUCT_SIZE|
+----------+--------------------+-------------+--------------------+--------------------+------------+
|1111009477|PL MINI TWIST PRE...|PRIVATE LABEL|          BAG SNACKS|            PRETZELS|       15 OZ|
|1111009497|   PL PRETZEL STICKS|PRIVATE LABEL|          BAG SNACKS|            PRETZELS|       15 OZ|
|1111009507|   PL TWIST PRETZELS|PRIVATE LABEL|          BAG SNACKS|            PRETZELS|       15 OZ|
|1111035398|PL BL MINT ANTSPT...|PRIVATE LABEL|ORAL HYGIENE PROD...|MOUTHWASHES (ANTI...|      1.5 LT|
|1111038078|PL BL MINT ANTSPT...|PRIVATE LABEL|ORAL HYGIENE PROD...|MOUTHWASHES (ANTI...|      500 ML|
+----------+--------------------+-------------+--------------------+--------------------+------------+
only showing top 5 rows



In [61]:
store_df = spark. \
                read. \
                option('header',True). \
                csv(store_data_folder)

In [62]:
store_df.show(3)

+--------+------------------+-----------------+-----------------------+--------+--------------+-----------------+-------------------+------------------+
|STORE_ID|        STORE_NAME|ADDRESS_CITY_NAME|ADDRESS_STATE_PROV_CODE|MSA_CODE|SEG_VALUE_NAME|PARKING_SPACE_QTY|SALES_AREA_SIZE_NUM|AVG_WEEKLY_BASKETS|
+--------+------------------+-----------------+-----------------------+--------+--------------+-----------------+-------------------+------------------+
|     389|        SILVERLAKE|         ERLANGER|                     KY|   17140|    MAINSTREAM|              408|              46073|             24767|
|    2277|ANDERSON TOWNE CTR|       CINCINNATI|                     OH|   17140|       UPSCALE|             null|              81958|             54053|
|    4259|     WARSAW AVENUE|       CINCINNATI|                     OH|   17140|         VALUE|             null|              48813|             31177|
+--------+------------------+-----------------+-----------------------+--------+--

In [63]:
transactions_df = spark.read.option("header",True).csv(transaction_data_folder)

In [64]:
transactions.show(3)

+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+
|WEEK_END_DATE|STORE_NUM|       UPC|UNITS|VISITS|HHS|SPEND|PRICE|BASE_PRICE|FEATURE|DISPLAY|TPR_ONLY|
+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+
|    14-Jan-09|      367|1111009477|   13|    13| 13|18.07| 1.39|      1.57|      0|      0|       1|
|    14-Jan-09|      367|1111009497|   20|    18| 18| 27.8| 1.39|      1.39|      0|      0|       0|
|    14-Jan-09|      367|1111009507|   14|    14| 14|19.32| 1.38|      1.38|      0|      0|       0|
+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+
only showing top 3 rows



In [80]:
# print(transactions.groupBy('UPC').agg({'PRICE': 'min'}).collect())
# print(transactions.groupBy('UPC').agg({'PRICE': 'max'}).collect())

In [65]:
product_df.createOrReplaceTempView('products')
transactions_df.createOrReplaceTempView('transactions')
store_df.createOrReplaceTempView('stores')

In [92]:
# 1. What is the range of prices offered on products?

products_price_range = spark.sql("""
                            select p.UPC
                                , p.DESCRIPTION
                                , p.CATEGORY
                                , min(t.PRICE) as MIN_PRICE
                                , max(t.PRICE) as MAX_PRICE
                                , avg(t.PRICE) as AVG_PRICE
                            from products p
                            left join transactions t
                            on p.UPC = t.UPC
                            group by p.UPC, p.DESCRIPTION, p.CATEGORY
                            order by p.UPC
                        """)

products_price_range.show()

# destination = "products_price_range"
# products_price_range.write.partitionBy("year","month","day").mode("overwrite").csv(destination)

+----------+--------------------+--------------------+---------+---------+------------------+
|       UPC|         DESCRIPTION|            CATEGORY|MIN_PRICE|MAX_PRICE|         AVG_PRICE|
+----------+--------------------+--------------------+---------+---------+------------------+
|1111009477|PL MINI TWIST PRE...|          BAG SNACKS|     0.89|     1.83| 1.300309097001017|
|1111009497|   PL PRETZEL STICKS|          BAG SNACKS|     0.86|     1.69|1.3023260869563715|
|1111009507|   PL TWIST PRETZELS|          BAG SNACKS|      0.8|     1.69|1.3116138175375258|
|1111035398|PL BL MINT ANTSPT...|ORAL HYGIENE PROD...|        1|     4.69|3.1535704656228067|
|1111038078|PL BL MINT ANTSPT...|ORAL HYGIENE PROD...|     0.47|     3.08|1.4523977596016782|
|1111038080|PL ANTSPTC SPG MN...|ORAL HYGIENE PROD...|     0.46|     4.18|1.4451583583208103|
|1111085319|PL HONEY NUT TOAS...|         COLD CEREAL|     1.07|     1.99| 1.759916380968303|
|1111085345|      PL RAISIN BRAN|         COLD CEREAL|     0

In [88]:
# 2.What is the impact on units/visit of promotions by geographies?

store_count_trans = spark.sql("""
                        select s.STORE_ID
                            , s.STORE_NAME
                            , count(t.STORE_NUM) AS NUM_TRANSACTIONS
                        from stores s
                        left join transactions t
                        on s.STORE_ID = t.STORE_NUM
                        group by s.STORE_ID, s.STORE_NAME
                        order by NUM_TRANSACTIONS
                    """)

store_count_trans.show()

+--------+--------------------+----------------+
|STORE_ID|          STORE_NAME|NUM_TRANSACTIONS|
+--------+--------------------+----------------+
|    8035|      OVER-THE-RHINE|            4119|
|   23055|WALNUT HILLS/PEEBLES|            4723|
|   11967|     NORTHBOROUGH SQ|            5104|
|    2523|  LANDMARK PLACE S/C|            5258|
|   15755| KROGER JUNCTION S/C|            5391|
|   10019|      AT EASTEX FRWY|            5477|
|   25253|          HIGHWAY 75|            5732|
|    2541|             NORWOOD|            5801|
|   17599|             KEARNEY|            5880|
|     367|      15TH & MADISON|            5946|
|     387|      TOWN & COUNTRY|            6054|
|   21485|             HOUSTON|            6087|
|    4521|  PARKWAY SQUARE S/C|            6113|
|    6431|        AT WARD ROAD|            6117|
|   23349|             GARLAND|            6136|
|   12011|             SHERMAN|            6138|
|   25233| ANTOINE TOWN CENTER|            6183|
|   19523|         D

In [94]:
# 3. Which products would you lower the price to increase sales?

product_price_trans = spark.sql("""
                            select p.UPC
                                , p.DESCRIPTION
                                , p.CATEGORY
                                , t.PRICE 
                                , count(t.PRICE)
                            from products p
                            left join transactions t
                            on p.UPC = t.UPC
                            group by p.UPC, p.DESCRIPTION, p.CATEGORY, t.PRICE
                            order by p.UPC, t.PRICE
                        """)

product_price_trans.show()

+----------+--------------------+----------+-----+------------+
|       UPC|         DESCRIPTION|  CATEGORY|PRICE|count(PRICE)|
+----------+--------------------+----------+-----+------------+
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.89|           2|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS|  0.9|           3|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.91|           7|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.92|          19|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.93|          31|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.94|          62|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.95|          95|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.96|         229|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.97|         419|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.98|         627|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS| 0.99|         401|
|1111009477|PL MINI TWIST PRE...|BAG SNACKS|    1|         101|
|1111009477|PL MINI TWIST PRE...|BAG SNA

In [104]:
# Transform product_size

spark.sql("""
    select p.UPC
    , p.PRODUCT_SIZE
    , case when contains(p.PRODUCT_SIZE, ' ')  
        then substring_index(p.PRODUCT_SIZE, ' ', 1)
        else p.PRODUCT_SIZE
      end AS PRODUCT_NUM
    from products p
""").show()

+----------+------------+-----------+
|       UPC|PRODUCT_SIZE|PRODUCT_NUM|
+----------+------------+-----------+
|1111009477|       15 OZ|         15|
|1111009497|       15 OZ|         15|
|1111009507|       15 OZ|         15|
|1111035398|      1.5 LT|        1.5|
|1111038078|      500 ML|        500|
|1111038080|      500 ML|        500|
|1111085319|    12.25 OZ|      12.25|
|1111085345|       20 OZ|         20|
|1111085350|       18 OZ|         18|
|1111087395|     32.7 OZ|       32.7|
|1111087396|     30.5 OZ|       30.5|
|1111087398|     29.6 OZ|       29.6|
|1600027527|    12.25 OZ|      12.25|
|1600027528|       18 OZ|         18|
|1600027564|       12 OZ|         12|
|2066200530|     13.2 OZ|       13.2|
|2066200531|     13.3 OZ|       13.3|
|2066200532|     14.7 OZ|       14.7|
|2840002333|       10 OZ|         10|
|2840004768|       16 OZ|         16|
+----------+------------+-----------+
only showing top 20 rows

