This notebook takes the Anaplan sales forecast extract (000s) from one date and explores the dataset. It considers:
- what columns are available
- data quality
- data inconsistency

For more information on outcomes, see documentation at this link (https://swirecocacola.sharepoint.com/:w:/s/ERA/Ec3i8vaHr95Ftz03OEsJ5YYBEWVDm6ir3HVcHOzY6bLPNA?e=BigXd6)

#Set Up

##Importing Modules

In [0]:
#Import SPARK Libraries
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *

##Loading Data

In [0]:
#view contents of the dbfs

In [0]:
%fs 
ls dbfs:/user/hive/warehouse/rgb_db.db/ 

path,name,size,modificationTime
dbfs:/user/hive/warehouse/rgb_db.db/forecast_extract_01/,forecast_extract_01/,0,1674844310000
dbfs:/user/hive/warehouse/rgb_db.db/test_table/,test_table/,0,1674690806000


In [0]:
#read table
df = spark.table('rgb_db.forecast_extract_01')

In [0]:
df.select('retailer').distinct().groupby().count().show()

+-----+
|count|
+-----+
|   59|
+-----+



#EDA

In [0]:
df.limit(6).display()

Time,Version,Value_Type,Period_Code,Year,Period,Week,WeekofPeriod,Super_Division,Profit_Center,Sales_Office,Sales_Office_Description,Super_Channel,Super_Channel_Description,Retailer,Retailer_Description,Cold_Drink_Channel,Cold_Drink_Channel_Description,Material,Material_SWEEP_Category_ID,Material_SWEEP_Category_Description,Material_PkgCat,Material_PkgCat_Description,Material_Packs_per_Case,Wholesale_Price,Discount,NSI_ADV,CMA/JBP_Allowances,CCF_National_on_Invoice,CCF_Regional_on_Invoice,CTM_on_Invoice,CCF_National_Accrual,CCF_Regional_Accrual,CTM_National_Accrual,CTM_Regional_Accrual,DNNSI,LMP,Dr_Pepper_CSD_Funding,Full_Service_Commission,COGSIPINCL,COGSIPEXCL,Facilitation_Fees,Reimbursements,ReimbursementsBaseFunding,DNGP,Effective_Unit_Price,Invoice_Price,Volume,Volume_SPC,Volume_EQV,Transactions,Forecast_DNNSI,Forecast_DNGP,Forecast_Volume,Forecast_Volume_SPC,Forecast_Volume_EQV,Retail_Price,Take_Rate,Price_Multiple
Week 14 FY22,000S,30,202204,2022,4,14,1,RM,4400335000,G235,"JOHNSTOWN, CO",A,ALL OTHER,0,ALL OTHER PLAN KA,,,100278,1,SWEEP-SPARKLING,AA,SSD CANS 12Z 6PK 4CT,4,-30.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,-30.0,0.0,0.0,0.0,-7.12011,-6.75,0.0,0.0,0.0,-23.25,0.0,-30.0,-1.0,-1.0,-1.5,-4.0,0.0,7.12011,0.0,0.0,0.0,0.0,0.0,
Week 15 FY22,000S,30,202204,2022,4,15,2,RM,4400335000,G235,"JOHNSTOWN, CO",A,ALL OTHER,0,ALL OTHER PLAN KA,,,100278,1,SWEEP-SPARKLING,AA,SSD CANS 12Z 6PK 4CT,4,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.3,0.0,0.0,0.0,4.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
Week 31 FY22,000S,30,202208,2022,8,31,1,RM,4400335000,G235,"JOHNSTOWN, CO",A,ALL OTHER,0,ALL OTHER PLAN KA,,,100278,1,SWEEP-SPARKLING,AA,SSD CANS 12Z 6PK 4CT,4,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.43,4.43,0.25,0.25,0.375,0.0,0.0,
Week 39 FY22,000S,30,202209,2022,9,39,5,RM,4400335000,G235,"JOHNSTOWN, CO",A,ALL OTHER,0,ALL OTHER PLAN KA,,,100278,1,SWEEP-SPARKLING,AA,SSD CANS 12Z 6PK 4CT,4,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.15,22.15,1.25,1.25,1.875,0.0,0.0,
Week 3 FY22,000S,30,202201,2022,1,3,3,RM,4400351000,G151,"SCOTTSBLUFF, NE",A,ALL OTHER,0,ALL OTHER PLAN KA,,,100278,1,SWEEP-SPARKLING,AA,SSD CANS 12Z 6PK 4CT,4,4830.0,4508.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,322.0,0.0,0.0,0.0,1602.00796,1365.05,0.0,0.0,0.0,-1043.05,0.0,322.0,161.0,161.0,241.5,644.0,2852.92,1250.9120399999997,161.0,161.0,241.5,0.0,0.0,
Week 4 FY22,000S,30,202201,2022,1,4,4,RM,4400351000,G151,"SCOTTSBLUFF, NE",A,ALL OTHER,0,ALL OTHER PLAN KA,,,100278,1,SWEEP-SPARKLING,AA,SSD CANS 12Z 6PK 4CT,4,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-169.99,0.0,0.0,0.0,169.99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


## Time
time | period_code | year | period | week | weekofperiod

Weeks 1-53 from 2022 and Weeks 1-52 from 2023 are included  
period columns are as expected  
FIXME: There is no date for when this forecast was pulled  
DROP time  
DROP period_code

Explain how Anaplan weeks work  
  
weeks should match up he says.
Sometimes anaplan breaks the last week into week 53

In [0]:
#time column
df.select('time').distinct().orderBy(['time'], ascending=[False]).show()

+------------+
|        time|
+------------+
| Week 9 FY23|
| Week 9 FY22|
| Week 8 FY23|
| Week 8 FY22|
| Week 7 FY23|
| Week 7 FY22|
| Week 6 FY23|
| Week 6 FY22|
|Week 53 FY22|
|Week 52 FY23|
|Week 52 FY22|
|Week 51 FY23|
|Week 51 FY22|
|Week 50 FY23|
|Week 50 FY22|
| Week 5 FY23|
| Week 5 FY22|
|Week 49 FY23|
|Week 49 FY22|
|Week 48 FY23|
+------------+
only showing top 20 rows



In [0]:
#checking period is as expected
(df.select(['period_code'
          , 'year'
          , 'period'
          , 'week'
          , 'weekofperiod'])
 .distinct()
 .orderBy(['year', 'week'])
 .show(5)
)

+-----------+----+------+----+------------+
|period_code|year|period|week|weekofperiod|
+-----------+----+------+----+------------+
|     202201|2022|     1|   1|           1|
|     202201|2022|     1|   2|           2|
|     202201|2022|     1|   3|           3|
|     202201|2022|     1|   4|           4|
|     202202|2022|     2|   5|           1|
+-----------+----+------+----+------------+
only showing top 5 rows



##Version
version | value_type

These are always the same.  
DROP version  
DROP value_type

In [0]:
df.select(['version', 'value_type']).distinct().show()

+-------+----------+
|version|value_type|
+-------+----------+
|   000S|        30|
+-------+----------+



##Location
super_division | profit_center | sales_office | sales_office_description

All looks normal  
FIXME: Need to see how the new location structure will look in this data  
DROP sales_office_description

did the new location structure get uploaded to Anaplan?  
  
they built the new hierarchy on top of the old hierarchy  
eventually want to drop the old  
right now they are using the new profit centers  
this cut over was Jan 1st

In [0]:
df.select(['super_division','profit_center','sales_office','sales_office_description']).show(6)

+--------------+-------------+------------+------------------------+
|super_division|profit_center|sales_office|sales_office_description|
+--------------+-------------+------------+------------------------+
|            RM|   4400335000|        G235|           JOHNSTOWN, CO|
|            RM|   4400335000|        G235|           JOHNSTOWN, CO|
|            RM|   4400335000|        G235|           JOHNSTOWN, CO|
|            RM|   4400335000|        G235|           JOHNSTOWN, CO|
|            RM|   4400351000|        G151|         SCOTTSBLUFF, NE|
|            RM|   4400351000|        G151|         SCOTTSBLUFF, NE|
+--------------+-------------+------------+------------------------+
only showing top 6 rows



In [0]:
df.select(['sales_office']).distinct().count()

Out[7]: 45

In [0]:
#descriptions are 1 to 1 with offices, so we can drop the description column to save space

(df.select(['sales_office','sales_office_description']) #select relevant columns
 .groupby('sales_office') #group by sales office
 .agg(F.count_distinct('sales_office_description')) #count distinct descriptions per sales office
 .withColumnRenamed('count(sales_office_description)', 'count') #rename aggregate column
 .sort('count', ascending=False) #sort descending to see if there are duplicates
 .show() #display to console
)

+------------+-----+
|sales_office|count|
+------------+-----+
|        G131|    1|
|        G182|    1|
|        G294|    1|
|        G292|    1|
|        G151|    1|
|        G181|    1|
|        G222|    1|
|        G144|    1|
|        G267|    1|
|        G113|    1|
|        G236|    1|
|        G163|    1|
|        G183|    1|
|        G281|    1|
|        G143|    1|
|        G223|    1|
|        G282|    1|
|        G175|    1|
|        G141|    1|
|        G266|    1|
+------------+-----+
only showing top 20 rows



In [0]:
#each profit center has one sales office. Each sales office has 3 profit centers. Keep both as transmitting between these grains can be tricky

(df.select(['sales_office','profit_center']) 
 .groupby('sales_office')
 .agg(F.count_distinct('profit_center')) 
 .withColumnRenamed('count(profit_center)', 'count')
 .sort('count', ascending=False)
 .show(6)
)

df.select(['sales_office','profit_center']).filter(col('sales_office')=='G131').distinct().show() 

+------------+-----+
|sales_office|count|
+------------+-----+
|        G131|    3|
|        G182|    3|
|        G294|    3|
|        G292|    3|
|        G151|    3|
|        G181|    3|
+------------+-----+
only showing top 6 rows

+------------+-------------+
|sales_office|profit_center|
+------------+-------------+
|        G131|   4400531000|
|        G131|   4400331000|
|        G131|   4400731000|
+------------+-------------+



In [0]:
#could drop division since we already have that info in sales office. but need to wait to see new structure

(df.select(['sales_office','super_division']) 
 .groupby('sales_office')
 .agg(F.count_distinct('super_division')) 
 .withColumnRenamed('count(super_division)', 'count')
 .sort('count', ascending=False)
 .show(6)
)

+------------+-----+
|sales_office|count|
+------------+-----+
|        G131|    1|
|        G182|    1|
|        G294|    1|
|        G292|    1|
|        G151|    1|
|        G181|    1|
+------------+-----+
only showing top 6 rows



##Customer
super_channel | super_channel_description | retailer | retailer_description | cold_drink_channel | cold_drink_channel_description

all looks normal  
cold drink channel is 78% null... is it needed?  
DROP super_channel_description  
DROP retailer_description  
DROP cold_drink_channel_description

In [0]:
df.select(['super_channel','super_channel_description','retailer','retailer_description', 'cold_drink_channel', 'cold_drink_channel_description']).show(6)

+-------------+-------------------------+--------+--------------------+------------------+------------------------------+
|super_channel|super_channel_description|retailer|retailer_description|cold_drink_channel|cold_drink_channel_description|
+-------------+-------------------------+--------+--------------------+------------------+------------------------------+
|            A|                ALL OTHER|       0|   ALL OTHER PLAN KA|              null|                          null|
|            A|                ALL OTHER|       0|   ALL OTHER PLAN KA|              null|                          null|
|            A|                ALL OTHER|       0|   ALL OTHER PLAN KA|              null|                          null|
|            A|                ALL OTHER|       0|   ALL OTHER PLAN KA|              null|                          null|
|            A|                ALL OTHER|       0|   ALL OTHER PLAN KA|              null|                          null|
|            A|         

In [0]:
(df.select(['super_channel','super_channel_description']) 
 .groupby('super_channel')
 .agg(F.count_distinct('super_channel_description')) 
 .withColumnRenamed('count(super_channel_description)', 'count')
 .sort('count', ascending=False)
 .show(6)
)

+-------------+-----+
|super_channel|count|
+-------------+-----+
|            F|    1|
|            M|    1|
|            V|    1|
|            D|    1|
|            O|    1|
|            C|    1|
+-------------+-----+
only showing top 6 rows



In [0]:
(df.select(['retailer','retailer_description']) 
 .groupby('retailer')
 .agg(F.count_distinct('retailer_description')) 
 .withColumnRenamed('count(retailer_description)', 'count')
 .sort('count', ascending=False)
 .show(6)
)

+--------+-----+
|retailer|count|
+--------+-----+
|     673|    1|
|     193|    1|
|     115|    1|
|   15332|    1|
|     412|    1|
|   46159|    1|
+--------+-----+
only showing top 6 rows



In [0]:
(df.select(['cold_drink_channel','cold_drink_channel_description']) 
 .groupby('cold_drink_channel')
 .agg(F.count_distinct('cold_drink_channel_description')) 
 .withColumnRenamed('count(cold_drink_channel_description)', 'count')
 .sort('count', ascending=False)
 .show(6)
)

+------------------+-----+
|cold_drink_channel|count|
+------------------+-----+
|                85|    1|
|                65|    1|
|                91|    1|
|                35|    1|
|                55|    1|
|                10|    1|
+------------------+-----+
only showing top 6 rows



In [0]:
null_ct = df.filter(col('cold_drink_channel').isNull()).count()
total_ct = df.count()
null_pct = null_ct / total_ct
print(null_pct)

0.7849939828926793


##Material  
material | material_sweep_category_id | material_sweep_category_description | material_pkgCat | material_PkgCat_description | material_packs_per_case

all of these features are available on the material table. We can drop all except for id  
DROP material_sweep_category_id  
DROP material_sweep_category_description  
DROP material_pkgCat  
DROP material_PkgCat_description  
DROP material_packs_per_case

FIXME: why does material not uniquely id a row?  
how does a single material have multiple package categories?  
  
this does not make sense to him. He thinks material should uniquely identify a row

In [0]:
df.select(['material', 'material_sweep_category_id', 'material_sweep_category_description', 'material_pkgCat', 'material_PkgCat_description', 'material_packs_per_case']).show(6)

+--------+--------------------------+-----------------------------------+---------------+---------------------------+-----------------------+
|material|material_sweep_category_id|material_sweep_category_description|material_pkgCat|material_PkgCat_description|material_packs_per_case|
+--------+--------------------------+-----------------------------------+---------------+---------------------------+-----------------------+
|  100278|                         1|                    SWEEP-SPARKLING|             AA|       SSD CANS 12Z 6PK 4CT|                      4|
|  100278|                         1|                    SWEEP-SPARKLING|             AA|       SSD CANS 12Z 6PK 4CT|                      4|
|  100278|                         1|                    SWEEP-SPARKLING|             AA|       SSD CANS 12Z 6PK 4CT|                      4|
|  100278|                         1|                    SWEEP-SPARKLING|             AA|       SSD CANS 12Z 6PK 4CT|                      4|
|  100

In [0]:
#each material is in just one SWEEP
(df.select('material','material_sweep_category_id','material_sweep_category_description')
 .groupBy('material')
 .agg(F.count_distinct('material_sweep_category_id').alias('count'))
 .sort('count', ascending = False)
 .show(25)
)

+--------+-----+
|material|count|
+--------+-----+
|  157307|    1|
|  130629|    1|
|  151791|    1|
|  157410|    1|
|  411653|    1|
|  102079|    1|
|  151762|    1|
|  133255|    1|
|  158050|    1|
|  129089|    1|
|  900128|    1|
|  103942|    1|
|  152061|    1|
|  154834|    1|
|  157196|    1|
|  156853|    1|
|  156123|    1|
|  156849|    1|
|  410845|    1|
|  145206|    1|
|  156283|    1|
|  135337|    1|
|  151818|    1|
|  151944|    1|
|  151413|    1|
+--------+-----+
only showing top 25 rows



###Material Package Category

In [0]:
(df.select('material','material_pkgcat','material_pkgcat_description')
 .filter(col('material')==119826)
 .groupBy('material','material_pkgcat','material_pkgcat_description')
 .count()
 .show()
)

+--------+---------------+---------------------------+------+
|material|material_pkgcat|material_pkgcat_description| count|
+--------+---------------+---------------------------+------+
|  119826|             CC|            SSD NR 20Z 24CT|101958|
|  119826|             BN|          SSD NR 13.2Z 24CT|   320|
+--------+---------------+---------------------------+------+



See Snowflake. This material should be CC

##Features

FIXME: what do the non-forecast columns represent?  
FIXME: what do the forecast columns represent?

In [0]:
df.select(df.columns[24:]).limit(10).display()

Wholesale_Price,Discount,NSI_ADV,CMA/JBP_Allowances,CCF_National_on_Invoice,CCF_Regional_on_Invoice,CTM_on_Invoice,CCF_National_Accrual,CCF_Regional_Accrual,CTM_National_Accrual,CTM_Regional_Accrual,DNNSI,LMP,Dr_Pepper_CSD_Funding,Full_Service_Commission,COGSIPINCL,COGSIPEXCL,Facilitation_Fees,Reimbursements,ReimbursementsBaseFunding,DNGP,Effective_Unit_Price,Invoice_Price,Volume,Volume_SPC,Volume_EQV,Transactions,Forecast_DNNSI,Forecast_DNGP,Forecast_Volume,Forecast_Volume_SPC,Forecast_Volume_EQV,Retail_Price,Take_Rate,Price_Multiple
-30.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,-30.0,0.0,0.0,0.0,-7.12011,-6.75,0.0,0.0,0.0,-23.25,0.0,-30.0,-1.0,-1.0,-1.5,-4.0,0.0,7.12011,0.0,0.0,0.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.3,0.0,0.0,0.0,4.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.43,4.43,0.25,0.25,0.375,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.15,22.15,1.25,1.25,1.875,0.0,0.0,
4830.0,4508.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,322.0,0.0,0.0,0.0,1602.00796,1365.05,0.0,0.0,0.0,-1043.05,0.0,322.0,161.0,161.0,241.5,644.0,2852.92,1250.9120399999997,161.0,161.0,241.5,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-169.99,0.0,0.0,0.0,169.99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
90.0,81.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,29.85108,25.62,0.0,0.0,0.0,-16.62,0.0,9.0,3.0,3.0,4.5,12.0,53.16,23.30892,3.0,3.0,4.5,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.65,0.0,0.0,0.0,2.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
4800.0,4320.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,480.0,0.0,0.0,0.0,1592.0576,1366.67,0.0,0.0,0.0,-886.67,0.0,480.0,160.0,160.0,240.0,640.0,2835.2,1243.1423999999995,160.0,160.0,240.0,0.0,0.0,
30.0,27.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,9.95036,-132.82,0.0,0.0,0.0,135.82,0.0,3.0,1.0,1.0,1.5,4.0,17.72,7.769639999999999,1.0,1.0,1.5,0.0,0.0,


dnnsi is wholesale less discounts

In [0]:
(df.select(['wholesale_price', 'discount', 'nsi_adv', 'dnnsi'])
 .withColumn('test', F.round(col('wholesale_price') - col('discount') - col('NSI_ADV'),2))
 .filter(col('dnnsi') != col('test'))
 .show(4)
)

+------------------+-----------------+------------------+-----------------+------+
|   wholesale_price|         discount|           nsi_adv|            dnnsi|  test|
+------------------+-----------------+------------------+-----------------+------+
|            1560.0|            650.0|            281.32|628.6800000000001|628.68|
|            1260.0|           562.26|161.70000000000002|            532.9|536.04|
|1014.5390418340355|202.9078083668071|350.79378269814845|  460.83745076908|460.84|
| 1700.122834645669|340.0245669291338| 587.8458054593176|772.2524622572176|772.25|
+------------------+-----------------+------------------+-----------------+------+
only showing top 4 rows



dngp is dnnsi less cogs (IP excluded) and less reimbursements

In [0]:
(df.select(['dnnsi', 'cogsipexcl', 'dngp'])
 .withColumn('test', col('dnnsi') - col('cogsipexcl'))
 .filter(col('dngp') != col('test'))
 .show(4)
)

+------+----------+------+------------------+
| dnnsi|cogsipexcl|  dngp|              test|
+------+----------+------+------------------+
|456.96|    367.97|104.95| 88.98999999999995|
|456.96|    366.94|105.98| 90.01999999999998|
|587.52|     472.3|135.74|115.21999999999997|
|423.14|    286.95|149.87|            136.19|
+------+----------+------+------------------+
only showing top 4 rows



In [0]:
df.select(df.columns[24:]).filter( (col('effective_unit_price') != 0) & (col('invoice_price') == 0) ).limit(10).display()

Wholesale_Price,Discount,NSI_ADV,CMA/JBP_Allowances,CCF_National_on_Invoice,CCF_Regional_on_Invoice,CTM_on_Invoice,CCF_National_Accrual,CCF_Regional_Accrual,CTM_National_Accrual,CTM_Regional_Accrual,DNNSI,LMP,Dr_Pepper_CSD_Funding,Full_Service_Commission,COGSIPINCL,COGSIPEXCL,Facilitation_Fees,Reimbursements,ReimbursementsBaseFunding,DNGP,Effective_Unit_Price,Invoice_Price,Volume,Volume_SPC,Volume_EQV,Transactions,Forecast_DNNSI,Forecast_DNGP,Forecast_Volume,Forecast_Volume_SPC,Forecast_Volume_EQV,Retail_Price,Take_Rate,Price_Multiple
0.0,0.0,1.03,0.0,0.0,0,0.0,0.0,0.0,1.03,0.0,-1.03,0.0,-23.92,0.0,392.5376,1.34,0.0,0.0,0.0,-2.37,362.96,0.0,52.0,52.0,78.0,104.0,697.205557613931,328.587957613931,51.152278621711744,51.152278621711744,76.72841793256762,0.0,0.0,
0.0,0.0,0.4,0.0,0.0,0,0.0,0.0,0.0,0.4,0.0,-0.4,0.0,-9.2,0.0,184.9412,0.0,0.0,0.0,0.0,-0.4,141.6,0.0,20.0,20.0,30.0,40.0,294.2234400395458,118.4822400395458,21.289684518056852,21.289684518056852,31.93452677708528,0.0,0.0,
0.0,0.0,0.08,0.0,0.0,0,0.0,0.0,0.0,0.08,0.0,-0.08,0.0,-1.84,0.0,29.70556,1.95,0.0,0.0,0.0,-2.03,28.32,0.0,4.0,4.0,6.0,8.0,54.50703425528229,26.641474255282294,3.944069048862683,3.944069048862683,5.916103573294024,0.0,0.0,
0.0,0.0,0.04,0.0,0.0,0,0.0,0.0,0.0,0.04,0.0,-0.04,0.0,-0.92,0.0,15.0976,-62.18,0.0,0.0,0.0,62.14,12.56,0.0,2.0,2.0,3.0,4.0,27.024232154355506,12.846632154355502,2.2006703708758555,2.2006703708758555,3.3010055563137835,0.0,0.0,
0.0,0.0,0.46,0.0,0.0,0,0.0,0.0,0.0,0.46,0.0,-0.46,0.0,0.0,0.0,45.13089,-3.73,0.0,0.0,0.0,3.27,39.48,0.0,7.0,8.75,8.203299999999999,21.0,109.45403287208929,64.32314287208929,7.202344730676402,9.002930913345503,8.440427789879676,0.0,0.0,
0.0,0.0,6.36,0.0,0.0,0,0.0,0.0,0.0,6.36,0.0,-6.36,0.0,0.0,0.0,649.61184,4.91,0.0,0.0,0.0,-11.27,541.4399999999999,0.0,96.0,120.0,112.5024,288.0,1501.0320718106143,851.4202318106143,98.77160438314236,123.46450547892798,115.75044317660456,0.0,0.0,
0.0,0.0,0.32,0.0,0.0,0,0.0,0.0,0.0,0.32,0.0,-0.32,0.0,0.0,0.0,83.94752,0.0,0.0,0.0,0.0,-0.32,47.68,0.0,16.0,16.0,24.0,64.0,0.0,-83.94752,0.0,0.0,0.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,104.30401666666668,99.0,0.0,0.0,0.0,-104.30401666666668,13.750000000000002,0.0,9.166666666666668,4.583333333333334,9.166666666666668,110.0,0.0,-104.30401666666668,9.166666666666668,4.583333333333334,9.166666666666668,3.0,1.0,2 for
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,142.23275,135.0,0.0,0.0,0.0,-142.23275,15.625,0.0,12.5,6.25,12.5,150.0,0.0,-142.23275,12.5,6.25,12.5,5.0,1.0,4 for
0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,176.27186386548615,167.30817355243877,0.0,0.0,0.0,-176.27186386548615,19.36437193893967,0.0,15.491497551151737,7.745748775575867,15.491497551151737,185.8979706138208,0.0,-176.27186386548615,15.491497551151737,7.745748775575867,15.491497551151737,5.0,1.0,4 for


In [0]:
df.select(['dnnsi', 'dngp', 'volume', 'forecast_dnnsi', 'forecast_dngp', 'forecast_volume']).show()

+-----+--------+------+------------------+------------------+------------------+
|dnnsi|    dngp|volume|    forecast_dnnsi|     forecast_dngp|   forecast_volume|
+-----+--------+------+------------------+------------------+------------------+
|-30.0|  -23.25|  -1.0|               0.0|           7.12011|               0.0|
|  0.0|     4.3|   0.0|               0.0|               0.0|               0.0|
|  0.0|     0.0|   0.0|              4.43|              4.43|              0.25|
|  0.0|     0.0|   0.0|             22.15|             22.15|              1.25|
|322.0|-1043.05| 161.0|2852.9199999999996|1250.9120399999997|             161.0|
|  0.0|  169.99|   0.0|               0.0|               0.0|               0.0|
|  9.0|  -16.62|   3.0|             53.16|23.308919999999997|               3.0|
|  0.0|    2.65|   0.0|               0.0|               0.0|               0.0|
|480.0| -886.67| 160.0|            2835.2|1243.1423999999997|             160.0|
|  3.0|  135.82|   1.0|     

are historical rows actuals and future rows we just populate non-forecast columns with forecast?  
not quite... not all volumes match in future weeks

In [0]:
(df.select(['year',
            'week',
            F.when(col('volume') == col('forecast_volume'),1).otherwise(0).alias('match')]
          )
 .groupBy('year', 'week')
 .agg(F.sum('match').alias('matches'), F.count('match').alias('total') )
 .withColumn('pct', col('matches') / col('total') )
 .sort(['year','week'], ascending=[1,1])
 .display()
)

year,week,matches,total,pct
2022,1,23787,131306,0.1811569920643382
2022,2,40704,160193,0.2540934997159676
2022,3,48511,162178,0.2991219524226467
2022,4,50948,172967,0.2945532962935126
2022,5,26174,149377,0.1752210849059761
2022,6,48015,171885,0.2793437472728859
2022,7,48822,172723,0.2826606763430463
2022,8,52976,178372,0.296997286569641
2022,9,26861,152903,0.1756734661844437
2022,10,51486,177739,0.2896719346907544


In [0]:
df.select('year','week','volume','forecast_volume','material_PkgCat_description','sales_office_description')\
.filter(col('volume') != col('forecast_volume'))\
.filter( (col('year')==2023) & (col('week')==34) )\
.show(6)

+----+----+------------------+-------------------+---------------------------+------------------------+
|year|week|            volume|    forecast_volume|material_PkgCat_description|sales_office_description|
+----+----+------------------+-------------------+---------------------------+------------------------+
|2023|  34|              40.0|               42.0|        WTR DASANI 16Z 24CT|            GLENDALE, AZ|
|2023|  34|              53.0|                0.0|        WTR DASANI 16Z 24CT|               TEMPE, AZ|
|2023|  34|               0.0| 0.7773333333333333|       EWTR PA PWR 20Z 12CT|           ARLINGTON, WA|
|2023|  34|               0.0|0.22266666666666668|       EWTR PA PWR 20Z 12CT|           ARLINGTON, WA|
|2023|  34|0.6666666666666666| 1.3333333333333333|       CFE DUNKIN COFFEE...|               OGDEN, UT|
|2023|  34|0.3333333333333333| 0.6666666666666666|       CFE DUNKIN COFFEE...|               OGDEN, UT|
+----+----+------------------+-------------------+--------------

##Grain

FIXME: It looks like we have duplicate rows   
  
FIXME: Does the allocation break retailer into sales offices and super channels and cold drink channels?

###Duplicate Rows

In [0]:
(df.select(['year','week','sales_office','material','retailer','super_channel','cold_drink_channel','material_pkgcat']) 
 .groupby(['year','week','sales_office','material','retailer','super_channel','cold_drink_channel','material_pkgcat'])
 .count()
 .sort('count', ascending=False)
 .show(20)
)

+----+----+------------+--------+--------+-------------+------------------+---------------+-----+
|year|week|sales_office|material|retailer|super_channel|cold_drink_channel|material_pkgcat|count|
+----+----+------------+--------+--------+-------------+------------------+---------------+-----+
|2023|   8|        G237|  119826|       0|            V|              null|             BN|    5|
|2023|  20|        G151|  119826|       0|            V|              null|             BN|    5|
|2023|  13|        G131|  119826|       0|            V|              null|             BN|    5|
|2023|   5|        G238|  119826|       0|            V|              null|             BN|    5|
|2023|   8|        G235|  119826|       0|            V|              null|             BN|    5|
|2023|  18|        G131|  119826|       0|            V|              null|             BN|    5|
|2023|  20|        G132|  119826|       0|            V|              null|             BN|    5|
|2023|   5|        G

In [0]:
(df.filter( (col('year') == 2023) \
           & (col('week') == 8) \
           & (col('sales_office') == 'G237') \
           & (col('material') == 119826) \
           & (col('retailer') == 0) \
           & (col('super_channel') == 'V') \
           & (col('material_pkgcat') == 'BN')
          )
 .display()
)

Time,Version,Value_Type,Period_Code,Year,Period,Week,WeekofPeriod,Super_Division,Profit_Center,Sales_Office,Sales_Office_Description,Super_Channel,Super_Channel_Description,Retailer,Retailer_Description,Cold_Drink_Channel,Cold_Drink_Channel_Description,Material,Material_SWEEP_Category_ID,Material_SWEEP_Category_Description,Material_PkgCat,Material_PkgCat_Description,Material_Packs_per_Case,Wholesale_Price,Discount,NSI_ADV,CMA/JBP_Allowances,CCF_National_on_Invoice,CCF_Regional_on_Invoice,CTM_on_Invoice,CCF_National_Accrual,CCF_Regional_Accrual,CTM_National_Accrual,CTM_Regional_Accrual,DNNSI,LMP,Dr_Pepper_CSD_Funding,Full_Service_Commission,COGSIPINCL,COGSIPEXCL,Facilitation_Fees,Reimbursements,ReimbursementsBaseFunding,DNGP,Effective_Unit_Price,Invoice_Price,Volume,Volume_SPC,Volume_EQV,Transactions,Forecast_DNNSI,Forecast_DNGP,Forecast_Volume,Forecast_Volume_SPC,Forecast_Volume_EQV,Retail_Price,Take_Rate,Price_Multiple
Week 8 FY23,000S,30,202302,2023,2,8,4,RM,4400337000,G237,"PUEBLO, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1100631313131313,0.1397147474747475,0.0,0.0,0.0,-0.1100631313131313,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1100631313131313,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 8 FY23,000S,30,202302,2023,2,8,4,RM,4400337000,G237,"PUEBLO, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1100631313131313,0.1397147474747475,0.0,0.0,0.0,-0.1100631313131313,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1100631313131313,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 8 FY23,000S,30,202302,2023,2,8,4,RM,4400337000,G237,"PUEBLO, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1100631313131313,0.1397147474747475,0.0,0.0,0.0,-0.1100631313131313,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1100631313131313,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 8 FY23,000S,30,202302,2023,2,8,4,RM,4400337000,G237,"PUEBLO, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1100631313131313,0.1397147474747475,0.0,0.0,0.0,-0.1100631313131313,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1100631313131313,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 8 FY23,000S,30,202302,2023,2,8,4,RM,4400337000,G237,"PUEBLO, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1100631313131313,0.1397147474747475,0.0,0.0,0.0,-0.1100631313131313,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1100631313131313,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,


###Allocation

In [0]:
(df.select(['year','week','sales_office','material','retailer','super_channel','cold_drink_channel']) 
 .groupby(['year','week','sales_office','material','retailer','super_channel','cold_drink_channel'])
 .count()
 .sort('count', ascending=False)
 .show(20)
)

+----+----+------------+--------+--------+-------------+------------------+-----+
|year|week|sales_office|material|retailer|super_channel|cold_drink_channel|count|
+----+----+------------+--------+--------+-------------+------------------+-----+
|2023|   2|        G236|  119826|       0|            V|              null|    6|
|2023|  19|        G236|  119826|       0|            V|              null|    6|
|2023|   8|        G236|  119826|       0|            V|              null|    6|
|2023|   5|        G236|  119826|       0|            V|              null|    6|
|2023|  22|        G171|  119826|     193|            R|              null|    6|
|2023|   5|        G151|  119826|       0|            V|              null|    5|
|2023|   2|        G132|  119826|       0|            V|              null|    5|
|2023|   2|        G238|  119826|       0|            V|              null|    5|
|2023|   2|        G131|  119826|       0|            V|              null|    5|
|2023|   5|     

In [0]:
(df.filter( (col('year') == 2023) \
           & (col('week') == 2) \
           & (col('sales_office') == 'G236') \
           & (col('material') == 119826) \
           & (col('retailer') == 0) \
           & (col('super_channel') == 'V') \
           #& (col('material_pkgcat') == 'BN')
          )
 .display()
)

Time,Version,Value_Type,Period_Code,Year,Period,Week,WeekofPeriod,Super_Division,Profit_Center,Sales_Office,Sales_Office_Description,Super_Channel,Super_Channel_Description,Retailer,Retailer_Description,Cold_Drink_Channel,Cold_Drink_Channel_Description,Material,Material_SWEEP_Category_ID,Material_SWEEP_Category_Description,Material_PkgCat,Material_PkgCat_Description,Material_Packs_per_Case,Wholesale_Price,Discount,NSI_ADV,CMA/JBP_Allowances,CCF_National_on_Invoice,CCF_Regional_on_Invoice,CTM_on_Invoice,CCF_National_Accrual,CCF_Regional_Accrual,CTM_National_Accrual,CTM_Regional_Accrual,DNNSI,LMP,Dr_Pepper_CSD_Funding,Full_Service_Commission,COGSIPINCL,COGSIPEXCL,Facilitation_Fees,Reimbursements,ReimbursementsBaseFunding,DNGP,Effective_Unit_Price,Invoice_Price,Volume,Volume_SPC,Volume_EQV,Transactions,Forecast_DNNSI,Forecast_DNGP,Forecast_Volume,Forecast_Volume_SPC,Forecast_Volume_EQV,Retail_Price,Take_Rate,Price_Multiple
Week 2 FY23,000S,30,202301,2023,1,2,2,RM,4400336000,G236,"DENVER, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,CC,SSD NR 20Z 24CT,24,84.39006965529559,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,84.39006965529559,0.0,0.0,0.0,22.42239462404001,24.40737408465797,0.0,0.0,0.0,61.96767503125558,0.0,84.39006965529559,1.562779067690659,1.562779067690659,3.906947669226648,37.50669762457581,84.39006965529559,61.96767503125558,1.562779067690659,1.562779067690659,3.906947669226648,0.0,0.0,
Week 2 FY23,000S,30,202301,2023,1,2,2,RM,4400336000,G236,"DENVER, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1449269696969697,0.1577568686868687,0.0,0.0,0.0,-0.1449269696969697,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1449269696969697,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 2 FY23,000S,30,202301,2023,1,2,2,RM,4400336000,G236,"DENVER, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1449269696969697,0.1577568686868687,0.0,0.0,0.0,-0.1449269696969697,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1449269696969697,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 2 FY23,000S,30,202301,2023,1,2,2,RM,4400336000,G236,"DENVER, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1449269696969697,0.1577568686868687,0.0,0.0,0.0,-0.1449269696969697,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1449269696969697,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 2 FY23,000S,30,202301,2023,1,2,2,RM,4400336000,G236,"DENVER, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1449269696969697,0.1577568686868687,0.0,0.0,0.0,-0.1449269696969697,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1449269696969697,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,
Week 2 FY23,000S,30,202301,2023,1,2,2,RM,4400336000,G236,"DENVER, CO",V,VALUE,0,ALL OTHER PLAN KA,,,119826,1,SWEEP-SPARKLING,BN,SSD NR 13.2Z 24CT,24,0.4848484848484848,0.4848484848484848,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1449269696969697,0.1577568686868687,0.0,0.0,0.0,-0.1449269696969697,0.0,0.0,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.2424242424242424,0.0,-0.1449269696969697,0.0101010101010101,0.0101010101010101,0.0145833333333333,0.0,0.0,


#Other

##Materials with multiple package categories

In [0]:
#show a list to show this is an issue with multiple products
(df.select('material','material_pkgcat','material_pkgcat_description')
 .groupBy('material')
 .agg(F.count_distinct('material_pkgcat').alias('count'))
 .sort('count', ascending = False)
 .show(25)
)

+--------+-----+
|material|count|
+--------+-----+
|  153428|    3|
|  156853|    2|
|  156849|    2|
|  119826|    2|
|  156845|    2|
|  156850|    2|
|  156834|    2|
|  155604|    2|
|  410321|    2|
|  116366|    2|
|  156847|    2|
|  156851|    2|
|  410314|    2|
|  156848|    2|
|  156893|    2|
|  410150|    2|
|  151776|    2|
|  155590|    2|
|  156836|    2|
|  156842|    2|
|  156837|    2|
|  156832|    2|
|  156846|    2|
|  156843|    2|
|  410344|    2|
+--------+-----+
only showing top 25 rows



In [0]:
#how many materials have the issue?
#25
(df.select('material','material_pkgcat','material_pkgcat_description')
 .groupBy('material')
 .agg(F.count_distinct('material_pkgcat').alias('count'))
 .filter(col('count') > 1)
 .count()
)

Out[18]: 25

In [0]:
#let's look at a few examples
(df.select('material','material_pkgcat','material_pkgcat_description')
 .filter( (col('material') == 153428) | (col('material') == 119826) )
 .groupby('material','material_pkgcat','material_pkgcat_description')
 .count()
 .orderBy(['material','count'], ascending =[True,False])
 .show()
)

+--------+---------------+---------------------------+------+
|material|material_pkgcat|material_pkgcat_description| count|
+--------+---------------+---------------------------+------+
|  119826|             CC|            SSD NR 20Z 24CT|101958|
|  119826|             BN|          SSD NR 13.2Z 24CT|   320|
|  153428|             JI|       ENG MNSTR 15.5/16...| 30993|
|  153428|             JE|       ENG MNSTR JAVA 15...|    73|
|  153428|             JL|       ENG MNSTR 15.5/16...|     1|
+--------+---------------+---------------------------+------+



##Incorrect Grain

In [0]:
#when retialer = 0, we can have multiple channels, so we will analyze separately

#all other retailers
#show a list of what should be the grain, but has duplicates
(df.select('retailer','profit_center','material','year','week','forecast_volume')
 .filter(col('retailer') != 0)
 .groupby('retailer','profit_center','material','year','week')
 .count()
 .sort('count', ascending=False)
 .show(40)
)

+--------+-------------+--------+----+----+-----+
|retailer|profit_center|material|year|week|count|
+--------+-------------+--------+----+----+-----+
|     193|   4400317000|  119826|2023|  22|    6|
|   99649|   4400394000|  151776|2023|  49|    3|
|     318|   4400394000|  151776|2023|  49|    3|
|     318|   4400394000|  151776|2023|  41|    3|
|   99649|   4400394000|  151776|2023|  44|    3|
|     318|   4400394000|  151776|2023|  43|    3|
|   99649|   4400394000|  151776|2023|  46|    3|
|   99649|   4400394000|  151776|2023|  48|    3|
|   99649|   4400394000|  151776|2023|  42|    3|
|     318|   4400394000|  151776|2023|  42|    3|
|   99649|   4400394000|  151776|2023|  45|    3|
|     318|   4400394000|  151776|2023|  45|    3|
|   99649|   4400394000|  151776|2023|  43|    3|
|     318|   4400394000|  151776|2023|  46|    3|
|     115|   4400362000|  156845|2023|  35|    3|
|     318|   4400394000|  151776|2023|  48|    3|
|     318|   4400394000|  151776|2023|  44|    3|


In [0]:
#how many rows have this issue?
#about 7000 afer stripping out the known package category issue
(df.select('retailer','profit_center','material','year','week','forecast_volume','material_pkgcat')
 .filter(col('retailer') != 0)
 .groupby('retailer','profit_center','material','year','week','material_pkgcat')
 .count()
 .filter(col('count') > 1)
 .count()
)

Out[47]: 7117

In [0]:
#example of multiple rows with different values
#the material package category is a separate issue. But the other 5 rows appear to be duplicates
(df.select('retailer','profit_center','material','year','week','forecast_volume','super_channel','material_pkgcat')
 .filter( (col('retailer') == 773) &
         (col('profit_center') == 4400362000) &
         (col('material') == 117603) &
         (col('year') == 2023) &
         (col('week') == 22)
        )
 .display()
)

retailer,profit_center,material,year,week,forecast_volume,super_channel,material_pkgcat
773,4400362000,117603,2023,22,904.358894686415,H,AI
773,4400362000,117603,2023,22,123.83176991855063,H,AI


In [0]:
#example of multiple rows with duplicate values
#the material package category is a separate issue. But the other 5 rows appear to be duplicates
(df.select('retailer','profit_center','material','year','week','forecast_volume','super_channel','material_pkgcat')
 .filter( (col('retailer') == 193) &
         (col('profit_center') == 4400317000) &
         (col('material') == 119826) &
         (col('year') == 2023) &
         (col('week') == 22)
        )
 .show(8)
)

+--------+-------------+--------+----+----+-------------------+-------------+---------------+
|retailer|profit_center|material|year|week|    forecast_volume|super_channel|material_pkgcat|
+--------+-------------+--------+----+----+-------------------+-------------+---------------+
|     193|   4400317000|  119826|2023|  22| 1.0000000000000002|            R|             CC|
|     193|   4400317000|  119826|2023|  22|0.09090909090909091|            R|             BN|
|     193|   4400317000|  119826|2023|  22|0.09090909090909091|            R|             BN|
|     193|   4400317000|  119826|2023|  22|0.09090909090909091|            R|             BN|
|     193|   4400317000|  119826|2023|  22|0.09090909090909091|            R|             BN|
|     193|   4400317000|  119826|2023|  22|0.09090909090909091|            R|             BN|
+--------+-------------+--------+----+----+-------------------+-------------+---------------+



##Retailers in Multiple Channels

In [0]:
#this is only an issue with all other KA plan
(df.select('retailer','retailer_description','super_channel','super_channel_description')
 .groupBy('retailer', 'retailer_description')
 .agg(F.count_distinct('super_channel').alias('count'))
 .sort('count', ascending = False)
 .show(25)
)

+--------+--------------------+-----+
|retailer|retailer_description|count|
+--------+--------------------+-----+
|       0|   ALL OTHER PLAN KA|    9|
|   99910|   CR AO INDEPENDENT|    1|
|   15065|          EXTRA MILE|    1|
|     301|       FAMILY DOLLAR|    1|
|     110|           FOOD CITY|    1|
|   99044|ALBERTSONS COS PO...|    1|
|   11164|         CIRCLE K RM|    1|
|     769|              TARGET|    1|
|   99649|ALBERTSONS COS SE...|    1|
|     491|              HAGGEN|    1|
|     331|      SPEEDWAY/GIANT|    1|
|     836|           WALGREENS|    1|
|     115|              BASHAS|    1|
|     673|     7-ELEVEN (7-11)|    1|
|   99920|      CR AO REGIONAL|    1|
|     487|MAVERIK COUNTRY S...|    1|
|     357|     GROCERY OUTLETS|    1|
|   36916|      UNITED PACIFIC|    1|
|      80|                 AFS|    1|
|     193|            CIRCLE K|    1|
|     709| SMITH'S FOOD & DRUG|    1|
|     638|           ROSAUER'S|    1|
|     439|          LOAF-N-JUG|    1|
|   61035|  

In [0]:
#it appears all other retailer is allocated across channels. This should be included in grain.
(df.select('retailer','retailer_description','super_channel','super_channel_description')
 .filter(col('retailer') == 0)
 .groupby('retailer','retailer_description','super_channel','super_channel_description')
 .count()
 .sort('super_channel_description', ascending = True)
 .show(25)
)

+--------+--------------------+-------------+-------------------------+-------+
|retailer|retailer_description|super_channel|super_channel_description|  count|
+--------+--------------------+-------------+-------------------------+-------+
|       0|   ALL OTHER PLAN KA|            A|                ALL OTHER| 148424|
|       0|   ALL OTHER PLAN KA|            C|                     CLUB|  80925|
|       0|   ALL OTHER PLAN KA|            R|       CONVENIENCE RETAIL|   2009|
|       0|   ALL OTHER PLAN KA|            D|                     DRUG| 128624|
|       0|   ALL OTHER PLAN KA|            F|             FULL SERVICE| 884536|
|       0|   ALL OTHER PLAN KA|            M|                     MASS|  53064|
|       0|   ALL OTHER PLAN KA|            O|               ON PREMISE|2845039|
|       0|   ALL OTHER PLAN KA|            S|             SUPER MARKET| 686720|
|       0|   ALL OTHER PLAN KA|            V|                    VALUE| 258608|
+--------+--------------------+---------