### Data Checks

Check keys: the key is made of `upc`, `calendar_date`, `store_id`, `geography_id`

* Which products are in all files? Only include products which we have complete data set for
* Do store IDs appear in mulitple geographys?
* Is `calendar_date` consistent across all files. Only include date range where we have all data.

In [1]:
import pandas as pd

In [3]:
fore = pd.read_csv('../data/Case Study - Forecast Data.txt', sep='\t', encoding='utf-16',
                  parse_dates=['calendar_date'])

In [4]:
depot = pd.read_csv('../data/Case Study - Depot Data.txt', sep='\t', encoding='utf-16',
                   parse_dates=['calendar_date'])

In [5]:
instore = pd.read_csv('../data/Case Study - In Store Data.txt', sep='\t', encoding='utf-16',
                     parse_dates=['calendar_date'])

In [6]:
cinv = pd.read_csv('../data/Case Study - Closing Inventory.txt', sep='\t', encoding='utf-16',
                   parse_dates=['calendar_date'])

### What are the shapes of the data?

In [7]:
fore.shape

(2648539, 34)

In [8]:
fore.shape[0] == depot.shape[0] == cinv.shape[0] == instore.shape[0]

True

All are the same shape

#### Date range

In [10]:
def return_daterange(df):
    return df['calendar_date'].min(), df['calendar_date'].max()

In [12]:
return_daterange(fore), return_daterange(depot), return_daterange(cinv), return_daterange(instore)

((Timestamp('2018-01-08 00:00:00'), Timestamp('2018-12-07 00:00:00')),
 (Timestamp('2018-01-08 00:00:00'), Timestamp('2018-12-07 00:00:00')),
 (Timestamp('2018-01-08 00:00:00'), Timestamp('2018-12-07 00:00:00')),
 (Timestamp('2018-01-08 00:00:00'), Timestamp('2018-12-07 00:00:00')))

#### Stores in multiple geographies?

In [13]:
(fore.groupby('store_id')['geography_id'].nunique() == 1).value_counts()

True    623
Name: geography_id, dtype: int64

In [14]:
(cinv.groupby('store_id')['geography_id'].nunique() == 1).value_counts()

True    623
Name: geography_id, dtype: int64

In [15]:
pd.merge(pd.DataFrame(fore['upc'].drop_duplicates()), pd.DataFrame(depot['upc'].drop_duplicates()), how='outer').shape

(143, 1)

All files have:
* dates between 08.01.2018 and 07.12.2018
* Store IDs that only one appear in one geography
* 143 products

#### Is Closing Inv the same In Store?

In [16]:
(cinv == instore[cinv.columns]).apply(lambda x: x.value_counts()).transpose()

Unnamed: 0,False,True
upc,,2648539.0
calendar_date,,2648539.0
calendar_id,,2648539.0
store_id,,2648539.0
geography_id,,2648539.0
shelf_life,,2648539.0
units_per_tray,,2648539.0
closing_inventory_min_neg_over_shelf_life_minus_2_days,2554345.0,94194.0
closing_inventory_neg_count_over_1_day,,2648539.0
closing_inventory_neg_count_over_shelf_life_minus_2_days,,2648539.0


**Note:** Closing Inventory and In Store are identical except for:
* closing_inventory_min_neg_over_shelf_life_minus_2_days
* closing_inventory_on_day

Join these to final data with `cinv` prefix

**Note2:** the files are the same. The only differences are from NA values failing match.

### Forecast data

In [23]:
fore['actual_store_need_over_lead_time']

 0.00      61705
-6.00      45957
-4.00      45739
-8.00      45176
-2.00      43568
           ...  
 40.52         1
 40.73         1
-62.51         1
-62.74         1
 403.00        1
Name: actual_store_need_over_lead_time, Length: 11574, dtype: int64

### Forecast

* No negative: forecast demand, waste prediction

In [35]:
fore['forecast_demand_on_day'].min(), fore['predicted_waste_on_day'].min()

(0.0, 0.0)

### Depot

* depot_delivered_qty_on_day should be positive
* depot_lvl_required_qty_over_supplier_lead_time many negative? Is this right?

In [42]:
(depot['depot_delivered_qty_on_day'] < 0).value_counts()

False    2648464
True          75
Name: depot_delivered_qty_on_day, dtype: int64

Remove negative depot_delivered_qty_on_day rows from data

In [45]:
(depot['depot_lvl_required_qty_over_supplier_lead_time'] < 0).value_counts()

False    2149185
True      499354
Name: depot_lvl_required_qty_over_supplier_lead_time, dtype: int64

In [49]:
(depot['depot_ordered_qty_over_supplier_lead_time'] < 0).value_counts()

False    2648539
Name: depot_ordered_qty_over_supplier_lead_time, dtype: int64

### In Store

* waste_value_on_day cannot be negative. How does a fraction of waste work? Is this a tray unit, the effect of the reduced price?
* stock_out_ind_on_day & ranging_indicator_on_day should be 0 or 1 and mutually exclusive

In [56]:
(instore['waste_value_on_day'] < 0 ).value_counts()

False    2648537
True           2
Name: waste_value_on_day, dtype: int64

In [61]:
(instore['stock_out_ind_on_day'] + instore['ranging_indicator_on_day']).value_counts()

1.0    1650382
2.0     130273
0.0      11806
dtype: int64

Remove rows which do not equal 1 (0 and 2)

### Closing Inventory

* Second column called closing_inventory_on_day

In [88]:
(instore['closing_inventory_on_day'] == cinv['closing_inventory_on_day']).value_counts()

True     1792461
False     856078
Name: closing_inventory_on_day, dtype: int64

In [89]:
instore['closing_inventory_on_day'].isna().value_counts()

False    1792461
True      856078
Name: closing_inventory_on_day, dtype: int64

In [91]:
cinv['closing_inventory_on_day'].isna().value_counts()

False    1792461
True      856078
Name: closing_inventory_on_day, dtype: int64

They are the same column

In [101]:
(cinv['closing_inventory_on_day'] < 0).value_counts()

False    2612704
True       35835
Name: closing_inventory_on_day, dtype: int64

Can't have a negative closing inventory

### Next: 
1. depot_delivered_qty_on_day remove where negative
2. closing_inventory_on_day remove where negative
3. waste_value_on_day remove where negative
4. Remove rows where (instore['stock_out_ind_on_day'] + instore['ranging_indicator_on_day']) != 1