## 1. Setup & Environment Configuration

**Objective:** Import required libraries, configure paths, validate environment

**Activities:**
- Import pandas, numpy, dask for data manipulation
- Define path constants for data/raw/ and docs/
- Test imports and display versions
- Configure warnings and display settings

**Expected output:** Confirmation that environment is ready

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
from pathlib import Path
import warnings

# Package versions
print("Package Versions:")
print(f"  pandas: {pd.__version__}")
print(f"  numpy: {np.__version__}")
print(f"  dask: {dask.__version__}")

Package Versions:
  pandas: 2.1.4
  numpy: 1.26.4
  dask: 2025.11.0


In [2]:
# Configure environment
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

print("OK - Display settings configured")

OK - Display settings configured


In [None]:
# Determine current directory (works in both scripts and notebooks)
current_dir = Path(__file__).parent if '__file__' in globals() else Path.cwd()
project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir
print(f"Project root: {project_root.resolve()}")

# Define path constants relative to project root
DATA_RAW = project_root / 'data' / 'raw'
DOCS = project_root / 'docs'

# Verify paths exist
assert DATA_RAW.exists(), f"ERROR - Path not found: {DATA_RAW}"
assert DOCS.exists(), f"ERROR - Path not found: {DOCS}"

print("OK - Paths validated:")
print(f"  DATA_RAW: {DATA_RAW.resolve()}")
print(f"  DOCS: {DOCS.resolve()}")


Project root: C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail
OK - Paths validated:
  DATA_RAW: C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail\data\raw
  DOCS: C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail\docs


## 2. Load Support Files (Small CSVs)

**Objective:** Load and validate all small CSV files (stores, items, oil, holidays, transactions)

**Activities:**
- Load 5 support CSV files into pandas DataFrames
- Display shape, columns, and data types for each
- Check for missing values
- Convert date columns to datetime format
- Display first few rows for validation

**Expected output:** 
- 5 DataFrames loaded successfully
- Schema validation report
- Missing value counts per file

In [11]:
DATA_RAW / 'stores.csv'

WindowsPath('c:/Users/adiaz/OneDrive/Dokumente/PythonScripts/MasterClass/Demand-forecasting-in-retail/data/raw/stores.csv')

In [12]:
# Load stores.csv
df_stores = pd.read_csv(DATA_RAW / 'stores.csv')

print("stores.csv loaded:")
print(f"  Shape: {df_stores.shape}")
print(f"  Columns: {list(df_stores.columns)}")
print(f"  Missing values: {df_stores.isnull().sum().sum()}")
print(f"\nFirst 3 rows:")
print(df_stores.head(3))

stores.csv loaded:
  Shape: (54, 5)
  Columns: ['store_nbr', 'city', 'state', 'type', 'cluster']
  Missing values: 0

First 3 rows:
   store_nbr   city      state type  cluster
0          1  Quito  Pichincha    D       13
1          2  Quito  Pichincha    D       13
2          3  Quito  Pichincha    D        8


In [13]:
# Load items.csv
df_items = pd.read_csv(DATA_RAW / 'items.csv')

print("items.csv loaded:")
print(f"  Shape: {df_items.shape}")
print(f"  Columns: {list(df_items.columns)}")
print(f"  Missing values: {df_items.isnull().sum().sum()}")
print(f"\nFirst 3 rows:")
print(df_items.head(3))

items.csv loaded:
  Shape: (4100, 4)
  Columns: ['item_nbr', 'family', 'class', 'perishable']
  Missing values: 0

First 3 rows:
   item_nbr     family  class  perishable
0     96995  GROCERY I   1093           0
1     99197  GROCERY I   1067           0
2    103501   CLEANING   3008           0


In [14]:
# Load oil.csv
df_oil = pd.read_csv(DATA_RAW / 'oil.csv')

print("oil.csv loaded:")
print(f"  Shape: {df_oil.shape}")
print(f"  Columns: {list(df_oil.columns)}")
print(f"  Missing values: {df_oil.isnull().sum().sum()}")
print(f"\nData types:")
print(df_oil.dtypes)
print(f"\nFirst 3 rows:")
print(df_oil.head(3))

oil.csv loaded:
  Shape: (1218, 2)
  Columns: ['date', 'dcoilwtico']
  Missing values: 43

Data types:
date           object
dcoilwtico    float64
dtype: object

First 3 rows:
         date  dcoilwtico
0  2013-01-01         NaN
1  2013-01-02       93.14
2  2013-01-03       92.97


In [15]:
# Load holidays_events.csv
df_holidays = pd.read_csv(DATA_RAW / 'holidays_events.csv')

print("holidays_events.csv loaded:")
print(f"  Shape: {df_holidays.shape}")
print(f"  Columns: {list(df_holidays.columns)}")
print(f"  Missing values: {df_holidays.isnull().sum().sum()}")
print(f"\nData types:")
print(df_holidays.dtypes)
print(f"\nFirst 3 rows:")
print(df_holidays.head(3))

holidays_events.csv loaded:
  Shape: (350, 6)
  Columns: ['date', 'type', 'locale', 'locale_name', 'description', 'transferred']
  Missing values: 0

Data types:
date           object
type           object
locale         object
locale_name    object
description    object
transferred      bool
dtype: object

First 3 rows:
         date     type    locale locale_name                    description  \
0  2012-03-02  Holiday     Local       Manta             Fundacion de Manta   
1  2012-04-01  Holiday  Regional    Cotopaxi  Provincializacion de Cotopaxi   
2  2012-04-12  Holiday     Local      Cuenca            Fundacion de Cuenca   

   transferred  
0        False  
1        False  
2        False  


In [16]:
# Load transactions.csv
df_transactions = pd.read_csv(DATA_RAW / 'transactions.csv')

print("transactions.csv loaded:")
print(f"  Shape: {df_transactions.shape}")
print(f"  Columns: {list(df_transactions.columns)}")
print(f"  Missing values: {df_transactions.isnull().sum().sum()}")
print(f"\nData types:")
print(df_transactions.dtypes)
print(f"\nFirst 3 rows:")
print(df_transactions.head(3))

transactions.csv loaded:
  Shape: (83488, 3)
  Columns: ['date', 'store_nbr', 'transactions']
  Missing values: 0

Data types:
date            object
store_nbr        int64
transactions     int64
dtype: object

First 3 rows:
         date  store_nbr  transactions
0  2013-01-01         25           770
1  2013-01-02          1          2111
2  2013-01-02          2          2358


In [17]:
# Create summary dictionary
support_files_summary = {
    'File': ['stores.csv', 'items.csv', 'oil.csv', 'holidays_events.csv', 'transactions.csv'],
    'Rows': [len(df_stores), len(df_items), len(df_oil), len(df_holidays), len(df_transactions)],
    'Columns': [df_stores.shape[1], df_items.shape[1], df_oil.shape[1], df_holidays.shape[1], df_transactions.shape[1]],
    'Missing': [0, 0, 43, 0, 0]
}

df_summary = pd.DataFrame(support_files_summary)
print("Support Files Summary:")
print(df_summary.to_string(index=False))
print("\nOK - All 5 support files loaded successfully")

Support Files Summary:
               File  Rows  Columns  Missing
         stores.csv    54        5        0
          items.csv  4100        4        0
            oil.csv  1218        2       43
holidays_events.csv   350        6        0
   transactions.csv 83488        3        0

OK - All 5 support files loaded successfully


## 3. Inspect train.csv with Dask

**Objective:** Load large train.csv file using Dask and inspect structure without loading full dataset into memory

**Activities:**
- Use Dask to read train.csv (479 MB file)
- Display schema and estimated row count
- Check for missing values
- Sample first rows for validation
- Document file characteristics

**Expected output:** 
- Train data structure confirmed
- Missing value percentages calculated
- Memory-efficient inspection complete

**Note:** train.csv is too large for pandas (125M rows). Dask enables lazy evaluation.

In [18]:
# Load train.csv with Dask (lazy evaluation)
print("Loading train.csv with Dask (this may take a moment)...")
df_train = dd.read_csv(DATA_RAW / 'train.csv')

print("OK - train.csv loaded (Dask DataFrame)")
print(f"\nColumns: {list(df_train.columns)}")
print(f"Data types:")
print(df_train.dtypes)

Loading train.csv with Dask (this may take a moment)...
OK - train.csv loaded (Dask DataFrame)

Columns: ['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion']
Data types:
id                       int64
date           string[pyarrow]
store_nbr                int64
item_nbr                 int64
unit_sales             float64
onpromotion            float64
dtype: object


In [19]:
# Compute actual row count (triggers computation)
print("Computing row count (this will take time - processing 125M rows)...")
train_length = len(df_train)
print(f"OK - Total rows: {train_length:,}")

# Check missing values per column
print("\nComputing missing values per column...")
missing_counts = df_train.isnull().sum().compute()
missing_pct = (missing_counts / train_length * 100).round(2)

print("\nMissing Values:")
for col in df_train.columns:
    count = missing_counts[col]
    pct = missing_pct[col]
    print(f"  {col:<15} {count:>12,} ({pct:>6}%)")

Computing row count (this will take time - processing 125M rows)...
OK - Total rows: 125,497,040

Computing missing values per column...

Missing Values:
  id                         0 (   0.0%)
  date                       0 (   0.0%)
  store_nbr                  0 (   0.0%)
  item_nbr                   0 (   0.0%)
  unit_sales                 0 (   0.0%)
  onpromotion       21,657,651 ( 17.26%)


In [20]:
# Sample first 1000 rows to inspect data
print("Sampling first 1000 rows...")
df_train_sample = df_train.head(1000, npartitions=-1)

print("\nFirst 5 rows:")
print(df_train_sample.head())

print("\nBasic statistics for unit_sales:")
print(df_train_sample['unit_sales'].describe())

# Check for negative values
negative_count = (df_train_sample['unit_sales'] < 0).sum()
print(f"\nNegative unit_sales in sample: {negative_count} rows")

Sampling first 1000 rows...

First 5 rows:
   id        date  store_nbr  item_nbr  unit_sales  onpromotion
0   0  2013-01-01         25    103665        7.00          NaN
1   1  2013-01-01         25    105574        1.00          NaN
2   2  2013-01-01         25    105575        2.00          NaN
3   3  2013-01-01         25    108079        1.00          NaN
4   4  2013-01-01         25    108701        1.00          NaN

Basic statistics for unit_sales:
count   1000.00
mean       5.87
std        8.31
min        0.50
25%        1.00
50%        3.00
75%        7.00
max       90.00
Name: unit_sales, dtype: float64

Negative unit_sales in sample: 0 rows


## 4. Identify Guayas Stores

**Objective:** Filter stores to Guayas region for project scope

**Activities:**
- Query stores.csv WHERE state = 'Guayas'
- Count Guayas stores
- Display store types and clusters in Guayas
- Export Guayas store_nbr list for train filtering

**Expected output:** 
- List of Guayas store identifiers
- Guayas store characteristics (types, clusters, cities)

In [21]:
# Filter stores to Guayas region
guayas_stores = df_stores[df_stores['state'] == 'Guayas'].copy()

print(f"Total stores in dataset: {len(df_stores)}")
print(f"Stores in Guayas: {len(guayas_stores)}")
print(f"Percentage: {len(guayas_stores)/len(df_stores)*100:.1f}%")

print("\nGuayas stores:")
print(guayas_stores)

Total stores in dataset: 54
Stores in Guayas: 11
Percentage: 20.4%

Guayas stores:
    store_nbr       city   state type  cluster
23         24  Guayaquil  Guayas    D        1
25         26  Guayaquil  Guayas    D       10
26         27      Daule  Guayas    D        1
27         28  Guayaquil  Guayas    E       10
28         29  Guayaquil  Guayas    E       10
29         30  Guayaquil  Guayas    C        3
31         32  Guayaquil  Guayas    C        3
33         34  Guayaquil  Guayas    B        6
34         35     Playas  Guayas    C        3
35         36   Libertad  Guayas    E       10
50         51  Guayaquil  Guayas    A       17


In [22]:
# Analyze store types in Guayas
print("Store types in Guayas:")
print(guayas_stores['type'].value_counts().sort_index())

print("\nStore clusters in Guayas:")
print(guayas_stores['cluster'].value_counts().sort_index())

print("\nCities in Guayas:")
print(guayas_stores['city'].value_counts())

# Extract store_nbr list for filtering
guayas_store_nbrs = guayas_stores['store_nbr'].tolist()
print(f"\nGuayas store_nbr list ({len(guayas_store_nbrs)} stores):")
print(guayas_store_nbrs)

Store types in Guayas:
type
A    1
B    1
C    3
D    3
E    3
Name: count, dtype: int64

Store clusters in Guayas:
cluster
1     2
3     3
6     1
10    4
17    1
Name: count, dtype: int64

Cities in Guayas:
city
Guayaquil    8
Daule        1
Playas       1
Libertad     1
Name: count, dtype: int64

Guayas store_nbr list (11 stores):
[24, 26, 27, 28, 29, 30, 32, 34, 35, 36, 51]


## 5. Identify Top-3 Product Families

**Objective:** Determine top-3 product families by item count for scope reduction

**Activities:**
- Count unique items per product family
- Rank families by item count
- Select top-3 families
- Display family characteristics

**Expected output:** 
- Top-3 families list with item counts
- Percentage of total items covered
- Family names for train filtering

In [23]:
# Count items per family
items_per_family = df_items['family'].value_counts().reset_index()
items_per_family.columns = ['family', 'item_count']
items_per_family = items_per_family.sort_values('item_count', ascending=False).reset_index(drop=True)

print(f"Total product families: {len(items_per_family)}")
print(f"Total items: {len(df_items)}")

print("\nTop-10 families by item count:")
print(items_per_family.head(10).to_string(index=False))

# Select top-3
top_3_families = items_per_family.head(3)
top_3_family_names = top_3_families['family'].tolist()

print(f"\nTop-3 families selected:")
print(top_3_families.to_string(index=False))

print(f"\nTop-3 families cover {top_3_families['item_count'].sum():,} items ({top_3_families['item_count'].sum()/len(df_items)*100:.1f}% of total)")

Total product families: 33
Total items: 4100

Top-10 families by item count:
       family  item_count
    GROCERY I        1334
    BEVERAGES         613
     CLEANING         446
      PRODUCE         306
        DAIRY         242
PERSONAL CARE         153
 BREAD/BAKERY         134
    HOME CARE         108
         DELI          91
        MEATS          84

Top-3 families selected:
   family  item_count
GROCERY I        1334
BEVERAGES         613
 CLEANING         446

Top-3 families cover 2,393 items (58.4% of total)


## 6. Summary & Export Findings

**Objective:** Consolidate inventory findings and export for documentation

**Activities:**
- Create comprehensive inventory summary
- Document Guayas scope (11 stores)
- Document top-3 families scope (2,393 items)
- Export summary to CSV for data_inventory.md update
- Calculate expected filtering impact on train.csv

**Expected output:** 
- Complete inventory summary dictionary
- Summary CSV exported to docs/
- Filtering estimates documented

In [24]:
# Create inventory summary with the assistance of AI
inventory_summary = {
    # File characteristics
    'stores_total': len(df_stores),
    'items_total': len(df_items),
    'families_total': df_items['family'].nunique(),
    'oil_records': len(df_oil),
    'oil_missing': 43,
    'holidays_records': len(df_holidays),
    'transactions_records': len(df_transactions),
    'train_rows': train_length,
    'train_columns': len(df_train.columns),
    
    # Guayas scope
    'guayas_stores': len(guayas_stores),
    'guayas_store_list': str(guayas_store_nbrs),
    'guayas_pct_of_stores': f"{len(guayas_stores)/len(df_stores)*100:.1f}%",
    
    # Top-3 families scope
    'top_3_families': str(top_3_family_names),
    'top_3_items': top_3_families['item_count'].sum(),
    'top_3_pct_of_items': f"{top_3_families['item_count'].sum()/len(df_items)*100:.1f}%",
    
    # Data quality
    'onpromotion_missing_count': int(missing_counts['onpromotion']),
    'onpromotion_missing_pct': f"{missing_pct['onpromotion']:.2f}%",
}

print("Inventory Summary:")
print("=" * 60)
for key, value in inventory_summary.items():
    print(f"{key:<30} {value}")

Inventory Summary:
stores_total                   54
items_total                    4100
families_total                 33
oil_records                    1218
oil_missing                    43
holidays_records               350
transactions_records           83488
train_rows                     125497040
train_columns                  6
guayas_stores                  11
guayas_store_list              [24, 26, 27, 28, 29, 30, 32, 34, 35, 36, 51]
guayas_pct_of_stores           20.4%
top_3_families                 ['GROCERY I', 'BEVERAGES', 'CLEANING']
top_3_items                    2393
top_3_pct_of_items             58.4%
onpromotion_missing_count      21657651
onpromotion_missing_pct        17.26%


In [25]:
# Export summary to CSV for documentation
df_summary_export = pd.DataFrame([inventory_summary])
output_path = DOCS / 'inventory_summary.csv'
df_summary_export.to_csv(output_path, index=False)

print(f"OK - Summary exported to: {output_path.resolve()}")

# Also create a more readable text summary
summary_text = f"""
DATA INVENTORY SUMMARY
Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}

FILE CHARACTERISTICS:
- stores.csv: {len(df_stores)} stores
- items.csv: {len(df_items)} items across {df_items['family'].nunique()} families
- oil.csv: {len(df_oil)} records ({43} missing values)
- holidays_events.csv: {len(df_holidays)} holiday records
- transactions.csv: {len(df_transactions):,} transaction records
- train.csv: {train_length:,} rows, 6 columns

PROJECT SCOPE (Guayas Region):
- Stores selected: {len(guayas_stores)} of {len(df_stores)} ({len(guayas_stores)/len(df_stores)*100:.1f}%)
- Store IDs: {guayas_store_nbrs}
- Top-3 families: {top_3_family_names}
- Items covered: {top_3_families['item_count'].sum():,} of {len(df_items)} ({top_3_families['item_count'].sum()/len(df_items)*100:.1f}%)

DATA QUALITY NOTES:
- onpromotion missing: {int(missing_counts['onpromotion']):,} rows ({missing_pct['onpromotion']:.2f}%)
- Decision: Fill with False (assume no promotion)
- train.csv size: 479 MB (requires Dask)
- Negative unit_sales: To be investigated in full dataset

NEXT STEPS (Day 2):
1. Filter train.csv to Guayas stores only
2. Filter to top-3 families only
3. Random sample 300K rows for development
4. Begin EDA on filtered dataset
"""

print(summary_text)

# Save text summary
with open(DOCS / 'inventory_summary.txt', 'w') as f:
    f.write(summary_text)

print(f"\nOK - Text summary also saved to: {(DOCS / 'inventory_summary.txt').resolve()}")

OK - Summary exported to: C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail\docs\inventory_summary.csv

DATA INVENTORY SUMMARY
Generated: 2025-11-11 04:45

FILE CHARACTERISTICS:
- stores.csv: 54 stores
- items.csv: 4100 items across 33 families
- oil.csv: 1218 records (43 missing values)
- holidays_events.csv: 350 holiday records
- transactions.csv: 83,488 transaction records
- train.csv: 125,497,040 rows, 6 columns

PROJECT SCOPE (Guayas Region):
- Stores selected: 11 of 54 (20.4%)
- Store IDs: [24, 26, 27, 28, 29, 30, 32, 34, 35, 36, 51]
- Top-3 families: ['GROCERY I', 'BEVERAGES', 'CLEANING']
- Items covered: 2,393 of 4100 (58.4%)

DATA QUALITY NOTES:
- onpromotion missing: 21,657,651 rows (17.26%)
- Decision: Fill with False (assume no promotion)
- train.csv size: 479 MB (requires Dask)
- Negative unit_sales: To be investigated in full dataset

NEXT STEPS (Day 2):
1. Filter train.csv to Guayas stores only
2. Filter to top-3 families only
3. Ra

In [26]:
# Notebook completion summary
print("=" * 70)
print("NOTEBOOK COMPLETE: d01_w01_SETUP_data_inventory.ipynb")
print("=" * 70)

print("\nACCOMPLISHMENTS:")
print("✓ All 5 support files loaded and validated")
print("✓ train.csv inspected with Dask (125,497,040 rows)")
print("✓ Guayas region scope defined (11 stores, 20.4% of total)")
print("✓ Top-3 families identified (2,393 items, 58.4% of total)")
print("✓ Inventory summary exported to docs/")

print("\nFILES CREATED:")
print(f"  - {(DOCS / 'inventory_summary.csv').resolve()}")
print(f"  - {(DOCS / 'inventory_summary.txt').resolve()}")

print("\nKEY FINDINGS:")
print(f"  - Dataset: 125M rows, 6 columns, 479 MB")
print(f"  - Scope filter: 11 stores × 2,393 items = ~2.3M potential rows")
print(f"  - Missing data: onpromotion 17.26% (will fill with False)")
print(f"  - Oil data: 43 missing values (3.5%)")

print("\nNEXT STEPS (Day 2):")
print("  1. Filter train.csv to Guayas stores")
print("  2. Filter to top-3 families")
print("  3. Random sample 300K rows")
print("  4. Export guayas_sample_300k.csv")

print("\nREADY FOR DAY 2 ✓")

NOTEBOOK COMPLETE: d01_w01_SETUP_data_inventory.ipynb

ACCOMPLISHMENTS:
✓ All 5 support files loaded and validated
✓ train.csv inspected with Dask (125,497,040 rows)
✓ Guayas region scope defined (11 stores, 20.4% of total)
✓ Top-3 families identified (2,393 items, 58.4% of total)
✓ Inventory summary exported to docs/

FILES CREATED:
  - C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail\docs\inventory_summary.csv
  - C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail\docs\inventory_summary.txt

KEY FINDINGS:
  - Dataset: 125M rows, 6 columns, 479 MB
  - Scope filter: 11 stores × 2,393 items = ~2.3M potential rows
  - Missing data: onpromotion 17.26% (will fill with False)
  - Oil data: 43 missing values (3.5%)

NEXT STEPS (Day 2):
  1. Filter train.csv to Guayas stores
  2. Filter to top-3 families
  3. Random sample 300K rows
  4. Export guayas_sample_300k.csv

READY FOR DAY 2 ✓
