# Corporación Favorita Grocery Sales Forecasting
**w01_d01_SETUP_data_inventory.ipynb**

**Author:** Alberto Diaz Durana  
**Date:** November 2025  
**Purpose:** Project setup, environment configuration, and comprehensive data inventory

---

## Objectives

This notebook accomplishes the following:

- Configure Python virtual environment and install base packages
- Download Kaggle dataset files (8 CSVs)
- Create project directory structure
- Generate comprehensive data inventory (schema, size, coverage)
- Document data quality baseline (missing values, duplicates)
- Establish data filtering criteria (Guayas region, top-3 families)

---

## Business Context

**Why proper setup matters:**

A solid foundation ensures:
- Reproducible environment across sessions
- Complete understanding of available data
- Clear scope definition (300K sample from 125M rows)
- Quality baseline for downstream analysis
- Efficient storage and access patterns

**Key decisions made:**
- Geographic scope: Guayas region only (11 stores)
- Product scope: Top-3 families (GROCERY I, BEVERAGES, CLEANING)
- Sample size: 300K rows (manageable for 4-week timeline)
- Date range: 2013-2017 (full training period available)

**Deliverables:**
- Virtual environment with pinned packages
- Project directory structure
- Data inventory report (8 files documented)
- Filtering criteria defined
- README and setup documentation

---

## Input Dependencies

External:
- Kaggle API credentials configured
- Internet connection for dataset download
- Python 3.11+ installed

Dataset:
- Kaggle competition: "favorita-grocery-sales-forecasting"
- 8 CSV files (train.csv ~479 MB, others <50 MB each)

---

## 1. Setup & Environment Configuration

**Objective:** Import required libraries, configure paths, validate environment

**Activities:**
- Import pandas, numpy, dask for data manipulation
- Define path constants for data/raw/ and docs/
- Test imports and display versions
- Configure warnings and display settings

**Expected output:** Confirmation that environment is ready

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
from pathlib import Path
import warnings

# Package versions
print("Package Versions:")
print(f"  pandas: {pd.__version__}")
print(f"  numpy: {np.__version__}")
print(f"  dask: {dask.__version__}")

In [None]:
# Configure environment
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

print("OK - Display settings configured")

In [None]:
# Determine current directory (works in both scripts and notebooks)
current_dir = Path(__file__).parent if '__file__' in globals() else Path.cwd()
project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir
print(f"Project root: {project_root.resolve()}")

# Define path constants relative to project root
DATA_RAW = project_root / 'data' / 'raw'
DOCS = project_root / 'docs'

# Verify paths exist
assert DATA_RAW.exists(), f"ERROR - Path not found: {DATA_RAW}"
assert DOCS.exists(), f"ERROR - Path not found: {DOCS}"

print("OK - Paths validated:")
print(f"  DATA_RAW: {DATA_RAW.resolve()}")
print(f"  DOCS: {DOCS.resolve()}")


## 2. Load Support Files (Small CSVs)

**Objective:** Load and validate all small CSV files (stores, items, oil, holidays, transactions)

**Activities:**
- Load 5 support CSV files into pandas DataFrames
- Display shape, columns, and data types for each
- Check for missing values
- Convert date columns to datetime format
- Display first few rows for validation

**Expected output:** 
- 5 DataFrames loaded successfully
- Schema validation report
- Missing value counts per file

In [None]:
DATA_RAW / 'stores.csv'

In [None]:
# Load stores.csv
df_stores = pd.read_csv(DATA_RAW / 'stores.csv')

print("stores.csv loaded:")
print(f"  Shape: {df_stores.shape}")
print(f"  Columns: {list(df_stores.columns)}")
print(f"  Missing values: {df_stores.isnull().sum().sum()}")
print(f"\nFirst 3 rows:")
print(df_stores.head(3))

In [None]:
# Load items.csv
df_items = pd.read_csv(DATA_RAW / 'items.csv')

print("items.csv loaded:")
print(f"  Shape: {df_items.shape}")
print(f"  Columns: {list(df_items.columns)}")
print(f"  Missing values: {df_items.isnull().sum().sum()}")
print(f"\nFirst 3 rows:")
print(df_items.head(3))

In [None]:
# Load oil.csv
df_oil = pd.read_csv(DATA_RAW / 'oil.csv')

print("oil.csv loaded:")
print(f"  Shape: {df_oil.shape}")
print(f"  Columns: {list(df_oil.columns)}")
print(f"  Missing values: {df_oil.isnull().sum().sum()}")
print(f"\nData types:")
print(df_oil.dtypes)
print(f"\nFirst 3 rows:")
print(df_oil.head(3))

In [None]:
# Load holidays_events.csv
df_holidays = pd.read_csv(DATA_RAW / 'holidays_events.csv')

print("holidays_events.csv loaded:")
print(f"  Shape: {df_holidays.shape}")
print(f"  Columns: {list(df_holidays.columns)}")
print(f"  Missing values: {df_holidays.isnull().sum().sum()}")
print(f"\nData types:")
print(df_holidays.dtypes)
print(f"\nFirst 3 rows:")
print(df_holidays.head(3))

In [None]:
# Load transactions.csv
df_transactions = pd.read_csv(DATA_RAW / 'transactions.csv')

print("transactions.csv loaded:")
print(f"  Shape: {df_transactions.shape}")
print(f"  Columns: {list(df_transactions.columns)}")
print(f"  Missing values: {df_transactions.isnull().sum().sum()}")
print(f"\nData types:")
print(df_transactions.dtypes)
print(f"\nFirst 3 rows:")
print(df_transactions.head(3))

In [None]:
# Create summary dictionary
support_files_summary = {
    'File': ['stores.csv', 'items.csv', 'oil.csv', 'holidays_events.csv', 'transactions.csv'],
    'Rows': [len(df_stores), len(df_items), len(df_oil), len(df_holidays), len(df_transactions)],
    'Columns': [df_stores.shape[1], df_items.shape[1], df_oil.shape[1], df_holidays.shape[1], df_transactions.shape[1]],
    'Missing': [0, 0, 43, 0, 0]
}

df_summary = pd.DataFrame(support_files_summary)
print("Support Files Summary:")
print(df_summary.to_string(index=False))
print("\nOK - All 5 support files loaded successfully")

## 3. Inspect train.csv with Dask

**Objective:** Load large train.csv file using Dask and inspect structure without loading full dataset into memory

**Activities:**
- Use Dask to read train.csv (479 MB file)
- Display schema and estimated row count
- Check for missing values
- Sample first rows for validation
- Document file characteristics

**Expected output:** 
- Train data structure confirmed
- Missing value percentages calculated
- Memory-efficient inspection complete

**Note:** train.csv is too large for pandas (125M rows). Dask enables lazy evaluation.

In [None]:
# Load train.csv with Dask (lazy evaluation)
print("Loading train.csv with Dask (this may take a moment)...")
df_train = dd.read_csv(DATA_RAW / 'train.csv')

print("OK - train.csv loaded (Dask DataFrame)")
print(f"\nColumns: {list(df_train.columns)}")
print(f"Data types:")
print(df_train.dtypes)

In [None]:
# Compute actual row count (triggers computation)
print("Computing row count (this will take time - processing 125M rows)...")
train_length = len(df_train)
print(f"OK - Total rows: {train_length:,}")

# Check missing values per column
print("\nComputing missing values per column...")
missing_counts = df_train.isnull().sum().compute()
missing_pct = (missing_counts / train_length * 100).round(2)

print("\nMissing Values:")
for col in df_train.columns:
    count = missing_counts[col]
    pct = missing_pct[col]
    print(f"  {col:<15} {count:>12,} ({pct:>6}%)")

In [None]:
# Sample first 1000 rows to inspect data
print("Sampling first 1000 rows...")
df_train_sample = df_train.head(1000, npartitions=-1)

print("\nFirst 5 rows:")
print(df_train_sample.head())

print("\nBasic statistics for unit_sales:")
print(df_train_sample['unit_sales'].describe())

# Check for negative values
negative_count = (df_train_sample['unit_sales'] < 0).sum()
print(f"\nNegative unit_sales in sample: {negative_count} rows")

## 4. Identify Guayas Stores

**Objective:** Filter stores to Guayas region for project scope

**Activities:**
- Query stores.csv WHERE state = 'Guayas'
- Count Guayas stores
- Display store types and clusters in Guayas
- Export Guayas store_nbr list for train filtering

**Expected output:** 
- List of Guayas store identifiers
- Guayas store characteristics (types, clusters, cities)

In [None]:
# Filter stores to Guayas region
guayas_stores = df_stores[df_stores['state'] == 'Guayas'].copy()

print(f"Total stores in dataset: {len(df_stores)}")
print(f"Stores in Guayas: {len(guayas_stores)}")
print(f"Percentage: {len(guayas_stores)/len(df_stores)*100:.1f}%")

print("\nGuayas stores:")
print(guayas_stores)

In [None]:
# Analyze store types in Guayas
print("Store types in Guayas:")
print(guayas_stores['type'].value_counts().sort_index())

print("\nStore clusters in Guayas:")
print(guayas_stores['cluster'].value_counts().sort_index())

print("\nCities in Guayas:")
print(guayas_stores['city'].value_counts())

# Extract store_nbr list for filtering
guayas_store_nbrs = guayas_stores['store_nbr'].tolist()
print(f"\nGuayas store_nbr list ({len(guayas_store_nbrs)} stores):")
print(guayas_store_nbrs)

## 5. Identify Top-3 Product Families

**Objective:** Determine top-3 product families by item count for scope reduction

**Activities:**
- Count unique items per product family
- Rank families by item count
- Select top-3 families
- Display family characteristics

**Expected output:** 
- Top-3 families list with item counts
- Percentage of total items covered
- Family names for train filtering

In [None]:
# Count items per family
items_per_family = df_items['family'].value_counts().reset_index()
items_per_family.columns = ['family', 'item_count']
items_per_family = items_per_family.sort_values('item_count', ascending=False).reset_index(drop=True)

print(f"Total product families: {len(items_per_family)}")
print(f"Total items: {len(df_items)}")

print("\nTop-10 families by item count:")
print(items_per_family.head(10).to_string(index=False))

# Select top-3
top_3_families = items_per_family.head(3)
top_3_family_names = top_3_families['family'].tolist()

print(f"\nTop-3 families selected:")
print(top_3_families.to_string(index=False))

print(f"\nTop-3 families cover {top_3_families['item_count'].sum():,} items ({top_3_families['item_count'].sum()/len(df_items)*100:.1f}% of total)")

## 6. Summary & Export Findings

**Objective:** Consolidate inventory findings and export for documentation

**Activities:**
- Create comprehensive inventory summary
- Document Guayas scope (11 stores)
- Document top-3 families scope (2,393 items)
- Export summary to CSV for data_inventory.md update
- Calculate expected filtering impact on train.csv

**Expected output:** 
- Complete inventory summary dictionary
- Summary CSV exported to docs/
- Filtering estimates documented

In [None]:
# Create inventory summary with the assistance of AI
inventory_summary = {
    # File characteristics
    'stores_total': len(df_stores),
    'items_total': len(df_items),
    'families_total': df_items['family'].nunique(),
    'oil_records': len(df_oil),
    'oil_missing': 43,
    'holidays_records': len(df_holidays),
    'transactions_records': len(df_transactions),
    'train_rows': train_length,
    'train_columns': len(df_train.columns),
    
    # Guayas scope
    'guayas_stores': len(guayas_stores),
    'guayas_store_list': str(guayas_store_nbrs),
    'guayas_pct_of_stores': f"{len(guayas_stores)/len(df_stores)*100:.1f}%",
    
    # Top-3 families scope
    'top_3_families': str(top_3_family_names),
    'top_3_items': top_3_families['item_count'].sum(),
    'top_3_pct_of_items': f"{top_3_families['item_count'].sum()/len(df_items)*100:.1f}%",
    
    # Data quality
    'onpromotion_missing_count': int(missing_counts['onpromotion']),
    'onpromotion_missing_pct': f"{missing_pct['onpromotion']:.2f}%",
}

print("Inventory Summary:")
print("=" * 60)
for key, value in inventory_summary.items():
    print(f"{key:<30} {value}")

In [None]:
# Export summary to CSV for documentation
df_summary_export = pd.DataFrame([inventory_summary])
output_path = DOCS / 'inventory_summary.csv'
df_summary_export.to_csv(output_path, index=False)

print(f"OK - Summary exported to: {output_path.resolve()}")

# Also create a more readable text summary
summary_text = f"""
DATA INVENTORY SUMMARY
Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}

FILE CHARACTERISTICS:
- stores.csv: {len(df_stores)} stores
- items.csv: {len(df_items)} items across {df_items['family'].nunique()} families
- oil.csv: {len(df_oil)} records ({43} missing values)
- holidays_events.csv: {len(df_holidays)} holiday records
- transactions.csv: {len(df_transactions):,} transaction records
- train.csv: {train_length:,} rows, 6 columns

PROJECT SCOPE (Guayas Region):
- Stores selected: {len(guayas_stores)} of {len(df_stores)} ({len(guayas_stores)/len(df_stores)*100:.1f}%)
- Store IDs: {guayas_store_nbrs}
- Top-3 families: {top_3_family_names}
- Items covered: {top_3_families['item_count'].sum():,} of {len(df_items)} ({top_3_families['item_count'].sum()/len(df_items)*100:.1f}%)

DATA QUALITY NOTES:
- onpromotion missing: {int(missing_counts['onpromotion']):,} rows ({missing_pct['onpromotion']:.2f}%)
- Decision: Fill with False (assume no promotion)
- train.csv size: 479 MB (requires Dask)
- Negative unit_sales: To be investigated in full dataset

NEXT STEPS (Day 2):
1. Filter train.csv to Guayas stores only
2. Filter to top-3 families only
3. Random sample 300K rows for development
4. Begin EDA on filtered dataset
"""

print(summary_text)

# Save text summary
with open(DOCS / 'inventory_summary.txt', 'w') as f:
    f.write(summary_text)

print(f"\nOK - Text summary also saved to: {(DOCS / 'inventory_summary.txt').resolve()}")

In [None]:
# Notebook completion summary
print("=" * 70)
print("NOTEBOOK COMPLETE: d01_w01_SETUP_data_inventory.ipynb")
print("=" * 70)

print("\nACCOMPLISHMENTS:")
print("✓ All 5 support files loaded and validated")
print("✓ train.csv inspected with Dask (125,497,040 rows)")
print("✓ Guayas region scope defined (11 stores, 20.4% of total)")
print("✓ Top-3 families identified (2,393 items, 58.4% of total)")
print("✓ Inventory summary exported to docs/")

print("\nFILES CREATED:")
print(f"  - {(DOCS / 'inventory_summary.csv').resolve()}")
print(f"  - {(DOCS / 'inventory_summary.txt').resolve()}")

print("\nKEY FINDINGS:")
print(f"  - Dataset: 125M rows, 6 columns, 479 MB")
print(f"  - Scope filter: 11 stores × 2,393 items = ~2.3M potential rows")
print(f"  - Missing data: onpromotion 17.26% (will fill with False)")
print(f"  - Oil data: 43 missing values (3.5%)")

print("\nNEXT STEPS (Day 2):")
print("  1. Filter train.csv to Guayas stores")
print("  2. Filter to top-3 families")
print("  3. Random sample 300K rows")
print("  4. Export guayas_sample_300k.csv")

print("\nREADY FOR DAY 2 ✓")