## 1. Setup & Environment Configuration

**Objective:** Import required libraries, configure paths, validate environment

**Activities:**
- Import pandas, numpy, dask for data manipulation
- Define path constants for data/raw/ and docs/
- Test imports and display versions
- Configure warnings and display settings

**Expected output:** Confirmation that environment is ready

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
from pathlib import Path
import warnings

# Package versions
print("Package Versions:")
print(f"  pandas: {pd.__version__}")
print(f"  numpy: {np.__version__}")
print(f"  dask: {dask.__version__}")

Package Versions:
  pandas: 2.1.4
  numpy: 1.26.4
  dask: 2025.11.0


In [2]:
# Configure environment
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

print("OK - Display settings configured")

OK - Display settings configured


In [None]:
# Determine current directory (works in both scripts and notebooks)
current_dir = Path(__file__).parent if '__file__' in globals() else Path.cwd()
project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir
print(f"Project root: {project_root.resolve()}")

# Define path constants relative to project root
DATA_RAW = project_root / 'data' / 'raw'
DOCS = project_root / 'docs'

# Verify paths exist
assert DATA_RAW.exists(), f"ERROR - Path not found: {DATA_RAW}"
assert DOCS.exists(), f"ERROR - Path not found: {DOCS}"

print("OK - Paths validated:")
print(f"  DATA_RAW: {DATA_RAW.resolve()}")
print(f"  DOCS: {DOCS.resolve()}")


Project root: C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail
OK - Paths validated:
  DATA_RAW: C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail\data\raw
  DOCS: C:\Users\adiaz\OneDrive\Dokumente\PythonScripts\MasterClass\Demand-forecasting-in-retail\docs


## 2. Load Support Files (Small CSVs)

**Objective:** Load and validate all small CSV files (stores, items, oil, holidays, transactions)

**Activities:**
- Load 5 support CSV files into pandas DataFrames
- Display shape, columns, and data types for each
- Check for missing values
- Convert date columns to datetime format
- Display first few rows for validation

**Expected output:** 
- 5 DataFrames loaded successfully
- Schema validation report
- Missing value counts per file

In [11]:
DATA_RAW / 'stores.csv'

WindowsPath('c:/Users/adiaz/OneDrive/Dokumente/PythonScripts/MasterClass/Demand-forecasting-in-retail/data/raw/stores.csv')

In [12]:
# Load stores.csv
df_stores = pd.read_csv(DATA_RAW / 'stores.csv')

print("stores.csv loaded:")
print(f"  Shape: {df_stores.shape}")
print(f"  Columns: {list(df_stores.columns)}")
print(f"  Missing values: {df_stores.isnull().sum().sum()}")
print(f"\nFirst 3 rows:")
print(df_stores.head(3))

stores.csv loaded:
  Shape: (54, 5)
  Columns: ['store_nbr', 'city', 'state', 'type', 'cluster']
  Missing values: 0

First 3 rows:
   store_nbr   city      state type  cluster
0          1  Quito  Pichincha    D       13
1          2  Quito  Pichincha    D       13
2          3  Quito  Pichincha    D        8


In [13]:
# Load items.csv
df_items = pd.read_csv(DATA_RAW / 'items.csv')

print("items.csv loaded:")
print(f"  Shape: {df_items.shape}")
print(f"  Columns: {list(df_items.columns)}")
print(f"  Missing values: {df_items.isnull().sum().sum()}")
print(f"\nFirst 3 rows:")
print(df_items.head(3))

items.csv loaded:
  Shape: (4100, 4)
  Columns: ['item_nbr', 'family', 'class', 'perishable']
  Missing values: 0

First 3 rows:
   item_nbr     family  class  perishable
0     96995  GROCERY I   1093           0
1     99197  GROCERY I   1067           0
2    103501   CLEANING   3008           0


In [14]:
# Load oil.csv
df_oil = pd.read_csv(DATA_RAW / 'oil.csv')

print("oil.csv loaded:")
print(f"  Shape: {df_oil.shape}")
print(f"  Columns: {list(df_oil.columns)}")
print(f"  Missing values: {df_oil.isnull().sum().sum()}")
print(f"\nData types:")
print(df_oil.dtypes)
print(f"\nFirst 3 rows:")
print(df_oil.head(3))

oil.csv loaded:
  Shape: (1218, 2)
  Columns: ['date', 'dcoilwtico']
  Missing values: 43

Data types:
date           object
dcoilwtico    float64
dtype: object

First 3 rows:
         date  dcoilwtico
0  2013-01-01         NaN
1  2013-01-02       93.14
2  2013-01-03       92.97


In [15]:
# Load holidays_events.csv
df_holidays = pd.read_csv(DATA_RAW / 'holidays_events.csv')

print("holidays_events.csv loaded:")
print(f"  Shape: {df_holidays.shape}")
print(f"  Columns: {list(df_holidays.columns)}")
print(f"  Missing values: {df_holidays.isnull().sum().sum()}")
print(f"\nData types:")
print(df_holidays.dtypes)
print(f"\nFirst 3 rows:")
print(df_holidays.head(3))

holidays_events.csv loaded:
  Shape: (350, 6)
  Columns: ['date', 'type', 'locale', 'locale_name', 'description', 'transferred']
  Missing values: 0

Data types:
date           object
type           object
locale         object
locale_name    object
description    object
transferred      bool
dtype: object

First 3 rows:
         date     type    locale locale_name                    description  \
0  2012-03-02  Holiday     Local       Manta             Fundacion de Manta   
1  2012-04-01  Holiday  Regional    Cotopaxi  Provincializacion de Cotopaxi   
2  2012-04-12  Holiday     Local      Cuenca            Fundacion de Cuenca   

   transferred  
0        False  
1        False  
2        False  


In [16]:
# Load transactions.csv
df_transactions = pd.read_csv(DATA_RAW / 'transactions.csv')

print("transactions.csv loaded:")
print(f"  Shape: {df_transactions.shape}")
print(f"  Columns: {list(df_transactions.columns)}")
print(f"  Missing values: {df_transactions.isnull().sum().sum()}")
print(f"\nData types:")
print(df_transactions.dtypes)
print(f"\nFirst 3 rows:")
print(df_transactions.head(3))

transactions.csv loaded:
  Shape: (83488, 3)
  Columns: ['date', 'store_nbr', 'transactions']
  Missing values: 0

Data types:
date            object
store_nbr        int64
transactions     int64
dtype: object

First 3 rows:
         date  store_nbr  transactions
0  2013-01-01         25           770
1  2013-01-02          1          2111
2  2013-01-02          2          2358


In [17]:
# Create summary dictionary
support_files_summary = {
    'File': ['stores.csv', 'items.csv', 'oil.csv', 'holidays_events.csv', 'transactions.csv'],
    'Rows': [len(df_stores), len(df_items), len(df_oil), len(df_holidays), len(df_transactions)],
    'Columns': [df_stores.shape[1], df_items.shape[1], df_oil.shape[1], df_holidays.shape[1], df_transactions.shape[1]],
    'Missing': [0, 0, 43, 0, 0]
}

df_summary = pd.DataFrame(support_files_summary)
print("Support Files Summary:")
print(df_summary.to_string(index=False))
print("\nOK - All 5 support files loaded successfully")

Support Files Summary:
               File  Rows  Columns  Missing
         stores.csv    54        5        0
          items.csv  4100        4        0
            oil.csv  1218        2       43
holidays_events.csv   350        6        0
   transactions.csv 83488        3        0

OK - All 5 support files loaded successfully


## 3. Inspect train.csv with Dask

**Objective:** Load large train.csv file using Dask and inspect structure without loading full dataset into memory

**Activities:**
- Use Dask to read train.csv (479 MB file)
- Display schema and estimated row count
- Check for missing values
- Sample first rows for validation
- Document file characteristics

**Expected output:** 
- Train data structure confirmed
- Missing value percentages calculated
- Memory-efficient inspection complete

**Note:** train.csv is too large for pandas (125M rows). Dask enables lazy evaluation.

In [18]:
# Load train.csv with Dask (lazy evaluation)
print("Loading train.csv with Dask (this may take a moment)...")
df_train = dd.read_csv(DATA_RAW / 'train.csv')

print("OK - train.csv loaded (Dask DataFrame)")
print(f"\nColumns: {list(df_train.columns)}")
print(f"Data types:")
print(df_train.dtypes)

Loading train.csv with Dask (this may take a moment)...
OK - train.csv loaded (Dask DataFrame)

Columns: ['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion']
Data types:
id                       int64
date           string[pyarrow]
store_nbr                int64
item_nbr                 int64
unit_sales             float64
onpromotion            float64
dtype: object


In [19]:
# Compute actual row count (triggers computation)
print("Computing row count (this will take time - processing 125M rows)...")
train_length = len(df_train)
print(f"OK - Total rows: {train_length:,}")

# Check missing values per column
print("\nComputing missing values per column...")
missing_counts = df_train.isnull().sum().compute()
missing_pct = (missing_counts / train_length * 100).round(2)

print("\nMissing Values:")
for col in df_train.columns:
    count = missing_counts[col]
    pct = missing_pct[col]
    print(f"  {col:<15} {count:>12,} ({pct:>6}%)")

Computing row count (this will take time - processing 125M rows)...
OK - Total rows: 125,497,040

Computing missing values per column...

Missing Values:
  id                         0 (   0.0%)
  date                       0 (   0.0%)
  store_nbr                  0 (   0.0%)
  item_nbr                   0 (   0.0%)
  unit_sales                 0 (   0.0%)
  onpromotion       21,657,651 ( 17.26%)


In [None]:
# Sample first 1000 rows to inspect data
print("Sampling first 1000 rows...")
df_train_sample = df_train.head(1000, npartitions=-1)

print("\nFirst 5 rows:")
print(df_train_sample.head())

print("\nBasic statistics for unit_sales:")
print(df_train_sample['unit_sales'].describe())

# Check for negative values
negative_count = (df_train_sample['unit_sales'] < 0).sum()
print(f"\nNegative unit_sales in sample: {negative_count} rows")

Sampling first 1000 rows...
