## Data Preparation for Linear Regression
This notebook outlines the data preparation steps for a linear regression analysis. It organizes the workflow into different tiers:

### Bronze Tier
Tables as they are

### Silver Tier
The Silver Tier contains curated, cleaned, and joined data.

#### Table `encoded_train_df`:  
Numerical variables: `total_daily_sales`, `days_since_earliest_date`, `transactions`, `onpromotion`.  

Categorical variables: `store_nbr`, `city`, `state`, `type`, `cluster`, `day_of_week`, `day_of_month`, `month`, `year`.  
All categorical variables are one-hot encoded using prefix `is_<varname>_` e.g. `is_state__Ohio` (note the double underscore).


This tier is designed for analysis and modeling, providing a structured dataset with relevant features for linear regression.

In [0]:
# imports
import os
from datetime import datetime

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql import DataFrame as SparkDataFrame
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

In [0]:
VOLUME_ROOT_PATH = "/Volumes/cscie103_catalog/final_project/data"
# raw data
VOLUME_BRONZE_DIR = f"{VOLUME_ROOT_PATH}/bronze"
# place where prepared data is written
VOLUME_SILVER_DIR = f"{VOLUME_ROOT_PATH}/silver"

# ensure all paths exist
for path in [VOLUME_BRONZE_DIR, VOLUME_SILVER_DIR]:
  if not os.path.exists(path):
    os.makedirs(path, exist_ok=True)

In [0]:
# load the data from local volumes
bronze_filenames = {
    'holidays_events': 'holidays_v1',
    'oil': 'oil',
    'sample_submission': 'sample_submission',
    'stores': 'stores',
    'test': 'test',
    'train': 'train',
    'transactions': 'transactions'
}
silver_filenames = {
    'holidays_events': 'holidays',
    'oil': 'oil',
    'sample_submission': 'sample_submission',
    'stores': 'stores',
    'test': 'test',
    'train': 'train',
    'transactions': 'transactions'
}

# read from Bronze Tier
bstores_df = spark.read.parquet(f"{VOLUME_BRONZE_DIR}/stores")
btransactions_df = spark.read.parquet(f"{VOLUME_BRONZE_DIR}/transactions")
btrain_df = spark.read.parquet(f"{VOLUME_BRONZE_DIR}/train")
btest_df = spark.read.parquet(f"{VOLUME_BRONZE_DIR}/test")
bholidays_events_df = spark.read.parquet(f"{VOLUME_BRONZE_DIR}/holidays")
# oil_df = spark.read.parquet(f"{VOLUME_BRONZE_DIR}/oil")

## Silver Tier

Produce & persist table:  

store_nbr	|   int  
date	    |   date  
id	        |   int  
family	    |   string  
sales	    |   double  
onpromotion	|   int  
transactions|	int  
city	    |   string  
state	    |   string  
type	    |   string  
cluster	    |   int  

In [0]:
def smart_na_drop(df):
    """
    Drops all rows with any null values in columns.
    """
    before = df.count()
    df = df.dropna()
    after = df.count()
    print(f"dropped {before - after} rows")
    return df

In [0]:
btrain_df = smart_na_drop(btrain_df)
btransactions_df = smart_na_drop(btransactions_df)
bstores_df = smart_na_drop(bstores_df)
bholidays_events_df = smart_na_drop(bholidays_events_df)

### Holidays

In [0]:
bholidays_events_df.printSchema()
display(bholidays_events_df)

In [0]:
def rows_to_value(df, name):
    return [ row[name] for row in df ]

display(rows_to_value(bholidays_events_df.select('locale_name').distinct().collect(), 'locale_name'))
bholidays_events_df.select('locale').distinct().show()
bholidays_events_df.select('type').distinct().show()

In [0]:
# Preparation of holidays data (holidays_events_df):
# 1. Drop rows with 'transfered' = true -> these were transferred to another date.
#    Identifiable by 'type' = 'Transfer'
# 2. Explode nationwade holiday to per state, identifiable by 'locale_name' = 'Ecuador'
# 3. Deduplicate dates. This is made under assumption that all the rest of holiday types are actual holidays.
# 4. Construct new dataframe with 2 columns: 'date', 'is_holiday' from the holidays df
# 5. Write to Bronze tier

sholidays_events_df = bholidays_events_df

# 1. Drop rows with 'transfered' = true -> these were transferred to another date.
sholidays_events_df = sholidays_events_df.where(F.col('locale_name') != 'Transfer')

# 2. Explode nationwade holiday to per state, identifiable by 'locale_name' = 'Ecuador'
# list of states is provided by the stores_df
ecuador_states = [ row['state'] for row in bstores_df.select('state').distinct().collect()]

# add array with all the states to 'Ecuador' rows
sholidays_events_df = sholidays_events_df.withColumn(
    'locale_name_array',
    F.when(
        F.col('locale_name') == 'Ecuador',
        F.array([ F.lit(s) for s in ecuador_states ])
    ).otherwise(
        F.array(F.col('locale_name'))
    )
)
# Explode & 
# 4. Construct new dataframe with 2 columns: 'date', 'is_holiday' from the holidays df
sholidays_events_df = sholidays_events_df.select(
    'date',
    F.explode('locale_name_array').alias('state'),
    F.lit(1).alias('is_holiday') 
)

# 3. Deduplicate rows by leaving unique per date-state
sholidays_events_df = sholidays_events_df.dropDuplicates(['date', 'state'])

# 5. Write to Silver tier
sholidays_events_df.write.mode('overwrite').parquet(f"{VOLUME_SILVER_DIR}/{silver_filenames.get('holidays_events')}")

In [0]:
sholidays_events_df.printSchema()
display(sholidays_events_df)

### Transactions

In [0]:
btransactions_df.printSchema()
display(btransactions_df)

In [0]:
btransactions_df.write.mode('overwrite').parquet(f"{VOLUME_SILVER_DIR}/{silver_filenames.get('transactions')}")
stransactions_df = spark.read.parquet(f"{VOLUME_SILVER_DIR}/{silver_filenames.get('transactions')}")

### Stores

In [0]:
bstores_df.printSchema()
display(bstores_df)

In [0]:
# 1. One hot encode categorical columns: city, state, type, cluster
bstores_categorical_cols = ['city', 'state', 'type', 'cluster']

bstores_indexers = [
    StringIndexer(inputCol=col, outputCol=col + '_idx', handleInvalid='keep')
    for col in bstores_categorical_cols
]

bstores_encoders = [
    OneHotEncoder(
        inputCols=[col + '_idx'], 
        outputCols=['is_' + col],
    )
    for col in bstores_categorical_cols
]

bstores_pipeline = Pipeline(stages=bstores_indexers + bstores_encoders)
bstores_pipeline_model = bstores_pipeline.fit(bstores_df)
bstores_df_encoded = bstores_pipeline_model.transform(bstores_df)
display(bstores_df)

bstores_ohe_cols = ['is_' + col for col in bstores_categorical_cols]
# Get all columns *except* the temporary index columns
bstores_df_cols_to_select = [col for col in bstores_df.columns if col not in [f'is_{c}' for c in bstores_categorical_cols]]
bstores_df_cols_to_select.extend(bstores_ohe_cols)

sstores_df = bstores_df_encoded.select(bstores_df_cols_to_select)

display(sstores_df)

In [0]:
# drop original categorical cols
sstores_df = sstores_df.drop(*bstores_categorical_cols)

display(sstores_df)

### Train

## Other

In [0]:
%skip
# This shows the rows which are dropped in the cell below after merging with transactions
test_df = train_df
test_df = test_df.join(btransactions_df, on=['date', 'store_nbr'], how='left')
test_df = test_df.withColumn(
    'transactions',
    F.when(F.col('sales') == 0, 0).otherwise(F.col('transactions'))
)
# show only rows where nulls are present
test_df.where(F.col('transactions').isNull()).show()

In [0]:
# 1. Merge with transactions data
#       .a Fill transactions as 0 when total_daily_sales is 0
#       .b Drop rows where any column is null
# 2. Merge with stores_df

strain_df = btrain_df

# 1. Merge with transactions data
strain_df = strain_df.join(btransactions_df, on=['date', 'store_nbr'], how='left')
strain_df = strain_df.withColumn(
    'transactions',
    F.when(F.col('sales') == 0, 0).otherwise(F.col('transactions'))
)
strain_df = smart_na_drop(strain_df) # expected to drop 3248 rows

# 2. Merge with stores_df
strain_df = strain_df.join(bstores_df, ['store_nbr'], how='left')
strain_df = smart_na_drop(strain_df) # expected to drop 0 rows

In [0]:
# Schema of strain_df
strain_df.printSchema()

strain_df.display()

In [0]:
# 1. Add columns day_of_week, day, month, year
# 2. Add column days_since_earliest_date as number of days since earliest date
# 3. Drop date column

# 1. Add columns day_of_week, day, month, year
strain_df = strain_df.withColumn('day_of_week', F.dayofweek(F.col('date')))
strain_df = strain_df.withColumn('day_of_month', F.dayofmonth(F.col('date')))
strain_df = strain_df.withColumn('month', F.month(F.col('date')))
strain_df = strain_df.withColumn('year', F.year(F.col('date')))

# 2. Add column time_since_earliest_date
earliest_date = strain_df.select(F.min('date')).collect()[0][0] # Normally is 2013-01-01
strain_df = strain_df.withColumn(
    'days_since_earliest_date',
    F.datediff(F.col('date'), F.lit(earliest_date))
)

# 3. Drop date column
strain_df = strain_df.drop('date')

In [0]:
# Preparation of the data for Logistic Regression
# 1. One-hot encode all categorical
#  variables: store_nbr, city, state, type, cluster, day_of_week, day_of_month, month, year

# 1. One-hot encode all categorical variables: store_nbr, city, state, type, cluster, day_of_week, day_of_month, month, year
for colname in ['store_nbr', 'city', 'state', 'type', 'cluster', 'day_of_week', 'day_of_month', 'month', 'year']:
    setrain_df = pd.get_dummies(
        setrain_df,
        columns=[colname],
        dtype=int,
        prefix=f'is_{colname}_'
    )
display(setrain_df)

In [0]:
# 1. Convert setrain_df back to spark dataframe
# 2. Write setrain_df to silver table

# 1. Convert setrain_df back to spark dataframe
setrain_df = spark.createDataFrame(setrain_df)

# 2. Write setrain_df to silver table
setrain_df.write.mode("overwrite").parquet(f"{VOLUME_SILVER_DIR}/encoded_train_df")