# Data Preparation Complete - Walmart Sales Forecast

Notebook n√†y th·ª±c hi·ªán to√†n b·ªô c√°c b∆∞·ªõc chu·∫©n b·ªã d·ªØ li·ªáu theo file "[C√°c b∆∞·ªõc l√†m tham kh·∫£o ti·∫øp theo]".

## C·∫•u tr√∫c:

**GIAI ƒêO·∫†N 1: CHU·∫®N B·ªä D·ªÆ LI·ªÜU**
- 1.1. T·∫°o df_main_weekly
- 1.2. T·∫°o df_events_daily  
- 1.3. T·∫°o df_feature_calendar_weekly

**GIAI ƒêO·∫†N 2: FEATURE ENGINEERING**
- 2.1. Merge & Ki·ªÉm tra
- 2.2. T·∫°o Features "Payday Pulse"
- 2.3. T·∫°o Features "Holiday"
- 2.4. T·∫°o Features "Lag/Rolling"
- 2.5. T·∫°o Features "Interaction"

**GIAI ƒêO·∫†N 3: L∆ØU C√ÅC FILE OUTPUT**

---


## 0. Setup v√† Import Libraries


In [1]:
# Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
import pandas as pd
import numpy as np
import warnings
import os
from datetime import datetime, timedelta

warnings.filterwarnings('ignore')

# C·∫•u h√¨nh pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("‚úÖ Libraries imported successfully!")


‚úÖ Libraries imported successfully!


### Helper Functions

ƒê·ªãnh nghƒ©a c√°c helper functions (t·ª´ file `data_prep_utils.py`)


In [2]:
def get_us_holidays(year):
    """T√≠nh c√°c ng√†y l·ªÖ M·ªπ cho m·ªôt nƒÉm"""
    holidays = {}
    
    holidays[f'{year}-01-01'] = ('New Years Day', 1)
    
    # Super Bowl
    super_bowl_dates = {2010: '2010-02-07', 2011: '2011-02-06', 2012: '2012-02-05'}
    if year in super_bowl_dates:
        holidays[super_bowl_dates[year]] = ('Super Bowl', 3)
    
    # Presidents Day
    presidents_day_dates = {2010: '2010-02-15', 2011: '2011-02-21', 2012: '2012-02-20'}
    if year in presidents_day_dates:
        holidays[presidents_day_dates[year]] = ('Presidents Day', 1)
    
    # Memorial Day
    memorial_day_dates = {2010: '2010-05-31', 2011: '2011-05-30', 2012: '2012-05-28'}
    if year in memorial_day_dates:
        holidays[memorial_day_dates[year]] = ('Memorial Day', 1)
    
    holidays[f'{year}-07-04'] = ('Independence Day', 1)
    
    # Labor Day
    labor_day_dates = {2010: '2010-09-06', 2011: '2011-09-05', 2012: '2012-09-03'}
    if year in labor_day_dates:
        holidays[labor_day_dates[year]] = ('Labor Day', 3)
    
    # Thanksgiving
    thanksgiving_dates = {2010: '2010-11-25', 2011: '2011-11-24', 2012: '2012-11-22'}
    if year in thanksgiving_dates:
        holidays[thanksgiving_dates[year]] = ('Thanksgiving', 5)
    
    holidays[f'{year}-12-25'] = ('Christmas', 5)
    holidays[f'{year}-12-24'] = ('Christmas Eve', 3)
    
    return holidays


def get_week_end_date(date):
    """T√≠nh WeekEndDate (Th·ª© S√°u cu·ªëi tu·∫ßn) cho m·ªôt ng√†y"""
    weekday = date.weekday()  # 0=Monday, 4=Friday, 6=Sunday
    if weekday == 4:  # Friday
        return date
    elif weekday == 5:  # Saturday
        return date + timedelta(days=6)
    elif weekday == 6:  # Sunday
        return date + timedelta(days=5)
    else:  # Monday-Thursday
        return date + timedelta(days=4-weekday)


def is_tax_refund_season(date):
    """Ki·ªÉm tra xem ng√†y c√≥ thu·ªôc m√πa ho√†n thu·∫ø kh√¥ng (15/02 - 15/04)"""
    month = date.month
    day = date.day
    if month == 2 and day >= 15:
        return 1
    elif month == 3:
        return 1
    elif month == 4 and day <= 15:
        return 1
    return 0


def calculate_weeks_since_payday(group):
    """T√≠nh s·ªë tu·∫ßn k·ªÉ t·ª´ payday g·∫ßn nh·∫•t cho m·ªói group (Store, Dept)"""
    weeks_since = []
    last_payday_week = None
    
    for idx, row in group.iterrows():
        if row['is_semimonthly_payweek'] == 1:
            last_payday_week = row['WeekEndDate']
            weeks_since.append(0)
        elif last_payday_week is not None:
            weeks_diff = (row['WeekEndDate'] - last_payday_week).days // 7
            weeks_since.append(weeks_diff)
        else:
            weeks_since.append(np.nan)
    
    return pd.Series(weeks_since, index=group.index)


def piecewise_decay(weeks):
    """T√≠nh gi√° tr·ªã decay theo piecewise function"""
    if weeks == 0:
        return 1.0
    elif weeks == 1:
        return 0.7
    elif weeks >= 2:
        return 0.4
    else:
        return 0.0


def get_christmas_date(year):
    """Tr·∫£ v·ªÅ ng√†y Gi√°ng sinh"""
    return pd.Timestamp(f'{year}-12-25')


def get_thanksgiving_date(year):
    """T√≠nh ng√†y Thanksgiving (Th·ª© 5 th·ª© 4 c·ªßa th√°ng 11)"""
    nov_1 = pd.Timestamp(f'{year}-11-01')
    first_thursday = nov_1 + timedelta(days=(3 - nov_1.weekday()) % 7)
    if first_thursday.day > 7:
        first_thursday = first_thursday - timedelta(days=7)
    thanksgiving = first_thursday + timedelta(days=21)
    return thanksgiving


def calculate_weeks_until_holiday(date, holiday_func):
    """T√≠nh s·ªë tu·∫ßn cho ƒë·∫øn l·ªÖ ti·∫øp theo"""
    year = date.year
    holiday_date = holiday_func(year)
    
    # N·∫øu l·ªÖ ƒë√£ qua trong nƒÉm n√†y, t√≠nh l·ªÖ nƒÉm sau
    if date > holiday_date:
        holiday_date = holiday_func(year + 1)
    
    weeks_diff = (holiday_date - date).days // 7
    return weeks_diff

print("‚úÖ Helper functions defined!")


‚úÖ Helper functions defined!


In [3]:
# C·∫•u h√¨nh ƒë∆∞·ªùng d·∫´n
DATA_PATH = '../data/'
PROCESSED_PATH = '../data/processed/'

print(f"üìÅ Data path: {DATA_PATH}")
print(f"üìÅ Processed path: {PROCESSED_PATH}")


üìÅ Data path: ../data/
üìÅ Processed path: ../data/processed/


---

# GIAI ƒêO·∫†N 1: CHU·∫®N B·ªä D·ªÆ LI·ªÜU

## 1.1. T·∫°o df_main_weekly

**M·ª•c ti√™u:**
- Merge 3 files: train.csv, stores.csv, features.csv
- X·ª≠ l√Ω MarkDowns (fillna, t·∫°o features)
- X·ª≠ l√Ω Weekly_Sales √¢m (returns)
- Validation d·ªØ li·ªáu


In [4]:
# Load c√°c datasets
print("üîÑ Loading datasets...")
train_df = pd.read_csv(DATA_PATH + 'train.csv')
stores_df = pd.read_csv(DATA_PATH + 'stores.csv')
features_df = pd.read_csv(DATA_PATH + 'features.csv')

print(f"üìà Train data shape: {train_df.shape}")
print(f"üè™ Stores data shape: {stores_df.shape}")
print(f"üå°Ô∏è Features data shape: {features_df.shape}")


üîÑ Loading datasets...
üìà Train data shape: (421570, 5)
üè™ Stores data shape: (45, 3)
üå°Ô∏è Features data shape: (8190, 12)


In [5]:
# Merge 3 files
print("üîÑ Merging datasets...")
df_main = pd.merge(train_df, stores_df, on='Store', how='left')
df_main = pd.merge(df_main, features_df, on=['Store', 'Date'], how='left', suffixes=('', '_features'))

# X·ª≠ l√Ω duplicate IsHoliday columns
if 'IsHoliday_features' in df_main.columns:
    df_main = df_main.drop(columns=['IsHoliday_features'])

print(f"‚úÖ Merged data shape: {df_main.shape}")
print(f"\nSample data:")
df_main.head()


üîÑ Merging datasets...
‚úÖ Merged data shape: (421570, 16)

Sample data:


Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Type,Size,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment
0,1,1,2010-02-05,24924.5,False,A,151315,42.31,2.572,,,,,,211.096358,8.106
1,1,1,2010-02-12,46039.49,True,A,151315,38.51,2.548,,,,,,211.24217,8.106
2,1,1,2010-02-19,41595.55,False,A,151315,39.93,2.514,,,,,,211.289143,8.106
3,1,1,2010-02-26,19403.54,False,A,151315,46.63,2.561,,,,,,211.319643,8.106
4,1,1,2010-03-05,21827.9,False,A,151315,46.5,2.625,,,,,,211.350143,8.106


In [6]:
# Chuy·ªÉn ƒë·ªïi Date sang datetime v√† ƒë·ªïi t√™n th√†nh WeekEndDate
df_main['Date'] = pd.to_datetime(df_main['Date'])
df_main = df_main.rename(columns={'Date': 'WeekEndDate'})

print(f"üìÖ Time range: {df_main['WeekEndDate'].min()} to {df_main['WeekEndDate'].max()}")

# Ki·ªÉm tra WeekEndDate c√≥ ph·∫£i l√† Th·ª© S√°u kh√¥ng
df_main['WeekDay'] = df_main['WeekEndDate'].dt.day_name()
print(f"\nWeekday distribution:")
print(df_main['WeekDay'].value_counts())


üìÖ Time range: 2010-02-05 00:00:00 to 2012-10-26 00:00:00

Weekday distribution:
WeekDay
Friday    421570
Name: count, dtype: int64


In [7]:
# X·ª≠ l√Ω MarkDowns
markdown_cols = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']

# Fill NA v·ªõi 0
for col in markdown_cols:
    df_main[col] = df_main[col].fillna(0)

# T·∫°o md_missing_any: = 1 n·∫øu c·∫£ 5 c·ªôt ƒë·ªÅu l√† 0
df_main['md_missing_any'] = ((df_main[markdown_cols] == 0).all(axis=1)).astype(int)

# T·∫°o md_sum: T·ªïng gi√° tr·ªã c·ªßa 5 c·ªôt MarkDown
df_main['md_sum'] = df_main[markdown_cols].sum(axis=1)

print(f"üìä MarkDowns missing (all zeros): {df_main['md_missing_any'].sum()} records")
print(f"\nMarkDowns sum statistics:")
print(df_main['md_sum'].describe())


üìä MarkDowns missing (all zeros): 270138 records

MarkDowns sum statistics:
count    421570.000000
mean       6684.041435
std       14750.941552
min           0.000000
25%           0.000000
50%           0.000000
75%        8075.260000
max      160510.610000
Name: md_sum, dtype: float64


In [8]:
# X·ª≠ l√Ω Weekly_Sales √¢m
# T·∫°o returns_flag v√† returns_abs tr∆∞·ªõc khi clip
df_main['returns_flag'] = (df_main['Weekly_Sales'] < 0).astype(int)
df_main['returns_abs'] = df_main['Weekly_Sales'].apply(lambda x: abs(x) if x < 0 else 0)

print(f"üìä Negative sales processed: {df_main['returns_flag'].sum()} records")
print(f"üìä Total returns value: ${df_main['returns_abs'].sum():,.2f}")

# Clip Weekly_Sales v·ªÅ >= 0
df_main['Weekly_Sales'] = df_main['Weekly_Sales'].clip(lower=0)

print(f"‚úÖ All Weekly_Sales are now >= 0")


üìä Negative sales processed: 1285 records
üìä Total returns value: $88,161.56
‚úÖ All Weekly_Sales are now >= 0


In [9]:
# Validation
print("üîç Validating data quality...")

# Ki·ªÉm tra WeekEndDate kh√¥ng b·ªã NA
assert df_main['WeekEndDate'].notna().all(), "WeekEndDate has NA!"

# Ki·ªÉm tra Store v√† Dept kh√¥ng b·ªã NA
assert df_main['Store'].notna().all(), "Store has NA!"
assert df_main['Dept'].notna().all(), "Dept has NA!"

# Ki·ªÉm tra Weekly_Sales (sau clip) kh√¥ng √¢m
assert (df_main['Weekly_Sales'] >= 0).all(), "Weekly_Sales still negative!"

print("‚úÖ All validations passed!")

# L∆∞u df_main_weekly
df_main_weekly = df_main.copy()
print(f"\nüìä df_main_weekly created! Shape: {df_main_weekly.shape}")
print(f"üìä Columns: {len(df_main_weekly.columns)}")


üîç Validating data quality...
‚úÖ All validations passed!

üìä df_main_weekly created! Shape: (421570, 21)
üìä Columns: 21


## 1.2. T·∫°o df_events_daily

**M·ª•c ti√™u:**
- T·∫°o l·ªãch daily t·ª´ min_date ƒë·∫øn max_date
- Th√™m features Payday (SNAP, semimonthly, tax refund)
- T·∫°o l·ªãch Holiday events cho M·ªπ (2010-2012)


In [10]:
# T·∫°o l·ªãch daily t·ª´ min_date ƒë·∫øn max_date
min_date = df_main_weekly['WeekEndDate'].min()
max_date = df_main_weekly['WeekEndDate'].max()

date_range = pd.date_range(start=min_date, end=max_date, freq='D')
df_events_daily = pd.DataFrame({'Date': date_range})

print(f"üìÖ Date range: {min_date.date()} to {max_date.date()}")
print(f"üìÖ Total days: {len(df_events_daily)}")


üìÖ Date range: 2010-02-05 to 2012-10-26
üìÖ Total days: 995


In [11]:
# Th√™m features Payday
# is_snap_window_1: ng√†y 1-10
df_events_daily['is_snap_window_1'] = (df_events_daily['Date'].dt.day <= 10).astype(int)

# is_snap_window_2: ng√†y 11-20
df_events_daily['is_snap_window_2'] = ((df_events_daily['Date'].dt.day >= 11) & 
                                       (df_events_daily['Date'].dt.day <= 20)).astype(int)

# is_semimonthly_payday: ng√†y 15 ho·∫∑c cu·ªëi th√°ng
df_events_daily['is_semimonthly_payday'] = ((df_events_daily['Date'].dt.day == 15) | 
                                           (df_events_daily['Date'].dt.is_month_end)).astype(int)

# is_tax_refund_season: 15/02 - 15/04 h√†ng nƒÉm
df_events_daily['is_tax_refund_season'] = df_events_daily['Date'].apply(is_tax_refund_season)

print("‚úÖ Payday features added!")
print(f"   SNAP window 1 days: {df_events_daily['is_snap_window_1'].sum()}")
print(f"   SNAP window 2 days: {df_events_daily['is_snap_window_2'].sum()}")
print(f"   Semimonthly payday days: {df_events_daily['is_semimonthly_payday'].sum()}")
print(f"   Tax refund season days: {df_events_daily['is_tax_refund_season'].sum()}")


‚úÖ Payday features added!
   SNAP window 1 days: 326
   SNAP window 2 days: 330
   Semimonthly payday days: 65
   Tax refund season days: 181


In [12]:
# T·∫°o l·ªãch Holiday events cho M·ªπ (2010-2012)
all_holidays = {}
for year in [2010, 2011, 2012]:
    all_holidays.update(get_us_holidays(year))

# Map v√†o df_events_daily
df_events_daily['HolidayName'] = df_events_daily['Date'].dt.strftime('%Y-%m-%d').map(
    lambda x: all_holidays.get(x, ('', 0))[0] if x in all_holidays else ''
)
df_events_daily['holiday_impact'] = df_events_daily['Date'].dt.strftime('%Y-%m-%d').map(
    lambda x: all_holidays.get(x, ('', 0))[1] if x in all_holidays else 0
)

print("‚úÖ Holiday events added!")
print(f"   Total holidays: {(df_events_daily['HolidayName'] != '').sum()}")
print(f"\nHoliday distribution:")
print(df_events_daily[df_events_daily['HolidayName'] != '']['HolidayName'].value_counts())


‚úÖ Holiday events added!
   Total holidays: 23

Holiday distribution:
HolidayName
Super Bowl          3
Presidents Day      3
Memorial Day        3
Independence Day    3
Labor Day           3
Thanksgiving        2
Christmas Eve       2
Christmas           2
New Years Day       2
Name: count, dtype: int64


In [13]:
# Xem sample df_events_daily
print("üìä df_events_daily shape:", df_events_daily.shape)
print("\nSample data:")
df_events_daily.head(15)


üìä df_events_daily shape: (995, 7)

Sample data:


Unnamed: 0,Date,is_snap_window_1,is_snap_window_2,is_semimonthly_payday,is_tax_refund_season,HolidayName,holiday_impact
0,2010-02-05,1,0,0,0,,0
1,2010-02-06,1,0,0,0,,0
2,2010-02-07,1,0,0,0,Super Bowl,3
3,2010-02-08,1,0,0,0,,0
4,2010-02-09,1,0,0,0,,0
5,2010-02-10,1,0,0,0,,0
6,2010-02-11,0,1,0,0,,0
7,2010-02-12,0,1,0,0,,0
8,2010-02-13,0,1,0,0,,0
9,2010-02-14,0,1,0,0,,0


## 1.3. T·∫°o df_feature_calendar_weekly

**M·ª•c ti√™u:**
- Th√™m WeekEndDate v√†o df_events_daily
- Groupby WeekEndDate v√† aggregate c√°c features


In [14]:
# Th√™m WeekEndDate v√†o df_events_daily
df_events_daily['WeekEndDate'] = df_events_daily['Date'].apply(get_week_end_date)

print("‚úÖ WeekEndDate added to df_events_daily!")
print(f"\nSample WeekEndDate mapping:")
print(df_events_daily[['Date', 'WeekEndDate']].head(10))


‚úÖ WeekEndDate added to df_events_daily!

Sample WeekEndDate mapping:
        Date WeekEndDate
0 2010-02-05  2010-02-05
1 2010-02-06  2010-02-12
2 2010-02-07  2010-02-12
3 2010-02-08  2010-02-12
4 2010-02-09  2010-02-12
5 2010-02-10  2010-02-12
6 2010-02-11  2010-02-12
7 2010-02-12  2010-02-12
8 2010-02-13  2010-02-19
9 2010-02-14  2010-02-19


In [15]:
# Groupby WeekEndDate v√† aggregate
df_feature_calendar_weekly = df_events_daily.groupby('WeekEndDate').agg({
    'is_snap_window_1': lambda x: 1 if x.sum() > 0 else 0,
    'is_snap_window_2': lambda x: 1 if x.sum() > 0 else 0,
    'is_semimonthly_payday': lambda x: 1 if x.sum() > 0 else 0,
    'is_tax_refund_season': lambda x: 1 if x.sum() > 0 else 0,
    'holiday_impact': 'max',
    'HolidayName': lambda x: x[x != ''].iloc[0] if (x != '').any() else ''
}).reset_index()

# ƒê·ªïi t√™n c·ªôt
df_feature_calendar_weekly = df_feature_calendar_weekly.rename(columns={
    'is_snap_window_1': 'is_snap_window_1_week',
    'is_snap_window_2': 'is_snap_window_2_week',
    'is_semimonthly_payday': 'is_semimonthly_payweek',
    'is_tax_refund_season': 'is_tax_refund_season_week',
    'holiday_impact': 'holiday_impact_week',
    'HolidayName': 'holiday_name_week'
})

print(f"‚úÖ df_feature_calendar_weekly created! Shape: {df_feature_calendar_weekly.shape}")
print(f"\nSample data:")
df_feature_calendar_weekly.head(10)


‚úÖ df_feature_calendar_weekly created! Shape: (143, 7)

Sample data:


Unnamed: 0,WeekEndDate,is_snap_window_1_week,is_snap_window_2_week,is_semimonthly_payweek,is_tax_refund_season_week,holiday_impact_week,holiday_name_week
0,2010-02-05,1,0,0,0,0,
1,2010-02-12,1,1,0,0,3,Super Bowl
2,2010-02-19,0,1,1,1,1,Presidents Day
3,2010-02-26,0,1,0,1,0,
4,2010-03-05,1,0,1,1,0,
5,2010-03-12,1,1,0,1,0,
6,2010-03-19,0,1,1,1,0,
7,2010-03-26,0,1,0,1,0,
8,2010-04-02,1,0,1,1,0,
9,2010-04-09,1,0,0,1,0,


In [16]:
# Xem statistics c·ªßa weekly features
print("üìä Weekly features statistics:")
print(f"\n   SNAP window 1 weeks: {df_feature_calendar_weekly['is_snap_window_1_week'].sum()}")
print(f"   SNAP window 2 weeks: {df_feature_calendar_weekly['is_snap_window_2_week'].sum()}")
print(f"   Semimonthly payweeks: {df_feature_calendar_weekly['is_semimonthly_payweek'].sum()}")
print(f"   Tax refund season weeks: {df_feature_calendar_weekly['is_tax_refund_season_week'].sum()}")
print(f"   Weeks with holidays: {(df_feature_calendar_weekly['holiday_name_week'] != '').sum()}")


üìä Weekly features statistics:

   SNAP window 1 weeks: 75
   SNAP window 2 weeks: 76
   Semimonthly payweeks: 65
   Tax refund season weeks: 28
   Weeks with holidays: 22


---

# GIAI ƒêO·∫†N 2: FEATURE ENGINEERING

## 2.1. Merge & Ki·ªÉm tra

**M·ª•c ti√™u:**
- Merge df_main_weekly v·ªõi df_feature_calendar_weekly
- Sanity check: ki·ªÉm tra uniqueness


In [17]:
# Merge df_main_weekly v·ªõi df_feature_calendar_weekly
df_final = pd.merge(df_main_weekly, df_feature_calendar_weekly, on='WeekEndDate', how='left')

# Fillna cho c√°c c·ªôt m·ªõi
fill_cols = ['is_snap_window_1_week', 'is_snap_window_2_week', 'is_semimonthly_payweek', 
             'is_tax_refund_season_week', 'holiday_impact_week']
for col in fill_cols:
    df_final[col] = df_final[col].fillna(0)

df_final['holiday_name_week'] = df_final['holiday_name_week'].fillna('')

print(f"‚úÖ Merge completed! Shape: {df_final.shape}")


‚úÖ Merge completed! Shape: (421570, 27)


In [18]:
# Sanity check: Ki·ªÉm tra uniqueness c·ªßa (Store, Dept, WeekEndDate)
print("üîç Sanity check...")
duplicates = df_final.groupby(['Store', 'Dept', 'WeekEndDate']).size()
if (duplicates > 1).any():
    print(f"‚ö†Ô∏è Warning: Found {((duplicates > 1).sum())} duplicate (Store, Dept, WeekEndDate) combinations!")
    print(duplicates[duplicates > 1].head())
else:
    print("‚úÖ No duplicates found! Each (Store, Dept, WeekEndDate) is unique.")


üîç Sanity check...
‚úÖ No duplicates found! Each (Store, Dept, WeekEndDate) is unique.


## 2.2. T·∫°o Features "Payday Pulse"

**M·ª•c ti√™u:**
- T·∫°o weeks_since_payday_15_eom
- T·∫°o payday_decay_exp v√† payday_decay_piecewise


In [19]:
# S·∫Øp x·∫øp theo Store, Dept, WeekEndDate ƒë·ªÉ t√≠nh lag
df_final = df_final.sort_values(['Store', 'Dept', 'WeekEndDate']).reset_index(drop=True)

# weeks_since_payday_15_eom: ƒê·∫øm s·ªë tu·∫ßn k·ªÉ t·ª´ is_semimonthly_payweek g·∫ßn nh·∫•t
df_final['weeks_since_payday_15_eom'] = df_final.groupby(['Store', 'Dept']).apply(
    calculate_weeks_since_payday, include_groups=False
).reset_index(level=[0, 1], drop=True)

# Fillna v·ªõi gi√° tr·ªã l·ªõn (n·∫øu ch∆∞a c√≥ payday n√†o)
df_final['weeks_since_payday_15_eom'] = df_final['weeks_since_payday_15_eom'].fillna(999)

print("‚úÖ weeks_since_payday_15_eom created!")
print(f"\nStatistics:")
print(df_final['weeks_since_payday_15_eom'].describe())


‚úÖ weeks_since_payday_15_eom created!

Statistics:
count    421570.000000
mean         15.445034
std         120.555547
min           0.000000
25%           0.000000
50%           1.000000
75%           1.000000
max         999.000000
Name: weeks_since_payday_15_eom, dtype: float64


In [20]:
# T·∫°o features decay
# C√°ch 1: Exponential decay
df_final['payday_decay_exp'] = np.exp(-0.25 * df_final['weeks_since_payday_15_eom'])

# C√°ch 2: Piecewise decay
df_final['payday_decay_piecewise'] = df_final['weeks_since_payday_15_eom'].apply(piecewise_decay)

print("‚úÖ Payday decay features created!")
print(f"\nPayday decay statistics:")
print(df_final[['payday_decay_exp', 'payday_decay_piecewise']].describe())


‚úÖ Payday decay features created!

Payday decay statistics:
       payday_decay_exp  payday_decay_piecewise
count      4.215700e+05           421570.000000
mean       8.514152e-01                0.807216
std        1.734981e-01                0.195337
min       3.427308e-109                0.400000
25%        7.788008e-01                0.700000
50%        7.788008e-01                0.700000
75%        1.000000e+00                1.000000
max        1.000000e+00                1.000000


## 2.3. T·∫°o Features "Holiday"

**M·ª•c ti√™u:**
- T·∫°o weeks_until_christmas v√† weeks_until_thanksgiving
- T·∫°o is_pre_christmas_window_week v√† is_pre_thanksgiving_window_week


In [21]:
# T·∫°o features Holiday countdown
df_final['weeks_until_christmas'] = df_final['WeekEndDate'].apply(
    lambda x: calculate_weeks_until_holiday(x, get_christmas_date)
)
df_final['weeks_until_thanksgiving'] = df_final['WeekEndDate'].apply(
    lambda x: calculate_weeks_until_holiday(x, get_thanksgiving_date)
)

print("‚úÖ Holiday countdown features created!")
print(f"\nStatistics:")
print(f"   weeks_until_christmas - min: {df_final['weeks_until_christmas'].min()}, max: {df_final['weeks_until_christmas'].max()}")
print(f"   weeks_until_thanksgiving - min: {df_final['weeks_until_thanksgiving'].min()}, max: {df_final['weeks_until_thanksgiving'].max()}")


‚úÖ Holiday countdown features created!

Statistics:
   weeks_until_christmas - min: 0, max: 51
   weeks_until_thanksgiving - min: 0, max: 51


In [22]:
# T·∫°o features Holiday window
df_final['is_pre_christmas_window_week'] = (df_final['weeks_until_christmas'] <= 3).astype(int)
df_final['is_pre_thanksgiving_window_week'] = (df_final['weeks_until_thanksgiving'] <= 2).astype(int)

print("‚úÖ Holiday window features created!")
print(f"   Pre-Christmas weeks: {df_final['is_pre_christmas_window_week'].sum()}")
print(f"   Pre-Thanksgiving weeks: {df_final['is_pre_thanksgiving_window_week'].sum()}")


‚úÖ Holiday window features created!
   Pre-Christmas weeks: 23856
   Pre-Thanksgiving weeks: 17654


## 2.4. T·∫°o Features "Lag/Rolling"

**M·ª•c ti√™u:**
- T·∫°o lag features (t-52, t-1, t-2, t-4)
- T·∫°o rolling statistics (mean, std)


In [23]:
# T·∫°o features Lag
df_final = df_final.sort_values(['Store', 'Dept', 'WeekEndDate']).reset_index(drop=True)

# lag_sales_t_52: Feature "nƒÉm ngo√°i" (52 tu·∫ßn tr∆∞·ªõc)
df_final['lag_sales_t_52'] = df_final.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(52)

# lag_sales_t_1, lag_sales_t_2, lag_sales_t_4: Lag ng·∫Øn h·∫°n
df_final['lag_sales_t_1'] = df_final.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(1)
df_final['lag_sales_t_2'] = df_final.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(2)
df_final['lag_sales_t_4'] = df_final.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(4)

print("‚úÖ Lag features created!")
print(f"\nLag features missing values:")
print(df_final[['lag_sales_t_52', 'lag_sales_t_1', 'lag_sales_t_2', 'lag_sales_t_4']].isna().sum())


‚úÖ Lag features created!

Lag features missing values:
lag_sales_t_52    160487
lag_sales_t_1       3331
lag_sales_t_2       6625
lag_sales_t_4      13134
dtype: int64


In [24]:
# T·∫°o features Rolling
# rolling_mean_sales_4_weeks: Trung b√¨nh 4 tu·∫ßn g·∫ßn nh·∫•t (shift(1) ƒë·ªÉ tr√°nh leakage)
df_final['rolling_mean_sales_4_weeks'] = df_final.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(1).rolling(window=4, min_periods=1).mean().reset_index(level=[0,1], drop=True)

# rolling_std_sales_4_weeks: ƒê·ªô l·ªách chu·∫©n 4 tu·∫ßn
df_final['rolling_std_sales_4_weeks'] = df_final.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(1).rolling(window=4, min_periods=1).std().reset_index(level=[0,1], drop=True)

print("‚úÖ Rolling features created!")
print(f"\nRolling features statistics:")
print(df_final[['rolling_mean_sales_4_weeks', 'rolling_std_sales_4_weeks']].describe())


IndexError: Too many levels: Index has only 1 level, not 2

## 2.5. T·∫°o Features "Interaction"

**M·ª•c ti√™u:**
- T·∫°o interact_snap_x_type_c
- T·∫°o interact_holiday_x_impact
- T·∫°o interact_tax_x_temp


In [None]:
# T·∫°o features Interaction
# interact_snap_x_type_c: SNAP x Store Type C
df_final['interact_snap_x_type_c'] = df_final['is_snap_window_1_week'] * (df_final['Type'] == 'C').astype(int)

# interact_holiday_x_impact: Pre-Christmas window x holiday impact
df_final['interact_holiday_x_impact'] = df_final['is_pre_christmas_window_week'] * df_final['holiday_impact_week']

# interact_tax_x_temp: Tax refund season x Temperature
df_final['interact_tax_x_temp'] = df_final['is_tax_refund_season_week'] * df_final['Temperature']

print("‚úÖ Interaction features created!")
print(f"\nInteraction features statistics:")
print(df_final[['interact_snap_x_type_c', 'interact_holiday_x_impact', 'interact_tax_x_temp']].describe())


---

# GIAI ƒêO·∫†N 3: L∆ØU C√ÅC FILE OUTPUT

**M·ª•c ti√™u:**
- L∆∞u df_main_weekly.csv
- L∆∞u df_events_daily.csv
- L∆∞u df_feature_calendar_weekly.csv
- L∆∞u df_final_for_model.csv


In [None]:
# L∆∞u c√°c file output
print("üíæ Saving output files...")

# T·∫°o th∆∞ m·ª•c processed n·∫øu ch∆∞a c√≥
os.makedirs(PROCESSED_PATH, exist_ok=True)

# L∆∞u df_main_weekly
df_main_weekly.to_csv(PROCESSED_PATH + 'df_main_weekly.csv', index=False)
print(f"‚úÖ Saved: df_main_weekly.csv ({df_main_weekly.shape})")

# L∆∞u df_events_daily
df_events_daily.to_csv(PROCESSED_PATH + 'df_events_daily.csv', index=False)
print(f"‚úÖ Saved: df_events_daily.csv ({df_events_daily.shape})")

# L∆∞u df_feature_calendar_weekly
df_feature_calendar_weekly.to_csv(PROCESSED_PATH + 'df_feature_calendar_weekly.csv', index=False)
print(f"‚úÖ Saved: df_feature_calendar_weekly.csv ({df_feature_calendar_weekly.shape})")

# L∆∞u df_final_for_model
df_final.to_csv(PROCESSED_PATH + 'df_final_for_model.csv', index=False)
print(f"‚úÖ Saved: df_final_for_model.csv ({df_final.shape})")

print("\nüéâ All files saved successfully!")


---

# T√ìM T·∫ÆT K·∫æT QU·∫¢

## Final Dataset Summary


In [None]:
print("="*80)
print("üìä FINAL DATASET SUMMARY")
print("="*80)

print(f"\nüìä Final dataset shape: {df_final.shape}")
print(f"üìä Total columns: {len(df_final.columns)}")
print(f"üìÖ Time range: {df_final['WeekEndDate'].min()} to {df_final['WeekEndDate'].max()}")
print(f"üè™ Stores: {df_final['Store'].nunique()}, Departments: {df_final['Dept'].nunique()}")

print(f"\nüí∞ Weekly Sales statistics:")
print(df_final['Weekly_Sales'].describe())


In [None]:
print(f"\nüìã All columns ({len(df_final.columns)}):")
for i, col in enumerate(df_final.columns, 1):
    print(f"  {i:2d}. {col}")

print("\n" + "="*80)
print("‚úÖ DATA PREPARATION COMPLETED SUCCESSFULLY!")
print("="*80)


In [None]:
# Xem sample c·ªßa df_final
print("Sample of final dataset:")
df_final.head()
