<a href="https://colab.research.google.com/github/Wezz-git/AI-samples/blob/main/(RandomForestRegressor)_Advanced_Feature_Engineering_(Time).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Business Problem:

You're a data scientist for a large retail chain. They have sales data and want a model to forecast demand. Your boss knows that sales are not random; they depend on the "time of year." Sales are higher on a Friday, in December, or on a holiday.

The "Real-World" Skill: Learn to engineer features from a datetime object. Take a single (date) column and "explode" it into multiple, smarter columns like (Month), (DayOfWeek), and (IsHoliday).

Model: RandomForestRegressor (like in Day 2) because it's great at finding patterns in these new features.

In [None]:
import pandas as pd

# Data of: Store Sales - Time Series Forecasting. Using 'train.csv'
# Load the data
df = pd.read_csv('/content/sample_data/train.csv')

# Print first 5 rowns
print(df.head())

# Print technical summary
print(df.info())

   id        date  store_nbr      family  sales  onpromotion
0   0  2013-01-01        1.0  AUTOMOTIVE    0.0          0.0
1   1  2013-01-01        1.0   BABY CARE    0.0          0.0
2   2  2013-01-01        1.0      BEAUTY    0.0          0.0
3   3  2013-01-01        1.0   BEVERAGES    0.0          0.0
4   4  2013-01-01        1.0       BOOKS    0.0          0.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2209503 entries, 0 to 2209502
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    float64
 3   family       object 
 4   sales        float64
 5   onpromotion  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 101.1+ MB
None


Convert the Date Column

- Use of the "magic" pd.to_datetime() function.

In [None]:
# 'date' is an object as CSV stores it string. Model cannot understand "2013-01-01"
# 1 - Convert the 'date' column from an object (text) to a datetime
df['date'] = pd.to_datetime(df['date'])

# 2 - Check the work
print("\n-- Updated Data Summary --")
print(df.info())


-- Updated Data Summary --
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2209503 entries, 0 to 2209502
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   id           int64         
 1   date         datetime64[ns]
 2   store_nbr    float64       
 3   family       object        
 4   sales        float64       
 5   onpromotion  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 101.1+ MB
None


Feature Enginnering

- Use of the special .dt accessor to "explode" the single date column into four new, much smarter, numeric columns.

In [None]:
# ('df' is in memory with a datetime 'date' column)

print("Engineering new features from the 'date' column..")

# 1 - Get the month (1-12)
df['month'] = df['date'].dt.month

# 2 - Get the day of the week (0=Monday, 6=Sunday)
df['day_of_week'] = df['date'].dt.dayofweek

# 3 - Get the year (eg., 2013, 2014)
df['year'] = df['date'].dt.year

# 4 - Get the day of the year (1-365)
df['day_of_year'] = df['date'].dt.dayofyear

# 3 - Check the work
print("\n-- New Featues Created --")
print(df[['month', 'day_of_week', 'year', 'day_of_year']].head())

Engineering new features from the 'date' column..

-- New Featues Created --
   month  day_of_week    year  day_of_year
0    1.0          1.0  2013.0          1.0
1    1.0          1.0  2013.0          1.0
2    1.0          1.0  2013.0          1.0
3    1.0          1.0  2013.0          1.0
4    1.0          1.0  2013.0          1.0


The output shows that for the first 5 rows (all on date 2013-01-01):

- month is 1 (January)

- day_of_week is 1 (a Tuesday)

- year is 2013

- day_of_year is 1

FInal Preprocessing

1- Handle Text: We still have the family column (with text like 'GROCERY I'). We need to convert this to numbers.

2- Select Columns: We need to drop the original date column (since we have our new features) and the id column.

In [None]:
# 1 - One-Hot Encode the 'family' column
# Use of 'get_dummies'
df_processed = pd.get_dummies(df, columns=['family'])

# 2 - Drop rows with missing Target values ---
print(f"Original row count: {len(df_processed)}")
df_processed = df_processed.dropna(subset=['sales'])
print(f"Row count after cleaning target: {len(df_processed)}")

# 3 - Create final features (X) and target (y)
y = df_processed['sales']

# Our features 'X' are all the columns we just built
# We drop the 'id' (it's just an ID)
# We drop 'date' (we already engineered features from it)
# We drop 'sales' (it's our target)

X = df_processed.drop(columns=['id', 'date', 'sales'])

# 4 - Check the work
print("\n-- Final Features (X) --")
print(X.head())

print("\n-- Final Target (y) --")
print(y.head())

Original row count: 2209503
Row count after cleaning target: 2209502

-- Final Features (X) --
   store_nbr  onpromotion  month  day_of_week    year  day_of_year  \
0        1.0          0.0    1.0          1.0  2013.0          1.0   
1        1.0          0.0    1.0          1.0  2013.0          1.0   
2        1.0          0.0    1.0          1.0  2013.0          1.0   
3        1.0          0.0    1.0          1.0  2013.0          1.0   
4        1.0          0.0    1.0          1.0  2013.0          1.0   

   family_AUTOMOTIVE  family_BABY CARE  family_BEAUTY  family_BEVERAGES  ...  \
0               True             False          False             False  ...   
1              False              True          False             False  ...   
2              False             False           True             False  ...   
3              False             False          False              True  ...   
4              False             False          False             False  ...   

   

Train and Evaluate

- Use of RandomForestRegressor

In [None]:
# 1- Import tools
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

#
# 2 - split the data
# use the stadard 80/20 split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3 - Initialize the model
# n_jobs=-1 - to sue ALL computer processors to speed it uo
print("Initializing Random Forest..")
model = RandomForestRegressor(n_estimators=20, random_state=42, n_jobs=-1)

# 4 - Train the model
print("Training model..")

model.fit(X_train, y_train)
print("Training complete!")

# Evaluate RMSE
predictions = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"\nModel RMSE: {rmse:,.2f}")

Initializing Random Forest..
Training model..
Training complete!

Model RMSE: 283.47
