# Feature Engineering Pipeline
## Unified approach combining all best practices from drafts

**Sources:**
- Basic RFM features from `baseline.ipynb` (23 features)
- Periodic aggregations from `ozon-fresh-categories_NS.ipynb` (277 features)
- UMAP embeddings from `categoricalembeddinglowdim.ipynb` (27 features)
- Temporal patterns from `features_from_pdf.ipynb`
- Advanced features (new)

**Target:** ~70-80 features initially ‚Üí expand if needed

## 1. Imports and Configuration

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
import glob
from typing import Dict, List
import warnings
import gc
from tqdm import tqdm
warnings.filterwarnings('ignore')

# For UMAP embeddings (optional - can add later)
# from umap import UMAP
# from sklearn.preprocessing import StandardScaler

print("Libraries loaded successfully!")

Libraries loaded successfully!


In [3]:
# Configuration
DATA_PATH = '../docs'  # Your loaded data

# Date ranges
TRAIN_START_DATE = pd.Timestamp('2024-03-01')   # 5 months of history for training
TRAIN_END_DATE = pd.Timestamp('2024-06-30')
VAL_START_DATE = pd.Timestamp('2024-07-01')
VAL_END_DATE = pd.Timestamp('2024-07-31')
TEST_START_DATE = pd.Timestamp('2024-08-01')
NUM_PERIODS = 4 

print(f"Train period: {TRAIN_START_DATE.date()} to {TRAIN_END_DATE.date()}")
print(f"Validation period: {VAL_START_DATE.date()} to {VAL_END_DATE.date()}")
print(f"Test prediction: {TEST_START_DATE.date()}")

Train period: 2024-03-01 to 2024-06-30
Validation period: 2024-07-01 to 2024-07-31
Test prediction: 2024-08-01


## 2. Data Loading

In [4]:
# Load actions_history (multiple parquet files)
import glob  
from tqdm import tqdm  
import gc 
print("Loading actions_history...")
actions_files = sorted(glob.glob(os.path.join(DATA_PATH, 'actions_history', '*.parquet')))
print(f"Found {len(actions_files)} action files")

if len(actions_files) == 0:
    raise FileNotFoundError(f"No parquet files found in {os.path.join(DATA_PATH, 'actions_history')}")

actions_list = []
for file in tqdm(actions_files, desc="Loading actions"):
    try:
        df = pd.read_parquet(file)
        # Convert timestamp to datetime
        df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
        actions_list.append(df)
    except Exception as e:
        print(f"Error loading {file}: {e}")
        raise

actions_history = pd.concat(actions_list, ignore_index=True)
print(f"Actions history shape: {actions_history.shape}")
print(f"Memory usage: {actions_history.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display basic info
print(f"\nDate range: {actions_history['timestamp'].min()} to {actions_history['timestamp'].max()}")
print(f"Unique users: {actions_history['user_id'].nunique():,}")
print(f"Unique products: {actions_history['product_id'].nunique():,}")

del actions_list
gc.collect()

Loading actions_history...
Found 53 action files


Loading actions:   0%|          | 0/53 [00:00<?, ?it/s]

Loading actions: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 53/53 [00:03<00:00, 14.47it/s]


Actions history shape: (182001544, 6)
Memory usage: 5207.11 MB

Date range: 2011-05-28 00:26:26 to 2024-07-31 23:59:58
Unique users: 5,224,053
Unique products: 374,821


0

In [5]:
# Load search_history (multiple parquet files)
print("Loading search_history...")
search_files = sorted(glob.glob(os.path.join(DATA_PATH, 'search_history', '*.parquet')))
print(f"Found {len(search_files)} search files")

search_list = []
for file in tqdm(search_files, desc="Loading searches"):
    df = pd.read_parquet(file)
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    search_list.append(df)

search_history = pd.concat(search_list, ignore_index=True)
print(f"Search history shape: {search_history.shape}")
print(f"Memory usage: {search_history.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

del search_list
gc.collect()

Loading search_history...
Found 32 search files


Loading searches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:08<00:00,  3.65it/s]


Search history shape: (78160845, 5)
Memory usage: 7893.96 MB


0

In [6]:
# Load product information and other metadata
print("Loading product information...")
product_information = pd.read_csv(os.path.join(DATA_PATH, 'product_information.csv'))
print(f"Product information shape: {product_information.shape}")

# Test users
test_users = pd.read_csv(os.path.join(DATA_PATH, 'test_users.csv'))
print(f"Test users shape: {test_users.shape}")

# Display sample
print("\nSample of actions_history:")
print(actions_history.head())
print("\nAction types:")
print(actions_history['action_type_id'].value_counts().sort_index())

Loading product information...
Product information shape: (238443, 8)
Test users shape: (2068424, 1)

Sample of actions_history:
    user_id           timestamp  product_id  page_product_id  action_type_id  \
0   7158706 2024-03-02 13:14:27   162625954              NaN               5   
1   2762233 2024-05-28 14:20:44   148481523              NaN               5   
2   6415797 2024-04-28 18:18:30   371796916              NaN               5   
3  11178472 2024-06-08 10:35:43   887739173              NaN               5   
4   2695403 2024-04-12 11:14:52   163600519              NaN               5   

   widget_name_id  
0              22  
1              22  
2              22  
3              22  
4              22  

Action types:
action_type_id
1    66968540
2     4065805
3    31306914
5    79660285
Name: count, dtype: int64


## 3. Target Creation

In [7]:
# Validation target (for training)
# Users who made an order (action_type_id == 3) in July 2024
print("Creating validation target...")

val_actions = actions_history[
    (actions_history['timestamp'] >= VAL_START_DATE) &
    (actions_history['timestamp'] <= VAL_END_DATE)
].copy()

val_target = (
    val_actions
    .assign(has_order=(val_actions['action_type_id'] == 3).astype(int))
    .groupby('user_id', as_index=False)
    .agg(target=('has_order', 'max'))
)

print(f"\nTotal users: {val_target.shape[0]:,}")
print(f"\nTarget distribution:")
print(val_target['target'].value_counts())

# Calculate class imbalance
positive_ratio = val_target['target'].mean()
print(f"\nPositive class ratio: {positive_ratio:.2%}")

del val_actions
gc.collect()

Creating validation target...

Total users: 1,835,147

Target distribution:
target
0    1200425
1     634722
Name: count, dtype: int64

Positive class ratio: 34.59%


0

## 4. Feature Generation Functions
### 4.1 Basic RFM Features (from baseline.ipynb)

In [8]:
def generate_basic_rfm_features(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp
) -> pd.DataFrame:
    """
    Generate basic RFM features for each action type.
    Based on baseline.ipynb approach.
    
    Returns:
        DataFrame with ~23 basic features per action type
    """
    print("\n=== Generating Basic RFM Features ===")
    
    df = user_df.copy()
    
    actions_id_to_suf = {
        1: "click",
        2: "favorite",
        3: "order",
        5: "to_cart",
    }
    
    # Filter actions for the period
    period_actions = actions_history[
        (actions_history['timestamp'] >= start_date) &
        (actions_history['timestamp'] <= end_date)
    ].copy()
    
    # Merge with product info for prices
    period_actions = period_actions.merge(
        product_information[['product_id', 'discount_price']],
        on='product_id',
        how='left'
    )
    
    for action_id, suffix in actions_id_to_suf.items():
        print(f"  Processing {suffix}s...")
        
        action_data = period_actions[period_actions['action_type_id'] == action_id].copy()
        
        if len(action_data) == 0:
            continue
        
        # Aggregate by user
        aggs = action_data.groupby('user_id').agg(
            **{
                f'num_products_{suffix}': ('product_id', 'count'),
                f'num_unique_products_{suffix}': ('product_id', 'nunique'),
                f'sum_discount_price_{suffix}': ('discount_price', 'sum'),
                f'max_discount_price_{suffix}': ('discount_price', 'max'),
                f'last_{suffix}_time': ('timestamp', 'max'),
                f'first_{suffix}_time': ('timestamp', 'min'),
            }
        ).reset_index()
        
        # Calculate recency features
        reference_date = end_date + timedelta(days=1)
        aggs[f'days_since_last_{suffix}'] = (reference_date - aggs[f'last_{suffix}_time']).dt.days
        aggs[f'days_since_first_{suffix}'] = (reference_date - aggs[f'first_{suffix}_time']).dt.days
        
        # Drop timestamp columns
        aggs = aggs.drop(columns=[f'last_{suffix}_time', f'first_{suffix}_time'])
        
        # Merge with main dataframe
        df = df.merge(aggs, on='user_id', how='left')
    
    # Search aggregations
    print("  Processing searches...")
    suffix = 'search'
    
    period_searches = search_history[
        (search_history['timestamp'] >= start_date) &
        (search_history['timestamp'] <= end_date)
    ].copy()
    
    if len(period_searches) > 0:
        search_aggs = period_searches.groupby('user_id').agg(
            **{
                f'num_{suffix}': ('search_query', 'count'),
                f'last_{suffix}_time': ('timestamp', 'max'),
                f'first_{suffix}_time': ('timestamp', 'min'),
            }
        ).reset_index()
        
        reference_date = end_date + timedelta(days=1)
        search_aggs[f'days_since_last_{suffix}'] = (reference_date - search_aggs[f'last_{suffix}_time']).dt.days
        search_aggs[f'days_since_first_{suffix}'] = (reference_date - search_aggs[f'first_{suffix}_time']).dt.days
        
        search_aggs = search_aggs.drop(columns=[f'last_{suffix}_time', f'first_{suffix}_time'])
        
        df = df.merge(search_aggs, on='user_id', how='left')
    
    # Count generated features
    new_features = len(df.columns) - len(user_df.columns)
    print(f"  Generated {new_features} RFM features")
    
    return df

### 4.2 Temporal Features (from features_from_pdf.ipynb)

In [9]:
def generate_temporal_features(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp
) -> pd.DataFrame:
    """
    Generate temporal pattern features:
    - Favorite day of week
    - Average hour of activity
    - Number of unique active days
    - Lifecycle features (is_new_user, lifetime)
    
    From features_from_pdf.ipynb
    """
    print("\n=== Generating Temporal Features ===")
    
    df = user_df.copy()
    
    actions_id_to_suf = {
        1: "click",
        2: "favorite",
        3: "order",
        5: "to_cart",
    }
    
    period_actions = actions_history[
        (actions_history['timestamp'] >= start_date) &
        (actions_history['timestamp'] <= end_date)
    ].copy()
    
    # Add temporal columns
    period_actions['day_of_week'] = period_actions['timestamp'].dt.dayofweek
    period_actions['hour'] = period_actions['timestamp'].dt.hour
    period_actions['date'] = period_actions['timestamp'].dt.date
    
    for action_id, suffix in actions_id_to_suf.items():
        print(f"  Processing {suffix} temporal patterns...")
        
        action_data = period_actions[period_actions['action_type_id'] == action_id].copy()
        
        if len(action_data) == 0:
            continue
        
        temporal_aggs = action_data.groupby('user_id').agg(
            **{
                f'favorite_day_of_week_{suffix}': ('day_of_week', 'mean'),
                f'avg_hour_{suffix}': ('hour', 'mean'),
                f'num_unique_days_{suffix}': ('date', 'nunique'),
                f'first_time_{suffix}': ('timestamp', 'min'),
            }
        ).reset_index()
        
        # Is new user (started after June 1st)
        temporal_aggs[f'is_new_user_{suffix}'] = (
            temporal_aggs[f'first_time_{suffix}'] >= pd.Timestamp('2024-06-01')
        ).astype(int)
        
        temporal_aggs = temporal_aggs.drop(columns=[f'first_time_{suffix}'])
        
        df = df.merge(temporal_aggs, on='user_id', how='left')
    
    # Add lifetime features (days between first and last activity)
    for suffix in ['click', 'favorite', 'order', 'to_cart']:
        first_col = f'days_since_first_{suffix}'
        last_col = f'days_since_last_{suffix}'
        
        if first_col in df.columns and last_col in df.columns:
            df[f'lifetime_{suffix}'] = df[first_col] - df[last_col]
    
    print(f"  Generated temporal features")
    return df

### 4.3 Conversion Features

In [10]:
def generate_conversion_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generate conversion rate features:
    - click_to_order_conversion
    - favorite_to_order_conversion
    - to_cart_to_order_conversion
    - searches_to_order_ratio
    - actions_per_day
    """
    print("\n=== Generating Conversion Features ===")
    
    df = df.copy()
    
    # Conversion rates
    for suffix in ['click', 'favorite', 'to_cart']:
        num_col = f'num_products_{suffix}'
        if num_col in df.columns and 'num_products_order' in df.columns:
            df[f'{suffix}_to_order_conversion'] = (
                df['num_products_order'] / df[num_col].replace(0, np.nan)
            )
    
    # Search to order ratio
    if 'num_search' in df.columns and 'num_products_order' in df.columns:
        df['searches_to_order_ratio'] = (
            df['num_search'] / df['num_products_order'].replace(0, np.nan)
        )
    
    # Actions per day
    for suffix in ['click', 'favorite', 'to_cart', 'order']:
        num_col = f'num_unique_products_{suffix}'
        days_col = f'num_unique_days_{suffix}'
        
        if num_col in df.columns and days_col in df.columns:
            df[f'{suffix}_per_day'] = (
                df[num_col] / df[days_col].replace(0, np.nan)
            )
    
    print("  Generated conversion features")
    return df

### 4.4 Advanced Features (NEW)

In [11]:
def generate_advanced_features(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp
) -> pd.DataFrame:
    """
    Generate advanced behavioral features:
    - Discount purchase ratio
    - Category diversity
    - Widget diversity
    - Price sensitivity
    """
    print("\n=== Generating Advanced Features ===")
    
    df = user_df.copy()
    
    period_actions = actions_history[
        (actions_history['timestamp'] >= start_date) &
        (actions_history['timestamp'] <= end_date)
    ].copy()
    
    # 1. Discount purchase ratio
    print("  Calculating discount ratios...")
    order_actions = period_actions[period_actions['action_type_id'] == 3].copy()
    
    if len(order_actions) > 0:
        order_actions = order_actions.merge(
            product_information[['product_id', 'price', 'discount_price']],
            on='product_id',
            how='left'
        )
        
        order_actions['has_discount'] = (
            order_actions['price'] > order_actions['discount_price']
        ).astype(int)
        
        discount_aggs = order_actions.groupby('user_id').agg(
            discount_purchase_ratio=('has_discount', 'mean'),
            avg_order_price=('discount_price', 'mean')
        ).reset_index()
        
        df = df.merge(discount_aggs, on='user_id', how='left')
    
    # 2. Category diversity
    print("  Calculating category diversity...")
    interaction_actions = period_actions[
        period_actions['action_type_id'].isin([1, 2, 3, 5])
    ].copy()
    
    if len(interaction_actions) > 0:
        interaction_actions = interaction_actions.merge(
            product_information[['product_id', 'category_id']],
            on='product_id',
            how='left'
        )
        
        category_aggs = interaction_actions.groupby('user_id').agg(
            num_unique_categories=('category_id', 'nunique'),
            total_interactions=('category_id', 'count')
        ).reset_index()
        
        category_aggs['category_diversity'] = (
            category_aggs['num_unique_categories'] / 
            category_aggs['total_interactions']
        )
        
        category_aggs = category_aggs.drop(columns=['total_interactions'])
        df = df.merge(category_aggs, on='user_id', how='left')
    
    # 3. Widget diversity
    print("  Calculating widget diversity...")
    widget_aggs = period_actions.groupby('user_id').agg(
        num_unique_widgets=('widget_name_id', 'nunique')
    ).reset_index()
    
    df = df.merge(widget_aggs, on='user_id', how='left')
    
    print("  Generated advanced features")
    return df

In [12]:
def generate_periodic_aggregations(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp,
    num_periods: int = 4
) -> pd.DataFrame:
    """
    –ü–µ—Ä–∏–æ–¥–Ω—ã–µ –∞–≥—Ä–µ–≥–∞—Ü–∏–∏ –∏–∑ NS notebook.
    –†–∞–∑–±–∏–≤–∞–µ—Ç –¥–∞–Ω–Ω—ã–µ –Ω–∞ –ø–µ—Ä–∏–æ–¥—ã (4 –Ω–µ–¥–µ–ª–∏ + —Å—Ç–∞—Ä—à–µ) –∏ –∞–≥—Ä–µ–≥–∏—Ä—É–µ—Ç –æ—Ç–¥–µ–ª—å–Ω–æ.
    –≠—Ç–æ –ª–æ–≤–∏—Ç –≤—Ä–µ–º–µ–Ω–Ω—É—é –¥–∏–Ω–∞–º–∏–∫—É - –Ω–µ–¥–∞–≤–Ω–µ–µ –ø–æ–≤–µ–¥–µ–Ω–∏–µ vs –∏—Å—Ç–æ—Ä–∏—á–µ—Å–∫–æ–µ.
    
    –ü–µ—Ä–∏–æ–¥—ã:
    - 0: –ü–æ—Å–ª–µ–¥–Ω–∏–µ 7 –¥–Ω–µ–π
    - 1: 8-14 –¥–Ω–µ–π –Ω–∞–∑–∞–¥
    - 2: 15-21 –¥–µ–Ω—å –Ω–∞–∑–∞–¥
    - 3: 22-28 –¥–Ω–µ–π –Ω–∞–∑–∞–¥
    - 4: >28 –¥–Ω–µ–π (–Ω–æ—Ä–º–∞–ª–∏–∑–æ–≤–∞–Ω–æ –ø–æ –¥–ª–∏–Ω–µ)
    
    –î–ª—è –∫–∞–∂–¥–æ–≥–æ –ø–µ—Ä–∏–æ–¥-–ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—å-–¥–µ–π—Å—Ç–≤–∏–µ:
    - –ö–æ–ª–∏—á–µ—Å—Ç–≤–æ –¥–µ–π—Å—Ç–≤–∏–π, –ø—Ä–æ–¥—É–∫—Ç–æ–≤, –∫–∞—Ç–µ–≥–æ—Ä–∏–π, –≤–∏–¥–∂–µ—Ç–æ–≤
    - –°—Ç–∞—Ç–∏—Å—Ç–∏–∫–∏ —Ü–µ–Ω (mean, max, min)
    - Std timestamps (–∫–æ–Ω—Å–∏—Å—Ç–µ–Ω—Ç–Ω–æ—Å—Ç—å –∞–∫—Ç–∏–≤–Ω–æ—Å—Ç–∏)
    - –°–∞–º–∞—è —á–∞—Å—Ç–∞—è –∫–∞—Ç–µ–≥–æ—Ä–∏—è (–∫–∞—Ç–µ–≥–æ—Ä–∏–∞–ª—å–Ω–∞—è —Ñ–∏—á–∞)
    
    –í–æ–∑–≤—Ä–∞—â–∞–µ—Ç ~200 —Ñ–∏—á–µ–π
    """
    print("\n=== –ì–µ–Ω–µ—Ä–∞—Ü–∏—è –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö –∞–≥—Ä–µ–≥–∞—Ü–∏–π ===")
    print(f"  –ü–µ—Ä–∏–æ–¥—ã: {num_periods} –Ω–µ–¥–µ–ª—å + —Å—Ç–∞—Ä—à–µ")
    
    df = user_df.copy()
    
    # –§–∏–ª—å—Ç—Ä—É–µ–º actions
    period_actions = actions_history[
        (actions_history['timestamp'] >= start_date) &
        (actions_history['timestamp'] <= end_date) &
        (actions_history['user_id'].isin(user_df['user_id']))
    ].copy()
    
    if len(period_actions) == 0:
        print("  –ù–µ—Ç –¥–∞–Ω–Ω—ã—Ö")
        return df
    
    # –ú–µ—Ä–∂–∏–º —Å product_information
    period_actions = period_actions.merge(
        product_information[['product_id', 'category_id', 'price', 'discount_price']],
        on='product_id',
        how='left'
    )
    
    # –ó–∞–ø–æ–ª–Ω—è–µ–º –ø—Ä–æ–ø—É—Å–∫–∏
    period_actions['category_id'] = period_actions['category_id'].fillna(10000).astype(int)
    period_actions['price'] = period_actions['price'].fillna(period_actions['price'].mean())
    period_actions['discount_price'] = period_actions['discount_price'].fillna(
        period_actions['discount_price'].mean()
    )
    
    # –í—ã—á–∏—Å–ª—è–µ–º –ø–µ—Ä–∏–æ–¥ (0 = –ø–æ—Å–ª–µ–¥–Ω—è—è –Ω–µ–¥–µ–ª—è, 4 = —Å—Ç–∞—Ä—à–µ)
    period_actions['period'] = ((end_date - period_actions['timestamp']).dt.days // 7).clip(upper=num_periods)
    
    # Timestamp –∫–∞–∫ integer –¥–ª—è std
    period_actions['timestamp_int'] = (period_actions['timestamp'].astype(int) / 1e12).astype(int)
    
    print(f"  –ê–≥—Ä–µ–≥–∞—Ü–∏—è –ø–æ –ø–µ—Ä–∏–æ–¥-–ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—å-–¥–µ–π—Å—Ç–≤–∏–µ...")
    
    # –ê–≥—Ä–µ–≥–∏—Ä—É–µ–º –ø–æ –ø–µ—Ä–∏–æ–¥—É, –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—é –∏ —Ç–∏–ø—É –¥–µ–π—Å—Ç–≤–∏—è
    aggregated = period_actions.groupby(['user_id', 'period', 'action_type_id'], as_index=False).agg(
        num_actions=('timestamp', 'nunique'),
        timestamp_std=('timestamp_int', 'std'),
        num_products=('product_id', 'nunique'),
        count_products=('product_id', 'count'),
        unique_widget_actions=('widget_name_id', 'nunique'),
        num_categories=('category_id', 'nunique'),
        category_mode=('category_id', lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else 10000),
        
        price_mean=('price', 'mean'),
        price_max=('price', 'max'),
        price_min=('price', 'min'),
        
        discount_price_mean=('discount_price', 'mean'),
        discount_price_max=('discount_price', 'max'),
        discount_price_min=('discount_price', 'min'),
    )
    
    # –ù–æ—Ä–º–∞–ª–∏–∑—É–µ–º —Ñ–∏—á–∏ –¥–ª—è –ø–µ—Ä–∏–æ–¥–∞ 4 (–ø–µ—Ä–µ–º–µ–Ω–Ω–∞—è –¥–ª–∏–Ω–∞)
    features_to_normalize = [
        'num_actions', 'num_products', 'count_products', 
        'unique_widget_actions', 'num_categories', 'timestamp_std'
    ]
    
    divisor = (end_date - pd.Timedelta(f"{num_periods*7} days") - start_date).days
    if divisor > 0:
        aggregated.loc[aggregated['period'] == num_periods, features_to_normalize] = (
            aggregated.loc[aggregated['period'] == num_periods, features_to_normalize] / divisor
        )
    
    print(f"  –ü–∏–≤–æ—Ç–∏–Ω–≥...")
    
    # Pivot –≤ —à–∏—Ä–æ–∫–∏–π —Ñ–æ—Ä–º–∞—Ç
    features = [
        'num_actions', 'num_products', 'count_products', 'unique_widget_actions',
        'num_categories', 'category_mode',
        'price_mean', 'price_max', 'price_min',
        'discount_price_mean', 'discount_price_max', 'discount_price_min',
        'timestamp_std'
    ]
    
    aggregated_wide = aggregated.pivot_table(
        index='user_id',
        columns=['period', 'action_type_id'],
        values=features,
        fill_value=0
    )
    
    # Flatt column names
    aggregated_wide.columns = [
        f"{feat}_{period}_{action}"
        for feat, period, action in aggregated_wide.columns
    ]
    
    aggregated_wide = aggregated_wide.reset_index()
    
    # –ú–µ—Ä–∂–∏–º
    df = df.merge(aggregated_wide, on='user_id', how='left')
    
    # –ó–∞–ø–æ–ª–Ω—è–µ–º nulls –Ω—É–ª—è–º–∏
    periodic_cols = [col for col in df.columns if col not in user_df.columns]
    df[periodic_cols] = df[periodic_cols].fillna(0)
    
    new_features = len(df.columns) - len(user_df.columns)
    print(f"  –°–≥–µ–Ω–µ—Ä–∏—Ä–æ–≤–∞–Ω–æ {new_features} –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö —Ñ–∏—á–µ–π –¥–ª—è actions")
    
    # –¢–∞–∫–∂–µ –∞–≥—Ä–µ–≥–∏—Ä—É–µ–º search_history –ø–æ –ø–µ—Ä–∏–æ–¥–∞–º
    print(f"  –ê–≥—Ä–µ–≥–∞—Ü–∏—è search_history...")
    
    period_searches = search_history[
        (search_history['timestamp'] >= start_date) &
        (search_history['timestamp'] <= end_date) &
        (search_history['user_id'].isin(user_df['user_id']))
    ].copy()
    
    if len(period_searches) > 0:
        period_searches['period'] = (
            (end_date - period_searches['timestamp']).dt.days // 7
        ).clip(upper=num_periods)
        
        period_searches['timestamp_int'] = (
            period_searches['timestamp'].astype(int) / 1e12
        ).astype(int)
        
        search_agg = period_searches.groupby(['user_id', 'period', 'action_type_id'], as_index=False).agg(
            num_actions=('timestamp', 'nunique'),
            timestamp_std_search=('timestamp_int', 'std'),
            unique_widget_search=('widget_name_id', 'nunique'),
        )
        
        # –ù–æ—Ä–º–∞–ª–∏–∑—É–µ–º –ø–µ—Ä–∏–æ–¥ 4
        if divisor > 0:
            search_agg.loc[search_agg['period'] == num_periods, ['num_actions', 'timestamp_std_search', 'unique_widget_search']] = (
                search_agg.loc[search_agg['period'] == num_periods, ['num_actions', 'timestamp_std_search', 'unique_widget_search']] / divisor
            )
        
        search_wide = search_agg.pivot_table(
            index='user_id',
            columns=['period', 'action_type_id'],
            values=['num_actions', 'unique_widget_search', 'timestamp_std_search'],
            fill_value=0
        )
        
        search_wide.columns = [
            f"{feat}_{period}_{action}"
            for feat, period, action in search_wide.columns
        ]
        
        search_wide = search_wide.reset_index()
        df = df.merge(search_wide, on='user_id', how='left')
        
        search_cols = [col for col in search_wide.columns if col != 'user_id']
        df[search_cols] = df[search_cols].fillna(0)
        
        print(f"  –î–æ–±–∞–≤–ª–µ–Ω–æ {len(search_cols)} search –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö —Ñ–∏—á–µ–π")
    
    total_new = len(df.columns) - len(user_df.columns)
    print(f"  –í—Å–µ–≥–æ –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö —Ñ–∏—á–µ–π: {total_new}")
    
    return df

## 5. Generate Features for Training Set

In [13]:
print("=" * 60)
print("GENERATING TRAINING FEATURES")
print("=" * 60)

# Start with users and target
df_train = val_target.copy()

# 1. Basic RFM features
df_train = generate_basic_rfm_features(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE
)

# 2. Temporal features
df_train = generate_temporal_features(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE
)

# 3. Conversion features
df_train = generate_conversion_features(df_train)

# 4. Advanced features
df_train = generate_advanced_features(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE
)

df_train = generate_periodic_aggregations(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE,
    num_periods=4
)

print("\n" + "=" * 60)
print(f"TOTAL FEATURES GENERATED: {len(df_train.columns) - 2}")  # -2 for user_id and target
print("=" * 60)

GENERATING TRAINING FEATURES

=== Generating Basic RFM Features ===
  Processing clicks...
  Processing favorites...
  Processing orders...
  Processing to_carts...
  Processing searches...
  Generated 27 RFM features

=== Generating Temporal Features ===
  Processing click temporal patterns...
  Processing favorite temporal patterns...
  Processing order temporal patterns...
  Processing to_cart temporal patterns...
  Generated temporal features

=== Generating Conversion Features ===
  Generated conversion features

=== Generating Advanced Features ===
  Calculating discount ratios...
  Calculating category diversity...
  Calculating widget diversity...
  Generated advanced features

=== –ì–µ–Ω–µ—Ä–∞—Ü–∏—è –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö –∞–≥—Ä–µ–≥–∞—Ü–∏–π ===
  –ü–µ—Ä–∏–æ–¥—ã: 4 –Ω–µ–¥–µ–ª—å + —Å—Ç–∞—Ä—à–µ
  –ê–≥—Ä–µ–≥–∞—Ü–∏—è –ø–æ –ø–µ—Ä–∏–æ–¥-–ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—å-–¥–µ–π—Å—Ç–≤–∏–µ...
  –ü–∏–≤–æ—Ç–∏–Ω–≥...
  –°–≥–µ–Ω–µ—Ä–∏—Ä–æ–≤–∞–Ω–æ 260 –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö —Ñ–∏—á–µ–π –¥–ª—è actions
  –ê–≥—Ä

In [14]:
# Show feature summary
print("\nFeature Summary:")
print(f"Total columns: {len(df_train.columns)}")
print(f"Total rows: {df_train.shape[0]:,}")
print(f"\nFirst 20 columns:")
print(df_train.columns[:20].tolist())

# Check for nulls
print(f"\nColumns with nulls (>0%):")
null_pcts = (df_train.isnull().sum() / len(df_train) * 100).sort_values(ascending=False)
null_cols = null_pcts[null_pcts > 0]
if len(null_cols) > 0:
    for col, pct in null_cols.head(20).items():
        print(f"  {col}: {pct:.2f}%")
else:
    print("  No null values found!")


Feature Summary:
Total columns: 337
Total rows: 1,835,147

First 20 columns:
['user_id', 'target', 'num_products_click', 'num_unique_products_click', 'sum_discount_price_click', 'max_discount_price_click', 'days_since_last_click', 'days_since_first_click', 'num_products_favorite', 'num_unique_products_favorite', 'sum_discount_price_favorite', 'max_discount_price_favorite', 'days_since_last_favorite', 'days_since_first_favorite', 'num_products_order', 'num_unique_products_order', 'sum_discount_price_order', 'max_discount_price_order', 'days_since_last_order', 'days_since_first_order']

Columns with nulls (>0%):
  favorite_to_order_conversion: 90.42%
  max_discount_price_favorite: 83.67%
  days_since_last_favorite: 83.16%
  sum_discount_price_favorite: 83.16%
  num_unique_days_favorite: 83.16%
  is_new_user_favorite: 83.16%
  lifetime_favorite: 83.16%
  days_since_first_favorite: 83.16%
  favorite_day_of_week_favorite: 83.16%
  avg_hour_favorite: 83.16%
  num_unique_products_favorite: 8

## 6. Generate Features for Test Set

In [15]:
print("=" * 60)
print("GENERATING TEST FEATURES")
print("=" * 60)

# Start with test users
df_test = test_users.copy()
df_test['target'] = 0  # Dummy target for consistency

# 1. Basic RFM features (using data up to VAL_END_DATE)
df_test = generate_basic_rfm_features(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE
)

# 2. Temporal features
df_test = generate_temporal_features(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE
)

# 3. Conversion features
df_test = generate_conversion_features(df_test)

# 4. Advanced features
df_test = generate_advanced_features(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE
)

df_test = generate_periodic_aggregations(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE,
    num_periods=4
)

print("\n" + "=" * 60)
print(f"TEST FEATURES GENERATED: {len(df_test.columns) - 2}")
print("=" * 60)

GENERATING TEST FEATURES

=== Generating Basic RFM Features ===
  Processing clicks...
  Processing favorites...
  Processing orders...
  Processing to_carts...
  Processing searches...
  Generated 27 RFM features

=== Generating Temporal Features ===
  Processing click temporal patterns...
  Processing favorite temporal patterns...
  Processing order temporal patterns...
  Processing to_cart temporal patterns...
  Generated temporal features

=== Generating Conversion Features ===
  Generated conversion features

=== Generating Advanced Features ===
  Calculating discount ratios...
  Calculating category diversity...
  Calculating widget diversity...
  Generated advanced features

=== –ì–µ–Ω–µ—Ä–∞—Ü–∏—è –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö –∞–≥—Ä–µ–≥–∞—Ü–∏–π ===
  –ü–µ—Ä–∏–æ–¥—ã: 4 –Ω–µ–¥–µ–ª—å + —Å—Ç–∞—Ä—à–µ
  –ê–≥—Ä–µ–≥–∞—Ü–∏—è –ø–æ –ø–µ—Ä–∏–æ–¥-–ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—å-–¥–µ–π—Å—Ç–≤–∏–µ...
  –ü–∏–≤–æ—Ç–∏–Ω–≥...
  –°–≥–µ–Ω–µ—Ä–∏—Ä–æ–≤–∞–Ω–æ 260 –ø–µ—Ä–∏–æ–¥–Ω—ã—Ö —Ñ–∏—á–µ–π –¥–ª—è actions
  –ê–≥—Ä–µ–≥

## 7. Feature Selection and Cleaning

In [16]:
# Get feature columns (exclude user_id and target)
feature_cols = [col for col in df_train.columns if col not in ['user_id', 'target']]
print(f"Total features before cleaning: {len(feature_cols)}")

# Fill nulls with -1 (indicator for missing)
print("\nFilling null values with -1...")
df_train[feature_cols] = df_train[feature_cols].fillna(-1)
df_test[feature_cols] = df_test[feature_cols].fillna(-1)

print("Null values filled.")

Total features before cleaning: 335

Filling null values with -1...
Null values filled.


In [17]:
# Check for infinite values
print("\nChecking for infinite values...")

# Replace inf with very large number
df_train = df_train.replace([np.inf, -np.inf], 999999)
df_test = df_test.replace([np.inf, -np.inf], 999999)

print("Infinite values handled.")


Checking for infinite values...
Infinite values handled.


In [18]:
# Basic feature statistics
print("\nFeature Statistics:")
print(f"Total features: {len(feature_cols)}")
print(f"Train shape: {df_train.shape}")
print(f"Test shape: {df_test.shape}")

# Feature types
numeric_features = df_train[feature_cols].select_dtypes(include=[np.number]).columns.tolist()
print(f"\nNumeric features: {len(numeric_features)}")


Feature Statistics:
Total features: 335
Train shape: (1835147, 337)
Test shape: (2068424, 337)

Numeric features: 335


## 8. Save Features

In [19]:
# Save as parquet (compressed, fast)
output_dir = '../results'
os.makedirs(output_dir, exist_ok=True)

print("Saving features...")

df_train.to_parquet(os.path.join(output_dir, 'features_train_v2.parquet'), index=False)
df_test.to_parquet(os.path.join(output_dir, 'features_test_v2.parquet'), index=False)

print(f"\nFeatures saved to {output_dir}/")
print(f"  - features_train.parquet: {df_train.shape}")
print(f"  - features_test.parquet: {df_test.shape}")

Saving features...

Features saved to ../results/
  - features_train.parquet: (1835147, 337)
  - features_test.parquet: (2068424, 337)


In [20]:
# Save feature names
with open(os.path.join(output_dir, 'feature_names_v2.txt'), 'w') as f:
    for col in feature_cols:
        f.write(f"{col}\n")

print(f"\nFeature names saved to {output_dir}/feature_names_v2.txt")
print(f"Total: {len(feature_cols)} features")


Feature names saved to ../results/feature_names_v2.txt
Total: 335 features


## 9. Feature Summary

In [21]:
print("\n" + "=" * 70)
print("FEATURE ENGINEERING COMPLETE")
print("=" * 70)

print("\nüìä Summary:")
print(f"  Total features generated: {len(feature_cols)}")
print(f"  Training samples: {df_train.shape[0]:,}")
print(f"  Test samples: {df_test.shape[0]:,}")
print(f"  Positive class ratio: {positive_ratio:.2%}")

print("\n‚úÖ Feature Categories:")
print("  - Basic RFM features (Recency, Frequency, Monetary)")
print("  - Temporal patterns (day of week, hour, activity consistency)")
print("  - Conversion features (funnel metrics)")
print("  - Advanced features (diversity, price sensitivity)")

print("\nüìÅ Output Files:")
print(f"  - {output_dir}/features_train.parquet")
print(f"  - {output_dir}/features_test.parquet")
print(f"  - {output_dir}/feature_names.txt")

print("\nüéØ Next Steps:")
print("  1. Review feature distributions in EDA")
print("  2. Train models (03_modeling.ipynb)")
print("  3. Analyze feature importance")
print("  4. Consider adding UMAP embeddings or periodic aggregations")

print("\n" + "=" * 70)


FEATURE ENGINEERING COMPLETE

üìä Summary:
  Total features generated: 335
  Training samples: 1,835,147
  Test samples: 2,068,424
  Positive class ratio: 34.59%

‚úÖ Feature Categories:
  - Basic RFM features (Recency, Frequency, Monetary)
  - Temporal patterns (day of week, hour, activity consistency)
  - Conversion features (funnel metrics)
  - Advanced features (diversity, price sensitivity)

üìÅ Output Files:
  - ../results/features_train.parquet
  - ../results/features_test.parquet
  - ../results/feature_names.txt

üéØ Next Steps:
  1. Review feature distributions in EDA
  2. Train models (03_modeling.ipynb)
  3. Analyze feature importance
  4. Consider adding UMAP embeddings or periodic aggregations



In [22]:
# Display sample of features
print("\nSample of generated features:")
display_cols = ['user_id', 'target'] + feature_cols[:10]
df_train[display_cols].head(10)


Sample of generated features:


Unnamed: 0,user_id,target,num_products_click,num_unique_products_click,sum_discount_price_click,max_discount_price_click,days_since_last_click,days_since_first_click,num_products_favorite,num_unique_products_favorite,sum_discount_price_favorite,max_discount_price_favorite
0,16,0,1.0,1.0,335.0,335.0,118.0,118.0,-1.0,-1.0,-1.0,-1.0
1,34,0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,36,1,9.0,9.0,20407.0,17257.0,49.0,73.0,-1.0,-1.0,-1.0,-1.0
3,53,0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,54,0,1.0,1.0,110.0,110.0,4.0,4.0,-1.0,-1.0,-1.0,-1.0
5,58,0,6.0,6.0,1007.0,389.0,60.0,116.0,-1.0,-1.0,-1.0,-1.0
6,62,0,35.0,30.0,7378.0,647.0,19.0,96.0,-1.0,-1.0,-1.0,-1.0
7,64,1,103.0,77.0,37495.0,5310.0,10.0,121.0,1.0,1.0,318.0,318.0
8,83,0,30.0,29.0,17255.0,3373.0,12.0,96.0,2.0,2.0,2353.0,1199.0
9,90,0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
