# Simplified Recommender System

A streamlined recommender system framework with NLP capabilities for easy experimentation.

## Features

- Simple, focused implementation for quick iteration
- Pretrained NLP models for text features
- Time-based data splitting for realistic evaluation
- Optional hyperparameter tuning
- Minimal dependencies

## Installation

1. Install core dependencies:

```bash
pip install polars catboost optuna scikit-learn pandas numpy
```

2. Install NLP dependencies:

```bash
pip install sentence-transformers
```

## Project Structure

The simplified framework consists of these key components:

- `simplified_data_loader.py` - Handles data loading and time-based splitting
- `simplified_experiment.py` - Manages the experiment lifecycle
- `simplified_hyperparameter_tuner.py` - Provides optional parameter tuning
- `text_processor.py` - Processes text features using pretrained models
- `feature_factory.py` - Your existing feature generation code (unmodified)
- `model_factory.py` - Your existing model factory (unmodified)



## Basic Usage

```python
from lavka_recsys.config import Config
from lavka_recsys.simplified_experiment import SimpleExperiment
from lavka_recsys.text_processor import register_text_embedding_features

# Register text embedding features
register_text_embedding_features()

# Create configuration
config = Config({
    "experiment_name": "simple_recommender",
    "model": {
        "type": "catboost",
        "config": {
            "catboost": {
                "iterations": 500,
                "learning_rate": 0.1,
                "depth": 6
            }
        }
    },
    "features": [
        "count_purchase_user_product",
        "ctr_product",
        "user_stats", 
        "product_stats",
        "store_stats",
        "product_embeddings",
        "category_embeddings"
    ],
    "target": "CartUpdate_Purchase_vs_View",
    "data": {
        "train_path": "data/train.parquet",
        "test_path": "data/test.parquet"
    },
    "text_processing": {
        "model_type": "sentence-transformers",
        "model_name": "all-MiniLM-L6-v2",
        "embedding_dimensions": 20
    }
})

# Create and run experiment
experiment = SimpleExperiment("my_experiment", config)
results = experiment.run()
print(f"Metrics: {results['metrics']}")

# Generate predictions
submission = experiment.predict()
```


## Text Feature Options

The system supports multiple pretrained NLP models:

1. **Sentence Transformers** (recommended):
   - Modern, transformer-based embeddings
   - Configuration: `"model_type": "sentence-transformers"`
   - Good models: `"all-MiniLM-L6-v2"` (small), `"all-mpnet-base-v2"` (larger but better)

2. **Word2Vec** (optional):
   - Classic word embeddings
   - Configuration: `"model_type": "word2vec"`
   - Good models: `"glove-wiki-gigaword-100"`, `"word2vec-google-news-300"`


## Hyperparameter Tuning

To run with hyperparameter tuning:

```python
results = experiment.run_with_tuning()
best_params = results['best_params']
submission = experiment.predict(best_params)
```

## Customization

For custom feature engineering, simply add your feature generators to the `feature_factory.py` file as you've been doing.

## Memory Requirements

- **Base System**: ~2GB RAM
- **With Sentence Transformers**: ~3GB RAM (depends on model size)
- **With Word2Vec**: ~2.5GB RAM

## Troubleshooting

- If you encounter memory issues, try a smaller model or reduce embedding dimensions
- For faster debugging, set `"sample_size": 10000` in your configuration

In [1]:
import os
from datetime import datetime

In [2]:
%load_ext autoreload
%autoreload 2
from lavka_recsys.config import Config
from lavka_recsys.experiment import Experiment
from lavka_recsys.text_processor import register_text_embedding_features

register_text_embedding_features()

True

In [3]:
config_dict = {
    "experiment_name": "simple_recommender",
    "model": {
        "type": "catboost",
        "config": {
            "catboost": {
                "iterations": 500,
                "learning_rate": 0.1,
                "depth": 6,
                "verbose": 100
            }
        }
    },
    # Essential features only
    "features": [
        # Basic statistics
        "count_purchase_user_product",
        "ctr_product", 
        "user_stats",
        "product_stats",
        "store_stats",
        "recency_user_store",
        "time_features",
        "time_window_user_product",
        "session_features",
        "frequency_features",
        "product_popularity_trend",
        "cross_features",
        "user_segments",

        # NLP embeddings
        "product_embeddings",
        "category_embeddings"
    ],
    "target": "CartUpdate_Purchase_vs_View",
    "data": {
        "train_path": "data/train.parquet",
        "test_path": "data/test.parquet",
        "sample_size": None  # Use full dataset (or set to a number for testing)
    },
    "output": {
        "results_dir": "results",
        "save_model": True,
        "save_predictions": True
    },
    # Text processing with pretrained model
    "text_processing": {
        "model_type": "sentence-transformers",
        "model_name": "all-MiniLM-L6-v2",  # Small but effective model
        "embedding_dimensions": 20
    }
}

In [4]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
experiment_name = f"recommender_{timestamp}"

# Create configuration
config = Config(config_dict)

In [None]:
# Create experiment
experiment = Experiment(experiment_name, config)

print(f"Starting experiment: {experiment_name}")

# Choose one of these options:
    
# Option 1: Run without hyperparameter tuning (fastest)
# results = experiment.run()

# Option 2: Run with hyperparameter tuning (better results)
results = experiment.run_with_tuning()
    
print(f"Experiment metrics: {results['metrics']}")

2025-04-13 00:04:19,396 - lavka_recsys.experiment=recommender_20250413_000419 - INFO - Saved configuration to results/recommender_20250413_000419_1cbc16_config.json
2025-04-13 00:04:20,089 - lavka_recsys.DataLoader - INFO - Loaded train data: 14954417 rows
2025-04-13 00:04:20,089 - lavka_recsys.DataLoader - INFO - Loaded test data: 565231 rows
Starting experiment: recommender_20250413_000419
2025-04-13 00:04:20,090 - lavka_recsys.experiment=recommender_20250413_000419 - INFO - Starting experiment with tuning: recommender_20250413_000419_1cbc16
2025-04-13 00:04:20,092 - lavka_recsys.HyperparameterTuner - INFO - Starting hyperparameter tuning with 20 trials


[I 2025-04-13 00:04:20,093] A new study created in memory with name: no-name-da630aa6-283e-4bfe-a248-d0d3461e2925


2025-04-13 00:04:20,095 - lavka_recsys.HyperparameterTuner - INFO - Trial 0: Testing parameters {'learning_rate': 0.025855262377220935, 'depth': 6, 'l2_leaf_reg': 6.056010767225593, 'iterations': 831}
2025-04-13 00:04:22,185 - lavka_recsys.DataLoader - INFO - Validation split: history until 2023-11-07 21:53:44
2025-04-13 00:04:22,186 - lavka_recsys.DataLoader - INFO - Training: 2023-11-07 21:53:44 to 2023-12-17 21:53:44
2025-04-13 00:04:22,187 - lavka_recsys.DataLoader - INFO - Validation: 2023-12-17 21:53:44 to 2023-12-31 21:53:44
2025-04-13 00:04:22,187 - lavka_recsys.DataLoader - INFO - Split data: 12237090 history, 2025540 train, 691787 validation rows
2025-04-13 00:04:22,188 - lavka_recsys.FeatureFactory - INFO - Generating features: count_purchase_user_product, ctr_product, user_stats, product_stats, store_stats, recency_user_store, time_features, time_window_user_product, session_features, frequency_features, product_popularity_trend, cross_features, user_segments, product_embed

[I 2025-04-13 00:13:26,752] Trial 0 finished with value: 0.7372242166578341 and parameters: {'learning_rate': 0.025855262377220935, 'depth': 6, 'l2_leaf_reg': 6.056010767225593, 'iterations': 831}. Best is trial 0 with value: 0.7372242166578341.


2025-04-13 00:13:26,757 - lavka_recsys.HyperparameterTuner - INFO - Trial 1: Testing parameters {'learning_rate': 0.08531604484376926, 'depth': 10, 'l2_leaf_reg': 4.670743279863423, 'iterations': 283}
2025-04-13 00:13:32,559 - lavka_recsys.DataLoader - INFO - Validation split: history until 2023-11-07 21:53:44
2025-04-13 00:13:32,563 - lavka_recsys.DataLoader - INFO - Training: 2023-11-07 21:53:44 to 2023-12-17 21:53:44
2025-04-13 00:13:32,564 - lavka_recsys.DataLoader - INFO - Validation: 2023-12-17 21:53:44 to 2023-12-31 21:53:44
2025-04-13 00:13:32,565 - lavka_recsys.DataLoader - INFO - Split data: 12237090 history, 2025540 train, 691787 validation rows
2025-04-13 00:13:32,566 - lavka_recsys.FeatureFactory - INFO - Generating features: count_purchase_user_product, ctr_product, user_stats, product_stats, store_stats, recency_user_store, time_features, time_window_user_product, session_features, frequency_features, product_popularity_trend, cross_features, user_segments, product_embed

In [None]:
print(f"Experiment metrics: {results['metrics']}")
    
# Print top 10 most important features
print("\nTop 10 most important features:")
top_features = sorted(
    results['feature_importance'].items(), 
    key=lambda x: x[1], 
    reverse=True
)[:10]

for feature, importance in top_features:
    print(f"{feature}: {importance:.6f}")

Experiment metrics: {'auc': 0.7355136520346217, 'logloss': 0.12800882401530772}

Top 10 most important features:
mean_interval_days: 10.078548
count_purchase_u_p_right: 9.299074
product_total_purchases: 7.772214
product_total_purchases_right: 5.748494
purchases_month_u_p: 3.857708
user_product_purchase_cross: 3.370766
product_total_views_right: 3.049936
user_total_interactions: 2.894758
user_total_purchases: 2.821779
product_total_views: 2.763572


In [None]:
best_params = results.get('best_params')  # Will be None if using run() without tuning
submission = experiment.predict(best_params)

print(f"\nPredictions generated. First 5 rows:")
print(submission.head(5))

print(f"\nExperiment completed: {experiment_name}")

2025-04-12 23:28:02,345 - lavka_recsys.experiment=recommender_20250412_232422 - INFO - Training final model for prediction
2025-04-12 23:28:05,677 - lavka_recsys.DataLoader - INFO - Final split: history until 2023-12-17 21:53:44
2025-04-12 23:28:05,681 - lavka_recsys.DataLoader - INFO - Final training: 2023-12-17 21:53:44 to 2023-12-31 21:53:44
2025-04-12 23:28:05,683 - lavka_recsys.DataLoader - INFO - Final split data: 14262630 history, 691787 train rows
2025-04-12 23:28:05,683 - lavka_recsys.FeatureFactory - INFO - Generating features: count_purchase_user_product, ctr_product, user_stats, product_stats, store_stats, recency_user_store, time_features, time_window_user_product, session_features, frequency_features, product_popularity_trend, cross_features, user_segments, count_purchase_user_product, count_purchase_user_store, ctr_product, recency_user_product, user_stats, product_stats, store_stats, city_stats
2025-04-12 23:28:25,836 - lavka_recsys.FeatureFactory - INFO - Joined featur

In [None]:
submission

index,request_id,predict
u32,u64,f64
65088,5858536007999182875,0.728352
190350,5858536007999182875,0.728352
210994,5858536007999182875,0.728352
243320,5858536007999182875,0.728352
465993,5858536007999182875,0.728352
…,…,…
511269,15600691769453273983,0.000753
433797,12645323795377778123,0.000751
368625,16619401595275987764,0.000604
430331,14789876256328305967,0.00056
