# Text Features Demo (Revised)

This notebook demonstrates the text-based features implemented in the recommendation system. These features aim to capture semantic relationships between products, and between products and users' historical interactions, to create more personalized and relevant recommendations.

In [1]:
import os
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # For better plot aesthetics
from IPython.display import display
from pathlib import Path

from lavka_recsys import Config, Experiment, setup_logging

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
setup_logging()

<Logger lavka_recsys (DEBUG)>

## 1. Load Configuration and Define Text Features

We'll load the default configuration and then explicitly define which text features we want to test. 
**Important Note on Embedding Dimensions:** The `num_cols` parameter in the `@FeatureFactory.register` decorators within `text_processor.py` (e.g., for `product_embeddings`, `category_embeddings`) must align with the actual number of embedding dimensions produced. This dimension is controlled by `text_processing.embedding_dimensions` in the configuration if PCA reduction is applied, or it's the raw embedding size of the pre-trained model. For this demo, we'll set it explicitly.

In [None]:
# Load base configuration from file
config = Config.load('default_config.yaml')

# Define the text feature generator names as registered in text_processor.py
ALL_TEXT_FEATURE_GENERATORS = [
    'product_embeddings',          # Generates product_embed_X columns
    'category_embeddings',         # Generates cat_embed_X columns
    # 'user_product_similarity',     # Generates similarity scores (weighted, max history)
    'text_similarity_cluster',     # Generates product_text_cluster and cluster_X_ratio columns
    'text_diversity_features'      # Generates distance_from_centroid, relative_diversity
]

# This MUST match the num_cols in their respective @FeatureFactory.register decorators in text_processor.py
config.set('text_processing.embedding_dimensions', 20) 

print(f"Base configuration loaded. Intended text features for demo: {ALL_TEXT_FEATURE_GENERATORS}")
print(f"Configured embedding dimensions (after potential PCA): {config['feature_config.text_processing.embedding_dimensions']}")
print(f"Configured number of text clusters: {config['feature_config.text_processing.n_clusters']}")

Base configuration loaded. Intended text features for demo: ['product_embeddings', 'category_embeddings', 'user_product_similarity', 'text_similarity_cluster', 'text_diversity_features']
Configured embedding dimensions (after potential PCA): 20
Configured number of text clusters: 15


## 2. Feature Overview (Updated)

Let's review the text features we'll be demonstrating (names reflect the revised `text_processor.py`):

1.  **`product_embeddings`**: Generates dense vector representations (embeddings) for product names.
    * Outputs columns like `product_embed_0`, `product_embed_1`, ...
2.  **`category_embeddings`**: Generates embeddings for product category names.
    * Outputs columns like `cat_embed_0`, `cat_embed_1`, ...
3.  **`user_product_similarity`**: Calculates similarity scores between target products and a user's historical interactions (purchases, cart additions) based on their embeddings.
    * `purchase_weighted_similarity`: Cosine similarity to the weighted average embedding of user's purchased items.
    * `cart_weighted_similarity`: Cosine similarity to the weighted average embedding of user's cart items.
    * `max_purchase_similarity_history`: Maximum cosine similarity to any single item previously purchased by the user.
    * `max_cart_similarity_history`: Maximum cosine similarity to any single item previously added to cart by the user.
4.  **`text_similarity_cluster`**: Clusters products based on the semantic similarity of their text embeddings (e.g., product names).
    * `product_text_cluster` (Categorical): The ID of the semantic cluster the product belongs to.
    * `cluster_purchase_ratio`: Ratio of a user's purchases from the target product's cluster to their total purchases.
    * `cluster_cart_ratio`: Ratio of a user's cart additions from the target product's cluster.
    * `cluster_view_ratio`: Ratio of a user's views from the target product's cluster.
5.  **`text_diversity_features`**: Measures how textually different or novel a target product is compared to the user's historical interactions.
    * `distance_from_centroid`: Euclidean distance between the target product's embedding and the centroid of the user's historically interacted items' embeddings.
    * `relative_diversity`: The `distance_from_centroid` normalized by the user's average historical interaction diversity.

These features help capture semantic relationships and user preferences that might be missed by traditional collaborative filtering or count-based features, especially for new or rare items.

## 3. Run Experiment with Text Features

Let's configure and run an experiment that includes all our text features.

In [None]:
# Set experiment name and output directory
experiment_name_with_text = "text_features_demo_with_text"
results_dir_with_text = Path(config.get('output.results_dir', 'results')) / experiment_name_with_text
os.makedirs(results_dir_with_text, exist_ok=True)

# Add text features to the list of features to be generated
current_features = config.get('features', [])
updated_features = list(set(current_features + ALL_TEXT_FEATURE_GENERATORS)) # Use set to avoid duplicates

# Configure experiment settings for the demo
text_exp_config = (
    config
    .set('features', updated_features)
    .set('experiment.type', 'single_run')
    .set('experiment.use_hyperparameter_tuning', False)
    .set('output.results_dir', str(results_dir_with_text))
    .set('data.sample_fraction', config.get('data.sample_fraction', 0.1)) 
)

print(f"Running experiment WITH text features. Output directory: {results_dir_with_text}")
print(f"Features to be generated: {text_exp_config.get('features')}")

# Create and run experiment
results_with_text = None
try:
    text_experiment_runner = Experiment(experiment_name_with_text, text_exp_config)
    text_experiment_runner.setup()
    results_with_text = text_experiment_runner.run()
    print("Experiment WITH text features completed.")
except Exception as e:
    print(f"ERROR running experiment WITH text features: {e}")
    import traceback
    traceback.print_exc()

Running experiment WITH text features. Output directory: results/text_features_demo_with_text
Features to be generated: None
2025-05-18 00:33:27,766 - lavka_recsys.Experiment(text_features_demo_with_text_9e9c4d) - INFO - Initialized experiment: text_features_demo_with_text_9e9c4d
2025-05-18 00:33:27,773 - lavka_recsys.Experiment(text_features_demo_with_text_9e9c4d) - INFO - Config saved: results/text_features_demo_with_text/text_features_demo_with_text_9e9c4d_config.json
2025-05-18 00:33:27,774 - lavka_recsys.Experiment(text_features_demo_with_text_9e9c4d) - INFO - Setting up experiment environment...
2025-05-18 00:33:27,776 - lavka_recsys.DataLoader - INFO - Loading training data from ../../data/lavka/train.parquet


2025-05-18 00:33:28,070 - lavka_recsys.DataLoader - INFO - Loading test data from ../../data/lavka/test.parquet
2025-05-18 00:33:28,406 - lavka_recsys.DataLoader - INFO - Holdout Split:
2025-05-18 00:33:28,431 - lavka_recsys.DataLoader - INFO -   train:	2022-12-31 18:46:42 → 2024-01-03 17:31:52 (15_070_276 rows, 367 days)
2025-05-18 00:33:28,434 - lavka_recsys.DataLoader - INFO -   holdout:	2024-01-03 17:56:48 → 2024-02-02 17:34:51 (1_438_338 rows, 29 days)
2025-05-18 00:33:28,435 - lavka_recsys.Experiment(text_features_demo_with_text_9e9c4d) - INFO - Setup complete.
2025-05-18 00:33:28,436 - lavka_recsys.Experiment(text_features_demo_with_text_9e9c4d) - INFO - Starting experiment run...
2025-05-18 00:33:28,792 - lavka_recsys.DataLoader - INFO - Validation Split:
2025-05-18 00:33:28,815 - lavka_recsys.DataLoader - INFO -   train_history:	2022-12-31 18:46:42 → 2023-11-04 17:16:23 (12_082_523 rows, 307 days)
2025-05-18 00:33:28,819 - lavka_recsys.DataLoader - INFO -   train_target:	2023-

  0%|          | 0/1498126 [00:00<?, ?it/s]

## 4. Analyze Results and Feature Importance (With Text Features)

Let's examine the metrics and see how important our text features are in the model.

In [None]:
if results_with_text and 'metrics' in results_with_text:
    print("\nExperiment Metrics (WITH Text Features):")
    for metric, value in results_with_text['metrics'].items():
        print(f"  {metric}: {value:.4f}")
else:
    print("No metrics found for the experiment WITH text features.")

if results_with_text and 'feature_importance' in results_with_text:
    importances = results_with_text['feature_importance']
    if importances:
        importance_df = pd.DataFrame({
            'feature': list(importances.keys()),
            'importance': list(importances.values())
        }).sort_values('importance', ascending=False)
        
        plt.figure(figsize=(12, 8))
        top_n = min(30, len(importance_df))
        sns.barplot(x='importance', y='feature', data=importance_df.head(top_n), palette='viridis')
        plt.title(f'Top {top_n} Feature Importances (WITH Text Features)')
        plt.tight_layout()
        plt.show()
        
        # Define patterns to identify text-based features by their typical name components
        text_feature_patterns = [
            'embed', 'similarity', 'cluster', 'diversity', 'centroid', 
            'product_text_cluster' 
        ]
        text_feature_regex = '|'.join(text_feature_patterns)
        
        text_importance_df = importance_df[importance_df['feature'].str.contains(text_feature_regex, case=False, na=False)]
        
        print("\nText Feature Importances (from model with text features):")
        if not text_importance_df.empty:
            display(text_importance_df)
        else:
            print("No text features found in the importance list based on patterns.")
            print(f"Patterns used: {text_feature_patterns}")
            print("Available features:", importance_df['feature'].tolist()[:20]) 
    else:
        print("Feature importance data is empty.")
else:
    print("No feature importance found for the experiment WITH text features.")

## 5. Run Comparison Experiment Without Text Features

To quantify the impact, let's run a similar experiment but explicitly exclude our text feature generators.

In [None]:
experiment_name_no_text = "text_features_demo_no_text"
results_dir_no_text = Path(config.get('output.results_dir', 'results')) / experiment_name_no_text
os.makedirs(results_dir_no_text, exist_ok=True)

no_text_exp_config = config.copy()

# Exclude text features
base_features = no_text_exp_config.get('features', [])
features_without_text_generators = [f for f in base_features if f not in ALL_TEXT_FEATURE_GENERATORS]
no_text_exp_config.set('features', features_without_text_generators)

# Configure experiment settings
no_text_exp_config.set('experiment.type', 'single_run')
no_text_exp_config.set('experiment.use_hyperparameter_tuning', False)
no_text_exp_config.set('feature_selection.enabled', True) 
no_text_exp_config.set('feature_selection.n_features', 50) 
no_text_exp_config.set('output.results_dir', str(results_dir_no_text))
no_text_exp_config.set('data.sample_fraction', config.get('data.sample_fraction', 0.1))

print(f"Running experiment WITHOUT text features. Output directory: {results_dir_no_text}")
print(f"Features to be generated: {no_text_exp_config.get('features')}")

# Create and run experiment
results_no_text = None
try:
    no_text_experiment_runner = Experiment(experiment_name_no_text, no_text_exp_config)
    no_text_experiment_runner.setup()
    results_no_text = no_text_experiment_runner.run()
    print("Experiment WITHOUT text features completed.")
except Exception as e:
    print(f"ERROR running experiment WITHOUT text features: {e}")
    import traceback
    traceback.print_exc()

## 6. Compare Performance

Let's visualize the performance difference between the model trained with text features and the one without.

In [None]:
if results_with_text and 'metrics' in results_with_text and results_no_text and 'metrics' in results_no_text:
    metrics_with = results_with_text['metrics']
    metrics_without = results_no_text['metrics']
    
    # Get common metrics for fair comparison
    common_metric_keys = sorted(list(set(metrics_with.keys()) & set(metrics_without.keys())))
    
    if common_metric_keys:
        metrics_to_plot_df = pd.DataFrame({
            'Metric': common_metric_keys,
            'With Text Features': [metrics_with.get(m, 0) for m in common_metric_keys],
            'Without Text Features': [metrics_without.get(m, 0) for m in common_metric_keys]
        })
        
        display(metrics_to_plot_df.set_index('Metric'))
        
        # Plotting
        metrics_to_plot_df.set_index('Metric').plot(kind='bar', figsize=(12, 7))
        plt.title('Performance Comparison: With vs. Without Text Features')
        plt.ylabel('Score')
        plt.xticks(rotation=45, ha='right')
        plt.legend(title='Model Type')
        plt.tight_layout()
        plt.show()
        
        print("\nPercentage Improvement with Text Features:")
        for metric_name in common_metric_keys:
            val_with = metrics_with.get(metric_name, 0)
            val_without = metrics_without.get(metric_name, 0)
            if val_without != 0:  
                improvement = ((val_with - val_without) / abs(val_without)) * 100
                print(f"  {metric_name}: {improvement:.2f}%")
            elif val_with > 0:
                 print(f"  {metric_name}: N/A (baseline is 0, new score is {val_with:.4f})")
            else:
                 print(f"  {metric_name}: N/A (both scores are 0 or baseline is 0)")
    else:
        print("No common metrics found between the two experiments for comparison.")
else:
    print("Metrics not available for one or both experiments. Cannot compare performance.")

## 7. Create Kaggle Submission (Optional)

If the model with text features performs well, you can generate a submission file.

In [None]:
if results_with_text: # Check if the experiment ran successfully
    print("\nAttempting to create Kaggle submission file from the model WITH text features...")
    try:
        # The Experiment class instance is 'text_experiment_runner'
        submission_df = text_experiment_runner.create_kaggle_submission()
        
        submission_path = results_dir_with_text / "text_features_submission.csv"
        submission_df.write_csv(submission_path)
        
        print(f"\nSubmission file saved to: {submission_path}")
        print("Submission Preview (first 5 rows):")
        display(submission_df.head(5))
    except AttributeError as e:
        print(f"Could not create submission. Experiment object might not be fully available or method missing: {e}")
    except Exception as e:
        print(f"An error occurred during submission file creation: {e}")
        import traceback
        traceback.print_exc()
else:
    print("Skipping submission file generation as the experiment with text features did not complete successfully or yield results.")

## 8. Conclusion & Next Steps

In this notebook, we've demonstrated the integration and potential impact of several text-based features. By comparing model performance with and without these features, we can assess their contribution.

Key takeaways:
* Text features can capture semantic nuances that other feature types might miss.
* The importance of individual text features can vary depending on the dataset and model.
* Careful configuration (e.g., embedding dimensions, model choices for `TextProcessor`) is important.

Potential next steps for further improvement:
* **Hyperparameter Tuning**: Tune the parameters of the text processing (e.g., `embedding_dimensions`, `n_clusters`) and the main recommendation model (e.g., CatBoost parameters) when text features are included.
* **Advanced Text Models**: Experiment with larger or more domain-specific pre-trained models for `TextProcessor` if computational resources allow.
* **Different Similarity/Distance Metrics**: Explore alternatives to cosine similarity or Euclidean distance where appropriate.
* **Interaction with Other Features**: Investigate how text features interact with other existing features in your model.
* **Error Analysis**: Deep dive into cases where the model with text features performs significantly better or worse to understand their effect more granularly.