# Text Features Demo

This notebook demonstrates the new text-based features implemented in the recommendation system. These features calculate semantic relationships between products and users' purchase history, helping to create more personalized recommendations.

In [None]:
import os
import polars as pl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pathlib import Path

from lavka_recsys.config import Config
from lavka_recsys.experiment import Experiment, ExperimentType

## 1. Load Configuration

We'll use the updated configuration with text features enabled.

In [None]:
# Load from file
config = Config.from_file('default_config.yaml')

# Check which text features are enabled
text_features = [
    feature for feature in config.features 
    if feature in ['product_embeddings', 'category_embeddings', 'user_product_distance', 
                   'text_similarity_cluster', 'text_diversity_features']
]
print(f"Enabled text features: {text_features}")

## 2. Feature Overview

Let's review the text features we've implemented:

1. **user_product_distance**: Calculates weighted similarity between target products and user's purchase/cart history
   - `purchase_weighted_similarity`: Similarity between product and weighted purchase history
   - `cart_weighted_similarity`: Similarity between product and weighted cart history
   - `min_purchase_similarity`: Similarity to closest purchased product
   - `min_cart_similarity`: Similarity to closest carted product

2. **text_similarity_cluster**: Clusters products based on semantic similarity
   - `cluster`: Which semantic cluster the product belongs to
   - `cluster_purchase_ratio`: How often user buys from this cluster
   - `cluster_cart_ratio`: How often user adds items from this cluster to cart

3. **text_diversity_features**: Measures novelty relative to user's typical purchases
   - `distance_from_centroid`: How different from user's typical purchases
   - `relative_diversity`: Normalized novelty metric

These features help capture semantic relationships between products and user preferences in ways that traditional collaborative filtering can't.

## 3. Run Simple Experiment

Let's run a simple experiment to see how these features perform.

In [None]:
# Set experiment name and output directory
experiment_name = "text_features_demo"
results_dir = f"results/{experiment_name}"

# Create output directory if it doesn't exist
os.makedirs(results_dir, exist_ok=True)

# Configure our experiment
text_config = config.copy()
text_config.set('experiment.type', 'single_run')
text_config.set('experiment.use_hyperparameter_tuning', False)
text_config.set('feature_selection.enabled', True)
text_config.set('feature_selection.n_features', 30)  # Using more features to ensure our text features are included
text_config.set('output.results_dir', results_dir)
text_config.set('data.sample_fraction', 0.1)  # Use a smaller dataset for faster execution

# Create experiment
text_experiment = Experiment(experiment_name, text_config)

# Setup and run experiment
text_experiment.setup()
results = text_experiment.run()

## 4. Analyze Results and Feature Importance

Let's examine how important our text features are in the model.

In [None]:
# Print metrics
print("Experiment Metrics:")
if 'metrics' in results:
    for metric, value in results['metrics'].items():
        print(f"  {metric}: {value:.4f}")

# Check feature importance
if 'feature_importance' in results:
    # Get all feature importance
    importances = results['feature_importance']
    
    # Create DataFrame for visualization
    importance_df = pd.DataFrame({
        'feature': list(importances.keys()),
        'importance': list(importances.values())
    }).sort_values('importance', ascending=False)
    
    # Plot top features
    plt.figure(figsize=(12, 6))
    top_n = min(20, len(importance_df))
    plt.barh(importance_df['feature'][:top_n], importance_df['importance'][:top_n])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.title(f'Top {top_n} Feature Importances')
    plt.gca().invert_yaxis()  # Display with highest importance at the top
    plt.tight_layout()
    plt.show()
    
    # Show text feature importances specifically
    text_importance = importance_df[importance_df['feature'].str.contains('|'.join([
        'embedding', 'text', 'distance', 'similarity', 'diversity', 'cluster'
    ]))]
    
    print("\nText Feature Importances:")
    display(text_importance.head(20))

## 5. Comparison Experiment Without Text Features

To quantify the impact of our text features, let's run a comparison experiment without them.

In [None]:
# Create a configuration without text features
no_text_config = text_config.copy()
no_text_features = [f for f in no_text_config.features if f not in text_features]
no_text_config.set('features', no_text_features)
no_text_config.set('output.results_dir', f"{results_dir}/no_text")

# Create output directory if it doesn't exist
os.makedirs(f"{results_dir}/no_text", exist_ok=True)

# Create and run experiment
no_text_experiment = Experiment(f"{experiment_name}_no_text", no_text_config)
no_text_experiment.setup()
no_text_results = no_text_experiment.run()

# Print metrics
print("Experiment Metrics Without Text Features:")
if 'metrics' in no_text_results:
    for metric, value in no_text_results['metrics'].items():
        print(f"  {metric}: {value:.4f}")

## 6. Compare Performance

Let's visualize the performance difference between models with and without text features.

In [None]:
# Extract metrics for comparison
if 'metrics' in results and 'metrics' in no_text_results:
    # Get common metrics
    common_metrics = set(results['metrics'].keys()) & set(no_text_results['metrics'].keys())
    
    # Prepare data for plotting
    metrics_to_plot = ['auc', 'precision', 'recall', 'f1', 'ndcg']
    metrics_to_plot = [m for m in metrics_to_plot if m in common_metrics]
    
    if metrics_to_plot:
        # Create comparison bar chart
        plt.figure(figsize=(10, 6))
        x = range(len(metrics_to_plot))
        width = 0.35
        
        with_text_values = [results['metrics'].get(m, 0) for m in metrics_to_plot]
        without_text_values = [no_text_results['metrics'].get(m, 0) for m in metrics_to_plot]
        
        plt.bar(x, with_text_values, width, label='With Text Features')
        plt.bar([i + width for i in x], without_text_values, width, label='Without Text Features')
        
        plt.xlabel('Metric')
        plt.ylabel('Value')
        plt.title('Performance Comparison: With vs. Without Text Features')
        plt.xticks([i + width/2 for i in x], metrics_to_plot)
        plt.legend()
        plt.tight_layout()
        plt.show()
        
        # Calculate percentage improvement
        print("\nPercentage Improvement with Text Features:")
        for i, metric in enumerate(metrics_to_plot):
            with_val = with_text_values[i]
            without_val = without_text_values[i]
            if without_val > 0:  # Avoid division by zero
                improvement = (with_val - without_val) / without_val * 100
                print(f"  {metric}: {improvement:.2f}%")
    else:
        print("No common metrics found for comparison.")

## 7. Create Kaggle Submission

Let's create a submission file using our model with text features.

In [None]:
# Create submission
submission_df = text_experiment.create_kaggle_submission()

# Save submission to CSV
submission_path = f"{results_dir}/text_features_submission.csv"
submission_df.write_csv(submission_path)

print(f"Submission saved to {submission_path}")
print(f"\nSubmission Preview (first 5 rows):")
display(submission_df.head(5))

## 8. Conclusion

In this notebook, we've demonstrated the use of three advanced text-based features:

1. **user_product_distance**: Weighted similarity between products and user purchase history
2. **text_similarity_cluster**: Product clusters based on semantic similarity  
3. **text_diversity_features**: Novelty metrics comparing products to user's typical preferences

These features capture semantic relationships that traditional collaborative filtering methods miss, especially for new or rare items. By comparing performance with and without these features, we can see their impact on recommendation quality.

Potential next steps:
- Fine-tune embedding model parameters
- Experiment with different similarity metrics
- Test other clustering algorithms beyond k-means
- Combine text features with collaborative filtering in different ways