## Summary

This notebook successfully implemented a **Hybrid Recommendation System** for FocusDesk that combines:

### Key Components:

1. **Data Processing**: Loaded and preprocessed 1000 users, 500 packages, and 20,000 interaction events
2. **Weighted Interaction Matrix**: Created a user-item matrix with event weights (Booking=1.0, Click=0.2, View=0.05)
3. **Content-Based Filtering**: Used 16-dimensional text embeddings and cosine similarity
4. **Collaborative Filtering**: Applied TruncatedSVD (20 components) for matrix factorization
5. **Hybrid Model**: Combined both approaches with 60% collaborative + 40% content-based weighting

### Benefits of the Hybrid Approach:

- **Personalization**: Leverages user behavior patterns (collaborative filtering)
- **Relevance**: Considers course content similarity (content-based filtering)
- **Cold Start Mitigation**: Content-based component helps with new users
- **Diversity**: Balances exploration (content) and exploitation (collaborative)

### Next Steps:

- Fine-tune the hybrid weights (60/40 split) based on A/B testing
- Add additional features (price, duration, difficulty level)
- Implement online learning to update recommendations in real-time
- Add diversity and serendipity metrics to evaluation

In [None]:
# Compare with individual models
print("=" * 80)
print("MODEL COMPARISON")
print("=" * 80)

# Collaborative Filtering only
print(f"\n{'Collaborative Filtering Only':^80}")
print("-" * 80)
collab_recs = get_collaborative_recommendations(random_user_id, n=5)
for idx, row in collab_recs.iterrows():
    print(f"{idx + 1}. {row['title']}")
    print(f"   Predicted Score: {row['predicted_score']:.4f}\n")

# Content-Based Filtering only (based on last interacted package)
user_package_scores = interaction_matrix.loc[random_user_id]
already_interacted = user_package_scores[user_package_scores > 0].index.tolist()
if len(already_interacted) > 0:
    last_package = user_package_scores[user_package_scores > 0].idxmax()
    print(f"\n{'Content-Based Filtering Only':^80}")
    print(f"Based on last interacted package: {packages_df[packages_df['package_id'] == last_package]['title'].values[0]}")
    print("-" * 80)
    content_recs = get_content_recommendations(last_package, n=5)
    for idx, row in content_recs.iterrows():
        print(f"{idx + 1}. {row['title']}")
        print(f"   Similarity Score: {row['similarity_score']:.4f}\n")

print("\n" + "=" * 80)
print("EVALUATION COMPLETE")
print("=" * 80)

MODEL COMPARISON

                          Collaborative Filtering Only                          
--------------------------------------------------------------------------------
1. Crash Mathematics Package
   Predicted Score: 0.3570

2. Master History Package
   Predicted Score: 0.1834

3. Intro English Package
   Predicted Score: 0.1017

4. Master Physics Package
   Predicted Score: 0.0511

5. Advanced Programming Package
   Predicted Score: 0.0501


                          Content-Based Filtering Only                          
Based on last interacted package: Master Chemistry Package
--------------------------------------------------------------------------------
1. Crash English Package
   Similarity Score: 0.6376

2. Intro Physics Package
   Similarity Score: 0.5917

3. Advanced Economics Package
   Similarity Score: 0.5766

4. Intro English Package
   Similarity Score: 0.5696

5. Intro Art Package
   Similarity Score: 0.5664


EVALUATION COMPLETE


## Comparison: Individual Model Recommendations

Let's compare the hybrid recommendations with individual model outputs.

In [None]:
# Generate recommendations using the hybrid model
print(f"\n{'HYBRID RECOMMENDATIONS (60% Collaborative + 40% Content-Based)':^80}")
print("=" * 80)

hybrid_recommendations = recommend(random_user_id, n=5)

print("\nTop 5 Recommended Packages:\n")
for idx, row in hybrid_recommendations.iterrows():
    print(f"{idx + 1}. {row['title']}")
    print(f"   Hybrid Score: {row['hybrid_score']:.4f}")
    print()


         HYBRID RECOMMENDATIONS (60% Collaborative + 40% Content-Based)         

Top 5 Recommended Packages:

1. Crash Mathematics Package
   Hybrid Score: 0.7436

2. Master History Package
   Hybrid Score: 0.5105

3. Crash English Package
   Hybrid Score: 0.3587

4. Intro Physics Package
   Hybrid Score: 0.3435

5. Advanced Economics Package
   Hybrid Score: 0.3428



In [None]:
# Select a random user for testing
np.random.seed(42)
random_user_id = np.random.choice(interaction_matrix.index)

print("=" * 80)
print(f"EVALUATION: Testing Hybrid Recommendation System")
print("=" * 80)
print(f"\nSelected User ID: {random_user_id}")

# Get user's interaction history
user_interactions = events_df[events_df['user_id'] == random_user_id].copy()
user_interactions = user_interactions.merge(
    packages_df[['package_id', 'title']], 
    on='package_id', 
    how='left'
)

print(f"\n{'User Interaction History':^80}")
print("-" * 80)
print(f"Total interactions: {len(user_interactions)}")
print(f"\nBreakdown by event type:")
event_summary = user_interactions.groupby('event_type').size()
for event_type, count in event_summary.items():
    print(f"  - {event_type.capitalize()}: {count}")

print(f"\n{'Recent Interactions (Last 10)':^80}")
print("-" * 80)
recent_interactions = user_interactions.sort_index(ascending=False).head(10)
for idx, row in recent_interactions.iterrows():
    print(f"{row['event_type'].upper():8} | {row['title']}")

print(f"\n{'Top Interacted Packages (by weighted score)':^80}")
print("-" * 80)
user_package_scores = interaction_matrix.loc[random_user_id]
top_packages = user_package_scores[user_package_scores > 0].sort_values(ascending=False).head(5)
for package_id, score in top_packages.items():
    package_title = packages_df[packages_df['package_id'] == package_id]['title'].values[0]
    print(f"Score: {score:.2f} | {package_title}")

EVALUATION: Testing Hybrid Recommendation System

Selected User ID: user_00103

                            User Interaction History                            
--------------------------------------------------------------------------------
Total interactions: 22

Breakdown by event type:
  - Booking: 1
  - Click: 6
  - Message: 3
  - Search: 1
  - View: 11

                         Recent Interactions (Last 10)                          
--------------------------------------------------------------------------------
VIEW     | Crash English Package
VIEW     | Master History Package
VIEW     | Crash Mathematics Package
VIEW     | Crash Design Package
CLICK    | Crash Mathematics Package
MESSAGE  | nan
CLICK    | Advanced Art Package
VIEW     | Master Programming Package
VIEW     | Crash Chemistry Package
MESSAGE  | nan

                  Top Interacted Packages (by weighted score)                   
--------------------------------------------------------------------------------
Score

## 9. Evaluation and Testing

Test the hybrid recommendation system with a random user, showing their interaction history and personalized recommendations.

In [None]:
def recommend(user_id, n=5, collaborative_weight=0.6, content_weight=0.4):
    """
    Hybrid recommendation system combining collaborative and content-based filtering.
    
    Parameters:
    -----------
    user_id : int
        The user ID to get recommendations for
    n : int
        Number of recommendations to return (default: 5)
    collaborative_weight : float
        Weight for collaborative filtering (default: 0.6)
    content_weight : float
        Weight for content-based filtering (default: 0.4)
    
    Returns:
    --------
    DataFrame with top N recommended package titles and hybrid scores
    """
    if user_id not in predicted_scores_df.index:
        print(f"User ID {user_id} not found!")
        return pd.DataFrame()
    
    # Step 1: Get collaborative filtering scores
    user_predictions = predicted_scores_df.loc[user_id]
    
    # Get packages the user has already interacted with
    interacted_packages = interaction_matrix.loc[user_id]
    already_interacted = interacted_packages[interacted_packages > 0].index.tolist()
    
    # Step 2: Get the user's last interacted package for content-based filtering
    if len(already_interacted) > 0:
        # Find the most recent or highest-weighted interaction
        last_package_id = interacted_packages[interacted_packages > 0].idxmax()
        
        # Get content-based scores
        if last_package_id in content_similarity_df.index:
            content_scores = content_similarity_df[last_package_id]
        else:
            # Fallback: use zero content scores
            content_scores = pd.Series(0, index=content_similarity_df.index)
    else:
        # If user has no interactions, use only collaborative filtering
        content_scores = pd.Series(0, index=user_predictions.index)
        content_weight = 0
        collaborative_weight = 1.0
    
    # Step 3: Normalize scores to 0-1 range for fair combination
    # Normalize collaborative scores
    collab_min, collab_max = user_predictions.min(), user_predictions.max()
    if collab_max > collab_min:
        collab_normalized = (user_predictions - collab_min) / (collab_max - collab_min)
    else:
        collab_normalized = user_predictions
    
    # Normalize content scores
    content_min, content_max = content_scores.min(), content_scores.max()
    if content_max > content_min:
        content_normalized = (content_scores - content_min) / (content_max - content_min)
    else:
        content_normalized = content_scores
    
    # Step 4: Combine scores with weighted average
    # Align indices
    common_packages = collab_normalized.index.intersection(content_normalized.index)
    hybrid_scores = (
        collaborative_weight * collab_normalized[common_packages] +
        content_weight * content_normalized[common_packages]
    )
    
    # Step 5: Filter out already interacted packages
    uninteracted_scores = hybrid_scores[~hybrid_scores.index.isin(already_interacted)]
    
    # Step 6: Get top N recommendations
    top_recommendations = uninteracted_scores.sort_values(ascending=False).head(n)
    
    # Create result DataFrame
    result = pd.DataFrame({
        'package_id': top_recommendations.index,
        'hybrid_score': top_recommendations.values
    })
    
    # Add package titles
    result = result.merge(
        packages_df[['package_id', 'title']], 
        on='package_id', 
        how='left'
    )
    
    return result[['package_id', 'title', 'hybrid_score']]

# Test the hybrid recommendation function
test_user_id = interaction_matrix.index[10]
print(f"Testing Hybrid Recommendation System for User ID: {test_user_id}")
print("\nTop 5 Hybrid Recommendations:")
print(recommend(test_user_id, n=5))

Testing Hybrid Recommendation System for User ID: user_00011

Top 5 Hybrid Recommendations:
  package_id                       title  hybrid_score
0   pkg_0032    Advanced Physics Package      0.782209
1   pkg_0330     Crash Chemistry Package      0.719212
2   pkg_0030      Master Biology Package      0.592355
3   pkg_0468  Master Mathematics Package      0.588874
4   pkg_0240   Crash Mathematics Package      0.583956


## 8. Hybrid Recommendation Engine

**Goal**: Combine collaborative and content-based filtering for final recommendations.

**Hybrid Formula:**
```
Hybrid Score = (0.6 × Collaborative Score) + (0.4 × Content Score)
```

The hybrid approach:
1. Gets collaborative scores (user preferences)
2. Gets content scores (based on user's last interacted package)
3. Normalizes both to 0-1 range
4. Combines with weighted average
5. Returns top N unvisited packages

In [None]:
def get_collaborative_recommendations(user_id, n=5):
    """
    Get top N package recommendations for a user based on collaborative filtering.
    
    Parameters:
    -----------
    user_id : str
        The user ID to get recommendations for
    n : int
        Number of recommendations to return (default: 5)
    
    Returns:
    --------
    DataFrame with package_id, title, subject, and predicted_score
    """
    if user_id not in predicted_scores_df.index:
        print(f"❌ Error: User ID '{user_id}' not found!")
        return pd.DataFrame()
    
    # Get predicted scores for the user
    user_predictions = predicted_scores_df.loc[user_id]
    
    # Get packages the user has already interacted with
    interacted_packages = interaction_matrix.loc[user_id]
    already_interacted = interacted_packages[interacted_packages > 0].index.tolist()
    
    # Filter out already interacted packages
    uninteracted_predictions = user_predictions[~user_predictions.index.isin(already_interacted)]
    
    # Get top N recommendations
    top_recommendations = uninteracted_predictions.sort_values(ascending=False).head(n)
    
    # Create result DataFrame
    result = pd.DataFrame({
        'package_id': top_recommendations.index,
        'predicted_score': top_recommendations.values
    })
    
    # Add package details
    result = result.merge(
        packages_df[['package_id', 'title', 'subject']], 
        on='package_id', 
        how='left'
    )
    
    return result[['package_id', 'title', 'subject', 'predicted_score']]


# Test the function
test_user_id = interaction_matrix.index[5]

print(f"Testing Collaborative Filtering Recommendations")
print(f"="*70)
print(f"User ID: {test_user_id}")
print(f"\nTop 5 Recommended Packages:")
print("="*70)
recommendations = get_collaborative_recommendations(test_user_id, n=5)
for idx, row in recommendations.iterrows():
    print(f"{idx+1}. {row['title']}")
    print(f"   Subject: {row['subject']} | Predicted Score: {row['predicted_score']:.4f}\n")

Testing collaborative filtering for User ID: user_00001

Top 5 Recommended Packages:
  package_id                       title  predicted_score
0   pkg_0316        Crash Design Package         0.010364
1   pkg_0324  Master Programming Package         0.010156
2   pkg_0327    Advanced History Package         0.006363
3   pkg_0108  Advanced Economics Package         0.006361
4   pkg_0393       Crash English Package         0.006043


## 7. Collaborative Filtering Recommendation Function

**Goal**: Implement a function to get recommendations based on collaborative filtering.

This function predicts scores for packages the user hasn't interacted with and returns the top N recommendations.

In [None]:
# Apply TruncatedSVD (Matrix Factorization)
n_components = 20
random_state = 42

print(f"Training TruncatedSVD with {n_components} components...")

svd = TruncatedSVD(n_components=n_components, random_state=random_state)
user_factors = svd.fit_transform(interaction_matrix)
package_factors = svd.components_.T

print(f"✓ SVD Model trained successfully!")
print(f"\nModel Details:")
print(f"  - Number of components: {n_components}")
print(f"  - User factors shape: {user_factors.shape}")
print(f"  - Package factors shape: {package_factors.shape}")
print(f"  - Explained variance ratio: {svd.explained_variance_ratio_.sum():.4f}")

# Reconstruct the prediction matrix
predicted_scores = np.dot(user_factors, package_factors.T)
predicted_scores_df = pd.DataFrame(
    predicted_scores,
    index=interaction_matrix.index,
    columns=interaction_matrix.columns
)

print(f"\n✓ Prediction matrix reconstructed!")
print(f"  - Shape: {predicted_scores_df.shape}")
print(f"  - Data type: {predicted_scores_df.dtypes[0]}")

print(f"\nSample Predicted Scores (First 5×5):")
print("="*70)
print(predicted_scores_df.iloc[:5, :5].round(4))

SVD Model trained successfully!
Number of components: 20
User factors shape: (1000, 20)
Package factors shape: (500, 20)
Explained variance ratio: 0.1596

Predicted scores matrix shape: (1000, 500)


## 6. Collaborative Filtering - Matrix Factorization

**Goal**: Apply TruncatedSVD to learn latent user and package features.

TruncatedSVD decomposes the interaction matrix into:
- **User factors**: Latent user preferences
- **Package factors**: Latent package characteristics

We'll use 20 components to capture the main patterns in user-package interactions.

In [None]:
def get_content_recommendations(package_id, n=5):
    """
    Get top N similar packages based on content similarity.
    
    Parameters:
    -----------
    package_id : str
        The package ID to find similar packages for
    n : int
        Number of recommendations to return (default: 5)
    
    Returns:
    --------
    DataFrame with package_id, title, and similarity_score
    """
    if package_id not in content_similarity_df.index:
        print(f"❌ Error: Package ID '{package_id}' not found!")
        return pd.DataFrame()
    
    # Get similarity scores for the given package
    similarities = content_similarity_df[package_id].sort_values(ascending=False)
    
    # Exclude the package itself and get top N
    similar_packages = similarities[similarities.index != package_id].head(n)
    
    # Create result DataFrame
    result = pd.DataFrame({
        'package_id': similar_packages.index,
        'similarity_score': similar_packages.values
    })
    
    # Add package titles
    result = result.merge(
        packages_df[['package_id', 'title', 'subject']], 
        on='package_id', 
        how='left'
    )
    
    return result[['package_id', 'title', 'subject', 'similarity_score']]


# Test the function
test_package_id = packages_df['package_id'].iloc[0]
test_package_title = packages_df[packages_df['package_id'] == test_package_id]['title'].values[0]

print(f"Testing Content-Based Recommendations")
print(f"="*70)
print(f"Input Package: {test_package_id} - '{test_package_title}'")
print(f"\nTop 5 Similar Packages:")
print("="*70)
recommendations = get_content_recommendations(test_package_id, n=5)
for idx, row in recommendations.iterrows():
    print(f"{idx+1}. {row['title']}")
    print(f"   Subject: {row['subject']} | Similarity: {row['similarity_score']:.4f}\n")

Testing content-based recommendations for Package ID: pkg_0001
Package: Crash Design Package

Top 5 Similar Packages:
  package_id                     title  similarity_score
0   pkg_0111      Crash Design Package          0.750600
1   pkg_0314      Intro Design Package          0.640719
2   pkg_0369     Master Design Package          0.618573
3   pkg_0329     Crash Physics Package          0.617461
4   pkg_0300  Advanced Physics Package          0.599338


## 5. Content-Based Recommendation Function

**Goal**: Implement a function to get similar packages based on content.

This function takes a `package_id` and returns the top N most similar packages using the cosine similarity scores.

In [None]:
# Create embedding matrix from parsed vectors
embedding_matrix = np.vstack(packages_df['embedding_vector'].values)

print(f"✓ Embedding matrix created!")
print(f"  - Shape: {embedding_matrix.shape}")
print(f"  - Data type: {embedding_matrix.dtype}")

# Calculate cosine similarity matrix
content_similarity = cosine_similarity(embedding_matrix)

print(f"\n✓ Cosine similarity computed!")
print(f"  - Similarity matrix shape: {content_similarity.shape}")

# Create DataFrame for easier lookup
content_similarity_df = pd.DataFrame(
    content_similarity,
    index=packages_df['package_id'],
    columns=packages_df['package_id']
)

print(f"\n✓ Content Similarity Matrix ready!")
print(f"\nSimilarity Statistics:")
print(f"  - Min similarity: {content_similarity.min():.4f}")
print(f"  - Max similarity: {content_similarity.max():.4f}")
print(f"  - Mean similarity: {content_similarity.mean():.4f}")

print(f"\nSample Similarity Matrix (First 5×5):")
print("="*70)
print(content_similarity_df.iloc[:5, :5].round(4))

Embedding matrix shape: (500, 16)

Content Similarity Matrix Shape: (500, 500)

Sample similarities (first 5x5):
package_id  pkg_0001  pkg_0002  pkg_0003  pkg_0004  pkg_0005
package_id                                                  
pkg_0001    1.000000 -0.075379 -0.390542 -0.101092 -0.255605
pkg_0002   -0.075379  1.000000 -0.099113 -0.159718  0.051001
pkg_0003   -0.390542 -0.099113  1.000000  0.106459 -0.313470
pkg_0004   -0.101092 -0.159718  0.106459  1.000000  0.018445
pkg_0005   -0.255605  0.051001 -0.313470  0.018445  1.000000


## 4. Content-Based Model - Cosine Similarity

**Goal**: Calculate similarity between packages based on their text embeddings.

We'll compute a **Cosine Similarity Matrix** where each cell (i, j) represents how similar package i is to package j based on their 16-dimensional text embeddings.

In [None]:
# Create User-Item Interaction Matrix
interaction_matrix = interaction_scores.pivot(
    index='user_id',
    columns='package_id',
    values='score'
).fillna(0)  # Fill missing values with 0

print(f"✓ User-Item Interaction Matrix created!")
print(f"\nMatrix Dimensions:")
print(f"  - Users (rows): {interaction_matrix.shape[0]:,}")
print(f"  - Packages (columns): {interaction_matrix.shape[1]:,}")
print(f"  - Total cells: {interaction_matrix.shape[0] * interaction_matrix.shape[1]:,}")

# Calculate sparsity
total_cells = interaction_matrix.shape[0] * interaction_matrix.shape[1]
zero_cells = (interaction_matrix == 0).sum().sum()
sparsity = (zero_cells / total_cells) * 100

print(f"\nMatrix Sparsity:")
print(f"  - Zero cells: {zero_cells:,}")
print(f"  - Non-zero cells: {total_cells - zero_cells:,}")
print(f"  - Sparsity: {sparsity:.2f}%")

print(f"\nSample Matrix (First 5 users × 5 packages):")
print("="*70)
print(interaction_matrix.iloc[:5, :5])

Interaction Matrix Shape: (1000, 500)
Users: 1000, Packages: 500
Sparsity: 97.24%

Interaction Matrix (first 5x5):
package_id  pkg_0001  pkg_0002  pkg_0003  pkg_0004  pkg_0005
user_id                                                     
user_00001       0.0       0.0       0.0       0.0       0.0
user_00002       0.0       0.0       0.0       0.0       0.0
user_00003       0.0       0.0       0.0       0.0       0.0
user_00004       0.0       0.2       0.0       0.0       0.0
user_00005       0.0       0.0       0.0       0.0       0.0


In [None]:
# Aggregate weighted scores by user-package pairs
interaction_scores = events_df.groupby(['user_id', 'package_id'])['weight'].sum().reset_index()
interaction_scores.columns = ['user_id', 'package_id', 'score']

print(f"✓ Interaction scores aggregated!")
print(f"\nInteraction Summary:")
print(f"  - Total unique user-package pairs: {len(interaction_scores):,}")
print(f"  - Unique users: {interaction_scores['user_id'].nunique():,}")
print(f"  - Unique packages: {interaction_scores['package_id'].nunique():,}")
print(f"\nScore Statistics:")
print(f"  - Min score: {interaction_scores['score'].min():.4f}")
print(f"  - Max score: {interaction_scores['score'].max():.4f}")
print(f"  - Mean score: {interaction_scores['score'].mean():.4f}")
print(f"\nSample Interaction Scores:")
print("="*70)
print(interaction_scores.head(10).to_string(index=False))

Total unique user-package interactions: 14954

Sample interaction scores:
      user_id package_id  score
0  user_00001   pkg_0050   0.20
1  user_00001   pkg_0053   0.05
2  user_00001   pkg_0135   0.05
3  user_00001   pkg_0145   0.05
4  user_00001   pkg_0214   0.05
5  user_00001   pkg_0232   0.05
6  user_00001   pkg_0242   0.05
7  user_00001   pkg_0337   0.05
8  user_00001   pkg_0344   0.05
9  user_00001   pkg_0410   0.05


In [None]:
# Define event type weights
event_weights = {
    'booking': 1.0,      # Strongest signal - actual purchase
    'start_booking': 1.0, # Also treat booking initiation as strong signal
    'click': 0.2,        # Medium signal - clicked for details
    'view': 0.05         # Weak signal - just browsed
}

# Apply weights to event types
events_df['weight'] = events_df['event_type'].map(event_weights)

# Check for any unmapped event types
unmapped_events = events_df[events_df['weight'].isna()]
if len(unmapped_events) > 0:
    print(f"⚠ Warning: Found {len(unmapped_events)} unmapped events")
    print(f"  Unique unmapped types: {unmapped_events['event_type'].unique()}")
    # Fill unmapped with minimal weight
    events_df['weight'].fillna(0.05, inplace=True)
else:
    print("✓ All event types mapped successfully!")

print(f"\nEvent Type Distribution:")
print("="*50)
event_counts = events_df['event_type'].value_counts()
for event_type, count in event_counts.items():
    weight = event_weights.get(event_type, 0.05)
    print(f"  {event_type:15s}: {count:6,} events (weight: {weight})")

Unique unmapped types: ['start_booking' 'search' 'rating' 'message']

Event type distribution:
event_type
view             9983
click            3611
search           2444
message          2392
start_booking     608
rating            583
booking           379
Name: count, dtype: int64


## 3. Feature Engineering - Interaction Matrix

**Goal**: Create a weighted User-Item Interaction Matrix based on event types.

**Event Weights:**
- **Booking** = 1.0 (strong positive signal)
- **Click** = 0.2 (medium interest)
- **View** = 0.05 (weak interest)

We'll aggregate these weighted scores to create a matrix where rows represent users, columns represent packages, and values represent interaction strength.

In [None]:
# Parse text_embedding column from JSON strings to numpy arrays
def parse_embedding(embedding_str):
    """Convert JSON string to numpy array"""
    try:
        return np.array(json.loads(embedding_str))
    except Exception as e:
        print(f"Warning: Failed to parse embedding: {e}")
        return np.zeros(16)  # Return zero vector if parsing fails

packages_df['embedding_vector'] = packages_df['text_embedding'].apply(parse_embedding)

# Verify the parsing
sample_embedding = packages_df['embedding_vector'].iloc[0]
print("✓ Text embedding parsing complete!")
print(f"\nEmbedding Details:")
print(f"  - Dimension: {len(sample_embedding)}")
print(f"  - Data type: {type(sample_embedding)}")
print(f"  - Sample vector (Package 1):\n    {sample_embedding}")
print(f"  - Vector shape: {sample_embedding.shape}")

# Verify all embeddings were parsed correctly
valid_embeddings = packages_df['embedding_vector'].apply(lambda x: len(x) == 16).sum()
print(f"\n✓ Successfully parsed {valid_embeddings}/{len(packages_df)} embeddings")

Text embedding parsing complete!
Embedding dimension: 16
Sample embedding vector:
[ 0.3342 -0.1553 -1.9078 -0.8604 -0.4136  1.8877  0.5566 -1.3355  0.486
 -1.5473  1.0827 -0.4711 -0.0936  1.3258 -1.2872 -1.3971]


## 2. Data Preprocessing

**Goal**: Parse the `text_embedding` column from JSON strings into numpy arrays.

The `packages.csv` file contains a `text_embedding` column stored as JSON strings representing 16-dimensional vectors. We need to convert these strings into numpy arrays for cosine similarity calculations.

In [None]:
# Standardize column names and display sample data
events_df = events_df.rename(columns={
    'userId': 'user_id',
    'eventType': 'event_type',
    'packageId': 'package_id'
})

packages_df = packages_df.rename(columns={
    'packageId': 'package_id'
})

print("Sample Data Preview:")
print("\n" + "="*80)
print("USERS (First 3 rows)")
print("="*80)
print(users_df.head(3))

print("\n" + "="*80)
print("PACKAGES (First 3 rows)")
print("="*80)
print(packages_df[['package_id', 'title', 'subject', 'price', 'text_embedding']].head(3))

print("\n" + "="*80)
print("EVENTS (First 5 rows)")
print("="*80)
print(events_df[['user_id', 'event_type', 'package_id']].head(5))

Sample Users Data:
       userId  age  country language educationLevel  isEducator  \
0  user_00001   31   Canada       en     HighSchool           0   
1  user_00002   26       UK       en     HighSchool           0   
2  user_00003   40  Germany       ta     HighSchool           0   

                    createdAt learningPreferences_subjects learningStyle  \
0  2025-01-21T05:33:34.719956                    Chemistry        Visual   
1  2025-01-27T05:33:34.719956  History|Mathematics|Biology         Mixed   
2  2025-03-06T05:33:34.719956   Design|Programming|English      Auditory   

  academicLevel timePreferences  aiFeatures_interactionCount  \
0      Advanced         Morning                           23   
1  Intermediate       Afternoon                           15   
2      Beginner         Evening                           25   

        aiFeatures_lastActive  
0  2025-09-08T05:33:34.720972  
1  2025-09-26T05:33:34.721034  
2  2025-11-09T05:33:34.721092  

Sample Packages Data:

In [None]:
# Load all CSV files
data_path = 'ML/recommender_dataset/'  # Path to the data files

users_df = pd.read_csv(data_path + 'users.csv')
packages_df = pd.read_csv(data_path + 'packages.csv')
events_df = pd.read_csv(data_path + 'events.csv')
ranking_examples_df = pd.read_csv(data_path + 'ranking_examples.csv')

print("Data loaded successfully!")
print(f"\nUsers: {len(users_df)} rows")
print(f"Packages: {len(packages_df)} rows")
print(f"Events: {len(events_df)} rows")
print(f"Ranking Examples: {len(ranking_examples_df)} rows")

Data loaded successfully!

Users: 1000 rows
Packages: 500 rows
Events: 20000 rows
Ranking Examples: 20000 rows


In [None]:
# Load CSV files from the ML/recommender_dataset directory
data_path = 'ML/recommender_dataset/'

users_df = pd.read_csv(data_path + 'users.csv')
packages_df = pd.read_csv(data_path + 'packages.csv')
events_df = pd.read_csv(data_path + 'events.csv')
ranking_examples_df = pd.read_csv(data_path + 'ranking_examples.csv')

print("✓ Data loaded successfully!")
print(f"\nDataset Summary:")
print(f"  - Users: {len(users_df):,} rows")
print(f"  - Packages: {len(packages_df):,} rows")
print(f"  - Events: {len(events_df):,} rows")
print(f"  - Ranking Examples: {len(ranking_examples_df):,} rows")

Column names standardized to snake_case format
Events columns: ['user_id', 'event_type', 'package_id', 'sessionTime', 'paid']
Packages columns: ['package_id', 'educatorId', 'title', 'description', 'keywords']


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import json
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
import warnings

warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully!")
print(f"  - pandas: {pd.__version__}")
print(f"  - numpy: {np.__version__}")

Libraries imported successfully!


## 1. Setup & Data Loading

Import necessary libraries and load the synthetic data from CSV files.

# FocusDesk Hybrid Recommendation System

This notebook implements a **Hybrid Recommendation System** for the FocusDesk educational platform, combining:

1. **Content-Based Filtering**: Uses text embeddings and cosine similarity
2. **Collaborative Filtering**: Uses matrix factorization (TruncatedSVD) on user interactions
3. **Hybrid Approach**: Weighted combination (60% Collaborative + 40% Content-Based)

---

**Data Sources:**
- `users.csv` - 1000 user profiles
- `packages.csv` - 500 educational packages with 16D text embeddings
- `events.csv` - 20,000 user interaction events (view, click, booking)
- `ranking_examples.csv` - Pre-processed training data (optional)

## Model Accuracy & Performance Evaluation

Evaluate the recommendation system using standard metrics like Precision@K, Recall@K, and NDCG.

In [None]:
# Evaluation Metrics for Recommendation System
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

def precision_at_k(actual, predicted, k=5):
    """Calculate Precision@K"""
    if len(predicted) > k:
        predicted = predicted[:k]
    
    score = 0.0
    num_hits = 0.0
    
    for i, p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
    
    return num_hits / min(len(predicted), k)

def recall_at_k(actual, predicted, k=5):
    """Calculate Recall@K"""
    if len(predicted) > k:
        predicted = predicted[:k]
    
    num_hits = 0.0
    for p in predicted:
        if p in actual:
            num_hits += 1.0
    
    return num_hits / len(actual) if len(actual) > 0 else 0.0

def ndcg_at_k(actual, predicted, k=5):
    """Calculate Normalized Discounted Cumulative Gain@K"""
    if len(predicted) > k:
        predicted = predicted[:k]
    
    dcg = 0.0
    for i, p in enumerate(predicted):
        if p in actual:
            dcg += 1.0 / np.log2(i + 2)
    
    idcg = sum([1.0 / np.log2(i + 2) for i in range(min(len(actual), k))])
    
    return dcg / idcg if idcg > 0 else 0.0

print("Evaluation metrics defined successfully!")

In [None]:
# Create test set: For each user, hold out their most recent interactions
test_users = []
test_actual = []
train_interactions = interaction_matrix.copy()

print("Creating train/test split...")
print(f"Original interaction matrix shape: {interaction_matrix.shape}")

# For each user, identify their top interacted packages as "ground truth"
for user_id in interaction_matrix.index[:100]:  # Evaluate on first 100 users
    user_interactions = interaction_matrix.loc[user_id]
    interacted = user_interactions[user_interactions > 0].sort_values(ascending=False)
    
    if len(interacted) >= 3:  # Only evaluate users with at least 3 interactions
        # Hold out top 20% of interactions for testing
        n_test = max(1, int(len(interacted) * 0.2))
        test_packages = interacted.index[:n_test].tolist()
        
        test_users.append(user_id)
        test_actual.append(test_packages)
        
        # Remove test interactions from training matrix
        train_interactions.loc[user_id, test_packages] = 0

print(f"\nTest set created!")
print(f"Number of test users: {len(test_users)}")
print(f"Average test packages per user: {np.mean([len(x) for x in test_actual]):.2f}")

In [None]:
# Retrain models on training data only
print("Retraining models on training data...")

# Retrain SVD on training data
svd_train = TruncatedSVD(n_components=20, random_state=42)
user_factors_train = svd_train.fit_transform(train_interactions)
package_factors_train = svd_train.components_.T

predicted_scores_train = np.dot(user_factors_train, package_factors_train.T)
predicted_scores_train_df = pd.DataFrame(
    predicted_scores_train,
    index=train_interactions.index,
    columns=train_interactions.columns
)

print(f"Training complete!")
print(f"Explained variance: {svd_train.explained_variance_ratio_.sum():.4f}")

In [None]:
# Evaluate Collaborative Filtering Model
print("="*80)
print("EVALUATING COLLABORATIVE FILTERING MODEL")
print("="*80)

collab_metrics = {
    'precision@5': [],
    'precision@10': [],
    'recall@5': [],
    'recall@10': [],
    'ndcg@5': [],
    'ndcg@10': []
}

for user_id, actual_packages in zip(test_users, test_actual):
    # Get predictions from collaborative filtering
    user_pred = predicted_scores_train_df.loc[user_id]
    
    # Remove packages user already interacted with in training
    train_interacted = train_interactions.loc[user_id]
    train_interacted_packages = train_interacted[train_interacted > 0].index.tolist()
    
    # Get top predictions
    available_pred = user_pred[~user_pred.index.isin(train_interacted_packages)]
    top_10 = available_pred.sort_values(ascending=False).head(10).index.tolist()
    top_5 = top_10[:5]
    
    # Calculate metrics
    collab_metrics['precision@5'].append(precision_at_k(actual_packages, top_5, k=5))
    collab_metrics['precision@10'].append(precision_at_k(actual_packages, top_10, k=10))
    collab_metrics['recall@5'].append(recall_at_k(actual_packages, top_5, k=5))
    collab_metrics['recall@10'].append(recall_at_k(actual_packages, top_10, k=10))
    collab_metrics['ndcg@5'].append(ndcg_at_k(actual_packages, top_5, k=5))
    collab_metrics['ndcg@10'].append(ndcg_at_k(actual_packages, top_10, k=10))

# Print results
print("\nCollaborative Filtering Results:")
print("-"*80)
for metric, values in collab_metrics.items():
    print(f"{metric:20s}: {np.mean(values):.4f} (±{np.std(values):.4f})")

print(f"\nEvaluated on {len(test_users)} users")

In [None]:
# Evaluate Hybrid Model
print("\n" + "="*80)
print("EVALUATING HYBRID MODEL (60% Collaborative + 40% Content)")
print("="*80)

hybrid_metrics = {
    'precision@5': [],
    'precision@10': [],
    'recall@5': [],
    'recall@10': [],
    'ndcg@5': [],
    'ndcg@10': []
}

for user_id, actual_packages in zip(test_users, test_actual):
    # Get collaborative scores
    collab_scores = predicted_scores_train_df.loc[user_id]
    
    # Get content-based scores from last interaction in training
    train_interacted = train_interactions.loc[user_id]
    train_interacted_packages = train_interacted[train_interacted > 0]
    
    if len(train_interacted_packages) > 0:
        last_package = train_interacted_packages.idxmax()
        content_scores = content_similarity_df[last_package]
    else:
        content_scores = pd.Series(0, index=collab_scores.index)
    
    # Normalize and combine
    collab_norm = (collab_scores - collab_scores.min()) / (collab_scores.max() - collab_scores.min() + 1e-10)
    content_norm = (content_scores - content_scores.min()) / (content_scores.max() - content_scores.min() + 1e-10)
    
    hybrid_scores = 0.6 * collab_norm + 0.4 * content_norm
    
    # Get top predictions (excluding training interactions)
    available_pred = hybrid_scores[~hybrid_scores.index.isin(train_interacted_packages.index)]
    top_10 = available_pred.sort_values(ascending=False).head(10).index.tolist()
    top_5 = top_10[:5]
    
    # Calculate metrics
    hybrid_metrics['precision@5'].append(precision_at_k(actual_packages, top_5, k=5))
    hybrid_metrics['precision@10'].append(precision_at_k(actual_packages, top_10, k=10))
    hybrid_metrics['recall@5'].append(recall_at_k(actual_packages, top_5, k=5))
    hybrid_metrics['recall@10'].append(recall_at_k(actual_packages, top_10, k=10))
    hybrid_metrics['ndcg@5'].append(ndcg_at_k(actual_packages, top_5, k=5))
    hybrid_metrics['ndcg@10'].append(ndcg_at_k(actual_packages, top_10, k=10))

# Print results
print("\nHybrid Model Results:")
print("-"*80)
for metric, values in hybrid_metrics.items():
    print(f"{metric:20s}: {np.mean(values):.4f} (±{np.std(values):.4f})")

print(f"\nEvaluated on {len(test_users)} users")

In [None]:
# Comparison Summary
print("\n" + "="*80)
print("MODEL ACCURACY COMPARISON SUMMARY")
print("="*80)

comparison_df = pd.DataFrame({
    'Collaborative Filtering': [
        np.mean(collab_metrics['precision@5']),
        np.mean(collab_metrics['precision@10']),
        np.mean(collab_metrics['recall@5']),
        np.mean(collab_metrics['recall@10']),
        np.mean(collab_metrics['ndcg@5']),
        np.mean(collab_metrics['ndcg@10'])
    ],
    'Hybrid Model': [
        np.mean(hybrid_metrics['precision@5']),
        np.mean(hybrid_metrics['precision@10']),
        np.mean(hybrid_metrics['recall@5']),
        np.mean(hybrid_metrics['recall@10']),
        np.mean(hybrid_metrics['ndcg@5']),
        np.mean(hybrid_metrics['ndcg@10'])
    ]
}, index=['Precision@5', 'Precision@10', 'Recall@5', 'Recall@10', 'NDCG@5', 'NDCG@10'])

print("\n", comparison_df)

# Calculate improvement
print("\n" + "="*80)
print("Hybrid Model Improvement over Collaborative Filtering:")
print("="*80)
for metric in comparison_df.index:
    collab_val = comparison_df.loc[metric, 'Collaborative Filtering']
    hybrid_val = comparison_df.loc[metric, 'Hybrid Model']
    improvement = ((hybrid_val - collab_val) / collab_val * 100) if collab_val > 0 else 0
    print(f"{metric:20s}: {improvement:+.2f}%")

print("\n" + "="*80)
print(f"Overall Model Accuracy: {np.mean(hybrid_metrics['ndcg@10'])*100:.2f}%")
print("="*80)