Key Motivation for Embedding in Recommender Systems

Recommender systems rely on fast retrieval and processing of rich user-item interaction histories, along with item metadata and user preferences. 
Embedding relevant data together (e.g., user history inside user docs, rating summaries inside movie docs) significantly boosts performance and modeling efficiency.

Document-Oriented Model: Benefits for Recommender Systems

1. Reduced Join Overhead

In a normalized schema:
- To build a user's rating history, you'd need to join USERS ↔ RATINGS ↔ MOVIES, possibly even GENOME_SCORES.
- This is costly in terms of latency and compute.

In the embedded MongoDB schema:
- All needed info (ratings, genres, tags, etc.) is localized in a single document (movie_doc or user_doc).
- This is ideal for read-heavy, recommendation workloads.

--> Embedding avoids joins and allows for O(1) document fetches.

2. Query Efficiency and Caching

Allows you to:
- Retrieve a user document and get their full rating history in one query.
- Fetch a movie document and instantly access:
    - Metadata (title, genres, year)
    - Tag genome (semantic info)
    - Ratings statistics

3. Recommendation Model Friendliness
Machine learning pipelines often need:
- A user's full interaction history.
- Movie content + aggregated rating stats.

--> Fast batch export for training (user histories are ready).
--> Real-time feature access for serving models (single-document reads).

4. Scalability & Sharding
MongoDB’s sharding strategy benefits from:
- Embedding all interactions per user in the user_doc → easy to shard by userId
- Embedding stats and tag genomes in movie_doc → easy to shard by movieId

--> this enables horizontal scalability for systems with millions of users/movies.




When Is Embedding Clearly Better?
- You have read-heavy, write-rarely patterns (typical for recsys).
- You’re building real-time or near real-time personalized services.
- You want to optimize for locality (getting all relevant data in one fetch).

When to Avoid Full Embedding
- If user or movie interaction histories are huge (e.g. millions of entries per document).
- When data is highly write-intensive (e.g., frequent rating/tag updates from many users).
- When your application needs strict normalization for transactional integrity.


Pros of Embedding in a Distributed System

1. Improved Data Locality
- Access to all needed data in one place (e.g. all a user’s interactions in one document).
This reduces the need for cross-node lookups or network I/O.

2. Atomic Updates (Per Document)
- MongoDB supports atomic operations at the document level
- If all interactions are embedded, a single write is safe and consistent.
Avoids partial writes or the need for distributed transactions.

Cons of Embedding in a Distributed System 

1. Document Growth and Size Limits (MongoDB has a 16MB document size limit.)
2. Load Imbalance 
- Highly active users or very popular movies may create hot shards (nodes receiving disproportionate traffic).
This breaks uniform load distribution, hurting scalability.
3. Data Duplication


1. Consistency (C)
Embedded:
Rating count and avgRating in a movie doc can become stale if updates aren't carefully managed.

Hard to guarantee atomicity across user and movie docs if a user changes a rating.

Normalized:
Easier to enforce strong consistency. Changes to RATINGS auto-reflect in analytics queries.

Better for OLAP, training pipelines, or regulatory correctness.

Conclusion: Favor normalized when C is a top priority.

2. Availability (A)
Embedded:
A single-node read can retrieve all data for a request.

Even in degraded mode (e.g., during a partition), the app can serve reads from stale replicas.

Normalized:
Reconstructing a user’s behavior requires multiple reads → each one is a potential point of failure.

During network partitions, some joins will fail.

Conclusion: Embedded wins for A, crucial for real-time recommendations.

3. Partition Tolerance (P)
Partition tolerance is non-negotiable in distributed systems — both models must work under it. The key difference is how gracefully each handles it:

Embedded:
Entire user or movie data is co-located (if sharded wisely), so reads/writes remain local during partition.

Normalized:
Requires fetching data from multiple collections/shards. This increases exposure to partition issues.

Conclusion: Embedded is more resilient to partitions and localized failures.



| Use Case                                                   | CAP Priority | Recommended Structure                       |
| ---------------------------------------------------------- | ------------ | ------------------------------------------- |
| Real-time recommender (UI)                                 | **AP**       | Embedded                                  |
| Offline ML model training                                  | **CP**       | Normalized (or hybrid with periodic sync) |
| High-throughput content serving                            | **AP**       | Embedded                                  |
| Operational updates to shared metadata (e.g., movie title) | **C**        | Normalized                                |


# TODO - DO RECSYS QUERIES FOR REAL RECOMMENDATIONS
