## ðŸ’¡ Step 5: Recommendation Function & Evaluation Metrics

---
This step operationalizes the final selected similarity model by generating Top-K recommendations and evaluating their quality using list-level proxy metrics appropriate for content-based systems under data-limited conditions.

*   **Approach**: Item-to-item Top-K retrieval using the final similarity matrix
*   **Evaluation**: List-level proxy metrics (ILD, Catalog Coverage)
*   **Scope**: Multiple anchor titles to assess robustness and consistency
*   **Constraint**: No user behavior, ratings, or supervised labels


In [None]:
# =============================================================================
# STEP 5: RECOMMENDATION FUNCTION & EVALUATION METRICS
# =============================================================================

"""
Generate Top-K content recommendations and evaluate their quality using
proxy metrics appropriate for content-based recommendation systems.

This step uses the final similarity matrix selected during model evaluation and selection,
which represents the best-performing content similarity approach among the tested alternatives.
This matrix is applied consistently to generate recommendations and assess their quality
under data-limited conditions.

What this step does:
- Defines a reusable recommendation function
- Retrieves Top-K similar titles for a given item
- Computes proxy evaluation metrics aligned with Step 1 KPIs

What this step does NOT do:
- Train a predictive model
- Use user behavior or ratings
- Perform offline A/B testing
"""

# ---------------------------------------------------------------------------
# 5a. Recommendation Retrieval Function
# ---------------------------------------------------------------------------
# Purpose: Retrieve Top-K similar titles based on the *final selected*
# content similarity matrix produced in Step 4, ensuring continuity
# between model selection and recommendation generation.

def get_recommendations(title, df, similarity_matrix, top_k=10):      # Retrieve Top-K similar titles.
                                                                      # top_k is set to 10 to align with the evaluation
                                                                      # and selection process in Step 4 and to reflect
                                                                      # common industry defaults for content discovery.
    """
    Retrieve Top-K content recommendations for a given title.

    Purpose:
    Provide a simple and explainable way to surface similar titles based
    on the final, selected content similarity model.
    """
    if title not in df["title"].values:                                # Validate that the input title exists in the catalog.
        raise ValueError("Title not found in the catalog.")

    # Resolve potential duplicate title ambiguity
    # Purpose: Ensure recommendation lookup is deterministic and does not
    # silently select an arbitrary title when duplicates exist.
    matches = df[df["title"] == title]

    if len(matches) > 1:
        raise ValueError(
            f"Ambiguous title '{title}'. Multiple catalog entries found. "
            "Refine the title selection or use a unique identifier."
        )

    idx = matches.index[0]                                             # Safe lookup after enforcing title uniqueness.

    # Handle sparse vs dense similarity matrices safely
    if hasattr(similarity_matrix, "toarray"):                              # Check if similarity matrix is sparse.
        scores = similarity_matrix[idx].toarray().ravel()                  # Convert query row to dense for ranking.
    else:
        scores = similarity_matrix[idx]                                    # Use directly if matrix is dense.

    similarity_scores = list(enumerate(scores))                            # Pair each catalog index with its similarity score.


    # Rank titles by similarity score in descending order
    similarity_scores = sorted(
        similarity_scores, key=lambda x: x[1], reverse=True
    )

    # Exclude the title itself and retain Top-K results
    similarity_scores = similarity_scores[1:top_k + 1]
    recommended_indices = [i[0] for i in similarity_scores]            # Extract indices of recommended titles.

    return df.iloc[recommended_indices][["title", "type", "listed_in"]]  # Return interpretable recommendation fields.


# ---------------------------------------------------------------------------
# 5b. Intra-list Diversity (ILD)
# ---------------------------------------------------------------------------
# Purpose: Measure how diverse a single recommendation list is in terms
# of genre composition, reinforcing the diversity objectives evaluated
# during model comparison in Step 4.

def compute_intra_list_diversity(recommendations):                    # Define a function to calculate recommendation diversity.
    """
    Compute Intra-list Diversity (ILD) based on genre overlap.

    Purpose:
    Quantify how varied the recommended titles are to avoid repeatedly
    suggesting near-duplicate content.
    """
    genres = recommendations["listed_in"].str.split(", ")              # Split genre strings into individual genre labels.
    genre_sets = genres.apply(set)                                     # Convert genre lists into sets for comparison.

    similarities = []                                                  # Initialize storage for pairwise similarity scores.

    for i in range(len(genre_sets)):
        for j in range(i + 1, len(genre_sets)):
            intersection = genre_sets.iloc[i] & genre_sets.iloc[j]     # Identify shared genres between two titles.
            union = genre_sets.iloc[i] | genre_sets.iloc[j]            # Identify total unique genres.
            similarities.append(len(intersection) / len(union))        # Jaccard similarity for categorical overlap.

    if not similarities:                                               # Handle edge cases with insufficient comparisons.
        return 0.0

    return 1 - np.mean(similarities)                                   # Higher values indicate greater diversity.


# ---------------------------------------------------------------------------
# 5c. Catalog Coverage
# ---------------------------------------------------------------------------
# Purpose: Measure how much of the catalog is exposed by a single
# recommendation list, complementing the multi-query coverage analysis
# performed during Step 4 evaluation.

def compute_catalog_coverage(recommended_titles, total_titles):       # Define a function to compute catalog exposure.
    """
    Compute Catalog Coverage (CC).

    Purpose:
    Assess whether recommendations surface a broad range of titles
    rather than repeatedly promoting the same items.
    """
    unique_recommended = recommended_titles["title"].nunique()         # Count unique titles in the recommendation list.
    return unique_recommended / total_titles                           # Normalize by total catalog size.


# ---------------------------------------------------------------------------
# 5d. Example Recommendation & Metric Inspection
# ---------------------------------------------------------------------------
# Purpose: Demonstrate how the selected model behaves for an individual
# title and how proxy metrics can be interpreted at the recommendation
# list level.

example_titles = [
    "Dick Johnson Is Dead",        # Example Title #1 (fixed anchor)
    df["title"].iloc[6847],        # Example Title #2, for this and the remaining titles, the index locations are randomly picked
    df["title"].iloc[94],          # Example Title #3
    df["title"].iloc[198],         # Example Title #4
    df["title"].iloc[4156]         # Example Title #5
]


for idx, example_title in enumerate(example_titles, start=1):

    recommendations = get_recommendations(     # Generate Top-K recommendations using the
        example_title,                          # final similarity matrix selected in Step 4.
        df,
        cosine_sim_matrix,                       # Backward-compatible variable name mapped
        top_k=10                                # to the selected model.
    )

    ild_score = compute_intra_list_diversity(recommendations)     # Compute Intra-list Diversity score.
    catalog_coverage = compute_catalog_coverage(                  # Compute Catalog Coverage score.
        recommendations, df.shape[0]
    )

    print(f"EXAMPLE TITLE #{idx}: {example_title}")                                       # Display example title
    print("\n")

    example_description = df.loc[df["title"] == example_title, "description"].values[0]   # Display description for contextual grounding

    print("DESCRIPTION:")
    print(fill(example_description, width=100))                                           # Wrap text for readability
    print("\n")

    print(
        f"'{example_title}' serves as the anchor item \n"
        f"for evaluating how effectively the final similarity model retrieves \n"
        f"related content based on learned content similarity signals. \n"
    )
    print("\n")

    print(f"THE TOP-{len(recommendations)} RECOMMENDATIONS")
    display(recommendations)                                              # Display recommended titles.
    print(
        f"The Top-{len(recommendations)} recommendations reflect titles \n"
        f"that the final model considers most similar to '{example_title}'. \n"
        f"The mix of content types and genres provides an initial qualitative check \n"
        f"on relevance and thematic consistency. \n"
    )
    print("\n")

    print("Intra-list Diversity (ILD):", round(ild_score, 3))             # Display diversity metric.
    print(
        f"An ILD score of {ild_score:.3f} indicates the degree of variety \n"
        f"within the recommendation list. Higher values suggest reduced redundancy \n"
        f"and stronger genre diversity, aligning with the diversity targets defined \n"
        f"in the success metrics. \n"
    )
    print("\n")

    print(f"Catalog Coverage (CC): {catalog_coverage:.3f}")               # Display coverage metric.
    print(
        f"A Catalog Coverage score of {catalog_coverage:.3f} indicates that this \n"
        f"single Top-{len(recommendations)} recommendation list surfaces roughly \n"
        f"{catalog_coverage:.1%} of the total catalog. This low value is expected \n"
        f"in a single-anchor evaluation and reflects a precision-focused retrieval \n"
        f"rather than broad catalog exploration.\n"
    )

    print("-" * 38)
    print("\n")

# Summary Across Example Titles
# ---------------------------------------------------------------------------
# Purpose: Aggregate proxy metrics across multiple anchor titles to
# demonstrate robustness and variability in recommendation behavior.

summary_rows = []

for example_title in example_titles:
    recommendations = get_recommendations(
        example_title,
        df,
        cosine_sim_matrix,
        top_k=10
    )

    summary_rows.append({
        "Example Title": example_title,
        "Intra-list Diversity (ILD)": round(compute_intra_list_diversity(recommendations), 3),
        "Catalog Coverage (CC)": round(
            compute_catalog_coverage(recommendations, df.shape[0]), 3
        )
    })

summary_df = pd.DataFrame(summary_rows)

print("SUMMARY OF RECOMMENDATION METRICS ACROSS EXAMPLE TITLES")
display(summary_df)
print("\n")
print(
    "The summary table shows that both Intra-list Diversity (ILD) and Catalog Coverage (CC) \n"
    "vary across different anchor titles. This variation is expected in a content-based \n"
    "recommender system, as titles differ in genre breadth, thematic specificity, and \n"
    "metadata richness. Narrow or niche titles tend to produce more tightly clustered \n"
    "recommendations with lower diversity, while broadly categorized titles surface \n"
    "a wider range of related content, increasing ILD and catalog exposure.\n\n"
    "Evaluating recommendations across multiple anchor titles strengthens validation \n"
    "by demonstrating that system performance is not dependent on a single example. \n"
    "This multi-anchor analysis provides evidence of stability, robustness, and \n"
    "generalizability under data-limited conditions. By showing consistent yet \n"
    "context-sensitive behavior across diverse titles, the recommender system \n"
    "meets evaluation best practices for unsupervised, content-based models.\n"
)
print("\n")


# Variance Statistics (Across Example Titles)
# ---------------------------------------------------------------------------
# Purpose: Quantify how much recommendation diversity and exposure
# vary across different anchor titles.

ild_variance = summary_df["Intra-list Diversity (ILD)"].var()
cc_variance = summary_df["Catalog Coverage (CC)"].var()

ild_std = summary_df["Intra-list Diversity (ILD)"].std()
cc_std = summary_df["Catalog Coverage (CC)"].std()

variance_df = pd.DataFrame({
    "Metric": ["Intra-list Diversity (ILD)", "Catalog Coverage (CC)"],
    "Variance": [round(ild_variance, 4), round(cc_variance, 6)],
    "Standard Deviation": [round(ild_std, 4), round(cc_std, 6)]
})

print("VARIANCE STATISTICS ACROSS EXAMPLE TITLES")
display(variance_df)
print("\n")
ild_var = variance_df.loc[variance_df["Metric"] == "Intra-list Diversity (ILD)", "Variance"].values[0]
ild_std = variance_df.loc[variance_df["Metric"] == "Intra-list Diversity (ILD)", "Standard Deviation"].values[0]

cc_var = variance_df.loc[variance_df["Metric"] == "Catalog Coverage (CC)", "Variance"].values[0]
cc_std = variance_df.loc[variance_df["Metric"] == "Catalog Coverage (CC)", "Standard Deviation"].values[0]

print(
    f"INTERPRETATION:\n"
    f"The Intra-list Diversity (ILD) shows low variance ({ild_var:.3f}) and a small \n"
    f"standard deviation ({ild_std:.3f}), indicating that the level of diversity \n"
    f"remains consistent across different anchor titles. This suggests stable \n"
    f"list-level behavior rather than sensitivity to any single example.\n\n"
    f"Catalog Coverage (CC) exhibits near-zero variance ({cc_var:.3f}) and standard \n"
    f"deviation ({cc_std:.3f}), which is expected in a single-anchor, Top-K evaluation \n"
    f"setting. This confirms that coverage is structurally constrained by the fixed \n"
    f"recommendation list size rather than influenced by model instability.\n\n"
    f"Overall, the low dispersion observed across both metrics indicates controlled \n"
    f"and predictable recommendation behavior, supporting the conclusion that the \n"
    f"model is neither overfitted nor erratic, but instead behaves consistently under \n"
    f"data-limited conditions.\n"
)
print("\n")


### ð“‚ƒðŸ–Š Key Findings

The evaluation shows that the final recommender produces thematically relevant and context-aware recommendations across diverse anchor titles. Intra-list Diversity adapts appropriately to content specificity, ranging from 0.20 for narrow family titles to 0.78 for broader crime and drama content, while documentaries exhibit moderate diversity (ILD 0.38â€“0.71). Catalog Coverage remains consistently low at 0.001 (â‰ˆ 0.1% per Top-10 list), reflecting an expected precision-focused retrieval strategy. Low metric variance (ILD variance 0.056; CC â‰ˆ 0.000) confirms stable and predictable behavior across multiple anchors, supporting the modelâ€™s robustness and suitability for deployment under data-limited conditions.

*   **Qualitative Relevance**:
    *  Example titles (e.g., Dick Johnson Is Dead, Ghost Rider, Show Dogs) retrieve genre- and theme-consistent recommendations, validating semantic similarity behavior

*   **Intra-list Diversity (ILD)**:
    *  Ranges from **0.20** (Show Dogs, niche family genre) to **0.78** (King of Boys: The Return of the King, broader crime/drama mix)
    *  Documentary and politically themed titles show moderate diversity (ILD **0.38â€“0.71**)
    *  **Variance: 0.056, Std. Dev.: 0.237**, indicating stable diversity behavior across anchors

*   **Catalog Coverage (CC)**:
    *  Consistently **0.001** per Top-10 list (â‰ˆ **0.1%** of the 8,809-title catalog)
    *  Near-zero variance confirms coverage is structurally constrained by fixed Top-K size, not model instability

*   **Robustness Validation**:
    *  Multi-anchor evaluation shows performance is not driven by a single example
    *  Behavior adapts to title specificity while remaining predictable and controlled

*   **Business Interpretation**:
    *  Delivers precision-focused recommendations with appropriate diversity
    *  Avoids erratic or overly repetitive outputs
    *  Suitable for real-world deployment in cold-start and metadata-only settings

---