CSE 158/258 - Recommender Sys & Web Mining - FA24
# Assignment 2 Report

Andrew Choi (A69033628), Chanbin Na (AA18087468), Jonghee Chun (A69033997), Kenny Hwang (A99021639)

[Github Repo](https://github.com/cjychoi/cse258-fa24/tree/main)

## 1. Exploratory Dataset Analysis

**Dataset: Food.com Recipes and Interactions**\
Source: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions

This dataset includes over 180K+ recipes and 700K+ recipe reviews spanning 18 years of user interactions and uploads on Food.com. For our analysis, we will utilize two files: **RAW_interactions.csv** and **RAW_recipes.csv**. We selected this dataset due to its rich and diverse information. Some of the interesting values included: 
```
- user_id: Unique user ID values
- recipe_id: Unique recipe ID values
- rating: Rating given from user (0~5 scale)
- review: Review of the recipe (text)
- minutes: Minutes to prepare the recipe
- tags: Food.com tags for each recipes
- nutrition: Nutrition information (calories, total fat, sugar, sodium, protein, saturated fat, and carbohydrates)
- n_steps: Number of steps in recipe
- ingredients: List of ingredient names
- n_ingredients: Number of ingredients
- description: User-provided description
```

From the first glance, a lot of values seemed to have interesting correlations to each other, such as 'minutes', 'n_steps', and 'n_ingredients' may affect the user's 'rating' because it could be a burden to cook a complicated recipe and may lead to a lower rating.

We have summarized the properites of the dataset and was able to retrieve the following results:\
<img src="img/data_summary.png" width="350">
<img src="img/data_ratings.png" width="500">
<img src="img/data_ingredients.png" width="500">
<img src="img/data_recipe_sub.png" width="504">

### Analysis
- **Dataset Characteristics:**
    - The dataset is extensive, with over 230,000 recipes and 1.1 million interactions, making it robust for machine learning tasks like recommendation systems or rating predictions.
    - The average number of ingredients per recipe (around 9.55) suggests that most recipes are relatively simple, with a manageable number of ingredients.

- **Rating Distribution:**
    - The ratings are heavily skewed towards the highest score (5), which might indicate a bias in user reviews or a tendency for users to rate recipes they liked rather than disliked.
    - The small number of ratings below 4 suggests that poorly received recipes may either not be reviewed often or that users prefer not to leave negative feedback.

- **Ingredients Popularity:**
    - Common ingredients like salt, butter, sugar, and onion dominate the dataset, reflecting their foundational role in a wide variety of recipes.
    - These ingredients are versatile and likely to appear in both simple and complex recipes, which might impact model predictions for recipe similarity or ingredient importance.

- **Temporal Trends:**
    - Recipe submissions peaked around 2007, reflecting either a surge in platform popularity or increased engagement during that period.
    - The sharp decline after 2007 might be due to changes in platform usage, competition from other recipe-sharing platforms, or a reduction in user engagement.

- **Insights for Predictive Modeling:**
    - The skewed rating distribution could pose challenges for regression models, requiring careful handling, such as rebalancing or weighted loss functions, to mitigate bias.
    - The dominance of certain ingredients might mean they have less predictive power for ratings, as they are present in a large proportion of recipes.
    - Temporal trends could be factored into models to study how recipe popularity or user engagement changed over time, which might also correlate with rating patterns.


## 2. Predictive Task and Evaluation

### Predictive Task 1

The goal is to predict the rating. As a first step, we analyzed whether any features of the recipes show a correlation with the rating. We examined all the numerical features—`minutes`, `n_steps`, `n_ingredients`, and all the `nutrition` (calories, total fat, sugar, sodium, protein, saturated fat, and carbohydrates)—to determine if they have any correlation with the rating. 

<img src="img/correlationMatrix.png" width="500">

In addition to calculating the correlation, we also determined the coefficients to quantify the relationship between these features and the rating.


| Feature          | Coefficient | Intercept | R² Score |
|-------------------|-------------|-----------|----------|
| minutes          | 0.000       | 4.411     | 0.000    |
| n_steps          | -0.005      | 4.455     | 0.000    |
| n_ingredients    | -0.001      | 4.422     | 0.000    |
| calories         | -0.000      | 4.418     | 0.000    |
| total_fat        | -0.000      | 4.415     | 0.000    |
| sugar            | -0.000      | 4.413     | 0.000    |
| sodium           | -0.000      | 4.412     | 0.000    |
| protein          | -0.000      | 4.414     | 0.000    |
| saturated_fat    | -0.000      | 4.415     | 0.000    |
| carbohydrates    | -0.000      | 4.415     | 0.000    |

The analysis shows that `minutes`, `n_steps`, `n_ingredients`, and all the `nutrition` values have negligible correlations with `rating`, and their regression models exhibit extremely low coefficients and R² scores of 0.000, indicating no explanatory power for variance in `rating`.



### Predictive Task 2
The task is to recommend recipes similar to a given recipe based on their ingredient similarity. The **Jaccard Similarity** is used to calculate the similarity between the sets of ingredients in recipes. The results provide the top 5 similar recipes for a given target recipe.

### Data Processing
- Input Features: The ingredients of each recipe are used as input features. These are stored in the ingredients column in the dataset.
- Preprocessing: The ingredients column is converted into a set for each recipe, enabling the computation of Jaccard Similarity.
- Mapping: Recipe IDs are mapped to their respective ingredient sets using the id column for identification

### Model
- Jaccard Similarity:
    - Measures the similarity between two sets as the ratio of the intersection size to the union size.
​
- Baseline:
    - The baseline for this task is a random recommendation, which serves as a benchmark to compare the performance of the Jaccard-based similarity recommendation system.

### Results
```
Top 5 Recipes Similar to 'arriba   baked winter squash mexican style':

Recipe: berber spice roasted chickpeas (ID: 514675)
Ingredients: dried garbanzo beans, salt, olive oil, mixed spice
Jaccard Similarity: 0.38

Recipe: ed s homemade microwave buttery popcorn (ID: 408958)
Ingredients: popcorn, butter, olive oil, salt
Jaccard Similarity: 0.38

Recipe: honey roasted peanuts (ID: 147856)
Ingredients: peanuts, butter, honey, salt
Jaccard Similarity: 0.38

Recipe: julia child method of preparing garlic (ID: 104441)
Ingredients: garlic, butter, olive oil, salt
Jaccard Similarity: 0.38

Recipe: potatoes rissole (ID: 72347)
Ingredients: russet potatoes, salt, butter, olive oil
Jaccard Similarity: 0.38
```

**Ingredient Overlap:**
- Many of the recommended recipes have overlapping ingredients like salt, olive oil, and butter, which are common foundational ingredients.
- This high overlap in foundational ingredients likely drives the similarity scores.

**Limitations:**
- The recommendations are heavily influenced by the most commonly occurring ingredients (e.g., salt, olive oil), which could reduce diversity in recommendations.
- Recipes with unique or specialized ingredients may not be well-represented due to their rarity in the dataset.

**Potential Enhancements:**
- Weighting ingredient importance could improve the diversity of recommendations by prioritizing unique ingredients over ubiquitous ones.
- Incorporating additional features such as tags, preparation steps, or user ratings could refine the recommendation system.

### Evaluation of Predictive Task

**Evaluation:**
- The recommendations are evaluated qualitatively by examining the relevance of the similar recipes.
- Quantitative evaluation can include precision and recall by comparing recommendations to user interaction data (if available).

**Baseline:**
- Random recipe recommendations serve as the baseline for comparison. The Jaccard Similarity-based system outperforms this by providing contextually similar recipes.

**Validity:**
- The approach is valid as it directly leverages shared ingredient sets, which align with user expectations for similar recipes.

## 3. Model Description

### Proposed Model:

Description of the model(s) used and why they were chosen.\
Explanation of how the model was implemented and optimized.

### Challenges

Major challenge that we had was determining the correlation between the features and the ratings.

## 4. Related Literature

## 5. Results and Conclusions

## 6. References

**Food.com Recipes and Interactions (Kaggle)**\
Crawled data from Food.com (GeniusKitchen) online recipe aggregator\
https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions?select=PP_recipes.csv


**Generating Personalized Recipes from Historical User Preferences**\
Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
EMNLP, 2019\
https://www.aclweb.org/anthology/D19-1613/\
https://aclanthology.org/D19-1613.pdf
