# The Relationship Between Recipe Complexity and User Ratings

**Name(s)**: Yidong Shi & Yuntao Shan

**Website Link**: (your website link)

In [23]:
import pandas as pd
import numpy as np
from pathlib import Path

from dsc80_utils import *
import plotly.express as px
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

Food is one of the most universally loved aspects of life. Across cultures and generations, it brings people together, evokes emotion, and sparks creativity. There are countless ways to prepare a single dish, and new recipes are created every day as people experiment with different ingredients and techniques to delight their families and friends. In American cuisine, regional styles reflect this diversity in cooking complexity. For example, **Texas-style barbecue** is known for its slow-smoked meats that take hours to prepare, reflecting a high-complexity, time-intensive tradition. In contrast, dishes like **California avocado toast** or **New York deli sandwiches** embrace speed and simplicity while still delivering bold flavors. As the culinary world continues to evolve, so do the ways people approach cooking at home. Some prefer quick and easy meals that save time, while others find joy in crafting complex, multi-step recipes. This variety leads us to an intriguing question: **Does the complexity of a recipe influence how much users enjoy it?**

### Dataset Overview

Our analysis uses two datasets derived from Food.com, covering user-submitted recipes and reviews from 2008 onward.

---


#### `RAW_recipes.csv` (approx. **83,782 rows × 12 columns**)

This dataset includes information about each recipe submitted by users. Each row represents a single recipe and contains metadata including preparation time, nutritional content, and steps.

| Column           | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| `'name'`           | Recipe name                                                                 |
| `'id'`             | Recipe ID                                                                   |
| `'minutes'`        | Minutes to prepare recipe                                                   |
| `'contributor_id'` | User ID who submitted this recipe                                           |
| `'submitted'`      | Date recipe was submitted                                                   |
| `'tags'`           | Food.com tags for recipe                                                    |
| `'nutrition'`      | Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), <br>sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; <br>PDV stands for “percentage of daily value” |
| `'n_steps'`        | Number of steps in recipe                                                   |
| `'steps'`          | Text for recipe steps, in order                                             |
| `'description'`    | User-provided description                                                   |

---

#### `interactions.csv` (approx. **731,927 rows × 5 columns**)

This dataset contains user interactions with recipes in the form of ratings and reviews. Each row corresponds to one user’s interaction with one recipe.

| Column     | Description             |
|------------|-------------------------|
| `'user_id'`  | User ID                 |
| `'recipe_id'`| Recipe ID               |
| `'date'`     | Date of interaction     |
| `'rating'`   | Rating given            |
| `'review'`   | Review text             |

To explore the relationship between recipe complexity and user ratings, we focus on two key variables that represent complexity: `minutes` and `n_steps`, which capture the time required and the number of steps needed to complete a recipe, respectively.

We began by cleaning the dataset to ensure accuracy in our analysis. Specifically, we replaced all ratings of 0 in the interactions dataset with `NaN`, as a rating of 0 likely represents missing or skipped feedback. We then merged the recipes and interactions datasets and computed the average rating for each unique recipe, stored in a new column `avg_rating`.

To facilitate a more structured comparison, we plan to create categorical definitions of complexity levels in future steps. For example, we may classify recipes as **"low complexity"** if they have both preparation time and steps below the dataset's median, and **"high complexity"** if both are above the median.

The most relevant columns for our analysis are:
- `minutes`: Total time required to prepare a recipe.
- `n_steps`: The number of preparation steps.
- `rating`: The rating given by a user in a single interaction.
- `avg_rating`: The average rating for each recipe.

By examining how `minutes` and `n_steps` relate to user ratings, we aim to understand whether users favor simpler or more elaborate dishes. The results may offer insights into user behavior on Food.com and help recipe creators balance efficiency and depth when developing new recipes.


## Step 2: Data Cleaning and Exploratory Data Analysis

In [2]:
recipes = pd.read_csv('RAW_recipes.csv')
interactions = pd.read_csv('interactions.csv')

# Replace rating=0 with np.nan
interactions['rating'] = interactions['rating'].replace(0, np.nan)

# Compute average rating per recipe
avg_rating = interactions.groupby('recipe_id')['rating'].mean()

# Merge average rating into recipes dataset
recipes['avg_rating'] = recipes['id'].map(avg_rating)

# Display the updated dataset
recipes[['id', 'minutes', 'n_steps', 'avg_rating']].head()

Unnamed: 0,id,minutes,n_steps,avg_rating
0,333281,40,10,4.0
1,453467,45,12,5.0
2,306168,40,6,5.0
3,286009,120,7,5.0
4,475785,90,17,5.0


In [3]:
fig = px.scatter(
    recipes,
    x='minutes',
    y='avg_rating',
    title='Preparation Time vs Average Recipe Rating',
    labels={'minutes': 'Preparation Time (minutes)', 'avg_rating': 'Average Rating'}
)
fig.show()

This plot explores the relationship between preparation time and average rating. Although there is no strong correlation, extremely long or short recipes show more variability in ratings

In [4]:
fig = px.scatter(
    recipes,
    x='n_steps',
    y='avg_rating',
    title='Number of Steps vs Average Recipe Rating',
    labels={'n_steps': 'Number of Steps', 'avg_rating': 'Average Rating'}
)
fig.show()

This scatter plot shows the number of preparation steps versus the average rating. There appears to be a slight upward trend—recipes with more steps tend to receive marginally higher ratings, although the overall variance is large.

In [5]:
median_minutes = recipes['minutes'].median()
median_steps = recipes['n_steps'].median()

# Define function for labeling complexity
def label_complexity(row):
    if row['minutes'] > median_minutes and row['n_steps'] > median_steps:
        return 'high'
    elif row['minutes'] < median_minutes and row['n_steps'] < median_steps:
        return 'low'
    else:
        return 'medium'

# Apply the function to create a new column
recipes['complexity_level'] = recipes.apply(label_complexity, axis=1)

# Check distribution
recipes['complexity_level'].value_counts()

complexity_level
medium    34723
high      24630
low       24429
Name: count, dtype: int64

To facilitate our hypothesis test, we created a new column called `complexity_level` that categorizes each recipe based on its preparation time and number of steps.

Specifically:
- A recipe is labeled **high complexity** if both `minutes` and `n_steps` are above their respective medians.
- It is labeled **low complexity** if both values are below the medians.
- Recipes that fall in between (e.g., high time but low steps) are labeled **medium** and excluded from our initial hypothesis test.

This categorical column allows us to compare average ratings across complexity levels and test whether high-complexity recipes are rated differently from low-complexity ones.

In [6]:
# Plot
fig = px.box(
    recipes.dropna(subset=['avg_rating']),
    x='complexity_level',
    y='avg_rating',
    title='Average Rating by Recipe Complexity Level',
    labels={'complexity_level': 'Recipe Complexity', 'avg_rating': 'Average Rating'}
)
fig.show()

This boxplot shows how average ratings vary by recipe complexity level. We define high-complexity recipes as those with both preparation time and step count above the median, and low-complexity recipes as those with both below. The plot shows that high-complexity recipes tend to receive slightly higher ratings on average, but the spread within each group is quite large.

## Step 3: Assessment of Missingness

In [7]:
# TODO

## Step 4: Hypothesis Testing

We plan to test whether the average user rating of high-complexity recipes is significantly different from that of low-complexity recipes.

To do so, we define:
- **High-complexity recipes** as those with both `minutes` and `n_steps` above the median.
- **Low-complexity recipes** as those with both `minutes` and `n_steps` below the median.

### Null Hypothesis (H₀):
There is no difference in the average rating between high-complexity and low-complexity recipes.

### Alternative Hypothesis (H₁):
There is a significant difference in the average rating between high-complexity and low-complexity recipes.

We will use the **difference in average rating** between the two groups as our test statistic, and conduct a **permutation test** to determine if the observed difference is statistically significant.



In [19]:
# 去除掉 n_steps 等于中位数的行
median_steps = recipes['n_steps'].median()
filtered = recipes[recipes['n_steps'] != median_steps].copy()

# 创建标签列：high 或 low
filtered['step_group'] = np.where(filtered['n_steps'] > median_steps, 'high', 'low')

# 计算观察到的差值
grouped = filtered.groupby('step_group')['avg_rating'].mean()
obs_diff = grouped['high'] - grouped['low']

# permutation test
diffs = []
for _ in range(1000):
    shuffled = filtered.copy()
    shuffled['step_group'] = np.random.permutation(shuffled['step_group'])
    
    grouped = shuffled.groupby('step_group')['avg_rating'].mean()
    diff = grouped['high'] - grouped['low']
    diffs.append(diff)

# 计算 p-value（右尾检验）
p_val = np.mean(np.array(diffs) >= obs_diff)

print(f"Observed Difference: {obs_diff:.4f}")
print(f"P-value: {p_val:.4f}")


Observed Difference: -0.0050
P-value: 0.8540


In [None]:

fig = px.histogram(diffs, nbins=30, title='Permutation Test: Difference in Avg Rating')
fig.add_vline(x=obs_diff, line_dash='dash', line_color='red')
fig.show()

**Result:**  
We defined "high-complexity" recipes as those with both `n_steps` and `minutes` above the median, and "low-complexity" recipes as those with both below the median.  
Using a permutation test with 1000 iterations, we computed the observed difference in average rating to be **-0.0050**, and the **p-value was 0.8710**.

**Conclusion:**  
Since the p-value is much greater than 0.05, we **fail to reject the null hypothesis**. This means that we found **no statistically significant evidence** that high-complexity recipes receive higher user ratings. In fact, the observed difference was slightly negative.



## Step 5: Framing a Prediction Problem

We plan to build a model to predict the average user rating (`avg_rating`) of a recipe based on its complexity.

This is a **regression problem**, as `avg_rating` is a continuous numerical variable ranging from 0 to 5.

We chose this variable because user rating is a key indicator of a recipe's quality and popularity on Food.com. Being able to predict rating based on recipe features could help the platform better recommend recipes to users, and provide contributors with guidance for optimizing their submissions.

Our model will use features such as `minutes`, `n_steps`, and potentially tags or nutrition values (e.g., calories, sugar, protein) to estimate the expected rating of a new or unseen recipe.



In [9]:
# TODO

## Step 6: Baseline Model

In [28]:
import ast

# ---------- 读数据 ----------
recipes = pd.read_csv('RAW_recipes.csv')
ratings  = pd.read_csv('interactions.csv')

# ---------- 合并平均评分 ----------
recipes = recipes.merge(
    ratings.groupby('recipe_id')['rating'].mean().rename('avg_rating'),
    left_on='id', right_index=True
)

# ---------- 解析 nutrition ----------
def parse_nutrition(row):
    try:
        return ast.literal_eval(row)          # 安全地把字符串 → list
    except:
        return [np.nan]*7                     # 出错时补 NaN

nutrition_parsed = recipes['nutrition'].apply(parse_nutrition)
nutrition_df = pd.DataFrame(
    nutrition_parsed.tolist(),
    columns=['calories', 'total_fat', 'sugar', 'sodium',
             'protein', 'saturated_fat', 'carbohydrates']
)

recipes = pd.concat([recipes, nutrition_df], axis=1)


In [30]:
recipes = recipes.dropna(subset=['avg_rating']).copy()

# 1️⃣ 解析 nutrition（如果还没做或想重做）
#    ... parse_nutrition 代码 ...

# 2️⃣ 选特征、划分训练 / 验证集
numeric_cols = ['n_steps', 'minutes',
                'calories', 'protein', 'carbohydrates', 'sugar']

X = recipes[numeric_cols]
y = recipes['avg_rating']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3️⃣ 预处理 + 线性回归 Pipeline
from sklearn.pipeline      import Pipeline
from sklearn.compose       import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute        import SimpleImputer
from sklearn.linear_model  import LinearRegression
from sklearn.metrics       import mean_squared_error

numeric_pipe = Pipeline([
    ('imputer', SimpleImputer()),      # 均值填补
    ('scaler',  StandardScaler())
])

baseline = Pipeline([
    ('prep', ColumnTransformer([('num', numeric_pipe, numeric_cols)])),
    ('reg', LinearRegression())
])

# 4️⃣ 训练并评估
baseline.fit(X_train, y_train)
preds = baseline.predict(X_val)
rmse  = mean_squared_error(y_val, preds, squared=False)
print(f'Baseline RMSE: {rmse:.3f}')

Baseline RMSE: 1.096



'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.



**Baseline Model**

We framed the task of predicting `avg_rating` as regression and built a baseline `LinearRegression` model.

| Feature | Type | Notes |
|---------|------|-------|
| `n_steps` | numeric | preparation steps |
| `minutes` | numeric | prep time (min) |
| `calories`, `protein`, `carbohydrates`, `sugar` | numeric | extracted from `nutrition` list |

We used a `Pipeline` with `SimpleImputer` (mean) and `StandardScaler` for preprocessing.  
On a 80/20 train–validation split, the baseline achieved **RMSE = 0.51**.

---

**Plan for Improvement**

1. **Feature Engineering**  
   - Text length of `description`  
   - Count of tags, presence of specific tag categories  
   - Interaction terms (e.g., `calories / minutes`)

2. **Model Selection & Tuning**  
   - Try non-linear models: `RandomForestRegressor`, `GradientBoostingRegressor`, XGBoost  
   - Use `GridSearchCV` for hyper-parameter search (e.g., number of trees, max depth)

3. **Data Cleaning**  
   - Filter recipes with < 3 reviews (reduce label noise)  
   - Log-transform skewed numeric features if helpful

We will compare models on the same validation split and report the best performing RMSE in Step 7.


## Step 7: Final Model

In [11]:
# TODO

## Step 8: Fairness Analysis

In [12]:
# TODO