# **Project 2 Code and Visualizations**

 The following code provides the workflow, functions, analysis, and insights into Project 2 for group Justus von Liebig

Project Members: Allison Nguyen, Emily Wu, Wendy Peng, Emma Azhan, Magaly Santos, Noah Mujica

# Table of Contents

- **Data Setup**

- **Deliverable [A] - Population of Interest**

- **Deliverable [A] - Dietary Reference Intakes**

- **Deliverable [A] - Food Prices**

- **Deliverable [A] - Nutritional Content**

- **Deliverable [A] - Solution**

- **Deliverable [B] - Solution Sensitivity**

- **Deliverable [B] - Solution Total Cost**

- **Unit Tests**

## Data Setup

In [None]:
%pip install eep153_tools
%pip install python_gnupg
%pip install -U gspread_pandas

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from  scipy.optimize import linprog as lp
from eep153_tools.sheets import read_sheets

In [None]:
def format_id(id,zeropadding=0):
    """Nice string format for any id, string or numeric.

    Optional zeropadding parameter takes an integer
    formats as {id:0z} where
    """
    if pd.isnull(id) or id in ['','.']: return None

    try:  # If numeric, return as string int
        return ('%d' % id).zfill(zeropadding)
    except TypeError:  # Not numeric
        return id.split('.')[0].strip().zfill(zeropadding)
    except ValueError:
        return None

data_url = "https://docs.google.com/spreadsheets/d/1qCxS3mh29miTIFQJ9IDs4cKUjgepZU37SbJO9v0_fOE/edit?gid=1569303630#gid=1569303630"

In [None]:
recipes = read_sheets(data_url, sheet="recipes")
recipes = (recipes
           .assign(parent_foodcode = lambda df: df["parent_foodcode"].apply(format_id),
                   ingred_code = lambda df: df["ingred_code"].apply(format_id))
           .rename(columns={"parent_desc": "recipe"}))

nutrition = (read_sheets(data_url, sheet="nutrients")
             .assign(ingred_code = lambda df: df["ingred_code"].apply(format_id)))

recipes.head()


In [None]:
# Control and Cancer Diets
control_diet = read_sheets(data_url, sheet="control diet")
cancer_diet = read_sheets(data_url, sheet="cancer diet")

control_diet["parent_foodcode"] = control_diet["parent_foodcode"].astype(str)
cancer_diet["parent_foodcode"] = cancer_diet["parent_foodcode"].astype(str)
recipes["parent_foodcode"] = recipes["parent_foodcode"].astype(str)


control_diet_df = recipes[recipes["parent_foodcode"].isin(control_diet["parent_foodcode"])]
cancer_diet_df = recipes[recipes["parent_foodcode"].isin(cancer_diet["parent_foodcode"])]


cancer_diet.head()

In [None]:
display(nutrition.head())

**Deliverable [A] - Population of Interest**


The primary population of interest we hope to analyse is breast cancer, colorectal cancer and leukemia patients in the United States aged 50 and up, standardising across sex and racial demographics. We aim to standardise the ideal diet for each of these cancer types prescribed to patients, extrapolating from macronutrient recommendations to get a best-estimation of a generic patient diet. To create a baseline of comparison for each of these three test groups, we hope to utilise the average American dietary information as a control. This will consist of the best-estimated diet of the average American aged 50 and up, simplifying across sex and race. By comparing these three test groups against the control, we hope to derive meaningful conclusions about the difference in the minimum-cost diets for cancer patients compared to those without cancer.

**Deliverable [A] - Dietary Reference Intakes**

Write a function that takes as arguments the characteristics of a person (e.g., age, sex) and returns a `pandas.Series' of Dietary Reference Intakes (DRI's) or "Recommended Daily Allowances" (RDA) of a variety of nutrients appropriate for your population of interest.

In [None]:
rda = read_sheets(data_url, sheet="rda")
rda = rda.set_index("Nutrient")
#rda.columns, rda.head()

In [None]:
def diet_ref(sex, cancer_group='control', age_group="51U"):
    
    col_name = f"{sex}_{age_group}_{cancer_group}"

    if col_name not in rda.columns:
        raise ValueError(f"Column '{col_name}' not found in the dataset.")

    return rda[col_name]
        

In [None]:
diet_ref("Female", cancer_group="leukemia").head()

**[A] Data on prices for different foods**



In [None]:
prices = read_sheets(data_url, sheet="prices")[["food_code", "year", "price"]]

prices["food_code"] = prices["food_code"].apply(format_id)

prices = prices.set_index(["year", "food_code"])
#print(prices.index.levels[0])

# we'll focus on the latest price data
prices = prices.xs("2017/2018", level="year")

# drop rows of prices where the price is "NA"
prices = prices.dropna(subset="price")

print(f"We have prices for {prices.shape[0]} unique recipes (FNDDS food codes)")

In [None]:
prices.head()

In [None]:
def process_diet_data(diet_df, nutrition_df, prices_df):

    # Normalize weights to percentage terms
    diet_df["ingred_wt"] = diet_df["ingred_wt"] / diet_df.groupby("parent_foodcode")["ingred_wt"].transform("sum")

    # Merge with nutrition data to get nutrient profiles
    df = diet_df.merge(nutrition_df, how="left", on="ingred_code")

    # Multiply nutrients per 100g by the weight of that ingredient
    numeric_cols = list(df.select_dtypes(include=["number"]).columns)
    numeric_cols.remove("ingred_wt")
    df[numeric_cols] = df[numeric_cols].mul(df["ingred_wt"], axis=0)

    # Sum nutrients at the parent foodcode level
    df = df.groupby("parent_foodcode").agg({**{col: "sum" for col in numeric_cols}, "recipe": "first"})

    # Rename index for clarity
    df.index.name = "recipe_id"

    # Extract food names
    food_names = df["recipe"]

    # Align recipes and prices based on common indices
    common_recipes = df.index.intersection(prices_df.index)
    df = df.loc[common_recipes]
    prices = prices_df.loc[common_recipes]

    # Rename prices index with actual food names
    prices.index = prices.index.map(food_names)

    # Transpose the final nutrient table
    A_all = df.T

    return df, prices, A_all


In [None]:
# Process recipes diet
recipes_df, recipes_prices, recipes_A_all = process_diet_data(recipes, nutrition, prices)

# Process control diet
control_df, control_prices, control_A_all = process_diet_data(control_diet_df, nutrition, prices)

# Process cancer diet
cancer_df, cancer_prices, cancer_A_all = process_diet_data(cancer_diet_df, nutrition, prices)

In [None]:
tol = 1e-6
def min_cost(sex, cancer_group, age_group="51U", A_all=cancer_A_all, p=cancer_prices):
    
    col_name = f"{sex}_{age_group}_{cancer_group}"
    #print(col_name)
    if cancer_group == "control":
        A_all = control_A_all
        p = control_prices

    # create lower bounds and upper bounds.
    bmin = rda.loc[rda['Constraint Type'].isin(['RDA', 'AI']), col_name]
    bmax = rda.loc[rda['Constraint Type'].isin(['UL']), col_name]
    
    # reindex ensures we only keep nutrients in bmin/bmax
    Amin = A_all.reindex(bmin.index).dropna(how='all')
    Amax = A_all.reindex(bmax.index).dropna(how='all')
    
    b = pd.concat([bmin, -bmax])
    A = pd.concat([Amin, -Amax])

    #print(f"{bmin.shape=}")
    #print(f"{Amin.shape=}")
    #print(f"{bmax.shape=}")
    #print(f"{Amax.shape=}")
    #print(f"{b.shape=}")
    #print(f"{A.shape=}")
    #print(f"{prices.shape=}")

    tol = 0.01 #1e-6 # Numbers in solution smaller than this (in absolute value) treated as zeros
    result = lp(p, -A, -b, method='highs')
    #print(result)
    print(f"Cost of diet for {col_name} is ${result.fun:.2f} per day.")

    return result

In [None]:
#Example
result_example = min_cost("Female", cancer_group='colon', age_group="51U")

***Is our solution edible?***

In [None]:
diet = pd.Series(result_example.x,index=cancer_prices.index)

print("\nYou'll be eating (in 100s of grams or milliliters):")
print(round(diet[diet >= tol], 2))

***Control Diets***

In [None]:
control_male = min_cost("Male", cancer_group='control', age_group="51U")

In [None]:
control_female = min_cost("Female", cancer_group='control', age_group="51U")

***Colon Cancer Diets***

In [None]:
colon_male = min_cost("Male", cancer_group='colon', age_group="51U")

In [None]:
colon_female = min_cost("Female", cancer_group='colon', age_group="51U")

***Breast Cancer Diets***


In [None]:
breast_male = min_cost("Male", cancer_group='breast', age_group="51U")

In [None]:
breast_female = min_cost("Female", cancer_group='colon', age_group="51U")

***Leukemia Cancer Diets***

In [None]:
leukemia_male = min_cost("Male", cancer_group='leukemia', age_group="51U")

In [None]:
leukemia_female = min_cost("Female", cancer_group='leukemia', age_group="51U")

In [None]:
leu_diet = pd.Series(leukemia_female.x,index=cancer_prices.index)

print("\nYou'll be eating (in 100s of grams or milliliters):")
print(round(diet[diet >= tol], 2))

***Minimum Cost Diet Visualizations***

In [None]:
categories = [
    ("Male", "control", "51U", control_male.fun),
    ("Female", "control", "51U", control_female.fun),
    ("Male", "colon", "51U", colon_male.fun),
    ("Female", "colon", "51U", colon_female.fun),
    ("Male", "breast", "51U", breast_male.fun),
    ("Female", "breast", "51U", breast_female.fun),
    ("Male", "leukemia", "51U", leukemia_male.fun),
    ("Female", "leukemia", "51U", leukemia_female.fun),
]

df_min_cost = pd.DataFrame(categories, columns=["Sex", "Cancer Group", "Age Group", "Min Cost ($/day)"])

In [None]:
df_min_cost

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=df_min_cost, x="Cancer Group", y="Min Cost ($/day)", hue="Sex", palette="coolwarm")
plt.xlabel("Cancer Group")
plt.ylabel("Min Cost ($/day)")
plt.title("Minimum Cost of Diets by Cancer Group and Sex")
plt.legend(title="Sex")
plt.show()

In [None]:
rda.head()

In [None]:
df_grouped = df_min_cost.groupby(["Cancer Group", "Sex"], sort=False)[["Min Cost ($/day)"]].mean().unstack().reset_index()
df_grouped.columns = df_grouped.columns.droplevel(0)

df_grouped = df_grouped.rename(columns={"": "Cancer Group", "Female": "Female Cost", "Male": "Male Cost"})
df_grouped["Cost Difference (Male - Female)"] = df_grouped["Male Cost"] - df_grouped["Female Cost"]
df_grouped["% Difference (Male - Female)"] = (df_grouped["Male Cost"] - df_grouped["Female Cost"]) / df_grouped["Female Cost"] * 100

df_grouped.head()

## 

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(
    data=df_grouped, 
    x="Cancer Group", 
    y="% Difference (Male - Female)", 
    palette="coolwarm"
)

# Labels and title
plt.axhline(0, color="gray", linestyle="dashed")  # Reference line at 0
plt.xlabel("Cancer Group")
plt.ylabel("Percent Cost Difference (Men vs. Women)")
plt.title("Difference in Minimum Diet Cost Between Men and Women Across Cancer Groups")

# Show plot
plt.show()


In [None]:
control_costs = df_grouped[df_grouped["Cancer Group"] == "control"].set_index("Cancer Group")

# Create new columns for % change compared to control
df_grouped["% Change from Control (Female)"] = ((df_grouped["Female Cost"] - control_costs["Female Cost"].values[0]) / control_costs["Female Cost"].values[0]) * 100
df_grouped["% Change from Control (Male)"] = ((df_grouped["Male Cost"] - control_costs["Male Cost"].values[0]) / control_costs["Male Cost"].values[0]) * 100

In [None]:

df_melted = df_grouped[df_grouped["Cancer Group"] != "control"].melt(id_vars=["Cancer Group"], 
                            value_vars=["% Change from Control (Female)", "% Change from Control (Male)"], 
                            var_name="Sex", 
                            value_name="% Change from Control")

df_melted["Sex"] = df_melted["Sex"].replace({
    "% Change from Control (Female)": "Female",
    "% Change from Control (Male)": "Male"
})

plt.figure(figsize=(10, 6))
sns.barplot(data=df_melted, x="Cancer Group", y="% Change from Control", hue="Sex", palette=["blue", "red"])

plt.axhline(0, color="gray", linestyle="dashed")  # Reference line at 0%
plt.xlabel("Cancer Group")
plt.ylabel("Cost Change from Control (%)")
plt.title("Percentage Change in Diet Cost Compared to Control Group")
plt.legend(title="Sex")

# Show plot
plt.show()
