# 🌟 Machine Learning for Nutrition & Food Science: Cooking Up Insights! 🥗

Welcome to this tasty Jupyter Notebook on **machine learning (ML)** for nutrition and food science! Whether you're snacking at home 🍎 or stirring ideas in a classroom, this guide will take you on a flavorful journey through ML techniques to analyze nutrition and food data. We'll explore **classification** (e.g., healthy vs. unhealthy diets), **regression** (e.g., predicting nutrient content), **feature selection** (e.g., identifying key nutrients), and **model evaluation**! 🍴

Expect hands-on code, fun exercises, and hidden treats (click the "Details" to reveal them)! Let's dig in! 🚀

## 1. Introduction to Machine Learning in Nutrition & Food Science 📊

Nutrition and food science data are like a recipe book 📖—full of ingredients (nutrients), dishes (diets), and quality checks (food safety). Machine learning helps us:

- **Classify** diets or foods (e.g., healthy vs. unhealthy).
- **Predict** nutritional values (e.g., calories or protein content).
- **Identify** key factors (e.g., nutrients driving health outcomes).

We'll use Python with `scikit-learn`, `pandas`, and `matplotlib` to build ML models. No culinary degree needed—just curiosity! 😄

**Exercise 1**: Why is ML useful for nutrition and food science compared to traditional statistical methods? Jot down your thoughts (no code needed).

<details>
<summary>💡 Hint</summary>
Consider how ML handles complex datasets, non-linear relationships, and interactions between nutrients or food components.
</details>

Let's start with loading the necessary libraries:

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
%run ../../bootstrap.py    # installs requirements + editable package

import fns_toolkit as fns

# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import RFE


## 2. Logistic Regression: Classifying Diets 🍎

Logistic regression is a classic ML method for binary classification (e.g., healthy vs. unhealthy diets). It’s simple yet powerful for nutrition data!

### 2.1 Logistic Regression in Action

Let's create a synthetic nutrition dataset (e.g., nutrient profiles) and classify diets as healthy or unhealthy.

In [None]:


# Generate synthetic nutrition dataset (60 samples, 10 nutrients/features)
np.random.seed(11088)
data = pd.DataFrame({
    'Calories': np.random.normal(500, 100, 60),
    'Protein_g': np.random.normal(30, 5, 60),
    'Carbs_g': np.random.normal(60, 10, 60),
    'Fat_g': np.random.normal(20, 5, 60),
    'Fiber_g': np.random.normal(10, 2, 60),
    'Sugar_g': np.random.normal(15, 5, 60),
    'Sodium_mg': np.random.normal(800, 200, 60),
    'Vitamin_C_mg': np.random.normal(50, 10, 60),
    'Calcium_mg': np.random.normal(300, 50, 60),
    'Iron_mg': np.random.normal(5, 1, 60)
})
labels = np.random.choice([0, 1], size=60)  # 0 = unhealthy, 1 = healthy

# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_scaled, labels, test_size=0.3, random_state=42)

# Train logistic regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Logistic Regression Accuracy: {accuracy:.2f} 🎉')

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix for Diet Classification 🥗')
plt.xlabel('Predicted')
plt.ylabel('True')

**Explanation**:
- **StandardScaler**: Ensures nutrients (e.g., calories, protein) are on the same scale.
- **LogisticRegression**: Predicts diet class (healthy/unhealthy) using nutrient profiles.
- **Confusion Matrix**: Shows true positives, false positives, etc.

**Exercise 2**: Add a penalty parameter (`C=0.1`) to the `LogisticRegression` model to increase regularization. Does the accuracy change? Why?

<details>
<summary>💡 Solution</summary>
Change the model line to:
```python
log_reg = LogisticRegression(C=0.1, random_state=42)
```
Smaller `C` increases regularization, reducing overfitting but potentially lowering accuracy if the model is too constrained.
</details>

**Learn More**: Check out [scikit-learn's Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for more details! 📚

## 3. Random Forest: Identifying Key Nutrients 🌳

Random Forest is like a team of decision trees that classifies diets or ranks nutrients by importance. It’s great for pinpointing key dietary factors!

### 3.1 Random Forest in Action

Let’s use a Random Forest to classify diets and identify important nutrients.

In [None]:


# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Random Forest Accuracy: {accuracy_rf:.2f} 🌟')

# Feature importance (important nutrients)
importance = rf.feature_importances_
plt.figure(figsize=(10, 6))
plt.bar(data.columns, importance, color='green', alpha=0.7)
plt.title('Nutrient Importance for Diet Classification 🥕')
plt.xlabel('Nutrient')
plt.ylabel('Importance')
plt.xticks(rotation=45)
plt.savefig('rf_importance.png')

**Explanation**:
- **RandomForestClassifier**: Combines multiple decision trees for robust predictions.
- **feature_importances_**: Ranks nutrients by their contribution to diet classification.

**Exercise 3**: Increase `n_estimators` to 200. Does the accuracy improve? Are the top nutrients the same?

<details>
<summary>💡 Hint</summary>
More trees reduce variance but may not significantly change feature rankings. Compare the accuracy and plot!
</details>

**Learn More**: Explore [Random Forests in scikit-learn](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees) for more ML fun! 🚀

## 4. Support Vector Machines (SVM): Classifying Food Quality ⚔️

SVMs find the best boundary to separate classes, ideal for tasks like classifying food quality (e.g., fresh vs. spoiled) based on nutritional or sensory data.

### 4.1 SVM in Action

Let’s classify foods as fresh or spoiled using an SVM with a linear kernel.

In [None]:


# Train SVM (using same dataset, now for food quality)
svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train, y_train)  # Labels: 0 = spoiled, 1 = fresh

# Predict and evaluate
y_pred_svm = svm.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f'SVM Accuracy: {accuracy_svm:.2f} 🛡️')

# Plot confusion matrix
cm_svm = confusion_matrix(y_test, y_pred_svm)
plt.figure(figsize=(6, 4))
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Greens', cbar=False)
plt.title('Confusion Matrix for Food Quality Classification 🍊')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.savefig('svm_cm.png')

**Explanation**:
- **SVC**: Support Vector Classifier with a linear kernel.
- **Hyperplane**: Maximizes the margin between fresh and spoiled classes.

**Exercise 4**: Try an RBF kernel (`kernel='rbf'`) in the SVM. Does the accuracy improve? Why might this happen?

<details>
<summary>💡 Solution</summary>
Change the SVM line to:
```python
svm = SVC(kernel='rbf', random_state=42)
```
The RBF kernel captures non-linear patterns, which may improve accuracy if the data has complex relationships.
</details>

**Learn More**: Dive into [SVMs in scikit-learn](https://scikit-learn.org/stable/modules/svm.html) for more insights! 🧠

## 5. Regression with Gradient Boosting: Predicting Nutrient Content 📈

Sometimes we need to predict continuous outcomes, like the calorie content of a meal. Gradient Boosting is a powerful ML method for regression tasks.

### 5.1 Gradient Boosting in Action

Let’s predict calorie content using Gradient Boosting.

In [None]:


# Generate synthetic calorie outcome
outcome = data['Calories'] + np.random.normal(0, 20, 60)  # Correlated with calories

# Use other nutrients as features
features = data.drop(columns=['Calories'])
features_scaled = scaler.fit_transform(features)

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(features_scaled, outcome, test_size=0.3, random_state=42)

# Train Gradient Boosting
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X_train_reg, y_train_reg)

# Predict and evaluate
y_pred_gbr = gbr.predict(X_test_reg)
mse = mean_squared_error(y_test_reg, y_pred_gbr)
print(f'Gradient Boosting MSE: {mse:.2f} 📉')

# Plot predictions vs true values
plt.figure(figsize=(8, 6))
plt.scatter(y_test_reg, y_pred_gbr, c='purple', alpha=0.7)
plt.plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 'r--')
plt.xlabel('True Calories (kcal)')
plt.ylabel('Predicted Calories (kcal)')
plt.title('Gradient Boosting Predictions for Calories 🥐')
plt.grid(True)
plt.savefig('gbr_predictions.png')

**Explanation**:
- **GradientBoostingRegressor**: Builds trees sequentially, correcting errors of previous trees.
- **MSE**: Measures prediction error (lower is better).

**Exercise 5**: Increase `n_estimators` to 200 in the Gradient Boosting model. Does the MSE decrease? Why?

<details>
<summary>💡 Hint</summary>
More trees improve fit but risk overfitting. Check if the MSE stabilizes or worsens!
</details>

**Learn More**: Explore [Gradient Boosting in scikit-learn](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting) for more details! 🌟

## 6. Feature Selection: Finding Key Nutritional Factors ⭐

Nutrition datasets often have many features. Feature selection helps us pick the most important nutrients (e.g., for health outcomes) using Recursive Feature Elimination (RFE).

### 6.1 RFE with Random Forest

Let’s use RFE to select the top 5 nutrients for diet classification.

In [None]:

# Apply RFE with Random Forest
rfe = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=5)
rfe.fit(X_train, y_train)

# Get selected features
selected_features = data.columns[rfe.support_]
print('Top 5 Nutrients:', selected_features.tolist(), '🎯')

# Train Random Forest on selected features
X_train_selected = X_train[:, rfe.support_]
X_test_selected = X_test[:, rfe.support_]
rf_selected = RandomForestClassifier(n_estimators=100, random_state=42)
rf_selected.fit(X_train_selected, y_train)
y_pred_selected = rf_selected.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)
print(f'Accuracy with Selected Features: {accuracy_selected:.2f} 🚀')

**Explanation**:
- **RFE**: Recursively eliminates less important nutrients.
- **selected_features**: The top 5 nutrients for diet classification.

**Exercise 6**: Change `n_features_to_select` to 3. Does the accuracy drop? Why might this happen?

<details>
<summary>💡 Solution</summary>
Fewer features may reduce model performance if critical information is lost. Compare the accuracy and consider the trade-off!
</details>

**Learn More**: Check out [Feature Selection in scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html) for more techniques! 🧠

## 7. Summary: Your ML Toolkit for Nutrition & Food Science 🧰

Here's what you've cooked up:

- **Logistic Regression** 📊: Simple classification of diets.
- **Random Forest** 🌳: Robust classification and nutrient ranking.
- **SVM** ⚔️: Powerful for food quality classification.
- **Gradient Boosting** 📈: Accurate prediction of nutrient content.
- **Feature Selection** ⭐: Identifies key nutritional factors.

**Final Exercise**: Download a real nutrition dataset (e.g., from [USDA FoodData Central](https://fdc.nal.usda.gov/)) and apply one of these ML methods. Write a short paragraph summarizing your results!

**What's Next?** Try advanced ML techniques like neural networks or combine ML with PCA for nutrition data. Keep exploring, and happy analyzing! 😄