# **Lecture 3: Machine Learning Modeling**

## **Learning Outcomes**
By the end of this lecture, students will be able to:
- Understand and implement basic machine learning models for regression and classification tasks.
- Prepare datasets for machine learning, including train-test splitting and feature scaling.
- Evaluate the performance of machine learning models using appropriate metrics.
- Compare different machine learning algorithms and interpret their results.

---

## **Objectives**
- Introduce regression and classification concepts in machine learning.
- Demonstrate regression modeling to predict continuous values, such as `Nutrition Density`.
- Implement classification modeling to predict categorical values, such as `Food_Group_LabelEncoded`.
- Provide insights into feature importance and performance evaluation techniques.

---

## **What We Will Do in This Notebook**
1. **Dataset Preparation**:
   - Load the dataset (`final_nutrition_data.csv`).
   - Perform train-test split.
   - Scale the features using `StandardScaler`.

2. **Regression Task**:
   - Define regression targets and features.
   - Train and evaluate **Linear Regression** and **Random Forest Regressor** models.
   - Compare models using metrics like MAE, MSE, and \(R^2\).
   - Visualize regression predictions.

3. **Classification Task**:
   - Define classification targets and features.
   - Train and evaluate **Logistic Regression** and **Random Forest Classifier** models.
   - Use classification metrics such as precision, recall, and F1-score.
   - Visualize classification results, including confusion matrix and feature importance.

4. **Insights and Summary**:
   - Interpret model results and key takeaways from both tasks.
   - Discuss potential improvements and further applications.

---


## **Step 1: Dataset Preparation**

In [1]:
### Load the Dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv("final_nutrition_data.csv")

# Display first few rows
print(df.head())

   Caloric Value       Fat  Saturated Fats  Monounsaturated Fats  \
0       1.068000  3.054126        4.273590              1.743794   
1      -0.793001 -0.295498        0.251201             -0.460053   
2      -1.006007 -0.571100       -0.169747             -0.625342   
3      -1.129326 -1.016303       -0.777782             -0.905781   
4      -0.243669  0.573708        1.326956              0.146005   

   Polyunsaturated Fats  Carbohydrates    Sugars   Protein  Dietary Fiber  \
0             -0.173496      -0.473409  2.087806 -0.205790      -0.481632   
1             -0.883996      -0.657250  2.756392 -0.992850      -0.350809   
2             -0.846695      -0.632181  0.368585 -0.947875      -0.481632   
3             -0.817387      -0.615468  0.464097 -0.767976      -0.481632   
4             -0.617559      -0.699032 -0.491026 -0.329471      -0.481632   

   Cholesterol  ...    Copper      Iron  Magnesium  Manganese  Phosphorus  \
0     1.116587  ... -0.631792 -0.962709  -0.572509 

In [3]:
# Get column names as a list
columns = df.columns.tolist()
print("Columns as a list:", columns)


Columns as a list: ['Caloric Value', 'Fat', 'Saturated Fats', 'Monounsaturated Fats', 'Polyunsaturated Fats', 'Carbohydrates', 'Sugars', 'Protein', 'Dietary Fiber', 'Cholesterol', 'Sodium', 'Water', 'Vitamin A', 'Vitamin B1', 'Vitamin B11', 'Vitamin B12', 'Vitamin B2', 'Vitamin B3', 'Vitamin B5', 'Vitamin B6', 'Vitamin C', 'Vitamin D', 'Vitamin E', 'Vitamin K', 'Calcium', 'Copper', 'Iron', 'Magnesium', 'Manganese', 'Phosphorus', 'Potassium', 'Selenium', 'Zinc', 'Nutrition Density', 'Food_Group_LabelEncoded']


### **What Questions Can We Ask for Regression?**
Regression modeling helps answer questions about **continuous numeric targets** based on independent features. Examples include:
- **Caloric Value Prediction**: 
  - Can we predict the `Caloric Value` of food items based on features like `Fat`, `Protein`, and `Carbohydrates`? 
- **Nutrition Density Analysis**:
  - How well do features like `Fat`, `Sugars`, and `Dietary Fiber` explain variations in `Nutrition Density`?  
- **Micronutrient Contribution**:
  - What is the contribution of vitamins and minerals (e.g., `Vitamin A`, `Calcium`) to changes in `Cholesterol` levels?

---

## **Next Steps**
### 1. Define the Problem
- **Target Variable**: Identify the column to predict (e.g., `Caloric Value`).
- **Features (Predictors)**: Select independent variables that contribute to the prediction.

### 2. Prepare Data
- Perform a train-test split to ensure proper evaluation of the model.
- Standardize or normalize features if necessary.

### 3. Build Regression Models
- Use models like:
  - Linear Regression
  - Decision Tree Regressor
  - Random Forest Regressor

### 4. Evaluate Performance
- Metrics to evaluate the regression models:
  - **Mean Absolute Error (MAE)**
  - **Mean Squared Error (MSE)**
  - **R-squared (R²)**

### 5. Extend the Analysis
- Explore feature importance to understand the key drivers of predictions.
- Experiment with advanced models like Gradient Boosting or Neural Networks for improved results.
- Use hyperparameter tuning for optimization.

---

### Let’s move forward and start coding the regression process!

## Step 1: Train a Regression Model with All Features

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Define features and target
X_all = df[['Caloric Value', 'Fat', 'Saturated Fats', 'Monounsaturated Fats', 'Polyunsaturated Fats', 'Carbohydrates', 'Sugars', 'Protein', 
            'Dietary Fiber', 'Sodium', 'Water', 'Vitamin A', 'Vitamin B1', 'Vitamin B11', 'Vitamin B12', 'Vitamin B2', 
            'Vitamin B3', 'Vitamin B5', 'Vitamin B6', 'Vitamin C', 'Vitamin D', 'Vitamin E', 'Vitamin K', 'Calcium', 'Copper', 'Iron', 
            'Magnesium', 'Manganese', 'Phosphorus', 'Potassium', 'Selenium', 'Zinc', 'Nutrition Density', 'Food_Group_LabelEncoded']]
y_all = df['Cholesterol']

In [7]:
X_all

Unnamed: 0,Caloric Value,Fat,Saturated Fats,Monounsaturated Fats,Polyunsaturated Fats,Carbohydrates,Sugars,Protein,Dietary Fiber,Sodium,...,Copper,Iron,Magnesium,Manganese,Phosphorus,Potassium,Selenium,Zinc,Nutrition Density,Food_Group_LabelEncoded
0,1.068000,3.054126,4.273590,1.743794,-0.173496,-0.473409,2.087806,-0.205790,-0.481632,-0.182895,...,-0.631792,-0.962709,-0.572509,-0.436944,-0.055764,-0.330419,0.313679,-0.018519,0.826281,0.237607
1,-0.793001,-0.295498,0.251201,-0.460053,-0.883996,-0.657250,2.756392,-0.992850,-0.350809,-0.845990,...,-0.938118,-1.077764,-1.055642,-0.840155,-1.166761,-1.019963,-1.461763,-1.075020,-0.960414,0.724952
2,-1.006007,-0.571100,-0.169747,-0.625342,-0.846695,-0.632181,0.368585,-0.947875,-0.481632,-0.744316,...,-0.289427,-1.068559,-0.987435,-0.391125,-0.950813,-0.821959,-0.343892,-0.995028,-0.650830,-0.670900
3,-1.129326,-1.016303,-0.777782,-0.905781,-0.817387,-0.615468,0.464097,-0.767976,-0.481632,-0.624959,...,-0.586744,-1.016785,-0.828285,-0.711861,-0.275501,-0.753111,-1.034342,-0.622234,-0.068084,-0.676917
4,-0.243669,0.573708,1.326956,0.146005,-0.617559,-0.699032,-0.491026,-0.329471,-0.481632,-0.720002,...,-0.703869,-1.050151,-0.583877,-0.556075,0.192388,-0.923896,-0.442528,0.585195,2.363612,1.236363
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,-0.905109,-1.037503,-0.779653,-0.919004,-0.879556,-0.548617,-0.491026,-0.329471,-0.481632,-0.684637,...,1.764761,0.533002,0.325550,-0.381961,-0.788852,-0.495867,-1.100099,-0.320377,-0.641216,1.531177
181,0.843783,-0.592300,-0.637466,-0.790630,-0.528746,-0.055590,-0.491026,2.661355,-0.481632,2.248456,...,-0.037159,3.179259,1.394127,0.076234,3.090643,-0.462777,-0.113742,0.736124,0.910816,-0.791232
182,-0.658471,-0.422699,-0.450379,-0.294765,-0.173496,-0.172579,-0.491026,-0.846682,-0.481632,-0.624959,...,-0.037159,-0.272381,-0.833969,-0.771426,-0.857046,-0.857717,0.971250,2.849125,-0.781873,0.345906
183,-0.793001,-0.804302,-0.684238,-0.570246,-0.795184,-0.732457,-0.491026,-0.093353,-0.481632,-0.403927,...,-0.397542,-0.732600,-0.299681,-0.647714,-0.413783,-0.447300,1.069886,-0.773163,-0.820358,1.188230


In [9]:
y_all

0      1.116587
1     -1.001910
2     -0.729099
3     -0.927813
4     -0.217157
         ...   
180   -0.382191
181    1.291725
182   -0.392295
183   -0.240733
184   -0.695418
Name: Cholesterol, Length: 185, dtype: float64

In [11]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=42)

In [13]:
# Train a linear regression model
model_all = LinearRegression()
model_all.fit(X_train, y_train)

In [15]:
# Predict on the test set
y_pred_all = model_all.predict(X_test)

In [17]:
# Create a DataFrame to display actual and predicted values
comparison_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred_all
})

# Display the first few rows of the DataFrame
print(comparison_df.head())

       Actual  Predicted
19  -0.523648  -0.178671
42   0.372250   1.743655
156 -0.301358   0.381486
111 -0.429343   0.034472
148 -0.429343  -0.338209


In [19]:
# Evaluate the model
mae_all = mean_absolute_error(y_test, y_pred_all)
mse_all = mean_squared_error(y_test, y_pred_all)
r2_all = r2_score(y_test, y_pred_all)

In [21]:
print(f"Performance with All Features:")
print(f"Mean Absolute Error (MAE): {mae_all}")
print(f"Mean Squared Error (MSE): {mse_all}")
print(f"R-squared (R²): {r2_all}")


Performance with All Features:
Mean Absolute Error (MAE): 0.5426082940843728
Mean Squared Error (MSE): 0.4635843328926965
R-squared (R²): 0.2896595494077906


**Performance Metrics:**
1. **Mean Absolute Error (MAE):** 
   - The MAE of `0.5426` indicates that, on average, the predicted values are off by approximately `0.533` units from the actual values.
   - MAE provides an intuitive measure of prediction error in the same units as the target variable (`Caloric Value`).

2. **Mean Squared Error (MSE):**
   - The MSE of `0.4635` reflects the average squared difference between the predicted and actual values.
   - MSE penalizes larger errors more heavily than smaller ones, making it sensitive to outliers. 

3. **R-squared (R²):**
   - The R² value of `0.2896` means that the model explains about `28.16%` of the variance in the target variable (`Caloric Value`) using all the provided features.
   - This relatively low R² suggests that many features in the dataset do not strongly contribute to predicting `Caloric Value` or that the relationship between the features and the target variable is weak or nonlinear.

---

### **Insights**
1. **Model Fit:**
   - The low R² value indicates that the model may not be fitting the data well, potentially due to:
     - Irrelevant or weakly correlated features.
     - A nonlinear relationship that is not captured effectively by the linear regression model.

2. **Error Analysis:**
   - The MAE and MSE values suggest room for improvement in prediction accuracy. High error metrics relative to the data's range indicate the need for better feature selection or alternative modeling techniques.

---

### **Next Steps**
1. **Feature Selection:**
   - Use techniques like Recursive Feature Elimination (RFE) to identify the most predictive features and exclude irrelevant ones.
   
2. **Nonlinear Models:**
   - Explore nonlinear models like decision trees, random forests, or gradient boosting to capture complex relationships in the data.

3. **Feature Engineering:**
   - Consider transforming features or creating interaction terms that might better explain the variability in `Caloric Value`.



## Using Machine Learning for Feature Selection
To improve the model's performance, we will use machine learning techniques such as feature importance ranking with tree-based methods or recursive feature elimination (RFE) to select relevant features. Below is the process.

In [24]:
from sklearn.feature_selection import RFE

# Initialize the regressor
regressor = LinearRegression()

# Apply Recursive Feature Elimination
rfe = RFE(estimator=regressor, n_features_to_select=10)  # Select top 10 features
rfe.fit(X_all, y_all)

# Get the selected features
selected_features_rfe = X_all.columns[rfe.support_]
print(f"Selected Features by RFE: {selected_features_rfe}")

Selected Features by RFE: Index(['Caloric Value', 'Fat', 'Saturated Fats', 'Carbohydrates', 'Protein',
       'Vitamin C', 'Calcium', 'Iron', 'Potassium', 'Nutrition Density'],
      dtype='object')


Explanation:

RFE recursively eliminates the least important features based on model performance.
Here, we use RFE to select the top 10 features contributing to predicting Cholesterol.

In [26]:
# Use selected features from RFE
X_selected = df[selected_features_rfe]

# Train-test split with selected features
X_train_selected, X_test_selected, y_train_selected, y_test_selected = train_test_split(X_selected, y_all, test_size=0.2, random_state=42)

# Train a new linear regression model
model_selected = LinearRegression()
model_selected.fit(X_train_selected, y_train_selected)

# Predict on the test set
y_pred_selected = model_selected.predict(X_test_selected)

# Evaluate the model
mae_selected = mean_absolute_error(y_test_selected, y_pred_selected)
mse_selected = mean_squared_error(y_test_selected, y_pred_selected)
r2_selected = r2_score(y_test_selected, y_pred_selected)

print(f"Performance with Selected Features:")
print(f"Mean Absolute Error (MAE): {mae_selected}")
print(f"Mean Squared Error (MSE): {mse_selected}")
print(f"R-squared (R²): {r2_selected}")


Performance with Selected Features:
Mean Absolute Error (MAE): 0.40627593402503137
Mean Squared Error (MSE): 0.32970142014895376
R-squared (R²): 0.49480549981461885


Let's apply a Random Forest Regressor, a nonlinear model that can capture complex interactions between features and improve the accuracy of predictions. We'll also evaluate its performance using the same metrics: MAE, MSE, and R².

In [28]:
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training set
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate performance
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Display results
print("Performance with Random Forest Regressor:")
print(f"Mean Absolute Error (MAE): {mae_rf}")
print(f"Mean Squared Error (MSE): {mse_rf}")
print(f"R-squared (R²): {r2_rf}")

# Create a DataFrame to show actual vs predicted values
actual_vs_pred_rf = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred_rf
})

# Display the DataFrame
print(actual_vs_pred_rf.head())


Performance with Random Forest Regressor:
Mean Absolute Error (MAE): 0.4419504744129756
Mean Squared Error (MSE): 0.3933250530775419
R-squared (R²): 0.39731635517334063
       Actual  Predicted
19  -0.523648  -0.221333
42   0.372250   2.110664
156 -0.301358  -0.360366
111 -0.429343  -0.184689
148 -0.429343  -0.616977


In [30]:
# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training set
rf_model.fit(X_train_selected, y_train_selected)

# Predict on the test set
y_pred_rf = rf_model.predict(X_test_selected)

# Evaluate performance
mae_rf = mean_absolute_error(y_test_selected, y_pred_selected)
mse_rf = mean_squared_error(y_test_selected, y_pred_selected)
r2_rf = r2_score(y_test_selected, y_pred_selected)

# Display results
print("Performance with Random Forest Regressor with RFE:")
print(f"Mean Absolute Error (MAE): {mae_rf}")
print(f"Mean Squared Error (MSE): {mse_rf}")
print(f"R-squared (R²): {r2_rf}")

# Create a DataFrame to show actual vs predicted values
actual_vs_pred_rf = pd.DataFrame({
    'Actual': y_test_selected,
    'Predicted': y_pred_rf
})

# Display the DataFrame
print(actual_vs_pred_rf.head())


Performance with Random Forest Regressor with RFE:
Mean Absolute Error (MAE): 0.40627593402503137
Mean Squared Error (MSE): 0.32970142014895376
R-squared (R²): 0.49480549981461885
       Actual  Predicted
19  -0.523648  -0.011976
42   0.372250   2.507015
156 -0.301358  -0.561572
111 -0.429343   0.195395
148 -0.429343  -0.455580


# Next Steps to Improve Accuracy

While the Random Forest Regressor with RFE-selected features has shown an improvement in performance metrics, there are still steps we can take to further refine the model and improve accuracy. Below are some strategies:

---

## 1. Hyperparameter Tuning
- Use techniques like Grid Search or Randomized Search to optimize the hyperparameters of the Random Forest Regressor, such as:
  - `n_estimators`: Number of trees in the forest.
  - `max_depth`: Maximum depth of the trees.
  - `min_samples_split`: Minimum number of samples required to split a node.
  - `min_samples_leaf`: Minimum number of samples required to be a leaf node.

---

## 2. Feature Engineering
- **Interaction Features**: Create new features by combining existing ones, e.g., `Fat * Protein` or `Sugars / Carbohydrates`.
- **Polynomial Features**: Add non-linear combinations of the features using polynomial transformations.
- **Log Transformations**: Apply log transformations to skewed numerical features to normalize their distribution.

---

## 3. Model Complexity
- **Boosting Algorithms**:
  - Try Gradient Boosting Machines (GBM), XGBoost, or CatBoost. These models are often more effective for non-linear relationships.
- **Ensemble Models**:
  - Combine predictions from multiple models using techniques like stacking or blending.

---

## 4. Cross-Validation
- Use k-fold cross-validation to ensure the model's performance is robust and not overfitted to a particular train-test split.

---

## 5. Dimensionality Reduction
- Use methods like Principal Component Analysis (PCA) or feature selection based on importance scores to reduce noise and focus on the most impactful features.

---

## 6. Outlier Handling
- Analyze residuals (differences between actual and predicted values) to identify and handle outliers that might be influencing the model.

---

## 7. Fine-Tuning Data Preprocessing
- Standardize or normalize numerical features to ensure all variables are on the same scale.
- Check for multicollinearity between features and drop highly correlated ones if necessary.

---

## 8. Neural Networks
- For complex relationships, try using a basic feed-forward neural network. Neural networks can capture intricate patterns that traditional ML models may miss.

---

## 9. Model Interpretation
- Use SHAP (SHapley Additive exPlanations) or feature importance plots to interpret which features contribute most to the predictions. Adjust feature selection or engineering accordingly.

---

## 10. Experiment with Other Targets
- Test whether related variables (e.g., `Nutrition Density` or `Protein`) can also serve as prediction targets to uncover additional insights.

---
