# 🤖 Predictive Model Performance
## How do you decide which predictive model to use?
In this notebook, we evaluate several machine learning models to predict whether a plant has medicinal properties based on its taxonomy and other descriptive features. Below is a brief overview of the models used and how they conceptually approach the classification task:

### 1. **Logistic Regression**
- **Type**: Linear model
- **Concept**: Estimates the probability that a plant is medicinal using a weighted combination of input features.
- **Strengths**: Simple, interpretable, fast.
- **Limitations**: Assumes a linear relationship between features and the log-odds of the outcome; struggles with complex patterns.

### 2. **Decision Tree**
- **Type**: Non-linear, rule-based
- **Concept**: Splits the data into branches based on feature thresholds to arrive at a prediction at the leaves.
- **Strengths**: Easy to visualize and understand; captures non-linear relationships.
- **Limitations**: Can overfit the training data if not pruned or regularized.

### 3. **Random Forest**
- **Type**: Ensemble (of decision trees)
- **Concept**: Trains multiple decision trees on different subsets of the data and averages their predictions to reduce variance.
- **Strengths**: More accurate and robust than a single tree; reduces overfitting.
- **Limitations**: Less interpretable; slower than simpler models.

### 4. **Gradient Boosting (e.g., GBT)**
- **Type**: Ensemble (boosted decision trees)
- **Concept**: Trains trees sequentially, where each tree corrects the errors of the previous one using gradient descent.
- **Strengths**: High accuracy; handles complex patterns well.
- **Limitations**: Can overfit if not tuned properly; computationally intensive.

### 5. **Support Vector Machine (SVM)**
- **Type**: Maximum-margin classifier
- **Concept**: Finds the optimal boundary (hyperplane) that best separates medicinal from non-medicinal plants by maximizing the margin between classes.
- **Strengths**: Works well in high-dimensional spaces; robust to overfitting.
- **Limitations**: Not ideal for large datasets; performance depends on kernel choice.

### 6. **XGBoost**
- **Type**: Gradient-boosted tree ensemble (optimized)
- **Concept**: An efficient and regularized implementation of gradient boosting that adds boosting trees iteratively to correct previous errors.
- **Strengths**: Often state-of-the-art in structured data problems; fast and scalable.
- **Limitations**: Complex; requires tuning; less interpretable.

---

Each model captures different aspects of the underlying patterns in the data. By comparing their performance across different evaluation strategies (e.g., with SMOTE and downsampling), we aim to identify not only which models are accurate, but also which ones are most robust under real-world conditions.


In [4]:
import pandas as pd
import ast
from collections import Counter

from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, auc
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objs as go
from ipywidgets import Tab

# ----------------------------------
# 1) DATA LOADING & PREPROCESSING
# ----------------------------------

df = pd.read_csv('pfaf_plants_merged.csv')
df_countries = pd.read_excel('plants_native_countries.xlsx')
df_countries = df_countries.rename(columns={'Scientific name': 'Scientific Name'})

df = df.merge(
    df_countries[['Family', 'native_countries']],
    on='Family', how='left'
)
# clean
needed_columns = ['Family','Scientific Name','medicinal_rating_search','use_keyword']
df = df.dropna(subset=['native_countries']).drop_duplicates(subset=['Scientific Name'])
df = df.dropna(subset=needed_columns).copy()
df.columns = df.columns.str.strip()

# filter
df = df[df['medicinal_rating_search'] > 0]
df['medicinal_property'] = (
    df['use_keyword'].astype(str)
         .str.lower()
         .str.split(r';|,', regex=True)
         .str[0]
         .str.strip()
)
# rename ratings
rating_map = {'Edibility Rating':'edibility','Other Uses Rating':'other_uses'}
df = df.rename(columns=rating_map)
# drop missing core features
core_cols = ['Family','Scientific Name','medicinal_property','edibility','other_uses','native_countries']
df = df.dropna(subset=core_cols)
# target
df['medicinal_group'] = df['medicinal_property']
# parse list strings

df['native_countries'] = df['native_countries'].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) else x
)
# keep groups with >=50
counts = df['medicinal_group'].value_counts()
keep = counts[counts >= 50].index
df = df[df['medicinal_group'].isin(keep)].reset_index(drop=True)

# encode target
label_encoder = LabelEncoder()
label_encoder.fit(df['medicinal_group'])
classes = list(label_encoder.classes_)

# palette and color map
target_colors = px.colors.qualitative.Dark24
color_map = {cls: target_colors[i % len(target_colors)] for i, cls in enumerate(classes)}

# ----------------------------------
# 2) EXPERIMENT FUNCTION
# ----------------------------------

def run_feature_experiment(df, label_encoder, feature, use_smote=True):
    # build feature matrix
    if feature == 'Family':
        le = LabelEncoder()
        X = le.fit_transform(df['Family']).reshape(-1, 1)
    elif feature == 'Scientific Name':
        le = LabelEncoder()
        X = le.fit_transform(df['Scientific Name']).reshape(-1, 1)
    elif feature in ['edibility', 'other_uses']:
        X = df[[feature]].values
    else:  # native_countries
        mlb = MultiLabelBinarizer()
        X = mlb.fit_transform(df['native_countries'])

    y = label_encoder.transform(df['medicinal_group'])

    # apply SMOTE
    if use_smote:
        min_cls = min(Counter(y).values())
        k = max(1, min_cls - 1)
        X, y = SMOTE(random_state=42, k_neighbors=k).fit_resample(X, y)

    # split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=0.2,
        random_state=42,
        stratify=y
    )

    # define models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, class_weight='balanced'),
        'Decision Tree': DecisionTreeClassifier(class_weight='balanced'),
        'Random Forest': RandomForestClassifier(class_weight='balanced'),
        'Gradient Boosting': GradientBoostingClassifier(),
        'SVM': SVC(probability=True, class_weight='balanced'),
        'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss')
    }

    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_true_lbl = label_encoder.inverse_transform(y_test)
        y_pred_lbl = label_encoder.inverse_transform(y_pred)
        report = classification_report(
            y_true_lbl, y_pred_lbl, output_dict=True
        )
        results[name] = {'report': report}
    return results

# ----------------------------------
# 3) RUN & COLLECT METRICS
# ----------------------------------

features = ['Family', 'Scientific Name', 'edibility', 'other_uses', 'native_countries']
all_metrics = []
for feat in features:
    res = run_feature_experiment(df, label_encoder, feat, use_smote=True)
    for m_name, m_res in res.items():
        for cls, met in m_res['report'].items():
            if cls in ['accuracy', 'macro avg', 'weighted avg']:
                continue
            all_metrics.extend([
                {'Feature': feat, 'Model': m_name, 'Class': cls, 'Metric': 'Precision', 'Value': met['precision']},
                {'Feature': feat, 'Model': m_name, 'Class': cls, 'Metric': 'Recall',    'Value': met['recall']},
                {'Feature': feat, 'Model': m_name, 'Class': cls, 'Metric': 'F1',        'Value': met['f1-score']}
            ])

df_feat_viz = pd.DataFrame(all_metrics)

# ----------------------------------
# 4) INTERACTIVE METRIC VISUALIZATION & SAVE HTML
# ----------------------------------

metric_dd = widgets.Dropdown(options=['F1', 'Precision', 'Recall'], value='F1', description='Metric:')
feat_dd   = widgets.Dropdown(options=features, value=features[0], description='Feature:')
class_dd  = widgets.Dropdown(options=sorted(df_feat_viz['Class'].unique()), value=sorted(df_feat_viz['Class'].unique())[0], description='Class:')

def update_plot(metric, feature, target_class):
    sub = df_feat_viz[
        (df_feat_viz['Metric'] == metric) &
        (df_feat_viz['Feature'] == feature) &
        (df_feat_viz['Class'] == target_class)
    ]
    fig = px.bar(
        sub,
        x='Model',
        y='Value',
        color='Class',
        color_discrete_map=color_map,
        title=f"{metric} for '{target_class}' using '{feature}'",
        labels={'Value': metric}
    )
    fig.update_layout(yaxis=dict(range=[0,1]))
    fig.show()
    # save HTML
    filename = f"metrics_{feature.replace(' ', '_')}_{target_class.replace(' ', '_')}_{metric}.html"
    fig.write_html(filename, include_plotlyjs='cdn')

out = widgets.interactive_output(
    update_plot,
    {'metric': metric_dd, 'feature': feat_dd, 'target_class': class_dd}
)

display(widgets.HBox([metric_dd, feat_dd, class_dd]), out)

# ----------------------------------
# 5) INTERACTIVE ROC CURVES PER FEATURE & SAVE HTML
# ----------------------------------

threshold = 0.7
roc_tabs   = Tab()
roc_childs = []

for feat in features:
    # prepare X, y
    if feat == 'Family':
        le = LabelEncoder(); X = le.fit_transform(df['Family']).reshape(-1,1)
    elif feat == 'Scientific Name':
        le = LabelEncoder(); X = le.fit_transform(df['Scientific Name']).reshape(-1,1)
    elif feat in ['edibility','other_uses']:
        X = df[[feat]].values
    else:
        mlb = MultiLabelBinarizer(); X = mlb.fit_transform(df['native_countries'])
    y = label_encoder.transform(df['medicinal_group'])
    minc = min(Counter(y).values()); k = max(1, minc-1)
    Xr, yr = SMOTE(random_state=42, k_neighbors=k).fit_resample(X, y)
    Xt, Xs, yt, ys = train_test_split(Xr, yr, test_size=0.2, random_state=42, stratify=yr)
    model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
    model.fit(Xt, yt)
    yproba = model.predict_proba(Xs)
    # compute ROC per class
    roc_data = {}
    for i, cls in enumerate(classes):
        ybin = (ys == i).astype(int)
        fpr, tpr, _ = roc_curve(ybin, yproba[:, i])
        roc_data[cls] = (fpr, tpr, auc(fpr, tpr))
    filtered = {cls: dat for cls, dat in roc_data.items() if dat[2] > threshold}

    out_tab = widgets.Output()
    with out_tab:
        fig = go.Figure()
        # diagonal
        fig.add_trace(go.Scatter(
            x=[0,1], y=[0,1], mode='lines', line=dict(color='black', dash='dash'), showlegend=False, hoverinfo='none'
        ))
        for cls, (fpr, tpr, score) in filtered.items():
            fig.add_trace(go.Scatter(
                x=fpr, y=tpr,
                mode='lines',
                name=f"{cls} (AUC={score:.2f})",
                line=dict(color=color_map[cls], width=2),
                opacity=0.7
            ))
        fig.update_layout(
            title=f"ROC Curves for '{feat}' (AUC > {threshold})",
            xaxis_title='False Positive Rate',
            yaxis_title='True Positive Rate',
            legend=dict(title='Click to isolate', orientation='h', x=0, y=-0.2, itemclick='toggleothers', itemdoubleclick='toggle'),
            margin=dict(l=50, r=50, t=50, b=100),
            width=800, height=600, clickmode='none'
        )
        fig.show()
        # save ROC HTML
        fname = f"roc_{feat.replace(' ', '_')}.html"
        fig.write_html(fname, include_plotlyjs='cdn')

    roc_childs.append(out_tab)

roc_tabs.children = roc_childs
for i, feat in enumerate(features):
    roc_tabs.set_title(i, feat)

display(roc_tabs)


`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.

HBox(children=(Dropdown(description='Metric:', options=('F1', 'Precision', 'Recall'), value='F1'), Dropdown(de…

Output()


`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.


Parameters: { "use_label_encoder" } are not used.





`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.


Parameters: { "use_label_encoder" } are not used.





`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.


Parameters: { "use_label_encoder" } are not used.





`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.


Parameters: { "use_label_encoder" } are not used.





`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.


Parameters: { "use_label_encoder" } are not used.




Tab(children=(Output(), Output(), Output(), Output(), Output()), selected_index=0, titles=('Family', 'Scientif…