🔬 Welcome to the PLS-DA Classification Notebook

In this notebook, you will explore how **Partial Least Squares–Discriminant Analysis (PLS-DA)** can be used to classify chemical compounds based on their features. The goal is to understand how different chemical descriptors contribute to separating compounds by their crystal structure type.

🧭 What You'll Be Doing

This activity continues the exploration from the PCA notebook, but now with a supervised machine learning approach. All compounds are already labeled with one of three structure types: **CsCl-type**, **NaCl-type**, or **ZnS-type**.

Here’s what you’ll do:
1. Load a set of **133 compositional features**, generated using the Composition Analyzer/Featurizer (CAF).
2. Use PLS-DA to evaluate how well the compounds can be separated based on these features.
3. Test different combinations of features to build a model that’s accurate and chemically explainable.

⚙️ Understanding the Features

Each compound is featurized using operations like:
- **Mean values** (e.g., average electronegativity)
- **Differences** (e.g., radius A - radius B)
- **Ratios** (e.g., melting point A / B)
- **Max and min** values for both elements

These features are numerical representations of underlying chemical ideas — and your task is to find out which ones matter most for distinguishing structure types.

In [None]:
filepath = "data/1929_Mendeleev_features_binary.csv"

🛠 Manual Feature Selection

Just like in the PCA notebook, you can:
- Select features or groups of features manually using interactive widgets.
- Observe how these selections affect classification performance and structure separation.

This interactive process helps you:
- Test hypotheses about what matters (e.g., is size difference enough?)
- Explore the idea of **feature relevance**
- Build intuition about chemical trends through data

In [None]:
from pls_da.plsda import run_plsda_analysis

run_plsda_analysis(filepath, target_column="Class")

📊 Evaluating the Model

to determine the **optimal number of PLS components**. This balances:
- **Underfitting** (not enough components)
- **Overfitting** (too many components)
- **Accuracy** and **explainability**

The notebook provides score plots, confusion matrices, and performance graphs so you can evaluate the model's behavior under different settings.

In [None]:
from pls_da.plsda import evaluate_n_components_plsda 

fig, scores = evaluate_n_components_plsda(filepath, 
                                          target_column="Class", 
                                          scoring="accuracy", 
                                          max_components=15, 
                                          verbose=False)

🤖 Automated Feature Selection

After exploring manually, you can move on to **automated feature selection** using:

- **Forward Selection**: Start with no features and add one-by-one to optimize accuracy.
- **Backward Elimination**: Start with all features and remove the least useful ones.

Each method allows you to control:
- The number of features to include or retain
- How many PLS components to use
- Whether to visualize model performance and scores

These tools help reduce dimensionality and improve interpretability of your model — while still achieving strong performance.

Also, you can adjust `n_components=` in the functions with best value obtained in the model evaluation section

In [None]:
from pls_da.feature import forward_selection_plsda, backward_elimination_plsda

# Forward selection example:
selected_feats, perf_hist = forward_selection_plsda(
    filepath, 
    target_column="Class", 
    max_features=15, 
    n_components=2,
    scoring='accuracy', 
    verbose=True, 
    visualize=True,
    interactive_scatter=True
)
print("Selected features via forward selection:", selected_feats)

In [None]:
# Backward elimination example:
remaining_feats, perf_hist_back = backward_elimination_plsda(
    filepath, 
    target_column="Class", 
    min_features=5, 
    n_components=2,
    scoring='accuracy', 
    verbose=True, 
    visualize=True,
    interactive_scatter=True
)
print("Remaining features via backward elimination:", remaining_feats)

Use your chemistry background and curiosity to experiment with the data. The more you test, the more connections you’ll discover between descriptors and structure — turning chemical knowledge into predictive power.