🔬 Welcome to the PLS-DA Classification Notebook

In this notebook, you will explore how **Partial Least Squares–Discriminant Analysis (PLS-DA)** can be used to classify chemical compounds based on their features. The goal is to understand how different chemical descriptors contribute to separating compounds by their crystal structure type.

🧭 What You'll Be Doing

This activity continues the exploration from the PCA notebook, but now with a supervised machine learning approach. All compounds are already labeled with one of three structure types: **CsCl-type**, **NaCl-type**, or **ZnS-type**.

Here’s what you’ll do:
1. Load a set of **133 compositional features**, generated using the Composition Analyzer/Featurizer (CAF).
2. Use PLS-DA to evaluate how well the compounds can be separated based on these features.
3. Test different combinations of features to build a model that’s accurate and chemically explainable.

⚙️ Understanding the Features

Each compound is featurized using operations like:
- **Mean values** (e.g., average electronegativity)
- **Differences** (e.g., radius A - radius B)
- **Ratios** (e.g., melting point A / B)
- **Max and min** values for both elements

These features are numerical representations of underlying chemical ideas — and your task is to find out which ones matter most for distinguishing structure types.

In [1]:
filepath = "data/1929_Mendeleev_features_binary.csv"

🛠 Manual Feature Selection

Just like in the PCA notebook, you can:
- Select features or groups of features manually using interactive widgets.
- Observe how these selections affect classification performance and structure separation.

This interactive process helps you:
- Test hypotheses about what matters (e.g., is size difference enough?)
- Explore the idea of **feature relevance**
- Build intuition about chemical trends through data

In [2]:
from pls_da.plsda import run_plsda_analysis

run_plsda_analysis(filepath, target_column="Class")

VBox(children=(HBox(children=(Button(button_style='success', description='Select All Features', layout=Layout(…

📊 Evaluating the Model

to determine the **optimal number of PLS components**. This balances:
- **Underfitting** (not enough components)
- **Overfitting** (too many components)
- **Accuracy** and **explainability**

The notebook provides score plots, confusion matrices, and performance graphs so you can evaluate the model's behavior under different settings.

In [3]:
from pls_da.plsda import evaluate_n_components_plsda 

fig, scores = evaluate_n_components_plsda(filepath, 
                                          target_column="Class", 
                                          scoring="accuracy", 
                                          max_components=15, 
                                          verbose=False)


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by

🤖 Automated Feature Selection

After exploring manually, you can move on to **automated feature selection** using:

- **Forward Selection**: Start with no features and add one-by-one to optimize accuracy.
- **Backward Elimination**: Start with all features and remove the least useful ones.

Each method allows you to control:
- The number of features to include or retain
- How many PLS components to use
- Whether to visualize model performance and scores

These tools help reduce dimensionality and improve interpretability of your model — while still achieving strong performance.

Also, you can adjust `n_components=` in the functions with best value obtained in the model evaluation section

In [4]:
from pls_da.feature import forward_selection_plsda, backward_elimination_plsda

# Forward selection example:
selected_feats, perf_hist = forward_selection_plsda(
    filepath, 
    target_column="Class", 
    max_features=15, 
    n_components=2,
    scoring='accuracy', 
    verbose=True, 
    visualize=True,
    interactive_scatter=True
)
print("Selected features via forward selection:", selected_feats)


[1mIteration 1[0m: [1mInitial selected[0m = ['density_A-B', 'Gilman_A'], [1mScore[0m = 0.5000


[1mIteration 2[0m: [1mAdded[0m  'group_A', [1mSelected[0m = ['density_A-B', 'Gilman_A', 'group_A'], [1mScore = 0.8250[0m


[1mIteration 3[0m: [1mAdded[0m  'ionization_energy_B', [1mSelected[0m = ['density_A-B', 'Gilman_A', 'group_A', 'ionization_energy_B'], [1mScore = 0.9000[0m


[1mIteration 4[0m: [1mAdded[0m  'specific_heat_A/B', [1mSelected[0m = ['density_A-B', 'Gilman_A', 'group_A', 'ionization_energy_B', 'specific_heat_A/B'], [1mScore = 0.9500[0m


[1mIteration 5[0m: [1mAdded[0m  'melting_point_K_min', [1mSelected[0m = ['density_A-B', 'Gilman_A', 'group_A', 'ionization_energy_B', 'specific_heat_A/B', 'melting_point_K_min'], [1mScore = 0.9750[0m


[1mIteration 6[0m: [1mAdded[0m  'Martynov_Batsanov_EN_max', [1mSelected[0m = ['density_A-B', 'Gilman_A', 'group_A', 'ionization_energy_B', 'specific_heat_A/B', 'melting_point_K_min', 'Martynov_Batsan

IntSlider(value=0, description='Step', max=13)

Output()

Selected features via forward selection: ['density_A-B', 'Gilman_A', 'group_A', 'ionization_energy_B', 'specific_heat_A/B', 'melting_point_K_min', 'Martynov_Batsanov_EN_max', 'Pauling_radius_CN12_avg', 'density_weighted_norm_A+B', 'normalized_index_A', 'normalized_index_B', 'density_max', 'melting_point_K_A-B', 'CIF_radius_A/B', 'avg_index']


In [5]:
# Backward elimination example:
remaining_feats, perf_hist_back = backward_elimination_plsda(
    filepath, 
    target_column="Class", 
    min_features=5, 
    n_components=2,
    scoring='accuracy', 
    verbose=True, 
    visualize=True,
    interactive_scatter=True
)
print("Remaining features via backward elimination:", remaining_feats)


[1mIteration 0:[0m All features, [1mScore = 0.8000[0m


[1mIteration 1:[0m [1mRemoved[0m 'specific_heat_min', [1mRemaining[0m = ['index_A', 'index_B', 'normalized_index_A', 'normalized_index_B', 'largest_index', 'smallest_index', 'avg_index', 'atomic_weight_weighted_A+B', 'atomic_weight_A/B', 'atomic_weight_A-B', 'period_A', 'period_B', 'group_A', 'group_B', 'group_A-B', 'Mendeleev_number_A', 'Mendeleev_number_B', 'Mendeleev_number_A-B', 'valencee_total_A', 'valencee_total_B', 'valencee_total_A-B', 'valencee_total_A+B', 'valencee_total_weighted_A+B', 'valencee_total_weighted_norm_A+B', 'unpaired_electrons_A', 'unpaired_electrons_B', 'unpaired_electrons_A-B', 'unpaired_electrons_A+B', 'unpaired_electrons_weighted_A+B', 'unpaired_electrons_weighted_norm_A+B', 'Gilman_A', 'Gilman_B', 'Gilman_A-B', 'Gilman_A+B', 'Gilman_weighted_A+B', 'Gilman_weighted_norm_A+B', 'Z_eff_A', 'Z_eff_B', 'Z_eff_A-B', 'Z_eff_A/B', 'Z_eff_max', 'Z_eff_min', 'Z_eff_avg', 'Z_eff_weighted_norm_A+B', 'ion

IntSlider(value=0, description='Step', max=96)

Output()

Remaining features via backward elimination: ['atomic_weight_A-B', 'group_A', 'Mendeleev_number_A', 'Mendeleev_number_B', 'valencee_total_weighted_A+B', 'valencee_total_weighted_norm_A+B', 'unpaired_electrons_A', 'unpaired_electrons_A-B', 'Gilman_A', 'Z_eff_A', 'Z_eff_A-B', 'Z_eff_min', 'Z_eff_weighted_norm_A+B', 'ionization_energy_B', 'ionization_energy_A/B', 'ionization_energy_avg', 'polyhedron_distortion_B', 'CIF_radius_A', 'CIF_radius_B', 'CIF_radius_A-B', 'CIF_radius_weighted_norm_A+B', 'Pauling_EN_A', 'Pauling_EN_B', 'Pauling_EN_max', 'Pauling_EN_min', 'Pauling_EN_weighted_norm_A+B', 'Martynov_Batsanov_EN_A', 'Martynov_Batsanov_EN_min', 'density_A', 'density_B', 'density_A-B', 'density_avg', 'specific_heat_A-B', 'specific_heat_max', 'cohesive_energy_A', 'cohesive_energy_A/B', 'cohesive_energy_avg']


Use your chemistry background and curiosity to experiment with the data. The more you test, the more connections you’ll discover between descriptors and structure — turning chemical knowledge into predictive power.