## Data Science Classification Exercise: MDS-2 (Light) Dataset
Copyright (C) 2025, B. Zeller-Plumhoff

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the [GNU General Public License](https://www.gnu.org/licenses/gpl-3.0.html) for more details.

This Jupyter Notebook was created by Berit Zeller-Plumhoff and Bianca Guedert. The code for importing the data set and displaying to examples was taken from the [MDS companion web page](https://mds-book.org/Content/datasets) - the corresponding software is published under the MIT license.


&#x1F50D; **Overview**: In this exercise, you will work with the **MDS-2 (Light)** dataset, which you can access from the [MDS Datasets](https://mds-book.org/Content/datasets) page. The goal is to apply classification algorithms to the dataset, following the steps and methods covered in class. 

&#x1F4D1; **Requirements**: Perform hyperparameter optimization (Grid Search) for **all the classifiers** we studied in the lectures

&#x1F4C5; **Deadline**: You have until next week to complete this task. 28/01


### Objectives &#x2705;

- [ ] Load the MDS-2 (Light) dataset and prepare it for classification.
- [ ] Split the data into training, validation, and testing sets.
- [ ] Apply **Grid Search** hyperparameter optimization for all the classifiers we covered in the lectures.
- [ ] Evaluate the performance of each classifier and select the best model.
- [ ] Compare the results using metrics like accuracy, confusion matrix.
- [ ] Visualize the performance of the classifiers and their feature importances.
- [ ] Compare your model results with [The Materials Data-Science Book](https://mds-book.org/)



### Hands on &#129304;

&#128187; We begin by loading the required libraries

In [None]:
import pandas as pd # library for organizing data
import numpy as np # library for numerial computations
from sklearn import linear_model # the linear_model library establishes a straightforward implementation of a linear regression model
from sklearn.model_selection import train_test_split, KFold, cross_val_score # this library enables the splitting of a data set into training and test data
from sklearn.inspection import DecisionBoundaryDisplay # library to display decision boundaries of classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, RocCurveDisplay, roc_curve, roc_auc_score, log_loss, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

import matplotlib.pyplot as plt # library for plotting (not interactive)

### Loading Ising Dataset &#x1F504;

Load the Ising Light dataset, a smaller version of the original Ising dataset with 16x16px images, to explore the structure-property relationship. Access the [MDS-2 (Light) Dataset](https://mds-book.org/Content/datasets) to find more about it.

In [None]:
from mdsdata import load_Ising_light  

images, labels, temperatures = load_Ising_light()

fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(8, 4))
ax0.imshow(images[10])
ax0.set(title=f"T={temperatures[10]:.2f}, label={labels[10]}")
ax1.imshow(images[3000])
ax1.set(title=f"T={temperatures[3000]:.2f}, label={labels[3000]}")
plt.show()

### Data Processing and Splitting &#x1F4C8;

The following code reshapes the image data and splits it into training, validation, and testing sets. This step is important to prepare the data for further classification tasks, including performing grid search for hyperparameter tuning.

In [4]:
X=images.reshape(5000,-1,1)[:,:,0] # you can select a fraction of the images initially in order to reduce the amount 
# of computational time while testing your code
y=labels

# split your data set into training/validation and testing - from here you can perform further splits for your grid hyperparameter search
X_trainval, X_test, y_trainval, y_test = train_test_split(X,y,test_size=0.1)

### Grid Search Hyperparameter Tuning

Perform Grid Search hyperparameter optimization for all the classifiers discussed in class.

&#x1F4A1; **Tip**: In `sklearn`, the **Pipeline** feature allows you to chain multiple steps (e.g., data preprocessing and model training) into a single object. This ensures that your transformations (like scaling or encoding) are consistently applied during both training and testing. It makes your workflow more efficient, reduces errors, and keeps your code clean! 

To learn more about `Pipeline`, check out the official [sklearn documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) 📖.

In [None]:
#start code here

### Model Evaluation

After finding the best hyperparameters for each classifier, evaluate the model performance on the test set

In [None]:
#start code here

### If you find necessary keep adding titles to organize you code !