## 19.05.2023 Logistic Regression

Copyright (C) 2023, B. Zeller-Plumhoff

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the [GNU General Public License](https://www.gnu.org/licenses/gpl-3.0.html) for more details.

This Jupyter Notebook was created by Berit Zeller-Plumhoff for the course "Data Science for Material Scientists" at Kiel University. 

Within the notebook, you will plot a sigmoid function trialling different parameters. You will then use logistic regression to attempt a simplistic prediction of tumour malignance based on one morphological feature of the tumour. Finally, you will perform a logistic regression to classify metallic and non-metallic materials based on their composition.

We begin by loading the required libraries. Note that in addition to the libraries you have gotten to know until now, we make use of [matminer](https://hackingmaterials.lbl.gov/matminer/#) and [pymatgen](https://pymatgen.org/). Have a closer look at the respective websites to learn more about these. 

The publication from [Ward et al., 2018](https://www.sciencedirect.com/science/article/abs/pii/S0927025618303252)<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) gives further information on Matminer. The toolkit accesses external databases, such as the [Materials Project](https://materialsproject.org/), where a large number of materials data has been accumulated and published.

In [None]:
import pandas as pd # library for organizing data
import numpy as np # library for numerial computations
from sklearn import linear_model # the linear_model library establishes a straightforward implementation of a linear regression model
from sklearn.metrics import log_loss, accuracy_score # these libraries enable the calculation of the MSE, MAE and R2 goodness of fit
from sklearn.model_selection import train_test_split # this library enables the splitting of a data set into training and test data
from sklearn.datasets import load_breast_cancer # access a medical dataset of breast cancer malignancy given morphological features
from sklearn.inspection import DecisionBoundaryDisplay # library to display decision boundaries of classifiers

import matplotlib.pyplot as plt # library for plotting (not interactive)
import matminer.datasets as mm # library for data mining materials properties, accesses published datasets
from matminer.featurizers.conversions import StrToComposition # converts a string denoting a material composition into the composition
from matminer.featurizers.composition.element import ElementFraction # determines the element fraction for a given composition
from pymatgen.core import Composition # materials analysis library, module used to analyse the chemical composition of a compound

#### Logistic regression function

We will begin by gaining some practical understanding on the sigmoid function used in logistic regression. To start with, define a function that takes both the (1D) feature vector $x$ and the array $\theta$ (containing the intercept $\theta_0$ and slope $\theta_1$) as input and returns the values of the sigmoid function for all given $x$.

In [None]:
def log_reg_func():
    
    # determine the value of the sigmoid function
    
    
    #return the result
    return

Based on the function you have defined, make a figure with two subplots (arranged vertically), where in the first subplot, you vary $\theta_0 \in \{-4,0,4\}$, while $\theta_1=1$ and in the second subplot you vary $\theta_1 \in \{-4,0.5,1,4\}$, while $\theta_0=0$. $x \in \left[-10,10\right]$ with $\Delta x = 0.1$. Make sure that you are including a legend in each plot. Let both plots share the x-axis, so that tick and the axis label only need to be included in the lower plot.

In [None]:
# set x
x=
# initialize the subplots

# adjust the vertical spacing between the plots to minimize it

# plot the function for varying theta_0 in the upper graph
for i in 

# plot the function for varying theta_1 in the lower graph
for i in 

# show the plot


### Logistic regression

While we have an understanding of what how $\theta$ will influence the shape of the logistic regression function, we now what to actually employ it for classification. Define a classification function that makes use of _scikit_learn_ to perform classification using logistic regression given an input feature vector and observation vector for training. The function should output the predicted labels/classes for the training features, as well as the fitted classifier.

In [None]:
def classification():
    
    # Define the model using linear_model and LogisticRegression from Scikit_learn
    model = 
    
    # train the model using .fit
    
        
    # Use the model to predict the entire set of data using .predict
    
    
    # return the predictions and the fitted model
    return 

We want to apply this classier to the [Breast Cancer Wisconsin Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), which is available in _scikit_learn_. Start by loading and displaying the dataset.

In [None]:
# load the breast cancer dataset into a variable

# display that variable to attain a better understanding of its structure


Without specifing any parameters when loading the dataset, you have loaded a bunch object. You should convert it into a pandas data frame. Display the data frame to have a look at the different features. Note that a target of 1 means that the tumour is benign, while a target of 0 means that it is malignant.

In [None]:
# convert the data part of the cancer bunch object to a dataframe

# add a target column to the dataframe to which you assign the target of the cancer bunch object

# display the dataframe


As you can see, within this data set we have 30 features that can be used to predict the cancer malignancy. In this instance, we will not use all features simultaneously, but you should test for each feature whether it alone could be used to predict the tumour malignancy with little error. Save all trained models in a list for later access. Display the log loss testing error for each feature.

In [None]:
# assign the target column of the data frame to your observation variable y

# initialize empty lists for the log loss and the different models for each feeature

# perform training the classifier for each feature
for i in 
    # assign the feature vector for training
    
    # split the data into training and test data
    
    # perform regression and prediction for training and predict y for the testing data
    
    
    # add the model and error to the respective lists
    
    

# plot the error for each feature in a horizontal bar chart


You can see how the use of different features for the classification of the cancer might lead to very different results. Perform a classification with the best performing feature and plot the classification probability for the features (selecting only every 5th data point), highlighting the observed class by color of the marker. In addition to the scatter plot, plot the underlying probability function and indicate the probability treshold by a horizontal line. Include a legend.

In [None]:
# identify the column index for which the log loss error is minimal
min_idx=
# initialize the figure

# plot every 5th entry of the feature vector and the predicted probability determined by the classifer
# each scatter marker should be assined the correct class in colour

# set up an array in the overall range of the feature vector with minimal log loss error and even step size

# plot the underlying probability function with the vector you have just defined

# plot the horizontal threshold line

# prepare your legend

# add axes labels and limits


Assuming that we may not want to use one but two features for the classification, perform the classification with the __worst area__ and __worst perimeter__. Use all the datasets of these two features for training. We now want to plot the 2D decision boundary for the resulting classifier. Follow the first example given [here](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html) to plot the decision boundary underneath a 2D scatterplot of the features and their designated target.

In [None]:
# setup your feature and observation vector for training


# perform the classification based on the vectors you defined


# generate two 2D features using numpy meshgrid


# create a grid based on the 2D features (see the online example)


# generate the prediction for the grid and reshape it into 2D 


# created the decision boundary


# set the colour map for your plot and display both the decision boundary, as well as the scatter plot of the
# actual features and their actual classes



What do you observe? Comment on the quality of the prediction.

### Predicting metallicity

Finally, we want to apply logistic regression to a materials science dataset. Therefore, we will access the dataset based on the publication from [Zhuo et al., 2018](https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.8b00124)<a name="cite_ref-2"></a>[<sup>[2]</sup>](#cite_note-2) that can be accessed through __matminer__. Use the _load_dataset_ function to load the __matbench_expt_is_metal__ dataset - it will automatically be loaded as a pandas dataframe. Display the dataset.

You can get a more detailed step by step introduction of different features that we will explore in this section by looking at [this lesson](https://workshop.materialsproject.org/lessons/08_ml_matminer/matminer-notes/) of the materials project workshop.

As you can see, the dataset contains the composition of each compoung and whether or not this is a metal. The latter information was determined based on the experimentally determined bandgap. You can get more information on the dataset by _printing_ the return of the function _get_all_dataset_info_, which takes the name of the dataset as input.

As you can see, the composition given in the _composition_ column of the dataframe is a string. To use the composition as features, we ultimately want to create a number of feature vectors equal to the number of elements in the periodic table where for any observation of a certain compound, the entry to that feature vector contains the stochiometric fraction of said element. To visualize this, run the following cell. Comment on what operations have been performed here. Play around with the dataframe entry that is evaluated to see how the output changes. Discuss what you are seeing with the group.

In [None]:
# this code is partly taken and adapted from https://workshop.materialsproject.org/lessons/08_ml_matminer/matminer-notes/
# which is available under the BSD 3-Clause License, Copyright (c) 2019, shreddd, All rights reserved.
# For the list of conditions of the license, please see https://choosealicense.com/licenses/bsd-3-clause/

idx=4
form=df["composition"][idx]
print(form)
comp=Composition(df["composition"][idx])
print(comp)
ef=ElementFraction()
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)
compef=ef.featurize(comp)
print(compef)

We now want to apply this operation to the whole dataset. To this end, we wish to use the _featurize_dataframe_ operation from _ElementFraction()_ but need to process our dataframe first to be able to do so.

In the first step, rename the column __composition__ into __formula__.

Secondly, we need to convert the chemical formula to a composition, which can be done by using _StrToComposition_ and its _featurize_dataframe_ operation. Apply this to the dataframe and display it. This operation may take around 5 minutes.

In [None]:
stc = 
df = 
df.head()

You can now apply the _featurize_dataframe_ operation from _ElementFraction()_ to the dataframe. However, this will take approximately 10 minutes. If you want to take a break, uncomment and run this cell.

In [None]:
#df = ef.featurize_dataframe(df, "composition", ignore_errors=True)
#df.to_pickle('./ismetal_df.pkl')

Otherwise, I have saved the final dataframe as .pkl for you to load at this stage.

In [None]:
load_df=pd.read_pickle('ismetal_df.pkl')
load_df.head()

Assign the loaded dataframe to your original one and look at some scatter plots for some elements (displaying the element column in $x$ and the target value in $y$.

Finally, we are ready to perform our classification task. Assign the dataframe __without__ columns formula" "is_metal" and "composition" to your feature variable X and the "is_metal" column to the observation y. Then split this data into training and test data and train the classifier. Predict the classes for the testing data and calculate and display both the log loss error for training and testing data, as well as the accuracy score.

In [None]:
# assign the variables

# split your data sets


# perform classification and prediction for training and testing data


# print log loss and accuracy for training and testing data


What do these results mean? Discuss.

If you have time left, run the following cell and look up the function that are used in the _scikit_learn_ documentation. What operations are carried out and what does the final result mean?

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kfold = KFold(n_splits=30, random_state=1, shuffle=True)

lr = linear_model.LogisticRegression()
scores = cross_val_score(lr, X, y, scoring='accuracy', cv=kfold)

print('Mean accuracy: {:.3f}'.format(np.mean(scores)))

<a name="cite_note-1"></a>1.[^](#cite_ref-1) Ward, L., Dunn, A., Faghaninia, A., Zimmermann, N. E. R., Bajaj, S., Wang, Q.,
Montoya, J. H., Chen, J., Bystrom, K., Dylla, M., Chard, K., Asta, M., Persson, K., Snyder, G. J., Foster, I., Jain, A., Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60-69 (2018).

<a name="cite_note-2"></a>2.[^](#cite_ref-2) Y. Zhuo, A.M. Tehrani, and J. Brgoch, J. Phys. Chem. Lett. 2018, 9, 7, 1668–1673, https://doi.org/10.1021/acs.jpclett.8b00124