## 28.04.2023 Linear regression

Copyright (C) 2023, B. Zeller-Plumhoff

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the [GNU General Public License](https://www.gnu.org/licenses/gpl-3.0.html) for more details.

In this notebook, you will write functions to read data sets into a pandas data frame, display the data using matplotlib and plotly and then apply linear regression to estimate the Young's modulus of the material.

The data and code in the first part of the jupyter notebook (Young's modulus) is based on that published by Michael N Sakano, Saaketh Desai and Alejandro Strachan from Purdue University on [nanoHUB](https://nanohub.org/tools/youngsmod) under the GNU General Public License version 3. It was adapted by Berit Zeller-Plumhoff for the course Data Science for Materials Scientists at Kiel University. The modifications include shortening of the jupyter notebook, adding comments and rephrasing text passages.

### Libraries

We begin by loading the libraries required to perform the requires tasks. These include:
1. [Pandas](https://pandas.pydata.org/) to load and organize the data
2. [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to apply the linear regression model
3. [Matplotlib](https://matplotlib.org/) and [plotly](https://plotly.com/python/) for plotting the data

In [1]:
import pandas as pd # This library is for developing data structures in the form of tables
import numpy as np # This library is for scientific operations and data manipulation in matrices
from sklearn import datasets, linear_model # This library helps to develop the linear model
from sklearn.metrics import mean_squared_error # This library adds error metrics to our model

import matplotlib.pyplot as plt # This library is for visualizing the curves
import plotly.express as px #This library is for visualizing advanced graphics
import plotly.graph_objs as go # This library is the graphical object for plotly

### Data

This notebook is set up to deal with data contained in .csv files and organized in two columns, namely strain and stress (in MPa).

We will begin by loading the dataset provided on nanohub, which is based on Hollomon, J. H. Tensile Stress-Strain Curves of a 70-30 Brass. (1944).

In [None]:
# Set the name of the data file, which should be organized in two columns,
# with strain in column 1 and stress in column 2.
# The .csv file should be in the same folder as the jupyter notebook, 
# otherwise, you need to adjust the filename
filename = 

# load the data from the .csv into the data variable, which is a pandas
# dataframe
data = 

# display the "head" (first five rows) of the dataframe
# try different commands of displaying the dataframe, 
# e.g. data, data.tail()
data.head()

There are different ways in which you can work with the data in the pandas dataframe. You can assign different columns to variables or continue working with the dataframe itself.

Create two variables, named strain and stress and assign the respective dataframe columns to these. Additionally, transform the strain so that the variable contains the values in %. To double check the assignment print the first 5 values of both variables.

In [None]:
strain = 
stress = 

### Plotting

Since we have now loaded the data, you can visualize it using matplotlib or plotly. The advantage of plotly is the interactiveness of the resulting plot.

Display the stress-strain curve of the loaded data, including axes labels and a figure title

Based on the data, decide on the maximum strain defining the linear regime of deformation. Based on this region, you will fit a linear curve to the data to determine the Young's modulus of the material. 
Automatically find the final array index of the strain variable pertaining to this maximum strain using the [numpy where](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function.

In [None]:
# insert the maximum strain you wish to include in the fitting for the
# Young's modulus based on the graph above
max_strain = 

# us the numpy where function to define the maximum index of the variable
# to include
max_index = 

### Linear regression

Based on the determined range, assign the strain and stress for training to the variables x_train and y_train. You need to be careful to check what input format the linear regression model based on the scikit-learn library requires. It should be a numpy 2D array, ultimately.

In [None]:
x_train=
y_train=

Following the definition of your training data, you will now write a function that uses that data to train a linear regression model based on the [scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). The function should take your training data as input and output the parameters of the fitted model, as well as the fitted stress values.

In [None]:
def regression(X_train, Y_train):
    
    # Define the model using linear_model and LinearRegression from Scikit_learn
    model = 
    # train the model using .fit
    model.fit()
        
    # Use the model to predict the entire set of data using .predict
    predictions = model.predict() 
    
    # return the predictions and the fitted model
    return 

Apply the function you have just defined to your data. Based on the predictions, determine the mean squared error of the predictions and print the linear function as well as the error.

In [None]:
predictions, model = 

# Print model and mean squared error and variance score


Plot the original experimental data, the training data and the linear model predictions in one plot with different colours, adding a legend.

Calculate and print the Young's modulus of the material in GPa, and cross-check your computed value with the published literature. 

Now repeat the above with different thresholds for the maximum strain and evaluate how your linear model and Young's modulus change with different values. Repeat the workflow for the additional Brass_grainsize_0_020mm.csv and Brass_grainsize_0_025mm.csv files to assess how the Young's modulus changes with grain size - these are 0.015, 0.02 and 0.025 mm, respectively.

## Polynomial regression

Following the linear regression example above, you should now implement a polynomial regression, which is is a more general case of the linear regression where features of the form $x^d \,, d\in \mathbb{N}$ may be included. We will use an example that determines the radius of gyration $R_g$ for a set of small angle X-ray scattering data. The radius of gyration is the average electron density weighted square distance of scatterers from the centre of the object. It is related to the dimensions of certain well-defined shapes, such as spheres and ellipses, but different structures can exhibit the same $R_g$.

Load the data stored in the file saxs_data.csv.

The radius of gyration is determined in the so-called Guinier zone, which is the beginning of the scattering curve. You may assume, that the maximum $q$ you need to consider is $q=0.01$ nm$^{-1}$. For this range of $q$ you can determine $R_g$ as follows: $$ln(I)=ln(I_0)-\frac{1}{3}q^2R_g^2$$
Based on this information and using the linear regression model from scikit_learn, determine $R_g$.