# Part 1: Working with linear and quadratic regressions in Python

In earlier experiments, we've plotted some data and got the equation of a line.  Today we'd like to test multiple equations for fitting our data to find the mathematical relationship between phi (fraction of organic solvent) and k' (retention factor)!

You will be using an HPLC simulator to determine the retention factor of two compounds, caffeine and diethylformamide, as a function of phi.  This data should be entered in your ELN in a table like the example below (but without the "mepar" column), and then saved as a "comma delimited values" (.csv) file (but <i>not</i> UTF-8) into the Chem220 folder where this Jupyter notebook lives.

![HPLCDataSetupImage.png](attachment:HPLCDataSetupImage.png)

Once you have made this file, move on to the code below:

In [None]:
# Import a number of extra packages into python. Here are the ones we need today:

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

# Next, read in the .csv datafile that you just made.  Then print it.




# Note the format here; each row is for a different phi value, the first column is the list of phi values, 
# and the other columns are the retention factors for each compound. 

# Now show that you know how to slice out phi, mepar, dieth, and caff from the 2-D data array into separate list 
# variables.  Here is an example.  Add the other slices.
phi = data[:,0]


# Write some classy print statements to show that your array slicing worked correctly.  An example:

print ("The values for phi are " + str(phi))



## Fake zeros or "nan"s?
Notice that you have some zeros or "nan" values in the caffeine array.  These are not valid data points -- they are not actually zero, but were rounded to zero by the output limit of the HPLC simulator program, which is set to four digits after the decimal. Furthermore, these zeros will cause major problems in our line fitting later, so we need to remove them from the array.  

Adapt one of the functions in the next code block to clean up the zeros (or nans).  Then write a classy print statement to show yourself that it worked.

In [None]:
# Here is an example function that removes "nan" values from a numpy array called x.
# x = x[np.logical_not(np.isnan(x))]


# Here is an example function that removes zeros from a numpy array called x.
# x = np.trim_zeros(x)


# Classy print statement:


Now that you've removed zeros from the "caff" array, you also need to create a special phi array that has only the matching values for each of the retention factors in the caff array.  Use the Numpy delete function to make the appropriate phi_caff array in the next code block.

In [None]:
# Here is an example delete function that removes all data points from the array "x" starting with the seventh 
# (counting from zero).  Adapt it to create the "phi_caff" array that you want from "phi". 
# x_shortened = np.delete(x,np.s_[6:])


# Now write a print statement to make sure it worked correctly.



## Plotting a standard curve 
Now it's time to plot our data with phi on the x-axis and the retention factors on the y axis.  Adapt code from previous labs to plot the mepar and dieth datasets.


In [None]:
# Paste and adapt code here.



## Linear Fits
Now you can copy and adapt the Python code to calculate slope, intercept, R, p and u_m from each dataset.  However, you will want to differentiate the values calculated in each case by giving them names such as "m_mepar" instead of "m" when you list all the variables to be calculated by each linregress function.  (Otherwise the second regression will overwrite the output from the first one...)

In [None]:
# Paste and adapt code here.



# Write some code to calculate R_squared:


# Classy output print statements giving the equation of the line and the R2 values:



## Adding the 'best fit line' to your graph

Once again, paste and adapt code to plot the 2 datasets along with the best fit lines.  

In [None]:
# Copy and adapt code here.



Upload your saved graph into the ELN, adding a caption that explains the colors and symbols, and some narrative that describes what the graph is all about.


## Quadratic equation fitting

OK, you're done with the linear fitting -- and hopefully you are dissatisfied with how your data is fit by linear functions.  Let's get Python to try quadratic functions instead!  The polynomial regression 'polyfit' function in the numpy module returns the fitting coefficients, from highest order to the lowest, that result in minimized sum-of-squares error of the y-residuals.  In other words, for a 2nd order polynomial fit, the first number output is the coefficient of the $ x^{2} $ term, the second number is the coefficient of the <i>x</i> term, and the last number in the output is the intercept.  

Adapt the code below:

In [None]:
# Here is sample code to calculate a 3rd order polynomial fit on x and y data, outputting the four coefficients 
# in order.  Adapt this code to calculate 2nd order polynomial fits for your data, being careful to differentiate 
# the output variables for caffeine and diethylformamide.

# z3, z2, z1, z0 = np.polyfit(x, y, 3)




## Graphing a polynomial fit
Now that you have your polynomial equations, it's time to add them to the graph and see how close they fit the data.  Reuse the code from the linear fit graphing code block, but update the prediction lines (and prediction variable names).


In [None]:
# Graph the data with polynomial fits included



Save and upload this graph into LabArchives.  <i>Record observations on the quality of the quadratic fit, relative to the linear fit, in your ELN as narrative to accompany the graph.</i>  Note that retention factors cannot be negative for a model to be functional.  <i>Which of your models (fits) so far are predicting negative retention factors at some values of phi?</i>



## Exponential fitting

While it is possible to take the same datasets and try fitting exponential functions, it is easier to transform the data by taking the natural logarithm of all the y values.  If data is exponential, graphing lny vs x will linearize the data, and in this case we can use our knowledge of linear regression error analysis again.  So, let's try this next!

In the code block below, use the natural logarithm function "np.log" on the numpy arrays "caff", and "dieth".


In [None]:
# Transform your retention factor lists into lists of ln(k) values.


# Now paste and adapt your code to create linear least squares fits to the transformed data.


# Write some code to calculate R_squared:


# Classy output print statements giving the equation of the line and the R2 values:




Finally, graph the transformed data and the linear equations fit to it.  This time, get all the data to appear on <i>one</i> graph with labeled axes.  Make sure the two datasets are different colors and symbols.  (To plot multiple datasets on one graph, just repeat the plt.plot line of code for each additional set of data after the first one.)


In [None]:
# Graphing code


<i>Record your observations once again on the quality of the exponential fit relative to the others, and upload the summary graph to the ELN.  
Which of these models can you use to predict retention factors at other phi values?  In other words, what is the mathematical relationship between k' and phi?</i>  

For your best model (linear, quadratic, or exponential), record in your ELN the fit parameters (slope and intercept or Z2, Z1, Z0) calculated by Python for each compound.

# Part 2:  Other compounds

Now that you know the model, you can easily calculate the fit coefficients of the retention index function for other compounds based on <i>only two data points each</i> from the HPLC simulator.  Since the amount of data is very small, we'll just enter in the twelve numbers by hand.  But don't forget to use the correct model!  Use the code block below:

Alternatively, if you don't want to mess with all these variable names, you can decide to make an output array called "regressions" and design the data structure yourself, similar to what we did in Module 3.

In [None]:
# Calculate slope and intercept from two datapoints.
# Fill in the two values into each numpy array below.  Don't forget to use the correct model!  

phi_two = np.array([])
nbenz = 
benzalc = 
phenol = 
acetanil = 
mepar = 

# Transform data if necessary


# Find the regression for each compound, according to the correct model



# Classy output print statements giving the equation of the line (but there are no R2 values):



As usual, to check our work let's graph the equations along with the data to make sure they match.  Use the colors "r", "y", "g", "b", and "m" for the five compounds.

In [None]:
# Graph the data and the fits



Upload this plot to the ELN, along with the fit coefficients (m, b, or z2,z1,z0) for each compound.  You now have the information you need to predict retention time of seven compounds at <i>any</i> mixing ratio of water and acetonitrile!  

## Building the model
At this point, we want to construct a large data array called "modlnk" that contains all of the modeled retention factors for all seven compounds at all of the mobile phase mixtures (phi).  We will use <b>for</b> loops to do this.

Our data structure for this large array will be to have a column for each compound, and a row for each phi value.

In [None]:
# According to the lab handout, we want to calculate and store values for lnk for all phi values from 0.10 to 0.80 
# with a step size of 0.01.  
# First, let's write a FOR loop to generate this list of phi values and call it "philist".  Start by declaring
# philist as an empty array:


# We'll just go from 0.10 to 0.20 for practice at first.  In the FOR loop, use "i" in an equation to generate the 
# phi values you want, then append them to philist using code like the following:
# philist.append(value)



# Did it work?  

# Now we'll use a second FOR loop that will step through each phi value in philist, calculate lnk for all compounds,
# and add these results the 2-D array called "modlnk," while at the same time 
# making a matching 1-D array of all the phi values called "philist".  In the code below,
# make sure the slope and intercept variable names match the ones you've calculated in earlier code blocks, then run
# this code.   
# First declare modlnk as an empty array:
modlnk = []

# Now calculate lnk for all seven compounds and store them in the modlnk array.  Note efficient FOR statement.
for i in philist:
    modlnk_acetanil = i*(m_ln_acetanil)+b_ln_acetanil
    modlnk_benzalc = i*(m_ln_benzalc)+b_ln_benzalc
    modlnk_caff = i*(m_ln_caff)+b_ln_caff
    modlnk_dieth = 
    modlnk_mepar = 
    modlnk_nbenz = 
    modlnk_phenol = 
    newrow = [modlnk_acetanil, modlnk_benzalc, modlnk_caff, modlnk_dieth, modlnk_mepar, modlnk_nbenz, modlnk_phenol]
    modlnk.append (newrow)

print ('modlnk:')
print (modlnk) 
print ('philist:')
print (philist)

# If the output looks good, change the while loop limit to do all the calculations between phi = 0.1 and 0.8.

If your code was successful, you should see a row-by-row printout of the array for phi values between 0.1 and 0.2, followed by a list of all the phi values.  If you see this, <b>you are now ready to modify the first FOR loop to go all the way from 0.10 to 0.80.</b>

This should produce a very LONG list of numbers -- all of our modeled retention factors for all of the compounds! 

## Sorting each row
Our next task is that we need to find the difference between the two closest-together numbers in every row.  This is a complex task, but we can break it down into several small steps to automate it.  Here is the strategy we will use:  first, we will make a new 2-D array with the ln<i>k</i> values within each row sorted from lowest to highest.  Second, we'll calculate the difference between adjacent numbers in the sorted array.  Third, we'll sort the differences array, so that the smallest difference will be listed first in each row.  At that point it will be easy to slice out the first column of the array, which will contain the difference between the two closest-together numbers in each row in our original array.  Problem solved!

In order to execute the first step, we will use a ready-made <b>subroutine</b>.  In the code below, first comes the <b>def</b>inition of a subroutine called "sortRowWise" which does the sorting in a FOR loop, and then prints out the resulting sorted 2-D array in a second pair of <b>nested FOR loops</b>.  The subroutine ends at the line "return 0" (as does the indentation).  Below the subroutine is the <b>driver code</b> that sets up an array of test data and then passes it to the subroutine in a <b>call statement</b> with the syntax <i>SubroutineName(variable1,variable2 ...)</i>.  The variables sent to the subroutine can be numbers or arrays of any dimension.

### How subroutines work
When Python runs this code block, it will ignore any subroutines until they are called.  So Python will skip down to the line beginning with "m = " and run the code starting there.  The next line calls the "sortRowWise" subroutine and sends it the array "m".  Python then finds where this subroutine is defined and runs it, line by line, until it hits "return 0," the end of the subroutine.  (The subroutine can call the input variable(s) whatever it wants -- the name does not have to match the call statement ("m"), although in this case it does.)  Once the subroutine ends, Python jumps back to the call statement with the array "m" that has been modified by the subroutine, and continues through the program from there.  Subroutines are convenient ways to incorporate common tasks -- and other people's code -- into your Python code.

Check out the following code, then run it.  Dr. D has added a lot of comments to help you understand how it works.  Note how the print statement in the subroutine gives you a different look than the print statement in the last line of the program:

In [None]:
# Here is a Python3 subroutine called "sortRowWise" to sort values in each row from smallest to leargest, 
# in a 2D matrix.  Dr. D took this code from the computer science portal GeeksforGeeks.org. 

# Defines a subroutine called "sortRowWise" which you can run on the array of your choice.  In the subroutine,
# it will use the name "m" for the array, but you can send it an array with another name in the call statement.  
def sortRowWise(m): 
      
    # One by one sort individual rows. 
    for i in range(len(m)):   # len(m) is the number of lines in the matrix.
        m[i].sort()           # Uses a sort function!
          
    # printing the sorted matrix 
    for i in range(len(m)): 
        for j in range(len(m[i])):   # ln(m[i]) is the number of numbers (columns) represented in row m[i]
            print(m[i][j], end=" ")  # i is the row and j is the column
        print() 
          
    return 0  #Tells Python the function is over, go back to the call statement line.
  

# Driver code to test out how it works.
# Make a sample array called "m"
m = [[9, 8, 7, 1 ],[7, 3, 0, 2],[9, 5, 3, 2 ],[ 6, 3, 1, 2]] 
  
# This line calls the function and sends it the array "m".    
sortRowWise(m) 

# This code is contributed by shubhamsingh10  
print(m)

OMG, it worked on the first try!  Thanks, Shubhamsingh10, you're the best!  

Now it's your turn.  Write a call statement the tells the sortRowWise function to operate on your massive matrix of lnk' values.  But before you do that, first copy the matrix into a fresh matrix with a new name -- otherwise the subroutine will overwrite your original matrix.

In [None]:
def sortRowWise(m): 
      
    # One by one sort individual rows. 
    for i in range(len(m)):  
        m[i].sort()  
          
    # printing the sorted matrix 
    for i in range(len(m)): 
        for j in range(len(m[i])): 
            print(m[i][j], end=" ") 
        print() 
          
    return 0
  

# Add your own three lines of driver code.  
# Line 1:  rename your modlnk array "modlnksort" so it won't get overwritten by the subroutine


# Line 2:  call the function and send it the "modlnksort" array


# Line 3:  Re-print the results, to make sure the subroutine has passed the correct data back.  



Look over your data.  Each row should now be ordered from lowest (most negative) to highest (most positive).  At this point, you should feel really happy that you're not trying to do this on Excel.

## Calculating the difference between adjacent numbers in each row
Now that you feel really happy that the numbers in each row are in order, we can use Python to calculate the differences between each number and its row-neighbors, and store these results in a new array.  Here goes, with a subroutine written by Dr. D!  Take a look at how it works, stepping through each row (i), and then nested within that <b>for</b> loop is another <b>for</b> loop going down each row through (j) columns to calculate the differences.

Notice that this subroutine works with two arrays, an input array (first) and an output array (second).  This is another way to solve the problem of overwriting data, but in this case it is necessary because the input and output arrays have different dimensions.

Once you've looked over the subroutine code, write your own driver code again.

In [None]:
# Subroutine to calculate the difference between all adjacent numbers in each row of a 2-D array

def RowDiffs(m, diff):    #This subroutine must be called with an input array (listed first) and an empty output array (2nd)

    # Take the differences of all adjacent numbers in each individual row. 
    for i in range(len(m)):          # For loop to step through rows
        diffrow = []                 # Set up empty array to hold the differences between numbers in this row
        for j in range(len(m[i])-1): # For loop to go down the row.  If a row holds 7 numbers, there will be 6 differences
            difftemp = (m[i][j+1] - m[i][j])   # Calculate the difference between 2 adjacent numbers
            # print(difftemp)
            diffrow.append(difftemp) # Append the difference into the row array
        # print(diffrow)
        diff.append(diffrow)         # Append the difference row to a 2-D array called "diff"
    
    # printing the sorted matrix.     
    for i in range(len(diff)): 
        for j in range(len(diff[i])): 
            print(diff[i][j], end=" ") 
        print() 
    return 0

# This code is contributed by ddehaan.  Learning so much about coding by teaching this class!  :-)  

# Add your own three lines of driver code.  
# Line 1:  declare an empty array called "diff"


# Line 2:  call the RowDiffs subroutine, passing two arrays (input array "modlnksort" first, empty output array 
# "diff" second) 


# Line 3:  print out the output array, so that you can check the data transfer from the subroutine went correctly.



The "diff" array now contains all the differences between the adjacent peak retention factors at a given acetonitrile / water mixture.  We are interested in the two closest peaks under each conditions, so we need to extract the smallest numbers in each row.  Use the <b>sortRowWise</b> subroutine on the array "diff" to put the numbers in row order, and place these ordered results in a new output array called "diffsort".

In [None]:
# Sorting the data again to put the smallest value in each row in column 0

# Subroutine
 

# Driver code



Did you print out "diffsort" to make sure your subroutine call worked correctly?  Good!  

Our last step is to slice out the first column of data from the array, because the first column now has the differences in ln(retentionIndex) between the two closest peaks at each condition.  You now know how to do this -- but you can also borrow and adapt code from the first code block in this Jupyter notebook to help you!  Note that you may have to convert your array to a Numpy array ("newarray = np.array(oldarray)") for slicing to work properly.

As always, check that your code is working properly by examining the output of a print statement.

In [None]:
# Convert to Numpy array and slice out first column



At this point, you have two 1-D arrays:  "philist" from several code blocks back, which contains the list of all the phi values from 0.1 to 0.8 in steps of 0.01; and the array that you just sliced out in the previous code block, which should be the corresponding closest-peak "delta ln<i>k</i>" values.  These arrays should be the same length!  

Your last Python task is to graph these two arrays against each other to make a window plot -- the phi values should be on the x-axis.  Use the code block below to graph the data as a red line, and be sure to include axis labels as usual.  

In [None]:
# Make and save "window plot"



## Extra credit option
In order to interpret your graph without having to eyeball Phi values off of it, it would be convenient to make a long table that shows the paired "philist" and minimum difference values so that you could exactly locate the peaks.  Remember how the print statements in the subroutine for loop came out so neatly?  For extra credit, write (or adapt) a for loop that will print out the paired values, one pair per line.  Include labels at the top of the output.

In [None]:
# Printing a table of values



          

Wow, you made it!


## Submission Instructions
In your ELN, upload each of the graphs you made in this notebook into the Expt 7 R&A section, and explain what each graph is telling you.   
Save this notebook with your name in the title and upload it, too!  There are a few questions to answer in the ELN, and you will use the HPLC simulator one more time to test the predictions of the chromatography model / window plot you just made in this Jupyter notebook.