# Practice Data Analysis on Python


Welcome to using Python for data science! This tutorial is designed to help you get used to doing data analysis on python. The format of this tutorial is the same as how your coding assignments/modules will be formatted.
This notebook will grabs data from simulated UV-VIS absorbance spectrums of a dye molecule with different concentrations. You will calculate the molar absorption coefficient of the dye. 

This notebook is split into 3 parts:
- A. Plot the dye spectra
- B. Find the peak absorbance
- C. Calculate the absorbance coefficient and generate a Beer's law plot.

This notebook is also **interactive**, and as such there are lines where you must write the code yourself to complete the tasks. Sections where you must write your own code are numbered and labelled with the heading:

## Task #0:

And detailed instructions for your coding task will be written below. These tasks are **in addition** to adding appropriate filenames where indicated.

There are **eight** tasks in total in this notebook.

Please make sure to read **all** of the instructions and comments **very** carefully, this is essential in making sure your code runs smoothly and error-free.

***Happy coding!***

Welcome to using Python for data science! This tutorial is designed to help you get used to doing data analysis on python. The format of this tutorial is the same as how your coding assignments/modules will be formatted.
This notebook will grabs data from simulated UV-VIS absorbance spectrums of a dye molecule with different concentrations. You will calculate the molar absorption coefficient of the dye. 

This notebook is split into 3 parts:
- A. Plot the dye spectra
- B. Find the peak absorbance
- C. Calculate the absorbance coefficient and generate a Beer's law plot.

This notebook is also **interactive**, and as such there are lines where you must write the code yourself to complete the tasks. Sections where you must write your own code are numbered and labelled with the heading:

## Task #0:

And detailed instructions for your coding task will be written below. These tasks are **in addition** to adding appropriate filenames where indicated.

There are **eight** tasks in total in this notebook.

Please make sure to read **all** of the instructions and comments **very** carefully, this is essential in making sure your code runs smoothly and error-free.

***Happy coding!***

## Part A: Plot the dye spectra

You will notice that there is an excel file in this folder labelled "dye_absorbance_spectrum.csv". We'll first load this file into python.

### Load and view the data into python using `pandas`

The data is uploaded using a python module known as `pandas` which will store the data in a *dataframe*. In the blocks below, we will go through how to load the data and view it as well.

In [None]:
# First we need to load the python modules we need
# This is just pandas for now
import pandas as pd

# Now we need to load the data using the pandas read_csv function
# This function reads the data from the file and loads it into a pandas dataframe
# This is saved in a variable called dye_absorbance_spectrum
dye_absorbance_spectra = pd.read_csv("dye_absorbance_spectrum.csv", delimiter=",")

# now lets take a look at the data
# but only the first 5 rows
print(dye_absorbance_spectra.head())

As we can see, the dataframe shows the same information as excel. Since it also preserves the column headings, we can use those to look at subsets of the data and call each column separately if necessary. This is important since we will need to do calculations using the two absorbance values.

The syntax to do this is as follows:
1. First, write the name of the variable that represents the *dataframe*
2. Next add square brackets
3. Add quotation marks inside the square brackets
4. Inside the quotes, enter the *exact* name of the column

So, for example, if I wanted to view just the data for **Dye Sample 1** I would use this code:

In [None]:
# Viewing only the first sample
# We have to make sure that the column name is exactly the same as in the excel file
print(dye_absorbance_spectra["Dye Sample 1"])

We can also view multiple columns at the same time by separating the column names with a comma:

In [None]:
# Viewing both the wavelength and the first sample
# We have to make sure that the column names are exactly the same as in the excel file
# Here we added an extra set of square brackets to specify that we want to view multiple columns
# And separated the column names by a comma
print(dye_absorbance_spectra[["Wavelength (nm)", "Dye Sample 1"]])

## Task #1:
For practice, in the space below, fill in the code to view the *Wavelength (nm)* and *Dye Sample 4* columns by adding the column names *in between* the quotes

In [None]:
# Viewing only the wavelength and the fourth sample 
# Fill in the correct column names in between the quotes
print(dye_absorbance_spectra[["", ""]])

### Creating variables for each column

Now that we're a little more familiarized with dataframes, we can move on to doing calculations with the loaded data. However, before doing this, it is very helpful to store the columns in separate variables since the syntax for calling a variable is much shorter.

We will also convert the data into an *array* using the module `numpy`, this will make calculations easier. Think of arrays as vectors

## Task #2

**Write in the correct column headings in the empty quotes below**

In [None]:
# First let"s import numpy
import numpy as np

# We"ll store the columns as individual variables
# We will also convert the data to numpy arrays
# I will do the first two for you, you can do the rest as practice
wavelength = np.array(dye_absorbance_spectra["Wavelength (nm)"])
sample_1 = np.array(dye_absorbance_spectra["Dye Sample 1"])

# Now you do the rest
sample_2 = np.array(dye_absorbance_spectra[""])
sample_3 = np.array(dye_absorbance_spectra[""])
sample_4 = np.array(dye_absorbance_spectra[""])
sample_5 = np.array(dye_absorbance_spectra[""])

### Plotting using `matplotlib`

To create plots on python, we will be using a library called `matplotlib`, this is a very extensive library that can do a lot of powerful tasks. But we will only use some of is features for this course.

#### Storing each measurement as a variable
Since all the measurements share the same x-axis, we can define a variable that stores the wavelength column, so we don't have to call it every time using the `pandas` notation.

#### Creating a plot canvas
The first thing we want to do when plotting is to make sure our plot is well formatted. To do this, we will create a *canvas*. Think of this like hitting the **scatter plot** button in Excel to create an empty plot. We'll then label our axes.
  
When you run this cell, you should see an empty plot with nicely formatted axes.

In [None]:
# Importing matplotlib for plotting
import matplotlib.pyplot as plt

# Create a canvas for the figure thats 8pt wide and 6pt tall
plt.figure(figsize=(8, 6))

# This changes the fontsize of the ticks on the axis
plt.tick_params(labelsize=14)

# This labels the axis and also changes their fontsize
plt.xlabel("Wavelength (nm)", fontsize=16)
plt.ylabel("Absorbance (a.u.)", fontsize=16)

# This shows the plot
plt.show()

#### Adding data to the plot
This is done by calling `plt.plot` and having the x and y data in parentheses. Therefore, the general format for adding data to a plot looks like this:  
`plt.plot(x, y)`

For example, this is what the plot for just the wavelength and the first dye sample would look like:

In [None]:
# Create a canvas for the figure thats 8pt wide and 6pt tall
plt.figure(figsize=(8, 6))

# An example plot of the wavelength against the absorbance of the first sample
plt.plot(wavelength, sample_1, label="Dye Sample 1")

# This labels the axis and also changes their fontsize
plt.xlabel("Wavelength (nm)", fontsize=16)
plt.ylabel("Absorbance (a.u.)", fontsize=16)

# This changes the fontsize of the ticks on the axis
plt.tick_params(labelsize=14)

# Adding labels
plt.legend()
# to show the plot
plt.show()

#### Your turn to plot

## Task #3

We'll combine both of the previous steps to plot the data we have. 

You will add your data to the graph below the large comment:
 `'''PLOTTING THE DATA'''`

For ease, I've added the code to plot Dye Samples 1-5, **all you need to do is**: replace the "x" and the "y" with appropriate variables and edit text after `label=` with the sample name you would like on the legend of the plot. 

***Note:*** labels *have to* be in-between quotations. I have set up the first plot for you as an example.

In [None]:
# Create a canvas for the figure thats 8pt wide and 6pt tall
plt.figure(figsize=(8, 6))

'''PLOTTING THE DATA'''
# Plot of the wavelength against the absorbance of the first sample
# I've done the first one for you
plt.plot(wavelength, sample_1, label="Dye Sample 1")

# Now you do the rest
plt.plot(x, y, label="")
plt.plot(x, y, label="")
plt.plot(x, y, label="")
plt.plot(x, y, label="")

# This changes the fontsize of the ticks on the axis
plt.tick_params(labelsize=14)

# This labels the axis and also changes their fontsize
plt.xlabel("Wavelength (nm)", fontsize=16)
plt.ylabel("Absorbance (a.u.)", fontsize=16)

# Don't need to edit this but I will explain everything I am doing here
# Set limits for the x and y axis
plt.xlim(400, 700)
plt.ylim(-0.05, 1.2)

# Add the plot legend
plt.legend(fontsize=16)

# This makes sure that the plot is formatted correctly
plt.tight_layout()
plt.show()

## Part B: Find the peak absorbance

### Wavelength of peak absorbance
The wavelength of the peak absorbance for this dye sample ($\lambda_{max}$) is:
$$\lambda_{max} = 578 nm$$

For this course, $\lambda_{max}$ will always be give to you. So, all we need to find the **absorbance** values at a wavelength of 578 nm for all our dye samples.

### Finding the absorbance at 578 nm
We will basically be doing python's version of hitting **CTRL/CMD + F** to find a particular value. The syntax may look complicated at first, but know that the only thing we need to worry about is the number that's highlighted. 

As an example, the code below just asks python to select the row in each column where the wavelength equals to 430 nm.

In [None]:
# This will print all of the column values where the wavelength is 430 nm
print(dye_absorbance_spectra[dye_absorbance_spectra["Wavelength (nm)"] == 430])

## Task #4
We need to do the same thing but for $\lambda_{max}$ so edit the line below to reflect that change by replacing the `?` with the appropriate value

In [None]:
# # This is the absorbance at lambda_max for the samples
# This code is incomplete, you need to fill in the question mark
lambda_max = dye_absorbance_spectra[dye_absorbance_spectra["Wavelength (nm)"] == ?]
                        
# This will print the absorbance at 578 nm for the samples
print(lambda_max)

## Task #5
#### Store the Absorbance Data in Excel
To make things easier, you can store the absorbance at 578 nm and the concentration of each run in the Excel sheet labelled **"dye_absorptivity_determination.xlsx"** which should be stored in the same folder as this notebook.

For ease, I have added the samples numbers and concentrations in the excel sheet, you just need to fill in the absorbance values at 578 nm from the result above.

**DO NOT CHANGE ANY OF THE COLUMN HEADINGS. THIS WILL RESULT IN AN ERROR IN THE CODE** 

The procedure is:

1. Download the excel file
2. Enter the data in the respective column
3. Save the file
4. Reupload the file into the same folder. **Do not change the name of the file**
5. Click "overwrite" when the dialog asks.

#### Uploading the Absorbance Data
We will be uploading the data using `pandas`, same as the spectra.

In [None]:
# Upload the calibration data from the excel file and store it as a dataframe
beers_law_data = pd.read_excel("dye_absorptivity_determination.xlsx", engine="openpyxl")

# Storing the concentration and absorbance values in variables
# We are also converting them to numpy arrays for easier calculations later
concentration = np.asarray(beers_law_data["Concentration (uM)"])
absorbance = np.asarray(beers_law_data["Absorbance 578nm"])

# Test if this works
print(concentration, absorbance)

## Part C: Molar Absorptivity and Beer's law plot.
### Fitting the data using a simple linear model
To fit, we'll be using the python library `scipy.stats`, and its corresponding linear regression class known as `linregress`. This is basically the python equivalent of **linest** in Excel. It will automatically fit our data to a linear model:
$$y = mx + c$$

To actually fit this, all we need to do is to call the function `linregress` which takes in two arguments:
- `x` = the independent variable
- `y` = the dependent variable

We will store the results of the fit as a variable `regression_results`

## Task #6
Identify the independent and dependent variable in your experiment (i.e. x and y). Next, find the appropriate variables from above which correspond to the independent and dependent variable. 

Finally, *complete* the *right-hand side* of the expressions inside the parentheses: `x=` and `y=` with the variables identified above.

If the completed the task correctly, the cell should print out:

**Regression Successful**

In [None]:
# Let's import the linregress function from scipy
from scipy.stats import linregress

# Now to perform the linear regression
# This is stored in a variable called regression_results

# Fill in the x and y values for the linear regression
regression_results = linregress(x=, y=)

print('Regression Successful')

### Fit results
For ease, we'll store the fitted slope, intercept, the errors, and the $r^2$ as variables, and then print the results. We will also do the same for the respective errors.

In [None]:
# Here is the slope and y-intercept
# n1 refers to the first order reaction
slope = regression_results.slope
intercept = regression_results.intercept

# And the errors
slope_err = regression_results.stderr
intercept_err = regression_results.intercept_stderr

# And the r-squared value
r_squared = regression_results.rvalue**2

# We'll first print the results
# Printing the results
print(f'Slope = {slope} ± {slope_err}')
print(f'Intercept = {intercept} ± {intercept_err}')
print(f'r^2 = {r_squared}')

### Find the fitted y-values for plotting

Using the fitted slope and intercept, we can visualize the "goodness of fit" by plotting both the fitted data and the experimental data on the same plot. 

To do this, we need to calculate the "fitted y-values" by using the calculated slope and intercept. Additionally, we want to extrapolate the fit, so it spans a longer range than the measurements we took: this will help us better visualize how the two variables relate to each other and how our good fit is.

In [None]:
import matplotlib.pyplot as plt

# We'll first create some equally spaced time values so the fit is extended
conc_fit = np.linspace(concentration.min()*0.8, concentration.max()*1.2, 1000)
# Now we'll calculate the y-values for the fit using the equation of a line
absorbance_fit = slope * conc_fit + intercept

## Task #7:
More practice plotting! You have three things to do:
- Replace the `"INSERT X-AXIS LABEL HERE"` and `"INSERT Y-AXIS LABEL HERE"` with the proper labels for the axes. Make sure to add your label **in-between** the quotation marks.
- For the ***Experimental Data***, replace the `x` and `y` with the appropriate variable names.
- Do the same for the ***Fitted Data***

For the last two, replace the `x` and `y` with the variable names you defined above in these lines:
- `experimental_plot, = plt.plot(x, y, "o", markersize=8, label="Experimental Data", color="red")`
- `fitted_plot, = plt.plot(x, y, "-", label="Fitted Data", color="black")`

In [None]:
# Now to plot
# Creating the canvas, same as before 
plt.figure(figsize=(8, 6))

# Changing the size of labels
plt.tick_params(labelsize=14)

# Add the axes labels here by replacing the text in between the quotation marks
plt.xlabel("INSERT X-LABEL HERE", fontsize=16)
plt.ylabel("INSERT Y-LABEL HERE", fontsize=16)

# Plotting our specific data

# Plotting the experimental data as a scatter plot
# The "o" argument tells matplotlib to plot the data points as circles
# The markersize argument changes the size of the circles
# The label argument is used to create a legend
# The color argument changes the color of the data points

# Replace the "x" and "y" with the correct variables: remember, variable names are not in quotes
experimental_plot, = plt.plot(x, y, "o", markersize=8, label="Experimental Data", color="red")

# Replace the "x" and "y" with the correct variables for the fitted line
# Remember these are the fitted time and n1 values we calculated above
# The "-" argument tells matplotlib to plot the data points as a line
# everything else is the same, we do not need a markersize for a line
fitted_plot, = plt.plot(x, y, "-", label="Fitted Data", color="black")

# Additionally, we will add some text to the plot 
# to display the slope, intercept and r-squared value with their errors
# *this is not required syntax for you to know, but it is very useful*
plt.text(0.95, 0.05, 
         f"m = {round(slope, 5)} ± {round(slope_err, 5)}" + 
         f"\nc = {round(intercept, 3)} ± {round(intercept_err, 3)}" +
         f"\n$r^2$ = {r_squared:0.4g}",
         ha="right", va="bottom", transform=plt.gca().transAxes, fontsize=16)

# Now we add the legend
plt.legend(fontsize=16)

# This makes sure that the plot is formatted correctly
plt.tight_layout()

plt.show()

## Task #8

Now that we have the Beer's law plot and the regression, we can find the molar absorptivity. In the code below, there is an *incomplete* variable for molar absorptivity and the error. Fill in the `?` with the correct variables for the molar absorptivity and error, based on the regression results above.


In [None]:
molar_absorptivity = ?  * 1e6 # in units of M^-1 cm^-1, I've converted it for you
molar_absorptivity_err = ? * 1e6 # in units of M^-1 cm^-1

# Printing the molar absorptivity and its error
print(f'Molar Absorptivity = {molar_absorptivity} ± {molar_absorptivity_err} M^-1 cm^-1')