# Big Data Assessment of Measurement Accuracy in Suncreams

Suncream (also known as sunblock or suntan lotion) is an essential cosmetic product that protects skin from the sun's harmful ultraviolet (UV) light.
The active ingredient in many suncream products is titanium dioxide, which absorbs the sun's harmful light, limiting its impact on our skin. 
However, companies that produce suncream frequently fail to report the amount of titanium dioxide in their products. 

Analytical scientists can use experimental measurements to estimate the amount of titanium dioxide in a suncream. 
This approach involves the calibration of instrumentation with samples of a known concentration. 
The calibrated instrumentation can then be used to estimate the titanium dioxide concentration in the unknown sample. 
Instrument calibration is an exercise in big data, and we must interpret our results using statistics and data modelling.

## Getting Started with Jupyter

Previously, you may have interacted with a Jupyter Notebook. 
But before starting the data analysis, we will quickly refresh some important aspects. 

### Interface Elements

There are a few parts of the Notebook interface that to draw attention to:

1. The **Notebook/file tabs**. Similar to modern web browsers, JupyterLab allows many files to be open simultaneously within a tabbed interface. 
2. The **toolbar** contains buttons for common actions relating to working with Notebooks, hovering over the button with the cursor will pop up relevant information.
3. The **cell**, which depending on the type, Python code or Markdown can be written in this box. 
4. Indictates if a cell has been run or not, when the cell has not been run it will read `In [ ]:` and run cells will have `In [x]:`, where `x` is a number that indicates the order that the cells were run. 

![](./images/interface.png)
Some important interface elements in the Jupyter Notebook.

### Cells

Cells make up the body of a Notebook. 
When a new Notebook is opened, it will contain a single empty cell. 
Other cells can be added below the currently selected one by running the cell, pressing the "+" button in the toolbar or by using the keyboard shortcut of pressing "B" (the shortcut "A" can be used to add a cell above the currently selected one).
Cells can be of different types, there are two particularly important ones to be aware of. 

#### Code Cells

A code cell contains Python code that can be executed. 
When the cell is run, the notebook will display any output from the final line of the cell in the corresponding cell. 

![](./images/code-cell.png)
An example of a code cell that has been run, the Python code in the cell performs the addition of 4 and 3 to give 7 as a return.

A cell is run by either clicking on the &#9658; icon in the toolbar or using "Control + Enter" (Windows) or "Command + Enter" (macOS) on the keyboard.
When the cell is run, the phrase should be printed below the code cell. 

#### Markdown Cells

The type of cell can be changed using the drop-down menu in the toolbar. 
After "Code", the most important type of cell is "Markdown". 
A markdown cell contains text that is formatted using [Markdown](https://www.markdownguide.org), which is a lightweight markup language for writing HTML documents. 
When a markdown cell is "run", the markdown is formatted to HTML, and the formatted text is shown in place of the cell.

![](./images/markdown.png)
A markdown cell that has not been run yet, showing the raw markdown.

![](./images/markdown-rendered.png)
The rendered markdown, with the nicely formatted equations.

#### Active Cells

The currently active cell is indicated by being highlighted.
The presence of the cursor, the blinking `|` symbol, indicates that the cell is currently in either the command or edit mode. 

##### Command Mode

When in command mode, the cell content cannot be edited but keyboard shortcuts can be used to cut, paste, and move whole cells. 
All of the keyboard shortcuts can be found [online](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330).

![](./images/command.png)
A Notebook cell in command mode.

##### Edit Mode

From command mode, pressing Enter or clicking in the input text area of a cell will switch the cell to edit mode. 
When in edit mode, code or markdown can be written. 

![](./images/edit-cursor.gif)
A Notebook cell in edit mode.

## How We Will Work

In this workshop, there are some code cells that you should run without modification and some that require editing. 
Those that require editing will have `◽◽◽` symbols in them and will raise an error if run without changing them. 

> **Task 1.1**
> 
> Run the code cell below; you should see the output `Hello World!` below the code cell. 
> 

In [None]:
print('Hello World!')

> **Task 1.2**
> 
> Using the code cell below, calculate the `1 + 2 + 3 + 4 + 5`. 
> Note that you need to change the `◽◽◽` with your own input. 
> 

In [None]:
◽◽◽

> **Task 1.3**
> 
> Complete the function below to find the square root of 25. 
>

In [None]:
import numpy as np

print(np.sqrt(◽◽◽))

## Background 

Analytical scientists use the Beer-Lambert law to measure the concentration of a species in solution. 
The Beer-Lambert law has the following form, 

$$
A = \varepsilon l c,
$$

where, $A$ is the absorbance of a given wavelength of light by a solution of concentration, $c$, over a distance of path length, $l$, and $\varepsilon$ is the molar absorption coefficient of the species of interest.
$\varepsilon$ is not generally known for a given species; therefore, it is necessary to produce what is known as a calibration curve. 

A calibration curve is made by making solutions of known concentration of the given species and measuring the absorbance. 
Since the relation above is linear, plotting this data will give a straight line where the gradient is $\varepsilon l$ and $l$ is a component of the measurement device (this is also known to the scientist).

With knowledge of $\varepsilon$, it is then possible to rearrange the equation above to give, 

$$
c = \frac{A}{\varepsilon l}.
$$

This rearrangement means measuring $A$ for some known new solution and estimating its concentration is possible. 

In this workshop, we will use this same process to estimate the concentration of titanium dioxide in a sample of suncream; we will also consider the error bounds on this measurement. 
First, we will read in a large dataset to provide an accurate estimate of the gradient and an uncertainty bound on this value. 

## Data Ingestion and Inspection

In this workshop, we will be trying to interpret data collected by hundreds of analytical scientists worldwide. 
Luckily, the scientists have agreed on a single way to store their data, and someone has compiled it into a single file. 
The Python function below lets us read this file into the computer's memory. 

> **Task 2.1**
> 
> Run the cell below; this will output a table of data. 
> This table is a `pandas.DataFrame` object, with the variable name `data`, that we will work with throughout. 
>

In [None]:
import pandas as pd

data = pd.read_csv('data.csv')
data

> **Task 2.2**
>
> Look at the table above; how many scientists contributed data?
> Run the cell below; what information has been output?
>

In [None]:
data.shape

The cell below will produce a histogram of the absorbance values at 0.000939 mol/L measured by each scientist. 
This concentration is the row of data with the index of `3` (note that there is a 3 at the start of that row in the table above). 
We do not want to include the `Concentration` column in our histogram. 
Therefore, we only want from the 2nd column onwards, which in Python is written with the index `1:` (Python counts indices from zero).
Therefore, we use the property `iloc[3, 1:]` to access this data and histogram it.

In [None]:
ax = data.iloc[3, 1:].hist()
ax.set_xlabel('Absorbance')
ax.set_ylabel('Frequency')

> **Task 2.3**
>
> You will be assigned to a breakout room in your group to compare and discuss the data distribution above.
> 
> Below are some examples of common data distributions; which is the most similar to the data above?
>
![Examples of four common statistical distributions: Normal, log-Normal, Uniform, Chi-squared](./distributions.png)

A summary of the data in the histogram can be generated with the `describe()` method. 
This gives what is known as summary statistics. 

In [None]:
data.iloc[3, 1:].describe()

> **Task 2.4**
>
> In your breakout rooms, discuss your understanding of a data set's *mean* and *standard deviation* and why there is a variation in the measured absorbance between the different scientists. 
>

## Calculation of $\varepsilon$

We discussed above that the calibration curve aims to estimate the value of the molar absorption coefficient, $\varepsilon$.
The estimation of $\varepsilon$ is found by calculating the gradient of the straight line, where the concentration is the *x*-axis and the absorbance is the *y*-axis. 
The gradient is calculated for one of the scientist's measurements with the code below. 

In [None]:
from scipy.stats import linregress

linregress(data['Concentration'], data['Scientist 1'])

Notice that this returns the `slope` and the `intercept`. 
The slope is analogous to the gradient, and since the measurements were made with a 1 cm cuvette, we can substitute a value of $1$ for $l$ and get the value of $\varepsilon$. 
$\varepsilon$ estimated by the first scientist was 774.87 M<sup>-1</sup>cm<sup>-1</sup>.
We noticed above that there was a variation in the measured absorbance; therefore, there will also be a variation in the estimate of $\varepsilon$. 

> **Task 3.1**
> 
> How do you think we would calculate the variance in the estimates of $\varepsilon$?
> Consider how you can use computers to perform repetitive tasks. 
>

Computers are great at repeating the same process; unlike humans, they don't get bored. 

> **Task 3.2**
>
> Below is a Python loop; complete the code inside the loop (again by changing the `◽◽◽`) to use the `linregress` function (see above) and print the `LinregressResult` for every scientist's data. 
>

In [None]:
for i in range(1, 517):
    print(◽◽◽(data['Concentration'], data[f'Scientist {i}']))

An important tool in data science is linear algebra, which is the backbone of modern machine learning methods. 
Linear algebra is important in machine learning as it enables the manipulation of large amounts of data in computationally efficient ways. 
Below is the code to compute $\varepsilon$ for all datasets using linear algebra.

In [None]:
import numpy as np

X = np.array([data.iloc[:, 0]]).T
y = np.array(data.iloc[:, 1:])
epsilon = pd.Series((np.linalg.inv(X.T @ X) @ X.T @ y)[0])
epsilon.describe()

> **Task 3.3**
> 
> Similar to the histogram above, modify the cell below to plot the histogram of estimated epsilon values.
>

In [None]:
ax = epsilon.◽◽◽()
ax.set_xlabel('Epsilon')
ax.set_ylabel('Frequency')

## Estimation of Concentration of Unknown

From the calibrated value of $\varepsilon$, the concentration of some solution can be estimated by rearranging the Beer-Lambert law. 
However, instead of a single value for $\varepsilon$, we have a distribution of values.
Therefore, we can estimate a range of concentrations for a single absorbance. 

> **Task 4.1**
>
> In the cell below, Using the rearranged Beer-Lambert law, calculate the distribution of concentration values for a measured absorbance of 0.42647.
> Store this distribution as the variable `new_concentration`.
>

In [None]:
◽◽◽

> **Task 4.2**
> 
> Use the `describe` method from above to probe the summary statistics of this result in the cell below.
>

In [None]:
◽◽◽

> **Task 4.3**
> 
> Finally, plot the histogram of this distribution in the cell below. 
>

In [None]:
◽◽◽