# Big Data Assessment of Measurement Accuracy in Suncreams

Suncream is an essential cosmetic product that protects our skin from harmful ultraviolet (UV)-light emitted by the sun. 
The active ingredient in many suncream products is titanium dioxide, which absorbs the hamrful light of the sun, limiting its impact on our sun. 
However, companies that produce suncream frequently fail to report the amount of titanium dioxde in their products. 

Analytical scientists can use experimental measurements to estimate the amount of titanium dioxide in a suncream. 
This involves the calibration of instrumentation with samples of a known concentration. 
The calibrated instrumentation can then be used to estimate the concentration of titanium dioxide in the unknown sample. 
Instrument calibration is an exercise in big data, where we must use statistics and data modelling to interpret our results.

## Getting Started with Jupyter

Previously, you may have interacted with a Jupyter Notebook. 
But before we get started on the data analysis, we will quickly refresh some important aspects. 

### Interface Elements

There are a few parts of the Notebook interface that to draw attention to ({numref}`interface`):

1. The **Notebook/file tabs**. Similar to modern web browsers, JupyterLab allows many files to be open simultaneously within a tabbed interface. 
2. The **toolbar** contains buttons for common actions relating to working with Notebooks, hovering over the button with the cursor will pop up relevant information.
3. The **cell**, which depending on the type, Python code or Markdown can be written in this box. 
4. Indictates if a cell has been run or not, when the cell has not been run it will read `In [ ]:` and run cells will have `In [x]:`, where `x` is a number that indicates the order that the cells were run. 

```{figure} ./images/interface.png
---
name: interface
---
Some important interface elements in the Jupyter Notebook.
```

### Cells

Cells make up the body of a Notebook. 
When a new Notebook is opened, it will contain a single empty cell. 
Other cells can be added below the currently selected one by running the cell, pressing the "+" button in the toolbar or by using the keyboard shortcut of pressing "B" (the shortcut "A" can be used to add a cell above the currently selected one).
Cells can be of different types, there are two particularly important ones to be aware of. 

#### Code Cells

````{margin}
```{note}
This book will focus on using Python but Jupyter Notebooks can support other programming languages; such as Julia, R, and Scala.
```
````
A code cell contains Python code that can be executed. 
When the cell is run, the notebook will display any output from the final line of the cell in the corresponding cell. 

```{figure} ./images/code-cell.png
---
name: code-cell
---
An example of a code cell that has been run, the Python code in the cell performs the addition of 4 and 3 to give 7 as a return.
```

A cell is run by either clicking on the &#9658; icon in the toolbar or using "Control + Enter" (Windows) or "Command + Enter" (macOS) on the keyboard.
````{margin}
```{note}
It is **strongly advised** to type code examples that appear in this book. 
Actively typing code gives one's brain more time to think about what it is doing, rather than passively reading it or copying and pasting. 
```
````
````{admonition} Task
:class: important
Nearly every introduction to programming resources starts with a ["Hello, World!" exercise](https://en.wikipedia.org/wiki/%22Hello,_World!%22_program), where you make the computer print the phrase "Hello, World!", and this course is no different. 

**Create** a new Notebook, **rename** the file to `Hello-World.ipynb` and into the first code cell **type** the following:
```
print("Hello, World!")
```
````
When the cell is run, the phrase should be printed below the code cell. 

#### Markdown Cells

The type of cell can be changed using the drop-down menu in the toolbar. 
After "Code", the most important type of cell is "Markdown". 
A markdown cell contains text that is formatted using [Markdown](https://www.markdownguide.org), which is a lightweight markup language for writing {term}`HTML` documents. 
When a markdown cell is "run", the markdown is formatted to {term}`HTML`, and the formatted text is shown in place of the cell ({numref}`markdown-rendered`).

````{margin}
```{note}
Equations are supported in markdown when written using [LaTeX](https://www.overleaf.com/learn/latex/Tutorials) syntax, this is a popular typesetting language.
```
````

```{figure} ./images/markdown.png
---
name: markdown
---
A markdown cell that has not been run yet, showing the raw markdown.
```

```{figure} ./images/markdown-rendered.png
---
name: markdown-rendered
---
The rendered markdown, with the nicely formatted equations.
```

```{admonition} Task
:class: important
**Write** a Markdown cell that describes what you ate for breakfast this morning.
```

#### Active Cells

The currently active cell is indicated by being highlighted.
The presence of the cursor, the blinking `|` symbol, indicates that the cell is currently in either the command or edit mode. 

##### Command Mode

When in command mode, the cell content cannot be edited but keyboard shortcuts can be used to cut, paste, and move whole cells. 
All of the keyboard shortcuts can be found [online](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330).

```{figure} ./images/command.png
---
name: command
---
A Notebook cell in command mode.
```

##### Edit Mode

From command mode, pressing Enter or clicking in the input text area of a cell will switch the cell to edit mode. 
When in edit mode, code or markdown can be written. 

```{figure} ./images/edit-cursor.gif
---
name: edit-cursor
---
A Notebook cell in edit mode.
```

## How We Will Work

In this workshop, there are some code cells that you should run without modification and some that require editing. 
Those that require editing will have `◽◽◽` symbols in them and will raise an error if run without changing them. 

> **Task 1.1**
> 
> Run the code cell below, you should see the output `Hello World!` below the code cell. 
> 

In [None]:
print('Hello World!')


> **Task 1.2**
> 
> In the code cell below, calculate the `1 + 2 + 3 + 4 + 5`. 
> 

In [None]:
# Delete this text and add your code. 

> **Task 1.3**
> 
> Use the function below to find the square root of 25. 
> Note, that you need to change the `◽◽◽` with your own input. 

In [5]:
import numpy as np

print(np.sqrt(◽◽◽))

## Background 

BEER LAMBERT LAW.

## Data Ingestion and Inspection

In this workshop, we will be trying to interpret data collected by hundreds of analytical scientists around the world. 
Luckily, the scientists have agreed on a single way to store their data, and someone has compiled this data into a single file. 
We can read this file into the computer's memory with the Python function below. 

> **Task 2.1**
> 
> Run the cell below, this will output a table of data. 
>

In [1]:
import pandas as pd

data = pd.read_csv('data.csv')
data

Unnamed: 0,Concentration,Scientist 1,Scientist 2,Scientist 3,Scientist 4,Scientist 5,Scientist 6,Scientist 7,Scientist 8,Scientist 9,...,Scientist 507,Scientist 508,Scientist 509,Scientist 510,Scientist 511,Scientist 512,Scientist 513,Scientist 514,Scientist 515,Scientist 516
0,0.0,-0.010427,0.000172,0.020222,-0.000128,-0.012638,-0.005316,-0.006502,0.016013,0.007789,...,-0.003736,-0.002232,0.013459,0.014796,-0.015781,0.0122,-0.000238,0.002373,0.014419,-0.000766
1,0.000313,0.227803,0.244555,0.272823,0.24356,0.218793,0.243459,0.245485,0.233017,0.251453,...,0.244756,0.243838,0.270062,0.258457,0.262739,0.220057,0.255971,0.26701,0.271163,0.219794
2,0.000626,0.49057,0.506207,0.455445,0.443846,0.446346,0.456844,0.451649,0.475799,0.48578,...,0.564463,0.474359,0.444966,0.491105,0.486606,0.504127,0.502802,0.48487,0.475246,0.51266
3,0.000939,0.735781,0.71111,0.700538,0.736499,0.752933,0.721121,0.72035,0.731274,0.735381,...,0.7411,0.691185,0.679709,0.780782,0.702164,0.709524,0.716961,0.75553,0.73606,0.734654
4,0.001252,0.988789,0.951622,0.978607,0.996031,0.943388,0.99288,1.021223,0.979691,0.99803,...,0.959817,1.014572,0.998936,1.006795,0.942309,0.951383,0.958462,0.961245,0.966661,0.988449
5,0.001565,1.236602,1.205545,1.211876,1.190975,1.216797,1.204562,1.209305,1.309807,1.273408,...,1.174011,1.194791,1.181516,1.221964,1.168727,1.198511,1.202758,1.2533,1.251559,1.206661
6,0.001878,1.414791,1.469524,1.467979,1.391937,1.490945,1.438907,1.49808,1.504191,1.515837,...,1.43079,1.523416,1.441522,1.45436,1.504852,1.425782,1.40481,1.509046,1.474033,1.431838


> **Task 2.2**
>
> Look at the table above, how many scientists contributed data?
>

We will plot a histogram of the measured absorbance at a concentration of 0.000939 mol/L. 
This row of data has an index of `3` (note, the 3 at the start of the row in the table above). 
Therefore to access this data we use the notation `iloc[3, 1:]`.
The `1:` is because we do not want to include the `Concentration` column. 

In [None]:
ax = data.iloc[3, 1:].hist()
ax.set_xlabel('Absorbance')
ax.set_ylabel('Frequency')

> **Task 2.3**
>
> In your group, compare the distribution of the data to the following common distributions, which distribution is the most similar to the data?
>
![Examples of four common statistical distributions: Normal, log-Normal, Uniform, Chi-squared](./distributions.png)

It is possible to generate a summary of information about this distribution of data using the `describe()` method. 

In [None]:
data.iloc[3, 1:].describe()

> **Task 2.4**
>
> In your groups, discuss your understanding of the *mean* and *standard deviation* of a set of data and why there is a variation in the measured absorbance between the different scientists. 

## Calculation of $\varepsilon$

As discussed in the background, the aim of the calibration curve is to compute the value of the molar absorption coefficient, $\varepsilon$. 
We can find $\varepsilon$ as the gradient of the straight line, where the concentration is the *x*-axis and the absorbance is the *y*-axis. 
This can be achieved for a single one of the scientist's measurements with the code below. 

In [None]:
from scipy.stats import linregress

linregress(data['Concentration'], data['Scientist 1'])

Notice, that this returns the `slope` and the `intercept`. 
Slope is analogous to gradient, since the measurements were made with a 1 cm cuvette, the value of $\varepsilon$ estimated by the first scientist was 774.87 M<sup>-1</sup>cm<sup>-1</sup>.
We noticed above that there was a variation in the measured absorbance, therefore there will also be a variation in the estimate of $\varepsilon$. 

> **Task 3.1**
> 
> Discuss in your groups, how you would calculate the variance in the estimates of $\varepsilon$. 
> Consider how you can use computers to perform repetative tasks. 
>

Computers are great at doing the same process over and over, unlike humans, they don't get bored. 

> **Task 3.2**
>
> Below is a Python loop, complete the code inside the loop (again by changing the `◽◽◽`) to print the `LinregressResult` for every scientist's data. 
>

In [None]:
for i in range(1, 517):
    print(◽◽◽(data['Concentration'], data[f'Scientist {i}']))

An important tool in data science is the use of linear algebra, which is the backbone of modern machine learning methods. 
Below, is the code to compute $\varepsilon$ for all of the datasets using linear algebra.

In [6]:
X = np.array([data.iloc[:, 0]]).T
y = np.array(data.iloc[:, 1:])
epsilon = pd.Series((np.linalg.inv(X.T @ X) @ X.T @ y)[0])
epsilon.describe()

count    516.000000
mean     778.610376
std       12.338032
min      744.133037
25%      770.505811
50%      779.269358
75%      786.394712
max      810.489133
dtype: float64

> **Task 3.3**
> 
> Similar to the histogram above, modify the cell below to plot the histogram of esimated values of epsilon.
>

In [None]:
ax = epsilon.◽◽◽()
ax.set_xlabel('Epsilon')
ax.set_ylabel('Frequency')

## Estimation of Concentration of Unknown

It is common practice, from the calibrated value of $\varepsilon$, to estimate the concentration of some unknown from a measured absorbance. 
However, now, instead of a single value for $\varepsilon$, we have a distribution of values.
Therefore, for a single absorbance, we can estimate a range of concentrations. 

By rearranging the Beer Lambert law, we can calculate concentration from the absobrance. 
$$
c = \frac{\varepsilon l}{A}
$$

> **Task 4.1**
>
> Calculate the distribution of concentration values for a measured absorbance of 0.42647, in the cell below.
> Store this distribution as the variable `new_concentration`.
>

In [7]:
◽◽◽

> **Task 4.2**
> 
> Use the `describe` method from above to probe the summary statistics of this result in the cell below.
>

In [9]:
◽◽◽

count    516.000000
mean       0.000609
std        0.000010
min        0.000585
25%        0.000603
50%        0.000609
75%        0.000616
max        0.000637
dtype: float64

> **Task 4.3**
> 
> Finally, plot the histogram of this distribution in the cell below. 
>

In [None]:
◽◽◽