# Processing growth data from *Escherichia coli* - the world's most popular model organism

First version: Daniel P Brink, Division of Applied Microbiology, Lund University.

Second version: Jens Uhlig, Division of Chemical Physics, Lund University

### Intended Learning Outcomes:
**1. To gain experience in using the following libaries:**
- [matplotlib](https://matplotlib.org) for visualizing 2D and 3D data. 
- [NumPy](https://www.numpy.org) for manipulating and doing operations on arrays.
- [pandas](https://pandas.pydata.org) for reading in data and represent data in tables.  
- [seaborn](https://seaborn.pydata.org/index.html) for setting nice colour palettes for matplotlib plots
- [SciPy](https://www.scipy.org) or [lmfit](https://lmfit.github.io/lmfit-py/) for fitting.
- The [os](https://docs.python.org/3/library/os.html) module for reading and navigating directories on the hard drive 

      
**2. To gain experience in using Jupyter notebook relevant features such as:**
- Marking up text and equations with Markdown and LaTeX.
- Data visualization.
- Magic commands (%).
- Inserting images.
- Built in help functionality.


**3. Understanding basic scientific techniques and models such as:**
- Loading and parsing datasets containing microbial growth data.
- Processing multiple biological replicates from the same experiment. Calculation and visualization of averages and standard deviations. 
- Visualizing microbial growth data and calculation of growth rates
- Evaluating mathmatical expressions.

**4. Basic python scripting such as:**
- Datatypes and objects.
- Datastructures.
- Loops.
- Functions.

**5. Searching documentation and help online** 

**6. Generation of publication ready figures.**

### Imports

In [None]:
from IPython.display import Markdown, IFrame, Image
import ipywidgets as widgets 

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns # the seaborn package can be used to set good-looking colour palettes for plots

import os # Later on in this notebook (Sections 2 and 3), we will accessing and reading directories on the hard drive using the os module.
from os.path import isfile, join

## Introduction

### The notebook

In this project, you will work with real growth data from the bacterium *Escherichia coli*. You will practice loading and handling large datasets split over multiple files, working with data from replicated experiments, and making plots and performing calculations that are commonly used for microbial growth data. The notebook is divided into 3 main sections:

1. [Plotting growth data from *E. coli*](#section1)
2. [Automating the processing of multiple datasets using loops](#section2)
3. [Calculating microbial growth rates](#section3)

This notebook has been designed in a way that will make us practice a concept known as *stepwise refinement*. This is a fancy way of saying that we can first design several simple components that we want our final script to have, and then bit by bit combine the components into a more complex project. This means that instead of asking you to building a complex script from the start, we begin by implementing one of the components we would like our final code to have. We test it and after we have made the component work like we desire it to, we can take the code and use it to implement new functionalities to our project. In this manner, we will eventually be able to reach fairly complex pieces of code, with the added benefit that we have probably understood each step of the code along the way! 

Chances are that as you are working on the notebook, you will not think too much about doing stepwise refinement. But as you reach the more challenging tasks later on in the notebook, you will hopefully be able to use the knowledge you have gained along the way to help you solve the more advanced excercises.

## 1. Plotting growth data from *E. coli*
<a id='section1'></a>

As you have probably learned in your previous courses, microbes can often grow in a range of different temperatures but tend to have an optimal temperature where it grows the fastest. The span of the temperature range and the optimal growth temperature typically differs between species, and sometimes even between strains from the same species. In textbooks, the optimal growth temperature of *E. coli* is often reported to be 37 °C, i.e., the average temperature of the human body. In this notebook you will work with a public dataset containing growth data from *Escherichia coli* strain K12 NCM3722 (Katipoglu-Yazan et al., 2023). The cells were grown in a minimal medium with glucose as substrate at 19 different temperatures ranging from 27 °C to 45 °C. Using this dataset, we will investigate how temperature affects the growth of *E. coli* and see if it really grows the fastest at 37 °C. 

Check if the data is already in a subfolder, if not:

Start by downloading the dataset called *Ecoli_MicroplateGrowth_Filtered.zip* from this link: https://doi.org/10.57745/GCKG7W (or 
Copy the .zip file to the same directory that you have downloaded this notebook to. Unzip the archive to a new directory called *ecoli_growth_data*. (You are of course free to name the folder whatever name you want, but please note that in the code examples below we will assume that the 19 files from the zip are located in a directory named *ecoli_growth_data*.) Please make sure that this new directory only contains the .csv files and nothing else. This will make our lives easier when we later on in the notebook will work with automating the loading of the files.

The growth data we will be using was captured using 48-well microtiter plates (Figure 1) at 19 different temperatures ranging from 27 °C to 45 °C. Each temperature setting was assessed in 40 replicates, i.e., 40 of the 48 wells were occupied with cells. Each well typically has a maximum volume of 400µl, and to not risk samples spilling over from one well to another when the plate is agitated for the sake of mixing, each well would probably contain 2-300µl of culture. The cells were grown in a minimal medium with glucose as substrate. Sterilized growth medium was added to the wells and a pre-culture of *E. coli* cells were inoculated to a low initial biomass concentration. The plate was then incubated in an automated plate reader that controls the temperature, shaking, and measurement of optical density (OD). The cells eventually started to duplicate themselves and there was be an increase in the number of cells, i.e. growth. The growth was be measured as discrete data points, but at a fairly frequent measurement interval. In this dataset, the OD was measured every 10 minutes.

<br>
<center><img src="images/fig1.png" width="400"></center>
<p><center><b>Figure 1:</b> An overview of a 48-well microtitre plate. Source: <a href="https://commons.wikimedia.org/wiki/File:48-well-plate.svg">Wikimedia Commons</a> </center></p>

**(1) Biotechnology task:** <br>
What is Optical Density (OD)? How is it measured? What alternative methods exist to measure biomass? What are the benefits and drawbacks of OD compared to the other methods?

In [None]:
# -- YOUR ANSWER HERE --
# ---------------------

### 1.1. Load the data using Pandas and make a first plot

In this first section of the notebook, we will work with the data captured at 37°C. This data is stored in the file named *Ecoli_MicroplateGrowth_37C_Filtered.csv*.<br>

To load the data in the file: 

**Ecoli_MicroplateGrowth_37C_Filtered.csv**
that can be found in the folder with name **ecoli_growth_data**

we can use the read_csv() function that was introduced in the notebook from the Pandas lecture.

The whole string, *ecoli_growth_data/Ecoli_MicroplateGrowth_37C_Filtered.csv*, is known as a path. Specifically, it is a *relative path*, since it only tells how to get to the new directory starting from the current directory. An *absolute path*, on the other hand, can be reached from any directory in the file system. The *absolut path* is then the current working directory plus the relative path.

using just the tricks we learned before:

    1. get a string with the current working directory using os.getcwd()
    2. os.sep gives the separator for the current operation system as string
    3. join links text with the operation system specific separator full path to my file.
    
The following code thus gives to the path to your file on whatever operation system you are working:

```Python
os.sep.join([os.getcwd(),'ecoli_growth_data','Ecoli_MicroplateGrowth_37C_Filtered.csv'])
```
        

*Bonus comment for those who are interested in a little more advanced discussion:*<br>

Forward (/) and backward (\\) slashes have special functions in many programming languages. Backslashes are often used to enable the printing of special characters, tabs, and newlines in a string; this is known as [string escape](https://www.w3schools.com/python/gloss_python_escape_characters.asp). The issue is that in Windows, paths are by default written using backslashes, which can lead to errors when Python interprets the backslash of the path as an escape character. Unix systems, such as Linux and Mac OS instead use forward slash for paths, so you might get errors when if your code was developped on one type of OS but run on another. 
There are several ways to get around this problem. One is to use a special package **pathlib**, the second is to use the *r'* notation that tells Python to treat the slash (in this case a forward slash) as a raw string. My preferred is to use the method above that uses only strings.

**(2) Python task:** <br>
Load the data from the 37°C experiments as a Pandas dataframe. Display the dataframe so that you can have a look at the data. 
Verify that the index are numbers and not text. It should look like this:
<br>
<center><img src="images/fig6a.jpg" width="600"></center>
<p><center><b>Figure 6:</b> DataFrame content </center></p>

In [None]:
# -- YOUR CODE HERE --
# ---------------------

The dataframe loaded from *Ecoli_MicroplateGrowth_37C_Filtered.csv* consists of 125 rows and 41 columns. Each row corresponds to a time point in which the whole microplate was scanned. The column "Time_in_hr" shows the time of capture. <br> <br>
The other columns contain the 40 different biological replicates of the same strain. The numbering of the replicates is based on the well coordinates shown in Figure 1 above. This means that in each .csv file there are: <br>

$$
  40 \text{ replicates} \times 125 \text{ time points} = 5000 \text{ data points}
$$

The size of this dataset is thus a good example of case where we will benefit a lot from using a systematic programming approach to processing, plotting and calculation of key parameters. By working with this notebook you will hopefully understand the benefits of using a progamming approach to analysing your data. Imagine handling 19 files that each contain 5000 data points using Excel... It is doable, but it will likely take a lot of effort.


**(3) Python task:** <br>
The headings of the columns in a data frame often contain valuable desciptions of the data it contains. In the previous task, you most likely did not get a display of all the headings in the dataframe - it might have been truncated. Let's practice how to extract all the headings. Make a list of all the column headings from the dataframe. Compare the headings to drawing in Figure 1 above. Does it match? Are all wells represented, or are there some wells that are not included?

In [None]:
# -- YOUR CODE HERE --
# ---------------------

Now let's try visualizing some of the data! There are 40 replicates in each .csv file, so let's start by plotting one of the replicates, e.g. A1. Plot the *Time_in_hr* column on the X-axis versus the OD values from the A1 replicate on the Y-axis. Try to reproduce this figure:

<center><img src="images/fig2.png"></center>
<p><center><b>Figure 2:</b> <i>E. coli</i> growth at 37 °C from replicate A1. </center></p>

**(4) Python task:** <br>
Plot the results from the A1 replicate (Time versus OD) using the matplotlib syntax you have learned in the course. Add axis labels and a title. Use a marker, such as 'o', to show that the data contains discrete datapoints.


In [None]:
# -- YOUR CODE HERE --
# ---------------------

**(5) Biotechnology task:** <br>
Microbial growth curves can be divided in different phases based on the rate of growth within each phase. What are the names of the different phases of the growth curve?

Hint: if you need to refresh your memory, you can read the Wikipedia article: [Bacterial growth](https://en.wikipedia.org/wiki/Bacterial_growth). <br>

In [None]:
# -- YOUR ANSWER HERE --
# ---------------------

### 1.2. Processing and plotting data from multiple replicates

A characteristic trait of biological systems is that they are *noisy*. This means that there can be a high variability between multiple repeated measurements of the same parameter. In microbiology, even two cells derived from the same parental cell can have small differences in their phenotypes depending on the local environment and the genotype of the cell (i.e., mutations). For instance, one cell might be at a position in the culture medium with a slighlty lower sugar concentration, and consequently grow slightly slower than its sibling cell that is positioned in a place with a little higher availability of sugars. Situations like this can result in that the signal of the parameter that we want to measure can have a high variability. When the variability becomes too high, we risk getting more noise than signal in the system we're measuring and, as a consequence, we will have a hard time interpreting the results.

Because of the inherent variability in our world, all scientific experiments - be it in natural sciences, medicine, or social sciences - requires replicates and statistics. Since biological systems can have an especially high variability, biological research requires us to perform many replicates. Performing experiments in replicates also allows us to do statistical analyses of the data to assess the variability and evaluate if a certain response is significantly different from another. In this notebook we focus on basic ways of handling data from multiple replicates using averages and standard deviations. (We will not go into statistical tests such as the *t-test* or *ANOVA*, since this is not a course in statistical analysis. However, for those of you who are interested in such tests, the scipy package comes with functions to perform different statistical tests.)

Furthermore, in the biological sciences, we make a differences between *technical replicates* and *biological replicates*. 

A *technical replicate* means that a measurement is performed multiple times using the same instrument and sample. For example, we inoculate a single colony in a shake flask, incubate it and let it grow for a while and then take a sample and measure the biomass content. We can measure this sample multiple times in the same instrument. Most likely we will get slightly different absolute numbers each time. This is the technical variability of the sample. It shows the variability of the method that was used, for instance how much a spectrophotometer differs between multiple measurements of the same sample. If we measure the sample multiple times, we can calculate the average value and the standard deviation of all the replicates. Ideally, we would like the method to produce similar results each time, i.e. we would like the standard deviation between the measurements to be as small as possible. Most mathematical software has built-in functions for taking the average and standard deviation. When working with Pandas dataframes, we can use .mean() and .std() to do this.

A *biological replicate*, is when we compare samples that are biologically separated. When working with single cells, we do this by using a different cell culture for each sample. The easiest way to achieve this is that we take a new colony from our agar plate for each culture that we inoculate. Then we prepare our sample just like we would for the technical replicate and perform the measurement. Since biological systems are known to be noisy, the biological replicates will typically have a larger variability than the technical samples. To be able to show that something is biologically reproducible, we typically first use the technical replicates to ensure that the measurement method is reliable, and then use the biological replicates when performing the statistics.

The metadata from Katipoglu-Yazan et al. (2023) dataset is a little vague on whether the replicates are technical or biological replicates. However, it says that the measurements for the different temperatures were made using "40 replicated cultures", which hopefully means that all 40 replicates were biological, i.e. each coming from a different starting colony on the agar plate. For the sake of this notebook, we will assume that the replicates in the dataset are biological replicates.

OK, let's have a look at the variability between the replicates in your dataset!

**(6) Python task:** <br>
Plot the OD results from four different replicates - A1, B2, C3, and D4 - as individual curves in a single plot. Use a different colour for each replicate. Use a marker to show that the data contains discrete measurements. Add labels to the axes, a legend for the curves, and a title to help the reader understand the plot.

In [None]:
# -- YOUR CODE HERE --
# ---------------------

**(7) Biotechnology task:** <br>
Look at the plot you just generated. Do you see any differences between the replicates? If yes, what do you think is the reason?

In [None]:
# -- YOUR ANSWER HERE --
# ---------------------

As mentioned above, a common approach of handling multiple replicates is to calculate the mean and standard deviation of all the replicates. In short, the thinking is that if the standard deviation from the mean is small enough, the reproducibility of the experimental setup can be considered good. Let's calculate the mean and the standard deviation for the replicates in the 37 °C file.

**(8) Python task:** <br>
Using the dataframe we created earlier as a starting point, create a new dataframe that contains the time, OD of each replicate, the mean of all replicates, the standard deviation of all replicates and the Median of all replicates<br>
Hint: emember that DataFrames have multiple axis

In [None]:
# -- YOUR CODE HERE --
# ---------------------

Now we have a dataframe containing the mean and standard deviation from all the 40 replicates at 37 °C. Let's investigate how the mean of our data looks like.

**(9) Python task:** <br>
In a single figure, plot the mean and the median of the replicates together with the data from replicates A1, B2, C2, and D4. Plot the mean in black colour so that is stands out from the other curves. Add labels to the axes, a legend for the curves, and a title to plot to help the reader understand the plot. In the label for the mean data, please specify that the number of replicates used to calculate the mean (n=40). 

Plot the mean last, so that it sits on top. 

You can play with the "alpha", "ms" parameter to influence the transparancy and marker size. Remember that by first creating a figure and then potting into the specific figure you have full controll of of what is plotted.

Bonus: 
Create the same plot using only the original DataFrame (meaning without using a separate Frame that contains the processed data).

In [None]:
# -- YOUR CODE HERE --
# ---------------------

**(10) Biotechnology task:** <br>
In the figure you just plotted, the mean is probably closer to the A1, B2, and C3 replicates than to the D4 replicate. What does that tell us about the overall performance of the bacterial cells from this experiment (37°C)? Do you think that the mean is a good representation of the replicates in this dataset? Is there a difference between Mean and Median?

In [None]:
# -- YOUR ANSWER HERE --
# ---------------------

An issue with using the mean of all our replicates is that it does not show how much the different replicates varied, only how they performed on average. To better be able to show that the data contains variability, it is common to add the standard deviation as error bars to our mean data. In the figure below is an example of how these plots often look like. 


<center><img src="images/fig3.png" width="500"></center>
<p><center><b>Figure 3:</b> Example of a plot of the mean values of discrete datapoints and error bars with the corresponding standard deviation.</center></p>


One way to plot error bars with matplotlib is to use this function:
```Python
ax.errorbar()
```

You can read the documentation <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.errorbar.html#matplotlib.axes.Axes.errorbar">here</a>.

**(11) Python task:** <br>
Plot the mean of the replicates and display the standard deviation as error bars.

Hint: the arguments *linewidth* and *capsize* of ax.errorbar() can be used to control the look of the errorbars

In [None]:
# -- YOUR CODE HERE --
# ---------------------

**(12) Biotechnology task:** <br>
Your plot probably showed that the standard deviation was smaller at certain time points and higher at other time points. Can you come up with any biological explanation as to why this could be?

In [None]:
# -- YOUR ANSWER HERE --
# ---------------------

And just like that, we have created a small piece of code that reads the 40 replicates x 125 time points = 5000 datapoints from the 37 °C dataset, calculates the mean and the standard deviation of all the replicates for each time point, and plots the results. Very handy!

We should keep in mind that the code we have produced will only work for .csv files structured exactly like that data from Katipoglu-Yazan et al. (2023), i.e., each column is a biological replicate and each row is a new time point. Please keep in mind that if you get OD data from another experiment made by a different person, the data might be sorted in a different way... This highlights the importance of knowing how your indata looks like, and that you may or may not need a sorting step before you send the data to your script. There is no standard format for OD data! But sorting it in the way Katipoglu-Yazan et al. (2023) had done comes very naturally for working with Pandas dataframes. Also keep in mind that if you change to another programming language, there might be other preferred ways of sorting the data - if this had been a course in the R language (another popular language in the biological sciences), we might have wanted to sort and label the data in a slightly different way.

**(13 Advanced) Python task (using LLM):** <br>
From our previous plots we have observed that some replicates were more outliers than others. We have yet to plot all replicates, so let's do that in this bonus task. In a real-life project, plotting all replicates would probably be the first thing we would do so that we could get a feeling for our data, but since the focus of this notebook is to practice Python, we skipped that step earlier.

The challenge of plotting 40 replicates in a single plot is that the figure will be hard to read: there will be many colours and markers. Instead, it might be more useful to split the data into subplots. For instance, you could make one subplot per row in the microtiter plate, i.e., one plot for A1-A6, one for B1-B6, etc. Remember that rows E and F in the dataset we are currently working with include eight datapoints and not six.

Is there a pattern in the data? Are the replicates from the same row of the microtiter plate reproducible? Are the replicates from the same column of the microtiter plate reproducible? If you want, you can load the data from another replicate too, and check if any patterns you observed in the 37 °C data are also present there.

Compare the variability of the replicates you have plotted to the standard deviations from the plot with the mean OD and standard deviation you just made in Task 11. What are your thoughts?

To make these selections there are again multiple ways. One is to use **index based slicing** with iloc we trained in the numpy notebook. A second is to us the very powerfull **grouping function** from pandas. A third is to do something in between and analyze the names of the columns. Finally Pandas actually has the possibility for a so called hierachical index. That means that it is made to handle stochasitical data just like this.

** Chat GPT & Co.**
Normally you would now start and play with the format and try to find code. However this is the place where we now have tools that do this for you if you know what you want. 

**Task** 

use a LLM (chat GPT or similar) to test what this prompt gives you: 

*I have a pandas DataFrame with name "df" whose column names mix letters and numbers, for example A1, A2, B10, and C3. I want to split each column name into the letter part and the number part, and then make those the two levels of a hierarchical MultiIndex for the columns in my DataFrame. Make a copy of my DataFrame into the variable "df1" and apply the change there. Name the letters "Columns" and the number "Rows".*


A good response most likely will use the package "re" and the output should result in something looking like this:
<center><img src="images/fig7.jpg"></center>
<p><center><b>Figure 7:</b> Multi index data</center></p>

Check by plotting that the data still looks the same.

Now you can use the powerfull grouping functions from pandas like this:

```Python
df1.groupby(level='Columns',axis=1).mean().plot()
```
or 
```Python
df1.groupby(level='Rows',axis=1).mean().plot()
```

to make the statistics and  plot it. What do you observe? Can how save is it to use statistical arguments to filter out wells from the data?

In [None]:
# -- YOUR ANSWER HERE --
# ---------------------

## 2. Automating the processing of multiple datasets using loops
<a id='section2'></a>

By now we have written code for loading the 37 °C growth data from its .csv file, calculating means and standard deviations, and plotting the data. Since microbial growth data is a very common data type in microbiology, this type of code is useful to have. You will probably be able to reuse this code in future courses and project, if you want. In this section of the notebook, we will work on making the code even more useful. 

As you probably remember from when you unzipped the growth data, the zip archive contained 19 files. If we want to make the same plot that we made for the 37 °C data (Task 11), we could use the code we have already written and repeat it manually for each of the 19 files. There is nothing wrong with processing data in that way. But it is time consuming, cumbersome, and prone to error. Instead of doing this manually, we could write a loop to have Python do the work for us. We would like the loop to find all the .csv files in the directory we made, then load the .csv files one by one, perform the calculations and plots, and then repear this process for the remaining .csv files.

To do this, we will make use of a Python module called *os* that allows us to e.g. move between and look inside directories on the hard drive programmatically. In particular, we will use the following functions to get a list of all our filenames: 

```python
os.listdir() 
```



**(14) Python task:** <br>
Test the following code for yourself. By using *os.listdir()* we can get a list of all the files in a given directory. In the examples below we will use it on the *ecoli_growth_data* subdirectory located in our starting directory. 

What result did you get? Did it contain the file names of all 19 .csv files you downloaded earlier? Were any other file types detected?

In [None]:
# -- YOUR CODE HERE --
# ---------------------

Perhaps you noticed that the list of files was not presented in alphabetical order? That is because os.listdir() returns a list with arbitrary internal order. While there are easy ways to sort the order of files, what we actually want is to extract the temperature from the filename. 

You notice the name structure of the file. It has as the second last part of its name the temperature. We can split string with:
```python
'this_string_has_underscores'.split('_') 
```
into a list from which we select which element we want.

When loading data like this a dictionary is the best way to hold different data of different shapes. The temperatures we just extracted from the filename can act as "keys" and we can reuse our code from above to read each file, take the Mean of all the wells and attach this DataFrame to the dictionary.

**(15) Python task:** <br>
Hint: For the contents of the loop, you can most likely reuse part of the code you wrote in Section 1.

In [1]:
path_to_my_files=os.sep.join([os.getcwd(),'ecoli_growth_data'])
my_files=os.listdir(path_to_my_files)
data_dictionary={}

for f in my_files:                     
    if f.endswith('.csv'):             # We add an if statement here to make sure that only .csv files are passed to pd.read_csv. If we for instance have an .txt file or even a subdirectory in our datafolder, we will get errors since only .csv formatted files will work.
        full_filepath=os.sep.join([path_to_my_files, f]) # We have a list of all the file names, but we also need to give the path to the file since all our data is in another folder than our current folder
        # -- YOUR CODE HERE --
        # ----read the file and assign it to the DataFrame "df"  ----
        # ----calculate the average of all wells and assign it to df_mean ----
        # ----split the filename (contained in f), extract the temperature as number and assign it to "temp"----
        # place the averaged data into your dictionary with the temperature as key:
        # data_dictionary[temp]=df_mean      


NameError: name 'os' is not defined

**(16a) Python task:** <br>
To plot all the averaged curves we can either loop over the keys in this dictionary and plot each of these averaged datasets into the same plot. 
We then use the temperature as label. Shape (set the limits appropiately), and label all axis of the plot. Give the legend a title that describes what the numbers mean. Change the color for each line by using the **enumerate** function in the loop and select a new color from the list of colours that you generate like this. Instead of "jet" you can choose any colormap from here [Matplotlib colormaps](https://matplotlib.org/stable/gallery/color/colormap_reference.html)

```python
import matplotlib
cmap = matplotlib.colormaps['jet']
colors = cmap(np.linspace(0, 1, 19))
```



In [None]:
# -- YOUR CODE HERE --
# ---------------------

**(16b) Python task:** <br>
See if you can convert the dictionary into a Pandas DataFrame, inspect the frame and plot it using the "jet" from matplotlib. You will note that there are many entries that are filled with NaN, since there is data missing. 

You can use the "interpolate" function to fill these holes, but you must be aware that you generate "datapoints" with this method, which is usually to be avoided. Plot the just interpolated data and inspect them. Can you identify the regions that are problematic?

```Python
df_interpolated = df.interpolate(method="linear", axis=0, limit_direction="both")
```
Better is to accept the holes and instead use a plotting function that interpolates the data just for the plot, but not for the analysis. 



In [None]:
# -- YOUR CODE HERE --
# ---------------------

**(17) Biotechnology task:** <br>
Microbes have a range of temperatures in which they can grow. If the temperature of the environment becomes too low or to high for the microbe, the cells will become stressed and eventually cease their propagation. Looking at the graph you have just generated, at which temperatures does the growth of this *E. coli* strain seem to be affected? Does it look like this *E. coli* strain grows the fastest at 37 °C?

In [None]:
# -- YOUR ANSWER --
# ---------------------

## 3. Calculating microbial growth rates
<a id='section3'></a>

The plots we have made so far have probably given you an indication of how this particular *E. coli* strain grows at different temperatures. Remember that the only difference between the data from the .csv files is the temperature - all other parameters were kept constant in all experiments. The optimal temperature for cultivation of *E. coli* is stated by many textbooks to be 37°C, which happens to be the average temperature of the human body. In this section, we will use the fact that our dataset contains growth data from 27-45°C to see if the maximum growth rate of this strain actually occurs at 37°C.

The visual inspection of the plots indeed gives us a lot of information. In addition to plotting the data, it is also common to quantify the growth by calculating the growth rate, $µ$. Specifically, we often want to calculate the maximum growth rate, $µ_{max}$, from different growth curves and then compare the values. Here we will eventually calculate and compare the $µ_{max}$ from all the 19 temperatures from our dataset.

### 3.1. The maths behind an exponential growth curve 
As you discussed yourselves in Task 5 back in Section 1, microbial growth curves consist of several different growth phases. A common way to describe the growth during the lag phase and the exponential phase is by using exponential models of the form:

$$
  Y=Y_{0} e^{µ (t-t_{0})} 
$$

Where $t$ is the time in hours, $Y$ is the number of cells at time point $t$, $µ$ is the growth rate at that time point, a $Y_{0}$ is the number of cells at the starting time point $t_{0}$.

This means that we can solve for the growth rate ($µ$) by:
$$
  µ=\frac{ ln( \frac{Y}{Y_{0}} ) }{t-t_{0}}
$$

According the properties of logarithms, we can re-write the equation as:

$$
  µ=\frac{ ln(Y) - ln(Y_{0}) }{t-t_{0}}
$$

This is the equation we will be working with in this section. You can compare this to the general formula for finding the derivative of a point on a curve, and hopefully see that they are very similar: 

$$
  \frac{dY}{dX}=\frac{Y-Y_{0}}{X-X_{0}}
$$

That means that to find $µ$ at a given point in a microbial growth curve, we need to be able to calculate the slope of the natural logarithm of the biomass curve at that point.

One way of finding $µ$ from a microbial growth curve is plot the natural logarithm ($ln$) of the cells versus time. We call this a semi-log y plot. 

We are often interested in finding the maximum growth rate µmax, which is the point of the semi-log y plot with the largest $µ$ (steepest slope).


**NB! in many programming languages, the function for the natural logarithm is *log()* and not *ln()***. The base-10 logarithm is often written as *log10()* in most programming languages. Likewise, the base-2 logartihm is written as *log2()*. The benefit of working with the natural logarithm is that it can be easily used to transform exponential functions, since we can use the definition that states that:

$$
  ln(e) = 1
$$


**(18) Python task:** <br>
The first step in using the ln-method for calculating the maximum growth rate ($µ_{max}$) is to tranform the OD values to ln(OD). Calculate the natural logarithm of the entries in the DataFrame from Task 16b and plot the results. 

Logarithm transformed plots can sometimes look very similar to their non-logarithm counterparts. Once you have made the plot, note the scale on the y-axis. Make a new figure with two axis, one with the linear and one with the logarithmic data. Compare the two!

In [None]:
# -- YOUR CODE HERE --
# ---------------------

**(19) Python task:** <br>
Using the ln data of the mean OD's from all temperatures, calculate the $µ_{max}$ for each temperature (looping over or combined in a DataFrame). Use the equation above that states that $ µ=(ln(Y) - ln(Y_{0})) / (t-t_{0})$.

One way of approaching this is to first calculate the slope ($µ$) for every point on the curve using the np.gradient function. The x-values are the indexes of the DataFrame that can for this purpose be accessed with x=df.index.values if df is the DataFrame and the values in the DataFrame with df.values.

Once all the µ are known, we can e.g. use the Pandas function .max() to find the maximum µ value. 
When doing such calculations it is good praxis to plot the functions before choosing the .max(), to verify that the values make sense.

If done right it will look something like this:
<center><img src="images/fig8.jpg" width="400"></center>
<p><center><b>Figure 7:</b> Derivatives as function of time for each temperature. </center></p>

Finally, plot the $\mu_{max}$ as a function of temperature.

In [None]:
# -- YOUR CODE HERE --
# ---------------------


**(20) Biotechnology task:** <br>
The formula we have been using to calculate exponential microbial growth can also be used to calculate how long it takes for the total amount of cells to double. This is known as doubling time, or generation time. How is the doubling time calculated? Use the $µ_{max}$ value you calculated in the last task to calculate the corresponding doubling time.


In [None]:
# -- YOUR ANSWER --
# ---------------------

In the previous two tasks, we used the mean OD value to calculate the $µ_{max}$. What could be the issue of calculating the $µ_{max}$ of the mean OD? Do you think that we would get another $µ_{max}$ estimate if we first calculate the $µ_{max}$ of each replicate, and then take the mean of the $µ_{max}$ values from all the replicates?

# **(21) Biotechnology task:** <br>
Now that you have plotted the average $µ_{max}$ for each temperature in the dataset, what did you find? Did the fastest maximal growth rate occur at 37°C like the textbooks often state? 
<br> - If yes, explain why this temperature would be beneficial for the growth of this bacterium. 
<br> - If no, what could be the reason why? 
<br> - Where there any results in this plot that were different from what you had expected?
<br> - Compare this figure to the plot you made with the mean and standard deviation in Task 16 in Section 2. Do the results match?


In [None]:
# -- YOUR ANSWER --
# ---------------------

# Previous Bonus tasks

## 3.3. Fitting curves to the experimental data

As we have discussed in this chapter, we often describe microbial growth as an exponential function, but that is only true until the end of the exponential phase. In fact, what we often see is more of an S-shaped curve, i.e., a <a href="https://en.wikipedia.org/wiki/Logistic_function">logistic curve</a>. In our data, we also have a decay phase after a short stationary phase, which may or may not be an issue for fitting a logistic model. Most likely, we will have some troubles if we try to fit a model to the whole growth curve, regarless of the model we choose. But if we only focus on the data up until until we reach the stationary phase, we can use an exponential function well enough. If we try to fit this exponential function: 

$$
  f(x)=a e^{b x} +c
$$


to the data up until the end of the exponential phase, we can get a figure that looks something like this:

<center><img src="images/fig4.png"></center>
<p><center><b>Figure 4:</b> Example of an exponential model fitted to the lag and exponential phase of the data from the 37 °C experiment. </center></p>


In this data, the end of the exponential phase seems to occur around row 34-37 in the 37 °C data. A piece of code to drop all rows in the dataframe after row 36 and save the modifed data as a new dataframe could look like:

```python
df_copy=df.drop(df.index[36:])
```

(Note that this particular piece of code only works if you have kept the index as row numbers (default setting when you generate a new dataframe) and not changed it to be the Time column).

**(24 BONUS) Python task:** <br>
Plot the mean OD data from the 37 °C dataset as discrete points in a scatter plot. Then, using the same data points that you used to plot the data, fit a curve using the exponential model $f(x)=a e^{b x} +c$. Once you have the fitted curve, the parameters of the fit to get an estimate of the $µ$. Did the $µ$ you calculated using the curve fit differ from the $µ_{max}$ you calculated using the ln-method? If yes, why do you think that there is a difference?

Hint: 
If you plan on using *lmfit* or Scipy's *curve_fit*, please remember that it requires you to provide it with initial guesses for the parameters of the mathematical function you want to fit.





In [None]:
# -- YOUR CODE HERE --
# ---------------------

## Graphical excellence

**(25) Python Task:** <br> 
To finish up the work we have been doing in the notebook, we will sum up some of the main results from our analysis into a single publication-ready figure. This means that the figure is ready to be used in e.g. in a scientific report, power point presentation, or conference poster. Reproduce the plot shown below. (Seaborn was used to set the colour palette). You may also make improvements to the figure if you want. Perhaps you find that there are too many curves to plot in a single figure with these number of colours? Perhaps you want to split the data into two subplots, or maybe even skip some of the curves? Anyway, the resulting figure should be something that you would gladly hand over to your supervisor! <br>

Do you understand the plot? It should be apparent if you completed all the previous exercises. If you are not sure, please ask your instructor!

<center><img src="images/fig5.png"></center>
<p><center><b>Figure 5:</b> Main results from this notebook. </center></p>


In [None]:
# -- YOUR CODE HERE --
# ---------------------

# References

Katipoglu-Yazan et al (2023). Data on the influence of temperature on the growth of *Escherichia coli* in a minimal medium containing glucose as the sole carbon source for the joint computation of growth yields and rates at each temperature from 27 to 45°C. https://doi.org/10.57745/GCKG7W

# Suggested answers to the tasks

### How to get the answers:
Enter the password in the next code block. If the password is correct, you can execute each of the code blocks below to get decrypted suggestions for how to solve the tasks.

```python
fernet = Fernet(b'!!!TYPE PASSWORD HERE!!!')
```
Note that the password needs to include a b' at the beginning and end with a ' .

In [None]:
from cryptography.fernet import Fernet
if 0:
    key = Fernet.generate_key()
    with open('secret.key', 'wb') as f:
        f.write(key)
else:
    with open('secret.key', 'r') as f:
        key=f.readline()
fernet = Fernet(key) # enter password here!

In [None]:
multiline="""
"""
encrypted=fernet.encrypt(multiline.encode('utf-8'))
print(encrypted)

In [None]:
# Function to split into letter part and number part and Create MultiIndex from tuples with custom level names
encrypted=b'gAAAAABom5xghEKvUBOh29ZkyqV_jQm_ji0fxaje4aiHXQDyNOqk-dlEDMZm40ngmleRuie0LforCAHHxIeRMC6QBFpRxlyWX6GQH1ltL8dCedMV2Q-7Vn5pOylVZBPpR7tJ_1Q_lVFnHSXifKRyk-u4-Dqd0hx70N_6wISXDT4TviJAYaMbXIcGfofYHuFEPkYhwUbjG2QyUuOsGd4Mqk8vj25lEjBFGa4ms7mP9r-uBBEVhIki0ctheyoQjHA0DpdagFP7WA8qoD4xqxBKzhBFooY1CC8Vy6SNgs-N94-cuoDc_CnkZZONpbeYQWL6Z9dQ0p1qqg5HFyMI-lmwSH-LVeNFKeUR00QAXxAFVH-YOBJpePfWLrCn8GMvniVO0TryWMJs4hGg8Q4BXKGHGJ4zNOcyyXe01XYreNz6wzLt_dQFfWNh4cJraqlcZcOOxTzafNNBmJdXKq2B_IsXPoP3H1S2aYzuxIe_GpGZVhoMvk_EvPNCz5gJLVRPYi5muFSYdfy4JoMj'
decrypted = fernet.decrypt(encrypted).decode()
print(decrypted)

In [None]:
# read files in Folder, read each file as DataFrame, average all columns read temp from filename and store the result as dictionary 
encrypted=b'gAAAAABom51lBZXuB7a0Dw6AEowooXBK7Ue85aYdZqd4rgWH2TmEql9Dy-4BIlOvEgTwe4QzAjP7soEQl2M2xPmtcd3svS46tsKPd9vozOEalVUbDkWPWLXAi7x9_j0UdUFk7_Zm-6G3Ooqqsgf-JRwPY6iNSvDMmLA6ilbJaI8_hqVAZQ9A0eJ_ybgPTrSNv9tRBO0kvbvR-665xd0cJHFOoJpc7daQbYZZsuV6qeSkn-Y6LiKAt4U7YS1tTMdAKB-AIvrZPaYKxXIRbU7cMe78OOIx5CwfJMHr_ZQigdPy-AbQ3R-bYdSivLVSN0JkPOAcITIoZ1leCUtH6CZ9JeA0RoHndwIG3dy_-LkpofGhFbmeD6ExoI_huk_Ojkbv7ZlHMVdgzjNpLymIdHKKbwBgqhvED2qaaMDXqvF354XVtTdynTfUIFgmjva75ahBRPRR9zO0ATejkdcWeC_FY8xzMWscUzs39pIsVpGz5hS3Xa7qLJkmEukV_sPR8GOqUbX0H6EQkcJ2iys8DxQA0QjlnTT9p7Bv49HizQEBO2G-5ZzLHPTC0bZMXJJE84aXPf0idIsxjSW3L_MmLxe7npFxEtI5r6AL0AgV7RbNYphzbVZDO3dhWnq0MS23PCjp2W2WFz8_-l43EGWVCqZiooZJG0Fu6ULcj9PsLUsGsxlvN1nYUgWoH55DynHMQzeedR-EhnIVWF4jqgv_D_hwRcRQy9Rp5WtJafVyrrPVCrRO5Mz9o9T1mu_I-SOSY35mtJRjRcUlAlHjpz2Tkoy2NX5J6NPslKfwMyKiDsfUymbOn1EUmuSU33ttHXsPsbudlMhXeWdhPApKa6f-xFPVvraY-hoOeEM2aM_7d4XTBRBG6i9D3q0aT7Bm-KYe8Jyddjmg-8qp78tYCZJvCD2cB0-LFYcDkhbBpHQadFmzPuYpP5ewWch4Bg-kHvzyvNK445ist6Z3OZARNevVUDErwNm3afx_HsacxXWvOpi_FifHPJZMW5JRdw-ouSE_ZJRbxU7N00AgidntvfuJD6cxPL75BXgkGS9vmmA2dHSMjrUDuYc-Vqy2meeGOt-RDFyINX5k1zYpP8e6MO4_MGomoG0VCd7bJHI-7QMHoEVQ8WWwRqYmkAlivoAJPp3crmyeD793daX5cWc7UkHG7mkKb8kHxHjUeoXJZg=='
decrypted = fernet.decrypt(encrypted).decode()
print(decrypted)

In [None]:
#Filter the columns after outliers
encrypted=b'gAAAAABom52nFftlMpAiCKR8Ipwno6hvu1DUVK3eEXVHE8NilunMA2TgttWvGtfzQlyQ9-s9q-Rc4kxYARTG5ZvQFZ33XQHtge1xG7c0lzVe67-phXd6PGmn_J1USA3GDnAQHCsP2bUZWeZGybumqg1UXbBFu69KTwcb3kfm6KDndMSd0ToOWOTopF33AZfwLFPfNenuYkxKPzgi9CT3I_fncKs6hR-AciZss6l46w7QeJHJJa7-97PHs8Ay-TivIghNufQuHeTAXpM4B0AeRB-piTnE7IklST1Fx9r68Htc3UJBLXRhXvIxjspJ4Ox1n4c5Kvt0s7IbsQzBES8dgZFI1X-pVSwQVa43br8WRjjh_8t-UFsG_EPtCyCyOxQfy_-yzJXYeDBuF7CgweM0B7LYbr98yGcYx1PX5q_hJWjp7wwOOh27PWk='
decrypted = fernet.decrypt(encrypted).decode()
print(decrypted)

In [None]:
#interpolate the missing rows
encrypted=b'gAAAAABom53ehelH2qcmYTy6TAL-jPVe4-DfDYZWEysPLGU1TZt6T41IJsvB48WW3KvRcQn_mlWiIx0zRMomMl3_M4-evF-n8NZ9O5I4Xoyd1dqDop9vshQG2pDwbp2b_ilzAwcNtqlrGYrk5nKjuoiuLdm0-PUOVSZM72tGAzQTMj9OyxeuFmsCCcKmX7p5BZFmU7j14asVUlvn5IBDrCwFJ-wxvcz_WQ=='
decrypted = fernet.decrypt(encrypted).decode()
print(decrypted)

In [None]:
#compare lin and log
encrypted=b'gAAAAABom54DBbIQQAsa38EcWtj69tkCpxXeuA4VtlCAtG8qSkajjlRtfI7lFqz2uo6BChW7eVqSp0zmxKFIzNAa-VfMM3sY_a6nmALU599ZSzXmjMi15uLWnJ3FwdHvnrJEtwECxiqg4sMM9TDEVSgEOKQiImlE3dylr1SzkzO_pf3nihg1NltxPeaG8UfatKSab1uQZDFMgpYTqDrP7bu0M4ahQ8124w=='
decrypted = fernet.decrypt(encrypted).decode()
print(decrypted)

In [None]:
#First plot the mu_max, then plot the maximum for all temperatures
encrypted=b'gAAAAABom55vfCymLVnjQQwXBT8dGK0bwm3arfvwJl4kHrySGC4lzsizCkX4xszNyYamlpIBJAWZMyHrZLUJPgWOqboM9CdqUN7YOco9lrqADy3wV8xnQANUVHq8b0muvbk2Bdms5F0cFW-I6nmTQi3X4uqLzm8IThhvKSOMjc7CMBOZ8D8UEeSy4LQZPW6ghjsOOtiVd5SS6OeC9YwbUTkdH6qRRqHQkeTtRVIojQvCwdmt1q-Kdud0jjBprm_oz6oNk10d01_YKgxg_sVyMzw_h74Xkng-aODS-LoU3rf3gk26T3SQspaXAx2xX1yrTPFXQ7DERgiD6v2JFf8eXIx1EZCJVjwCyo5nuwyibqUC2elvjVtpwJH-e6EcWql89klacn0-M6QBktCImtT3y5y_-iw8ozVoEdWGwlpaHnA2QK9dmkfM3qxuHscKF1kRKKfTaNIMN11Q97gsjP0kgCNPjlVLCs5xIhQpkUFEH960iNHmtiM8AoktbFpWyemVEsCY9E4yuyHQD4UuU1KDa5MkYxz1uSRZqwaDkxMqf0uYrQvfVKjFUVJySBxp9ms6AnyiID15TUmVgmibeu7cZhmFQgi3sZVXbFGBZOXQuxprVv3zj6HL-mDolGmnlPXQn9_Ind0hclAWSsYoM8ZLVsKfmIz9d2bKORMYlNfMiD32WIdn8XA7r-_3b30tkPQd6ZHxOuNGa4LPTZPmHL7_vB1bDjurSU42MOnz5xT0H6X6ccOzvtVmi871wpVnfu03nn0z50x2cGCYwPuxUtSrNVibRjnLU-QQkh6DkgcJnFjsiNByeQUMimkTl4OZAS-6QfEVz6k3NoWGyZU8oLpHJe9Jdou64Ai5n8yoM14kACwWIfShOt0rs8IwfJ2UJgs3c_ugVhvxHwUfsRGLRfXmjhOE4KpIxu8kOOCSPfHmqqVWsSfoY0xlfYiRrHNFIdXJHQB3eJB3Otuz9gOGKxzM7h3sr_VK71xe4g=='
decrypted = fernet.decrypt(encrypted).decode()
print(decrypted)

