# A short introduction to Pandas, NumPy, and Matplotlib

This is a short introduction to using Pandas, NumPy, and Matplotlib. Here, we are going to load a Excel file containing a measured rate constant ($k$) at different temperatures ($T$). The temperature is in Kelvin and the rate constant has units of M$^{1/2}~\text{s}$.

You may recall that we expect the rate constant to follow the [Arrhenius equation](https://en.wikipedia.org/wiki/Arrhenius_equation):

\begin{equation}
k = A \text{e}^{\frac{-E_\text{a}}{RT}}
\end{equation}

where $A$ is a pre-exponential factor, $E_\text{a}$ the activation energy, and $R = 8.3145$ J/K mol is the gas constant. According to this equation, we should have an exponential relation between $k$ and the (inverse) temperature. Or, if we take the natural logarithm of this equation,

\begin{equation}
\ln k = \ln A -\frac{E_\text{a}}{RT}
\end{equation}

we get a linear relationship between the logarithm of $k$ and the inverse temperature ($1/T$). We shall make some plots here to see if this is the case.

For doing this we will use three Python libraries:
* [Pandas](https://pandas.pydata.org/) for loading the raw data.
* [NumPy](https://numpy.org/) for working with numerical array and mathematics.
* [Matplotlib](https://matplotlib.org/) for plotting and making figures.

## Reading the data

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
#plt.style.use('seaborn-talk')  # Change the style of plots.
plt.style.use(['./tkj4175.mplstyle', 'seaborn-notebook'])
%matplotlib notebook

There are two extra lines in the code above:
1. `plt.style.use('seaborn-talk')` which changes the style of the plots produced. There is a list of predefined styles in the [matplotlib documentation](https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html). Here, you can experiment with different styles, perhaps `ggplot` is more to your liking? You can also define your [own styles](https://matplotlib.org/stable/tutorials/introductory/customizing.html) or import styles that others have created. [Here is a collection of styles for scientific figures](https://github.com/garrettj403/SciencePlots) which may come in handy for your next nature paper.

2. `%matplotlib notebook` which makes the plots interactive (so that we can zoom and pan).

We will load the Excel file using Pandas. This will give us a Pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), which is, to quote Pandas, a:

> Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Essentially, it is a table containing the data. Let us load it and show the contents:

In [None]:
data = pd.read_excel('rate_constant.xlsx')
data

We see here, that we have two columns: `"rate constant"` and `"temperature"`. We can access the columns as follows:

In [None]:
data['rate constant']

In [None]:
data['temperature']

The code above gives us a new Pandas DataFrame with a single column. If we are only interested in the actual values, we can get those in the following way:

In [None]:
temperature = data['temperature'].values
print(temperature,'Type:', type(temperature))

The code above returns the temperature as a NumPy array. This is handy since we then can do NumPy-type operations. For instance, we could check what the largest and smallest temperatures are:

In [None]:
print('Largest temperature:', temperature.max())
print('Smallest temperature:', temperature.min())

Or, since we are probably going to use the inverse temperature, we can calculate that:

In [None]:
inverse_temperature = 1.0 / temperature
print(inverse_temperature)

We can add the inverse temperature to our original data table by doing:

In [None]:
data['inverse temperature'] = 1.0 / data['temperature']
data

**Task for you**: Add a new column with the logarithm of the rate constant to the table. For calculating the logarithm, you can use the method `np.log()` from NumPy:

In [None]:
np.log(100)

In [None]:
np.log(np.array([1.0, 2.0, 2.718281828]))

In [None]:
# Write your code here.
data['logk'] = np.log(data['rate constant'])
data

## Some Pandas selections
With Pandas, we can do selections from our data. For instance, we could only be interested in the measurements where the temperature is higher than 770 K. We can do this by a selection:

In [None]:
data[data['temperature'] > 770]  # Select rows, based on the "temperature" column

We can also do multiple conditions at the same time, for instance, where the temperature is above 770 and the rate constant is above 0.5: 

In [None]:
data[(data['temperature'] > 770) & (data['rate constant'] > 0.5)]

If we know exactly what row/column we want, we can also use the index (row number, column number) to lookup values:

In [None]:
data.iloc[0]  # Get the first row, Python starts counting from 0

In [None]:
data.iloc[0, 1]  # Get first row, second column, this should be the temperature 700

In [None]:
data.iloc[:, 1]  # Get all rows, second column

In [None]:
data.iloc[1:4, 0]  # Get rows between 1 and 4 (4 not included), first column

**Task for you**: Find the temperature(s) where the rate constant is smaller than 0.1:

In [None]:
# Write your code here.
data[data['rate constant'] < 0.1]

In [None]:
# Just the temperatures:
select = data[data['rate constant'] < 0.1]
select['temperature'].values

## Making a plot
First, we will plot the rate constant as a function of the temperature. You will then be asked to modify the plot to show the logarithm of the rate constant as a function of the inverse temperature.

**Note**: Matplotlib has a large [gallery](https://matplotlib.org/stable/gallery/) of example plots which is worth checking out.

In [None]:
# First, we just get the values that we are to plot:
temperature = data['temperature'].values
rate_constant = data['rate constant'].values

We will now plot the rate constant as a function of the temperature:

1. We create a empty figure and a axis for plotting. A figure can contain many
   axes (for instance, if we wanted to have two plots next to each other, we would create two axes),
   but here we are going to just create one axis and name it `ax1`.
2. We add a line plot to the axis by using the axis method `plot()`. The syntax for plotting is:
   ```python
   ax1.plot(x_values, y_values)
   ```

In [None]:
# We then create a figure as follows:
fig, ax1 = plt.subplots(constrained_layout=True)  # Create a figure and axis for plotting.
# We set constrained layout to be True above to remove some white borders and to make sure that all text is visible
ax1.plot(temperature, rate_constant);

The plot above is perhaps not so nice. It is missing a few things (like labels).
Let us try another version, where we add labels to the x-axis and y-axis, and a title for the whole plot:

In [None]:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.plot(temperature, rate_constant)
ax1.set(xlabel='Temperature', ylabel='Rate constant', title='Arrhenius');

The plot still looks a bit strange, since matplotlib is just connecting all the points we give in by straight lines.
Let us try to plot it differently - as a [scatter plot](https://matplotlib.org/stable/gallery/shapes_and_collections/scatter.html):

In [None]:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(temperature, rate_constant)
ax1.set(xlabel='Temperature', ylabel='Rate constant', title='Arrhenius');

Most matplotlib methods accept many arguments that can be used to change the appearance of the plot. We will here change some properties of the scatter plot:
1. We make the symbols larger, by using the `"s"` argument.
2. We change the [symbol](https://matplotlib.org/stable/api/markers_api.html) by using the `"marker"` argument.
3. We change the [color](https://matplotlib.org/stable/gallery/color/named_colors.html) by using the `"color"` argument.
3. We add a label to the plot. A label can automatically be added to a legend for the figure:

In [None]:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(temperature, rate_constant, s=100, marker='X', color='darkorange', label='Measured points')
ax1.set(xlabel='Temperature', ylabel='Rate constant', title='Arrhenius')
ax1.legend();  # Add a legend to the figure!

We can add multiple plots to the same axis. Let us also add the line plot, together with the scatter plot. Here, we change the width of the plotted line to make it thicker and we change 
the [style](https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html) of it:

In [None]:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.plot(temperature, rate_constant, linewidth=5, linestyle='dotted', label='Some lines')
ax1.scatter(temperature, rate_constant, s=100, marker='X', color='darkorange', label='Measured points')
ax1.set(xlabel='Temperature', ylabel='Rate constant', title='Arrhenius')
ax1.legend();  # Add a legend to the figure!

In the plot above, it seems like the line is drawn on top of the scatter plot. We can control the order of the elements being drawn by using another parameter `"zorder"`:

In [None]:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.plot(temperature, rate_constant, linewidth=5, linestyle='dotted', label='Some lines', zorder=1)
ax1.scatter(temperature, rate_constant, s=100, marker='X', color='darkorange', label='Measured points', zorder=2)
ax1.set(xlabel='Temperature', ylabel='Rate constant', title='Arrhenius')
ax1.legend();  # Add a legend to the figure!

**Task for you**: Make a figure where you plot the logarithm of the rate constant as a function of the inverse temperature. Change the labels on the x- and y-axes according to what you plot, and add
a descriptive figure title. You can also experiment with colors, symbol types etc.

But let us not forget the science: Does it look like the logarithm of the rate constant varies linearly with the inverse temperature?

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
ax.set_title('Logarithm of rate constant as function of inverse temperature')
# Multiply the inverse temperature by 1000 to make them easier to read:
ax.scatter(1000 * data['inverse temperature'].values, data['logk'].values, marker='^', s=200)
ax.set(xlabel='1000 × 1 / T (K⁻¹)', ylabel='ln k');

> But let us not forget the science: Does it look like the logarithm of the rate constant varies linearly with the inverse temperature?

Yes, it sure looks like a linear dependence.

## Fitting a line to the measured data (a bit more advanced)

In TKJ4175 we will learn about least squares regression.
We can use least squares regression to find the line that best approximates the data.
Let us jump ahead and do that here with NumPy.

NumPy has a convenient function for doing this, and it is called [polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html). It can be used as follows:

```python
p = np.polyfit(x, y, deg)
```

Where `x` and `y` are the values (x and y) we use for the fitting and `deg` is a number which tells NumPy what kind of polynomial to fit. For instance, if we put the number `1` in place of `deg` we will get a straight line. If we put the number `2` in place of `deg` we will get a second order polynomial.

The parameters of the fitted are returned and here stored in the variable `p`.
In general, `polyfit` can be used to fit a polynomial of order `deg` and the parameters are returned in the following order:

```python
p(x) = p[0] * x**deg + x[1] * x**(deg - 1) + ... + p[deg]
```

Translated to the case where we fit a straight line (`deg = 1`) we get:

```python
p(x) = p[0] * x + p[1]
```

That is `p[0]` will contain the slope and `p[1]` will contain the intersection (constant term) in the straight line.

For testing this out, we will create some dummy data that follows the equation `y = 2x + 1` and fit a straight line:

In [None]:
# Define some x-values:
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Calculate some y-values:
y = 2 * x + 1

In [None]:
# Plot x and y:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(x, y)
ax1.set(xlabel='x', ylabel='y');

Let us now fit a straight line to the x- and y-points we have:

In [None]:
p = np.polyfit(x, y, 1)
print(p)

In [None]:
print(f'The equation is: {p[0]:.2f}*x + {p[1]:.2f}')

We have now fitted the equation for a straight line and find that the slope
is 2 and the intersection is 1 as expected. 
We would now like to plot that line to see how well the fit is (obviously, here it is going to look very good!).

One way of doing this could be to calculate the straight line values:

```python
yline = p[0]*x + p[1]
```
and plot those as a function of x. If we, for some reason decided to fit a 100th order polynomial it would be a hassle to write out the calulation of the straight line values:

```python
yline = p[0]*x**100 + p[1]*x**99 + # many, many more terms
```

We will therefore use a NumPy method to evaluate the polynomial (for us: the straight line) we have fitted. This method is called [polyval](https://numpy.org/doc/stable/reference/generated/numpy.polyval.html):

```python
yline = np.polyval(p, x)
```

This method will evaluate a polynomial with parameters `p` at the given `x` values.
The benefit of using this method, is that we do not have to write out the equation we are evaluating explicitly.
If we put `deg=1` or `deg=100` in `polyfit`, we can still evaluate
the fitted polynomial using `yline = np.polyval(p, x)`.

In [None]:
# Calulate the y-values for the straight line:
yline = np.polyval(p, x)
# And add them to the plot:
# Plot x and y:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(x, y, label='Original points', s=100)
ax1.plot(x, yline, label='Fitted line', color='xkcd:melon', linewidth=3, ls='--')
ax1.set(xlabel='x', ylabel='y')
ax1.legend();

Here are a couple of more examples of using `polyfit`:

In [None]:
# Here is another example where we fit a 10'th order polynomial to some data:
x = np.linspace(0, 4*np.pi, 100)  # generate 100 values between 0 and 4π
y = np.cos(x)  # Calculate the cosine of x

p = np.polyfit(x, y, 10)  # Fit a 10th order polynomial to the x, cos(x) data.

# Calulate the y-values on the fitted line:
yline = np.polyval(p, x)

# Plot the original x, cos(x) data and the fitted line:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(x, y, label='y = cos(x)', s=75)
# alpha = 0.8 makes the scatter points transparent
ax1.plot(x, yline, label='Fitted line', color='xkcd:blood orange', linewidth=3, alpha=0.8)
ax1.set(xlabel='x', ylabel='y')
ax1.legend();

In [None]:
# And another example where we just try out different orders of polynomials:
x = np.linspace(0, 2.5, 100)  # generate 100 values between 0 and 2.5
y = np.exp(x**2)  # Calculate the e^(x²)

# Plot the original x, exp(x) data and the fitted line:
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(x, y, label='y = exp(x²)', s=25, color='0.6')

for deg in range(1, 5):
    p = np.polyfit(x, y, deg)  # Fit a "deg" order polynomial to the x, cos(x) data.
    # Calulate the y-values on the fitted line:
    yline = np.polyval(p, x)
    # alpha = 0.8 makes the scatter points transparent
    ax1.plot(x, yline, label=f'Fitted polynomial, order {deg}', linewidth=3, alpha=0.8)
ax1.set(xlabel='x', ylabel='y')
ax1.legend();

**Tasks for you**: 

1. Fit a straight line to the logarithm of the rate constant as a function
   of the inverse temperature. Plot your fitted line together with the data used to fit it.
   You should get that the parameters of the straight line are about
   ```python
   [-21880.87290684,     26.6620931]
   ```

2. (Science part!) Can you show that the activation energy is approximately 180 kJ/mol? 

3. Plot your fitted line on the form

   \begin{equation}
   k = A \text{e}^{\frac{-E_\text{a}}{RT}}
   \end{equation}

   together with the original rate constant and temperature data. Does is seem to approximate the original
   data well?

In [None]:
# 1
x = data['inverse temperature'].values
y = data['logk'].values
p = np.polyfit(x, y, deg=1)
print('Parameters:', p)

In [None]:
# 1 - plotting part:
fitted_line = np.polyval(p, x)

fig, ax = plt.subplots(constrained_layout=True)
ax.set_title('Logarithm of rate constant as function of inverse temperature')
ax.plot(x, fitted_line, label=f'Fitted line\ny = {p[0]:.2g}x + {p[1]:.2g}', ls=':', color='black')
ax.scatter(data['inverse temperature'].values, data['logk'].values, marker='^')
ax.legend()
ax.set(xlabel='1 / T (K⁻¹)', ylabel='ln k');
# display number on x-axis in scientific notation, that is 0.001 -> 1 * 1e-3
ax.ticklabel_format(axis='x', style='scientific', scilimits=(-1, 1)) 

We have found parameters for,

\begin{equation}
\ln k = \ln A -\frac{E_\text{a}}{RT}
\end{equation}

written as:
`y = p[1] + p[0] * x`

where:
- `y` is $\ln k$,
- `x` is $1/T$,
- `p[1]` is $\ln A$,
- and `p[0]` is $-\frac{E_\text{a}}{R}$.

This means that the activation energy is:

$E_\text{a} = - R \times \text{p[0]}$


In [None]:
R = 8.3145
activation_energy = -R * p[0]
print(f'Activation energy = {activation_energy / 1000:.2f} kJ/mol')

> Can you show that the activation energy is approximately 180 kJ/mol?

Yes, the activation energy is approximately 180 kJ/mol.

In [None]:
# 3
# We evaluate k using the parameters we have found:
A = np.exp(p[1])
# Let us evaluate it for several temperatures between 670 and 840:
# (We treat the fitted equation as a model, and use it to evaluate k for
# different temperatures.)
T = np.linspace(670, 840, 100)
k_fit = A * np.exp(-activation_energy / (R*T))

fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(data['temperature'].values, data['rate constant'].values, label='Measurements')
ax.set(xlabel='Temperature (K)', ylabel='Rate constant (M$^{1/2}$ s)')
ax.plot(T, k_fit, label='Model')
ax.set_title('Rate constant as a function of temperature')
ax.legend();

> Plot your fitted line on the form
>
> \begin{equation}k = A \text{e}^{\frac{-E_\text{a}}{RT}}\end{equation}
>
> together with the original rate constant and temperature data. Does is seem to approximate the original data well?

Yes, from the plot above, it seems to approximate the original data reasonably well.