### Exercises

#### Question 1

The accompanying file `data.csv` contains information for the value `x` of something observed at time `t`.

Given this data, we want to calculate the rate of change of this value over time - we'll do this by taking two consecutive observations, say $x(t_i)$ and $x(t_{i+1})$ and approximate the rate of change using this formula:

$$
v(t_{i+1}) = \frac{x(t_{i+1}) - x(t_i)}{t_{i+1} - t_i}
$$

For example, if the data looks like this:

```
t     x
0.1   10
0.2   12
0.4   14
0.5   15
```

Then the first row of data would be considered $t_0$, the second row $t_1$, etc

And we can start approximating the rate of change starting at $v_1$ which would be calculated as:

$$
v_1 = \frac{12 - 10}{0.2 - 0.1} = 20.0
$$

Similarly, $v_2$ would be calculated as:

$$
v_2 = \frac{14 - 12}{0.4 - 0.2} = 10.0
$$

Use NumPy arrays to create an array that holds the calculated rates of change and determine the minimum, maximum, average and standard deviation of the rate of change.

In [1]:
import pandas as pd
import numpy as np

In [6]:
url = "https://raw.githubusercontent.com/anhailing/python-fundamentals/main/29%20-%20NumPy/12%20-%20Exercises/data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,t,x
0,0.092,14.765675
1,0.200,20.259227
2,0.296,25.246365
3,0.390,28.591960
4,0.494,35.583875
...,...,...
95,9.607,490.923872
96,9.702,495.846965
97,9.796,500.089664
98,9.900,504.174303


In [8]:
# pandas approach

df_diff = df.diff().dropna()
df_diff["v"] = df_diff["x"] / df_diff["t"]
df_diff["v"].describe()

count    99.000000
mean     49.981252
std       9.089487
min      29.427399
25%      43.956498
50%      49.784833
75%      57.635012
max      69.073005
Name: v, dtype: float64

#### Question 2

In linear regression we try to find the coefficients `m` (slope) and `c` (y-intercept) of a straight line

$$
y = mx + c
$$

that provides the "best" fit given some `x` and `y` data. This formula then allows to predict `y` values for given `x` values.

Given an array of `n` `(x, y)` data pairs, these coefficients can be calculated very simply.

A bit of terminology first:

- Let `X` mean the column of `X` values.
- Let `Y` mean the column of `Y` values.
- Let `XX` mean a column calculated by multiplying each `x` in the `X` column by itself
- Let `XY` mean a column calculated by multiplying the `x` and `y` values from the `X` and `Y` columns

Then, given some column (say `X`), this symbol: $\sum{X}$ means the sum of all the elements in the column.

Similarly, the symbol $\sum{XY}$ means the sum of the values obtained by multiplying (pairwise) the values in `X` and `Y`.

Given those definitions, the formulas for calculating the "best" values of `m` and `c` are given by:

$$
m = \frac{n\sum{XY} - \sum{X}\sum{Y}}{n\sum{XX} - (\sum{X})^2}
$$

$$
c = \frac{\sum{Y}\sum{XX} - \sum{X}\sum{XY}}{n\sum{XX} - (\sum{X})^2}
$$

(where `n` is the number of `(x,y)` pairs in our data set.)

Using the same data we saw in Question 1, calculate the values for `m` and `c` for that data set given the formulas above.

You can think of the `t` column in the data as the `X` column, and the `x` values in the data as the `Y` column - we are trying to predict the value of `x` given a value of `t`.

This will result in a straight line that "best" fits through the data.

Compare the slope of this regression line to the average rate of change you calculated in Question 1.

In [15]:
X = df["t"].values
Y = df["x"].values
n = len(X)

In [18]:
m_num = n * np.sum(X * Y) - np.sum(X) * np.sum(Y)
m_dom = n * np.sum(X ** 2) - np.sum(X) ** 2
m = m_num / m_dom

c_num = np.sum(Y) * np.sum(X**2) - np.sum(X) * np.sum(X * Y)
c = c_num / m_dom

In [19]:
m, c

(49.978008206387344, 10.081268844890284)