### Solutions

#### Question 1

The accompanying file `data.csv` contains information for the value `x` of something observed at time `t`.

Given this data, we want to calculate the rate of change of this value over time - we'll do this by taking two consecutive observations, say $x(t_i)$ and $x(t_{i+1})$ and approximate the rate of change using this formula:

$$
v(t_{i+1}) = \frac{x(t_{i+1}) - x(t_i)}{t_{i+1} - t_i}
$$

For example, if the data looks like this:

```
t     x
0.1   10
0.2   12
0.4   14
0.5   15
```

Then the first row of data would be considered $t_0$, the second row $t_1$, etc

And we can start approximating the rate of change starting at $v_1$ which would be calculated as:

$$
v_1 = \frac{12 - 10}{0.2 - 0.1} = 20.0
$$

Similarly, $v_2$ would be calculated as:

$$
v_2 = \frac{14 - 12}{0.4 - 0.2} = 10.0
$$

Use NumPy arrays to create an array that holds the calculated rates of change and determine the minimum, maximum, average and standard deviation of the rate of change.

##### Solution

In [1]:
import numpy as np
import csv

We'll start by importing the data first:

In [2]:
with open('data.csv') as f:
    reader = csv.reader(f)
    next(f)  # skip header row
    raw_data = list(reader)
    
raw_data

[['0.092', '14.765674972872079'],
 ['0.2', '20.259226923447223'],
 ['0.296', '25.246364712175524'],
 ['0.39', '28.59196014284041'],
 ['0.494', '35.5838751542487'],
 ['0.605', '39.92405609554009'],
 ['0.699', '44.900143003396344'],
 ['0.806', '50.111998705949176'],
 ['0.89', '55.33744839374389'],
 ['1.003', '61.13682148020512'],
 ['1.109', '64.5004524657241'],
 ['1.195', '69.43382286277865'],
 ['1.304', '75.21429131416473'],
 ['1.394', '80.68024464266546'],
 ['1.51', '85.69045993029255'],
 ['1.596', '90.67365382390035'],
 ['1.699', '94.26145999596629'],
 ['1.801', '100.1501806890637'],
 ['1.893', '104.02236089943838'],
 ['2.006', '110.57471384341925'],
 ['2.098', '114.10017544334872'],
 ['2.193', '120.51135990363724'],
 ['2.302', '125.75356699020317'],
 ['2.402', '129.9848906541477'],
 ['2.508', '135.67840981521636'],
 ['2.599', '139.99229760200282'],
 ['2.691', '145.23099677011123'],
 ['2.799', '149.00349744583'],
 ['2.893', '155.49635992161282'],
 ['2.992', '158.88764234748376'],
 ['3

We have to convert our data to floats, so we could do it this way:

In [3]:
data = [[float(t), float(x)] for t, x in raw_data]
data

[[0.092, 14.765674972872079],
 [0.2, 20.259226923447223],
 [0.296, 25.246364712175524],
 [0.39, 28.59196014284041],
 [0.494, 35.5838751542487],
 [0.605, 39.92405609554009],
 [0.699, 44.900143003396344],
 [0.806, 50.111998705949176],
 [0.89, 55.33744839374389],
 [1.003, 61.13682148020512],
 [1.109, 64.5004524657241],
 [1.195, 69.43382286277865],
 [1.304, 75.21429131416473],
 [1.394, 80.68024464266546],
 [1.51, 85.69045993029255],
 [1.596, 90.67365382390035],
 [1.699, 94.26145999596629],
 [1.801, 100.1501806890637],
 [1.893, 104.02236089943838],
 [2.006, 110.57471384341925],
 [2.098, 114.10017544334872],
 [2.193, 120.51135990363724],
 [2.302, 125.75356699020317],
 [2.402, 129.9848906541477],
 [2.508, 135.67840981521636],
 [2.599, 139.99229760200282],
 [2.691, 145.23099677011123],
 [2.799, 149.00349744583],
 [2.893, 155.49635992161282],
 [2.992, 158.88764234748376],
 [3.11, 164.85985880020678],
 [3.196, 168.9285892684912],
 [3.297, 175.05784410369083],
 [3.403, 180.14110291795092],
 [3.50

And then we can load this data up into a NumPy array:

In [4]:
data = np.array(data)
data

array([[9.20000000e-02, 1.47656750e+01],
       [2.00000000e-01, 2.02592269e+01],
       [2.96000000e-01, 2.52463647e+01],
       [3.90000000e-01, 2.85919601e+01],
       [4.94000000e-01, 3.55838752e+01],
       [6.05000000e-01, 3.99240561e+01],
       [6.99000000e-01, 4.49001430e+01],
       [8.06000000e-01, 5.01119987e+01],
       [8.90000000e-01, 5.53374484e+01],
       [1.00300000e+00, 6.11368215e+01],
       [1.10900000e+00, 6.45004525e+01],
       [1.19500000e+00, 6.94338229e+01],
       [1.30400000e+00, 7.52142913e+01],
       [1.39400000e+00, 8.06802446e+01],
       [1.51000000e+00, 8.56904599e+01],
       [1.59600000e+00, 9.06736538e+01],
       [1.69900000e+00, 9.42614600e+01],
       [1.80100000e+00, 1.00150181e+02],
       [1.89300000e+00, 1.04022361e+02],
       [2.00600000e+00, 1.10574714e+02],
       [2.09800000e+00, 1.14100175e+02],
       [2.19300000e+00, 1.20511360e+02],
       [2.30200000e+00, 1.25753567e+02],
       [2.40200000e+00, 1.29984891e+02],
       [2.508000

Now that we have our data in a NumPy array, we can calculate the differences in the `t` values and the `x` values this way:

In [5]:
delta_t = data[1:, 0] - data[:-1, 0]
delta_t

array([0.108, 0.096, 0.094, 0.104, 0.111, 0.094, 0.107, 0.084, 0.113,
       0.106, 0.086, 0.109, 0.09 , 0.116, 0.086, 0.103, 0.102, 0.092,
       0.113, 0.092, 0.095, 0.109, 0.1  , 0.106, 0.091, 0.092, 0.108,
       0.094, 0.099, 0.118, 0.086, 0.101, 0.106, 0.105, 0.098, 0.089,
       0.112, 0.102, 0.099, 0.084, 0.103, 0.113, 0.086, 0.101, 0.104,
       0.106, 0.087, 0.108, 0.099, 0.105, 0.099, 0.098, 0.108, 0.091,
       0.1  , 0.106, 0.098, 0.088, 0.108, 0.102, 0.105, 0.091, 0.111,
       0.096, 0.103, 0.094, 0.094, 0.105, 0.099, 0.107, 0.091, 0.093,
       0.101, 0.115, 0.103, 0.098, 0.1  , 0.087, 0.111, 0.087, 0.112,
       0.086, 0.103, 0.098, 0.106, 0.107, 0.098, 0.102, 0.097, 0.094,
       0.097, 0.114, 0.086, 0.107, 0.107, 0.095, 0.094, 0.104, 0.092])

In [6]:
delta_x = data[1:, 1] - data[:-1, 1]
delta_x

array([5.49355195, 4.98713779, 3.34559543, 6.99191501, 4.34018094,
       4.97608691, 5.2118557 , 5.22544969, 5.79937309, 3.36363099,
       4.9333704 , 5.78046845, 5.46595333, 5.01021529, 4.98319389,
       3.58780617, 5.88872069, 3.87218021, 6.55235294, 3.5254616 ,
       6.41118446, 5.24220709, 4.23132366, 5.69351916, 4.31388779,
       5.23869917, 3.77250068, 6.49286248, 3.39128243, 5.97221645,
       4.06873047, 6.12925484, 5.08325881, 5.53027051, 3.83900341,
       5.57787913, 5.30294698, 5.20821309, 4.43967735, 4.85431361,
       4.12607127, 6.6884015 , 4.2947909 , 4.44810383, 5.59862584,
       4.72659501, 4.69967985, 6.22392009, 4.21258033, 4.55307339,
       6.60443984, 3.17419266, 6.27195317, 4.68220561, 4.86311587,
       5.33692035, 4.3533792 , 4.06025552, 6.61132684, 4.45682406,
       4.88490988, 5.56644901, 5.00376658, 5.81802311, 3.49455089,
       5.57134869, 4.32461004, 6.50352017, 4.57637459, 4.92455496,
       3.99038786, 5.11638838, 4.7968517 , 5.26003847, 4.91070

And we can then calculate the rates of change this way:

In [7]:
rates = delta_x / delta_t
rates

array([50.86622176, 51.94935197, 35.59144075, 67.22995203, 39.1007292 ,
       52.93709476, 48.7089318 , 62.20773438, 51.32188572, 31.73236779,
       57.36477206, 53.03182065, 60.73281476, 43.1915111 , 57.94411504,
       34.83306963, 57.73255581, 42.08891533, 57.98542428, 38.32023478,
       67.48615221, 48.093643  , 42.31323664, 53.71244492, 47.40536029,
       56.94238226, 34.93056181, 69.07300506, 34.25537804, 50.61200384,
       47.3108194 , 60.68569144, 47.95527183, 52.66924298, 39.17350419,
       62.6727992 , 47.34774089, 51.06091262, 44.84522578, 57.78944777,
       40.05894433, 59.18939376, 49.93942905, 44.04063203, 53.8329408 ,
       44.59051897, 54.01930861, 57.62888975, 42.55131646, 43.3626037 ,
       66.71151349, 32.38972102, 58.07364049, 51.45280885, 48.63115874,
       50.34830522, 44.42223678, 46.13926723, 61.21598928, 43.69435355,
       46.52295121, 61.16976936, 45.07897815, 60.60440737, 33.92767856,
       59.26966689, 46.00648977, 61.93828733, 46.22600595, 46.02

Of course, we could just do all this in one step as well since we want to perform the same difference calculations on each column:

In [8]:
delta = data[1:] - data[:-1]
delta

array([[0.108     , 5.49355195],
       [0.096     , 4.98713779],
       [0.094     , 3.34559543],
       [0.104     , 6.99191501],
       [0.111     , 4.34018094],
       [0.094     , 4.97608691],
       [0.107     , 5.2118557 ],
       [0.084     , 5.22544969],
       [0.113     , 5.79937309],
       [0.106     , 3.36363099],
       [0.086     , 4.9333704 ],
       [0.109     , 5.78046845],
       [0.09      , 5.46595333],
       [0.116     , 5.01021529],
       [0.086     , 4.98319389],
       [0.103     , 3.58780617],
       [0.102     , 5.88872069],
       [0.092     , 3.87218021],
       [0.113     , 6.55235294],
       [0.092     , 3.5254616 ],
       [0.095     , 6.41118446],
       [0.109     , 5.24220709],
       [0.1       , 4.23132366],
       [0.106     , 5.69351916],
       [0.091     , 4.31388779],
       [0.092     , 5.23869917],
       [0.108     , 3.77250068],
       [0.094     , 6.49286248],
       [0.099     , 3.39128243],
       [0.118     , 5.97221645],
       [0.

And then the rates are simply:

In [9]:
rates = delta[:, 1] / delta[:, 0]
rates

array([50.86622176, 51.94935197, 35.59144075, 67.22995203, 39.1007292 ,
       52.93709476, 48.7089318 , 62.20773438, 51.32188572, 31.73236779,
       57.36477206, 53.03182065, 60.73281476, 43.1915111 , 57.94411504,
       34.83306963, 57.73255581, 42.08891533, 57.98542428, 38.32023478,
       67.48615221, 48.093643  , 42.31323664, 53.71244492, 47.40536029,
       56.94238226, 34.93056181, 69.07300506, 34.25537804, 50.61200384,
       47.3108194 , 60.68569144, 47.95527183, 52.66924298, 39.17350419,
       62.6727992 , 47.34774089, 51.06091262, 44.84522578, 57.78944777,
       40.05894433, 59.18939376, 49.93942905, 44.04063203, 53.8329408 ,
       44.59051897, 54.01930861, 57.62888975, 42.55131646, 43.3626037 ,
       66.71151349, 32.38972102, 58.07364049, 51.45280885, 48.63115874,
       50.34830522, 44.42223678, 46.13926723, 61.21598928, 43.69435355,
       46.52295121, 61.16976936, 45.07897815, 60.60440737, 33.92767856,
       59.26966689, 46.00648977, 61.93828733, 46.22600595, 46.02

We can then calculate the min, max and average rates of change:

In [10]:
np.amin(rates)

29.42739859222142

In [11]:
np.amax(rates)

69.07300506151955

In [12]:
np.mean(rates)

49.98125178748103

In [13]:
np.std(rates)

9.043463532187504

#### Question 2

In linear regression we try to find the coefficients `m` (slope) and `c` (y-intercept) of a straight line

$$
y = mx + c
$$

that provides the "best" fit given some `x` and `y` data. This formula then allows to predict `y` values for given `x` values.

Given an array of `n` `(x, y)` data pairs, these coefficients can be calculated very simply.

A bit of terminology first:

- Let `X` mean the column of `X` values.
- Let `Y` mean the column of `Y` values.
- Let `XX` mean a column calculated by multiplying each `x` in the `X` column by itself
- Let `XY` mean a column calculated by multiplying the `x` and `y` values from the `X` and `Y` columns

Then, given some column (say `X`), this symbol: $\sum{X}$ means the sum of all the elements in the column.

Similarly, the symbol $\sum{XY}$ means the sum of the values obtained by multiplying (pairwise) the values in `X` and `Y`.

Given those definitions, the formulas for calculating the "best" values of `m` and `c` are given by:

$$
m = \frac{n\sum{XY} - \sum{X}\sum{Y}}{n\sum{XX} - (\sum{X})^2}
$$

$$
c = \frac{\sum{Y}\sum{XX} - \sum{X}\sum{XY}}{n\sum{XX} - (\sum{X})^2}
$$

(where `n` is the number of `(x,y)` pairs in our data set.)

Using the same data we saw in Question 1, calculate the values for `m` and `c` for that data set given the formulas above.

You can think of the `t` column in the data as the `X` column, and the `x` values in the data as the `Y` column - we are trying to predict the value of `x` given a value of `t`.

This will result in a straight line that "best" fits through the data.

Compare the slope of this regression line to the average rate of change you calculated in Question 1.

##### Solution

We already saw how to load the data in Question 1.

I'll do the import, conversion to floats, and loading up into a NumPy array in the same step to simplify our earlier code a bit.

In [14]:
import numpy as np
import csv

with open('data.csv') as f:
    reader = csv.reader(f)
    next(f)  # skip header row
    data = np.array([[float(t), float(x)] for t, x in reader])
    
data


array([[9.20000000e-02, 1.47656750e+01],
       [2.00000000e-01, 2.02592269e+01],
       [2.96000000e-01, 2.52463647e+01],
       [3.90000000e-01, 2.85919601e+01],
       [4.94000000e-01, 3.55838752e+01],
       [6.05000000e-01, 3.99240561e+01],
       [6.99000000e-01, 4.49001430e+01],
       [8.06000000e-01, 5.01119987e+01],
       [8.90000000e-01, 5.53374484e+01],
       [1.00300000e+00, 6.11368215e+01],
       [1.10900000e+00, 6.45004525e+01],
       [1.19500000e+00, 6.94338229e+01],
       [1.30400000e+00, 7.52142913e+01],
       [1.39400000e+00, 8.06802446e+01],
       [1.51000000e+00, 8.56904599e+01],
       [1.59600000e+00, 9.06736538e+01],
       [1.69900000e+00, 9.42614600e+01],
       [1.80100000e+00, 1.00150181e+02],
       [1.89300000e+00, 1.04022361e+02],
       [2.00600000e+00, 1.10574714e+02],
       [2.09800000e+00, 1.14100175e+02],
       [2.19300000e+00, 1.20511360e+02],
       [2.30200000e+00, 1.25753567e+02],
       [2.40200000e+00, 1.29984891e+02],
       [2.508000

So here, the `X` column is the first column (the time column), and the `Y` column is the second column (the observed value column).

We can certainly assign those individual columns to variable names:

In [15]:
X = data[:, 0]
Y = data[:, 1]

We also need the value for `n`:

In [16]:
n = len(X)

Then we can simply use NumPy's universal operators for our formulas:

$$
m = \frac{n\sum{XY} - \sum{X}\sum{Y}}{n\sum{XX} - (\sum{X})^2}
$$

In [17]:
m = (n * np.sum(X * Y) - np.sum(X) * np.sum(Y)) / (n * np.sum(X * X) - (np.sum(X)) ** 2)
m

49.978008206387344

$$
c = \frac{\sum{Y}\sum{XX} - \sum{X}\sum{XY}}{n\sum{XX} - (\sum{X})^2}
$$

In [18]:
c = (np.sum(Y) * np.sum(X * X) - np.sum(X) * np.sum(X * Y)) / (n * np.sum(X * X) - (np.sum(X)) ** 2)
c

10.081268844890284

So the "best" straight line through our data is given by:

$$
x = m * t + c = 49.98 t + 10.08
$$

If we compare our value for `m` here: `49.978`, we'll see that it is very close to the average rate of change we calculated in Question 1: `49.981`.

(I won't get into the math here, but if $y = f(x)$, then the rate of change of $y$ (w.r.t. $x$) is given by the derivative of the function (i.e. $\frac{df}{dx}$). And for a linear equation, that derivative is the slope `m`.)