### Exercises

#### Question 1

The accompanying file `data.csv` contains information for the value `x` of something observed at time `t`.

Given this data, we want to calculate the rate of change of this value over time - we'll do this by taking two consecutive observations, say $x(t_i)$ and $x(t_{i+1})$ and approximate the rate of change using this formula:

$$
v(t_{i+1}) = \frac{x(t_{i+1}) - x(t_i)}{t_{i+1} - t_i}
$$

For example, if the data looks like this:

```
t     x
0.1   10
0.2   12
0.4   14
0.5   15
```

Then the first row of data would be considered $t_0$, the second row $t_1$, etc

And we can start approximating the rate of change starting at $v_1$ which would be calculated as:

$$
v_1 = \frac{12 - 10}{0.2 - 0.1} = 20.0
$$

Similarly, $v_2$ would be calculated as:

$$
v_2 = \frac{14 - 12}{0.4 - 0.2} = 10.0
$$

Use NumPy arrays to create an array that holds the calculated rates of change and determine the minimum, maximum, average and standard deviation of the rate of change.

##### Solution 1

In [2]:
import numpy as np
import csv

In [3]:
file = 'data.csv'

In [123]:
# check out file
with open(file) as f:
    next(f) # skip header
    for _ in range(3):
        print(next(f).strip())

0.092,14.765674972872079
0.2,20.259226923447223
0.296,25.246364712175524


In [56]:
from collections import namedtuple

In [57]:
RateOfChange = namedtuple('RateOfChange', 'min max mean std')

In [87]:
# func uses NumPy, csv, namedtuple module
def calc_rate_of_change(file):
    # read file from disk
    with open(file) as f:
        reader = csv.reader(f)
        header = next(reader)
        data = np.array(list(reader)).astype(np.float64)
    
    # calculate rate of change
    #roc = np.subtract(data[1:, 1], data[:-1, 1]) / np.subtract(data[1:, 0], data[:-1, 0])
    # alternatively ...
    roc = np.divide.reduce(np.subtract(data[1:], data[:-1])[:, ::-1], 1)
    
    # return minimum, maximum, mean, and standard deviation
    return RateOfChange(np.amin(roc), np.amax(roc), np.mean(roc), np.std(roc))

In [88]:
calc_rate_of_change(file)

RateOfChange(min=np.float64(29.42739859222142), max=np.float64(69.07300506151955), mean=np.float64(49.98125178748103), std=np.float64(9.043463532187504))

#### Question 2

In linear regression we try to find the coefficients `m` (slope) and `c` (y-intercept) of a straight line

$$
y = mx + c
$$

that provides the "best" fit given some `x` and `y` data. This formula then allows to predict `y` values for given `x` values.

Given an array of `n` `(x, y)` data pairs, these coefficients can be calculated very simply.

A bit of terminology first:

- Let `X` mean the column of `X` values.
- Let `Y` mean the column of `Y` values.
- Let `XX` mean a column calculated by multiplying each `x` in the `X` column by itself
- Let `XY` mean a column calculated by multiplying the `x` and `y` values from the `X` and `Y` columns

Then, given some column (say `X`), this symbol: $\sum{X}$ means the sum of all the elements in the column.

Similarly, the symbol $\sum{XY}$ means the sum of the values obtained by multiplying (pairwise) the values in `X` and `Y`.

Given those definitions, the formulas for calculating the "best" values of `m` and `c` are given by:

$$
m = \frac{n\sum{XY} - \sum{X}\sum{Y}}{n\sum{XX} - (\sum{X})^2}
$$

$$
c = \frac{\sum{Y}\sum{XX} - \sum{X}\sum{XY}}{n\sum{XX} - (\sum{X})^2}
$$

(where `n` is the number of `(x,y)` pairs in our data set.)

Using the same data we saw in Question 1, calculate the values for `m` and `c` for that data set given the formulas above.

You can think of the `t` column in the data as the `X` column, and the `x` values in the data as the `Y` column - we are trying to predict the value of `x` given a value of `t`.

This will result in a straight line that "best" fits through the data.

Compare the slope of this regression line to the average rate of change you calculated in Question 1.

##### Solution 2

Gotta do two main things:
1. calculate the values for `m` and `c` using the LaTex-written formula above
2. compare slope of regression line `m` to average (`mean`) rate of change in *Question 1*

In [111]:
def calc_mc(arr):
    # get n value
    n = arr.shape[0]

    # ------>8------ calculate for m: (slope) of straight line
    # compute numerator: (n * sum of xy) - (sum of x * sum of y)
    m_numer = n * np.sum(np.multiply(arr[:, 0], arr[:, 1])) - np.sum(arr[:, 0]) * np.sum(arr[:, 1])
    
    # compute denominator: (n * sum of xx) - (sum of x)^squared
    m_denom = n * np.sum(np.square(arr[:, 0])) - np.square(np.sum(arr[:, 0]))
    
    m = np.divide(m_numer, m_denom)

    # ------>8------ calculate for c: (y-intercept) of straight line
    # compute numerator: (sum of y * sum of xx) - (sum of x * sum of xy)
    c_numer = np.multiply(np.sum(arr[:, 1]), np.sum(np.square(arr[:, 0]))) - np.multiply(np.sum(arr[:, 0]), np.sum(np.multiply(arr[:, 0], arr[:, 1])))
    
    # compute denominator: (n * sum of xx) - (sum of x)^squared
    c_denom = n * np.sum(np.square(arr[:, 0])) - np.square(np.sum(arr[:, 0]))
    
    c = np.divide(c_numer, c_denom)

    # create namedtuple class for LinearReg
    LinearReg = namedtuple('LinearReg', 'm c')

    return LinearReg(m, c)

In [112]:
calc_mc(data)

LinearReg(m=np.float64(49.978008206387344), c=np.float64(10.081268844890284))

In [124]:
# compare regression slope `m` to average rate of change in Question 1
linear_reg = calc_mc(data)

# grab average rate of change from Question 1
roc = calc_rate_of_change(file)

# find percentage difference between slope and avergae
def calc_percdiff(slope, avg):
    perc_diff = np.abs((roc.mean - linear_reg.m) / ((linear_reg.m + roc.mean) / 2)) * 100
    result = f'Both values are very similar with a percentage difference of {float(np.around(perc_diff, 4))}%: not up to 1%' \
            if perc_diff < 1 else 'Both values differ by 1% or more'
    return result

calc_percdiff(linear_reg.m, roc.mean)

'Both values are very similar with a percentage difference of 0.0065%: not up to 1%'