#### https://online.stat.psu.edu/stat505/lesson/2

This lesson is concerned with linear combinations or linear transformations of the variables. 
$$
Y = c_1 X_1 + c_2 X_2 + ... + c_p X_p = X c
$$

In [1]:
import pandas as pd 
import numpy as np
from IPython.display import Image

#### Introductory Example

In [2]:
df = pd.read_table("nutrient.txt", delim_whitespace=True, header=None, index_col=0)

In [3]:
df.columns = ['calcium', 'iron', 'protein', 'vitamin A', 'vitamin C']

We may want to ask what is the total intake of vitamins A and C. Since Vitamin A is measuring in micrograms while Vitamin C is in miligrams, so we can express it as:
$$
Y = 0.001 X_4 + X_5
$$

In matrix form, we can construct the vector **c** as:
$$
\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0.001 \\ 1 \end{bmatrix}
$$

In [4]:
c = np.matrix([0, 0, 0, 0.001, 1]).T
c

matrix([[0.   ],
        [0.   ],
        [0.   ],
        [0.001],
        [1.   ]])

In [7]:
naive_form = 0.001 * df['vitamin A'] + df['vitamin C']

In [11]:
matrix_form = df.values @ c

In [22]:
(naive_form != np.array(matrix_form).flatten()).sum()

0

#### LC Mean

$$
Y = X c = c_1 X_1 + ... + c_p X_p
$$

$$
E[Y] = c_1 E[X_1] + ... + c_p E[X_p] = c_1 \mu_1 + ... + c_p \mu_p
$$

Recall that if X is n x p, $E[X]$ is the mean of all of the features, which is the mean of every column, so the result is $E[X] = \mu_X$, which is 1 x p 

$E[Y]$ is LC of the element of E[X], which is a scaler value
$$
E[Y] = E[X] c
$$

In [31]:
matrix_form_mean = df.mean() @ c

In [32]:
matrix_form_mean

array([79.76808175])

In [33]:
naive_form_mean = 0.001 * df['vitamin A'].mean() + df['vitamin C'].mean()

In [34]:
print(matrix_form_mean)
print(naive_form_mean)

[79.76808175]
79.76808175033923


#### LC Variance

Notation: cov matrix is $\Sigma$, and the covariance between x1 and x2 is $\Sigma_{12}$

$$
Y = c_1 X_1 + ... + c_p X_p
$$

Here we don't have to assume $X_i$ is Indenpendent Identically Distributed
$$
var(Y) = c_1^2 var(X_1) + ... + c_p^2 var(X_p) + \sum_{i!=j} c_i c_j cov(X_i, X_j)
$$

Later you will see this alternative expression
$$
var(Y) = \sum_{i=1}^p c_i^2 s_i^2 + 2 \sum_{j<k} c_j c_k s_{jk}
$$

If you think about it, you can represent var(Y) in this way:
$$
var(Y) = \sum_{i=1}^p \sum_{j=1}^p c_i c_j cov(X_i, X_j) = \sum_{i=1}^p \sum_{j=1}^p c_i c_j \Sigma_{ij}
$$

Recall that if we have this form: row mat col, where row is 1 x n, matrix is n x k, and column vector is k x 1, we will have
$$
r B c = \sum_{i=1}^n \sum_{j=1}^k r_i B_{ij} c_j = \sum_{i=1}^n \sum_{j=1}^k r_i c_j B_{ij}
$$

In [None]:
np.array([])

In [37]:
matrix_form.shape

(737, 1)

Connect between the fact and the formula for $var(Y)$, we know we can express var(Y) in the form of $r B c$

$$
var(Y) = c^T \Sigma c
$$

In [49]:
matrix_form.var(ddof=1)

5463.059027756587

In [58]:
c.T @ df.cov().values @ c

matrix([[5463.05902776]])