# **Q1)**

> Study on infant respiratory disease

> Dependent variable: binary category (1/0 flag) for infants on whether or not they developed the disease in the first year of their life (this is the dependent variable)

> Explanatory variables: **(1)** genders of each infant (binary category boy/girl); **(2)** how fed (three categories: breast-fed, bottle-fed, supplement)

Fitted model in R:

```R
fit <- glm(disease/(disease + nondisease) ~ gender + food,
family = binomial, weights = disease + nondisease, data=babyfood)
```

### a) State the model that has been fitted.

The model fitted is a **linear regression model**. This model has been implemented in R via GLM using the binomial family with no link function specified. As no link function has been explicitly called, R will have used the canonical link function for the binomial distribution, which is the logit link function. The weights, $w_{i}$, are of the linear regression model are total number of infants with and without the disease.

$g_{\text{canonical}}(\pi_{i}) = g_{\text{logit}}(\pi_{i}) = \ln\frac{\pi_{i}}{(1 - \pi_{i})}$

### b) Interpret the significance of the parameters for each explanatory variable.

The **girl** parameter for the **gender** explanatory variable is significant at the 5% level, suggesting that gender plays a significant role in whether or not an infant develops respiratory disease in the first year of their lives. The **breast-fed** parameter for the **food** explanatory variable is highly significant, with a $p$-value of $1.22 \times 10^{-5}$ meaning the parameter is significant at the 0.1% level.

The **supplement** parameter has a large standard error with respect to its parameter estimate, and thus has a small $z$-score which is not significant at the 5% or even the 10% level. Therefore the model suggests that taking food suppliments has no significant effect on whether an infant develops respiratory disease in the first year of their life.

When dealing with categorical variables in R, one of the classes of the variable is included in the intercept. With us having two categorical variables, *one of the classes from each variable* would have been included in the intercept. For **gender** this would have been the class **boy**, and for **food** this would have been the class **bottle-fed**. This boy / bottle-fed hybrid baseline parameter is highly significant, with a $p$-value below $2 \times 10^{-16}$, and is therefore clearly significant at the 0.1% significance level. To untangle these parameters from each other to help with our interpretation of the model, we could choose a different class order for each explanatory variable, and thus have different parameters contribute towards the intercept parameter.

### c) Does the model provide a good fit?

By looking at the residual deviance of the model in R which represents the model deviance, and knowing that the scale parameter for a binomial GLM, $\phi$, is equal to 1, we know that the residual deviance = deviance, $D$ = **scaled deviance**, $S$, and thus we can use the residual deviance in a goodness-of-fit test:

$S = D/ \phi = D \sim \chi^{2}_{n-p}$

Plugging in the scaled deviance of $0.72192$ and the degrees of freedom of $6-4=2$, we produce the model $p$-value from the following R code:

In [52]:
deviance <- 0.72192

n <- 6

p <- 4

df.n.sub.p <- n-p

pchisq(deviance, df.n.sub.p, lower.tail = FALSE)

With a $p$-value of $0.70$ we do not have sufficient evidence to reject the null hypothesis comparing the current model against the saturated model with $n-p$ additional parameters with respect to the current model, and therefore we accept the null hypothesis that the $n-p$ additional parameters in the saturated model are equal to zero, and hence that our model does not require additional parameters. Our model is deemed adequete.

**We can therefore say that the model provides a good fit**.

It's interesting to look at the explanatory variable contributions to the deviance in the ANOVA table, and it's clear that the **food** variable has the largest and most significant impact of the two variables in reducing the residual deviance.

As we search for an adequete model that's also parsimonious, we may wish to re-fit the model without **gender** and see how the $p$-value of the model changes.

### d) R code to calculate p-value of the model

As shown in **c)**, the code to calculate the $p$-value for the model:

In [53]:
pchisq(deviance, df.n.sub.p, lower.tail = FALSE)

### e) Interpret the effect of breast-feeding and derive a 95% confidence interval for it

The p-value associated with breast-feeding effect, with an estimated coefficient of $-0.6693$ and a standard error in that estimate of $0.1530$ produces a $z$ statistic of $-4.374$, which is highly significant.

One can calculate the $p$-value (given in the R summary) associated with that $z$ statistic is $1.22 \times 10^{-5}$, and thus the breast feeding effect is significant at the 1% level using the following code:

In [50]:
2*pnorm(-4.374, lower.tail = TRUE)

The number of standard errors to produce a 95% confidence interval is given by:

In [37]:
# using the standard normal distribution
num.se.95 <- qnorm(0.025, lower.tail = FALSE)

num.se.95

Which is a well-known limit: 1.96 standard errors produces a 95% confidence interval for the standard normal distribution.

We can now plug in these values to produce the 95% confidence interval:

In [38]:
parameter.estimate <- -0.6693
parameter.se <- 0.1530

parameter.estimate + c(-num.se.95*parameter.se,num.se.95*parameter.se)

Therefore the 95% confidence interval for the breast feeding coefficient is:

> $[-0.969, -0.369]$

### f) Estimate an unbiased estimator of the scale parameter, $\phi$.

I.e. what estimator provides: $E[\hat{\phi}] = \phi$?

We can use the property that the expectation of a $\chi^{2}$ random variable is the number of degrees of freedom of that random variable.

If 

> $S = \frac{D}{\phi} \sim \chi^{2}_{n-p}$,

then 

> $E[\frac{D}{\phi}] = n-p$,

and therefore

> $E[\frac{D}{n-p}] = \phi$

and thus $\frac{D}{n-p}$ is an unbiased estimator for $\phi$:

> $\hat{\phi} = \frac{D}{n-p}$

In [54]:
scale.estimate <- deviance / df.n.sub.p
scale.estimate

> $\hat{\phi} = 0.36096$

As a sanity check, we'd expect this to be fairly close to 1.

### g) Which model would be most reasonable to endorse between:

```r
fit2 <- glm(disease/(disease + nondisease) ~ gender + food,
family = binomial(link="identity"), weights = disease + nondisease)
```

*and*

```r
fit3 <- glm(disease/(disease + nondisease) ~ gender + food,
family = binomial(link="probit"), weights = disease + nondisease)
```

The only difference between the two models is the choice of link function. I think the **fit3** model with the **probit** link function would be the superior model Vs the **fit2** model which uses the **identity** link function.

Link functions are the mathematical machinery within GLMs to fit non-linear models, mapping the mean value for observation $i$ (i.e. the prediction) to the linear predictor, $\eta_{i}$, via $g(\mu_{i}) = \eta_{i}$. The **identity** link function, as the name suggests, provides no non-linearity in the mapping between $\mu_{i}$ and $\eta_{i}$, whereby $\mu_{i} = \eta_{i}$, such as is the case in the general linear model.

Since we have seen the non-linear **logit** link function perform well in the above goodness-of-fit tests, we expect that a non-linear S-shaped function, like the **probit** link function, will fit the data better than a linear fit via the **identity** link.

Secondly (but perhaps most importantly), the **probit** link function has a range between $0$ and $1$ which is essential for fitting a model to a **proportion** with a range between $0$ and $1$, rather than the linear **identity** function with a range from $- \infty$ to $\infty$.

### h) For the following model, what is the deviance, $D$, on $d$ degrees of freedom? And what is $d$?

```r
fit4 <- glm(disease/(disease + nondisease) ~ gender*food,
family = binomial, weights = disease + nondisease)
```

Using the ```gender*food``` rather than ```gender+food``` syntax, we're adding the following two interaction terms to the model (in addition to the model already fit with $2$ degrees of freedom):

* ```genderGirl:foodBreast```;
* ```genderGirl:foodSuppl```.

Each of the two interaction terms requires a single degree of freedom.

We will therefore have zero degrees of freedom remaining: $d = 0$.

This means that $n-p = 0$ and therefore that $p$, the number of parameters in the **fit4** model, is equal to the number of observations, $n$. **fit4** is therefore the saturated model.

This will result in us fitting the saturated model, and thus comparing the saturated model (**fit4**) Vs the saturated model in our hypothesis test, and therefore our generalised LRT, $\Lambda$, will equal 1, and therefore our scaled deviance $S = -2\ln\Lambda$ will equal 0, as $\ln 1 = 0$.

Finally, since the scaled deviance, $S = 0$, then the deviance, $D = 0$.

* $d = 0$
* $D = 0$.

# **Q2)**

### a) Classification rule devivation

Starting with some definitions:

* $\pi_{i}$ is the prior probability that an individual selected at random belongs to population $i$.
* $C(i|j)$ is the cost of incorrectly allocating an individual to population $i$, when they really belong to population $j$.

**Bayes rule**:

Allocate population 1 if:

> $\dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \geq \dfrac{\pi_{2} C(1|2)}{\pi_{1}C(2|1)}$,

otherwise allocate population 2.

&nbsp;

In the multivariate normal case (with $p$ variables), where the observation vectors $\mathbf{X}_{i} \sim \text{MVN}_{p}(\mathbf{\mu_{i}}, \Sigma)$, then our two probability density functions for population 1 and 2 are:

> $f_{1}(\mathbf{x}) = \dfrac{1}{{\left |2 \pi \Sigma \right |}^{1/2}} \exp\left \{-\dfrac{1}{2}(\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}})   \right \}$,

and

> $f_{2}(\mathbf{x}) = \dfrac{1}{{\left |2 \pi \Sigma \right |}^{1/2}} \exp\left \{-\dfrac{1}{2}(\mathbf{x} - \mathbf{\mu_{2}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{2}})   \right \}$,

respectively.

&nbsp;

We can therefore calculate the likelihood ratio as:

> $\dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} = \exp\left \{-\dfrac{1}{2}(\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}}) + \dfrac{1}{2}(\mathbf{x} - \mathbf{\mu_{2}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{2}})  \right \}$,

where the $\dfrac{1}{{\left |2 \pi \Sigma \right |}^{1/2}}$ terms in each pdf cancel out.

We can then calculate the log-likelihood by taking natural logs of both sides:

> $\ln \left ( \dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \right ) = -\dfrac{1}{2}(\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}}) + \dfrac{1}{2}(\mathbf{x} - \mathbf{\mu_{2}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{2}})$

and factorising the $\dfrac{1}{2}$:

> $\ln \left ( \dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \right ) = -\dfrac{1}{2} \left ( (\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}}) -(\mathbf{x} - \mathbf{\mu_{2}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{2}}) \right )$.

&nbsp;

We can expand out the $(\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}}) -(\mathbf{x} - \mathbf{\mu_{2}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{2}})$ terms:

Firstly the $(\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}})$ term:

> $= (\mathbf{x}^{T}\Sigma^{-1} - \mathbf{\mu_{1}}^{T}\Sigma^{-1})(\mathbf{x} - \mathbf{\mu_{1}})$

> $= \mathbf{x}^{T}\Sigma^{-1}\mathbf{x} - \mathbf{x}^{T}\Sigma^{-1}\mathbf{\mu_{1}} - \mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{x} + \mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{\mu_{1}}$

Each of these four components is a scalar, and the transpose of a scalar is the scalar itself, i.e. $a^{T} = a$, therefore:

> $\mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{x} = \left ( \mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{x} \right )^{T} = \mathbf{x}^{T}(\Sigma^{-1})^{T}\mathbf{\mu_{1}} = \mathbf{x}^{T}(\Sigma^{T})^{-1}\mathbf{\mu_{1}} = \mathbf{x}^{T}\Sigma^{-1}\mathbf{\mu_{1}}$,

since $\Sigma$ is a symmetric matrix, $\Sigma = \Sigma^{T}$.

We can therefore write:

> $(\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}}) = \mathbf{x}^{T}\Sigma^{-1}\mathbf{x} - 2\mathbf{x}^{T}\Sigma^{-1}\mathbf{\mu_{1}} + \mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{\mu_{1}}$

and

> $(\mathbf{x} - \mathbf{\mu_{2}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{2}}) = \mathbf{x}^{T}\Sigma^{-1}\mathbf{x} - 2\mathbf{x}^{T}\Sigma^{-1}\mathbf{\mu_{2}} + \mathbf{\mu_{2}}^{T}\Sigma^{-1}\mathbf{\mu_{2}}$.

Thus we can expand out:

> $(\mathbf{x} - \mathbf{\mu_{1}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{1}}) -(\mathbf{x} - \mathbf{\mu_{2}})^{T} \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_{2}})$

to

> $- 2\mathbf{x}^{T}\Sigma^{-1}\mathbf{\mu_{1}} + \mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{\mu_{1}} + 2\mathbf{x}^{T}\Sigma^{-1}\mathbf{\mu_{2}} - \mathbf{\mu_{2}}^{T}\Sigma^{-1}\mathbf{\mu_{2}}$,

where we can factor out the $-2\mathbf{x}^{T}\Sigma^{-1}$ terms:

> $-2\mathbf{x}^{T}\Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}}) + \mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{\mu_{1}} - \mathbf{\mu_{2}}^{T}\Sigma^{-1}\mathbf{\mu_{2}}$.

&nbsp;

The $\mathbf{\mu_{1}}^{T}\Sigma^{-1}\mathbf{\mu_{1}} - \mathbf{\mu_{2}}^{T}\Sigma^{-1}\mathbf{\mu_{2}}$ terms follow the difference of two squares, and can be written as $(\mathbf{\mu_{1}} + \mathbf{\mu_{2}})^{T}\Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}})$.

&nbsp;

Our log-likelihood can therefore be written:

> $\ln \left ( \dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \right ) = -\dfrac{1}{2} \left ( -2\mathbf{x}^{T}\Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}}) + (\mathbf{\mu_{1}} + \mathbf{\mu_{2}})^{T}\Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}}) \right )$

> $= \mathbf{x}^{T}\Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}}) -\dfrac{1}{2}(\mathbf{\mu_{1}} + \mathbf{\mu_{2}})^{T}\Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}})$

&nbsp;

The common factor in both terms is $\Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}})$. Let $\mathbf{L} = \Sigma^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}})$ such that our log-likelihood takes the form:

> $\ln \left ( \dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \right ) =  \mathbf{x}^{T}\mathbf{L} -\dfrac{1}{2}(\mathbf{\mu_{1}} + \mathbf{\mu_{2}})^{T}\mathbf{L}$.

Note again that each of the two above terms is a scalar ($\mathbf{L}$ is a $p \times 1$ vector), so we can shuffle the order of the terms by taking the transpose of each:

> $\ln \left ( \dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \right ) =  \mathbf{L}^{T}\mathbf{x} -\dfrac{1}{2}\mathbf{L}^{T}(\mathbf{\mu_{1}} + \mathbf{\mu_{2}})$.

&nbsp;

Recall that as:

> $\dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \geq \dfrac{\pi_{2} C(1|2)}{\pi_{1}C(2|1)}$,

the log-likelihood relationship must be:

> $\ln \left ( \dfrac{f_{1}(\mathbf{x})}{f_{2}(\mathbf{x})} \right ) \geq \ln \left ( \dfrac{\pi_{2} C(1|2)}{\pi_{1}C(2|1)} \right )$,

therefore:

> $\mathbf{L}^{T}\mathbf{x} -\dfrac{1}{2}\mathbf{L}^{T}(\mathbf{\mu_{1}} + \mathbf{\mu_{2}}) \geq \ln \left ( \dfrac{\pi_{2} C(1|2)}{\pi_{1}C(2|1)} \right )$.

&nbsp;

---

**Our allocation rule is therefore:**

**Allocate to population 1 if:**

> $\mathbf{L}^{T}\mathbf{x} -\dfrac{1}{2}\mathbf{L}^{T}(\mathbf{\mu_{1}} + \mathbf{\mu_{2}}) \geq \ln \left ( \dfrac{\pi_{2} C(1|2)}{\pi_{1}C(2|1)} \right )$,

**otherwise allocate to population 2.**

---


**b) Mahalanobis distance and misclassification probability.**

We are provided with the mean results for four tests administered to two groups.

We are also provided with the pooled *within-groups* sample covariance matrix, $\mathbf{S}_{U}$, which is an unbiased estimator for the population covariance matrix, $\Sigma$, whereby:

> $\mathbf{S}_{U} = \dfrac{(n_{1}-1)\mathbf{S}_{1U} + (n_{2}-1)\mathbf{S}_{2U}}{n_{1} + n_{2} - 2}$.

The squared Mahalanobis distance (a measure of distance between two **population** means), $\alpha$, is calculated as:

> $\alpha = (\mu_{1} - \mu_{2})^{T}\Sigma^{-1}(\mu_{1} - \mu_{2})$.

Since we don't know the population paramaters, we can calculate the sample squared Mahalanobis distance:

> $\hat{\alpha} = (\bar{\mathbf{x}}_{1} - \bar{\mathbf{x}}_{2})^{T}\mathbf{S}_{U}^{-1}(\bar{\mathbf{x}}_{1} - \bar{\mathbf{x}}_{2})$.

We therefore take the inverse of the pooled sample covariance matrix, $\mathbf{S}_{U}^{-1}$, which is provided as **invSp** in the question, and calculate $\hat{\alpha}$.

In [69]:
# sample mean vector for senile group
senile.37 <- c(12.57, 9.57, 11.49, 7.97)

# sample mean vector for non-senile group
non.senile.12 <- c(8.75, 5.35, 8.50, 4.75)

# the four different tests performed by each individual
subtest <- c('Information','Similarities','Arithmetic','Picture completion')

df.means <- data.frame(subtest, senile.37, non.senile.12)

df.means

subtest,senile.37,non.senile.12
<fct>,<dbl>,<dbl>
Information,12.57,8.75
Similarities,9.57,5.35
Arithmetic,11.49,8.5
Picture completion,7.97,4.75


In [70]:
# symmetric sample covariance matrix
sample.covar <- data.frame(Information = c(11.2553,9.4042,7.1489,3.3830),
                 Similarities = c(9.4042,13.5318,7.3830,2.5532),
                 Arithmetic = c(7.1489,7.3830,11.5744,2.6170),
                 PictureCompletion = c(3.3830,2.5532,2.6170,5.8085))

sample.covar.inv <- solve(sample.covar)

sample.covar.inv

0,1,2,3,4
Information,0.25907356,-0.1357645,-0.05877998,-0.06473009
Similarities,-0.1357645,0.18645117,-0.03833003,0.01438476
Arithmetic,-0.05877998,-0.03833003,0.15098314,-0.01694172
PictureCompletion,-0.06473009,0.01438476,-0.01694172,0.21117177


### Squared Mahalanobis distance calculation:

In [79]:
alpha.hat <- t(senile.37 - non.senile.12) %*% sample.covar.inv %*% (senile.37 - non.senile.12)

alpha.hat[1,1]

We estimate the squared Mahalanobis distance to be $2.43$.

The probability that an individual is misallocated to the correct group, i.e. is misclassified, is given by the standard normal distribution function $= \Phi \left ( - \dfrac{\sqrt{\hat{\alpha}}}{2} \right )$

In [77]:
pnorm(-0.5 * sqrt(alpha.hat[1,1]), lower.tail = TRUE)

**We calculate a misclassification probability of $0.22$.**

Note, this probability may be *underestimated* for small sample sizes.

### (c) PCA

*100 individuals were given 3 scores based on a series of tests designed to measure an underlying attitude.*

(i) The third eigenvector has an eigenvalue of zero. This means there's an exact linear relationship between the three variables (the third eigenvector not being orthogonal to the other two). The third principal component accounts for *zero variance* with respect to the other two, and therefore that one of the variables is redundant.

(ii) The first principal component accounts for $\dfrac{296.724}{296.724 + 25.276} = 92\%$ of the variance. This dominating component is a measure of the overall size , with all three coefficients having similar magnitude and *sign*.

The second principal component accounts for the remaining $8\%$ of the variance (as the third component accounts for zero variance), but interestingly, shows there to be a *contrast between the second and third variables*, with opposite sign, whilst the first variable has no impact.

There's therefore interest in the interplay between the second and third variables, and since one of the variables should be removed, it makes sense that it's the first. 

When re-running the PCA on the second and third variable, you'd be running PCA on the following unbiased sample covariance matrix:

> $\mathbf{S}_{U} = \begin{bmatrix}
52 & 36\\ 
36 & 73
\end{bmatrix}$

Where the expression to find the eigenvalues, indicating the proportions of the variance accounted for by each of the two principal components is:

> $\det\begin{bmatrix}
52 - \lambda & 36\\ 
36 & 73 - \lambda
\end{bmatrix} = 0$

I'd expect there to still be an overall size component as the first principal component that accounts for most of the variance, where each coefficient has the same sign. I'd then expect the second component, with a significantly lower eigenvalue / variance, to contain coefficients with the **same coefficient values, but of opposite sign**.

This is because we're down to two variables and two principal components, where each component is a linear combination of the two variables, the size of the weights for each variable in the two linear combinations won't change. But, the sign will in the second principal component, as we've seen there to be a contrasting interaction between the second and third variables when performing PCA initially, with the three variables.

This initial variable removal process of finding and removing highly correlated variables is useful as a pre-processing step prior to fitting a regression. Highly correlated variables can cause enormous errors in regression coefficients, which is clearly sub-optimal.

Note that by reducing the number of variables, we've reduced the dimensionality of our data, from three dimensions to two, by *eliminating* one of our underlying variables. This is one of the two ways in which PCA can be used to reduce the dimenions in our data. The other way, is to select the principal components that explain most of the variance in the data, where these components represent linear combinations of the underlying variables, where we can treat these principal components as 'new' variables, and hence work with fewer number of principal components than our original number of variables, whilst maintaining most of the variance in the data in this new representation.

As we've also reduced the number of our variable dimensions from 3 to 2, we can more easily visually interpret the interplay between the two variables in a plot (I find 3-D plots often tricky to interpret!), as the PCA analysis has revealed the best 2-D projection of the test score data.

This iterative PCA process could also help to feedback to the test designers that the three scores that they've produced aren't truly orthogonal when it comes to measuring the underlying attitude - essentially one of those tests has been wasted.

As an interesting aside, I suspect the PCA analysis from the above would look similar to the PCA analysis that would result from comparing the returns of two similar stocks in a paired analysis: e.g. Coke and Pepsi. The results of which are shown below:

In [129]:
# dataframe with returns for Coca Cola and Pepsi going back to the 1960s.
df.pair <- read.csv('coke_pepsi_paired_returns.csv')

df.pair[1:5,]

return_coke,return_pepsi
<dbl>,<dbl>
-0.003780718,-0.002890173
-0.001897533,-0.014492754
-0.008593156,-0.008823529
-0.004755695,-0.005934718
0.001233141,0.0


In [125]:
# producing the sample covariance matrix
S <- var(df.pair)
S

Unnamed: 0,return_coke,return_pepsi
return_coke,0.0002328181,0.0001219119
return_pepsi,0.0001219119,0.0002464368


In [127]:
# producing the eigenvalues and eigenvectors
eig <- eigen(S)
eig

eigen() decomposition
$values
[1] 0.0003617293 0.0001175255

$vectors
          [,1]       [,2]
[1,] 0.6871070 -0.7265563
[2,] 0.7265563  0.6871070


In [128]:
# proportion of variance between the first and second principal components
(eig$values/sum(eig$values)) * 100

---

**END OF ASSIGNMENT**

---