# Lecture Material Notebook

## Resources:
 * [R Quick Helps](http://www.statmethods.net/)
 * [Body Dimensions](https://www.openintro.org/stat/data/bdims.php)
 
## Topics In Notebook
 * Reducing dimensions through projections

In [None]:
# Pull seed data down
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")

names(bdims)

In [None]:
# Remove Gender, which is a factor
lessData <- bdims[!names(bdims) %in% c('sex')]
ncol(lessData)
summary(lessData)

__Reference__: http://www.sthda.com/english/wiki/principal-component-analysis-in-r-prcomp-vs-princomp-r-software-and-data-mining

In [None]:
cor(lessData)

In [None]:
# Compute the Principal Components 
pca <- princomp(lessData, cor=TRUE)

In [None]:
summary(pca) # print variance accounted for

### IMPORTANT PCA information

The __Proportion of Variance__ and __Cumulative Proportion__ help you see how important or significant the components are.  

Note that the first principal component (PC) captures __0.6248721__ of the total variance, which accumulates to 1.0.
So, about 62.5%.

We can see that PC 1-19 capture 99% of the total variance by looking at the __Cumulative Proportion__.

In [None]:
loadings(pca) # pc loadings

### Scree plot

Next, we will look at the trend of variance captured as we progress from the first PC to the last.
This is typically called a _Scree_ plot.

In [None]:
plot(pca,type="lines") # scree plot

We can see that after the first two PC, the contribution to variance is very minimal.

In [None]:
reduced <- pca$scores[,1:2] # the first 5 principal components
summary(reduced)

biplot(pca) 

# Interpretting a BiPlot


---

From: http://forrest.psych.unc.edu/research/vista-frames/help/lecturenotes/lecture13/biplot.html)

As used in Principal Component Analysis, the axes of a biplot are a pair of principal components. These axes are drawn in black and are labled Comp.1, Comp.2.

A biplot uses **points to represent the scores of the observations** on the _principal components_, and it __uses vectors to represent the coefficients of the variables on the principal components__. ...

__Interpreting Points__: The relative location of the points can be interpreted. Points that are close together correspond to observations that have similar scores on the components displayed in the plot. To the extent that these components fit the data well, the points also correspond to observations that have similar values on the variables.

...  
[ 
The Points that are close together are data members with similar projections/position in the transformed space.
That is, their vector components share similar trends in the original data space.
]  
...

__Interpreting Vectors__: Both the direction and length of the vectors can be interpreted. Vectors point away from the origin in some direction.

A vector points in the direction which is most like the variable represented by the vector. This is the direction which has the highest squared multiple correlation with the principal components. The length of the vector is proportional to the squared multiple correlation between the fitted values for the variable and the variable itself.

The fitted values for a variable are the result of projecting the points in the space orthogonally onto the variable's vector (to do this, you must imagine extending the vector in both directions). The observations whose points project furthest in the direction in which the vector points are the observations that have the most of whatever the variable measures. Those points that project at the other end have the least. Those projecting in the middle have an average ammount. then the

Thus, vectors that point in the same direction correspond to variables that have similar response profiles, and can be interpreted as having similar meaning in the context set by the data. 

---



## Factor Analysis

Finding underlying driving variables that lead to the observed variable.

In [None]:
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation
students <- read.csv("/dsa/data/all_datasets/student_prefs/student_subject_preferences.csv")
summary(students)
cor(students)

In [None]:
efa <- factanal(students, 2, rotation="varimax")

print(efa, digits=2, cutoff=0.3, sort=TRUE)

### Factor Analysis Interpretation

The way to interpret factors is to look at the observed variables that each factor contribute to:

__Factor 1__ : contributes to Biology, Geography, and Chemistry  
__Factor 2__ : contributed to Algebra, Calculus, and Statistics  

Can we assign a conceptual label the factors based on their resulting observed measurements variables they are contributing to?

Yes!  We can associate the first factor with _Science_ and the second factor with _Math_.  If these were scores on standardized tests, we use the factor analysis to plot students into sets of ''Science Kids'' and ''Math Kids''.



In [None]:
# plot factor 1 by factor 2
load <- efa$loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(students),cex=.7) # add variable names 

A noticeable result of plotting the original variables in the factor space is that they separate in the scatter plot of the factor axis.