# Factor Analysis

Another important thing to think about when analyzing surveys is how well the items "hang together" and whether you are measuring more than one concept in your survey.  You may ask several questions about a similar broad topic, but is that all one topic, or does it really have some subtopics in it? *Factor analysis* has the answer to these questions and more! The basic goal of factor analysis is to see how items fall together and to see if they group in any particular patterns that make sense logically.

## Types of Factor Analysis

There are two broad types of factor analysis: *exploratory factor analysis* and *confirmatory factor analysis*. Exploratory factor analysis, abbreviated *EFA*, is used when you don't really have an inkling of what your data will yield.  You are intrepid explorers, traversing unknown survey data worlds! Confirmatory factor analysis, abbreviated *CFA* (so original!), is either for after you have completed EFA or when you are so confident about what your data holds you feel you can skip the EFA and just want a validation check. You are confirming your thoughts about the data with CFA. An example of when you might proceed straight to CFA is when you have already used a validated, previously studied set of survey items, and just want to make sure that your data is behaving the same way as it did for others.  

The most common type of factor analysis is definitely EFA, and it's a good thing, because it's easier, too! Conducting a CFA is actually a form of structural equation modeling (SEM), and you won't get into that here. However, you will learn how to rock the heck out of an EFA, and that knowledge will take you a long way!

---

## Assumptions of EFA

There are only three assumptions for EFA - yes you heard that right - three! Let the party commence! 

---

### Sample Size 

Although there are many different opinions about sample size for EFA, the safest rule you can follow is to have at least 300 data points. However, you may be able to get away with as few as 150 data points if you have a small number of survey questions you're examining and those survey questions are moderately correlated with each other. 

---

### Absence of Multicollinearity

*Multicollinearity*, or having a lot of overlap between variables, is a problem, because it will make sorting your survey items into distinct groups quite difficult.  Chances are that if your survey items all have really high multicollinearity, then you should have asked fewer survey items, because they are all getting at the same concept! You can test for multicollinearity by running a correlation matrix on all your survey items. If anything correlates with anything else at .9 or higher, than it's got to go, and you'll want to eliminate it from your analysis.  Though that's a good guideline, you may run into situations where lower correlations also cause problems.  You'll be able to catch this by looking at the *determinants*. You can think of determinants as another measure of how well survey items are correlated. When you run a determinant test, you are looking for a value of greater than .00001.    

---

### Some Relationship between Survey Items

Although multicollinearity is to be avoided, it's important that there is some relationship between your survey items. Otherwise, they probably shouldn't be grouped together at all! So you'll also want to scan your correlation matrix for any variable that has multiple correlations with other items of .3 or lower, which is a good indication it's not going to play nicely with the others and should be removed. You can also run a catch-all test to make sure that there is some relation between all the variables - this is *Bartlett's test*, which you will want to be significant, since it tests against an *identity matrix*, or a matrix that assumes no relationship between all variables (correlations of 0 for everything).

---

## Factor Rotation

The other big thing you need to know about EFA before diving in is *factor rotation*.  In order to better see the relationships between your different survey items, you will want to rotate the data.  You can rotate it 90 degrees, which is called *orthogonal rotation* and is really meant for when you theoretically don't think your survey items are related, or you can rotate it with *oblique rotation*, which does not maintain right angles at 90 degrees.  Oblique rotation is when you theoretically believe your survey items should be related. The most common types of orthogonal rotation are *varimax* and *quartimax*.  The most common types of oblique rotation are *oblimin* and *promax*. You don't need to know the mathematical differences between then, and chances are, you will use a process of trial and error in which you'll try at least two different rotation types for each data set.

In the image below, you'll see that Figure A shows off the raw data, which is scattered all over the place.  Figure B, in the middle, shows a type of orthogonal rotation, in which the axes are now turned 90 degrees from where they once were.  And Figure C shows a type of oblique rotation, which also rotates the data, just not at 90 degrees.  In this example, the data remain spread apart (probably because there are only three data points), but in most cases, as you rotate, the data will start to clump together, forming factors.

# Factor Analysis Setup

Now that you understand the basics of factor analysis, you will run one of your own in R!

---

## Load Libraries

You will need to install and load several libraries in order to complete factor analysis in R. You will use ```corpcor``` for correlations, and ```GPArotation``` for the factor analysis proper.  ```psych``` will help you with interpreting the factor loadings, and ```IDPmisc``` can be used to remove missing data.

```{r}
library("corpcor")
library("GPArotation")
library("psych")
library("IDPmisc")
```

---

In [1]:
# INSTALL PACKAGES ####
install.packages("corpcor")
install.packages("GPArotation")
install.packages("psych")
install.packages("IDPmisc")


The downloaded binary packages are in
	/var/folders/wk/6why77bn1kn0l0pkd4vd3zl00000gn/T//RtmpgUF8ti/downloaded_packages

The downloaded binary packages are in
	/var/folders/wk/6why77bn1kn0l0pkd4vd3zl00000gn/T//RtmpgUF8ti/downloaded_packages


also installing the dependencies ‘tmvnsim’, ‘mnormt’





The downloaded binary packages are in
	/var/folders/wk/6why77bn1kn0l0pkd4vd3zl00000gn/T//RtmpgUF8ti/downloaded_packages

The downloaded binary packages are in
	/var/folders/wk/6why77bn1kn0l0pkd4vd3zl00000gn/T//RtmpgUF8ti/downloaded_packages


In [3]:
# LOAD LIBRARIES ####
library("corpcor")
library("GPArotation")
library("psych")
library("IDPmisc")

## Load in Data

For this walkthrough, you will  be using **[data from a survey on financial wellbeing](https://repo.exeterlms.com/documents/V2/DataScience/Metrics-Data-Processing/financialWB.zip)**  The codebook is located **[here](https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf)**. Check out the variable list starting on page 5 if you'd like to know what all the survey items are (or at least the ones you'll be working with).

---


In [5]:
financialWB <- read.csv('../data/financialWB.csv')

In [7]:
head(financialWB)

Unnamed: 0_level_0,PUF_ID,sample,fpl,SWB_1,SWB_2,SWB_3,FWBscore,FWB1_1,FWB1_2,FWB1_3,⋯,PPMSACAT,PPREG4,PPREG9,PPT01,PPT25,PPT612,PPT1317,PPT18OV,PCTLT200FPL,finalwt
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>
1,10350,2,3,5,5,6,55,3,3,3,⋯,1,4,8,0,0,0,0,1,0,0.3672919
2,7740,1,3,6,6,6,51,2,2,3,⋯,1,2,3,0,0,0,0,2,0,1.3275607
3,13699,1,3,4,3,4,49,3,3,3,⋯,1,4,9,0,0,0,1,2,1,0.8351558
4,7267,1,3,6,6,6,49,3,3,3,⋯,1,3,7,0,0,0,0,1,0,1.410871
5,7375,1,3,4,4,4,49,3,3,3,⋯,1,2,4,0,0,1,0,4,1,4.2606681
6,10910,1,3,5,7,5,67,5,1,1,⋯,1,2,3,0,0,0,0,2,0,0.7600609


## Question Setup

With the data above, you will be determining how a set of questions from the financial wellbeing survey hang together and whether there are any subscales. To do this, you will perform factor analysis.  In factor analysis, there is no x or y variables - you are simply seeing how variables fit together.

---

## Data Wrangling

Before you begin, there is one data wrangling item that needs to take place - you will subset your data.  The function you'll use in R for factor analysis does not allow you to specify variables, so you'll need to trim your data to only the variables you are interested in looking at to begin with. In order to subset, take a look at the data and identify the columns you want to keep. In this case, you want the items that start with ```FWB```. They are contained in columns numbered 8-17.  With the below code, you will only have those columns in your new dataset to use:

```{r}
financialWB1 <- financialWB[, 8:17]
```

In [8]:
financialWB1 <- financialWB[, 8:17]

In [10]:
head(financialWB1)

Unnamed: 0_level_0,FWB1_1,FWB1_2,FWB1_3,FWB1_4,FWB1_5,FWB1_6,FWB2_1,FWB2_2,FWB2_3,FWB2_4
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,3,3,3,3,2,3,2,3,2,4
2,2,2,3,3,3,4,2,2,2,3
3,3,3,3,3,3,3,3,3,3,3
4,3,3,3,3,3,3,3,3,3,3
5,3,3,3,3,3,3,3,3,3,3
6,5,1,1,1,1,1,2,5,2,2


---

## Test Assumptions

Now that you have the columns you'll be examining in the factor analysis, you'll need to test the assumptions for them! You will be looking at sample size and how well the variables relate to each other.

---

### Sample Size

Sample size should ideally be 300 or more. Luckily, there are 6,394 rows here, so you have met this assumption!

---

### Absence of Multicollinearity

Next, you will test for the absence of multicollinearity. The first way to do this is with a correlation matrix.  You can use the function ```cor()``` to do that: 

```{r}
financialWBmatrix <- cor(financialWB1)
```



In [12]:
financialWBmatrix <- cor(financialWB1)

And then to view it, you can easily use the ```View()``` function to read it easier in R (as opposed to printing it), and you can make use of hte ```round()``` function so that you are only seeing two decimal places, which makes things easier to sort through.  The ```2``` in the code below indicates the number of decimal places you would like to see. 

```{r}
View(round(financialWBmatrix, 2))
```

In [13]:
View(round(financialWBmatrix, 2))

Unnamed: 0,FWB1_1,FWB1_2,FWB1_3,FWB1_4,FWB1_5,FWB1_6,FWB2_1,FWB2_2,FWB2_3,FWB2_4
FWB1_1,1.0,0.68,-0.48,0.69,-0.42,-0.44,-0.57,0.65,-0.48,-0.45
FWB1_2,0.68,1.0,-0.48,0.7,-0.36,-0.45,-0.51,0.6,-0.43,-0.41
FWB1_3,-0.48,-0.48,1.0,-0.49,0.53,0.62,0.6,-0.5,0.52,0.55
FWB1_4,0.69,0.7,-0.49,1.0,-0.35,-0.43,-0.5,0.62,-0.46,-0.43
FWB1_5,-0.42,-0.36,0.53,-0.35,1.0,0.47,0.5,-0.41,0.43,0.44
FWB1_6,-0.44,-0.45,0.62,-0.43,0.47,1.0,0.52,-0.45,0.44,0.51
FWB2_1,-0.57,-0.51,0.6,-0.5,0.5,0.52,1.0,-0.6,0.64,0.61
FWB2_2,0.65,0.6,-0.5,0.62,-0.41,-0.45,-0.6,1.0,-0.53,-0.46
FWB2_3,-0.48,-0.43,0.52,-0.46,0.43,0.44,0.64,-0.53,1.0,0.55
FWB2_4,-0.45,-0.41,0.55,-0.43,0.44,0.51,0.61,-0.46,0.55,1.0


---
In it, you want to look at only half the matrix (remember that the top and bottom halves along the diagonal are mirror image of each other). As you go down the columns, starting to look only after the 1.0 on the diagonal, look for any correlations that are higher than .9. This would indicate really high multicollinearity, and if there's an item that has a correlation of .9, you will most likely want to remove that item. A quick scan indicates that there is nothing above .9 here and you are good to go. 

---

#### Bartlett's Test

To double check your findings from the correlation matrix, you can also run Bartlett's test with this simple line: 

```{r}
cortest.bartlett(financialWB1)
```

In [14]:
cortest.bartlett(financialWB1)

R was not square, finding R from data

