# Tutorial 6: Factor Analysis to Measure Tolerance
**Date**: March 2023

**Background**
In the previous tutorial, you assessed the reliability and validity of questionnaries. In this tutorial, you will peform an exploratory factor analysis (EFA) to measure tolerance on the following case-study.

**Case-Study on Tolerance**
The goal of this case-study is to examine the attitude of 150 students towards tolerance where tolerance refers to the degree of diversity. The researchers involved in this case-study developed eight questions to model various expressions of tolerance. The questions are provided below:


**Table 1**. Description of variables in the dataset

|id|variable   |description                                                                                                                                                     |
|------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
|1     |id         |annoymized unique identifier per individual                                                                                                                                                       |
|2     |age        |Age of student                                                                                                                                                                                                          |
|3     |height     |Height (scale in cm, e.g. 183)                                                                                                                                  |
|4     |country    |Where are you come from? (Country)                                                                                                                              |
|5     |language   |How many language you speak at home to your family?                                                                                                             |
|6     |freq_travel|How many different countries have you travelled to?                                                                                                                                                                   |
|7     |q1         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [People should have the right to live how they wish]|
|8    |q2         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [It is important that people have the freedom to live their life as they choose]     |
|9   |q3         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [ It is okay for people to live as they wish as long as they do not harm other people]                                    |
|10    |q4         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I respect other people’s beliefs and opinions]        |
|11    |q5         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I respect other people’s opinions even when I do not agree]   |
|12    |q6         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I like to spend time with people who are different from me]                                         |
|13    |q7         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [I like people who challenge me to think about the world in a different way]                            |
|14   |q8         |five-point Likert Scale from 'strongly disagree' to 'strongly agree': [Society benefits from a diversity of traditions and lifestyles]                            |
                 |



The responses to the survey questions were recorded in the "tolerance survey dataset". 

The dataset is available here: https://raw.githubusercontent.com/MaastrichtU-IDS/global-studies/main/semester4/tutorial4/inputs/tolerance_survey_data.csv 


**Today's Objectives**

In today's tutorial, you will conduct an exploratory factor analysis (EFA) on the given dataset. The goal of the factor analyis is as follows:

- Conduct a Kaiser-Meyer-Olkin (KMO) test to assess the suitability of survey data (only items q1-q8) for factor analysis
- Find the appropriate number of Factors for the given survey items (only q1-q8)
- Interpret the factor loadings and find out which survey items/variables are related to which factors
- Compute the different types of variance to model the interrelationships among variables

**Intended Learning Outcomes**
- M5. Understand the fundamentals of factor analysis and identify opportunities to combine variables;

- M6. Conduct your own factor analysis in Python;


### Install the factor_analyzer package

#### For MacOS:
Type Command + Space Bar on your Mac Keyboard. Type in “Terminal”. Open the terminal.
Type `pip install factor_analyzer==0.3.2` and hit enter.
Restart the Kernel of your Jupyter Notebook. Now try to import: `from factor_analyzer import FactorAnalyzer` in a cell in Jupyter notebook.

#### For Windows:
Type “Anaconda prompt” in your window search option. Open the Anaconda Prompt.
Type `pip install factor_analyzer==0.3.2` and hit enter.
Restart the Kernel of your Jupyter Notebook. Now try to import: `from factor_analyzer import FactorAnalyzer` in a cell in Jupyter notebook.

#### For Colab
Run `!pip install factor_analyzer==0.3.2` in a Code cell. Now try to import: `from factor_analyzer import FactorAnalyzer` in a cell in the Colab notebook.


#### Import the necessary libraries you will need to perform factor analysis on the survey data.

In [None]:
import pandas as pd
from factor_analyzer import FactorAnalyzer #package for Exploratory Factor Analysis (EFA)

#### 0. Import the dataset
Read the csv file and store it in a dataframe named, `df`.

In [None]:
url = 'https://raw.githubusercontent.com/MaastrichtU-IDS/global-studies/main/semester4/tutorial4/inputs/tolerance_survey_data.csv'
df = pd.read_csv(url)

#### 1. Drop the columns: 'id' , 'age', 'height', 'country', 'language' and 'freq_travel' of `df`
For conducting factor analysis we are only interested in questions q1-q8 as they are related to tolerance. We are not interested in studying the attributes of respondents.

In [None]:
df = df.drop(['id','age','height','country','language','freq_travel'], axis=1)

#### 2. Transform the variables from the likert scale to the numbers by using the following mapping:

Strongly Agree ---> 5

Agree ---> 4

Neutral ---> 3

Disagree ---> 2

Strongly Disagree ---> 1

In [None]:
df_transformed = df.replace(['Strongly Agree',
                   'Agree', 
                   'Neutral', 
                   'Disagree', 
                   'Strongly Disagree'], [5,4,3,2,1])

#### 3. Print the first 10 rows of the transformed dataframe.

In [None]:
df_transformed.head(10)

#### 4. Handle missing values

Compute the mean of all the columns (questions). Replace the missing values by the mean of that particular column. Check the dataframe to see if the missing values have been replaced by the mean or not ( to do this you can print the values of different columns, for instance for q1: `df_transformed.q1.values`

In [None]:
#Mean of all the columns
column_means = df_transformed.mean()
column_means

In [None]:
#Replace missing values in each column by using the mean of that particular column
df_transformed = df_transformed.fillna(column_means)

In [None]:
#Print the values of column q1
df_transformed.q1.values

##### Plot the correlation matrix 

What patterns do you observe ? Which survey items can be grouped together ?

In [None]:
df_transformed.corr().style.background_gradient(cmap="Blues")

#### 5. Check the suitability of the data for Factor Analysis using Kaiser-Meyer-Olkin (KMO) test

Conduct a Kaiser-Meyer-Olkin (KMO) test to evaluate if the data is suitable for Factor Analysis or not. Interpret the results of the test.  

Check the documentation of the `calculate_kmo()` function here: https://factor-analyzer.readthedocs.io/en/latest/factor_analyzer.html?highlight=kmo#factor_analyzer.factor_analyzer.calculate_kmo  and use it to check the suitability of the transformed dataframe (without missing values) for factor analysis. 

Interpret the KMO value as per the table provided in the lecture and decide if the dataset is suitable to perform factor analysis or not ?

In [None]:
from factor_analyzer import calculate_kmo
#Write the code to check the suitability of the data for Factor Analysis

#### 6. Conduct Exploratory Factor Analysis


In [None]:
#How many items are there in the survey ?
no_of_items = df_transformed.shape[1] # we have 8 questions = number of columns = the second parameter of df.shape
no_of_items

In [None]:
from factor_analyzer import FactorAnalyzer
fa = FactorAnalyzer(n_factors=no_of_items, method='principal') # set number of factors to the number of items and method ='principal'
fa.fit(df_transformed)

Initially, the number of factors are set to the number of survey items in the dataset. So, in our case that will be `no_of_items = 8`. However, all the factors do not provide useful information about the common variance shared by the different survey items. Therefore, we will use the criteria disscussed in the lecutre to choose the optimal number of factors. 

- 1. Retain all the factors that are above the eigen value of 1
- 2. Use the scree plot to determine the point of inflexion and the optimal number of factors. 

#### 7. Print the EigenValues for all the Factors. 

Go to the following documentation of the factor_analyzer package: https://factor-analyzer.readthedocs.io/en/latest/genindex.html . Find the function `get_eigenvalues()` and study its documentation. Apply the `get_eigenvalues()` function to print the eigen values of all the factors. 

In [None]:
#Print the eigen values

#### 8. Make a scree plot using matplotlib
The x-axis should plot the number of factors. The y-axis should contain the corresponding eigen values. There should be a line connecting the points as illustrated in the lecture slides.

In [None]:
import matplotlib.pyplot as plt

# x-axis should print the number of factors (= no. of items so from 1 to 8)
#Use the range function to get numbers from 1 to 8. Check documentation of range() here: https://www.w3schools.com/python/ref_func_range.asp
x_axis = range() # fill the correct parameters for the range function

# y-axis should contain the eigen values
y_axis = 

In [None]:
#Documentation: https://matplotlib.org/3.5.1/plot_types/index.html

#First make a scatter plot (just plot the points without the line)


#Then make a line plot connecting these points


#Add a plot title
plt.title('Scree Plot')

# Add the lable for x-axis
plt.xlabel('Put the appropriate label')

# Add the lable for y-axis
plt.ylabel('Put the appropriate label')

#Add the grid lines (optional)
plt.grid()

#Show the plot
plt.show()

#### 9. How many factors will you choose depending on the eigen values and your interpretation of the scree plot?

#### 10. Fit the Factor Analysis on the retained factors

Now lets fit the factor analysis model again on the number of retained factors to compute the factor loadings for each of the 8 survey items

Use the `FactorAnalyzer()` module and `fit()` function again to apply the new factor analysis model with the number of retained factors

In [None]:
# Put the n_factors equal to the number of factors retained
# set rotation = 'varimax' which is a type of orthogonal rotation
# set method = 'principal' for principal factor extraction method of exploratory factor analysis
no_of_retained_factors = 
fa = FactorAnalyzer(n_factors = no_of_retained_factors , method='principal', rotation ='varimax') #default rotation is set to promax (oblique rotation)
fa.fit(df_transformed) 

#### 11. Print the factor loadings matrix. 
Use `fa.loadings_` to print the factor loadings matrix

In [None]:
fa.loadings_

#### 12. Create a pandas DataFrame named, `df_loadings` from the Factor loading array, `fa.loadings_`. 
If you have retained m factors then the names of the columns of the DataFrame should be: Factor 0, Factor 1, ........ Factor m-1. So if m=3 then your column names are: Factor 0, Factor 1 and Factor 2

In [None]:
#Use the following syntax to create a DataFrame from an array. Put the correct values for the array_name, index and columns
# df_loadings = pd.DataFrame(array_name, index = put correct value here , columns = put correct value here )

# After creating the df_loadings, try this-> df_loadings.style.background_gradient(cmap = "Reds")

#### 13.Interpret the factor loadings 
1. Mention the most representative Factors (out of the retained Factors) for each survey item/question. Consider a threshold of above + 0.60 and below -0.60 for choosing the factor loadings.

2. Mention the most representative survey items which load onto each of the retained Factors. 

Write your answer in the Markdown cell below.


#### 14. Communalities
Print the communalities for each of the survey items/questions.

Go to the following documentation of the factor_analyzer package: https://factor-analyzer.readthedocs.io/en/latest/genindex.html . Find the function `get_communalities()` and study its documentation. Apply the `get_communalities()` function to print the communalities.

In [None]:
#Print the communalities for each of the survey items/questions.

#### 15. Create a dataframe with one column and number of rows equal to the number of survey items to store the communalities for each survey item. 

In [None]:
df_communalitites = pd.DataFrame(fa.get_communalities(),index=df_transformed.columns, columns =['Communalities'])
df_communalitites

#### 16. Variance in Factors
Use the `get_factor_variance()` function to calculate the factor variance information, including variance, proportional variance and cumulative variance for each factor. Refer to the documentation here: https://factor-analyzer.readthedocs.io/en/latest/genindex.html and find out how to use the `get_factor_variance()`


In [None]:
#Print the variance for all retained factors

Convert the above numpy array to a DataFrame, `df_variance` for a better representation of factor variances. Pay attention to the index names and column names.

In [None]:
#Use the following syntax to create a DataFrame from an array. Put the correct values for the array_name, index and columns
#df_variance = pd.DataFrame(array_name, index = put correct value here , columns = put correct value here )

#### 17. Representation of survey data in terms of Factors
Create a new dataframe, `df_Factors` by applying the computed factors to the original `df_transformed` dataframe containing 150 observations. By doing so you reduce the `df_transformed` dataframe from 8 items to the number of factors. Use the `fit_transform()` function from the documentation: https://factor-analyzer.readthedocs.io/en/latest/genindex.html

In [None]:
#Write the code here to represent the survey data in terms of Factors

In [None]:
#Convert the transformed array into a dataframe