# Factor Analysis in Machine Learning / Data Analysis

In [49]:
# Import necessary libraries
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

import numpy as np


In [7]:
data = pd.read_csv('pima_indians_diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


1. **Standardizing the Data**: Since factor analysis assumes variables are on the same scale, we use StandardScaler to standardize the features.
2. **Performing Factor Analysis**: The FactorAnalysis object from sklearn.decomposition is used. The n_components parameter specifies how many factors to extract.
3. **Factor Loadings**: These show the contribution of each feature to the extracted factors.
4. **Extracted Factors**: The transformed dataset represents the original data in term of the extracted latent factors.

### What Are Factor Loadings?

Factor loadings are numerical values that indicate how strongly each original variable (feature) correlates with each factor extracted during factor analysis. They represent the contribution of a particular variable to a specific factor, essentially showing how much the variable "loads" onto the factor.

Mathematically, factor loadings are the coefficients of the linear combinations that define each factor.

### How to Interpret Factor Loadings
1. Magnitude:
    * A high factor loading (closer to -1 or 1) indicates that the variable has a strong relationship with the corresponding factor.
    * A low factor loading (close to 0) indicates that the variable contributes little to the corresponding factor.
2. Sign:
    * A positive loading indicates a direct relationship between the variable and the factor. When the factor increases, the variable tends to increase.
    * A negative loading indicates an inverse relationship between the variable and the factor. When the factor increases, the variable tends to decrease.
3. Dominant Loadings:
    * Variables with the largest absolute loadings on a factor are the most strongly associated with that factor. They help label or interpret the factor.
4. Factor Structure:
    * Variables that have high loadings on the same factor are interpreted as being related or part of the same underlying construct.
    * If a variable has high loadings on multiple factors, it may indicate that the variable is influenced by multiple constructs.
  
### Example of Interpreting Factor Loadings

Assume the following factor loadings from a dataset analyzing health metrics:

| Variable | Factor 1 | Factor 2 | Factor 3 |
|----------| ---------- | ---------- | ---------- |
| Glucose | 0.85 | 0.12 | -0.03 | 
|Insulin|0.90|-0.05|0.08|
|BMI|0.60|0.10|0.50|
|Age|-0.10|0.80|0.02|

* Factor 1 (e.g., "Metabolic Factor"): Glucose and Insulin have high positive loadings, suggesting Factor 1 is strongly related to metabolic activity.
* Factor 2 (e.g., "Age Factor"): Age has a high loading, indicating that this factor is related primarily to the age of the individual.
* Factor 3 (e.g., "BMI Influence"): BMI has a moderate loading on Factor 3, suggesting this factor could be related to body weight or health measurements.

In [9]:
# Step 1: Prepare the data
# Drop any non-numerical or target columns (e.g., "Outcome")
features = data.drop(columns=["Outcome"])

# Standardize the data
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)


In [11]:
# Step 2: Perform Factor Analysis
# Specify the number of factors to extract
n_factors = 3
factor_analysis = FactorAnalysis(n_components=n_factors, random_state=42)
factors = factor_analysis.fit_transform(features_scaled)


In [13]:
# Step 3: Analyze the results
# Get factor loadings
loadings = pd.DataFrame(
    factor_analysis.components_.T,  # Transpose to match feature names
    columns=[f"Factor{i+1}" for i in range(n_factors)],
    index=features.columns
)

# Display factor loadings
print("Factor Loadings:")
print(loadings)



Factor Loadings:
                           Factor1   Factor2   Factor3
Pregnancies              -0.119365  0.577885 -0.012766
Glucose                   0.330945  0.331624  0.064792
BloodPressure             0.112565  0.285480  0.346381
SkinThickness             0.503759 -0.053898  0.388483
Insulin                   0.938834  0.062961 -0.097143
BMI                       0.275897  0.100511  0.671914
DiabetesPedigreeFunction  0.210509  0.058953  0.123855
Age                      -0.109435  0.914474 -0.033471


In [19]:
# Create a DataFrame for the extracted factors
factors_df = pd.DataFrame(factors, columns=[f"Factor{i+1}" for i in range(n_factors)])

# Display the first few rows of the factors
factors_df.head()


Unnamed: 0,Factor1,Factor2,Factor3
0,-0.578652,1.249269,0.558757
1,-0.62536,-0.417515,-0.065258
2,-0.696552,0.09391,-0.673719
3,0.115889,-1.038568,-0.30293
4,0.964153,-0.060139,0.827289


### Train RandomForest Model

* Data Preparation:
    * factors_df: Contains the extracted factors from the Factor Analysis.
    * target: Contains the Outcome column as the target variable for prediction.
* Train-Test Split: We split the dataset into training (80%) and testing (20%) sets.
* RandomForest Model: A RandomForestClassifier is used to train the model with the extracted factors as input features.
* Model Evaluation: accuracy_score and classification_report are used to measure the model’s performance.


In [37]:
# Step 1: Prepare the target variable
target = data["Outcome"]

In [39]:
# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(factors_df, target, test_size=0.2, random_state=42)

In [41]:
# Step 3: Train a RandomForest model using the extracted factors
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)

In [43]:
# Step 4: Make predictions
y_pred = rf_model.predict(X_test)

In [45]:
# Step 5: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the RandomForest model:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy of the RandomForest model: 0.6948051948051948

Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.79      0.77        99
           1       0.58      0.53      0.55        55

    accuracy                           0.69       154
   macro avg       0.67      0.66      0.66       154
weighted avg       0.69      0.69      0.69       154



### Making Predictions

* New Data Point: The new_data array contains the values for the same features used in the original dataset.
* Standardization: The scaler object from the training process is reused to standardize the new data.
* Factor Transformation: The factor_analysis object from the training process is used to transform the new data into the factor space.
* Prediction: The trained rf_model predicts the Outcome (diabetes status) based on the factors.
    * **predict_proba** provides the probabilities for each class (e.g., Class 0: No Diabetes, Class 1: Diabetes).

In [54]:
# Step 1: Example of a new data point (scaled like the training data)
new_data = np.array([
    [2, 110, 80, 25, 130, 30.5, 0.5, 35]  # Example feature values
])

In [62]:
# Step 2: Standardize the new data (use the same scaler as used for training)
new_data_scaled = scaler.transform(new_data)
new_data_scaled



array([[-0.54791859, -0.34096773,  0.56322275,  0.27998931,  0.4358859 ,
        -0.18943689,  0.08493691,  0.14967911]])

In [60]:
# Step 3: Apply Factor Analysis to transform the new data
new_data_factors = factor_analysis.transform(new_data_scaled)
new_data_factors

array([[ 0.3851519 ,  0.09206998, -0.11336205]])

In [64]:
# Step 4: Use the trained RandomForest model to make a prediction
new_prediction = rf_model.predict(new_data_factors)
new_prediction_proba = rf_model.predict_proba(new_data_factors)



In [66]:
# Step 5: Display the prediction and probabilities
print("Predicted Outcome:", new_prediction[0])
print("Prediction Probabilities (Class 0 and Class 1):", new_prediction_proba[0])

Predicted Outcome: 0
Prediction Probabilities (Class 0 and Class 1): [0.56 0.44]
