<a href="https://colab.research.google.com/github/Utkarshmishra2k2/Factor-Analysis-PCA-on-Airline-Data/blob/main/Factor_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install factor-analyzer

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import statsmodels.stats.outliers_influence as sms
import statsmodels.api as sm

In [None]:
from factor_analyzer import calculate_bartlett_sphericity
from factor_analyzer import calculate_kmo
from factor_analyzer import FactorAnalyzer

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

In [None]:
from sklearn.cluster import KMeans

In [None]:
data_02 = pd.read_csv("https://raw.githubusercontent.com/UM1412/Data-Set/main/FactorAnalysisTrain.csv")

# Factor Analysis

Considering only columns including numbers rated on a Likert scale.

In [None]:
data_01 = data_02.iloc[:, 8:22]

In [None]:
data_01.sample(10)

## Factorability

### 01)Bartlett's Test of Sphericity

**The two primary tests commonly used to assess the suitability of a dataset for Factor Analysis are Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin (KMO) Test.**

Bartlett's Test of Sphericity is a tool used by researchers to determine the appropriateness of employing factor analysis to uncover the underlying structure within their data. By assessing the interrelatedness of variables through their correlation or covariance, the test assists in discerning whether these variables exhibit significant dependencies. If the test suggests that the variables are not independent, it provides support for conducting factor analysis to unveil latent factors within the dataset.

In [None]:
chi_square, p_value = calculate_bartlett_sphericity(data_01)

print("Chi-Square Statistic: ", chi_square)
print("P-value: ", p_value)

**Interpretation**
<br/>The P-value is lower than 0.05, which means that this dataset is suitable for the Factor Analysis!

### 02) KMO-test

The Kaiser-Meyer-Olkin (KMO) measure is a statistical tool employed to evaluate the suitability of a dataset for factor analysis. It assesses the extent to which variables in the dataset share common variance, which is crucial for factor analysis. The KMO measure quantifies the proportion of variance among variables that is shared, providing insight into whether the dataset meets the fundamental assumption of factor analysis.

In [None]:
kmo_all, kmo_model = calculate_kmo(data_01)

print("KMO for All Variables:", kmo_all)
print("KMO for Model:", kmo_model)

**Interpretation**
<br/>The average Kaiser-Meyer-Olkin (KMO) measure for the variables exceeds 0.6, and each variable individually also has a KMO score above 0.6.
<br/>Based on these results, I concluded that this dataset is appropriately suited for the Factor Analysis.

## Standardization

In [None]:
scaler = StandardScaler()
data_03 = scaler.fit_transform(data_01)

In [None]:
data_03

## Principal Component Analaysis

In [None]:
PCA = PCA()
PCA.fit(data_03)

##  Deciding Number of Factor

Decide on the number of principal components to retain. This decision can be based on the cumulative explained variance (usually aiming for a high cumulative variance, e.g., 65-90%) or by using criteria such as the Kaiser criterion (retain components with eigenvalues greater than 1) or scree plot inspection.

In [None]:
result = pd.DataFrame({
    "Eigen": PCA.explained_variance_,
    "Variance_ratio":PCA.explained_variance_ratio_ * 100,
    "CumulativeVariance": (PCA.explained_variance_ratio_ * 100).cumsum()
})
result.index = ['comp ' + str(i+1) for i in result.index]

result

**Interpretation**

Since cumulative explained variance of Component 4 is greated than 65% we will extract 5 Components. Also Kaiser criterion Component 4 with eigenvalues greater than 1.

In [None]:
a = PCA.explained_variance_
num_components = len(a)
plt.figure(figsize=(15,15))
plt.plot(np.arange(1, 15), a, marker="*", linestyle="-")
plt.title('Scree Plot')
plt.xlabel('Principal Component Index')
plt.ylabel('Eigenvalue')
plt.ylabel('Eigenvalue')
plt.grid(True)
plt.show()

Point at Component 4 shows elbow.Thus we can conclude Components before and incuding 4 capture most of the variance in the data, while components after the elbow capture less significant variance.

## Factor Analysis Type

### Rotation == None

In [None]:
Factor_01 = FactorAnalyzer(n_factors=4, rotation=None)
Factor_01.fit(data_03)

In [None]:
laoding_01 = pd.DataFrame(Factor_01.loadings_, index=data_01.columns, columns=[f'Factor{i+1}' for i in range(4)])
laoding_01

### Rotation == Promax

In [None]:
Factor_02 = FactorAnalyzer(n_factors=4, rotation="Promax")
Factor_02.fit(data_03)

In [None]:
laoding_02 = pd.DataFrame(Factor_02.loadings_, index=data_01.columns, columns=[f'Factor{i+1}' for i in range(4)])
laoding_02

### Rotation == Quartimax

In [None]:
Factor_03 = FactorAnalyzer(n_factors=4, rotation="quartimax")
Factor_03.fit(data_03)

In [None]:
laoding_03 = pd.DataFrame(Factor_03.loadings_, index=data_01.columns, columns=[f'Factor{i+1}' for i in range(4)])
laoding_03

### Rotation == Orthogonal

In [None]:
Factor_04 = FactorAnalyzer(n_factors=4, rotation="geomin_ort")
Factor_04.fit(data_03)

In [None]:
laoding_05 = pd.DataFrame(Factor_04.loadings_, index=data_01.columns, columns=[f'Factor{i+1}' for i in range(4)])
laoding_05

### Rotation == Varimax

In [None]:
Factor_05 = FactorAnalyzer(n_factors=4, rotation="varimax")
Factor_05.fit(data_03)

In [None]:
laoding = pd.DataFrame(Factor_05.loadings_, index=data_01.columns, columns=[f'Factor{i+1}' for i in range(4)])

In [None]:
laoding

**Factor Making**


In assigning names to each factor, we considered the three primary elements exhibiting the highest loading scores.

Factor 1 is labeled "Inflight Comfort & Quality" due to its strong association with aspects enhancing the quality of time spent inside the aircraft.
- Cleanliness (0.854)
- Food and drink (0.77)
- Inflight entertainment (0.766)
- Seat Comfort (0.754)

Factor 2 is denoted as "Customer Service Quality" since it primarily encompasses components linked to the provision of high-quality service throughout the entire journey, from boarding to arrival.
- Inflight service (0.799)
- Baggage handling (0.76)
- On-board service (0.7)
- Leg Room (0.4832)

Factor 3 is titled "Convenience and Efficiency" as it predominantly reflects elements aimed at optimizing time efficiency during the entire travel experience.
- Ease of online booking (0.766)
- Gate location (0.68)
- Departure | Arrival Time Confinent (05.8964)

The fourth and final factor is named "Technological Accessibility" owing to its strong association with technological advancements facilitating convenient access to flight-related services.
- Online boarding (0.7565)
- Inflight Wi-Fi service (0.478)
- Ease of online booking (0.463)

In [None]:
data_04 = Factor_04.transform(data_01)

In [None]:
data_05 = pd.DataFrame()

In [None]:
data_05['Inflight Comfort and Quality'] = data_04[:, 0]
data_05['Customer Service Quality'] = data_04[:, 1]
data_05['Convenience and Efficiency'] = data_04[:, 2]
data_05['Technological Accessibility'] = data_04[:, 3]

In [None]:
data_01.shape

In [None]:
data_05.shape

In [None]:
data_05.corr(method = 'pearson')

In [None]:
(pd.DataFrame(Factor_05.get_factor_variance(),index=['Variance','Proportional Var','Cumulative Var'],columns = ["Inflight Comfort and Quality","Customer Service Quality","Convenience and Efficiency","Technological Accessibility"]))

In [None]:
pd.DataFrame(Factor_05.get_communalities(),index=laoding.index,columns=['Communalities'])

In [None]:
factor_loadings = {
    "Inflight Comfort & Quality": {"Cleanliness": 0.854, "Food and Drink": 0.77, "Inflight Entertainment": 0.766,"Seat Cmfort":0.754},
    "Customer Service Quality": {"Inflight Service": 0.799, "Baggage Handling": 0.76, "On-board Service": 0.7,"Leg Room":0.4832},
    "Convenience and Efficiency": {"Ease of online booking": 0.766, "Gate location": 0.68, "Inflight wifi service": 0.05,"Deparature|Arrival Time":0.589},
    "Technological Accessibility": {"Online boarding": 0.7565, "Inflight wifi service": 0.478, "Ease of online booking": 0.463}
}
factor_loadings_df = pd.DataFrame.from_dict(factor_loadings, orient='index')
factor_loadings_df = factor_loadings_df.transpose()
plt.figure(figsize=(10, 6))
sns.heatmap(factor_loadings_df, annot=True, cmap="YlGnBu", cbar=False)
plt.title('Factor Loadings')
plt.xlabel('Factors')
plt.ylabel('Variables')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Logastics Regression

### Label Encoding

In [None]:
label_encoder = LabelEncoder()

In [None]:
data_05["Result"] = label_encoder.fit_transform(data_02['satisfaction'])

In [None]:
data_05.info()

In [None]:
Y = data_05["Result"]
X = sm.add_constant(data_05[['Inflight Comfort and Quality', 'Customer Service Quality','Convenience and Efficiency', 'Technological Accessibility']])

In [None]:
X

**Binary Target Variable**

In [None]:
print(data_02['satisfaction'].value_counts())

In [None]:
print(data_05["Result"].value_counts())

**Interpretation**

There are only two outcomes (i.e. binary classification of "neutral or dissatisfied" or did "satisfied"), so we will be using Binary Logistic Regression (which is the default method we use when we specify family=Binomial in our logit models earlier)


In [None]:
model  = sm.Logit(Y, X).fit_regularized(alpha=0.1)

In [None]:
print(model.summary())

Hypothesis:


Ho:All variables are not significant vs H1: Not Ho



Desion Criteria:

Since all p-value is less that 0.05(i.e., signifance level) we reject Ho i.e., All variable are significant

**The Logestioc Regression Equation is Given as**

**log(p/(1-p)) =  -6.6653 +  0.7408 * Inflight Comfort and Quality + 0.7172 * Customer Service Quality + 0.4942 * Convenience and Efficiency + 0.9806 * Technological Accessibility**



# K Means

In [None]:
data_06 = data_02.iloc[:, 8:22]
data_06["Result"] = data_02['satisfaction']

In [None]:
X = data_06.drop(columns=["Result"])
y = data_06["Result"]

In [None]:
data_06

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
y_encoded = label_encoder.fit_transform(y)

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

In [None]:
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)

In [None]:
kmeans.fit(X_scaled)

In [None]:
cluster_centers = kmeans.cluster_centers_

In [None]:
labels = kmeans.labels_

In [None]:
correct_labels = sum(y_encoded == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y_encoded.size))

In [None]:
inertia = kmeans.inertia_

In [None]:
silhouette_score = metrics.silhouette_score(X_scaled, labels, metric='euclidean')

In [None]:
print("Silhouette Score:", silhouette_score)

In [None]:
print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))