#### Exploratory Data Analysis (EDA)

This notebook outlines the EDA process applied to the logs dataset. It includes descriptive statistics, visualizations (histograms, box plots), correlation analysis (heatmap and pair plots), and dimensionality reduction using PCA. The aim is to identify redundant features, understand data distribution, and inform further feature selection for learning analytics.


In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the dataset (assume CSV format)
df = pd.read_csv('Mode2Logs.csv') # data converted to CSV for ease




##### 1. Descriptive Statistics and Histograms

This section provides a numerical summary of the dataset and visualizes the distribution of each feature using histograms.


In [None]:
# -------------------------------
# a. Descriptive Statistics
# -------------------------------
print("Summary Statistics:")
print(df.describe())

# Plotting histograms for each feature
df.hist(figsize=(15, 10), bins=20)
plt.suptitle("Histograms of Features")
plt.show()


##### 2.1 Box Plots for Outlier Detection

Box plots help us visualize the distribution, medians, and potential outliers for each feature. Note that features like `session_duration` and `avg_touch_play_duration` have a very high scale, which may compress the visualization of other features.


In [None]:
# Box plots to detect outliers
plt.figure(figsize=(15, 5))
sns.boxplot(data=df.drop(columns=['sessionID']))  # Exclude non-numeric or identifier columns
plt.title("Box Plots for Features")
plt.xticks(rotation=90)
plt.show()


##### 2.2 Box Plots on Scaled Data

By removing high-scale features such as `session_duration` and `avg_touch_play_duration`, we can observe a more balanced view of the remaining features.


In [None]:

# Box plots for Scaled data
columns=['sessionID', 'session_duration','avg_touch_play_duration'] 

plt.figure(figsize=(15, 5))
sns.boxplot(data=df.drop(columns=columns))  # exclude high-scale features
plt.title("Box Plots for Features")
plt.xticks(rotation=90)
plt.show()


##### 3. Correlation Analysis

Correlation analysis helps to identify linear relationships between features. We use a heatmap and pair plots to visualize these relationships.


In [None]:

# -------------------------------
# b. Correlation Analysis
# -------------------------------
# Compute correlation matrix
corr_matrix = df.drop(columns=['sessionID']).corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()



##### Pair Plots

Pair plots visualize the relationships and distributions between pairs of features. They provide a complementary view to the heatmap and help reveal non-linear relationships and data clusters.


In [None]:
# Pair plots (for a subset of features if there are too many)
sns.pairplot(df.drop(columns=['sessionID']).iloc[:, :8])  # using the first 8 features for clarity
plt.suptitle("Pair Plots for Selected Features", y=1.02)
plt.show()


##### 4. Dimensionality Reduction using PCA

Principal Component Analysis (PCA) reduces the number of features by transforming them into principal components that capture most of the variance in the data. Standardization is performed to ensure each feature contributes equally.


In [None]:
# -------------------------------
# c. Dimensionality Reduction: PCA
# -------------------------------
# Select features (excluding sessionID)
features = df.drop(columns=['sessionID']).columns
X = df[features].fillna(0).values  # fill missing values if any

# Standardize the data (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=5)  # Adjust the number of components based on explained variance
principal_components = pca.fit_transform(X_scaled)

print("Explained Variance Ratio for PCA components:")
print(pca.explained_variance_ratio_)

# Visualize the first two principal components
plt.figure(figsize=(8, 6))
plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.7)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA: PC1 vs PC2")
plt.show()


#### Summary

In this notebook, we conducted an in-depth EDA on VR session logs. We:
- Analyzed descriptive statistics to understand the basic distribution of features.
- Used histograms and box plots to identify outliers and evaluate feature scales.
- Applied correlation analysis (via heatmaps and pair plots) to reveal relationships among features.
- Performed PCA for dimensionality reduction, revealing clusters that may inform further user segmentation and feature selection for learning analytics.

These steps help ensure that our subsequent analysis and model-building are based on a robust understanding of the data.


In [None]:
html_table = df.describe().T.to_html()

with open("Feature_describe.html", "w") as f:
    f.write(html_table)