A Python toolkit for analyzing student-AI interaction data in online education platforms. This repository implements a complete data analysis pipeline with four modules corresponding to the research paper sections.
This project provides a modular data analysis pipeline for educational research:
-
Module 1: Conversation Session Identification and Segmentation (3.3.1)
- Two-stage dialogue segmentation algorithm
- Time threshold (15-minute inactivity) + LLM topic segmentation
-
Module 2: Feature Engineering (3.3.2)
- Extract 10 features from segmented sessions
- Behavioral, cognitive, and temporal engagement features
-
Module 3: Cluster Analysis (3.3.3)
- PCA dimensionality reduction
- K-Means clustering with elbow method
- Comprehensive visualization
-
Module 4: Process Mining (3.3.4)
- First-Order Markov Model (FOMM)
- Student-level engagement pattern analysis
- Python 3.8+ (required by dependencies: pandas>=2.0.0, numpy>=1.24.0, scikit-learn>=1.3.0)
- See
requirements.txtfor full dependency list
# Install dependencies
pip install -r requirements.txtcd module_1_session_segmentation
python two_stage_dialog_split.py \
--input_folder /path/to/raw/dialog/csv \
--output_folder /path/to/segmented/output \
[--enable_llm] \
[--api_key YOUR_API_KEY]cd module_2_feature_engineering
python run.py \
--dialog_folder /path/to/segmented/dialogs \
--class_time_file /path/to/class_time_range_by_school.csv \
--class_schedule_file /path/to/class_schedule.csv \
--output_file extracted_features.csv \
--output_folder /path/to/outputcd module_3_cluster_analysis
python run.py \
--features_file /path/to/extracted_features.csv \
--output_folder /path/to/clustering_results \
--max_k 10 \
--variance_threshold 0.8cd module_4_process_mining
python fomm.py \
--cluster_csv /path/to/clustered_features.csv \
--output_dir /path/to/process_mining_resultsdata_analysis/
├── README.md # This file
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── CHINESE_TERMS.md # Chinese column names and values reference
├── FEATURE_ENGINEERING.md # Feature definitions and formulas
├── .gitignore # Git ignore rules
├── module_1_session_segmentation/ # Module 1: Session segmentation
│ ├── __init__.py
│ └── two_stage_dialog_split.py
├── module_2_feature_engineering/ # Module 2: Feature extraction
│ ├── __init__.py
│ ├── feature_extraction.py
│ ├── feature_utils.py
│ ├── io_utils.py
│ └── run.py
├── module_3_cluster_analysis/ # Module 3: Clustering
│ ├── __init__.py
│ ├── clustering.py
│ ├── pca_reduction.py
│ └── run.py
└── module_4_process_mining/ # Module 4: Process mining
├── __init__.py
└── fomm.py
Each module contains:
__init__.py: Module-level documentation (visible when importing)- Main entry point:
- Module 1:
two_stage_dialog_split.py - Module 2:
run.py - Module 3:
run.py - Module 4:
fomm.py
- Module 1:
- Additional utility files as needed
extracted_features.csv: Feature matrix with 10 features per sessionhistograms_before_log/: Feature distribution plots before log transformationhistograms_after_log/: Feature distribution plots after log transformation
PCA Analysis Files:
pca_01_explained_variance_ratio.png: Explained variance ratio by componentpca_02_cumulative_explained_variance.png: Cumulative explained variance ratiopca_03_feature_loadings_heatmap.png: Feature loadings in principal componentspca_04_2d_projection.png: 2D PCA projection scatter plot
Clustering Analysis Files:
clustering_01_elbow_method.png: Elbow method for optimal cluster numberclustering_02_stability_report.txt: Clustering stability report (ARI scores)clustering_03_clustered_features.csv: Features with cluster labelsclustering_04_class_distribution.csv: Class-level cluster distributionclustering_05_pca_visualization.png: PCA space clustering visualizationclustering_06_cluster_size_distribution.png: Cluster size distribution bar chartclustering_07_qa_turns_distribution.png: QA turns distribution by clusterclustering_08_feature_heatmap.png: Cluster feature means heatmapclustering_09_course_progress_distribution.png: Course progress distribution by cluster
transition_probability_heatmap.png: State transition probability matrix (heatmap)transition_probability_heatmap_with_counts.png: Transition probability matrix with count annotationstransition_count_matrix.csv: Raw transition count matrixtransition_probability_matrix.csv: Transition probability matrix (CSV format)
This project is licensed under the MIT License - see the LICENSE file for details.