An interactive Streamlit application for Gaussian Mixture Model clustering with a robust Expectation-Maximization implementation. Upload your CSV data, preprocess it, configure the EM algorithm, and explore comprehensive visualizations of clustering results.
- Automatic detection and removal of non-numeric/ID columns
- Multiple missing value strategies (drop rows, fill with mean/median)
- Outlier handling using IQR method (clip or remove)
- Feature selection and standardization (StandardScaler)
- Real-time preprocessing report
- K-means++ initialization for intelligent starting points
- Log-sum-exp trick for numerical stability
- Covariance regularization to prevent singular matrices
- Multiple random restarts to avoid local optima
- Full convergence tracking with log-likelihood history
- BIC and AIC scoring for model selection
- Data Overview Tab: Data preview, statistical summary, feature distributions, correlation heatmap
- Clusters Tab: 2D/3D PCA scatter plots, cluster size distribution, per-cluster feature violin plots
- Convergence Tab: Log-likelihood convergence curve, per-iteration improvement chart
- Model Selection Tab: BIC/AIC comparison across K values, automatic best-K detection
- Parameters Tab: Component-wise parameters (means, std devs, covariances), responsibility heatmap
- Python 3.9 or higher
- Clone the repository and navigate to the project directory:
cd /path/to/GMM- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtstreamlit>=1.56.0
numpy>=2.4.4
pandas>=3.0.2
scipy>=1.17.1
scikit-learn>=1.6.1
plotly>=6.7.0
streamlit run main.pyThe app will open in your browser at http://localhost:8501
- Upload Data: Click "Browse files" to upload a CSV file
- Preprocess: Configure data cleaning in the sidebar
- Handle missing values (drop/mean/median)
- Enable/disable standardization
- Auto-drop ID/text columns
- Handle outliers (clip/remove)
- Select specific features (optional)
- Configure EM Algorithm:
- Set number of components (K)
- Adjust max iterations and tolerance
- Set covariance regularization
- Choose number of random restarts
- Run Simulation: Click the "Run Simulation" button
- Explore Results: Navigate through tabs to analyze results
The Expectation-Maximization algorithm implemented includes several robustness features:
Instead of random initialization, we use K-means++ which:
- Selects initial centers that are spread apart
- Reduces likelihood of poor local optima
- Improves convergence speed
- Log-sum-exp trick: Prevents underflow when computing responsibilities
- Cholesky decomposition: Stable computation of log Gaussian densities
- Covariance regularization: Adds small diagonal terms to prevent singularity
The algorithm runs multiple times with different initializations and returns the best solution based on log-likelihood.
- Tracks log-likelihood at each iteration
- Stops when improvement falls below tolerance
- Provides full convergence history for analysis
- BIC (Bayesian Information Criterion): Penalizes model complexity
- AIC (Akaike Information Criterion): Balances fit and complexity
- Automatic sweep across K values for optimal selection
A sample credit card customer dataset (GMM_dataset.csv) is included with features:
BALANCE: Account balancePURCHASES: Total purchasesONEOFF_PURCHASES: Maximum single purchaseCASH_ADVANCE: Cash advance amountCREDIT_LIMIT: Credit limitPAYMENTS: Total payments- And more...
GMM/
├── main.py # Streamlit application
├── gmm_engine.py # Robust GMM/EM implementation
├── requirements.txt # Python dependencies
├── GMM_dataset.csv # Sample dataset
└── README.md # This file
When data has more than 2 dimensions, PCA is automatically applied for scatter plots:
- 2D view: First two principal components
- 3D view: First three principal components
The app uses Streamlit's caching (@st.cache_data) to:
- Store preprocessing results
- Cache trained models
- Accelerate repeated computations
- Efficient numpy/scipy operations
- Vectorized computations
- Optimized for datasets up to 100K rows
Issue: "ModuleNotFoundError: No module named 'sklearn'"
- Solution: Run
pip install scikit-learn
Issue: Dashboard not loading
- Solution: Ensure you're in the project directory and venv is activated
Issue: Slow convergence
- Solution: Reduce max iterations, increase tolerance, or reduce n_init
Issue: Singular matrix warnings
- Solution: Increase
reg_covarparameter in sidebar
This project is provided as-is for educational and research purposes.
Contributions are welcome! Areas for improvement:
- Additional preprocessing options
- More visualization types
- Parallel processing for large datasets
- Additional initialization methods
- Online/batch EM variants