This project explores evolutionary relationships between species based on codon usage frequencies using Machine Learning & Distance Metrics. The primary goal is to calculate evolutionary distances using Euclidean distance and analyze the correlation between species.
Additionally, a Random Forest model is trained to predict species classification based on codon frequency, and evolutionary trends are visualized.
- Preprocess Codon Usage Data π§¬
- Compute Evolutionary Distances π
- Apply Machine Learning (Random Forest Regression) π€
- Feature Importance Analysis π
- Visualize Species Relationships πΌοΈ
The project uses a Codon Usage Frequency dataset (codon_usage.csv), where:
- Each row represents a species
- Columns contain relative codon frequencies
SpeciesNameis the label column
- Python π
- Pandas, NumPy for data processing
- Scipy for Euclidean distance computation
- Scikit-learn for ML modeling
- Matplotlib for visualization
# Clone the repository
git clone https://github.com/your-repo/evolutionary-distance
cd evolutionary-distance
# Install dependencies
install the necessary dependencies- Close evolutionary species have smaller distances (e.g., Rattus norvegicus & Mus musculus)
- Machine Learning captures key codon importance in species classification
- Visualization helps interpret relationships effectively
- ποΈ Improve the model with deep learning (CNNs for sequence data)
- π¬ Explore other distance metrics (Manhattan, Mahalanobis)
- 𧬠Extend analysis to larger genetic datasets
π§ Reach me at Ayush04coder@gmail.com
π οΈ Contributions & PRs are welcome!