## Portfolio assingment: Identifying Biological Substitutes for Synthetic Compounds Using Clustering

**Author**: F.Feenstra

### Objective:
The goal of this assignment is to use unsupervised clustering techniques to identify biological compounds that could potentially serve as substitutes for synthetic chemicals. You will explore the use of chemical fingerprints, feature extraction, and clustering algorithms to find relationships between synthetic and natural compounds based on structural similarity.

---

### Assignment Overview:

1. **Introduction to the Problem**:
   - Synthetic compounds are widely used in various industries, but there is a growing interest in finding biological alternatives due to environmental and health concerns.
   - Your task is to use unsupervised clustering to identify potential biological substitutes for a given set of synthetic chemicals with specific functionalities (e.g., antimicrobial, antioxidant, or surfactant properties).

2. **Dataset**:
   - You will be provided with two datasets:
     - **Synthetic Compounds**: A list of synthetic chemicals with known structures.
     - **Natural Compounds**: A database of natural products with biological activity.
   - Both datasets will include chemical structures in the form of SMILES strings.

---

### Assignment Tasks:

1. **Data Preprocessing**:
   - Load the datasets 
   - Use RDKit to generate chemical fingerprints for all compounds.
   - Alternatively, calculate molecular descriptors if you prefer to use numerical features.

3. **Clustering**:
   - Choose an appropriate clustering algorithm 
   - Perform clustering on the feature representations of the compounds.
   
4. **Dimensionality Reduction and Visualization**:
   - Use t-SNE or UMAP to visualize the clustering results.
   - Show how the synthetic and biological compounds are distributed accros the clusters.

5. **Analysis**:
   - Identify clusters that contain both synthetic and natural compounds.
   - Discuss the potential of the natural compounds in these clusters as substitutes for the synthetic ones.
   - Provide a brief explanation of how chemical similarity might indicate similar biological activity.

6. **Discussion**:
   - Reflect on the limitations of your approach, such as the choice of features or clustering method.
  

---

#### Learning Outcomes
By completing this assignment, you will:

- Gain practical experience with unsupervised machine learning for chemical structure analysis.
- Develop skills in conducting experiments, optimizing models, and documenting results.
- Learn to evaluate clustering outcomes

---

#### Assessment criteria
- Organized solution: Portfolio well-organized. Code is devided in functions or class methods, using coding standards and is adequately documented. Code wich is not written in functions or methods will not be reviewed. Assignment can be easily reproduced by others. 
- Problem Understanding and Formulation: Demonstrates a clear understanding of the problem to be addressing
- Literature: cites recent and authoritative sources 
- Data Preprocessing and Exploration: Thoroughly preprocessed the data to handle missing values, outliers, and other data quality issues. Explores the dataset to gain insights and understand its characteristics
- Model Selection and Architecture: Chooses appropriate unsupervised machine learning algorithms for the given problem. Provides a rationale for the choices based on the characteristics of the data and problem
- Result and discussion: Interprets and discusses the results in a meaningful way. Compares the results to baselines. Conclusions are drawn from the results supported by evidence.
- Critical Thinking and Problem-Solving: Student demonstrates critical thinking skills by addressing challenges and proposing insightful solutions
- Presentation and Communication:  The concepts are explained clearly, and technical terms are appropriately defined

---

### Resources:
- **RDKit Documentation**: [RDKit](https://www.rdkit.org/docs/)

Good luck, and have fun exploring the world of cheminformatics and unsupervised learning! 