# High-Level Design Document

## Project Title
Prediction of LC50 Value using Quantitative Structure–Activity Relationship (QSAR) Models

## Overview
This project aims to develop a robust predictive model for estimating the LC50 value, which represents the concentration of a chemical compound that results in the death of 50% of a test batch of fish over a 96-hour period. By leveraging Quantitative Structure–Activity Relationship (QSAR) models and advanced machine learning techniques, the project seeks to provide accurate and reliable predictions. An ensemble model is created by combining Random Forest, Support Vector Regression (SVR), and Gradient Boosting Regressor (GBR) models. Additionally, the project includes an API for interacting with the model and is deployed as a scalable Azure web service.

## System Architecture
The system architecture comprises several key components, each addressing a specific aspect of the project. These components include:

1. **Data Collection and Preprocessing**
2. **Model Training**
3. **Model Deployment**
4. **User Interface**
5. **Logging and Monitoring**

### Components

#### 1. Data Collection and Preprocessing
Data collection and preprocessing are critical steps in this project. The quality and structure of the data significantly influence the model's performance. This component involves gathering data from reliable sources, cleaning and preparing the data, and performing feature engineering.

- **Data Sources:**
  - **ECOTOX Database:** Managed by the US Environmental Protection Agency, this database provides comprehensive in vivo test data on fish for numerous chemical substances. The data includes various chemical properties and their effects on aquatic life.
  - **ECHA Data:** Additional data from the European Chemicals Agency (ECHA) is utilized to supplement and enrich the dataset, ensuring a diverse and comprehensive dataset for model training.

- **Preprocessing Steps:**
  - **Data Cleaning:** 
    - Handling missing values by either imputation or removal to ensure data integrity.
    - Correcting inconsistencies in the data (e.g., standardizing units of measurement) to maintain uniformity.
  - **Feature Engineering:**
    - Selecting relevant molecular descriptors that significantly influence the LC50 value.
    - Creating new features through domain knowledge to enhance the model's predictive power.
  - **Data Transformation:**
    - Normalizing or standardizing data to ensure uniform scale across different features.
    - Encoding categorical variables (e.g., species, exposure route) into numerical formats suitable for machine learning models.
  - **Splitting Data:**
    - Dividing the dataset into training and testing sets to evaluate model performance effectively and prevent overfitting.

#### 2. Model Training
Model training involves selecting, training, and optimizing various machine learning algorithms to accurately predict the LC50 values. This component also includes creating an ensemble model that combines the strengths of multiple individual models.

- **Machine Learning Models:**
  - **Random Forest Regressor:**
    - An ensemble method that builds multiple decision trees and merges their results for more accurate and robust predictions.
    - Utilizes bootstrap aggregating (bagging) to improve model accuracy and reduce overfitting.
  - **Support Vector Regressor (SVR):**
    - A regression method that finds the hyperplane in high-dimensional space that best fits the data.
    - Effective in handling high-dimensional data and capturing complex relationships between features.
  - **Gradient Boosting Regressor (GBR):**
    - An ensemble technique that builds models sequentially, with each model attempting to correct the errors of its predecessor.
    - Highly effective in improving prediction accuracy through iterative refinement.

- **Ensemble Model:**
  - Combines the predictions from Random Forest, SVR, and GBR models to improve accuracy and robustness.
  - **Implementation:**
    - Train each model independently on the training data to capture different aspects of the underlying patterns.
    - Aggregate the predictions from each model during inference to obtain the final prediction, typically through averaging or weighted averaging.

- **Tools and Libraries:**
  - **scikit-learn:** For implementing and training machine learning models, providing a wide range of tools for model selection, training, and evaluation.
  - **pandas:** For data manipulation and preprocessing, enabling efficient handling of large datasets.
  - **pickle:** For model serialization and saving trained models for deployment, ensuring easy and efficient model loading during inference.

#### 3. Model Deployment
Deploying the trained models as a web service involves setting up the infrastructure, creating an API for interaction, and ensuring the service is scalable and reliable. This component focuses on making the predictive model accessible to end-users through a web-based interface.

- **Deployment Platform:**
  - **Microsoft Azure Web Services:** Chosen for its robust infrastructure, scalability options, and integration with various tools for monitoring and maintenance. Azure provides a reliable and flexible platform for deploying and managing web applications.

- **Deployment Process:**
  - **Flask API:**
    - Develop a Flask application to handle HTTP requests and provide predictions. Flask is a lightweight and flexible web framework that facilitates quick development and deployment.
    - Expose endpoints for users to input data and receive predictions, ensuring a seamless and user-friendly experience.
  - **Azure Web App Service:**
    - Configure and deploy the Flask application on Azure Web App Service, leveraging Azure's capabilities for scalability and reliability.
    - Ensure the web service can handle varying loads and provide continuous availability, minimizing downtime and ensuring a high-quality user experience.

#### 4. User Interface
The user interface (UI) is designed to facilitate easy interaction with the model, allowing users to input molecular descriptors and receive LC50 predictions. This component ensures that the application is accessible and user-friendly.

- **Front-end:**
  - **HTML Forms:**
    - Simple and intuitive forms to collect input data from users, minimizing the complexity and making it easy for users to provide the necessary information.
    - Fields to input the six molecular descriptors required for prediction, ensuring that users can easily input the data needed for the model.
  - **Result Display:**
    - Display the predicted LC50 value in a user-friendly format, providing clear and understandable results.

- **Back-end:**
  - **Flask Framework:**
    - Handle form submissions and data processing, ensuring that the back-end logic is efficiently managed.
    - Interface with the trained models to generate predictions, ensuring that the model predictions are accurately and efficiently delivered to the users.

#### 5. Logging and Monitoring
Logging and monitoring are essential for maintaining the health and performance of the deployed service. This component focuses on tracking and managing the application's performance and reliability.

- **Logging:**
  - **Python Logging Library:**
    - Record key events, errors, and system performance metrics, providing detailed insights into the application's operations.
    - Store logs in a centralized location for easy access and analysis, ensuring that issues can be quickly identified and resolved.

- **Monitoring:**
  - **Azure Monitoring Tools:**
    - Utilize Azure’s built-in monitoring tools for real-time log streaming, diagnostics, and performance monitoring.
    - Set up alerts for critical events and performance anomalies, ensuring that potential issues are quickly identified and addressed.

## Functional Requirements

### User Interface
- **Input Form:**
  - Fields for six molecular descriptors: descriptor1, descriptor2, descriptor3, descriptor4, descriptor5, descriptor6. These fields will collect the necessary input data for the model.
- **Result Display:**
  - Display the predicted LC50 value based on the input descriptors, providing users with clear and accurate results.

### API Endpoints
- **GET /:**
  - Render the main input form (index page) for users to enter data, ensuring that users have easy access to the input form.
- **POST /predict:**
  - Accept form data, process it, and return the predicted LC50 value, providing users with accurate and timely predictions.

## Non-Functional Requirements

### Performance
- **Latency:**
  - Ensure that the model's response time for predictions is within acceptable limits to provide a seamless user experience, minimizing delays and ensuring quick feedback.
- **Scalability:**
  - The system should handle multiple simultaneous requests efficiently without significant degradation in performance, ensuring that the application can scale to meet varying demand levels.

### Security
- **Data Validation:**
  - Validate user inputs to prevent erroneous data submissions and potential security risks, ensuring that only valid and reliable data is processed.
- **Error Handling:**
  - Implement graceful error handling to ensure the application remains stable in the event of unexpected inputs or errors, providing users with clear error messages and maintaining application stability.

## Deployment Strategy

### Azure Configuration
- **Web App Service:**
  - Create and configure an Azure Web App Service to host the Flask application, leveraging Azure's capabilities for scalability and reliability.
- **Continuous Deployment:**
  - Set up continuous deployment from a GitHub repository to Azure, ensuring that updates to the codebase are automatically deployed, minimizing manual intervention and ensuring quick deployment of new features and fixes.
- **Environment Variables:**
  - Configure necessary environment variables and app settings for the application to function correctly in the cloud environment, ensuring that the application is correctly configured and secure.

### Steps to Deploy
1. **Create a Web App Service in Azure:**
   - Use the Azure portal or CLI to create a new web app service, ensuring that the service is correctly configured and ready for deployment.
2. **Configure Deployment Source:**
   - Link the GitHub repository to the Azure web app service for automated deployments, ensuring that code changes are automatically deployed to the web service.
3. **Set Up Environment Variables:**
   - Define any required environment variables, such as paths to model files or API keys, in the Azure app settings, ensuring that the application has access to the necessary resources and configurations.
4. **Deploy the Application:**
   - Push code changes to the GitHub repository to trigger the continuous deployment process, ensuring that the latest code is deployed to the web

 service.
5. **Monitor Deployment Logs:**
   - Check deployment logs in the Azure portal to ensure the application is deployed successfully and is running correctly, identifying and addressing any issues that arise during deployment.

## Monitoring and Maintenance

### Logging
- Implement logging using Python’s logging library to track key events, errors, and performance metrics, providing detailed insights into the application's operations.
- Configure Azure to store and manage these logs, providing easy access for monitoring and debugging, ensuring that issues can be quickly identified and resolved.

### Monitoring Tools
- Utilize Azure’s monitoring and diagnostics tools to track application performance, detect anomalies, and receive alerts for critical events, ensuring that the application's health and performance are continuously monitored.
- Set up dashboards in Azure Monitor to visualize key metrics and monitor the health of the application in real-time, providing a comprehensive view of the application's performance.

### Regular Updates
- Periodically update the models with new data to maintain and improve prediction accuracy, ensuring that the models remain up-to-date and reliable.
- Regularly update dependencies and libraries to ensure the application remains secure and performant, addressing any security vulnerabilities and ensuring compatibility with the latest tools and technologies.

## Detailed Design

### Data Collection and Preprocessing

#### Data Sources
- **ECOTOX Database:**
  - The ECOTOX Database, managed by the US Environmental Protection Agency, provides comprehensive in vivo test data on fish for numerous chemical substances. The data includes various chemical properties and their effects on aquatic life, such as toxicity levels, exposure times, and environmental conditions.
  - The data from the ECOTOX Database is essential for building a robust and accurate predictive model, as it provides a rich source of historical data on chemical toxicity.

- **ECHA Data:**
  - Additional data from the European Chemicals Agency (ECHA) is utilized to supplement and enrich the dataset. The ECHA data includes detailed information on chemical substances, their properties, and their effects on human health and the environment.
  - By combining data from both the ECOTOX Database and ECHA, the project ensures a diverse and comprehensive dataset, which enhances the model's ability to generalize and make accurate predictions.

#### Data Cleaning
- **Handling Missing Values:**
  - Missing values in the dataset can significantly impact the performance of the predictive model. Therefore, it is crucial to handle missing values appropriately.
  - **Imputation:** Missing values can be imputed using various techniques such as mean, median, or mode imputation, or more advanced methods like k-nearest neighbors (KNN) imputation.
  - **Removal:** In cases where missing values cannot be imputed or are present in a significant portion of the dataset, the affected rows or columns can be removed to ensure data integrity.

- **Correcting Inconsistencies:**
  - Data inconsistencies, such as differing units of measurement or variations in data formats, can lead to errors in model training and prediction.
  - Standardizing units of measurement and ensuring uniform data formats are essential steps in the data cleaning process.

#### Feature Engineering
- **Selecting Relevant Molecular Descriptors:**
  - Molecular descriptors are numerical values that describe the chemical and physical properties of a molecule. These descriptors are crucial for building a predictive model that accurately estimates the LC50 value.
  - The project involves selecting relevant molecular descriptors that significantly influence the LC50 value. This selection is based on domain knowledge, literature review, and exploratory data analysis.

- **Creating New Features:**
  - In addition to selecting existing molecular descriptors, the project involves creating new features through domain knowledge and feature engineering techniques.
  - For example, combining existing descriptors to create new, more informative features can enhance the model's predictive power.

#### Data Transformation
- **Normalization/Standardization:**
  - Normalizing or standardizing the data ensures that all features have a uniform scale, which is essential for many machine learning algorithms.
  - **Normalization:** Scaling the data to a range between 0 and 1.
  - **Standardization:** Transforming the data to have a mean of 0 and a standard deviation of 1.

- **Encoding Categorical Variables:**
  - Categorical variables, such as species or exposure route, need to be encoded into numerical formats suitable for machine learning models.
  - **One-Hot Encoding:** Representing categorical variables as binary vectors.
  - **Label Encoding:** Assigning a unique numerical value to each category.

#### Splitting Data
- **Training and Testing Sets:**
  - Dividing the dataset into training and testing sets is crucial for evaluating the model's performance and preventing overfitting.
  - The training set is used to train the machine learning models, while the testing set is used to evaluate their performance.
  - **Validation Set:** In addition to the training and testing sets, a validation set can be used to tune hyperparameters and select the best model.

### Model Training

#### Machine Learning Models
- **Random Forest Regressor:**
  - Random Forest is an ensemble method that builds multiple decision trees and merges their results to improve accuracy and robustness.
  - **Bootstrap Aggregating (Bagging):** Random Forest uses bagging to create multiple subsets of the training data, trains a decision tree on each subset, and aggregates their predictions.

- **Support Vector Regressor (SVR):**
  - SVR is a regression method that finds the hyperplane in high-dimensional space that best fits the data.
  - **Kernel Trick:** SVR uses kernel functions to transform the input data into a higher-dimensional space, enabling it to capture complex relationships between features.

- **Gradient Boosting Regressor (GBR):**
  - Gradient Boosting is an ensemble technique that builds models sequentially, with each model attempting to correct the errors of its predecessor.
  - **Boosting:** GBR uses boosting to iteratively refine the model by focusing on the hardest-to-predict examples.

#### Ensemble Model
- **Combining Predictions:**
  - The ensemble model combines predictions from Random Forest, SVR, and GBR models to improve accuracy and robustness.
  - **Aggregation Methods:**
    - **Averaging:** The final prediction is an average of the predictions from the individual models.
    - **Weighted Averaging:** Assigning weights to each model based on their performance and aggregating their predictions.

#### Tools and Libraries
- **scikit-learn:**
  - Provides a wide range of tools for model selection, training, and evaluation.
  - Includes implementations of Random Forest, SVR, and GBR models, as well as tools for data preprocessing and transformation.

- **pandas:**
  - Enables efficient handling and manipulation of large datasets.
  - Provides tools for data cleaning, transformation, and feature engineering.

- **pickle:**
  - Used for model serialization and saving trained models for deployment.
  - Ensures easy and efficient model loading during inference.

### Model Deployment

#### Deployment Platform
- **Microsoft Azure Web Services:**
  - Azure provides a robust infrastructure for deploying and managing web applications.
  - Offers scalability options and integration with various tools for monitoring and maintenance.

#### Deployment Process
- **Flask API:**
  - Develop a Flask application to handle HTTP requests and provide predictions.
  - **Endpoints:**
    - **GET /:** Render the main input form (index page) for users to enter data.
    - **POST /predict:** Accept form data, process it, and return the predicted LC50 value.

- **Azure Web App Service:**
  - Configure and deploy the Flask application on Azure Web App Service.
  - Ensure the web service can handle varying loads and provide continuous availability.

### User Interface

#### Front-end
- **HTML Forms:**
  - Simple and intuitive forms to collect input data from users.
  - Fields to input the six molecular descriptors required for prediction.

- **Result Display:**
  - Display the predicted LC50 value in a user-friendly format.

#### Back-end
- **Flask Framework:**
  - Handle form submissions and data processing.
  - Interface with the trained models to generate predictions.

### Logging and Monitoring

#### Logging
- **Python Logging Library:**
  - Record key events, errors, and system performance metrics.
  - Store logs in a centralized location for easy access and analysis.

#### Monitoring
- **Azure Monitoring Tools:**
  - Utilize Azure’s built-in monitoring tools for real-time log streaming, diagnostics, and performance monitoring.
  - Set up alerts for critical events and performance anomalies.

## Functional Requirements

### User Interface
- **Input Form:**
  - Fields for six molecular descriptors: descriptor1, descriptor2, descriptor3, descriptor4, descriptor5, descriptor6.
- **Result Display:**
  - Display the predicted LC50 value based on the input descriptors.

### API Endpoints
- **GET /:**
  - Render the main input form (index page) for users to enter data.
- **POST /predict:**
  - Accept form data, process it, and return the predicted LC50 value.

## Non-Functional Requirements

### Performance
- **Latency:**
  - Ensure that the model's response time for predictions is within acceptable limits to provide a seamless user experience.
- **Scalability:**
  - The system should handle multiple simultaneous requests efficiently without significant degradation in performance.

### Security
- **Data Validation:**
  - Validate user inputs to prevent erroneous data submissions and potential security risks.
- **Error Handling:**
  - Implement graceful error handling to ensure the application remains stable in the event of unexpected inputs or errors.

## Deployment Strategy

### Azure Configuration
- **Web App Service:**
  - Create and configure an Azure Web App Service to host the Flask application.
- **Continuous Deployment:**
  - Set up continuous deployment from a GitHub repository to Azure, ensuring that updates to the codebase are automatically deployed.
- **Environment Variables:**
  - Configure necessary environment variables and app settings for the application to function correctly in the cloud environment.

### Steps to

 Deploy
1. **Create a Web App Service in Azure:**
   - Use the Azure portal or CLI to create a new web app service.
2. **Configure Deployment Source:**
   - Link the GitHub repository to the Azure web app service for automated deployments.
3. **Set Up Environment Variables:**
   - Define any required environment variables, such as paths to model files or API keys, in the Azure app settings.
4. **Deploy the Application:**
   - Push code changes to the GitHub repository to trigger the continuous deployment process.
5. **Monitor Deployment Logs:**
   - Check deployment logs in the Azure portal to ensure the application is deployed successfully and is running correctly.

## Monitoring and Maintenance

### Logging
- Implement logging using Python’s logging library to track key events, errors, and performance metrics.
- Configure Azure to store and manage these logs, providing easy access for monitoring and debugging.

### Monitoring Tools
- Utilize Azure’s monitoring and diagnostics tools to track application performance, detect anomalies, and receive alerts for critical events.
- Set up dashboards in Azure Monitor to visualize key metrics and monitor the health of the application in real-time.

### Regular Updates
- Periodically update the models with new data to maintain and improve prediction accuracy.
- Regularly update dependencies and libraries to ensure the application remains secure and performant.

## Detailed Design

### Data Collection and Preprocessing

#### Data Sources
- **ECOTOX Database:**
  - Provides comprehensive in vivo test data on fish for numerous chemical substances.
  - Essential for building a robust and accurate predictive model.

- **ECHA Data:**
  - Additional data from the European Chemicals Agency to supplement and enrich the dataset.
  - Ensures a diverse and comprehensive dataset for model training.

#### Data Cleaning
- **Handling Missing Values:**
  - Imputation or removal to ensure data integrity.
  - Standardizing units of measurement to maintain uniformity.

#### Feature Engineering
- **Selecting Relevant Molecular Descriptors:**
  - Based on domain knowledge, literature review, and exploratory data analysis.
- **Creating New Features:**
  - Combining existing descriptors to create new, more informative features.

#### Data Transformation
- **Normalization/Standardization:**
  - Ensures that all features have a uniform scale.
- **Encoding Categorical Variables:**
  - Representing categorical variables as binary vectors or numerical values.

#### Splitting Data
- **Training and Testing Sets:**
  - Dividing the dataset into training and testing sets to evaluate model performance.

### Model Training

#### Machine Learning Models
- **Random Forest Regressor:**
  - Builds multiple decision trees and merges their results.
  - Uses bagging to improve model accuracy and reduce overfitting.

- **Support Vector Regressor (SVR):**
  - Finds the hyperplane in high-dimensional space that best fits the data.
  - Uses kernel functions to capture complex relationships between features.

- **Gradient Boosting Regressor (GBR):**
  - Builds models sequentially, with each model correcting the errors of its predecessor.
  - Uses boosting to iteratively refine the model.

#### Ensemble Model
- **Combining Predictions:**
  - Combines predictions from Random Forest, SVR, and GBR models.
  - Uses averaging or weighted averaging to obtain the final prediction.

#### Tools and Libraries
- **scikit-learn:** For model selection, training, and evaluation.
- **pandas:** For data manipulation and preprocessing.
- **pickle:** For model serialization and saving trained models for deployment.

### Model Deployment

#### Deployment Platform
- **Microsoft Azure Web Services:**
  - Provides a robust infrastructure for deploying and managing web applications.
  - Offers scalability options and integration with various tools for monitoring and maintenance.

#### Deployment Process
- **Flask API:**
  - Develop a Flask application to handle HTTP requests and provide predictions.
  - Expose endpoints for users to input data and receive predictions.

- **Azure Web App Service:**
  - Configure and deploy the Flask application on Azure Web App Service.
  - Ensure the web service can handle varying loads and provide continuous availability.

### User Interface

#### Front-end
- **HTML Forms:**
  - Simple and intuitive forms to collect input data from users.
  - Fields to input the six molecular descriptors required for prediction.

- **Result Display:**
  - Display the predicted LC50 value in a user-friendly format.

#### Back-end
- **Flask Framework:**
  - Handle form submissions and data processing.
  - Interface with the trained models to generate predictions.

### Logging and Monitoring

#### Logging
- **Python Logging Library:**
  - Record key events, errors, and system performance metrics.
  - Store logs in a centralized location for easy access and analysis.

#### Monitoring
- **Azure Monitoring Tools:**
  - Utilize Azure’s built-in monitoring tools for real-time log streaming, diagnostics, and performance monitoring.
  - Set up alerts for critical events and performance anomalies.