PubChem-Bioactivity-Classification

README

Welcome to the PubChem Bioactivity Classification project! This project aims to classify molecular bioactivity as active or inactive using machine learning and deep learning techniques.

Project Objectives

The main objective of this project is to build and evaluate ML and DL models that can predict the bioactivity of molecules based on their chemical structure. To achieve this, we will follow the following steps:

Read and clean the data: The raw PubChem bioactivity data will be read into a data frame and cleaned by skipping rows that do not contain data and deleting NaN and duplicate values in the 'PUBCHEM_EXT_DATASOURCE_SMILES' column.
Describe the molecular structure and the binary classification: The molecular structure will be described using the MHFP (Molecular Hash Fingerprint) method, which calculates a fingerprint for each molecule. The 'PUBCHEM_ACTIVITY_SCORE' column will be converted to a binary activity and set as the target variable (y).
Prepare the data for machine learning and deep learning: The data will be split into training and test sets and scaled using appropriate techniques.
Machine learning: At least three scikit-learn ML models will be tried and the best model will be selected using cross-validation. The parameters of the best model will be optimized using grid search.
Deep learning: A deep neural network (DNN) will be built, compiled, and fit to the data. The learning curve (loss vs. epoch) will be plotted and the performance of the model will be optimized by varying the number of layers, the number of neurons per layer, the dimensionality of the MHFP, and/or the number of epochs.
Evaluate the models: The performance of the ML and DL models will be evaluated using appropriate metrics, such as accuracy, precision, and recall.

Data Preprocessing

Before we can build and evaluate the ML and DL models, we need to prepare the data by performing the following preprocessing steps:

Read the data into a data frame.
Skip rows that do not contain data.
Use only the columns 'PUBCHEM_EXT_DATASOURCE_SMILES' and 'PUBCHEM_ACTIVITY_SCORE'.
Delete NaN and remove duplicate data in 'PUBCHEM_EXT_DATASOURCE_SMILES'.
Use MHFP to calculate the fingprint and reformat it into a data frame that can be used as an X.
Convert "PUBCHEM_ACTIVITY_SCORE" to a binary activity and set it to y.
Split the data into training and test sets.
Scale the features using appropriate techniques.

Machine Learning

Once the data is prepared, we can start building and evaluating ML models. We will use scikit-learn to try at least three different ML models and use cross-validation to select the best model. Then, we will use grid search to optimize the parameters of the best model.

Deep Learning

After selecting the best ML model, we will build, compile, and fit a DNN to the data. We will plot the learning curve (loss vs. epoch) to visualize the training process. We will also vary the number of layers, the number of neurons per layer, the dimensionality of the MHFP, and/or the number of epochs to optimize the performance of the model.

To run the code, you will need to have Python 3 and the following libraries installed:

pandas
numpy
scikit-learn
tensorflow (for deep learning)

You can install these libraries using pip install.

To run the code, clone the repository and navigate to the project directory. Then, run the main.py script:

git clone https://github.com/azzaouiyazid/PubChem-Bioactivity-Classification.git
cd PubChem-Bioactivity-Classification
python main.py

The code will execute the project milestones in order and output the results.

Additional Notes

The raw data can be downloaded from Data folder.
The scikit-learn and tensorflow libraries provide a wide range of machine learning and deep learning models that you can experiment with. Try out different models and parameters to see which ones work best for your data.
Make sure to tune the hyperparameters of the models to get the best performance. You can use techniques such as grid search or random search to optimize the hyperparameters.
Don't forget to evaluate the models using appropriate metrics and compare the results to choose the best model for your data.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubChem-Bioactivity-Classification

README

Project Objectives

Data Preprocessing

Machine Learning

Deep Learning

Additional Notes

About

Releases

Packages

Languages

azzaouiyazid/PubChem-Bioactivity-Classification

Folders and files

Latest commit

History

Repository files navigation

PubChem-Bioactivity-Classification

README

Project Objectives

Data Preprocessing

Machine Learning

Deep Learning

Additional Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages