Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✍️ Contribution period: Olawumi Salaam #1033

Closed
14 of 15 tasks
OlawumiSalaam opened this issue Mar 8, 2024 · 18 comments
Closed
14 of 15 tasks

✍️ Contribution period: Olawumi Salaam #1033

OlawumiSalaam opened this issue Mar 8, 2024 · 18 comments

Comments

@OlawumiSalaam
Copy link
Contributor

OlawumiSalaam commented Mar 8, 2024

Week 1 - Get to know the community

  • Join the communication channels
  • Open a GitHub issue (this one!)
  • Install the Ersilia Model Hub and test the simplest model
  • Install Docker if needed, and test another model
  • Write a motivation statement to work at Ersilia
  • Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

  • Select a model from the list suggested in GitBook
  • Download and serve the model via the Ersilia Model Hub to ensure it works
  • Open a repository on your GitHub user with all the necessary files
  • Select and clean a dataset of 1000 molecules (example notebook 1)
  • Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

  • Find a suitable dataset with sufficient experimental results
  • Clean and standardize the dataset
  • Run predictions and calculate metrics.

Week 4 - Prepare your final application

  • Submit the final application in the Outreachy website
@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-08 114023 I had a difficult time testing the docker after installation and I was stucked on it for a whole day. I was frustrated but I needed to fix it in order to continue with the installation.

@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-08 131025 I picked up where I left and had to debug the issue by maximising online resources to solve the problem. Here are the ways I solved as
  • I made sure turned on Windows Subsystem For Linux, run a check and made sure Virtualization was enabled.
  • I installed Unbuntu 22.04.3
  • I opened the terminal and followed all the installation instructions on the Ersilia guide and was able go get the docker working.

@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-08 130750 Successfully installed Ersilia Model Hub

@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-08 130828 I checked that the CLI worked on my terminal and I explored the available commands. This is my first time using Ubuntu and I was happy I was able to use it. Another skill and tool learnt.

@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-08 170354 when I try to serve model eos2r5a, I am getting this error. There is an issue with loading JSON data. @GemmaTuron @DhanshreeA

@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-08 215151 After confirming that Ersilia is recognized in my CLI, I proceeded to test a simple model to ensure the functionality of the system by calculating the molecular weight of the molecules. The testing process was successful.

@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-09 081455

The aim of this task is to test Ersilia with Docker. I was able to achieve this through this steps:

  1. Another simple model from the hub was used and a model image was pulled.
  2. The Ersilia environment was activated.
  3. I A test was performed on the model that was fetched from DockerHub using the CLI.
  4. The model generated Morgan Fingerprints for the provided molecule.

I successfully tested Ersilia's compatibility with Docker and confirmed its ability to fetch, execute, and produce results using models obtained from DockerHub.

@Ajoke23
Copy link
Contributor

Ajoke23 commented Mar 9, 2024

Screenshot 2024-03-08 170354 when I try to serve model eos2r5a, I am getting this error. There is an issue with loading JSON data. @GemmaTuron @DhanshreeA

Hi @OlawumiSalaam
I think this is resolved right?. I remember seeing you posting about this on #debugging channel on slack and I gave my suggestion by forwarding a comment/suggestion @DhanshreeA made regarding plunging

Please, let me know if you're still encounter any error?

@OlawumiSalaam
Copy link
Contributor Author

Amidst the Covid-19 lockdown, my husband's absence left me helpless, unable to provide for our children. This made me realize how dependent I had become despite having a degree in chemistry and decision for change of situation. Research led me to consider technology as a viable option. However, entering this male-dominated field was met with disapproval from my local communities who believed in traditional gender roles. Biases from peers and educators added to my challenges. My capabilities were questioned because I am a woman rather than merit. This affected my confidence and led to imposter syndrome, where I sometimes doubt my abilities despite my accomplishments. This stereotyping limited my access to key projects, mentors and affected my professional development.

In a bid to seek opportunities to break this barrier by getting a role to help advance my career, I stumbled upon Outreachy Internship. I realized Outreachy opens doors to the world of free software and the main goal is to make tech more diverse and fairer by working on real projects and teaming up with others to upskill through contributions to Open source projects. Outreachy helps people who are underrepresented in tech to break down barriers and make sure everyone is included in the tech world.

However, my interest in working on Ersilia Projects sparked when the contribution stage opened and I was browsing through the different projects. Specifically, I discovered that Ersilia Open Source charity is focused on strengthening the research capacity for infectious and neglected diseases by developing and implementing novel artificial intelligence and machine learning tools. After doing research on how Machine Learning can be implemented in drug discovery, I discovered it encompasses a range of applications that have the potential to revolutionize the industry. Machine learning algorithms can predict the biological activity of compounds, allowing researchers to focus their efforts on the most promising candidates. These algorithms can identify potential drug candidates by analyzing chemical structures and properties.

Aside from earning a bachelor's degree in chemistry, I also developed my skills in Artificial Intelligence and Machine Learning. I am able to execute Machine Learning projects life cycle stemming from defining problem statement, data gathering, data cleaning and preprocessing, exploratory data analysis to reveal hidden insights, modelling the data and interpreting data both on supervised and unsupervised machine learning task. My past trainings and projects provided me opportunities to learn Python programming language iusing VScode, Jupyter notebook, google colab IDE and frameworks like TensorFlow, Keras, Sci-kit learn, OpenCV, PyTorch, Numpy, Pandas, Matplotlib, Seaborn, Plotly, Git and Github. I am also proficient with model evaluation tools such as mean absolute error, R-squared, accuracy score, precision, Recall,F1-score, ROC-AUC curve, classification report that will be needed for regression and classification tasks in Ersilia projects. However, I recognize that there is still much more to learn in the field of AI, and I am eager to expand my knowledge and skills in this rapidly advancing field by leveraging this internship. Of immediate need is to learn tools for accessing molecular databases such as PubChem, ChEMBL and processing chemical and biological data.

My experiences in my previous role and projects have equipped me with communication, collaboration, problem-solving, time management, project management, and analytical skills. I am confident that these skills will enable me to maximise the internship program. I look forward to leveraging my skills to understand issues, make meaningful contributions and succeed in the program. Starting from the contribution stage, I will be learning from industry experts, absorb their valuable insight, feedbacks, and gain a broader perspective of the industry.

After this internship, I see myself having a breakthrough and becoming successful. Success for me is about continuously learning, growing, and striving to reach my full potential. I intend to be in a role that will help me contribute significantly to developing AI and Machine Learning technologies. I see myself working as a deep learning scientist applying my expertise to solve complex problems and develop innovative solutions. My goal is to be at the forefront of cutting-edge research and development, leveraging AI to address real-world challenges as it relates to Sustainable Development Goals (SDGs). I look forward to mentoring and inspiring people from underrepresented groups like me to pursue careers in technology and AI. This will be my own way of helping Outreachy achieve its goals and giving back to the society.

I believe that being selected will provide me with invaluable experiences and also open doors to networking opportunities, mentorship, and potential business collaborations. Overall, all this will help me advance my career growth and enable me to make meaningful contributions to the AI industry.

@OlawumiSalaam
Copy link
Contributor Author

Screenshot 2024-03-08 170354 when I try to serve model eos2r5a, I am getting this error. There is an issue with loading JSON data. @GemmaTuron @DhanshreeA

Hi @OlawumiSalaam I think this is resolved right?. I remember seeing you posting about this on #debugging channel on slack and I gave my suggestion by forwarding a comment/suggestion @DhanshreeA made regarding plunging

Please, let me know if you're still encounter any error?

@DhanshreeA . Thank you. The error has been resolved.

@OlawumiSalaam
Copy link
Contributor Author

Motivation Letter For Ersilia Project

Amidst the Covid-19 lockdown, my husband's absence left me helpless, unable to provide for our children. This made me realize how dependent I had become despite having a degree in chemistry and decision for change of situation. Research led me to consider technology as a viable option. However, entering this male-dominated field was met with disapproval from my local communities who believed in traditional gender roles. Biases from peers and educators added to my challenges. My capabilities were questioned because I am a woman rather than merit. This affected my confidence and led to imposter syndrome, where I sometimes doubt my abilities despite my accomplishments. This stereotyping limited my access to key projects, mentors and affected my professional development.
In a bid to seek opportunities to break this barrier by getting a role to help advance my career, I stumbled upon Outreachy Internship. I realized Outreachy opens doors to the world of free software and the main goal is to make tech more diverse and fairer by working on real projects and teaming up with others to upskill through contributions to Open source projects. Outreachy helps people who are underrepresented in tech to break down barriers and make sure everyone is included in the tech world.
However, my interest in working on Ersilia Projects sparked when the contribution stage opened and I was browsing through the different projects. Specifically, I discovered that Ersilia Open Source charity is focused on strengthening the research capacity for infectious and neglected diseases by developing and implementing novel artificial intelligence and machine learning tools. After doing research on how Machine Learning can be implemented in drug discovery, I discovered it encompasses a range of applications that have the potential to revolutionize the industry. Machine learning algorithms can predict the biological activity of compounds, allowing researchers to focus their efforts on the most promising candidates. These algorithms can identify potential drug candidates by analyzing chemical structures and properties.
Aside from earning a bachelor's degree in chemistry, I also developed my skills in Artificial Intelligence and Machine Learning. I am able to execute Machine Learning projects life cycle stemming from defining problem statement, data gathering, data cleaning and preprocessing, exploratory data analysis to reveal hidden insights, modelling the data and interpreting data both on supervised and unsupervised machine learning task. My past trainings and projects provided me opportunities to learn Python programming language iusing VScode, Jupyter notebook, google colab IDE and frameworks like TensorFlow, Keras, Sci-kit learn, OpenCV, PyTorch, Numpy, Pandas, Matplotlib, Seaborn, Plotly, Git and Github. I am also proficient with model evaluation tools such as mean absolute error, R-squared, accuracy score, precision, Recall,F1-score, ROC-AUC curve, classification report that will be needed for regression and classification tasks in Ersilia projects. However, I recognize that there is still much more to learn in the field of AI, and I am eager to expand my knowledge and skills in this rapidly advancing field by leveraging this internship. Of immediate need is to learn tools for accessing molecular databases such as PubChem, ChEMBL and processing chemical and biological data.
My experiences in my previous role and projects have equipped me with communication, collaboration, problem-solving, time management, project management, and analytical skills. I am confident that these skills will enable me to maximise the internship program. I look forward to leveraging my skills to understand issues, make meaningful contributions and succeed in the program. Starting from the contribution stage, I will be learning from industry experts, absorb their valuable insight, feedbacks, and gain a broader perspective of the industry.
After this internship, I see myself having a breakthrough and becoming successful. Success for me is about continuously learning, growing, and striving to reach my full potential. I intend to be in a role that will help me contribute significantly to developing AI and Machine Learning technologies. I see myself working as a deep learning scientist applying my expertise to solve complex problems and develop innovative solutions. My goal is to be at the forefront of cutting-edge research and development, leveraging AI to address real-world challenges as it relates to Sustainable Development Goals (SDGs 3). It is still very unfortunate that Nigeria has not won the Malaria battle and Nigeria has the greatest number of malaria cases in the World. We still loose friends and family to Malaria. Malaria is a major public health concern in Nigeria, with an estimated 68 million cases and 194 000 deaths due to the disease in 2021. Nigeria has the highest burden of malaria globally, accounting for nearly 27% of the global malaria burden. Knowing that Ersilia initiative is at the forefront of building AI-powered solutions, I will like to be part of the mission. I look forward to mentoring and inspiring people from underrepresented groups like me to pursue careers in technology and AI. This will be my own way of helping Outreachy achieve its goals and giving back to the society.
I believe that being selected will provide me with invaluable experiences and also open doors to networking opportunities, mentorship, and potential collaborations on global health issues. Overall, all this will help me advance my career growth and enable me to make meaningful contributions to the AI industry.

@OlawumiSalaam
Copy link
Contributor Author

OlawumiSalaam commented Mar 13, 2024

📆 WEEK 2: Get Familiar with Machine Learning for Chemistry

The task for this week is to get familiar with Machine Learning for chemistry data and validating a model for predicting ADME properties of small molecules. The aim is to test accuracy and reproducibility of the model. Aqueous solubility is an important physicochemical property that influences pharmacokinetic properties of compounds and a very important factor in drug discovery. To this achieve this objectives, EOS model ID: eos74bo was selected and the task is in 3 steps:

Task 1

Model: eos74bo
Model Description
Aqueous Kinetic Solubility
Kinetic aqueous solubility (μg/mL) was experimentally determined using the same SOP in over 200 NCATS drug discovery projects. A final dataset of 11780 non-redundant molecules and their associated solubility was used to train a SVM classifier. Approximately half of the dataset has poor solubility (< 10 μg/mL), and two-thirds of these low soluble molecules report values of < 1 μg/mL. A subset of the data used is available at PubChem (AID 1645848).

Identifiers
EOS model ID: eos74bo
Slug: ncats-solubility
Characteristics
Input: Compound
Input Shape: Single
Task: Classification
Output: Probability
Output Type: Float
Output Shape: Single
Interpretation: Probability of a compound having poor solublibity (< 10 µg/ml)

  1. A - GitHub repository was created for the task and can be found here
  2. A list of 1000 molecules from public ChEMBL and they I represented them as standard SMILES and their inchikey. The data was saved in input folder
  3. Predictions for the preprocessed 1000 molecules was generated and plot was generated.

This will was achieved in the following steps:
• Import necessary libraries
• Data preprocessing
• Model bias evaluation

• I installed and Imported necessary libraries such as miniconda, rkidit, pandas, numpy, matplotlib and specified neccessary folder path in my mounted google drive.

• Data preprocessing The model was tested on 1000 molecules from public repositories they are represented as standard SMILES. The data is in the input folder in the data folder I loaded a list of molecules I obtained from ChEMBL and processed them to make sure I have Standard SMILES representation of the compound and InChIKey associated to the compound. This was done by applying the standardise_smile function and inchikey function i defined in my src folder. After that, the standardise smiles was converted into a list.This was necessary because that is accepted data format of the model

• Model bias evaluation.
The model was fetched and served from the ersilia model hub and ready to be used for predictions on a new data. The "predict api" was used and output was generated. The model was tested on 1000 molecules from public repositories they are represented as standard SMILES.The predictions (output) was saved into a csv file in the output folder located in the data folder and visualisation was done with plotting scatterplot, histogram, distplot and barchart using matplotlib and seaborn library respectively. The plots are saved in the plot folder

@OlawumiSalaam
Copy link
Contributor Author

@GemmaTuron @DhanshreeA please I am waiting for your feedback

@DhanshreeA
Copy link
Member

@OlawumiSalaam good work on Task 1! I believe you can submit the final application.

@OlawumiSalaam
Copy link
Contributor Author

OlawumiSalaam commented Mar 25, 2024

@DhanshreeA @GemmaTuron . Please here is my updated task. Your feedback will be valuable in improving the task.
Week 2 Task 2

  • ** Identify a result you could reproduce from the paper**

The task is to reproduce the result of ADME@NCATS Solubility model as described by the author in this Publication. I will be working on a subset of training data and is made available in the PubChem database. The experimental data associated with the compounds are open for public access. The dataset has 2532 records and the details can be found here PubChem AID 1645848. I aim to reproduce this result- AUC-ROC: 0.93 +/– 0.00 from the 5 fold cross validation as stated here here.

  • Implementation of the Author’s model
  1. Installation
    The installation requires conda and I have conda installed before.
    Chemprop can either be installed from PyPi via pip or from source (i.e., directly from the git repo) and I installed Chemprop from PyPi in Ubuntu on my computer by running the following command:
    conda create -n chemprop python=3.8
    conda activate chemprop
    conda install -c conda-forge rdkit
    pip install git+https://github.com/bp-kelley/descriptastorus
    pip install chemprop

  2. Data cleaning. The data was preprocessed and the notebook can be found here

  3. Still in the same notebook, after necessary imports. The GCNN_model was trained using this command
    # Define command-line argument arguments = [ '--data_path', data_path, '--dataset_type', 'classification', '--save_dir', checkpoint_dir, '--epochs', '50', --num_folds', '5 '--save_smiles_splits', '--quiet' ]
    # Parse arguments args = chemprop.args.TrainArgs().parse_args(arguments)
    # Move your model and data to the GPU if available args.device = device
    # Run cross-validation mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

I was able to achieve Overall test auc = 0.869393 +/- 0.019119 and the result is
quiet.log

Here is my observations: The ADME@NCATS solubility model was trained 22,209 dataset but only 2,529 has been made publicly available in PubChem.

  • Result Comparison of ADME@NCATS Solubility model and eos74bo model from Ersilia model hub
  • A subset of the NPC marketed drugs was downloaded from the Supplemental Material and contains 185 dataset.
**MODEL ** AUC-ROC BACC Sensitivity Specificity Kappa
ADME@NCATS 0.84 0.84 0.81 0.86 0.61
eos74bo 0.84 0.84 0.81 0.86 0.61

Result Discussion.
A detailed notebook on the model result comparison is found here
The ADME@NCATS solubility model and eos74bo produced the same result on the same test set. I observed that the ADME@NCATS model was trained using graph convolutional neural network with 22,209 datasets while the eos74bo from ersilia model hub used a dataset of 11780 to train a Support Vector Machine classifier.

@DhanshreeA
Copy link
Member

Good work @OlawumiSalaam ! Please go ahead and create your final application. :)

@OlawumiSalaam
Copy link
Contributor Author

OlawumiSalaam commented Apr 1, 2024

To improve the quality of my contributions, I researched further on how to improve model accuracy and reliability. I found out that data quality is an important factor. To this effect, I took my data preprocessing steps forward by ensuring any compound found in training data was not present in the NPC data that was used to validate the model. Initially, the NPC test data contains 185 but after preprocessing, it was down to 176 compounds. These were used to generate predictions on both eos74bo and ADME@NCATS models and the performance was evaluated on same classification metrics. The two models achieved the same results when with ROC curve plotted and same with other metrics.

**MODEL ** AUC-ROC BACC Sensitivity Specificity Kappa
ADME@NCATS 0.82 0.82 0.78 0.86 0.58
eos74bo 0.82 0.82 0.78 0.86 0.58

Result Interpretation
An AUC-ROC (Area Under the Receiver Operating Characteristic Curve) score of 0.8235 means that the model's performance in terms of differentiating between the positive and negative classes is relatively good because the closer the AUC-ROC score is to 1, the better the model's performance.
The model's accuracy is 82.35% across both classes (i.e., active and inactive) as revealed in the balanced accuracy score. This is a useful metric when dealing with imbalanced dataset and this is the case with NPC dataset that I used.
The model correctly identifies 78.38% of the positive cases (i.e. low soluble compounds) out of all actual positive cases with a sensitivity score of 0.7838(all low soluble compounds in the data) and the model correctly identifies 86.33% of the negative cases (high soluble compounds) out of all actual negative cases(all high soluble compounds in the data). The sensitivity is really important because we want to able to identify low soluble compounds correctly such that we don not mistakenly use poor soluble compounds to develop drugs which will be very costly.

Result Discussion.
A detailed notebook on the model result comparison is found updated
The ADME@NCATS solubility model and eos74bo produced the same result on the same test set. I observed that the ADME@NCATS model was trained using graph convolutional neural network with 22,209 datasets while the eos74bo from ersilia model hub used a dataset of 11780 to train a Support Vector Machine classifier.
I also notice that metric scores reduced but I think the result is more reliable.

@OlawumiSalaam
Copy link
Contributor Author

OlawumiSalaam commented Apr 2, 2024

WEEK 3: Validate a Model in the Wild

A detailed notebook for implementation is found here.

The goal of the task is to get the performance of the model on an external dataset obtained from a pubic repository like ChEMBL, PubChem, Therapeutics Data Commons or MoleculeNet. I searched through PubChem, Therapeutics Data Commons and MoleculeNet but the dataset I found for aqueous solubility were for regression task. I found a dataset for which experimental data was for binary classification on PubChem.
About the Dataset
The test dataset that I used for the task contains 4510 data which is a subset of Aqueous Solubility from MLSMR Stock Solutions PubChem AID 1996.
The first step was to import the necessary libraries and specify the folder paths.
I read the test data into a pandas dataframe and inspected the first 5 rows. To understand the test dataset,
I did exploratory data analysis by checking the shape, the data distribution(i.e. how many molecules do I have in each class by running a value_counts method on solubility column). the result is the count of how many low soluble and high soluble molecules in the test data. For more context, I visualised using pieplot from matplotlib.pyplot module. the analysis revealed an imbalanced data set with high soluble molecules of 70.4% and low soluble molecules having 29.6%.
The next step was to clean and preprocess the data:

  1. I checked to see that all the data in the test data contains valid smiles
  2. I checked for missing values and duplicated values
  3. Using the already defined standardized_smile function in the src folder, I converted the smiles into standardized smiles in a new colum
  4. I got the Inchikey representation of the standardadized molecule by applying already defined get_inchikey function in the src folder to the standardized_smile column.
  5. I checked for repeated molecules between the test and train data and I found 3. I ensured the 3 repeated molecules were removed to avoid bias and proceeded to run predictions with the cleaned data.
    Running Predictions
  6. the eos74bo model was fetched and served from the Ersilia model hub successfully.
  7. The standardized_smiles was converted to a list to serve as input for the model.
  8. The prediction was run and prediction output was generated.
    Performance Evaluation
  9. I generated the ROC curve and used other metrics such as auc-roc, balanced accuracy, sensitivity and specificity.
    The result obtained is shown below:
**MODEL ** AUC-ROC BACC Sensitivity Specificity
eos74bo 0.7411 0.7411 0.7637 0.7186

Result Interpretation
An AUC-ROC (Area Under the Receiver Operating Characteristic Curve) score of 0.7411 means that the model's performance in terms of differentiating between the positive(low solubility) and negative(high solubility) classes is relatively good because the closer the AUC-ROC score is to 1, the better the model's performance.

The model's balanced accuracy is 74.11% across both classes (i.e., active and inactive) as revealed in the balanced accuracy score. This is a useful metric when dealing with imbalanced dataset and this is the case with test dataset that I used.

Sensitivity score of 0.7637 means the model correctly identifies 76.37% of the positive cases (i.e. low soluble compounds) out of all actual positive case (all low soluble compounds in the data). The sensitivity is really important because we want to able to identify low soluble compounds correctly such that we do not mistakenly use poor soluble compounds to develop drugs which will be very costly since the focus is on minimizing Type 2 errors.

Specificity score of 0.7186 indicates that approximately 71.86% of the actual negative cases (high solubility molecules) are correctly identified as negative by the model.
Conclusion
Results are based on a clean version of test data set by removing the four common samples in the train data set and test data set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants