-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✍️ Contribution period: Olawumi Salaam #1033
Comments
when I try to serve model eos2r5a, I am getting this error. There is an issue with loading JSON data. @GemmaTuron @DhanshreeA |
when I try to serve model eos2r5a, I am getting this error. There is an issue with loading JSON data. @GemmaTuron @DhanshreeA Hi @OlawumiSalaam Please, let me know if you're still encounter any error? |
Amidst the Covid-19 lockdown, my husband's absence left me helpless, unable to provide for our children. This made me realize how dependent I had become despite having a degree in chemistry and decision for change of situation. Research led me to consider technology as a viable option. However, entering this male-dominated field was met with disapproval from my local communities who believed in traditional gender roles. Biases from peers and educators added to my challenges. My capabilities were questioned because I am a woman rather than merit. This affected my confidence and led to imposter syndrome, where I sometimes doubt my abilities despite my accomplishments. This stereotyping limited my access to key projects, mentors and affected my professional development. In a bid to seek opportunities to break this barrier by getting a role to help advance my career, I stumbled upon Outreachy Internship. I realized Outreachy opens doors to the world of free software and the main goal is to make tech more diverse and fairer by working on real projects and teaming up with others to upskill through contributions to Open source projects. Outreachy helps people who are underrepresented in tech to break down barriers and make sure everyone is included in the tech world. However, my interest in working on Ersilia Projects sparked when the contribution stage opened and I was browsing through the different projects. Specifically, I discovered that Ersilia Open Source charity is focused on strengthening the research capacity for infectious and neglected diseases by developing and implementing novel artificial intelligence and machine learning tools. After doing research on how Machine Learning can be implemented in drug discovery, I discovered it encompasses a range of applications that have the potential to revolutionize the industry. Machine learning algorithms can predict the biological activity of compounds, allowing researchers to focus their efforts on the most promising candidates. These algorithms can identify potential drug candidates by analyzing chemical structures and properties. Aside from earning a bachelor's degree in chemistry, I also developed my skills in Artificial Intelligence and Machine Learning. I am able to execute Machine Learning projects life cycle stemming from defining problem statement, data gathering, data cleaning and preprocessing, exploratory data analysis to reveal hidden insights, modelling the data and interpreting data both on supervised and unsupervised machine learning task. My past trainings and projects provided me opportunities to learn Python programming language iusing VScode, Jupyter notebook, google colab IDE and frameworks like TensorFlow, Keras, Sci-kit learn, OpenCV, PyTorch, Numpy, Pandas, Matplotlib, Seaborn, Plotly, Git and Github. I am also proficient with model evaluation tools such as mean absolute error, R-squared, accuracy score, precision, Recall,F1-score, ROC-AUC curve, classification report that will be needed for regression and classification tasks in Ersilia projects. However, I recognize that there is still much more to learn in the field of AI, and I am eager to expand my knowledge and skills in this rapidly advancing field by leveraging this internship. Of immediate need is to learn tools for accessing molecular databases such as PubChem, ChEMBL and processing chemical and biological data. My experiences in my previous role and projects have equipped me with communication, collaboration, problem-solving, time management, project management, and analytical skills. I am confident that these skills will enable me to maximise the internship program. I look forward to leveraging my skills to understand issues, make meaningful contributions and succeed in the program. Starting from the contribution stage, I will be learning from industry experts, absorb their valuable insight, feedbacks, and gain a broader perspective of the industry. After this internship, I see myself having a breakthrough and becoming successful. Success for me is about continuously learning, growing, and striving to reach my full potential. I intend to be in a role that will help me contribute significantly to developing AI and Machine Learning technologies. I see myself working as a deep learning scientist applying my expertise to solve complex problems and develop innovative solutions. My goal is to be at the forefront of cutting-edge research and development, leveraging AI to address real-world challenges as it relates to Sustainable Development Goals (SDGs). I look forward to mentoring and inspiring people from underrepresented groups like me to pursue careers in technology and AI. This will be my own way of helping Outreachy achieve its goals and giving back to the society. I believe that being selected will provide me with invaluable experiences and also open doors to networking opportunities, mentorship, and potential business collaborations. Overall, all this will help me advance my career growth and enable me to make meaningful contributions to the AI industry. |
when I try to serve model eos2r5a, I am getting this error. There is an issue with loading JSON data. @GemmaTuron @DhanshreeA @DhanshreeA . Thank you. The error has been resolved. |
Amidst the Covid-19 lockdown, my husband's absence left me helpless, unable to provide for our children. This made me realize how dependent I had become despite having a degree in chemistry and decision for change of situation. Research led me to consider technology as a viable option. However, entering this male-dominated field was met with disapproval from my local communities who believed in traditional gender roles. Biases from peers and educators added to my challenges. My capabilities were questioned because I am a woman rather than merit. This affected my confidence and led to imposter syndrome, where I sometimes doubt my abilities despite my accomplishments. This stereotyping limited my access to key projects, mentors and affected my professional development. |
📆 WEEK 2: Get Familiar with Machine Learning for ChemistryThe task for this week is to get familiar with Machine Learning for chemistry data and validating a model for predicting ADME properties of small molecules. The aim is to test accuracy and reproducibility of the model. Aqueous solubility is an important physicochemical property that influences pharmacokinetic properties of compounds and a very important factor in drug discovery. To this achieve this objectives, EOS model ID: eos74bo was selected and the task is in 3 steps: Task 1Model: eos74bo Identifiers
This will was achieved in the following steps: • I installed and Imported necessary libraries such as miniconda, rkidit, pandas, numpy, matplotlib and specified neccessary folder path in my mounted google drive. • Data preprocessing The model was tested on 1000 molecules from public repositories they are represented as standard SMILES. The data is in the input folder in the data folder I loaded a list of molecules I obtained from ChEMBL and processed them to make sure I have Standard SMILES representation of the compound and InChIKey associated to the compound. This was done by applying the standardise_smile function and inchikey function i defined in my src folder. After that, the standardise smiles was converted into a list.This was necessary because that is accepted data format of the model • Model bias evaluation. |
@GemmaTuron @DhanshreeA please I am waiting for your feedback |
@OlawumiSalaam good work on Task 1! I believe you can submit the final application. |
@DhanshreeA @GemmaTuron . Please here is my updated task. Your feedback will be valuable in improving the task.
The task is to reproduce the result of ADME@NCATS Solubility model as described by the author in this Publication. I will be working on a subset of training data and is made available in the PubChem database. The experimental data associated with the compounds are open for public access. The dataset has 2532 records and the details can be found here PubChem AID 1645848. I aim to reproduce this result- AUC-ROC: 0.93 +/– 0.00 from the 5 fold cross validation as stated here here.
I was able to achieve Overall test auc = 0.869393 +/- 0.019119 and the result is Here is my observations: The ADME@NCATS solubility model was trained 22,209 dataset but only 2,529 has been made publicly available in PubChem.
Result Discussion. |
Good work @OlawumiSalaam ! Please go ahead and create your final application. :) |
To improve the quality of my contributions, I researched further on how to improve model accuracy and reliability. I found out that data quality is an important factor. To this effect, I took my data preprocessing steps forward by ensuring any compound found in training data was not present in the NPC data that was used to validate the model. Initially, the NPC test data contains 185 but after preprocessing, it was down to 176 compounds. These were used to generate predictions on both eos74bo and ADME@NCATS models and the performance was evaluated on same classification metrics. The two models achieved the same results when with ROC curve plotted and same with other metrics.
Result Interpretation Result Discussion. |
WEEK 3: Validate a Model in the WildA detailed notebook for implementation is found here. The goal of the task is to get the performance of the model on an external dataset obtained from a pubic repository like ChEMBL, PubChem, Therapeutics Data Commons or MoleculeNet. I searched through PubChem, Therapeutics Data Commons and MoleculeNet but the dataset I found for aqueous solubility were for regression task. I found a dataset for which experimental data was for binary classification on PubChem.
Result Interpretation The model's balanced accuracy is 74.11% across both classes (i.e., active and inactive) as revealed in the balanced accuracy score. This is a useful metric when dealing with imbalanced dataset and this is the case with test dataset that I used. Sensitivity score of 0.7637 means the model correctly identifies 76.37% of the positive cases (i.e. low soluble compounds) out of all actual positive case (all low soluble compounds in the data). The sensitivity is really important because we want to able to identify low soluble compounds correctly such that we do not mistakenly use poor soluble compounds to develop drugs which will be very costly since the focus is on minimizing Type 2 errors. Specificity score of 0.7186 indicates that approximately 71.86% of the actual negative cases (high solubility molecules) are correctly identified as negative by the model. |
Week 1 - Get to know the community
Week 2 - Get Familiar with Machine Learning for Chemistry
Week 3 - Validate a Model in the Wild
Week 4 - Prepare your final application
The text was updated successfully, but these errors were encountered: