Skip to content

aaryan2/Synhibit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Synhibit - Using ML to Prevent Cancers


Overview

This project was the result of a roughly month-long research project that aimed at taking a new approach to cancer therapy. As someone who's been researching cancer diagnostics for the past couple of years, it's been quite alarming to notice that, for such a widespread disease that medicine hasn't been able to consistently cure, that cancer wasn't being treated the same way that most other diseases were dealt with. Despite trillions of dollars going into funding for the disease, it was a pretty big shock to see that all of that money went into developing cures, whereas almost no time, effort, or capital was being directed into finding a preventative measure similar other diseases such the cold.

By its nature, cancer is a genetic condition, and encompasses a vast number of forms that affect specific regions of the body (eg. brain, lung, bone cancers). Cancers disrupt the healthy growth systems put in place by the host's healthy cells, and initiates a uncontrolled cell growth that can form solid tumours, metastasize (spread), destroy organs, and ultimately lead to death unless cured. Or even better - prevented. Long before someone even develops cancer, their cells have to undergo a set of mutations in specific regions of their DNA that oversee their growth, survival, and rate of reproduction.

Normally, our cells can detect these abrupt changes to their genetic "instruction manual", and initiate a protocol called apoptosis, which essentially allows them to self-destruct for the safety of their host. Cancer; however, did find a workaround to this process through driver mutations, which are much too large for cells to attempt to fix, and often, they even inhibit the part of the cell's DNA that initiates apoptosis:


Cancerous vs. Healthy Cells


A healthy person would have to accumulate between 2-9 of these mutations to actually develop cancer (depending on the type of cancer, as well as the genetic mutations aquired), but it's important to keep in mind that many people can inherit these mutations as a hereditary genetic malfunction, and that could make for much fewer accumulated mutations necessary for cancer to take place. Mutations take place through the external factors we're exposed to, and every factor causes a cascade of cellular and chemical reaction in us. These 'cascades' are known as cellular pathways, and when we're exposed to negative factors (eg. smoking, alcohol, poor diet). These negative pathways get expressed, and bring about negative results and can lead to the driver mutations that cause cancers. Of course, while there are other ways mutations can occur, but cellular pathways form the basis of the cycle.

But, what if we could proactively inhibit these mutations that can cause cancer when expressed?

With investments in the field cancer prevention only recently beginning to ramp-up, the opportunity opened up to harness the power of machine learning (ML) for this approach - a field that focuses on the use of statistical models developed by computers to fit data that humans can't even comprehend. If we succesfully trained this model on the main compounds that inhibit the expression of cancer-causing pathways, then that could literally change the way we see cancer treatment. Using the compounds screened, or even generated by the model, we could discover much more specific, safe, and affordable compounds to repress mutations that cause our cells to go off course, and even potentially develop a preventitive vaccine for cancer!


Gathering Data

To successfully develop this model, I needed a dataset, but of course, there were absolutely none around. The next best option was to create my own, so that's what I did. Initially, the process was meant to automatically gather data using the Pubchem API, but that wouldn't seem to work (future step though). So, I went with the next best option of manually picking the datapoints from an extensive cancer pathway inhibiting dataset from this website => https://www.bocsci.com/tag/cancer-381.html, as well as Wikipedia for a list of carcinogens, as they were basically the only things that would be guaranteed to not have therapeutic effects towards cancer, since you can never really be sure about that with other random compounds. It was especially sueful to make use of that site, since it contained specific compounds that could selectively target specific cellular pathways, and as you'll see later, that came real-handy as extra features when training the model:



Some of the compound that were analyzed

So, after hours of scouring compounds and Ctrl+C'ing data into a spreadsheet, the dataset aspect was done. Now, the fun could begin - training the model! Firstly ,I've got to mention that, while I do have some experience in developing ML models, I realized that it would take a lot of unnecessary time and effort to do so for this scenario. With this in mind, I decided to go with one of my favourite platforms to develop classification ML algorithms - without code - MATLAB! Usually, for similar platforms that function without code, you face limitations to customizability, but with MATLAB, you don't really have to face the same barriers since you can customize in most ways you could with the use of code.



Conclusion

By importing the dataset I created into the MATLAB workspace and using the classification learning app, I created an Optimizable Naive Bayes model, and it was the best for two major reasons: I tried out all the available models on the site and it performed the best, as well as the fact that the Naive Bayes algorithm tends to perform well for more noisy and widely distrubuted (scattered) datasets such as this one.

Finally, for what you might have been waiting for since the start - the RESULTS! Well, see for yourself:

ROC Curve

Confusion Matrix

The model ended up getting in insanely high 92.3% accuracy! That literally means that we now have a way of consistently discovering new compounds that could better inhibit cancers, rather than treat them when they happen. Of course, this is incredibly promising, but for this to really be considered a success, we need to obtain similar results with a much, much larger dataset, and don't be worries, because I'm getting my weapons ('Python 2.7 and the PubChem API') ready, to do what's possibly going to be the largest chemical web-scraping project the world has ever seen! Hopefully, Synhibit can move on to help in the discovery of these compounds, especially when there's trillions to choose from, and who knows? Maybe the project could lead to the contents of your future cancer vaccine? Till then, stay updated, because they'll keep coming. I hope you enjoyed!

About

An open-source ML cancer pathway inhibition prediction system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published