Cancer_diagnosis

Machine learning model to help molecular pathologist to classify the variations (9 classes)

@author:Akash Kumar
linkedin:-https://www.linkedin.com/in/akash-kumar916/

We understand that analyzing text represents a difficult challenge, but believe it or not is the current state of the art when it comes to interpretation of genetic variants.

The workflow is as follows

A molecular pathologist selects a list of genetic variations of interest that he/she want to analyze
The molecular pathologist searches for evidence in the medical literature that somehow are relevant to the genetic variations of interest
Finally this molecular pathologist spends a huge amount of time analyzing the evidence related to each of the variations to classify them

Our goal here is to replace step 3 by a machine learning model. The molecular pathologist will still have to decide which variations are of interest, and also collect the relevant evidence for them. But the last step, which is also the most time consuming, will be fully automated.

There are nine different classes a genetic mutation can be classified on.

This is not a trivial task since interpreting clinical evidence is very challenging even for human specialists. Therefore, modeling the clinical evidence (text) will be critical for the success of your approach.

Both, training and test, data sets are provided via two different files. One (training/test_variants) provides the information about the genetic mutations, whereas the other (training/test_text) provides the clinical evidence (text) that our human experts used to classify the genetic mutations. Both are linked via the ID field.

Therefore the genetic mutation (row) with ID=15 in the file training_variants, was classified using the clinical evidence (text) from the row with ID=15 in the file training_text

Finally, to make it more exciting!! Some of the test data is machine-generated to prevent hand labeling.

Data Link: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

We have two data files: one conatins the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations.
Both these data files are have a common column called ID
Data file's information:
- training_variants (ID , Gene, Variations, Class)
- training_text (ID, Text)

Performance matric

Metric(s):
Multi class log-loss
Confusion matrix

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
cancer_daignosis.ipynb		cancer_daignosis.ipynb
training_variants		training_variants

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cancer_diagnosis

Performance matric

About

Releases

Packages

Languages

akashkumar916/Cancer_diagnosis

Folders and files

Latest commit

History

Repository files navigation

Cancer_diagnosis

Performance matric

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages