# Predict the effect of Genetic Variants to enable Personalized Medicine
## This is a UCSD ML Bootcamp Capstone Project 
#### by Zhanyang Zhu Zhanyang.Zhu@Gmail.com
####    5/2022-9/2022
Data is taken from [Kaggle Personalized Medicine: Redefining Cancer Treatment](https://www.kaggle.com/competitions/msk-redefining-cancer-treatment/data)

In this competition you will develop algorithms to classify genetic mutations based on clinical evidence (text).

There are nine different classes a genetic mutation can be classified on.

This is not a trivial task since interpreting clinical evidence is very challenging even for human specialists. Therefore, modeling the clinical evidence (text) will be critical for the success of your approach.

Both, training and test, data sets are provided via two different files. One (training/test_variants) provides the information about the genetic mutations, whereas the other (training/test_text) provides the clinical evidence (text) that our human experts used to classify the genetic mutations. Both are linked via the ID field.

Therefore the genetic mutation (row) with ID=15 in the file training_variants, was classified using the clinical evidence (text) from the row with ID=15 in the file training_text

Finally, to make it more exciting!! Some of the test data is machine-generated to prevent hand labeling. You will submit all the results of your classification algorithm, and we will ignore the machine-generated samples. 

File descriptions
training_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations), Class (1-9 the class this genetic mutation has been classified on)
training_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)
test_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations)
test_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)
submissionSample - a sample submission file in the correct format

# Set up 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_rows    = 99
pd.options.display.max_columns = 99
pd.options.display.width       = 80

In [3]:
pd.__version__

'1.3.4'

In [52]:
# read in traning variants:
tn_var = pd.read_csv("training_variants", index_col='ID')

In [53]:
tn_var.head()

Unnamed: 0_level_0,Gene,Variation,Class
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,FAM58A,Truncating Mutations,1
1,CBL,W802*,2
2,CBL,Q249E,2
3,CBL,N454D,3
4,CBL,L399V,4


In [71]:
# read in training text data:
tn_var.info

<bound method DataFrame.info of         Gene             Variation  Class
ID                                       
0     FAM58A  Truncating Mutations      1
1        CBL                 W802*      2
2        CBL                 Q249E      2
3        CBL                 N454D      3
4        CBL                 L399V      4
...      ...                   ...    ...
3316   RUNX1                 D171N      4
3317   RUNX1                 A122*      1
3318   RUNX1               Fusions      1
3319   RUNX1                  R80C      4
3320   RUNX1                  K83E      4

[3321 rows x 3 columns]>

In [91]:
tn_var.Class.value_counts()

7    953
4    686
1    568
2    452
6    275
5    242
3     89
9     37
8     19
Name: Class, dtype: int64

In [67]:
# read in training text data:
tn_text = pd.read_csv("training_text", sep='\|\|', skiprows=1, header=None, names=['ID', 'Text'], index_col='ID')

  return func(*args, **kwargs)


In [68]:
tn_text.head()

Unnamed: 0_level_0,Text
ID,Unnamed: 1_level_1
0,Cyclin-dependent kinases (CDKs) regulate a var...
1,Abstract Background Non-small cell lung canc...
2,Abstract Background Non-small cell lung canc...
3,Recent evidence has demonstrated that acquired...
4,Oncogenic mutations in the monomeric Casitas B...


In [72]:
tn_text.info

<bound method DataFrame.info of                                                    Text
ID                                                     
0     Cyclin-dependent kinases (CDKs) regulate a var...
1      Abstract Background  Non-small cell lung canc...
2      Abstract Background  Non-small cell lung canc...
3     Recent evidence has demonstrated that acquired...
4     Oncogenic mutations in the monomeric Casitas B...
...                                                 ...
3316  Introduction  Myelodysplastic syndromes (MDS) ...
3317  Introduction  Myelodysplastic syndromes (MDS) ...
3318  The Runt-related transcription factor 1 gene (...
3319  The RUNX1/AML1 gene is the most frequent targe...
3320  The most frequent mutations associated with le...

[3321 rows x 1 columns]>

In [83]:
tn_var.iloc[3320]

Gene         RUNX1
Variation     K83E
Class            4
Name: 3320, dtype: object

In [99]:
tn_text.Text[3320]

"The most frequent mutations associated with leukemia are recurrent somatic chromosomal translocations or inversions, many of which involve the polyomavirus enhancer-binding protein or core-binding factor transcriptional regulation complex (PEBP2/CBF). Several translocations involve the α subunit of this complex, the RUNX1 gene (also called AML1, CBFα2, or PEBP2αB) on chromosome 21q22.1 (t(8;21), t(3;21), and t(12;21)). Additionally, the β subunit of the complex, PEBP2β also called CBFβ, is disrupted in inv(16)(p13;q22).1 An abundance of evidence points to the existence of genes that predispose to hematologic malignancies. However, large multiple-generation families with hematologic malignancies alone are rare.2 Only 2 loci for familial hematologic malignancies have been identified to date, 1 on chromosome 21q22.13 and the other on 16q22.4 5 These loci contain RUNX1 andPEBP2β/CBFβ, respectively.Studies of families that demonstrate single-gene inheritance for leukemia predisposition sho