# MSclassifier: a flexible tool to create a classifier based on Mutational signatures

Creating a *Neural Network* or any sort of *classifier* is a task that requires a certain expertise. However, in **MSclassifier we propose an easy to use and hands off approach to developping a classifier** based on the mutational profile of samples.

In order to train MSclassifier, the **only required input** is the **path to a folder containing all vcf or maf files** of all samples we wish to classify, and **two list** containing the names of all known *positive* and *negative* samples accordingly.  

If we only wish to predict samples based on a **pretrained model**, this package also allows for such functionality and does not require any lists of ground truth samples (although they can be added in order to check the performance of the model).
. 


In this example, we use Semple's group High Grade Serous Ovarian Cancer cohort to create a classifier that predicts Homologous Recombination deficiency.
Along the steps of this example we will explain further functionality of MSclassifier.

# Create a project

In order to get started we need to load the module MSclassifier, and create a class object with essential information about our project.

In [1]:
import MSclassifier

path='/home/elatorre/Desktop/HGSOC datasets/HGSOC VCF filtered/'
path_proficient = path+ 'Proficient.txt'
path_deficient = path + 'Deficient.txt'

HGSOC = MSclassifier.signature_classifier (vcf=path , 
                                          positive=path_deficient, 
                                          negative=path_proficient,
                                          project_name='HGSOC_HR')

We can now acces the information of our project as attributes of the HGSOC, the newly created MSclassifier class object:

In [2]:
f' Project {HGSOC.project_name} will use the following features to train a model: {HGSOC.feature_list}'

" Project HGSOC_HR will use the following features to train a model: ['SBS96', 'ID83', 'DBS78']"

## Load vcf files

We can now load the vcf files by using the method **.load()**. This function calls SigProfilerMatrixGeneratorFunc to load .vcf files and prepare them for extraction.

In [3]:
HGSOC.load_vcf()

The given input files do not appear to be in the correct vcf format. Skipping this file:  Proficient.txt
The given input files do not appear to be in the correct vcf format. Skipping this file:  Deficient.txt
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 90.8 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 42.83 seconds.
Matrices generated for 188 samples with 0 errors. Total of 3303977 SNVs, 23360 DINUCs, and 313090 INDELs were successfully analyzed.


.load() also creates a dataset with all the information about the inputed vcf files and ground truth samples. This dataset is stored as a .data attribute.

In [4]:
HGSOC.data

Unnamed: 0,sample,class,Sample type,training
0,AOCS_084,-1.0,Proficient,0.0
1,AOCS_159,-1.0,Proficient,0.0
2,DO29146,0.0,Unknown,0.0
3,SHGSOC050,-1.0,Proficient,0.0
4,SHGSOC022,1.0,Deficient,0.0
...,...,...,...,...
181,SHGSOC001,1.0,Deficient,0.0
182,DO30650,0.0,Unknown,0.0
183,AOCS_168,0.0,Unknown,0.0
184,DO28412,-1.0,Proficient,0.0


# Training a model

## Extracting *de novo* signatures
If we wish to train a new model, the first step is to train a new set of mutational signatures for each feature in our feature_list. 

We use the method **.signature_train()** to first split our dataset into a training and test set, and extract *de novo* signatures on the training set setusing SigProfiler extractor. 

In [5]:
# Since the aim is not to reconstruct all samples with high accuracy, 
# we choose a low integer, end = 5, as the limit of de novo extracted mutational signatures.
HGSOC.signature_train(end=5)


************** Reported Current Memory Use: 0.25 GB *****************

Normalization Cutoff is : 13195
Extracting signature 1 for mutation type 96
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
execution time: 0 seconds 
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 

execution time: 0 seconds 
process 1 continues please wait... 
execution time: 0 seconds 
execution time: 0 seconds 
execution time: 0 seconds 
process 1 continues please wait... 
execution time: 0 seconds 
execution time: 0 seconds 
execution time: 0 seconds 







Time taken to collect 8 iterations for 1 signatures is 0.37 seconds
Optimization time is 0.05347132682800293 seconds
The reconstruction error is 0.3272, average process stability is 1.0 and 
the minimum process stability is 1.0 for 1 signatures


Extracting signature 2 for mutation type 96
process 2 continues please wait... 
execution tim



 
Your Job Is Successfully Completed! Thank You For Using SigProfiler Extractor.
 

************** Reported Current Memory Use: 0.26 GB *****************

Normalization Cutoff is : 1038
Extracting signature 1 for mutation type INDEL
process 1 continues please wait... 
execution time: 0 seconds 

process 1 continues please wait... 
process 1 continues please wait... 
execution time: 0 seconds 
execution time: 0 seconds 

process 1 continues please wait... 

process 1 continues please wait... 
execution time: 0 seconds 

execution time: 0 seconds 
process 1 continues please wait... 

execution time: 0 seconds 

process 1 continues please wait... 
execution time: 0 seconds 

process 1 continues please wait... 
execution time: 1 seconds 

Time taken to collect 8 iterations for 1 signatures is 0.74 seconds
Optimization time is 0.13062238693237305 seconds
The reconstruction error is 0.5311, average process stability is 1.0 and 
the minimum process stability is 1.0 for 1 signatures


Extrac


Time taken to collect 8 iterations for 5 signatures is 6.53 seconds
Optimization time is 0.07741355895996094 seconds
The reconstruction error is 0.1015, average process stability is 0.64 and 
the minimum process stability is -0.3 for 5 signatures




 
Your Job Is Successfully Completed! Thank You For Using SigProfiler Extractor.
 

************** Reported Current Memory Use: 0.26 GB *****************

Normalization Cutoff is : 93
Extracting signature 1 for mutation type DINUC
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
execution time: 0 seconds 
execution time: 0 seconds 
execution time: 0 seconds 

execution time: 0 seconds 


process 1 continues please wait... 

execution time: 0 seconds 
process 1 continues please wait... 

execution time: 0 seconds 
execution time: 0 seconds 


process 1 continues please wait... 
execution time: 0 seconds 

Time 

execution time: 1 seconds 

process 5 continues please wait... 
execution time: 1 seconds 

process 5 continues please wait... 
execution time: 1 seconds 

process 5 continues please wait... 
execution time: 2 seconds 

Time taken to collect 8 iterations for 5 signatures is 2.49 seconds
Optimization time is 0.07621574401855469 seconds
The reconstruction error is 0.3327, average process stability is 0.24 and 
the minimum process stability is -0.06 for 5 signatures




 
Your Job Is Successfully Completed! Thank You For Using SigProfiler Extractor.
 


As mentioned, this is the first step into training our model. Therefore, we can now access the features and mutational signatures profiles used in this model as attributes of the model.

In [3]:
print(HGSOC.model.features)

# each element of .model.signatures contains the signatures associated with each feature in feature_list
HGSOC.model.signatures[0] 

['SBS96_pro_1', 'SBS96_pro_2', 'SBS96_pro_3', 'SBS96_pro_4', 'SBS96_pro_5', 'SBS96_pro_6', 'SBS96_def_1', 'SBS96_def_2', 'SBS96_def_3', 'SBS96_def_4', 'ID83_pro_1', 'ID83_pro_2', 'ID83_pro_3', 'ID83_pro_4', 'ID83_pro_5', 'ID83_def_1', 'ID83_def_2', 'ID83_def_3', 'ID83_def_4', 'DBS78_pro_1', 'DBS78_pro_2', 'DBS78_def_1']


Unnamed: 0,MutationsType,SBS96_pro_1,SBS96_pro_2,SBS96_pro_3,SBS96_pro_4,SBS96_pro_5,SBS96_pro_6,SBS96_def_1,SBS96_def_2,SBS96_def_3,SBS96_def_4
0,A[C>A]A,0.012891,0.014074,0.035418,0.016570,0.020621,0.022471,0.026262,0.018367,0.009627,0.012937
1,A[C>A]C,0.008083,0.012209,0.028185,0.013828,0.008931,0.015942,0.023796,0.008670,0.008024,0.007279
2,A[C>A]G,0.000371,0.005163,0.005273,0.003408,0.004835,0.002125,0.003194,0.003277,0.003329,0.001538
3,A[C>A]T,0.005357,0.008736,0.030913,0.009730,0.018642,0.013256,0.021790,0.003675,0.015397,0.009522
4,A[C>G]A,0.015892,0.004059,0.013470,0.011232,0.017217,0.003567,0.011814,0.017011,0.021471,0.009817
...,...,...,...,...,...,...,...,...,...,...,...
91,T[T>C]T,0.021713,0.013618,0.008536,0.004923,0.009385,0.008414,0.011110,0.011944,0.014943,0.007514
92,T[T>G]A,0.008381,0.002205,0.003892,0.008598,0.005978,0.001955,0.004072,0.007553,0.007493,0.003619
93,T[T>G]C,0.007077,0.003274,0.003143,0.003449,0.001473,0.002151,0.003867,0.006168,0.005559,0.000719
94,T[T>G]G,0.009633,0.003177,0.008037,0.004275,0.003322,0.004083,0.006366,0.008478,0.009383,0.002966


## Fitting the extracted signatures on all samples

These newly extracted signatures have been trained only in the training set. Therefore, he next step is to fit these signatures on all samples in the dataset. 
Again, we do so through the **.signature_fit()** method.

In [6]:
HGSOC.signature_fit()

We can now check that our dataset has been updated.

In [7]:
HGSOC.data

Unnamed: 0,sample,class,Sample type,training,prediction,SVM prediction,SBS96_pro_1,SBS96_pro_2,SBS96_pro_3,SBS96_pro_4,...,ID83_pro_3,ID83_pro_4,ID83_pro_5,ID83_def_1,ID83_def_2,ID83_def_3,ID83_def_4,DBS78_pro_1,DBS78_pro_2,DBS78_def_1
0,AOCS_084,-1.0,Proficient,0,-0.690888,Proficient,1.697787,1.150595,1.192019,-0.354043,...,-0.658393,-0.600821,-0.179078,-0.658393,-0.648761,-0.399734,1.225963,-0.774997,-0.636970,1.411967
1,AOCS_159,-1.0,Proficient,0,-1.245736,Proficient,1.127304,2.023549,0.599706,0.139680,...,-0.993338,-0.709343,0.570821,-0.934397,0.089365,-0.603553,-0.631501,-1.226645,1.222836,0.003809
2,DO29146,0.0,Unknown,0,-1.179443,Proficient,-0.445669,2.187945,1.232186,0.061045,...,0.069129,1.653104,-0.184718,-0.691333,-0.698727,-0.698727,-0.698727,-1.379841,0.421538,0.958303
3,SHGSOC050,-1.0,Proficient,1,-1.013028,Proficient,0.034584,1.149668,2.199172,-0.872373,...,0.524405,-0.745904,-0.584137,-0.758913,-0.758913,-0.758913,-0.225584,1.307106,-0.185999,-1.121106
4,SHGSOC022,1.0,Deficient,0,1.085344,Deficient,-0.685520,-0.685520,-0.015071,-0.685520,...,0.489635,-0.484880,-0.683908,2.129365,1.285420,-0.683908,-0.683908,-0.707107,-0.707107,1.414214
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181,SHGSOC001,1.0,Deficient,1,1.216777,Deficient,-0.287690,-0.501142,-0.602954,-0.604282,...,-0.594751,-0.365527,-0.594751,0.634003,-0.081085,2.619184,-0.594751,-0.707107,-0.707107,1.414214
182,DO30650,0.0,Unknown,0,-1.032635,Proficient,1.049832,2.361886,0.809569,-0.700275,...,-0.777625,-0.462113,1.613345,-0.391910,-0.184605,-0.269014,-0.777625,-1.340242,1.061034,0.279209
183,AOCS_168,0.0,Unknown,0,0.009468,Deficient,-0.004540,0.682050,0.866545,-0.301971,...,-1.032636,-0.650992,-0.333889,1.730972,-0.768225,-1.032636,1.130214,1.354107,-0.323788,-1.030319
184,DO28412,-1.0,Proficient,1,-1.022459,Proficient,1.780988,1.745248,0.641153,-0.634612,...,1.070283,2.188663,-0.754452,-0.754452,-0.754452,-0.754452,0.464642,-0.707107,1.414214,-0.707107


## Training the classifier

We acknowledge that *biology is rarely binary*. That is why instead of training a standard classifier, MSclassifier trains a **regression model** and looks for a **margin maximizer** that optimaly classifies samples. Effectively, MSclassifier outputs at the same time a continuous classification scale, as well as a tentative prediction. This approach allows the user to decide wether MSclassification binary classification is correct, or if a different threshold might suit the needs better. 

We now have the data ready for input in a regression model. The method **.model_fit()** searches in a predefined grid of possible neural networks which one returns the best result and saves it as the regressor model. Further, this method uses a standard linear SVM machine to determine the margin maximizer that will be used for classification purposes. 

We will explain in a later section **how to use your own regression model** of choice rather than the default one.


In [8]:
HGSOC.model_fit()

Best: -0.141703 using {'activation': 'logistic', 'alpha': 1e-16, 'hidden_layer_sizes': 22, 'solver': 'adam'}


We can access the prediction of our model in the test set through the **.test_check()** method.

In [10]:
HGSOC.test_check()

We can also view which are the features driving this model through the **.model.importances** attribute.

In [11]:
HGSOC.model.importances

Unnamed: 0,importance
ID83_pro_1,0.055704
ID83_def_1,0.0481
SBS96_pro_2,0.0432
ID83_pro_2,0.039632
SBS96_pro_1,0.031938
SBS96_def_3,0.027352
ID83_def_3,0.025728
ID83_def_2,0.021386
SBS96_def_1,0.017168
SBS96_pro_6,0.011792


One of the parameters that we can tweek in a model are the features to be used as inputs. For example we can select only the top 5 features in terms of importance.

In [16]:
top_features=[]
for i in range(0,5):
    top_features.append(HGSOC.model.importances.index[i])
top_features

['ID83_pro_1', 'ID83_def_1', 'SBS96_pro_2', 'ID83_pro_2', 'SBS96_pro_1']

In [17]:
HGSOC.model.features=top_features
HGSOC.model_fit()
HGSOC.test_check()

Best: -0.118920 using {'activation': 'tanh', 'alpha': 1e-16, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}


## Predicting samples classification

Once we are happy with a chosen model we can then predict the outcome of our model in the whole dataset using the  **.model_predict()** method, and acces the **.plot** attribute.

In [18]:
HGSOC.model_predict()
HGSOC.plot.show()

A new row has been added to the dataset, containing the binary prediction based on the margin maximizer threshold.

In [19]:
HGSOC.data

Unnamed: 0,sample,class,Sample type,training,SBS96_pro_1,SBS96_pro_2,SBS96_pro_3,SBS96_pro_4,SBS96_pro_5,SBS96_pro_6,...,ID83_pro_5,ID83_def_1,ID83_def_2,ID83_def_3,ID83_def_4,DBS78_pro_1,DBS78_pro_2,DBS78_def_1,prediction,SVM prediction
0,AOCS_084,-1.0,Proficient,0,1.697787,1.150595,1.192019,-0.354043,-0.160285,-1.106252,...,-0.179078,-0.658393,-0.648761,-0.399734,1.225963,-0.774997,-0.636970,1.411967,-1.000236,Proficient
1,AOCS_159,-1.0,Proficient,0,1.127304,2.023549,0.599706,0.139680,0.216260,-0.935028,...,0.570821,-0.934397,0.089365,-0.603553,-0.631501,-1.226645,1.222836,0.003809,-1.000236,Proficient
2,DO29146,0.0,Unknown,0,-0.445669,2.187945,1.232186,0.061045,0.375905,-0.664314,...,-0.184718,-0.691333,-0.698727,-0.698727,-0.698727,-1.379841,0.421538,0.958303,-1.000374,Proficient
3,SHGSOC050,-1.0,Proficient,1,0.034584,1.149668,2.199172,-0.872373,-0.149508,-0.563857,...,-0.584137,-0.758913,-0.758913,-0.758913,-0.225584,1.307106,-0.185999,-1.121106,-1.000236,Proficient
4,SHGSOC022,1.0,Deficient,0,-0.685520,-0.685520,-0.015071,-0.685520,-0.338512,-0.124855,...,-0.683908,2.129365,1.285420,-0.683908,-0.683908,-0.707107,-0.707107,1.414214,1.000227,Deficient
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181,SHGSOC001,1.0,Deficient,1,-0.287690,-0.501142,-0.602954,-0.604282,-0.593936,-0.375127,...,-0.594751,0.634003,-0.081085,2.619184,-0.594751,-0.707107,-0.707107,1.414214,0.999797,Deficient
182,DO30650,0.0,Unknown,0,1.049832,2.361886,0.809569,-0.700275,-0.394999,-0.614754,...,1.613345,-0.391910,-0.184605,-0.269014,-0.777625,-1.340242,1.061034,0.279209,-1.000373,Proficient
183,AOCS_168,0.0,Unknown,0,-0.004540,0.682050,0.866545,-0.301971,-0.403636,-1.409465,...,-0.333889,1.730972,-0.768225,-1.032636,1.130214,1.354107,-0.323788,-1.030319,0.910942,Deficient
184,DO28412,-1.0,Proficient,1,1.780988,1.745248,0.641153,-0.634612,-0.573573,-0.809225,...,-0.754452,-0.754452,-0.754452,-0.754452,0.464642,-0.707107,1.414214,-0.707107,-1.000202,Proficient


Further, since we have a binary classification, we have stored the confusion matrix of our problem in the **.confusion_matrix** attribute.

In [20]:
print(HGSOC.confusion_matrix)
f'This model has an accuracy of {HGSOC.accuracy}.'

              precision    recall  f1-score   support

  Proficient       0.98      0.98      0.98        42
   Deficient       0.97      0.97      0.97        40

    accuracy                           0.98        82
   macro avg       0.98      0.98      0.98        82
weighted avg       0.98      0.98      0.98        82



'This model has an accuracy of 0.975609756097561.'

There is also a **ROC_curve** attribute.

In [21]:
HGSOC.ROC_curve

# Exporting the project

We have added the functionality to export the model in the ouput folder within the VCF folder. These projects can be loaded via pickle.

In [13]:
HGSOC.export()

import pickle
new_load = pickle.load(open(path+'output/HGSOC_HR.p','rb'))

# Using a trained model in a new dataset

MSclassifier also allows the user to use a pretrained model and use it to classify samples in a new dataset.

In this example we use HGSOC.model to predict HR deficiency in an unseen cohort.

In [22]:
# First we load the trained model
saved_model = HGSOC.model

path_unseen_cohort='/home/elatorre/Desktop/HGSOC datasets/VCF Canadian/'
path_deficient_unseen_cohort = path_unseen_cohort + 'Deficient.txt'

unseen_cohort = MSclassifier.signature_classifier (vcf=path_unseen_cohort , 
                                          positive=path_deficient_unseen_cohort,
                                          project_name='Unseen_Cohort_HR',
                                          model=saved_model # We use the saved model as input
                                          )

Notice that we only have ground truth information about HR deficient samples. This is not a problem as we do not want to train a new model.

**If a model is given** at the start of a project, **MSclassifier classifier only requires the path to the vcf files** to predict their outcome given the inputed model. 

## Predicting the outcome on an unseen cohort with no training set

In order to predict the outcome in a new cohort with no training set, we need to do the following:

    - .load_vcf()
    - .signature_fit() 
    - .model_predict() 

In [23]:
unseen_cohort.load_vcf()
unseen_cohort.signature_fit()
unseen_cohort.model_predict()
f' This model uses {unseen_cohort.model.features} as features '

The given input files do not appear to be in the correct vcf format. Skipping this file:  Deficient.txt
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 18.7 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 10.67 seconds.
Matrices generated for 60 samples with 0 errors. Total of 569691 SNVs, 6224 DINUCs, and 50558 INDELs were successfully analyzed.


" This model uses ['ID83_pro_1', 'ID83_def_1', 'SBS96_pro_2', 'ID83_pro_2', 'SBS96_pro_1'] as features "

In [24]:
unseen_cohort.plot.show()

As we can see, our HR predictor model has done an excellent job at predicting HR deficinecy in a completely unseen cohort by just using 5 inputs as features.

# Using your own model

As mentioned, finding the right regressor model that is right for your purposes can be tricky and the neural network that MSclassifier produces might not be the right one. 

In this example we input a third party regressor as the classifier of the HGSOC. First, we retrieve the X_train and y_train arrays to fit the model, using the **MSclassifier.model.train_set** method. Then we fit the model and replace HGSOC.model.classifier with our new fitted model, and call the method **.model.model_fit()** to find the margin maximization.

MSclassifier's .export() functionality is only retained if the third party model is an scikit-klearn model. However, MSclassifier also accepts TensorFlow models, but the model needs to be exported separately with the tensorflow .save method.


In this example we use a keras tensorflow model.

In [32]:
X_train, y_train = MSclassifier.model.train_set(HGSOC)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l1,l2

# Build a baseline regressor
regressor = Sequential()
regressor.add(Dropout(0.2, input_shape=(X_train.shape[1],)))
regressor.add(Dense(units=X_train.shape[1]+5,kernel_regularizer=l2(0.01)))
regressor.add(Dropout(0.2))
regressor.add(Dense(units=1, activation='tanh'))
regressor.summary()

regressor.compile(optimizer='rmsprop', loss='mean_squared_error',  metrics=['mae','accuracy'])

regressor.fit(X_train, y_train, validation_split=0.2, batch_size=20, epochs=90,verbose=0)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 5)                 0         
_________________________________________________________________
dense (Dense)                (None, 10)                60        
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 71
Trainable params: 71
Non-trainable params: 0
_________________________________________________________________


<tensorflow.python.keras.callbacks.History at 0x7efd70452910>

Then we use te trained regressor as the input of the **.model_fit()** mehtod. The next steps are as usual.

In [33]:
# We input use the fitted regressor
HGSOC.model_fit(regressor)
HGSOC.test_check()
HGSOC.model_predict()
HGSOC.plot.show()
print(HGSOC.confusion_matrix)

              precision    recall  f1-score   support

  Proficient       0.98      0.98      0.98        42
   Deficient       0.97      0.97      0.97        40

    accuracy                           0.98        82
   macro avg       0.98      0.98      0.98        82
weighted avg       0.98      0.98      0.98        82



## Exporting and loading an MSclassifier object with a TensorFlow model

Since this is a TensorFlow model, we need to export the project in two steps:

In [34]:
# step 1: export the tensorflow model
path_to_model= path+'output/'+HGSOC.project_name + '_tfmodel'
HGSOC.model.classifier.save(path_to_model)
# step 2: export the MSclassifier object
HGSOC.model.classifier=None
HGSOC.export()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: /home/elatorre/Desktop/HGSOC datasets/HGSOC VCF filtered/output/HGSOC_HR_tfmodel/assets


In order load an MSclassifier that was trained with a Tensorflow model, we also need to do this is a couple of steps.

In [35]:
import pickle
from tensorflow import keras

# import the model
model = keras.models.load_model(path_to_model)

path_to_MSclassifier= path+'/output/MSclassifier_alt.p'
HGSOC=pickle.load(open(path_to_MSclassifier,'rb'))
HGSOC.model.classifier=model

# We can now check that indeed we have loaded the model correctly
HGSOC.model.classifier

<tensorflow.python.keras.saving.saved_model.load.Sequential at 0x7efd184f2dd0>

# Training a model on synthetic exome data

Thanks to SigProfiler exome filtering capability we can easily train a model on synthetic exome data. This is achieved by simply setting exome=true in the creation of a project.

In [36]:
import MSclassifier

path='/home/elatorre/Desktop/HGSOC datasets/HGSOC VCF filtered/'
path_proficient = path+ 'Proficient.txt'
path_deficient = path + 'Deficient.txt'

HGSOC_exome = MSclassifier.signature_classifier (vcf=path , 
                                          positive=path_deficient, 
                                          negative=path_proficient,
                                          exome=True,
                                          project_name='HGSOC_exome')
HGSOC_exome.load_vcf()
HGSOC_exome.signature_train()
HGSOC_exome.signature_fit()
HGSOC_exome.model_fit()
HGSOC_exome.model_predict()
HGSOC_exome.plot
HGSOC_exome.plot.write_image(path+"output/nn_exome.svg")

The given input files do not appear to be in the correct vcf format. Skipping this file:  Proficient.txt
The given input files do not appear to be in the correct vcf format. Skipping this file:  Deficient.txt
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 114.94 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 59.62 seconds.
Matrices generated for 188 samples with 1 errors. Total of 3303977 SNVs, 23360 DINUCs, and 313089 INDELs were successfully analyzed.

************** Reported Current Memory Use: 0.7 GB *****************

Normalization Cutoff is : 220
Extracting signature 1 for mutation type 96
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
execution time: 0 seconds 
execution time: 0 seconds 
process 1 continues please wait... 
execution time: 0 seconds 

execution time: 0 seconds 



process 1 continues please wait... 
execution time: 0 seconds 
process 1 continues plea



 
Your Job Is Successfully Completed! Thank You For Using SigProfiler Extractor.
 

************** Reported Current Memory Use: 0.7 GB *****************

Normalization Cutoff is : 52
Extracting signature 1 for mutation type INDEL
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
execution time: 0 seconds 

process 1 continues please wait... 
execution time: 0 seconds 
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
process 1 continues please wait... 
execution time: 0 seconds 
execution time: 0 seconds 
execution time: 0 seconds 


execution time: 0 seconds 
execution time: 0 seconds 

execution time: 0 seconds 




Time taken to collect 8 iterations for 1 signatures is 0.44 seconds
Optimization time is 0.14651226997375488 seconds
The reconstruction error is 0.5588, average process stability is 1.0 and 
the minimum process stability is 1.0 for 1 signatures


Extractin

Let us test this model on the unseen cohort.

In [7]:
# First we load the trained model
saved_model = HGSOC_exome.model

path_unseen_cohort='/home/elatorre/Desktop/HGSOC datasets/VCF Canadian/'
path_deficient_unseen_cohort = path_unseen_cohort + 'Deficient.txt'

unseen_cohort_exome = MSclassifier.signature_classifier (vcf=path_unseen_cohort , 
                                          positive=path_deficient_unseen_cohort,
                                          project_name='Unseen_Cohort_Exome_HR',
                                          exome=True,
                                          model=saved_model # We use the saved model as input
                                          )
                                         
unseen_cohort_exome.load_vcf()
unseen_cohort_exome.signature_fit()
unseen_cohort_exome.model_predict()
unseen_cohort_exome.plot.show()
unseen_cohort_exome.plot.write_image(path_unseen_cohort+"output/cc_exome.svg")

The given input files do not appear to be in the correct vcf format. Skipping this file:  Deficient.txt
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 23.42 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 10.55 seconds.
Matrices generated for 60 samples with 0 errors. Total of 569691 SNVs, 6224 DINUCs, and 50558 INDELs were successfully analyzed.


In [8]:
unseen_cohort_exome.plot