how to train a model #7

ShangjinTan · 2019-03-18T08:55:30Z

Hi Deepsignal,

I am impressed by the high sensitivity and accuracy of deepsignal in calling methylation sites. I would very much like to try it in my study. Here I have a few questions.

Deepsignal only provides a human CpG model. I want is to extract all methylation motifs (not only CpG) of all methylaiton types (6mA, 5mC, 4mC) from microorganisms. So it seems I have to train a custom model. Am I right?
deepsignal extract can extract features for training. Could you please explain a little bit about what exactly is extracted?
I have tried deepsignal extract on the example yeast data. The methy_label of all positions are all '1'. Does '1' mean that this position will be used for training? What does '1' mean?
If the result of deepsignal extract is used for training a model, how can deepsignal know which base is methylated?
deepsignal extracts selected motifs with the same mod_loc. If I want to extract all types of motifs (probably with different mod_loc), including novel motifs. Does this mean that deepsignal extract is not applicable to me?
For training a model, if the input is a pool of all methylation types, is there a requirement for the number of a type, or of a specific motif of a type?
Could you please give some advice on how to prepare the files for training a model?

Thank you so much.
Shangjin

PengNi · 2019-03-20T02:29:14Z

Hi @ShangjinTan ,

Thanks for your interest.

Currently different motifs/methylation types need different deepsignal models. A custom model is needed for a non-CpG methylation type.
the extraction module extracts five kinds of features (one for CNN and 4 for RNN) for deepsignal. One line represents one sample for training/testing. The detail of the output format are in the README. More details are in the preprint manuscript.
The methy_label has two choices [0, 1]. 0 represents unmethylated, 1 represents methylated.
as 3.
A motif seq follows IUPAC alphabet can be can be trained (check the --motifs and --mod_loc options). However, deepsignal cannot guarantee high performance for multi methylation types (even multi motifs) with a single model. Currently we've test models for CpG, GATC (6mA), CCWGG (5mC) separately.
as 5.
To train a model, methylated and unmethylated samples from reads are necessary. The samples can be chosen either from methylase-treated/PCR-amplified data or based on bisulfite sequencing, or other sequencing technique.

The chosen samples then can be shuffled and splited to training and validting datasets. According to our experiments, a model can be trained to achieve high performance by at most 20m samples for training and at least 10k samples for validting (half positive samples, half negative samples).

Some scripts from /scripts may be useful. Feel free to ask any more details and scripts to the email nipeng at csu.edu.cn.

Best,
Peng

ardakdemir · 2019-07-16T07:16:40Z

I am interested in training my own model.
Would it be possible for you to share with the community the datasets you have used for training?
Or any reference to a database containing a dataset that can be used for training deepsignal (raw nanopore signals and methylation labels for each read) would also be very much appreciated!

Thanks in advance!

PengNi · 2019-07-16T08:52:46Z

Hi @ardakdemir ,

First you can check out nanopolish. The data (PRJEB13021) contains R9 reads of E.coli and Human NA12878. The reads are either totally methylated or totally unmethylated for 5mC.
The dataset from signalAlign contains reads for 6mA in GATC.
We also used 30x R9.4 reads of human NA12878 (PRJEB23027) from this work (nbt.4060). We get the high-confidence 5mC positions from the bisulfite sequencing (ENCFF835NTC).

Best,
Peng

ardakdemir · 2019-07-16T11:07:47Z

Thanks a lot for the suggestions!

Best

Arda

ardakdemir · 2019-10-12T02:01:36Z

Dear @PengNi

"First you can check out nanopolish. The data (PRJEB13021) contains R9 reads of E.coli and Human NA12878. The reads are either totally methylated or totally unmethylated for 5mC."

The dataset you mentioned above contains many files. Which ones did you use for training? And how should I infer whether the reads are methylated or unmethylated? Is the information contained inside the fast5 files?

PengNi · 2019-10-12T02:50:46Z

Hi @ardakdemir ,

We use E.coli R9 reads for training and testing. You can recognize the type of files by the filenames. The file of which the filename contains "pcr" means the reads are unmethylated. "pcr_MSssI" means the reads are methylated. You can read their paper for double-check.

Best,
Peng

ardakdemir · 2019-10-12T03:34:25Z

Thanks a lot!

ardakdemir · 2019-10-12T11:12:55Z

How can I obtain the same reference you used for mapping the fast5 files for :

E. coli K12 ER2925

I could not find any reference for ER2925

PengNi · 2019-10-12T11:29:27Z

I used this reference: ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.29.dna.genome.fa.gz

ardakdemir · 2019-10-12T12:00:19Z

Thanks! I also downloaded that but tombo gives: Poor raw to expected signal matching error, and suggests (revert with `tombo filter clear_filters`) Did you experience anything similar?

PengNi · 2019-10-12T12:08:29Z

tombo only supports R9.4+ reads. If you want to process the E.coli R9 2D reads, you can use nanoraw.

Also, I suggest you use the R9.4 reads (maybe human NA12878 (PRJEB23027) ) for experiments too. Nanopore may no longer use R9 2D flowcell anymore.

ardakdemir · 2019-10-12T12:56:25Z

Thanks a lot for the information. I wonder how using the raw basecalls would affect the final performance on read level? Do you think we can skip the resquiggle step and do the methylation calling directly from nanopore basecalls? We may not always have the reference for the resquiggle step Peng Ni <notifications@github.com>, 12 Eki 2019 Cmt, 21:08 tarihinde şunu yazdı:

…

tombo only supports R9.4+ reads. If you want to process the E.coli R9 2D reads, you can use nanoraw <https://github.com/marcus1487/nanoraw>. Also, I suggest you use the R9.4 reads (maybe human NA12878 (PRJEB23027) ) for experiments too. Nanopore may no longer use R9 2D flowcell anymore. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7?email_source=notifications&email_token=AC5IHLSWH5TZU3T6KUJ35KLQOG435A5CNFSM4G7E7X52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBB56JI#issuecomment-541318949>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC5IHLVPNTEO2AYDLTTUKB3QOG435ANCNFSM4G7E7X5Q> .

PengNi · 2019-10-12T13:03:30Z

Emm, in my opinion, it makes no sense to call methylation without a reference. We always need to align reads to a genome to do some analysis.

PengNi closed this as completed Mar 22, 2019

Jerry-0591 mentioned this issue Jul 31, 2020

Question about train dataset #50

Closed

This was referenced Aug 31, 2020

Example training and validation data #53

Closed

The original FAST5 file looked for 6mA loci #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to train a model #7

how to train a model #7

ShangjinTan commented Mar 18, 2019 •

edited

Loading

PengNi commented Mar 20, 2019

ardakdemir commented Jul 16, 2019

PengNi commented Jul 16, 2019

ardakdemir commented Jul 16, 2019

ardakdemir commented Oct 12, 2019

PengNi commented Oct 12, 2019

ardakdemir commented Oct 12, 2019

ardakdemir commented Oct 12, 2019 •

edited

Loading

PengNi commented Oct 12, 2019

ardakdemir commented Oct 12, 2019 via email

PengNi commented Oct 12, 2019

ardakdemir commented Oct 12, 2019 via email

PengNi commented Oct 12, 2019

how to train a model #7

how to train a model #7

Comments

ShangjinTan commented Mar 18, 2019 • edited Loading

PengNi commented Mar 20, 2019

ardakdemir commented Jul 16, 2019

PengNi commented Jul 16, 2019

ardakdemir commented Jul 16, 2019

ardakdemir commented Oct 12, 2019

PengNi commented Oct 12, 2019

ardakdemir commented Oct 12, 2019

ardakdemir commented Oct 12, 2019 • edited Loading

PengNi commented Oct 12, 2019

ardakdemir commented Oct 12, 2019 via email

PengNi commented Oct 12, 2019

ardakdemir commented Oct 12, 2019 via email

PengNi commented Oct 12, 2019

ShangjinTan commented Mar 18, 2019 •

edited

Loading

ardakdemir commented Oct 12, 2019 •

edited

Loading