Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to train a model #7

Closed
ShangjinTan opened this issue Mar 18, 2019 · 13 comments
Closed

how to train a model #7

ShangjinTan opened this issue Mar 18, 2019 · 13 comments

Comments

@ShangjinTan
Copy link

ShangjinTan commented Mar 18, 2019

Hi Deepsignal,

I am impressed by the high sensitivity and accuracy of deepsignal in calling methylation sites. I would very much like to try it in my study. Here I have a few questions.

  1. Deepsignal only provides a human CpG model. I want is to extract all methylation motifs (not only CpG) of all methylaiton types (6mA, 5mC, 4mC) from microorganisms. So it seems I have to train a custom model. Am I right?

  2. deepsignal extract can extract features for training. Could you please explain a little bit about what exactly is extracted?

  3. I have tried deepsignal extract on the example yeast data. The methy_label of all positions are all '1'. Does '1' mean that this position will be used for training? What does '1' mean?

  4. If the result of deepsignal extract is used for training a model, how can deepsignal know which base is methylated?

  5. deepsignal extracts selected motifs with the same mod_loc. If I want to extract all types of motifs (probably with different mod_loc), including novel motifs. Does this mean that deepsignal extract is not applicable to me?

  6. For training a model, if the input is a pool of all methylation types, is there a requirement for the number of a type, or of a specific motif of a type?

  7. Could you please give some advice on how to prepare the files for training a model?

Thank you so much.
Shangjin

@PengNi
Copy link
Collaborator

PengNi commented Mar 20, 2019

Hi @ShangjinTan ,

Thanks for your interest.

  1. Currently different motifs/methylation types need different deepsignal models. A custom model is needed for a non-CpG methylation type.

  2. the extraction module extracts five kinds of features (one for CNN and 4 for RNN) for deepsignal. One line represents one sample for training/testing. The detail of the output format are in the README. More details are in the preprint manuscript.

  3. The methy_label has two choices [0, 1]. 0 represents unmethylated, 1 represents methylated.

  4. as 3.

  5. A motif seq follows IUPAC alphabet can be can be trained (check the --motifs and --mod_loc options). However, deepsignal cannot guarantee high performance for multi methylation types (even multi motifs) with a single model. Currently we've test models for CpG, GATC (6mA), CCWGG (5mC) separately.

  6. as 5.

  7. To train a model, methylated and unmethylated samples from reads are necessary. The samples can be chosen either from methylase-treated/PCR-amplified data or based on bisulfite sequencing, or other sequencing technique.

    The chosen samples then can be shuffled and splited to training and validting datasets. According to our experiments, a model can be trained to achieve high performance by at most 20m samples for training and at least 10k samples for validting (half positive samples, half negative samples).

Some scripts from /scripts may be useful. Feel free to ask any more details and scripts to the email nipeng at csu.edu.cn.

Best,
Peng

@PengNi PengNi closed this as completed Mar 22, 2019
@ardakdemir
Copy link

I am interested in training my own model.
Would it be possible for you to share with the community the datasets you have used for training?
Or any reference to a database containing a dataset that can be used for training deepsignal (raw nanopore signals and methylation labels for each read) would also be very much appreciated!

Thanks in advance!

@PengNi
Copy link
Collaborator

PengNi commented Jul 16, 2019

Hi @ardakdemir ,

  1. First you can check out nanopolish. The data (PRJEB13021) contains R9 reads of E.coli and Human NA12878. The reads are either totally methylated or totally unmethylated for 5mC.

  2. The dataset from signalAlign contains reads for 6mA in GATC.

  3. We also used 30x R9.4 reads of human NA12878 (PRJEB23027) from this work (nbt.4060). We get the high-confidence 5mC positions from the bisulfite sequencing (ENCFF835NTC).

Best,
Peng

@ardakdemir
Copy link

Thanks a lot for the suggestions!

Best

Arda

@ardakdemir
Copy link

Dear @PengNi

"First you can check out nanopolish. The data (PRJEB13021) contains R9 reads of E.coli and Human NA12878. The reads are either totally methylated or totally unmethylated for 5mC."

The dataset you mentioned above contains many files. Which ones did you use for training? And how should I infer whether the reads are methylated or unmethylated? Is the information contained inside the fast5 files?

@PengNi
Copy link
Collaborator

PengNi commented Oct 12, 2019

Hi @ardakdemir ,

We use E.coli R9 reads for training and testing. You can recognize the type of files by the filenames. The file of which the filename contains "pcr" means the reads are unmethylated. "pcr_MSssI" means the reads are methylated. You can read their paper for double-check.

Best,
Peng

@ardakdemir
Copy link

Thanks a lot!

@ardakdemir
Copy link

ardakdemir commented Oct 12, 2019

How can I obtain the same reference you used for mapping the fast5 files for :

E. coli K12 ER2925

I could not find any reference for ER2925

@PengNi
Copy link
Collaborator

PengNi commented Oct 12, 2019

I used this reference: ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.29.dna.genome.fa.gz

@ardakdemir
Copy link

ardakdemir commented Oct 12, 2019 via email

@PengNi
Copy link
Collaborator

PengNi commented Oct 12, 2019

tombo only supports R9.4+ reads. If you want to process the E.coli R9 2D reads, you can use nanoraw.

Also, I suggest you use the R9.4 reads (maybe human NA12878 (PRJEB23027) ) for experiments too. Nanopore may no longer use R9 2D flowcell anymore.

@ardakdemir
Copy link

ardakdemir commented Oct 12, 2019 via email

@PengNi
Copy link
Collaborator

PengNi commented Oct 12, 2019

Emm, in my opinion, it makes no sense to call methylation without a reference. We always need to align reads to a genome to do some analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants