how to compare with the last the column for the same chromosome name? #1237

Closed
Fei-Guang opened this Issue Nov 3, 2016 · 7 comments

Comments

Projects
None yet
2 participants
@Fei-Guang

Fei-Guang commented Nov 3, 2016

$ samtools view rdd1.bam |head -1

 NB501244AR:119:HJY3WBGXY:3:11508:7857:8792      16      chr1    10002   0       40M     *       0       0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC        ///E////6E////EEAEEE/EEEEEEEEEEEEAEAAA/A        XT:A:R  NM:i:0  X0:i:594        XM:i:0  XO:i:0  XG:i:0  MD:Z:40 

$ cat rdd2.kmer | head -10
chr1 1 0.42235 0.01501 10001 110000
chr1 2 0.41104 0.01254 60001 160000
chr1 3 0.42826 0.00282 110001 177417 227418 260000
chr1 4 0.4151 0.00288 160001 177417 227418 267719 317720 360000
chr1 5 0.39534 0.00166 260001 267719 317720 410000
chr1 6 0.42745 0.0008 360001 460000
chr1 7 0.47991 0.0041 410001 471368 521369 560000
chr1 8 0.44 0.00977 460001 471368 521369 610000
chr1 9 0.39048 0.00875 560001 660000
chr1 10 0.42579 0.00624 610001 710000

what i want to do is to compare the 4th column of rdd1.bam with the 5th column and last column of rdd2.kmer if both rdd has the same chromosome name(The third column of rdd1 should be same as the first column of rdd2)

if the 4th column of rdd1.bam in range [the 5th column of rdd2.kmer, the last column of rdd2.kmer]
the count[chr1] ++

how can i do it with adam0.20?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 3, 2016

Member

Hi @Fei-Guang! How do you know which line of the second file (rdd2.kmer) corresponds to which read? If the first read in rdd1.bam should line up with the first line of rdd2.kmer, I'd do:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")

// conceptually, we could just do reads.rdd.zip(lines), but!
// we aren't guaranteed that both RDDs have the same number
// of records in each partition, so zipWithIndex followed by join
// is (slower, but) safer
val zippedLinesAndReads = reads.rdd
  .zipWithIndex
  .map(_.swap)
  .join(lines.zipWithIndex.map(_.swap))

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (_, (read, line)) = kv

  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val start = columns(4).toLong
  val end = columns(5).toLong

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= read.getStart && read.getStart < end) {
    Some((read.getContigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)
Member

fnothaft commented Nov 3, 2016

Hi @Fei-Guang! How do you know which line of the second file (rdd2.kmer) corresponds to which read? If the first read in rdd1.bam should line up with the first line of rdd2.kmer, I'd do:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")

// conceptually, we could just do reads.rdd.zip(lines), but!
// we aren't guaranteed that both RDDs have the same number
// of records in each partition, so zipWithIndex followed by join
// is (slower, but) safer
val zippedLinesAndReads = reads.rdd
  .zipWithIndex
  .map(_.swap)
  .join(lines.zipWithIndex.map(_.swap))

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (_, (read, line)) = kv

  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val start = columns(4).toLong
  val end = columns(5).toLong

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= read.getStart && read.getStart < end) {
    Some((read.getContigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 3, 2016

Member

What's the format of the rdd2.kmer file? If it is a typical feature format (e.g., BED, GTF), this code would be slightly cleaner.

Member

fnothaft commented Nov 3, 2016

What's the format of the rdd2.kmer file? If it is a typical feature format (e.g., BED, GTF), this code would be slightly cleaner.

@Fei-Guang

This comment has been minimized.

Show comment
Hide comment
@Fei-Guang

Fei-Guang Nov 4, 2016

hello, @fnothaft

Both rdd should have the same chromosome name, it means
The third column of rdd1 should be same as the first column of rdd2.

take the first line of rdd1 as an example, the chromosome name is chr1.

Many thanks for the help

Fei-Guang commented Nov 4, 2016

hello, @fnothaft

Both rdd should have the same chromosome name, it means
The third column of rdd1 should be same as the first column of rdd2.

take the first line of rdd1 as an example, the chromosome name is chr1.

Many thanks for the help

@Fei-Guang

This comment has been minimized.

Show comment
Hide comment
@Fei-Guang

Fei-Guang Nov 4, 2016

hello @fnothaft

the format of the rdd2.kmer is defined by our engineer, not in BED and GTF format

hello @fnothaft

the format of the rdd2.kmer is defined by our engineer, not in BED and GTF format

@Fei-Guang Fei-Guang changed the title from how to compare with the last the column? to how to compare with the last the column for the same chromosome name? Nov 4, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 4, 2016

Member

Hi @Fei-Guang! Where are you running that? Are you running that in spark-shell? ADAM relies on a custom Kryo serializer Registrator for serialization. If you use ./bin/adam-shell, this starts a Spark shell where the serialization config (and classpath) are set up.

Member

fnothaft commented Nov 4, 2016

Hi @Fei-Guang! Where are you running that? Are you running that in spark-shell? ADAM relies on a custom Kryo serializer Registrator for serialization. If you use ./bin/adam-shell, this starts a Spark shell where the serialization config (and classpath) are set up.

@Fei-Guang

This comment has been minimized.

Show comment
Hide comment
@Fei-Guang

Fei-Guang Nov 4, 2016

hello @fnothaft

what should be changed in the code if the line of the second file (rdd2.kmer) corresponds the line of the rdd1.bam by the chromosome name?

Fei-Guang commented Nov 4, 2016

hello @fnothaft

what should be changed in the code if the line of the second file (rdd2.kmer) corresponds the line of the rdd1.bam by the chromosome name?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 4, 2016

Member

@Fei-Guang I think you'd want:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")
  .map(line => {
  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val contig = columns(0)
  val start = columns(4).toLong
  val end = columns(5).toLong

  (contig, (start, end))
})

// join against the other RDD keyed by contig name
val zippedLinesAndReads = reads.rdd.flatMap(read => {
  // check whether the read is mapped, lest we get a null pointer exception
  if (read.getReadMapped) {
    Some((read.getContigName, read.getStart))
  } else {
    None
  }
}).join(lines)

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (contigName, (readStart, (start, end))) = kv

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= readStart && readStart < end) {
    Some((contigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)
Member

fnothaft commented Nov 4, 2016

@Fei-Guang I think you'd want:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")
  .map(line => {
  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val contig = columns(0)
  val start = columns(4).toLong
  val end = columns(5).toLong

  (contig, (start, end))
})

// join against the other RDD keyed by contig name
val zippedLinesAndReads = reads.rdd.flatMap(read => {
  // check whether the read is mapped, lest we get a null pointer exception
  if (read.getReadMapped) {
    Some((read.getContigName, read.getStart))
  } else {
    None
  }
}).join(lines)

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (contigName, (readStart, (start, end))) = kv

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= readStart && readStart < end) {
    Some((contigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)

@Fei-Guang Fei-Guang closed this Nov 7, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment