**Background**

You will use Spark and Python to process genomic data. This consists of bout 3 billion nucleotides in the human genome and a smaller number for the flatworm C. elegans. The genome sequences are found as FASTA files. For the purposes of this exercise, treat lower and upper case as the same. Recall that FASTA files have comment lines starting with '>' that must be excluded from the analysis. For the exercises below, assume that k=20 for the k-mers.

In [1]:
%%spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
165,application_1522938745830_0133,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
hadoop = sc._jvm.org.apache.hadoop

fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration() 
path = hadoop.fs.Path('/data/c_elegans')

for f in fs.get(conf).listStatus(path):
    print f.getPath()

hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.II.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.III.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.IV.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.V.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.X.fa

**Exercise 2 (50 points)**

Write a program using `spark` to find 5 most common k-mers (shifting windows of length k) in the human genome. Ignore case when processing k-mers. You can work one line at a time - we will ignore k-mers that wrap around lines. You should write a function that takes a path to FASTA files and a value for k, and returns an key-value RDD of k-mer counts. Remember to strip comment lines that begin with '>' from the anlaysis. 

Use k=20

**Note**: The textFile method takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Please set this paramter to 60 - it will speed up processing.

**Check**: Use the C. elegans genome at `/data/c_elegans`. You should get 

```
[
(u'ATATATATATATATATATAT', 2217), 
(u'TATATATATATATATATATA', 2184), 
(u'CTCTCTCTCTCTCTCTCTCT', 1373), 
(u'TCTCTCTCTCTCTCTCTCTC', 1361), 
(u'AGAGAGAGAGAGAGAGAGAG', 1033)
]
```

In [3]:
def most_common(path, k):
    counts = (sc.textFile(path,60)
            .map(lambda line: line.upper().strip())
            .filter(lambda line: not line.startswith('>') )
            .flatMap(lambda line: [line[i:i+k] for i in range(len(line) - k + 1)])
            .map(lambda x: (x, 1))
            .reduceByKey(lambda x, y: x + y)
            )
    return counts

In [4]:
file_path = '/data/c_elegans/*fa'
k = 20
counts = most_common(file_path,k)
counts

PythonRDD[6] at RDD at PythonRDD.scala:48

In [5]:
counts.takeOrdered(5, key=lambda x: -x[1])

[(u'ATATATATATATATATATAT', 2217), (u'TATATATATATATATATATA', 2184), (u'CTCTCTCTCTCTCTCTCTCT', 1373), (u'TCTCTCTCTCTCTCTCTCTC', 1361), (u'AGAGAGAGAGAGAGAGAGAG', 1033)]

**Exercise 3 (10 points)** 

As a simple QC measure, we can assume that the k-mers that have a count of only 1 are due to sequencing errors. Put all the k-mers with a count of 2 or more in a Spark DataFrame with two columns (sequence, count). 

In [6]:
f_counts = counts.filter(lambda x: x[1]>=2)
df = spark.createDataFrame(f_counts,['sequence','count'])
df.show()

+--------------------+-----+
|            sequence|count|
+--------------------+-----+
|CTGAAAAAATCAGCTTTTAT|    5|
|AAATGCGTTTTTTGAACTTA|    3|
|TACATCTTCTTTTGTAAGAC|    2|
|AGCGACAATTACTCAGTTTT|    2|
|GAAATTGTCAACTAACAAAA|    2|
|GCAAATCAAGGAAATGTTTG|    2|
|TGTCAACCTGGCCCATATTG|    2|
|CTCACCGATTCAAGTCTTAA|    2|
|ATTTAATTTAAAAAAAATTA|    2|
|CTCTCACCGGAGTGGCTGTT|    2|
|TTATTTGTTTTTTTTTTGGA|    2|
|GAGAAATACTACGCTACGAA|    2|
|CGGAAGAAGAAAAATGAGAA|    2|
|TACAAATGATTAAAACAAAA|    2|
|TTTTCGATCCACTGCTCCAA|    2|
|TTTATCGAATTTTTTAGAGA|    2|
|AAAGATTCATAAATCATGAA|    2|
|TCTACCCAGTTATTGGACCA|    2|
|GGCTCAGGATTTTGGTTGAG|    2|
|TATTTTTAATATTTTTTTTT|    2|
+--------------------+-----+
only showing top 20 rows

**Exercise 4 (10 points)**

Find all k-mers with count greater than 1 that are palindromes.

In [7]:
def palindrome(kmer):
    rev = kmer[::-1]
    return kmer == rev

In [8]:
p = f_counts.filter(lambda x: palindrome(x[0]))

In [9]:
p.count()

228

In [10]:
p.collect()

[(u'AAATAAATAAAATAAATAAA', 6), (u'GGTCCTCCTGGTCCTCCTGG', 2), (u'TAAATTAAATTAAATTAAAT', 24), (u'TAGTTTTTTTTTTTTTTGAT', 2), (u'AAAAACCAAAAAACCAAAAA', 7), (u'AAGAAGAAGAAGAAGAAGAA', 40), (u'AAAAAATAAAAAATAAAAAA', 10), (u'AAAAATAAAAAAAATAAAAA', 4), (u'AATAAATAAAAAATAAATAA', 3), (u'TTAGGATTCGGCTTAGGATT', 2), (u'AAAGAAAAAAAAAAAAGAAA', 3), (u'TTTTGAAAAAAAAAAGTTTT', 5), (u'TTTAATTAATTAATTAATTT', 4), (u'GGCTTCGGCTTCGGCTTCGG', 2), (u'ATTAATTAATTAATTAATTA', 5), (u'TTTTTTATTTTTTATTTTTT', 9), (u'AAAAAAAACTTCAAAAAAAA', 2), (u'TTAATTAATTTTAATTAATT', 2), (u'AAAAAACAAAAAACAAAAAA', 15), (u'TTTTTTTTGAAGTTTTTTTT', 4), (u'TTTTTTTTAAAATTTTTTTT', 4), (u'TTTAAAAAATTAAAAAATTT', 2), (u'TCCCCCCCCCCCCCCCCCCT', 5), (u'TGGGGTCTGGGGTCTGGGGT', 10), (u'AAAAAAAATAATAAAAAAAA', 2), (u'AACTTTTTCAACTTTTTCAA', 2), (u'TTTAGAAAAAAAAAAGATTT', 2), (u'AAAAAAACTTTTCAAAAAAA', 2), (u'ATTTTTTAAAAAATTTTTTA', 3), (u'AATTAAAAAAAAAAAATTAA', 2), (u'AAAATTTTAAAATTTTAAAA', 7), (u'ATTTTTTTAAAATTTTTTTA', 2), (u'TACCATATACCATATACCAT', 3), (u'A