# Spark Solution to Secondary Sort

### Option #1
* Read and buffer all of the values for a given key in an Array or List data structure and then do an in-reducer sort on the values. This solution works if you have a small set of values (which will fit in memory) per reducer key.

#### Step 1: Create a RDD

In [1]:
filePath = "TimeSeriesData"

In [2]:
tsRDD = sc.textFile(filePath)
type(tsRDD)

pyspark.rdd.RDD

In [3]:
tsRDD.take(7)

['name time value', 'x 2 9', 'y 2 5', 'x 1 3', 'y 1 7', 'y 3 1', 'x 3 6']

#### Step 2: Filtering column from our data

In [4]:
head = tsRDD.first()

In [5]:
tshRDD = tsRDD.filter(lambda data: data!=head)

In [6]:
tshRDD.first()

'x 2 9'

#### Step 3: Converting to PairRDD

* i.e Key - value Pair
* here tuple(key, value)
    * Key -> name
    * Value -> tuple(time, value)
* Always try to use tuple instead of list.
* since w.k.t both RDD and Tuple are **Immutable**.

In [7]:
# For Parsing the Row
def parseTs(row):
    read = row.split(" ")
    return (read[0], (int(read[1]), int(read[2])))

In [8]:
tsPairRDD = tshRDD.map(parseTs)

In [9]:
tsPairRDD.take(5)

[('x', (2, 9)), ('y', (2, 5)), ('x', (1, 3)), ('y', (1, 7)), ('y', (3, 1))]

#### Step 4: Group PairRDD elements by the key (name)

In [10]:
tsPairGRDD = tsPairRDD.groupByKey()

In [11]:
tsPairGRDD.mapValues(lambda x: list(x)).collect()

[('y', [(2, 5), (1, 7), (3, 1)]),
 ('p', [(2, 6), (4, 7), (1, 9), (6, 0), (7, 3)]),
 ('x', [(2, 9), (1, 3), (3, 6)]),
 ('z', [(1, 4), (2, 8), (3, 7), (4, 0)])]

In [12]:
tsPairGRDD.mapValues(lambda x: sorted(x, key=lambda t: t[0])).collect()

[('y', [(1, 7), (2, 5), (3, 1)]),
 ('p', [(1, 9), (2, 6), (4, 7), (6, 0), (7, 3)]),
 ('x', [(1, 3), (2, 9), (3, 6)]),
 ('z', [(1, 4), (2, 8), (3, 7), (4, 0)])]

In [13]:
tsPairGRDD.mapValues(lambda x: sorted(x, key=lambda t: t[0], reverse=True)).collect()

[('y', [(3, 1), (2, 5), (1, 7)]),
 ('p', [(7, 3), (6, 0), (4, 7), (2, 6), (1, 9)]),
 ('x', [(3, 6), (2, 9), (1, 3)]),
 ('z', [(4, 0), (3, 7), (2, 8), (1, 4)])]