### Pair RDD

Pair RDD is a special type of RDD that stores key-value pairs. It is useful when it is necessary to store data that has a key and several values (for example, all customer transactions, generated in real time).

- **mapValues()**
- **countByKey()**
- **groupByKey()**
- **reduceByKey()**
- **aggregateByKey()**

**Creating a RDD**

In [1]:
cars = sc.textFile('aux/datasets/cars.csv')

In [2]:
cars.take(5)

['MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE',
 'subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118',
 'chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151',
 'mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348']

**Creating a Pair RDD**

In [3]:
rddCars = cars.map(lambda line: (line.split(',')[0], line.split(',')[7]))

In [4]:
rddCars.take(5)

[('MAKE', 'HP'),
 ('subaru', '69'),
 ('chevrolet', '48'),
 ('mazda', '68'),
 ('toyota', '62')]

**Removing the header**

In [5]:
header = rddCars.first()

In [11]:
rddCars_02 = rddCars.filter(lambda line: line != header)

In [12]:
rddCars_02.take(5)

[('subaru', '69'),
 ('chevrolet', '48'),
 ('mazda', '68'),
 ('toyota', '62'),
 ('mitsubishi', '68')]

**Finding HP values by manufacturers, and adding 1 to each "manufacturer/hp" record**

In [15]:
rddCars_03 = rddCars_02.mapValues(lambda hp: (hp, 1))

In [16]:
rddCars_03.collect()

[('subaru', ('69', 1)),
 ('chevrolet', ('48', 1)),
 ('mazda', ('68', 1)),
 ('toyota', ('62', 1)),
 ('mitsubishi', ('68', 1)),
 ('honda', ('60', 1)),
 ('nissan', ('69', 1)),
 ('dodge', ('68', 1)),
 ('plymouth', ('68', 1)),
 ('mazda', ('68', 1)),
 ('mitsubishi', ('68', 1)),
 ('dodge', ('68', 1)),
 ('plymouth', ('68', 1)),
 ('chevrolet', ('70', 1)),
 ('toyota', ('62', 1)),
 ('dodge', ('68', 1)),
 ('honda', ('58', 1)),
 ('toyota', ('62', 1)),
 ('honda', ('76', 1)),
 ('chevrolet', ('70', 1)),
 ('nissan', ('69', 1)),
 ('mitsubishi', ('68', 1)),
 ('dodge', ('68', 1)),
 ('plymouth', ('68', 1)),
 ('mazda', ('68', 1)),
 ('isuzu', ('78', 1)),
 ('mazda', ('68', 1)),
 ('nissan', ('69', 1)),
 ('honda', ('76', 1)),
 ('toyota', ('62', 1)),
 ('toyota', ('70', 1)),
 ('mitsubishi', ('88', 1)),
 ('subaru', ('73', 1)),
 ('nissan', ('55', 1)),
 ('subaru', ('82', 1)),
 ('honda', ('76', 1)),
 ('toyota', ('70', 1)),
 ('honda', ('76', 1)),
 ('honda', ('76', 1)),
 ('nissan', ('69', 1)),
 ('nissan', ('69', 1)),
 

**Calculating total HP and cars by manufacturer**

In [21]:
manufacturers = rddCars_03.reduceByKey(lambda hp, qty: (int(hp[0]) + int(qty[0]), hp[1] + qty[1]))

In [22]:
manufacturers.collect()

[('chevrolet', (188, 3)),
 ('mazda', (1390, 16)),
 ('mitsubishi', (1353, 13)),
 ('nissan', (1846, 18)),
 ('dodge', (675, 8)),
 ('plymouth', (607, 7)),
 ('saab', (760, 6)),
 ('volvo', (1408, 11)),
 ('alfa-romero', (376, 3)),
 ('mercedes-benz', (1170, 8)),
 ('jaguar', (614, 3)),
 ('subaru', (1035, 12)),
 ('toyota', (2969, 32)),
 ('honda', (1043, 13)),
 ('isuzu', (168, 2)),
 ('volkswagen', (973, 12)),
 ('peugot', (1098, 11)),
 ('audi', (687, 6)),
 ('bmw', (1111, 8)),
 ('mercury', ('175', 1)),
 ('porsche', (764, 4))]

**Calculation HP average by manufacturer**

In [24]:
manufacturers.mapValues(lambda value: round(int(value[0]) / int(value[1]), 2)).collect()

[('chevrolet', 62.67),
 ('mazda', 86.88),
 ('mitsubishi', 104.08),
 ('nissan', 102.56),
 ('dodge', 84.38),
 ('plymouth', 86.71),
 ('saab', 126.67),
 ('volvo', 128.0),
 ('alfa-romero', 125.33),
 ('mercedes-benz', 146.25),
 ('jaguar', 204.67),
 ('subaru', 86.25),
 ('toyota', 92.78),
 ('honda', 80.23),
 ('isuzu', 84.0),
 ('volkswagen', 81.08),
 ('peugot', 99.82),
 ('audi', 114.5),
 ('bmw', 138.88),
 ('mercury', 175.0),
 ('porsche', 191.0)]