### Transformations

In [1]:
numbersList_01 = [124, 901, 652, 102, 397]

In [2]:
type(numbersList_01)

list

In [3]:
rddNumbersList_01 = sc.parallelize(numbersList_01)

In [4]:
type(rddNumbersList_01)

pyspark.rdd.RDD

In [5]:
rddNumbersList_01.collect()

[124, 901, 652, 102, 397]

In [6]:
rddNumbersList_01.count()

5

In [7]:
rddCars = sc.textFile('aux/datasets/cars.csv')

In [8]:
type(rddCars)

pyspark.rdd.RDD

In [9]:
rddCars.first()

'MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE'

In [10]:
rddCars.take(5)

['MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE',
 'subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118',
 'chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151',
 'mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348']

**Each action generates a new data computing process. But we can persist the cached data, so it can be used for other actions without the need for new computation.**

In [11]:
rddCars.cache()

aux/datasets/cars.csv MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

In [12]:
[line for line in rddCars.collect()]

['MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE',
 'subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118',
 'chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151',
 'mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348',
 'mitsubishi,gas,std,two,hatchback,fwd,four,68,5500,37,41,5389',
 'honda,gas,std,two,hatchback,fwd,four,60,5500,38,42,5399',
 'nissan,gas,std,two,sedan,fwd,four,69,5200,31,37,5499',
 'dodge,gas,std,two,hatchback,fwd,four,68,5500,37,41,5572',
 'plymouth,gas,std,two,hatchback,fwd,four,68,5500,37,41,5572',
 'mazda,gas,std,two,hatchback,fwd,four,68,5000,31,38,6095',
 'mitsubishi,gas,std,two,hatchback,fwd,four,68,5500,31,38,6189',
 'dodge,gas,std,four,hatchback,fwd,four,68,5500,31,38,6229',
 'plymouth,gas,std,four,hatchback,fwd,four,68,5500,31,38,6229',
 'chevrolet,gas,std,two,hatchback,fwd,four,70,5400,38,43,6295',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,31,3

**Lazy Evaluation - Using map() to create a new RDD (transformation)**

In [13]:
rddCarsByTab = rddCars.map(lambda comma: comma.replace(',', '\t'))

In [14]:
rddCarsByTab.take(5)

['MAKE\tFUELTYPE\tASPIRE\tDOORS\tBODY\tDRIVE\tCYLINDERS\tHP\tRPM\tMPG-CITY\tMPG-HWY\tPRICE',
 'subaru\tgas\tstd\ttwo\thatchback\tfwd\tfour\t69\t4900\t31\t36\t5118',
 'chevrolet\tgas\tstd\ttwo\thatchback\tfwd\tthree\t48\t5100\t47\t53\t5151',
 'mazda\tgas\tstd\ttwo\thatchback\tfwd\tfour\t68\t5000\t30\t31\t5195',
 'toyota\tgas\tstd\ttwo\thatchback\tfwd\tfour\t62\t4800\t35\t39\t5348']

**Since RDDs are immutable, the original RDD remains the same**

In [15]:
rddCars.take(5)

['MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE',
 'subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118',
 'chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151',
 'mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348']

**Lazy Evaluation - Using filter() to create a new RDD (transformation)**

In [16]:
rddToyota = rddCars.filter(lambda line: 'toyota' in line)

In [17]:
rddToyota.count()

32

In [18]:
rddToyota.take(5)

['toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,31,38,6338',
 'toyota,gas,std,four,hatchback,fwd,four,62,4800,31,38,6488',
 'toyota,gas,std,four,wagon,fwd,four,62,4800,31,37,6918',
 'toyota,gas,std,four,sedan,fwd,four,70,4800,30,37,6938']

**It is important to save the RDD dataset, as it is kept in memory. Spark requests the data from the Master process and then generates an output file.**

**Save as CSV**

In [19]:
rddSaved = open('aux/datasets/cars-toyota.csv', 'w')
rddSaved.write('\n'.join(rddToyota.collect()))
rddSaved.close()

**Save as TXT**

In [20]:
rddToyota.saveAsTextFile('aux/datasets/toyota.txt')

### Set Operations

In [21]:
rddFruits_01 = sc.parallelize(['Apple', 'Orange', 'Grape', 'Lemon'])
rddFruits_02 = sc.parallelize(['Melon', 'Grape', 'Banana'])

**Union**

In [22]:
[unique_fruits for unique_fruits in rddFruits_01.union(rddFruits_02).distinct().collect()]

['Orange', 'Grape', 'Lemon', 'Apple', 'Melon', 'Banana']

**Intersection**

In [23]:
[common_fruits for common_fruits in rddFruits_01.intersection(rddFruits_02).collect()]

['Grape']

### Left/Right Outer Join

In [24]:
rddNames_01 = sc.parallelize(['John', 'Mark', 'Peter']).map(lambda name: (name, 1))
rddNames_02 = sc.parallelize(['Bill', 'Peter', 'Steve']).map(lambda name: (name, 1))

In [25]:
rddNames_01.join(rddNames_02).collect()

[('Peter', (1, 1))]

In [26]:
rddNames_01.leftOuterJoin(rddNames_02).collect()

[('John', (1, None)), ('Mark', (1, None)), ('Peter', (1, 1))]

In [27]:
rddNames_01.rightOuterJoin(rddNames_02).collect()

[('Bill', (None, 1)), ('Peter', (1, 1)), ('Steve', (None, 1))]

### Distinct

In [28]:
numbersList_02 = [100, 200, 300, 400, 400, 500]

In [29]:
rddNumbersList_02 = sc.parallelize(numbersList_02)

In [30]:
[n for n in rddNumbersList_02.distinct().collect()]

[100, 200, 300, 400, 500]

### Transformation and Cleaning

In [31]:
rddCars.collect()

['MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE',
 'subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118',
 'chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151',
 'mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348',
 'mitsubishi,gas,std,two,hatchback,fwd,four,68,5500,37,41,5389',
 'honda,gas,std,two,hatchback,fwd,four,60,5500,38,42,5399',
 'nissan,gas,std,two,sedan,fwd,four,69,5200,31,37,5499',
 'dodge,gas,std,two,hatchback,fwd,four,68,5500,37,41,5572',
 'plymouth,gas,std,two,hatchback,fwd,four,68,5500,37,41,5572',
 'mazda,gas,std,two,hatchback,fwd,four,68,5000,31,38,6095',
 'mitsubishi,gas,std,two,hatchback,fwd,four,68,5500,31,38,6189',
 'dodge,gas,std,four,hatchback,fwd,four,68,5500,31,38,6229',
 'plymouth,gas,std,four,hatchback,fwd,four,68,5500,31,38,6229',
 'chevrolet,gas,std,two,hatchback,fwd,four,70,5400,38,43,6295',
 'toyota,gas,std,two,hatchback,fwd,four,62,4800,31,3

In [32]:
def cleanRdd(rdd):
    
    if(isinstance(rdd, int)):
        return rdd
    
    attrList = rdd.split(',')
    
    attrList[3] = '2' if attrList[3] == "two"  else attrList[3]
    attrList[3] = '4' if attrList[3] == "four" else attrList[3]
    attrList[5] = attrList[4].upper()
    
    return ",".join(attrList)

In [33]:
rddCarsClean = rddCars.map(cleanRdd)

In [34]:
rddCarsClean

PythonRDD[62] at RDD at PythonRDD.scala:53

In [35]:
rddCarsClean.collect()

['MAKE,FUELTYPE,ASPIRE,DOORS,BODY,BODY,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE',
 'subaru,gas,std,2,hatchback,HATCHBACK,four,69,4900,31,36,5118',
 'chevrolet,gas,std,2,hatchback,HATCHBACK,three,48,5100,47,53,5151',
 'mazda,gas,std,2,hatchback,HATCHBACK,four,68,5000,30,31,5195',
 'toyota,gas,std,2,hatchback,HATCHBACK,four,62,4800,35,39,5348',
 'mitsubishi,gas,std,2,hatchback,HATCHBACK,four,68,5500,37,41,5389',
 'honda,gas,std,2,hatchback,HATCHBACK,four,60,5500,38,42,5399',
 'nissan,gas,std,2,sedan,SEDAN,four,69,5200,31,37,5499',
 'dodge,gas,std,2,hatchback,HATCHBACK,four,68,5500,37,41,5572',
 'plymouth,gas,std,2,hatchback,HATCHBACK,four,68,5500,37,41,5572',
 'mazda,gas,std,2,hatchback,HATCHBACK,four,68,5000,31,38,6095',
 'mitsubishi,gas,std,2,hatchback,HATCHBACK,four,68,5500,31,38,6189',
 'dodge,gas,std,4,hatchback,HATCHBACK,four,68,5500,31,38,6229',
 'plymouth,gas,std,4,hatchback,HATCHBACK,four,68,5500,31,38,6229',
 'chevrolet,gas,std,2,hatchback,HATCHBACK,four,70,5400,38,43,6295',
 't

### Actions

**reduce() - Sum of values**

In [36]:
numbersList_03 = [124, 901, 652, 102, 397, 124, 901, 652]

In [37]:
rddNumbersList_03 = sc.parallelize(numbersList_03)

In [38]:
rddNumbersList_03.collect()

[124, 901, 652, 102, 397, 124, 901, 652]

In [39]:
rddNumbersList_03.reduce(lambda n1, n2: n1 + n2)

3853

**reduce() - Finding the line with fewest characters**

In [40]:
rddCars.reduce(lambda word_01, word_02: word_01 if len(word_01) < len(word_02) else word_02)

'bmw,gas,std,two,sedan,rwd,six,182,5400,16,22,41315'

**reduce() - Using a custom function**

In [41]:
def getMpg(rdd):
    
    if(isinstance(rdd, int)):
        return rdd
    
    attrList = rdd.split(',')
    
    return int(attrList[9]) if attrList[9].isdigit() else 0

In [42]:
round(rddCars.reduce(lambda avg_01, avg_02: getMpg(avg_01) + getMpg(avg_02)) / (rddCars.count() - 1), 2)

25.15

**takeSample()**

In [43]:
rddTeams_01 = sc.parallelize(['Lakers', 'Bulls', '76ers', 'Celtics', 'Spurs', 'Mavericks', 'Bucks'])

In [44]:
rddTeams_01.takeSample(True, 3)

['Mavericks', 'Lakers', 'Bulls']

In [45]:
rddTeams_01.takeSample(True, 3)

['76ers', 'Bulls', 'Bulls']

**countByKey()**

In [46]:
rddTeams_02 = sc.parallelize(['Lakers', 'Bulls', '76ers', 'Celtics', 'Bulls', 'Mavericks', 'Celtics'])

In [47]:
rddTeams_02.map(lambda key: (key, 1)).countByKey().items()

dict_items([('Lakers', 1), ('Bulls', 2), ('76ers', 1), ('Celtics', 2), ('Mavericks', 1)])