## RDDs

In [1]:
import findspark
findspark.init('/home/rich/spark/spark-2.4.3-bin-hadoop2.7')
import pandas as pd
from pyspark import SparkConf, SparkContext

## Spark Context

In [2]:
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)

In [3]:
num = range(1,101)
spark_data = sc.parallelize(num)

In [4]:
spark_data

PythonRDD[1] at RDD at PythonRDD.scala:53

## RDDs from Parallelized collections

In [7]:
# Create an RDD from a list of words
RDD = sc.parallelize(["Spark", "is", "a", "framework", "for", "Big Data processing"])

# Print out the type of the created object
print("The type of RDD is", type(RDD))

The type of RDD is <class 'pyspark.rdd.RDD'>


## RDDs from External Datasets

In [23]:
file_path = "./data/hello.txt"

In [24]:
# Create a fileRDD from file_path
fileRDD = sc.textFile(file_path)

# Check the type of fileRDD
print("The file type of fileRDD is", type(RDD))

The file type of fileRDD is <class 'pyspark.rdd.RDD'>


## Partitions in data

SparkContext's textFile() method takes an optional second argument called minPartitions for specifying the minimum number of partitions.

Modifying the number of partitions may result in faster performance due to parallelization.

In [25]:
# Check the number of partitions in fileRDD
print("Number of partitions in fileRDD is", fileRDD.getNumPartitions())

# Create a fileRDD_part from file_path with 5 partitions
fileRDD_part = sc.textFile(file_path, minPartitions = 5)

# Check the number of partitions in fileRDD_part
print("Number of partitions in fileRDD_part is", fileRDD_part.getNumPartitions())

Number of partitions in fileRDD is 1
Number of partitions in fileRDD_part is 6


## Map and Collect

The map() transformation takes in a function and applies it to each element in the RDD.

In [26]:
numbRDD = spark_data

#map transformation to cube numbers
cubedRDD = numbRDD.map(lambda x:x**3)

#collect the results - use only on small datasets
numbers_all = cubedRDD.collect()

for numb in numbers_all[:10]:
    print(numb)

1
8
27
64
125
216
343
512
729
1000


## Filter and Count

The RDD transformation filter() returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword

In [27]:
#filter the fileRDD
fileRDD_filter = fileRDD.filter(lambda line:'Spark' in line)

print("Total number of lines with Spark keyword is:",fileRDD_filter.count())

for line in fileRDD_filter.take(4):
    print(line)

Total number of lines with Spark keyword is: 2
third Spark
Spark sixth


## ReduceBykey and Collect

One of the most popular pair RDD transformations is reduceByKey() which operates on key, value (k,v) pairs and merges the values for each key.

reduceByKey() transformation merges the values for each key using an associative reduce function.

In [28]:
 # Create PairRDD Rdd with key value pairs
Rdd = sc.parallelize([(1,2),(3,4),(3,6),(4,5)])

# Apply reduceByKey() operation on Rdd
Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x+y)

# Iterate over the result and print the output
for num in Rdd_Reduced.collect(): 
  print("Key {} has {} Counts".format(num[0], num[1]))

Key 1 has 2 Counts
Key 3 has 10 Counts
Key 4 has 5 Counts


## SortByKey and Collect

Sort the pair RDD based on the key

In [29]:
#sort reducedRDD with the key by descending order
Rdd_Reduced_Sort = Rdd_Reduced.sortByKey(ascending=False)

#iterate over the result 
for num in Rdd_Reduced_Sort.collect():
     print("Key {} has {} Counts".format(num[0], num[1]))

Key 4 has 5 Counts
Key 3 has 10 Counts
Key 1 has 2 Counts


## CountingBykeys

Use the Rdd pair RDD that you created earlier and count the number of unique keys in that pair RDD.

In [32]:
#transform rdd with countbykey
total = Rdd.countByKey()

print("The type of total is",type(total))

# Iterate over the total and print the output
for k, v in total.items(): 
  print("key", k, "has", v, "counts")

The type of total is <class 'collections.defaultdict'>
key 1 has 1 counts
key 3 has 2 counts
key 4 has 1 counts


In [34]:
for num in Rdd.collect():
    print(num)

(1, 2)
(3, 4)
(3, 6)
(4, 5)


## Create a base RDD and transform it

Write code that calculates the most common words from Complete Works of William Shakespeare.

    Create a base RDD from Complete_Shakespeare.txt file.
    Use RDD transformation to create a long list of words from each element of the base RDD.
    Remove stop words from your data.
    Create pair RDD where each element is a pair tuple of ('w', 1)
    Group the elements of the pair RDD by key (word) and add up their values.
    Swap the keys (word) and values (counts) so that keys is count and value is the word.
    Finally, sort the RDD by descending order and print the 10 most frequent words and their frequencies.


In [171]:
file_path = "./data/shake1.txt"

In [172]:
# Create a baseRDD from the file path
baseRDD = sc.textFile(file_path)

# Split the lines of baseRDD into words
splitRDD = baseRDD.flatMap(lambda x: x.split())

# Count the total number of words
print("Total number of words in splitRDD:", splitRDD.count())


Total number of words in splitRDD: 128576


## Remove stop words and reduce the dataset

After splitting the lines in the file into a long list of words using flatMap() transformation, in the next step, you'll remove stop words from your data.

After removing stop words, you'll next create a pair RDD where each element is a pair tuple (k, v) where k is the key and v is the value. In this example, pair RDD is composed of (w, 1) where w is for each word in the RDD and 1 is a number. Finally, you'll combine the values with the same key from the pair RDD using reduceByKey() operation

In [173]:
#could just stop words from nltk instead
import csv
stop_word_list = []
with open("./data/stopwords.txt") as csv_file:
    csv_reader = csv.reader(csv_file,delimiter = '\n')
    for row in csv_reader:
        a = row[0].replace("'","").replace(",","").replace(" ","")
        stop_word_list.append(a)

In [174]:
#res = filter(lambda x:x.lower() not in a,l1)

# Convert the words in lower case and remove stop words from stop_words
splitRDD_no_stop = splitRDD.filter(lambda x:x.lower() not in stop_word_list)

#create a tuple of the word and 1
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (w,1))

#Get the count of the number of occurrences of each word (word frequency) in the pair 
resultRDD = splitRDD_no_stop_words.reduceByKey(lambda x,y:x+y)


## Print word frequencies

After combining the values (counts) with the same key (word), print the word frequencies using the take(N) action. Could have used the collect() action but as a best practice, it is not recommended as collect() returns all the elements from your RDD. You'll use take(N) instead, to return N elements from your RDD.

What if we want to return the top 10 words? For this first, you'll need to swap the key (word) and values (counts) so that keys is count and value is the word. After you swap the key and value in the tuple, you'll sort the pair RDD based on the key (count) and print the top 10 words in descending order. 

In [175]:
# Display the first 10 words and their frequencies
for word in resultRDD.take(10):
    print(word)

('Project', 9)
('Gutenberg', 7)
('EBook', 1)
('Complete', 3)
('Works', 3)
('William', 11)
('Shakespeare,', 1)
('Shakespeare', 12)
('eBook', 2)
('use', 38)


In [176]:
# Swap the keys and values 
resultRDD_swap = resultRDD.map(lambda x: (x[1], x[0]))

In [177]:
# Sort the keys in descending order
resultRDD_swap_sort = resultRDD_swap.sortByKey(ascending=False)

In [178]:
resultRDD_swap_sort.take(10)

[(1531, 'my'),
 (776, 'me'),
 (650, 'thou'),
 (574, 'thy'),
 (393, 'shall'),
 (311, 'would'),
 (295, 'good'),
 (286, 'thee'),
 (273, 'love'),
 (269, 'Enter')]

In [179]:
# Show the top 10 most frequent words and their frequencies
for word in resultRDD_swap_sort.take(10):
    print("{} has {} counts". format(word[1], word[0]))

my has 1531 counts
me has 776 counts
thou has 650 counts
thy has 574 counts
shall has 393 counts
would has 311 counts
good has 295 counts
thee has 286 counts
love has 273 counts
Enter has 269 counts
