<a href="https://colab.research.google.com/github/besherh/BigDataManagement/blob/main/SparkNotebooks/helloworld_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setting up PySpark in Colab
Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.



In [None]:
!apt-get install openjdk-8-jdk-headless

Reading package lists... Done
Building dependency tree       
Reading state information... Done
openjdk-8-jdk-headless is already the newest version (8u282-b08-0ubuntu1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 13 not upgraded.


Next, we will install Apache Spark 3.0.1 with Hadoop 2.7 .


In [None]:
!wget https://apache.mirrors.nublue.co.uk/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz


--2021-02-27 22:44:48--  https://apache.mirrors.nublue.co.uk/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz
Resolving apache.mirrors.nublue.co.uk (apache.mirrors.nublue.co.uk)... 141.0.161.104, 2a01:61c0:1:10:141:0:161:104
Connecting to apache.mirrors.nublue.co.uk (apache.mirrors.nublue.co.uk)|141.0.161.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220488957 (210M) [application/octet-stream]
Saving to: ‘spark-3.0.2-bin-hadoop2.7.tgz.2’


2021-02-27 22:45:12 (8.92 MB/s) - ‘spark-3.0.2-bin-hadoop2.7.tgz.2’ saved [220488957/220488957]



Now, we just need to unzip that folder.


In [None]:
!tar xf spark-3.0.2-bin-hadoop2.7.tgz


There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.



In [None]:
!pip install -q findspark


Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run Pyspark in the Colab environment.


In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"


We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method.

In [None]:
import findspark
findspark.init()
findspark.find()

'/content/spark-3.0.2-bin-hadoop2.7'

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

You can give a name to the session using appName() and add some configurations with config() if you wish.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

Finally, print the SparkSession variable.

In [None]:
spark


#optional
If you want to view the Spark UI, you would have to include a few more lines of code to create a public URL for the UI page.

In [None]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!curl -s http://localhost:4040/api/tunnels

--2021-02-27 22:50:34--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 3.223.68.239, 3.209.148.13, 3.229.59.32, ...
Connecting to bin.equinox.io (bin.equinox.io)|3.223.68.239|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.1’


2021-02-27 22:50:36 (17.1 MB/s) - ‘ngrok-stable-linux-amd64.zip.1’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ngrok                   
{"tunnels":[{"name":"command_line","uri":"/api/tunnels/command_line","public_url":"https://8b97c6c3e3e9.ngrok.io","proto":"https","config":{"addr":"http://localhost:4050","inspect":true},"metrics":{"conns":{"count":0,"gauge":0,"rate1":0,"rate5":0,"rate15":0,"p50":0,"p90":0,"p95":0,"p99":0},"http":{"count":0,"rate1":0,"rate5":0,"rate15":0,"p50":0,"p90":0,"p95":0,"p99":0}

#Loading data into PySpark
The SparkContext, sc, is the main entry point for accessing Spark in Python. The textFile() method reads the file into a Resilient Distributed Dataset (RDD) with each line in the file being an element in the RDD collection. The URL hdfs:/user/cloudera/words.txt specifies the location of the file in HDFS.



In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
#you need to download the file from the below link then upload it to the Colab ( save it as "shakespeare.txt")
#https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
lines = sc.textFile("/content/shakespeare.txt")

We can verify the file was successfully loaded by calling the count() method, which prints the number of elements in the RDD:

In [None]:
lines.count()

124456

#Split each line into words. Next, we will split each line into a set of words.
To split each line into words and store them in an RDD called words, run: The flatMap() method iterates over every line in the RDD, and lambda line : line.split(" ") is executed on each line. The lambda notation is an anonymous function in Python, i.e., a function defined without using a name. In this case, the anonymous function takes a single argument, line, and calls split(" ") which splits the line into an array words. The flatMap() method iterates over every line in the RDD, and lambda line : line.split(" ") is executed on each line. The lambda notation is an anonymous function in Python, i.e., a function defined without using a name. In this case, the anonymous function takes a single argument, line, and calls split(" ") which splits the line into an array words.

In [None]:
words = lines.flatMap(lambda line : line.split(" "))


#Assign initial count value to each word. 
Next, we will create tuples for each word with an initial count of 1:
The map() method iterates over every word in the words RDD, and the lambda expression creates a tuple with the word and a value of 1.
Note that in the previous step we used flatMap, but here we used map. In this step, we want to create a tuple for every word, i.e., we have a one-to-one mapping between the input words and output tuples. In the previous step, we wanted to split each line into a set of words, i.e., there is a one-to-many mapping between input lines and output words. In general, use map when the number of inputs to number of outputs is one-to-one, and flatMap for one-to-many (or one-to-none).


In [None]:
tuples = words.map(lambda word : (word,1))

#Sum all word count values.
 We can sum all the counts in the tuples for each word into a new RDD counts:

In [None]:
counts = tuples.reduceByKey(lambda a,b: (a + b))

The reduceByKey() method calls the lambda expression for all the tuples with the same word. The lambda expression has two arguments, a and b, which are the count values in two tuples.

#Write word counts to text file.
 We can write the counts RDD to current folder:

In [1]:
counts.coalesce(1).saveAsTextFile("outputDir")

NameError: ignored

The coalesce() method combines all the RDD partitions into a single partition since we want a single output file, and saveAsTextFile() writes the RDD to the specified location.

#View result

In [None]:
cat "/content/outputDir/part-00000"

('This', 1105)
('is', 7851)
('the', 23242)
('100th', 1)
('Etext', 4)
('file', 14)
('presented', 11)
('by', 2824)
('Project', 13)
('Gutenberg,', 1)
('and', 18297)
('in', 9576)
('cooperation', 1)
('with', 6722)
('World', 5)
('Library,', 2)
('Inc.,', 1)
('from', 2283)
('their', 1934)
('Library', 4)
('of', 15544)
('Future', 3)
('Shakespeare', 45)
('CDROMS.', 1)
('', 517065)
('Gutenberg', 11)
('often', 116)
('releases', 1)
('Etexts', 3)
('that', 7531)
('are', 2917)
('NOT', 225)
('placed', 10)
('Public', 1)
('Domain!!', 1)
('*This', 1)
('has', 326)
('certain', 116)
('copyright', 7)
('implications', 1)
('you', 9081)
('should', 1387)
('read!*', 1)
('<<THIS', 220)
('ELECTRONIC', 442)
('VERSION', 221)
('OF', 1490)
('THE', 342)
('COMPLETE', 223)
('WORKS', 221)
('WILLIAM', 244)
('SHAKESPEARE', 223)
('IS', 445)
('COPYRIGHT', 221)
('1990-1993', 221)
('BY', 663)
('WORLD', 221)
('LIBRARY,', 221)
('INC.,', 221)
('AND', 672)
('PROVIDED', 222)
('PROJECT', 222)
('GUTENBERG', 221)
('ETEXT', 223)
('ILLINOIS