In [26]:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"


# install findspark using pip
!pip install -q findspark

## Question 1

* SparkContext is a Spark entry point with configuration information for running a Spark Application. It is critical because it manages memory and IO operations as well as task execution across multiple clusters. Spark Context first appeared in Spark 1.0. 

* SparkSession is another Spark entry point that includes all of the features of SparkContext as well as support for DataFrame and Dataset APIs. This enables us to create DataFrames that are both more efficient and flexible than RDDs. This enables us to work with both structured and semi-structured data. 

* SparkSession also allows multiple sessions to run on the same spark context. SparkContext and SparkSession are important because they manage and optimize job execution across multiple clusters.

* They also help in the creation of scalable pipelines. They also support a variety of data sources, such as HDFS.

## Question 2

In [27]:
findspark.init()
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext

In [35]:
import os
import json


from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

emp = sc.textFile(os.environ.get('SPARK_HOME') + '/examples/src/main/resources/employees.json')
data = emp.map(lambda x:json.loads(x))
data.collect()

[{'name': 'Michael', 'salary': 3000},
 {'name': 'Andy', 'salary': 4500},
 {'name': 'Justin', 'salary': 3500},
 {'name': 'Berta', 'salary': 4000}]

## Question 3

In [38]:
hc = HiveContext(sc)

data_df = hc.read.json(os.environ.get('SPARK_HOME') + '/examples/src/main/resources/employees.json')
data_df.printSchema()

root
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)



## Question 3a.

In [39]:
data_df.registerTempTable("employee")

In [40]:
data_df.show()

+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
+-------+------+



## Question 3b.

In [41]:
df2 = spark.sql("select * from employee order by 2 desc")

In [42]:
df2.show()

+-------+------+
|   name|salary|
+-------+------+
|   Andy|  4500|
|  Berta|  4000|
| Justin|  3500|
|Michael|  3000|
+-------+------+



## Question 4a.

In [43]:
import csv
data_df = sc.textFile(os.environ.get('SPARK_HOME') + '/examples/src/main/resources/people.txt')\
.map(lambda line: list(csv.reader([line]))[0])

In [44]:
rows = data_df.collect()

## Question 4b.

In [45]:
for i in range(len(rows)):
    print(rows[i])

['Michael', ' 29']
['Andy', ' 30']
['Justin', ' 19']


## Question 5

* SQLContext is the SparkSQL gateway, while HiveContext is the Hive gateway. Hive is capable of storing plain text tables in column-oriented formats such as HDFS.

* Advantages of HiveContext:It is extremely simple to import data from Hive into Spark Applications.It has a querying interface that is similar to SQL and thus simple to use. It can read and write a wide range of file formats.

* SQLContext has the following advantages: It has a similar interface to SQL for querying data, making it simple to use; and * It supports a wide range of file formats.
* The following are the disadvantages of HiveContext and SQLContex.It is slower than RDDs. It is inappropriate for unstructured data.

* Benefits of RDD: It is extremely flexible and powerful.It supports a wide range of file formats.It provides control over the pipeline, which can be used to optimize performance. It is significantly faster than Hivecontext or SQLContext.

* RDD has the following disadvantages:It is difficult to work with; It takes longer to optimize for performance.

## References
* https://towardsdatascience.com/pyspark-on-google-colab-101-d31830b238be
* https://spark.apache.org/docs/latest/api/python/getting_started/index.html