![image.png](attachment:image.png)



*Note: Apache Spark is trade mark of Apache Software Foundation*

Entire Spark is devided into two types


1. <a href="http://spark.apache.org/docs/latest/rdd-programming-guide.html" target=_blank>Unstructured API: RDD falls under Unstructured API</a>
2. <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html" target=_blank>Structured API: DataFrame, DataSets and Sql table/view fall under Structured API</a>




![image.png](attachment:image.png)

# Unstructured API: Resilient Distributed Datasets (RDDs)

RDDs are a special data type, tailor-made for Apache Spark. Indeed, the first boost of performance in Apache Spark came from the innovative nature of the RDD structure. An RDD can be considered as a distributed set of elements.

Traditional systems such as MapReduce, though distributed in nature, lack ways to utilise the main memory of a cluster’s resources. This makes them inappropriate for a certain type of computations - those that reuse the intermediate results during computation. RDDs have entered the industry and neatly filled in the gap that MapReduce has.

RDDs are data structures that are designed to effectively process big data. Following are the main properties of RDDs:

 - Distributed collection of data
 - Fault tolerance
 - Parallel operations
 - Ability to use varied data sources
 
 
Once a user submits a code, a Spark Application is created, which starts the Driver program. The driver program is responsible for initiating the SparkContext, which is the most crucial task in the entire process cycle.

### SparkContext and SparkConf

The SparkContext acts as the entry point in the Spark environment (connection with the machines in Apache Spark) to run any operation.

SparkContext establishes a connection to a Spark execution environment through which you can create Spark jobs, access Spark services, until SparkContext is stopped. It does not execute the code but creates an optimised physical plan of the execution within the Spark architecture.

SparkConf, on the other hand, is required to configure the working environment in the Spark. There are two mandatory settings of any Spark application that have to be defined before the application could be run: 
 - *spark.master*
 - *spark.app.name*
 
*spark.master* helps you decide the cluster mode in which you want the Spark to work. You could work with Spark Standalone or Apache YARN or Apache Mesos as your cluster manager within the Spark environment.

*spark.app.name* helps you assign a name to the set of operations that you submit to the Spark environment. Naming the Spark Application helps you identify your Spark jobs. You may want to check the status or analyse the submitted jobs and application name is really useful in such cases.  

Other attributes that you can configure using the SparkConf are the environment specications like number of executors, memory allocated to different machines, etc.

In [None]:
# loading SparkContext and SparkConf from the library pyspark
# Setting the application name and cluster mode using SparkConf

from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName("Creating RDD").setMaster("YARN")
sc = SparkContext().getOrCreate(conf=conf)

In [None]:
# Checking the SparkContext object

sc

## Creating RDDs

There are two ways to create RDDs:

 - *parallelize method*: 
     This method is helpful when you have an existing collection of data in the driver program.
     
 - *textFile method*: 
     This method is used when you are loading the dataset stored in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat

In [None]:
# Creating an RDD using the parallelize method

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data) # rdd can be created by calling parallelize method 

In [None]:
# Creating an RDD using the textfile method

distFile = sc.textFile("/common_folder/pyspark_data/flight-data/json/2015-summary.json") # rdd can be created by referencing external datasource

In [None]:
# printing the created rdd

distData.collect()

In [None]:
# data type

type(distData)

In [None]:
# data type

type(distFile)

In [None]:
# printing the 5 elements of rdd

distData.take(5)