# RDD Creation 

## Overview 

In [1]:
from pyspark import SparkContext, SparkConf

In [3]:
conf = SparkConf().setAppName('Simple App').setMaster('local')
sc = SparkContext(conf=conf)

###### The master URL passed to Spark can be in one of the following formats:
    
        local	Run Spark locally with one worker thread (i.e. no parallelism at all).
    local[K]	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
        yarn	Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

###### Note : There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

### Resilient Distributed Datasets (RDDs)

#### Parallelized Collections

In [7]:
#here is how to create a parallelized collection holding the numbers 1 to 5:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

#### External Datasets

In [8]:
distFile = sc.textFile("EmployData.csv")

Some notes on reading files with Spark:

>If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

>All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

>The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

In [11]:
#Display RDD Data using action take()
for i in distFile.take(10):print(i)

employee_name,email,job_profile
James,james@gmail.com,Sr. Developer
Smith,Smith@gmail.com,Project Lead
