<h2>RDD Creation</h2>
<p>
You can create RDD by parallelizing the existing collection and reading data from a disk.</br>
<ul>
<li>
parallelizing an existing collection
</li>
<li>
referencing a dataset in an external storage system (HDFS, S3 and many more). 
</li>
</ul>
</p>
<p>
<strong>
Initialize SparkSession using the builder pattern method defined in SparkSession class.
</strong>
</p>
<p height="200" width="100%" style="background:black;font-size:20px;color:white">
# Imports</br>
from pyspark.sql import SparkSession</br>
</br>
# Create SparkSession</br>
spark = SparkSession.builder</br>
      .master("local[1]")</br>
      .appName("SparkByExamples.com")</br>
      .getOrCreate()</br>   
</p>
<p>
<strong>
master()
</strong>
If you are running it on the cluster you need to use your master name as an argument to master().</br> usually, it would be either yarn (Yet Another Resource Negotiator) or mesos depends on your cluster setup.
</p>
<p>
<strong>
local[x]:
</strong>
When operating in Standalone mode, specify ‘local[x]’, where ‘x’ is an integer greater than 0, to determine the number of partitions for RDDs.</br> Ideally, set ‘x’ to match the number of CPU cores available on your system for optimal performance.
</p>
<p>
<strong>
appName():
</strong>
Used to set your application name.
</p>
<p>
<strong>
getOrCreate():
</strong>
This returns a SparkSession object if already exists, and creates a new one if not exist.
</p>
<h2>Using sparkContext.parallelize()</h2>
<p>
<strong>
parallelize() 
</strong> function of SparkContext (sparkContext.parallelize() ) you can create an RDD. This function loads the existing collection from your driver program into parallelizing RDD.</br> This method of creating an RDD is used when you already have data in memory that is either loaded from a file or from a database. and all data must be present in the driver program prior to creating RDD.
</p>
<img width="200" height="300" src="parallelize.png">
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
# Create RDD from parallelize</br>  
data = [1,2,3,4,5,6,7,8,9,10,11,12]</br>
rdd = spark.sparkContext.parallelize(data)</br>
</p>
<h2>Using sparkContext.textFile()</h2>
<p>
Use the textFile() method to read a .txt file into RDD.
</p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
# Create RDD from external Data source</br>
rdd2 = spark.sparkContext.textFile("/path/textFile.txt")</br>
</p>
<h2>Using sparkContext.wholeTextFiles()</h2>
<p>
<strong>wholeTextFiles()</strong> function returns a PairRDD with the key being the file path and the value being file content.
</p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
# Read entire file into a RDD as single record.</br>
rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")</br>
</p>
<p>
Using <strong>emptyRDD()</strong> method on sparkContext we can create an RDD with no data. This method creates an empty RDD with no partition.
</p>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
# Create an empty RDD with no partition</br> 
rdd = spark.sparkContext.emptyRDD()</br>

# Output:</br>
# rddString = spark.sparkContext.emptyRDD[String]</br>
</p>
<h2>Creating empty RDD with partition</h2>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
# Create empty RDD with partition
rdd2 = spark.sparkContext.parallelize([],10) #This creates 10 partitions
</p>