RDD’s are created primarily in two different ways,

parallelizing an existing collection and
referencing a dataset in an external storage system (HDFS, S3 and many more). 
Before we look into examples, first let’s initialize SparkSession using the builder pattern method defined in SparkSession class. While initializing, we need to provide the master and application name as shown below. In realtime application, you will pass master from spark-submit instead of hardcoding on Spark application.

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .getOrCreate()

- master() – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either <a href="https://sparkbyexamples.com/hadoop/how-yarn-works/">yarn (Yet Another Resource Negotiator)</a> or mesos depends on your cluster setup.

Use local[x] when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.

- appName() – Used to set your application name.

- getOrCreate() – This returns a SparkSession object if already exists, creates new one if not exists.

Note: Creating SparkSession object, it internally creates one SparkContext per JVM.

# Create RDD using sparkContext.parallelize()

In [9]:
#Create RDD from parallelize
# data = sc.parallelize([('Amber', 22), ('Alfred', 23), ('Skye', 4),
#                        ('Albert', 12), ('Amber', 9)])
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
rdd = spark.sparkContext.parallelize(data)

![imgs](https://i0.wp.com/sparkbyexamples.com/wp-content/uploads/2020/08/rdd-creation.png?w=1183&ssl=1)

For production applications, we mostly create RDD by using external storage systems like HDFS, S3, HBase e.t.c. To make it simple for this PySpark RDD tutorial we are using files from the local system or loading it from the python list to create RDD.

## parallelize是个混合结构

RDDs are *schema-less* data structures.

In [12]:
data_heterogenous = sc.parallelize([('Ferrari', 'fast'), {
    'Porsche': 100000
}, ['Spain', 'visited', 4504]]).collect()
data_heterogenous

[('Ferrari', 'fast'), {'Porsche': 100000}, ['Spain', 'visited', 4504]]

你可以通过下面方式访问.

访问元组：

In [13]:
data_heterogenous[0]

('Ferrari', 'fast')

访问对象：

In [14]:
data_heterogenous[1]['Porsche']

100000

# Create RDD using sparkContext.textFile()
Using textFile() method we can read a text (.txt) file into RDD.

# Create RDD from external Data source

In [11]:
data_from_file = sc.textFile('/Users/liuning/data/VS14MORT.txt.gz', 4)
data_from_file.take(2)

['                   1                                          2101  M1087 432311  4M4                2014U7CN                                    I64 238 070   24 0111I64                                                                                                                                                                           01 I64                                                                                                  01  11                                 100 601',
 '                   1                                          2101  M1058 371708  4D3                2014U7CN                                    I250214 062   21 0311I250 61I272 62E669                                                                                                                                                            03 I250 E669 I272                                                                                       01  11                                 100 601']

Note, that to execute the code above you will have to change the path where the data is stored. The dataset can be downloaded from http://tomdrabas.com/data/VS14MORT.txt.gz

# Create RDD using sparkContext.wholeTextFiles()
wholeTextFiles() function returns a PairRDD with the key being the file path and value being file content.

# Reads entire file into a RDD as single record.
rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")
Besides using text files, we can also create RDD from CSV file, JSON, and more formats.

Create empty RDD using sparkContext.emptyRDD
Using emptyRDD() method on sparkContext we can create an RDD with no data. This method creates an empty RDD with no partition.

# Creates empty RDD with no partition   

In [22]:
rdd = spark.sparkContext.emptyRDD
rdd
# rddString = spark.sparkContext.emptyRDD[String]

<bound method SparkContext.emptyRDD of <SparkContext master=local[*] appName=PySparkShell>>

# Creating empty RDD with partition
Some times we may need to write an empty RDD to files by partition, In this case, you should create an empty RDD with partition.

# Create empty RDD with partition

In [24]:
rdd2 = spark.sparkContext.parallelize([], 10)  #This creates 10 partitions
rdd2

ParallelCollectionRDD[15] at readRDDFromFile at PythonRDD.scala:274

# Creating RDD from DataFrame and vice-versa
Though we have more advanced API’s over RDD, we would often need to convert DataFrame to RDD or RDD to DataFrame. Below are several examples.

In [11]:
data = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10), (11, 12)]
rdd = spark.sparkContext.parallelize(data)
# Converts RDD to DataFrame
dfFromRDD1 = rdd.toDF()
dfFromRDD1

DataFrame[_1: bigint, _2: bigint]

In [13]:
# Converts RDD to DataFrame with column names
dfFromRDD2 = rdd.toDF(["col1", "col2"])
dfFromRDD2

DataFrame[col1: bigint, col2: bigint]

In [17]:
# using createDataFrame() - Convert DataFrame to RDD
df = spark.createDataFrame(rdd, ["col1", "col2"])
# Convert DataFrame to RDD
rdd = df.rdd

# RDD Parallelize
When we use parallelize() or textFile() or wholeTextFiles() methods of SparkContxt to initiate RDD, it automatically splits the data into partitions based on resource availability. when you run it on a laptop it would create partitions as the same number of cores available on your system.

getNumPartitions() – This a RDD function which returns a number of partitions our dataset split into.

In [27]:
print("initial partition count:" + str(rdd2.getNumPartitions()))

initial partition count:10


#Outputs: initial partition count:2
Set parallelize manually – We can also set a number of partitions manually, all, we need is, to pass a number of partitions as the second parameter to these functions for example 

In [29]:
spark.sparkContext.parallelize([1, 2, 3, 4, 56, 7, 8, 9, 12, 3], 10)

ParallelCollectionRDD[16] at readRDDFromFile at PythonRDD.scala:274

# Repartition and Coalesce
Some times we may need to repartition the RDD, PySpark provides two ways to repartition; first using repartition() method which shuffles data from all nodes also called full shuffle and second coalesce() method which shuffle data from minimum nodes, for examples if you have data in 4 partitions and doing coalesce(2) moves data from just 2 nodes.  

Both of the functions take the number of partitions to repartition rdd as shown below.  Note that <a href="https://sparkbyexamples.com/pyspark/pyspark-repartition-vs-coalesce/">repartition()</a> method is a very expensive operation as it shuffles data from all nodes in a cluster. 

In [31]:
reparRdd = rdd2.repartition(4)
print("re-partition count:" + str(reparRdd.getNumPartitions()))
#Outputs: "re-partition count:4

re-partition count:4


Note: repartition() or coalesce() methods also returns a new RDD.