<h1>Spark:</h1>
<ul><li>Spark is a big data tool that allows users to store, process and stream large amounts of data in a quick and fault-tolerant way</li>
<li>Spark is written in scala, but offers a Python API to achieve all the tasks using the Python language</li>
<li>The building block of scala is called the Resilient Distributed Dataset, or an RDD. All actions and transformations are performed over the RDDs. However do bear in mind that RDDs cannot be modified once they are created. The only way data can be modified is by creating new RDDs mapping from the existing ones</li></ul>

<h1>Prerequisites to run PySpark</h1>
<ul><li>Apache Spark plus Hadoop</li>
<li>Java 8 or higher</li>
<li>Python 3</li></ul>

In [3]:
import findspark
findspark.init()

In [5]:
import pyspark
from pyspark import SparkConf, SparkContext

In [6]:
conf = SparkConf().setMaster("local[2]").setAppName("CreatingRDD")

In [8]:
sc = SparkContext(conf=conf)

# 1. Parallelizing data

In [9]:
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])

In [10]:
x,y

(ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:262,
 ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:262)

In [12]:
x.collect(), y.collect()

([('spark', 1), ('hadoop', 4)], [('spark', 2), ('hadoop', 5)])

# 2. Create new RDDs from exisitng RDDs

In [13]:
nums = sc.parallelize([1, 2, 3, 4, 5])
from operator import add
adding = nums.reduce(add)
print("Adding all the elements -> %i" %(adding))

Adding all the elements -> 15


In [14]:
adding

15

In [15]:
words = sc.parallelize(
        ["scala",
        "java",
        "hadoop",
        "spark",
        "akka",
        "spark vs hadoop",
        "pyspark",
        "pyspark and spark"]
)
words.collect()

['scala',
 'java',
 'hadoop',
 'spark',
 'akka',
 'spark vs hadoop',
 'pyspark',
 'pyspark and spark']

In [16]:
words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.collect()
print("Filtered RDD -> %s" %(filtered))

Filtered RDD -> ['spark', 'spark vs hadoop', 'pyspark', 'pyspark and spark']


# 3. From external files

In [37]:
data = sc.textFile('file:///home/boom/Documents/programming/pyspark/data_files/people.txt')

In [38]:
data.collect()

['Michael, 29', 'Caine, 87', 'Zupper, 22', 'Xerin, 45']

In [39]:
data.top(2)

['Zupper, 22', 'Xerin, 45']

In [41]:
dataf = data.flatMap(lambda x: x.split(',')[::2])
print("Filtered RDD -> %s" %(dataf))

Filtered RDD -> PythonRDD[23] at RDD at PythonRDD.scala:53


In [42]:
dataf.take(2)

['Michael', 'Caine']