Reference:

- [Charles Bochet - Get Started with PySpark and Jupyter Notebook in 3 Minutes](https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f)
- [Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015.](https://proquest-safaribooksonline-com.proxy.library.cmu.edu/book/databases/business-intelligence/9781449359034/preface/idp4948496_html#X2ludGVybmFsX0h0bWxWaWV3P3htbGlkPTk3ODE0NDkzNTkwMzQlMkZjaGFwX3BhaXJfcmRkc19odG1sJnF1ZXJ5PQ==)

In [None]:
import findspark
findspark.init()

In [None]:
import pyspark
sc = pyspark.SparkContext(appName='my')

**Above code adds PySpark into sys.path at runtime and creates a Spark context for later use**

## A simple e.g.

In [36]:
lines = sc.textFile('data/README.md') # the README file from findspark project page on Github

In [37]:
lines.count()

50

In [38]:
lines.first()

'# Find spark'

In [39]:
pythonlines = lines.filter(lambda line: "Python" in line)

In [40]:
pythonlines.first()

'Findspark can add a startup file to the current IPython profile so that the environment vaiables will be properly set and pyspark will be imported upon IPython startup. This file is created when `edit_profile` is set to true.'

> Transformations and actions are different because of the way Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a lazy fashion—that is, the first time they are used in an action. This approach might seem unusual at first, but makes a lot of sense when you are working with Big Data. For instance, consider Example 3-2 and Example 3-3, where we defined a text file and then filtered the lines that include Python. If Spark were to load and store all the lines in the file as soon as we wrote lines = sc.textFile(...), it would waste a lot of storage space, given that we then immediately filter out many lines. Instead, once Spark sees the whole chain of transformations, it can compute just the data needed for its result. In fact, for the first() action, Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

In [41]:
pythonlines = lines.filter(lambda line: "Python" in line).persist()

In [42]:
pythonlines.first()

'Findspark can add a startup file to the current IPython profile so that the environment vaiables will be properly set and pyspark will be imported upon IPython startup. This file is created when `edit_profile` is set to true.'

> Finally, Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist(). In practice, you will often use persist() to load a subset of your data into memory and query it repeatedly.

## Two ways to create an RDD
- loading external files
    - like in the previous e.g.
- parallelizing an existing dataset
    - only used for prototyping since this requires the dataset to be in the memory of one machine

In [None]:
lines = sc.parallelize(["pandas", "i like pandas"])
lines.first()

## Transformations v.s. Actions
- transformations return RDDs and are exectued lazily
    - filter()
    - union(), distinct(), intersetion(), suctract(), cartesian(), ...
    - map(), flatMap()
- actions return other types of data and kick off computations
    - count()
    - take()
    - reduce(), fold()
    - aggregate()
    - ...

For the set operations, they are generally _very_ expensive except $\text{union}$, since they require Shuffling data across the network.

> One issue to watch out for when passing functions is inadvertently serializing the object containing the function. When you pass a function that is the member of an object, or contains references to fields in an object (e.g., self.field), Spark sends the entire object to worker nodes, which can be much larger than the bit of information you need (see Example 3-19). Sometimes this can also cause your program to fail, if your class contains objects that Python can’t figure out how to pickle.

> This is a good example of avoiding passing functions that have field references
```python
class WordFunctions(object):
  ...
  def getMatchesNoReference(self, rdd):
      # Safe: extract only the field we need into a local variable
      query = self.query
      return rdd.filter(lambda x: query in x)
```

In [43]:
for line in lines.take(10):
    print(line)


# Find spark

PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library.
You can address this by either symlinking pyspark into your site-packages,
or adding pyspark to sys.path at runtime. `findspark` does the latter.

To initialize PySpark, just call

```python
import findspark


In [49]:
nums = sc.parallelize(list(range(100)))
squared = nums.map(lambda x: x*x)
for nn in squared.take(10):
    print('{}'.format(nn))

0
1
4
9
16
25
36
49
64
81


In [48]:
lines = sc.parallelize(['hello world', 'hello spark', 'hello allen'])
words = lines.flatMap(lambda pp: pp.split(' ')).collect() # use collect to trigger an action
for word in words:
    print(word)
    

hello
world
hello
spark
hello
allen


In [53]:
for el in lines.cartesian(nums).take(10):
    print(el)

('hello world', 0)
('hello world', 1)
('hello world', 2)
('hello world', 3)
('hello world', 4)
('hello world', 5)
('hello world', 6)
('hello world', 7)
('hello world', 8)
('hello world', 9)


In [55]:
sumCount = nums.aggregate((0, 0), # the accumulator
                          (lambda acc, value: (acc[0] + value, acc[1] + 1)), # how to combine the RDD with the accumulator
                          (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) # how to combind res. from two acc.
print(sumCount[0] / float(sumCount[1])) # the average

49.5


In [56]:
nums.mean()

49.5

**In Scala and Java, these type of methods are only defined for specific types of RDDs.**
e.g., mean() and variance() for numeric RDDs, and join() for key-value pair RDDs