## Task1: Calculate PI in parallel with Apache Spark using the Monte Carlo Method


We have a circle of radius 0.5, enclosed by a 1 × 1 square. The area of the circle is πr2=π/4, the area of the square is 1. If we divide the area of the circle, by the area of the square we get π/4.


Total Number of points: 0
Points within circle: 0
Pi estimation:
Add points one-by-one
Animate
Speed
Reset
 Open with CodePen
One method to estimate the value of π (3.141592...) is by using a Monte Carlo method. In the demo above, we have a circle of radius 0.5, enclosed by a 1 × 1 square. The area of the circle is πr2=π/4, the area of the square is 1. If we divide the area of the circle, by the area of the square we get π/4.

We then generate a large number of uniformly distributed random points and plot them on the graph. These points can be in any position within the square i.e. between (0,0) and (1,1). If they fall within the circle, they are coloured red, otherwise they are coloured blue. We keep track of the total number of points, and the number of points that are inside the circle. If we divide the number of points within the circle, Ninner by the total number of points, Ntotal, we should get a value that is an approximation of the ratio of the areas we calculated above, π/4.

In other words,

π4 ≈ Ninner / Ntotal

π ≈ 4Ninner / Ntotal

<img src='./montecarlo.png'>

In [1]:
import pyspark
from pyspark.sql import SparkSession
import random

In [2]:
# create spark context
sc = pyspark.SparkContext(appName="Pi")
spark = SparkSession(sc)

In [3]:
# number of points to be used during the simulation
num_samples = 100000
# num_samples = 100000000

In [2]:
def gen_random_point(p):
    # just generate two random x and y coordinates between (0, 1), regardless the input
    x = random.random()
    y = random.random()
    return (x, y)

# point is a tuple (x, y)
def inside_the_circle(point):
    # returns true if the point is in the circle
    x = point[0]
    y = point[1]
    return (x**2 + y**2 < 1)

In [12]:
l = range(100000)
k = len(list(filter(inside_the_circle, map(gen_random_point, l))))
4 * (k / 100000)

3.14172

In [5]:
sparkJob = sc.parallelize(range(0, num_samples)) \
    .map(gen_random_point) \
    .filter(inside_the_circle)

sparkJob.take(5)

[(0.7733220271447795, 0.5451749199388612),
 (0.16667746438446662, 0.3861032287810393),
 (0.17811503014270713, 0.02760527504565391),
 (0.5323115568620385, 0.8082912235606978),
 (0.03581442393918355, 0.38632518110024927)]

In [6]:
points_in_circle = sparkJob.count()

pi_estimate = 4.0 * points_in_circle / num_samples

print(pi_estimate)

3.1478


### Convert the list of points into a Dataframe

Dataframes like spreadsheet but stored partitioned across the servers in your spark cluster. You can do standard operations, like filtering, projections, joins, etc. what you usually can do on top of structured data, but it will be executed in parallel across the nodes. We will see dataframes in detail later during the course



In [7]:
points_in_circle_df = sparkJob.toDF(["point_x", "point_y"])
points_in_circle_df.printSchema()

root
 |-- point_x: double (nullable = true)
 |-- point_y: double (nullable = true)



Dataframes has some nice properties. For example, you can use expressions to filter the rows and you don't need to usec closures like in the previous examples

In [8]:
points_in_circle_df.where("point_x > 0.9 and point_y < 0.1").take(10)

[Row(point_x=0.9929514713106562, point_y=0.05981262003230803),
 Row(point_x=0.9454315286737294, point_y=0.012655744128695856),
 Row(point_x=0.9178649173156344, point_y=0.03058842636433068),
 Row(point_x=0.916504603690335, point_y=0.02483951575131127),
 Row(point_x=0.9455943222429517, point_y=0.07967301581441122),
 Row(point_x=0.9918079069699787, point_y=0.056238790433995045),
 Row(point_x=0.9252100032746915, point_y=0.0895698754674834),
 Row(point_x=0.9679890981573258, point_y=0.037182969426968304),
 Row(point_x=0.9728967141483011, point_y=0.046676922974952406),
 Row(point_x=0.9961561607993109, point_y=0.028224569531642207)]

Whether you're working with RDDs or Dataframes, the result data can be copied back to the local memory of the notebook (which is your spark driver application is this case BTW!). After that you can access to this data as a regular list. To obtain the entire result dataset, you can use the **collect()** method. 

However, note that we're not really using the **collect()** method very often in practice. As we will see later, actually it is quite rare that we need to copy the result dataset back to the driver's local memory.

In [9]:
my_points = points_in_circle_df.where("point_x > 0.9 and point_y < 0.1").collect()
# see how my_points is not a spark "data type" anymore. It is a regular list copied to the main memory of the 
# driver application (i. e. to this notebook)
print (type(my_points))
# print the first 10 elements of the list
print (my_points[:10])

<class 'list'>
[Row(point_x=0.9929514713106562, point_y=0.05981262003230803), Row(point_x=0.9454315286737294, point_y=0.012655744128695856), Row(point_x=0.9178649173156344, point_y=0.03058842636433068), Row(point_x=0.916504603690335, point_y=0.02483951575131127), Row(point_x=0.9455943222429517, point_y=0.07967301581441122), Row(point_x=0.9918079069699787, point_y=0.056238790433995045), Row(point_x=0.9252100032746915, point_y=0.0895698754674834), Row(point_x=0.9679890981573258, point_y=0.037182969426968304), Row(point_x=0.9728967141483011, point_y=0.046676922974952406), Row(point_x=0.9961561607993109, point_y=0.028224569531642207)]
