## Spark

In [None]:
from pyspark.sql import SparkSession
import time


In [None]:
spark = (SparkSession.builder.appName("cs544")
         .master("spark://boss:7077")
         .config("spark.executor.memory", "512M")
         .getOrCreate())

Chain of operations enable us to set parameters inside the `Builder` object.

In [None]:
SparkSession.builder.appName("cs544")

In [None]:
SparkSession.builder.appName("cs544").master("spark://boss:7077")

#### Web server access

Once you create the spark session, you will be able to access the web server that listens on 4040.

Connect to `localhost:4040` on your browser and that should give you detailed information about your spark cluster. "Executors" tab will give you details about the cluster nodes.

`sparkContext` is the entry point for all RDD related things.

In [None]:
sc = 

Let's create a list containing numbers from 0 to 1M.

In [None]:
nums = list(range(1_000_000))

### RDD creation

In [None]:
rdd = 

### `lambda` syntax

- anonymous functions
- `lambda ARGUMENTS: EXPRESSION`

### Transformation: `map`

Let's compute inverse.

In [None]:
inverses = 

### Action

- Action is what triggers the actual computation (or work)
  
We could get all results using `collect`, but be careful that is a lot of data to store in RAM.

In [None]:
# inverses.collect() # ACTION to get all the numbers

### Action

Let's get top N results instead using `take(<N>)`.

In [None]:
# Any potential problems in running this?


How can we fix the `ZeroDivisionError` error?

### Filter

Let's filter out any values <= 0.

In [None]:
inverses = rdd.???.map(lambda x: 1/x)
inverses

### Action

Let's compute mean of all the numbers.

In [None]:
inverses

### Partitioning

Number of partitions.

Let's create 10 partitions.

In [None]:
rdd = sc.parallelize(nums, ???)
rdd.getNumPartitions()

In [None]:
inverses = rdd.filter(lambda x: x > 0).map(lambda x: 1/x)
inverses.mean()

#### How to read spark job progress bar?

For example, `4 + 2 / 10` means:
- 4 tasks are done
- 2 tasks are running
- 10 tasks in total

### RDD caching

RDD sampling: `<rdd>.sample(...)`
- Psuedorandomness (seed) is not always possible in spark sampling because partitions might be different everytime. However, if you have same partitions every time, then seed will be deterministic.
- So, how can we achieve psuedorandomness?
  - Sample
  - Save results in a file
  - Only use that file

In [None]:
sample = 

How long does it take to compute mean on the sample?

Let's cache the results. This is fast because no work is done.

In [None]:
sample.cache()

The first time you "use" cached rdd, it will be slower than just running the computation itself. Why? Well it is doing the task work + extra caching work.

In [None]:
start_time = time.time()
print(sample.mean())
end_time = time.time()
end_time - start_time

Let's try it again.

In [None]:
start_time = time.time()
print(sample.mean())
end_time = time.time()
end_time - start_time

Doesn't give us much improvement, why not? 

We started with a big dataset (1M numbers). Sampling leads to narrow partitions because it doesn't want to shuffle data across partitions.

Solution: re-partition after sampling.

In [None]:
sample = rdd.sample(True, fraction=0.1, seed=544).???.cache()

Again will be slower first time, as we are doing the compute and caching work.

In [None]:
start_time = time.time()
print(sample.mean())
end_time = time.time()
end_time - start_time

In [None]:
start_time = time.time()
print(sample.mean())
end_time = time.time()
end_time - start_time

Better performance than before.

### Spark DataFrames

In [None]:
! wget https://ms.sites.cs.wisc.edu/cs544/data/ghcnd-stations.txt

In [None]:
df = 

In [None]:
df

In [None]:
type(df), type(df.rdd)

Let's take a peek at first 10 lines within this spark dataframe.

Why doesn't this work? Where is our data?

#### Moving data to HDFS

In [None]:
ghcnd-stations.txt

Let's read the data from HDFS.

In [None]:
df = spark.read.text(???)

In [None]:
!head ghcnd-stations.txt

In [None]:
df.take(10)

Let's convert spark dataframe to pandas dataframe. **Be careful!** entire data might not fit into memory.

In [None]:
# Limit to first 10 rows


In [None]:
pandas_df = df.limit(10).toPandas()
pandas_df

#### Extract station ID using pandas

In [None]:
pandas_df["value"]

We can add station ID as a new column into the same pandas dataframe because it is mutable.

In [None]:
pandas_df["station"] = pandas_df["value"].str[:11]
pandas_df

#### Extract station ID using Spark

`from pyspark.sql.functions import col, expr`<br>
`expr(<SQL>)`

In [None]:
#substring


We **cannot** add station ID as a new column into the same spark dataframe because it is immutable. Recall that spark dataframe build on spark SQL which depends on RDD format, which is immutable.

In [None]:
df

In [None]:
df2

In [None]:
df2.limit(10).toPandas()