# Lab Section 1


-----
Part 1: RDD and Spark Basics
-----

Let's get familiar with the basics of Spark (PySpark). 

## 1) Spark Contexts

Check if you have a `SparkContext`. If not, initiate a `SparkContext`. A `SparkContext` specifies where your cluster is, i.e. the resources for all your distribute computation. Specify your `SparkContext` as follows:
   
```python
import pyspark
sc = pyspark.SparkContext()
```

## 2) Making your RDD

Spark keeps your data in __Resilient Distributed Datasets (RDDs)__. An RDD is a collection of data partitioned across machines. Each group of records that is processed by a single thread (*task*) on a particular machine on a single machine is called a *partition*. Using RDDs Spark can process your data in parallel across the cluster. 
   
You can create an RDD from a list, from a file or from an existing RDD.
   
Let's create an RDD from a Python list:
   
```python
list_rdd = sc.parallelize([419, 789, 57, 83, 805, 898, 419, 260, 83, 872])
```


## 3) Designing a Transformation Plan

RDDs are lazy so they don't perform operations unless it is needed. Transformations create a plan to change the data. 

 ```python
list_rdd.filter(lambda x: x>800) # Find all values greater than 800
 ```

Note that this action produces no results. It is simply a plan to manipulate the data. Each RDD knows what it has to do when it is asked to produce data. Actions then instantiate the plan to produce a value.

When you use `take()` or `first()` to inspect an RDD does it load the entire file or just the partitions it needs to produce the results. It is smart and lazy, just loads the partitions it needs.
 
 ```python
 list_rdd.first() # Views the first entry
 list_rdd.take(2) # Views the first two entries

```

## 4) Collection results

If you want to get all the data from the partitions to be sent back to the driver you can do that using `collect()`. However, if your dataset is large this will __kill__ ☠ the driver. Only do this when you are developing with a small test dataset.
   
```python
list_rdd.collect()
```

----
Part 2: Explore Apple stock prices
----

![](http://apple-stock-news.com/wp-content/uploads/2016/01/appppp.jpg)

Normally we would download the Apple stock prices. This would be done using the protocol below:

Either open the link in Chrome or download with Python:

```python
import urllib2
url = 'http://real-chart.finance.yahoo.com/table.csv?s=AAPL&g=d&ignore=.csv'
csv = urllib2.urlopen(url).read()
with open('aapl.csv','w') as f: f.write(csv)        
```

This has fortunately been done for you, and waits ready for you in the `/data` folder included in this directory, as `aapl.csv`.


### (Group) Exercises

Using Spark, answer the below questions in your small groups (you can use shell scripting to check your answers in many cases):

A) How many records are there in this CSV?

B) Find the average *adjusted close* price of the stock. Also find the min, max, variance, and standard deviation.

C) Find the dates of the 3 highest adjusted close prices.

D) Find the date of the 3 lowest adjusted close prices.

E) Find the number of days on which the stock price fell, i.e. the close price was lower than the open.

F) Find the number of days on which the stock price rose.

G) Find the number of days on which the stock price neither fell nor rose.

H) To find out how much the stock price changed on a particular day, convert the close and the open prices to natural log values using `math.log()` and then take the difference between the close and the open. This gives you the log change in the price. Find the 3 days on which the price increased the most.

I) The log change price lets you calculate the average change by taking the average of the log changes. Calculate the average change in log price over the entire range of prices.

---
Hints
----

[Python RDD documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.html#pyspark.RDD)