# Big Data and Spark Introduction

## Matrix Factorization in Spark

- Our data is large
- MovieLens 100k - that's tiny
- we use 20M-> 200x larger
- Does it compare to what large corporation deal with?

- Netflix
    - 100 million users
    - we have ~130k users
    - 15k titles(hard to find data)
    - we have 26k movies
    - 500k movie exist(not including TV shows)
- YouTube
    - 1.8 billion users
    - 7 billion videos
    - unlike standard dataset, M>N
    

## Big Data
- Unforfunately a buzzword
- I think of it as any technology that requires distributed computing
- e.g. Database sharding
    - User IDs 1-1000 on one machine
    - User IDs 1001-2000 on another machine
    

## Sharding Files

## Distributed Compute
- Imagine having to do

```python
for i in range(N)
```

- but N is very large
- can split the work across multiple nodes
- what if each machine does a separate chunk of the loop?

- you don't have to manually write code to pass data to worker machines
- In the old days you'd have to, but these days we have more redundant,fault-tolerant libraries such as Hadoop and Spark
- Main tip: Keep in mind how we modify data in a Pandas DataFrame

```python
def some_func(row):
    out = do_something_to(row)
    return out
df['new_column'] = df.apply(some_func,axis = 1)
result = map(some_func,some_sequence) # pure python
```

## History Lesson
- MapReduce was developed at Google
- Hadoop(open-source version of MapReduce) was built at Yahoo, now part of Apache
- Spark builds on top of Hadoop, makes MapReduce code easier to write 
- That's why Spark install comes with Hadoop
- Hadoop is written in Java, thus you also have to install Java

## Section Online

- Mostly learning about how Matrix Factorization in Spark API works
- Most of the work is in setup
- 2 methods
    - 1) Local
    - 2) Amazone EC2 cluster(more realistic)
- If you are already working for a company doing big data, you probably already have a dedicated ops team for provisioning machines
- Don't make your manager angry by accidentally spending thousands of dollars on EC2


# Setting up Spark in your local Environment

- Windows 
    - Ubuntu inside a VirtualBox
- Mac
    - get homebrew
    - xcode-select --install
    - install java
    - brew cask install java
    - brew install scala
    - brew install apache-spark
    - pip install pyspark
    - spark-shell
        - 1 + 2
        - 3
    - python
- Ubuntu
    - sudo apt update
    - sudo apt install default-jdk
    - sudo apt install scala
    - pip install pyspark
    - 

## Matrix Factorization in Spark

```python
# pyspark.mllib is like scipy in pyspark version
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
import os
```

```python
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
import os

# load in the data
data = sc.textFile("./01.RecommanderSystem/data/movielens-20m-dataset/rating.csv")
```

```python
# convert into a sequence of Rating objects
ratings = data.map(
  lambda l: l.split(',')
).map(
   # UserId,MovieId,Rating(1,1.5,3.5 5 and etc)
  lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))
)
```

```python
# split into train and test
train, test = ratings.randomSplit([0.8, 0.2])
```

```python
# train the model
# K is latent dimensionality
K = 10
epochs = 10
model = ALS.train(train, K, epochs)

```

```python
# train
x = train.map(lambda p: (p[0], p[1]))
p = model.predictAll(x).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = train.map(lambda r: ((r[0], r[1]), r[2])).join(p)
# joins on first item: (user_id, movie_id)
# each row of result is: ((user_id, movie_id), (rating, prediction))
mse = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("train mse: %s" % mse)

```

In [1]:
import os
import sys
from pyspark import SparkContext
from pyspark import SparkConf


conf = SparkConf()
conf.setAppName("spark-ntlk-env")

sc = SparkContext(conf=conf)



In [2]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
import os

# load in the data
data = sc.textFile("./data/movielens-20m-dataset/rating.csv")


In [3]:
# convert into a sequence of Rating objects
ratings = data.map(
  lambda l: l.split(',')
).map(
  lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))
)


In [4]:
# split into train and test
train, test = ratings.randomSplit([0.8, 0.2])


In [None]:
# train the model
K = 10
epochs = 10
model = ALS.train(train, K, epochs)


## Spark submit

```python
spark-submit --master spakr://localhost:7077 spark2.py
```

![](http://2.bp.blogspot.com/-72rQVVqBQ_c/VfCe_hwW23I/AAAAAAAAMYQ/BflO3g-Kd6c/s640/1.png)

```python

sc = SparkContext("local", "Your App Name Here")


# load in the data
# data = sc.textFile("../large_files/movielens-20m-dataset/small_rating.csv")
data = sc.textFile("../large_files/movielens-20m-dataset/rating.csv.gz")
```

# Setting up Spark on AWS

- 0) Install Spark locally
- 1) Create AWS key pair and credentials
- 2) Clone the spark-ec2 tool from github
- 3) Create a cluster
- 4) Run our scrip,spark2.py
- 5) Destroy cluster


# setup key pair

![](https://docs.slashdb.com/user-guide/assets/keypairs.png)

![](https://buildkiteassets.com/assets/docs/tutorials/elastic_ci_stack_aws/ec2-create-key-pair-8f7684bec4b3d001b9e20633617548fa552b48734f59508f29e8b608978a7011.png)

## save the PEM in a safe place

## Create Credential
![](https://docs.aws.amazon.com/IAM/latest/UserGuide/images/security-credentials-user.shared.console.png)

![](https://cloud-gc.readthedocs.io/en/stable/_images/access_key_console.png)

## Set Environment Variables

- export AWS_ACCESS_KEY_ID = ...
- export AWS_SECRET_ACCESS_KEY = ...

## clone spark-ec2 repo

[spark-ec2 repo ](https://github.com/amplab/spark-ec2)

## launch your cluster
- run from the folder you just cloned

```python
./spark-ec2 -k AWS2018-3 -i AWS2018-3.pem -s 5 --spot-price=0.04 launch rec-cluster2

-k (--key-pair): name of your key pair
-i (--identity-file): PEM file you downloaded
-s (--slave): number of slave machines
--spot-price:bid amount(in dollars)
on-demand: expensive, guaranteed
spot-pricing: cheap, bid enough to win, can be interrupted
Default instance type is m3.large(2vCPUs, 7.5GB RAM)
$0.03/h
Might change in future since m3 is older generation, m4 is current
can use --instance-type to change
--region : specifying region
rec-cluster2 is the cluster name

chmod 400 AWSKey2018-2.pem
```

## copy the URL of your master node

- in AWS terminal, http://ec2_xxxxx-computexx.amazonaws.com:8080

## Run the code

```python
spark-submit --master spakr://ec2-xxxx-amazonaws.com:7077 spark2.py
```

## Jave Heap memory error

```python
spark-submit --master spakr://ec2-xxxx-amazonaws.com:7077 spark2.py --executor-memory 10g --driver-memory 10g
``` 

## Data files
- data would be stored on HDFS or S3
- latest spark has trouble with S3 protocol
- if your company has HDFS, try it


## Destroy cluster

```python
./spark-ec2 destroy rec-cluster2
```

# Reference

[pyspark in virtualenv](https://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/)  
[movielens dataset](https://grouplens.org/datasets/movielens/)

![](https://raw.githubusercontent.com/databricks/spark-training/master/website/img/matrix_factorization.png)