
# Spark Labs -- Python V3

## Getting the Data

The dataset is located [here](https://s3.amazonaws.com/elephantscale-public/data/data.zip)

- Option 1: Click on the above link to download

- Option 2: Using command line client like wget

```bash
    $    wget 'https://s3.amazonaws.com/elephantscale-public/data/data.zip'
```

# Labs

### Setup

- on the VM: Follow [setup-vm-python](setup-vm-python.ipynb) to set up Spark and Jupyter environment on 
- (optional) : follow [setup-local](setup-local.ipynb) to setup your laptop
- **Test your setup by running [Testing123](testing-123.ipynb)**

### Python Intro Labs  (Reference only, not done in class)

__`python-labs`__ directory contains some basic labs in Python.  If you are new to Python, give them a try.

###  1 - Jupyter Intro

- [1.1 - Hello Jupyter](python-basics/hello-jupyter.ipynb)
   
### 2 - Spark Intro

- [2.1 - Run Spark](02-intro/2.1-install-spark-python.ipynb)
- [2.2 - Spark Shell](02-intro/2.2-shell-python.ipynb) 

### 3 - Spark Core

- [3.1 - RDD basics](03-rdd/3.1-rdd-basics-python.ipynb)
- [3.2 - Dataset basics](03-rdd/3.2-dataset-basics-python.ipynb)
- [3.3 - Caching](03-rdd/3.3-caching-python.ipynb)


### 4 - Dataframes

- [4.1 - Dataframes](04-dataframe/4.1-dataframe-python.ipynb)
- [4.2 - Spark SQL](04-dataframe/4.2-sql-python.ipynb)
- [4.3 - Dataset](04-dataframe/4.3-dataset-python.ipynb)
- [4.4 - Caching 2 SQL](04-dataframe/4.4-caching-2-sql-python.ipynb)
- [4.5 - Spark & Hive (Hadoop)](04-dataframe/4.5-spark-and-hive-python.ipynb)
- [4.6 - Data formats](04-dataframe/4.6-data-formats-python.ipynb)

### 5 - API

- [5.1 - Submit first application](05-api/5.1-submit-python.ipynb)


### Practice Labs for end of day 2

- [Practice Lab 1 - Analyze Spark Commit logs](practice-labs/commit-logs-python.ipynb)
- [Practice Lab 2 - Analyze clickstream logs](practice-labs/clickstream-python.ipynb)


### 6 - MLLib

- [6.1 - Kmeans](06-mllib/kmeans/kmeans-python.ipynb)
- [6.2 - Recommendations](06-mllib/recs/dating.ipynb)
- [6.3 - Classification](06-mllib/classification/churn_svm.ipynb)

### 7 - GraphX

- [7.1  - Influencers (Twitter)](07-graphx/7.1-influencer-python.ipynb)
- [7.2  - Shortest path (in LinkedIn)](07-graphx/7.2-shortest-path-python.ipynb)

### 8 - Streaming

- [8.1 - Streaming over TCP](08-streaming/8.1-over-tcp/README-python.ipynb)
- [8.2 - Windowed Count](08-streaming/8.2-window/README-python.ipynb)
- [8.3 - Kafka Streaming](08-streaming/8.3-kafka/README-python.ipynb)
- 8.4 - Structured Streaming
    * [8.4a - Structured Streaming 1](08-streaming/8.4-structured/README1-python.ipynb)
    * [8.4b - Structured Streaming 2 (JSON)](08-streaming/8.4-structured/README2-python.ipynb)

### 9 - Operations

- [9.1 - Cluster setup](09-ops/9.1-cluster-setup.md)


### 10 - Practice Labs

- [Practice Lab 1 - Analyze Spark Commit logs](practice-labs/commit-logs-python.ipynb)
- [Practice Lab 2 - Analyze clickstream logs](practice-labs/clickstream-python.ipynb)
- [Practice Lab 3 - Analyze house sales data](practice-labs/house-sales-python.ipynb)
- [Practice Lab 4 - Optimize Spark query](practice-labs/optimize-query-python.ipynb)
