# Working with Larger Datasets with a Scalable Solution!
- Consider the full 2015 TLC Taxi Dataset (at ~2GB per month @ ~24GB annually)
- Datasets with more than 20k rows would be hard for Excel, but fine for Pandas.
- A file with more than 100mil rows (a few GB) is large for Pandas.
- Although `pandas` would be sufficient for each month, how about a whole year?

That's right, use Spark 3.0!
![image.png](https://spark.apache.org/images/spark-logo-trademark.png)

## Method 1 (Docker)
We will be using:
- Docker for Windows 10 with WSL2 Backend (https://www.docker.com/)
- AWS Glue container (https://hub.docker.com/r/amazon/aws-glue-libs)

Steps:  
0. (Pre-Req) Install WSL2.
1. Download Docker and install.
2. Set WSL2 as backend and restart.
3. Launch WSL2 and run `docker pull amazon/aws-glue-libs:glue_libs_1.0.0_image_01`
    - tag=`glue_libs_1.0.0_image_01` is the latest as of 2021 July.
4. Run and install the container:
```bash
docker run -itd -p 8888:8888 -p 4040:4040 -v %UserProfile%\.aws:/root/.aws:rw -v C:\Users\YOUR_USERNAME\Documents\GitHub:/home/jupyter/jupyter_default_dir --name glue_jupyter amazon/aws-glue-libs:glue_libs_1.0.0_image_01 /home/jupyter/jupyter_start.sh
```
    - `p` specifies the port
    - `-v` specifies the directory for your files
    - `--name` specifies the container name (though the container ID will be different)
5. Check to see the container is running with `docker ps`
6. Launch Jupyter Notebook with your browser and open a `PySpark` kernel.

## Method 2 (Preferred)
We will be using:
- Ubuntu 20.04 (WSL2) or MacOS.

Steps:  
1. Install WSL2 for Windows 10 users. MacOS users, please ensure your terminal is set to `bash`.
2. Setup your Python environment (i.e `pip3 install notebook pandas numpy ...`)
3. Install `Java` and `PySpark`:  
- Linux
```bash
# install java
sudo apt install openjdk-8-jdk -y
# add to path
echo 'JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"' | sudo tee -a /etc/environment
# apply to environment
source /etc/environment
# install spark
pip3 install pyspark
```
- MacOS
```bash
# install java 8 and link to system java wrapper
brew install openjdk@8
sudo ln -sfn /usr/local/opt/openjdk@8/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-8.jdk
# add to path (earlier OSX defaults to bash while newer ones defaults to zsh)
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v1.8)"' | tee -a $HOME/.bashrc $HOME/.zshrc
# reload java path
source $HOME/.bashrc ; source $HOME/.zshrc
# install spark. Note: if you are using anaconda/conda environments, you need to make sure the pip3 is the correct pip3! Or you should install with conda directly!
pip3 install pyspark
```

- MacOS with M1 Chips may need to follow this guide for Java JDK:
https://code2care.org/q/install-native-java-jdk-jre-on-apple-silicon-m1-mac
```

## Preparation for the Next Part
This is a pre-requisite for the next tutorial. To be ready:
1. You must already have `PySpark` installed.
2. You need the dataset downloaded.

The code below downloads all 2015 data directly from the Amazon S3 Bucket. This is approximately ~21.3GB in size, so make sure you have ample storage space. You will only need to run this once.

```python
from os.path import getsize
from urllib.request import urlretrieve

output_dir = "../data/large"
fname_template = "yellow_tripdata_2015"

for m in range(1, 13):
    month = str(m).zfill(2)
    out = f'{fname_template}-{month}.csv'
    url = f"https://s3.amazonaws.com/nyc-tlc/trip+data/{out}"
    urlretrieve(url, f"{output_dir}/{out}")

    print(f"Done downloading {out} to {output_dir} with size {getsize(f'{output_dir}/{out}') / 1073741824:.2f}GB")
```

To verify you've installed it correctly, run the code below

In [None]:
from pyspark.sql import SparkSession

# Create a spark session (which will run spark jobs)
spark = SparkSession.builder.getOrCreate()

sdf = spark.read.csv('../data/sample.csv', header=True)

sdf