# Load Sparkify Data

This notebook contains the first step of the Sparkify project. First, it loads the medium-sized Sparkify dataset from video.udacity-data.com into the local directory using the `curl` command below. Second, it loads the medium-sized dataset (462MB) and the mini dataset (128MB) which should be added manually to `/data` into a Cloud Storage bucket.

The additional notebooks `run_exploratory_data_analysis.ipynb` and `run_pipe_sparkify` read the data directly from Cloud Storage, because according to the [official doc](https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets):

> If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

## Requirements
1. The directory `/data` must exist.
2. The file `mini_sparkify_event_data.json` must be present in the `/data` directory.
3. The notebook must be executed from a PySpark kernel.
4. In the folder `/authentication`, there must be a file `gcp_client_secrets.json` with secrets to access GCP.

In [1]:
project_dir = "/home/Sparkify-churn"

In [2]:
import os
os.chdir(project_dir)
from data_transfer import transfer_files_with_cloud_storage

In [3]:
bucket_name = "pyspark-cluster-202205"
file_path_mini = "data/mini_sparkify_event_data.json"
file_path_medium = "data/medium_sparkify_event_data.json"

In [4]:
# load the medium Sparkify dataset into local dir
!curl "https://video.udacity-data.com/topher/2018/December/5c1d6681_medium-sparkify-event-data/medium-sparkify-event-data.json" >> file_path_medium

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  231M  100  231M    0     0  43.8M      0  0:00:05  0:00:05 --:--:-- 47.9M


In [5]:
# load the medium Sparkify dataset from local dir to Cloud Storage
for file_path in [file_path_mini]:#, file_path_medium]:
    transfer_files_with_cloud_storage(
        bucket_name=bucket_name,
        local_file_name=os.path.join(project_dir, file_path),
        remote_blob_name=file_path,
        transfer_option="export",
        project_dir=project_dir,
    )

Uploading '/home/Sparkify-churn/data/mini_sparkify_event_data.json' to GCP bucket 'pyspark-cluster-202205' as 'data/mini_sparkify_event_data.json'...
Done.

Uploading '/home/Sparkify-churn/data/medium_sparkify_event_data.json' to GCP bucket 'pyspark-cluster-202205' as 'data/medium_sparkify_event_data.json'...
Done.

