# Downloading Kaggle datasets to Azure File Datasets

This sample notebook shows the workflow on how to download a Kaggle dataset to an AzureML FileDataset, without using your local bandwith.

## Preparation

In order to get started, you need the following

- An active AzureML workspace
- Your config.json provided with the right information (see documentation)
- Your Kaggle authentication details (kaggle.com - click your account - in API section, Create new API token)

## Authentication

There are two options to authenticate

- You can enter your secrets (username + key) into the Azure Key Vault instance that is linked to your AzureML workspace (you will need the right authorization for that).  Provide two secrets with the right values then: `KAGGLE-USERNAME` and `KAGGLE-KEY`
- You can pass the values as arguments to your script
            
## Configuration parameters

In the following section, you can provide your configuration parameters for your script.

- kaggle_dataset_name: the name of the dataset you want to download.  (`Copy API command` when clicking the three dots on the kaggle.com dataset website) This is in the form of (user/data-set-name)
- kaggle_user_name: the username of kaggle.  leave this empty if you are using Azure KeyVault
- kaggle_user_key : the secret key of your kaggle user.  leave this empty if you are using Azure KeyVault

In [22]:
kaggle_dataset_name = 'new-york-state/nys-farm-product-dealer-licenses-currently-issued'
kaggle_user_name = ''
kaggle_user_key = ''
use_key_vault = True
remote_compute_name = 'cpu-cluster'

In [23]:
import os
import os.path
import sys
from datetime import date

from azureml.widgets import RunDetails
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

## Connect to your AzureML workspace

In [24]:
from arcus.azureml.environment.aml_environment import AzureMLEnvironment
azure_config_file = '../.azureml/config.json'
aml = AzureMLEnvironment.Create(config_file=azure_config_file)

Connected to AzureML workspace
>> Name: codit-ai-incubators-ml
>> Subscription: c1537527-c126-428d-8f72-1ac9f2c63c1f
>> Resource group: codit-ai-incubators


In [30]:
training_name = 'Arcus-Kaggle-Download'
trainer = aml.start_experiment(training_name)
trainer.setup_training(training_name)

## Setup of script

The following script will be used to execute in the cloud

In [31]:
%%writefile Arcus-Kaggle-Download/train.py

# General references
import argparse
import os

# Add arcus references
from arcus.azureml.datacollection.kagglecollection import KaggleDataCollector
from arcus.azureml.environment.aml_environment import AzureMLEnvironment

##########################################
### Parse arguments and prepare environment
##########################################

parser = argparse.ArgumentParser()

# If you want to parse arguments that get passed through the estimator, this can be done here
parser.add_argument('--kaggle_user', type=str, dest='user', default=None, help='Kaggle user name')
parser.add_argument('--kaggle_key', type=str, dest='key', default=None, help='Kaggle user secret')
parser.add_argument('--kaggle_dataset', type=str, dest='dataset', default=None, help='Kaggle data set name')
parser.add_argument('--use_keyvault', type=bool, dest='usekeyvault', default=True, help='Indicate to use Key Vault')

args, unknown = parser.parse_known_args()
kaggle_user = args.user
kaggle_secret = args.key
kaggle_dataset = args.dataset
use_key_vault = args.usekeyvault

# Load the environment from the Run context, so you can access any dataset
aml_environment = AzureMLEnvironment.CreateFromContext()
collector = KaggleDataCollector()
collector.copy_to_azureml(aml_environment, kaggle_dataset, local_path='kaggle_data', user_name = kaggle_user, user_key = kaggle_secret, use_key_vault=use_key_vault, force_download=True)

print('Training finished')

Overwriting Arcus-Kaggle-Download/train.py


## Launch script on cloud compute

In [34]:
if use_key_vault:
    # Using Key Vault authentication, so only needed to pass the actual data set name
    args = {
        '--kaggle_dataset': kaggle_dataset_name
    }
else:
    # Not using Key Vault authentication, so passing the secrets too
    args = {
        '--kaggle_user': kaggle_user_name,
        '--kaggle_key': kaggle_user_key,
        '--kaggle_dataset': kaggle_dataset_name,
        '--use_keyvault': False,
    }

mask_stage_run = trainer.start_training(training_name, environment_type='sklearn', 
                       script_parameters=args,
                       compute_target=remote_compute_name, gpu_compute=False)

Getting environment for type sklearn
Taking AzureML-Scikit-learn-0.20.3 as base environment
https://ml.azure.com/experiments/Arcus-Kaggle-Download/runs/Arcus-Kaggle-Download_1598357096_2e38122b?wsid=/subscriptions/c1537527-c126-428d-8f72-1ac9f2c63c1f/resourcegroups/codit-ai-incubators/workspaces/codit-ai-incubators-ml


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…