# How to Kaggle the Engineer way. Act 1: VS Code Containers

Or, how to engineer the heck out of your Kaggle development environment: a play in 2 acts.

My entire Kaggle work, including what's discussed in this article, is published [here](https://github.com/Witalia008/kaggle-public).

## Intro and Motivation

I am a software engineer, who not so long ago decided to embark on new challenge of becoming a data scientist.
As part of that I started participating in competitions on Kaggle (I've done just 2 so far).
However, at the beginning I've hit quite a troublesome ramp-up stage. Most of the time of those first two competitions I've spent setting up my development environment. I just couldn't bear all those manual and repetitive tasks: downloading datasets, hard-coding paths, manually submitting through the website, and most importantly getting lost in all the inputs and outputs of the models (i.e. tracking versions).

So, I've decided that I would save myself a lot of time later if I spent some time setting things up right at the beginning - i.e. to take the Software Engineering way.

This is a story in two acts:
* Act 1 is my set up of VS Code with Containers for local development to mimic that on Kaggle kernels.
* Act 2 is my set up of Google Colab to run independently yet work with Kaggle.

At the end of Act 1 you will be able to develop locally and train on Kaggle without any manual steps. Act 2 will be in an upcoming article.

## Act 1 - Local containerised environment in VS Code

During the first competition I took part in (an image classification problem), I started simply by using Kaggle kernels online, since my laptop doesn't have GPU and it would take me ages to train the models.

However, very quickly kernels were not enough:
* First of all, I wanted to track everything on GitHub, so I'd end up downloading everything locally, committing, etc., every time I make a change. That switching between tools just wasn't acceptable.
* Second of all, I wanted to be able to debug my code. I believe that debugging is a powerful tool and that's just a fact - a "thorn in the side" of those that like using Vim for typing code.



### Overview of VS Code Remote: Containers

Containers are a great tool if you don't feel like having to do the set up steps over and over again.
They allow you to define configuration steps once, and they use that environment any other time. Or, better yet, use one pre-defined by someone else, e.g. by Kaggle.

You can find extensive information on how to use VS Code: Remote extension to work inside containers [here](https://code.visualstudio.com/docs/remote/containers).

### Get Kaggle container image

Kaggle kernels on their website work exactly this way - they have a pre-defined container image that gets loaded for each user's notebook, thus isolating their environments.

And, luckily for us, they have their images published.
* [Here](https://github.com/Kaggle/docker-python) they have information on their Python images.
* You can peek inside their [Dockerfile](https://github.com/Kaggle/docker-python/blob/master/Dockerfile) to see what they're installing inside.
* The images are published on Google Container Registry at [CPU-only](gcr.io/kaggle-images/python) or [GPU](gcr.io/kaggle-gpu-images/python)

### Configure VS Code

* First, you need to install [VS Code Remote Development Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack).

* Second, create `.devcontainer/devcontainer.json` file to store your environment definition ([more info](https://code.visualstudio.com/docs/remote/containers#_create-a-devcontainerjson-file)):
```json
{
    "name": "Kaggle Dev CPU",
    "image": "gcr.io/kaggle-images/python", // Remember to pull latest before rebuilding.
    "extensions": [
        "ms-python.python",
    ],
    "settings": {
        "terminal.integrated.shell.linux": "/bin/bash",
        "python.pythonPath": "/opt/conda/bin/python"
    },
    "devPort": 8888,
    "shutdownAction": "none"
}
```
* Run VS Code command `Remote-Containers: Rebuild and Reopen in Container`.
* Your VS Code window title should now report `... [Dev Container: Kaggle Dev CPU] - ...`.

### Mimic Kaggle environment

Even though now I'd have development done in VS Code, I'd still want to train on Kaggle because of the GPU.
And, again I wanted to have to do minimal manual tasks to switch from running locally to Kaggle. So, I decided to *mimic* Kaggle environment locally so that the script/notebook wouldn't even know the difference.

I already had the same Container running, so I just needed to mimic folder locations:
- `/kaggle/input`: the folder to which the datasets are mapped.
- `/kaggle/working`: the folder where the output is stored (and, current working directory for the notebook/script).

To achieve this:
* Create `input` and `working` folders where it's convenient for you. I chose `data/` folder within my workspace.
* Create mappings for the container in `devcontainer.json`:
```json
"mounts": [
    "type=bind,source=${localWorkspaceFolder}/data/input,target=/kaggle/input",
    "type=bind,source=${localWorkspaceFolder}/data/output,target=/kaggle/output",
],
```
The above configuration maps `data/input` folder from within my local workspace folder inside the container under path `/kaggle/input` - just like it is on Kaggle. It also maps an extra folder `data/output <-> /kaggle/output` so that notebooks could persist data outside of the containers.

* Create a script `.devcontainer/setup.sh` that would be executed by VS Code after container creation:
```bash
#!/bin/bash
mkdir /kaggle/working
```

Don't forget to make it executable: `chmod +x .devcontainer/setup.sh`.

And tell VS Code to run it (in `devcontainer.json`):
```json
"postCreateCommand": ".devcontainer/setup.sh",
```

### Getting datasets from Kaggle

At first I would simply download the datasets manually (to a folder `data/input`) and name them the same as Kaggle kernel would.
However, when I started using different additional datasets or additional libraries, etc., I started looking if maybe Kaggle has some API or tool to automate the process. And luckily [it does](https://github.com/Kaggle/kaggle-api).

So, I decided to set up VS Code tasks (you can learn more [here](https://code.visualstudio.com/docs/editor/tasks)) that would run the commands to download datasets, files, etc.:
* Get your Kaggle API credentials as described [here](https://github.com/Kaggle/kaggle-api#api-credentials).
* I placed `kaggle.json` in my workspace folder (**make sure to add it to `.gitignore`!**):
* Then we need to make sure it's in home directory of the container to be able to run within (not to have to switch to local mode).

Add a script called `.devcontainer/setup-mounted.sh` (this one will be run after code was mounted):
```bash
#!/bin/bash
# Set up a link to the API key to root's home.
mkdir /root/.kaggle
ln -s /workspaces/kaggle/kaggle.json /root/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
```

And tell VS Code to run this script after attaching to the container:
```json
"postAttachCommand": ".devcontainer/setup-mounted.sh",
```
  
* After Kaggle API is set up, add VS Code tasks (in `.vscode/tasks.json`):

```json
"tasks": [
  {
      "label": "kaggle dataset download",
      "type": "shell",
      "command": "kaggle datasets download ${input:userName}/${input:datasetName} -p ${input:datasetName} --unzip --force",
      "options": {
          "cwd": "/kaggle/input"
      },
      "problemMatcher": []
  }
]
```

The above task would download a dataset in format `<username>/<dataset name>` to the directory `/kaggle/input` just like it would be done in Kaggle kernels.

### Run environment-specific logic (if necessary)

This could be useful if you only have CPU on your machine, and could only run training for 2 epochs, but when running on Kaggle, you'd like to run full training (on, for instance, 30 epochs).

For this, I was using an environment variable that would only be set in VS Code.
* Tell VS Code to define this environment variable in `devcontainer.json`:
```json
"containerEnv": {
    "KAGGLE_MODE": "DEV"
},
```
* Use it in code to check whether you're running locally or not:
```python
import os
DEVMODE = os.getenv("KAGGLE_MODE") == "DEV"
print(f"DEV MODE: {DEVMODE}")
EPOCHS = 2 if DEVMODE else 30
```
This step I liked the least since I had to repeat this in every notebook... However, it was only on creation of the notebook, so still much better than having to do everything manually every time when wishing to switch between environments.

### Extra

You can enable a number of useful extensions for your work. In `devcontainer.json`:
```json
"extensions": [
    "ms-python.python",
    "ms-python.vscode-pylance",
    // Editing/dev process
    "streetsidesoftware.code-spell-checker",
    "wayou.vscode-todo-highlight",
    "janisdd.vscode-edit-csv",
    "davidanson.vscode-markdownlint",
    // VCS helpers
    "donjayamanne.githistory",
    "eamodio.gitlens"
],
```

And don't forget a bunch of useful [settings](https://github.com/Witalia008/kaggle-public/blob/master/.vscode/settings.json) (like formatting and linting your code) for your VS Code environment. 

### Entire setup

You can refer to the files:
* [.devcontainer](https://github.com/Witalia008/kaggle-public/tree/master/.devcontainer) folder.
* [.vscode](https://github.com/Witalia008/kaggle-public/tree/master/.vscode) folder.
* [cassava_inference.py](https://github.com/Witalia008/kaggle-public/blob/master/cassava-leaf-disease-classification/cassava-inference.py) - sample python script.

### Further improvements

Something I haven't done, but is possible in Kaggle API:
* It allows to upload notebooks and run them ([read more](https://github.com/Kaggle/kaggle-api#kernels)), so you won't have to got to the website manually.
* It also allows to submit to the competition (will be covered in Act 2).
* It has some other features that might be useful for you, like listing leader board, etc.

## Summary

Following the steps described in this article, one can set up development environment on their local machine very similar to the one on Kaggle, with additional perks of version control system and debugging (and any other advantage of using an IDE).

This could be achieved by a few steps:
* Configuring VS Code to develop inside a Kaggle container.
* Setting up directory structure similar to Kaggle's with mapping to local machine's storage.
* Using Kaggle API to download datasets and more.
* Using environment variables to have environment-specific logic in the code.

I hope this set up could help you ease your work when participating in Kaggle competitions.