# How to Kaggle the Engineer way. Act 2: Google Colab

Or, how to engineer the heck out of your Kaggle development environment: a play in 2 acts.

This is the Act 2 of the play, and you can read Act 1 [here](https://link.medium.com/kOoq47KYrfb).

My entire Kaggle work, including what's discussed in this article, is published [here](https://github.com/Witalia008/kaggle-public).

## Prologue/Intro

In the first part of the story I described how to set up VS Code with Containers for Kaggle. To me it is of the utmost importance to use methodical approaches, should it be engineering work, or data science. Because of that belief, my ramp up on Kaggle competitions was in large part setting things up so that I would save time in the future.

In this article, I'll describe how I improved on my initial approach of using VS Code for development for the second competition I took part in. As the title suggests, I moved to Google Colab, but nonetheless I still wanted to have proper set up.

## Shortcomings of previous approach

In the first competition, having set up VS Code with Kaggle container, I was developing locally and training on Kaggle. I did my best to minimize any repetitive tasks involved in switching between the environments. However, there were still a few other issues:
* I still had to switch between two environments.
* Kaggle would not allow to use specific version of the dataset (for instance, when training 5 fold, I'd have to download each and upload all as a dataset for the ensemble).
* Including dependencies was a pain - I had to upload them, again, as a dataset and then map their location.

In this article I'll describe how I used Google Colab to overcome all of these nuisances and get some additional perks like better GPU and better runtime in general.

## Act 2 - Google Colaboratory for Kaggle with GitHub

Google Colab is a great way to get free GPU for your Data Science work. For a small additional fee, you could get Colab pro with V100/P100 GPU, which is a few times faster than T4 ones that you get on Kaggle. And it has other nice features, like mapping Google Drive files, working with GitHub, and more - described below.

I chose that it would be an improvement over Kaggle in training speed and ease of use, and I would not need to develop locally. With a bit more research, it looked like I won't even need to leave Colab environment and do everything within (kind of like not leaving my flat).

### Setting up GitHub for Colab in Google Drive

First of the things I did was to make sure I can still work with Git, and would not have to switch between the environments.

The bad approach: you can open files directly from your private or public repo, edit it and commit back, as described [here](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb). There are a few issues, however:
* the commits are only for one notebook, i.e. Colab only works within the context of one notebook.
* because of the above, you cannot have dependencies.

The good thing is, Colab allows you to [mount your Google Drive](https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/).
Once it gets mapped, it is a regular folder on your VM instance in which Colab is running.

So, why not clone your GitHub repo?
* Create GitHub Personal Access Token to use in place of your password - [link](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token).
* Have some temporary notebook, let's call it `terminal.ipynb`, stored in your Google Drive, and with your drive mapped:
```python
from google.colab import drive
drive.mount("/content/drive")
```
* In Colab Pro, you can then open a terminal instance - [link](https://stackoverflow.com/questions/59318692/how-can-i-run-shell-terminal-in-google-colab).
* In any Colab, you can run terminal commands from the notebook cells:
```python
%%bash
cd /content/drive/MyDrive/
git clone https://<username>:<PAT>@github.com/<username>/<repo name>.git
cd <repo name>
```

At this point, you have all your code git-cloned to your Google Drive, and you can open any notebook within.
I've also set up some code in a [notebook](https://github.com/Witalia008/kaggle-public/blob/master/setup-colab-for-kaggle.ipynb) to work with Git, but I ended up using just the terminal commands.

The only shortcoming is that you'd have to map Google Drive at the beginning of each notebook, but I think that's relatively fine.
```python
try:
    from google.colab import drive
    drive.mount("/content/drive")
    %cd /content/drive/MyDrive/Colab\ Notebooks/kaggle
except:
    print("Not in Colab")
```
I've created myself a [template notebook](https://github.com/Witalia008/kaggle-public/blob/master/template-colab-kaggle-nb.ipynb) which I copy as a baseline for any other notebook.

NOTE: an additional cool hack, not to have to give Colab authorization to your Drive each time, you can set up auto-mount for each of the notebooks like described [here](https://stackoverflow.com/questions/52808143/colab-automatic-authentication-of-connection-to-google-drive-persistent-per-n).

### Set up for Kaggle

Next step was to set up all the directories needed for Kaggle (`/kaggle/input`, `/kaggle/working`) and Kaggle API secrets, etc. This setup would have to be done at the beginning of each notebook, so it would have to be as script imported and functions called.

You cannot really edit Python scripts in Colab, but you can make your notebooks write to files:
```python
%%writefile setup_colab.py
def setup_colab_for_kaggle():
    ...
```

In that script, let's perform the necessary setup:
* Map and/or create directories:
```python
kaggle_dir = Path("/kaggle")
drive_content_dir = Path("/content/drive/MyDrive/kaggle")
(kaggle_dir / "working").mkdir()
target_content_dirs = ["input", "output"] + ([] if local_working else ["working"])
for content_dir in target_content_dirs:
    (kaggle_dir / content_dir).symlink_to(drive_content_dir / content_dir)
```
* Set up Kaggle API token. Upload your `kaggle.json` file to your Drive (I placed it in the repo folder, though of you do too, **make sure to add it to the `.gitignore`**).
```python
drive_sources_dir = Path("/content/drive/MyDrive/Colab Notebooks/kaggle")
kaggle_config = Path.home() / ".kaggle"
(kaggle_config / "kaggle.json").symlink_to(drive_sources_dir / "kaggle.json")
```
* You could use the same approach to set up Weights & Biases API key or any other key.
* Have any other set up code in that file.

There's quite a bit of additional logic, so the full code is in the [notebook](https://github.com/Witalia008/kaggle-public/blob/master/setup-colab-for-kaggle.ipynb) and the [output script](https://github.com/Witalia008/kaggle-public/blob/master/setup_colab.py).

Now, each notebook would just have to import the code and run the setup function:
```python
from setup_colab import setup_colab_for_kaggle
setup_colab_for_kaggle(check_env=False, local_working=True)
```

Using my script, the output would tell us about the resulting setup:
```
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/kaggle
Content of Drive Kaggle data dir (/content/drive/MyDrive/kaggle): ['/content/drive/MyDrive/kaggle/input', '/content/drive/MyDrive/kaggle/working', '/content/drive/MyDrive/kaggle/output']
Content of Kaggle data dir (/kaggle): ['/kaggle/working', '/kaggle/output', '/kaggle/input']
Content of Kaggle data subdir (/kaggle/input): ['/kaggle/input/vinbigdata']
Content of Kaggle data subdir (/kaggle/output): ['/kaggle/output/vbdyolo-out']
Content of Kaggle data subdir (/kaggle/working): []
Content of Kaggle config dir (/root/.kaggle): ['/root/.kaggle/kaggle.json']
Loaded environment variables from .env file: ['WANDB_API_KEY'].
```

Cool, after this point, all of the setup is mimicking that of Kaggle kernels, and of the local environment described in previous article, so the notebooks wouldn't know the difference when switching between the two environments (should one want to). But let's set this up even more so we won't have to switch at all.

### Getting data from Kaggle

At first, I would download the data locally, then upload it to my Drive, and it would end up being mapped to the correct location as shown above. However, that wasn't too practical, and most importantly - *Colab was very slow in accessing files from Google Drive*. Turns out, it's much faster to download the dataset each time into local storage of Colab's VM. In your notebook just have:
```
!kaggle competitions download vinbigdata-chest-xray-abnormalities-detection -f train.csv -p {INPUT_FOLDER_DATA} --unzip
!kaggle datasets download xhlulu/vinbigdata-chest-xray-resized-png-1024x1024 -p {INPUT_FOLDER_PNG} --unzip
```

One epoch of training Efficient Net B3 through Google Drive would take ~1 hour, whereas downloading 8GB takes around 8 minutes and training 1 epoch comes down to ~15 minutes. 4x speed improvement - not bad!

### Versioning datasets and models

This section will shed some light on why we needed to set up Kaggle API token.

Being systematic and methodical about my approaches to Kaggle competitions, I started thinking how to organize the data I use and the models that get produced. Kind of like Git for models and data. And, in fact, I found out there's [Data Version Control](https://dvc.org/), and some other tools. However, just for Kaggle competitions, something simpler would do. One option would be just to store in folders, but that's not quite neat and data could be lost, and it's hard to attach notes to each version. So I chose to stole everything as Kaggle datasets.

That could be done as easily as uploading files at the end of each notebook:
* Place the files you want to upload in one folder (let's say, `yolo_pred/` folder with YOLO predictions and `yolo.pt` file).
* Create `dataset-metadata.json` file - something parsed and understood by Kaggle API:
```python
with open(Path(folder_path) / "dataset-metadata.json", "w") as f:
    json.dump({
        "title": dataset_name,
        "id": f"{user_name}/{dataset_name}",
        "licenses": [{ "name": "CC0-1.0" }]
    }, f, indent=4)
```
* Use Kaggle API command to upload the files at the end of your notebook execution (e.g. model training):
If first time:
```
!kaggle datasets create -p {OUTPUT_FOLDER_CUR} -r zip
```
OR if not:
```
!kaggle datasets version -m "{version_message}" -p {OUTPUT_FOLDER_CUR} -r zip
```
* Now, in your post-processing, you can download a specific version (specific trained mode and its output) to use for submission:
```
!kaggle datasets download "username/dataset-name" -v {yolo_version} -p {version_data["path"]} --unzip --force
```
Spoiler alert: though, this won't work in official Kaggle API - see next paragraph.

### Fixing Kaggle API for our needs

As a bit of digression from the main story of setting everything up I needed to add/fix a few features in Kaggle API. The push for this was because at one point I wasn't able to reproduce good result I obtained with some earlier version. At that point I had everything versioned, and every setting recorded, so that was close to impossible... Yet it was happening. And then I noticed I was getting the same result for any of the versions I was selecting. Turned out, well, that Kaggle API was silently ignoring my version request and always downloading the latest one.

So, their GitHub repo mentioned they welcome changes, hence I decided to take matters into my own hands (I quite needed the functionality for the competition).
* First, I needed to make sure Kaggle would download a specific version. My PR [here](https://github.com/Kaggle/kaggle-api/pull/335).
* Second, I was already on the roll, so I made some fixes that one file from the competition would download correctly (the full dataset was ~200GB, and I was downloading just the `.csv` file with train labels, and getting ~8GB of the same images from another dataset). My PR [here](https://github.com/Kaggle/kaggle-api/pull/336).

Apparently, though, the process of getting changes into Kaggle API is slow (they have a private repo, and some dev has to go and merge the changes...).
So, both changes are temporarily in my fork [here](https://github.com/Witalia008/kaggle-api/tree/witalia-main).

You can install Kaggle API with my fixes from your Colab notebook like so:
```
!pip install -U git+https://github.com/Witalia008/kaggle-api.git@witalia-main
```

### Submitting to Kaggle

For the particular competition I was taking part in, the submission was just a `.csv` file, so I would simply run the following command at the end of my post-processing notebook:
```bash
!kaggle competitions submit \
    vinbigdata-chest-xray-abnormalities-detection \
    -f {WORK_FOLDER}/submission.csv \
    -m "{submission_message}"
```

And then check on the result of that submission:
```bash
!kaggle competitions submissions vinbigdata-chest-xray-abnormalities-detection
```

### Entire setup

You can refer to the files with the complete implementation:
* [setup-colab-for-kaggle.ipynb](https://github.com/Witalia008/kaggle-public/blob/master/setup-colab-for-kaggle.ipynb) - the notebook with code to work with Git, and also setup-colab logic.
* [setup_colab.py](https://github.com/Witalia008/kaggle-public/blob/master/setup_colab.py) - the setup script (as produced by the above notebook), which contains functions to map directories, configure keys, etc.
* [template-colab-kaggle-nb.ipynb](https://github.com/Witalia008/kaggle-public/blob/master/template-colab-kaggle-nb.ipynb) - the template notebook that has required setup which I use as a base for other notebooks.
* [vbd-yolov5.ipynb](https://github.com/Witalia008/kaggle-public/blob/master/vinbigdata-chest-xray-abnormalities-detection/vbd-yolov5.ipynb) - example notebook that downloads datasets, trains a model, and stores output as a dataset.
* [vbd-postprocess-yolo.ipynb](https://github.com/Witalia008/kaggle-public/blob/master/vinbigdata-chest-xray-abnormalities-detection/vbd-postprocess-yolo.ipynb) - example notebook that downloads multiple versions of the previous notebook's outputs as datasets, ensembles them, and submits to Kaggle.

### Further improvements

The problem with this approach is actually losing the full IDE's capabilities (like, debugging, or working with other file formats apart from notebooks). There's limited debugging capabilities with pdb in notebooks, and one can edit files with Vim or Nano through the terminal, but that's not very good. But I hear there's a way to connect VS Code to Google Colab ([link](https://amitness.com/vscode-on-colab/)), so I'll definitely try that out and maybe there will be a sequel.

## Epilogue/Summary

The setup described in this article is further improvement for Kaggle development environment, that with a few simple steps allows to download data, do your data science work, store the models/output, and submit to competitions, without ever leaving Colab notebooks.

This could be achieved by the following steps:
* Mounting Google Drive and operating through terminal commands.
* Cloning Git repo and tracking files as you would do on local machine.
* Creating a script to run setup commands at the beginning of each notebook to mimic Kaggle environment, setup API keys, and more.
* Using Kaggle API to download datasets and competition data from notebooks into Colab VM.
* Using Kaggle API to store model versions and their outputs as datasets.
* Using Kaggle API to submit to competitions directly from notebooks.

I hope that you find this setup useful and it allows you to spend less time on configuring everything and more time on solving for Kaggle competitions. Good luck :)