In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.vision import *

##Data and Motivations

So the idea here is to make a dog classifier from the [Stanford Dogs dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/). This is basically going to be the same as the pets classifier from the Fast.ai lesson one, but includes more dogs and fewer cats. (My real motivation is that my mom likes huskies and my sister likes german shorthairs, and neither were included in the Fast.ai pet classifier.)

This is my first time using Google Cloud and my first try at ML, so I'm going to list everything.

To get the dataset I first tried using the fastai function `untar_data`, but this threw the error `ReadError: not a gzip file` because the function expects a `.gz` and the Stanford file is `.tar`. Easy enough: I'm using Google Cloud, so I can just do: 

`wget http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar`

and then unpack it with:

`tar -xvf images.tar`


**Access to Gcloud**

When I first set up the VM it worked like a charm. But the next day when I started it back up I wasn't able to conect to Jupyter notebooks. The original command specified port 8080:

`gcloud compute ssh --zone=$ZONE jupyter@$INSTANCE_NAME -- -L 8080:localhost:8080`

I noticed that when I started the notebook the server was on `8889`. Reconnecting to the shell with port `8889` did the trick. (Although I'm not sure what changed from yesterday.)

## The first attempt

```python
images_path = "/home/jupyter/data/StanfordDogs/Images/"
tfms = get_transforms(do_flip=False)
data = ImageDataBunch.from_folder(images_path)
```

Unfortunately, this won't work, because the images aren't split up into subdirectories for `train`, `valid` and `test`, which is the standard for ImageNet-type datasets.

The current dataset is structured like so:

```
- Images/
  - n02110185-Siberian_husky
    + n02110185_184.jpg
    + ...
    + ...
```

(I also don't see german shorthair in this list. Oh well!)

So, basically, I need to split the images up such that `ImageDataBunch.from(folder)` can understand it. Like the following:

```
path\
  train\
    clas1\
    clas2\
    ...
  valid\
    clas1\
    clas2\
    ...
  test\
  ```
  
I recently read that people tend to use 50% to 90% for training and the rest for validation.
  
training data  | .      | .      | .      | test data
-------------- |--------|--------|--------|-----------
fold 1         | fold 2 | fold 3 | fold 4 | fold 5

I'll use folds 1-4 for training, fold 4 for validation and fold 5 for testing.

So I need a script that, for each dog folder, moves 60% of the images to `train/`, 20% to `valid` and 20% to `test`. While I'm at it I'll normalize the directory names to get better class names. (They're all over the place in terms of lower- and upper-case.)