# Gathering and Labeling Data


## Setting up the folder structure

We're going to create a bunch of folders that we'll need to keep our images and eventually our model organized. The first piece is to add a `working/training_demo` folder like this:

```
internship/
├─ models/
│  ├─ community/
│  ├─ official/
│  ├─ orbit/
│  ├─ research/
│  └─ ...
└─ workspace/
   └─ training_demo/
```

And within the `training_demo` folder, we need this structure:

```
training_demo/
├─ annotations/
├─ exported-models/
├─ images/
│  ├─ test/
│  └─ train/
├─ models/
└─ pre-trained-models/
```


* `annotations`: This folder will be used to store the respective TensorFlow `*.record` files, which contain the list of annotations for our dataset images.

* `exported-models`: This folder will be used to store exported versions of our trained model(s).

* `images`: This folder contains a copy of all the images in our dataset, as well as the respective `*.xml` files produced for each one, once labelImg is used to annotate objects.

* `images/train`: This folder contains a copy of all images, and the respective `*.xml files`, which will be used to train our model.

* `images/test`: This folder contains a copy of all images, and the respective `*.xml` files, which will be used to test our model.

* `models`: This folder will contain a sub-folder for each of training job. Each subfolder will contain the training pipeline configuration file `*.config`, as well as all files generated during the training and evaluation of our model.

* `pre-trained-models`: This folder will contain the downloaded pre-trained models, which shall be used as a starting checkpoint for our training jobs.


When it comes time to do your final project, you will re-create the `training_demo` folder, but name it something like `lastname_project`. The sub-folders and the process from here on out will be the same.

* Open anaconda command, activate the `internship` environment. Navigate to the `workspace/training_demo/images` folder.
* Download the chromedriver https://sites.google.com/a/chromium.org/chromedriver/downloads and put the `chromedriver.exe` file in the images folder
* Run this command to get the tool we need: `git clone https://github.com/ultralytics/google-images-download`
* Run `python google-images-download\bing_scraper.py --search "pasta sauce shelf" --limit 100 --download --chromedriver ./chromedriver.exe --size medium --format jpg` to execute the image retrieval script to get a bunch of images. When you do your final project, you will change the "pasta sauce jar" to your topic. The files will download to a sub-folder in `images`.
> Note: if you get an error saying that the chromedriver isn't correct, try downloading a different version.
* Inspect your images - verify that they look like what you want to identify.
* Move all the images to the `models/images` folder.
* Run the next cell to rename the files to sequential number.


In [None]:
import os
path = 'workspace/training_demo/images/'
counter = 1
for f in sorted(os.listdir(path)):
    suffix = f.split('.')[-1]
    if suffix == 'jpg' or suffix == 'png':
        # If you have more than 999 training files, update the :03d to :04d, but this shouldn't be an issue
        new = '{:03d}.{}'.format(counter, suffix)
        os.rename(path + f, path + new)
        counter = int(counter) + 1

We now need to run the image labeler to label these images. See the documentation here for instructions on labeling: [Image Labeler](https://github.com/tzutalin/labelImg).

* Navigate back to the `training_demo/images` folder (`cd ..`)
* Run the image labeler: `labelimg`. This should pop up a new window for labeling images. We'll use the PascalVOC format.
* Open the source folder `training_demo/images`. Set the save folder as `training_demo/images` - this will put the image data in the right place for us.

We'll go throught the basics for image labeling (documentation here https://github.com/tzutalin/labelImg#usage). There is a tutorial here that covers this https://youtu.be/K_mFnvzyLvc?t=9m13s.

We need to label all of the jars of pasta sauce with the label `pasta_sauce_jar`. It is important that we use the same label for all the objects we care about. 

We should now have a collection of `.jpg` and `.xml` files in the `images` folder.

# Week 3 Project Work

We'll start off labeling a single item for training our models, but we'll eventually want to train a multi-class model that can identify different objects. The project this week is to pick what object you want to detect and get 100-200 images labeled for that object. You can use the Google download trick we did above or you can go get your own images. If you take pictures, be sure to vary the angles and distances from the object. You want a variety of backgrounds, too. Basically, try to capture the object in as many real-world places and angles as you might expect to find when the final model is implemented.

It is also possible to capture videos and extract the frames as images. I'd recommend searching for tools to help you do this ([like this site](https://www.raymond.cc/blog/extract-video-frames-to-images-using-vlc-media-player/)). The same applies to videos as still images- capture a variety of angles, distances, and backgrounds.