# Creating the Huacos Data Set from Google Images

The next few lines follow the instructions from https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson2-download.ipynb and collect images from Google for the categories of Huacos that we are interested in. The Google search terms used were:

* mochica ceramics
* chimu ceramics
* chavin ceramics
* nazca ceramics
* paracas ceramics
* inca ceramics
* tiahuanaco ceramics

**NB** A quick visual inspection of the images shows a couple of problems: first that not all of the images returned are actual photographs (despite the fact that we asked Google to only search for photographs) and two that several of the images appeared in different searches, i.e. a the same 'huaco' appeared to be labeled as both Inca and Nazca. Some manual cleaning of the database will be needed to deal with this problem as it could affect the accuracy of the results. 

In [1]:
from fastai.vision import *

## Create Directories for the data

Start creating the directories for the data, the `csv` files that will point to the data and finally download the data

Classification classes. For now I am limiting the classes to four classes where I have found good data. The other classes do not results good search results and produce results that belong to the other classes. Until I find better pictures I will limite this notebook to mochica, chimu, chavin and nazca

In [None]:
#classes = ['inca', 'paracas', 'nazca', 'chimu', 'mochina', 'chavin', 'tiahuanaco']

classes = ['nazca', 'chimu', 'mochina', 'chavin']

## Donwload the images
The next line of code has to be run once for each category. Set up the values using the cells in earlier parts of this notebook

In [3]:
path = Path('data/huacos')

In [None]:
folder = 'mochica'
file = 'urls_mochica.csv'
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

print(file)
print(dest)

In [None]:
download_images(path/file, dest, max_pics=400)

In [4]:
folder = 'chimu'
file = 'urls_chimu.csv'
dest = path/folder
print(dest)

data/huacos/chimu


In [5]:
download_images(path/file, dest, max_pics=400)

Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '': No schema supplied. Perhaps you meant http://?
Error  Invalid URL '

In [None]:
folder = 'chavin'
file = 'urls_chavin.csv'
dest = path/folder
print(dest)
download_images(path/file, dest, max_pics=400)

In [None]:
folder = 'inca'
file = 'urls_inca.csv'
dest = path/folder
print(dest)
download_images(path/file, dest, max_pics=400)

In [None]:
#paracas pottery -nazca -mochica -chimu -inca -skull -ballestas -reserve -islas -pisco filetype:jpg
folder = 'paracas'
file = 'urls_paracas.csv'
dest = path/folder
print(dest)
download_images(path/file, dest, max_pics=400)

In [None]:
#nazca pottery -nazca -mochica -chimu -inca -skull -ballestas -reserve -islas -pisco -paracas filetype:jp
folder = 'nazca'
file = 'urls_nazca.csv'
dest = path/folder
print(dest)
download_images(path/file, dest, max_pics=400)

In [None]:
#tiahuanaco pottery -nazca -mochica -chimu -inca -skull -inca -ballestas -reserve -islas -pisco -paracas
folder = 'tiahuanaco'
file = 'urls_tiahuanaco.csv'
dest = path/folder
print(dest)
download_images(path/file, dest, max_pics=400)

## Look at the Data

In [None]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
data.classes

In [None]:
data.show_batch(rows=3, figsize=(7,8))

Some statistics about the data set. Show the classes, the number of classes lenght of the training and validation sets

In [None]:
data.classes, data.c, len(data.train_ds), len(data.valid_ds)

## First pass at training a Model

Let's train a model. We will use the 'raw' data that we donwloaded. We don't expect very good results because we know that the dataset has some problems that we have already alluded to. We also noticed several items that were not picture. These will affect the results, but we need to do this so that we can clean up the data set later on.

### Train the Model

In [None]:
learn = cnn_learner(data, models.resnet34, metrics=error_rate)

In [None]:
learn.fit_one_cycle(4)

The results are not very good. Let's see if we can work on this and improve the results

In [None]:
learn.save('stage-1')

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))

In [None]:
learn.save('stage-2')

In [None]:
learn.load('stage-2');

In [None]:
interp = ClassificationInterpretation.from_learner(learn)

In [None]:
interp.plot_confusion_matrix()

The confusion matrix above does confirm my initial statement that the model was not doing very well. Let's try to clean up the model and remove images that should be there to see if we can improve things

In [None]:
learn.lr_find()

## Cleaning up the data

As I mentioned earlier, the data that was collected contains images that should not be in the dataset or that should not be in a particular dataset. These 'rogue' images will impact the accuracy and performance of the model. So we are going to clean the data set using the ImageCleaner widget from fastai.widgets . With this widget we can prune our top losses, removing photos that don't belong

In [None]:
from fastai.widgets import *

First we need to get the file paths from our top_losses. We can do this with .from_toplosses. We then feed the top losses indexes and corresponding dataset to ImageCleaner. In order to clean the entire set of images, we need to create a new dataset without the split.

In [None]:
db = (ImageList.from_folder(path)
                   .split_none()
                   .label_from_folder()
                   .transform(get_transforms(), size=224)
                   .databunch()
     )

Create a new learner to use our new databunch with all of the images

In [None]:
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)

learn_cln.load('stage-2');

In [None]:
ds, idxs = DatasetFormatter().from_toplosses(learn_cln)

Now we run the image cleaner.

In [None]:
ImageCleaner(ds, idxs, path)

## View Cleaned up data

In [None]:
#If you already cleaned your data, run this cell instead of the one before
np.random.seed(42)
data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
         ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
data.classes

In [None]:
data.show_batch(rows=3, figsize=(7,8))

Since we already cleaned up the data from `top_losses` we run a different set of commands to create `db` and the new data bunch. Otherwise all results will be overwritten by the new run of `ImageCleaner`

In [None]:
db = (ImageList.from_csv(path, 'cleaned.csv', folder='.')
                    .split_none()
                    .label_from_df()
                    .transform(get_transforms(), size=224)
                    .databunch()
      )

Create a new learner

In [None]:
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)

In [None]:
#ds, idxs = DatasetFormatter().from_similars(learn_cln)

In [None]:
#ImageCleaner(ds, idxs, path, duplicates=True)

In [None]:
# Recreate the databunch
#db = (ImageList.from_csv(path, 'cleaned.csv', folder='.')
#                    .split_none()
#                    .label_from_df()
#                    .transform(get_transforms(), size=224)
#                    .databunch()
#      )

In [None]:
#New learner
#learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)

In [None]:
learn_cln.fit_one_cycle(4)

In [None]:
learn_cln.save('stage-3')

In [None]:
learn_cln.unfreeze()

In [None]:
learn_cln.lr_find()

In [None]:
learn_cln.recorder.plot()

In [None]:
learn_cln.fit_one_cycle(16, 1e-4)

In [None]:
interp = ClassificationInterpretation.from_learner(learn_cln)

In [None]:
interp = ClassificationInterpretation.from_learner(learn_cln)

In [None]:
interp.plot_confusion_matrix()