# Animals-10 Data Exploration
The [Animals-10 Dataset](https://www.kaggle.com/datasets/alessiocorrado99/animals10?resource=download) is made by Corrado Alessio. There are a multitude of goals that I want to acheive in this exploration. These include:
- Explore Dataset and its characteristics
- Identify issues
- Develop solutions
- Preprocess for image classification training

# Exploring the Dataset
To get a feel of what the dataset is about, the first step would be to read what the author has outlined for us:

> Hello everyone!
> This is the dataset I have used for my matriculation thesis.
> It contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spyder, butterfly, chicken, sheep, cow, squirrel, elephant.
> I have used it to test different image recognition networks: from homemade CNNs (~80% accuracy) to Google Inception (98%). It could simulate a smart gallery for a researcher (like a biologist).
> All the images have been collected from "google images" and have been checked by human. There is some erroneous data to simulate real conditions (eg. images taken by users of your app).
> The main directory is divided into folders, one for each category. Image count for each category varies from 2K to 5 K units.".

As we can see here he mentions that the dataset includes 10 categories:`dog`, `cat`, `horse`, `spider`, `butterfly`, `chicken`, `sheep`, `cow`, `squirrel`, and `elephant`. We should confirm this as one of the key concepts I have learned is to not trust everything until I am able to verify the datapoint my self. So lets analyze the zip file and see what included. The file was downloaded and extracted into the data folder.

The data zip file can be downloaded from [here](https://www.kaggle.com/datasets/alessiocorrado99/animals10?resource=download).

In [1]:
import os

data_dir = 'data'
data_files = sorted(os.listdir(data_dir))
print(data_files)

['raw-img', 'translate.py']


As we can see, there are two files included: a `raw-img` folder and `translate.py` python file. Looking at the translate.py file, it seems to be a translation file. It contains a `dict` that gives translations of the supposed classes from Italian to English. This will be useful to use later. The next step is to take a look into the `raw-img` folder.

In [2]:
raw_dir = data_dir + '/' + data_files[0]
raw_files = sorted(os.listdir(raw_dir))

print(raw_files)
# for later conversion back ['cane', 'cavallo', 'elefante', 'farfalla', 'gallina', 'gatto', 'mucca', 'pecora', 'ragno', 'scoiattolo']

['cane', 'cavallo', 'elefante', 'farfalla', 'gallina', 'gatto', 'mucca', 'pecora', 'ragno', 'scoiattolo']


These look to be the class folders! Only issue is that they seem to be in Italian. This can easily be fixed by using the `translation.py` file found earlier. There are a majority of ways to use this file, but in my case, it easier to move the file to the root directory of the notebook, rather then make it into a module. Since the file only contains a single dict, I decided it wasnt worth the extra steps to make it a module. There are many ways to use this file, this is the way I decided to use it.

In [3]:
# Check if second file (translation.py) exists, if it does move it
if 1 < len(data_files):
    os.rename(data_dir + '/' + data_files[1], data_files[1])
    print('Moved translation.py to root directory of notebook')
else:
    print('translation.py may have already been moved or is missing')

import translate # The dict given to us by dataset author

# Translate Italian folder names to English
for folder in raw_files:
    os.rename(raw_dir + '/' + folder,
              raw_dir + '/' + translate.translate[folder])

Moved translation.py to root directory of notebook


KeyError: 'ragno'

What happened? It seems the `ragno` key doesn't exist in the imported dictionary. Lets investigate this by printing out the included dictionary.

In [4]:
print(translate.translate)

{'cane': 'dog', 'cavallo': 'horse', 'elefante': 'elephant', 'farfalla': 'butterfly', 'gallina': 'chicken', 'gatto': 'cat', 'mucca': 'cow', 'pecora': 'sheep', 'scoiattolo': 'squirrel', 'dog': 'cane', 'elephant': 'elefante', 'butterfly': 'farfalla', 'chicken': 'gallina', 'cat': 'gatto', 'cow': 'mucca', 'spider': 'ragno', 'squirrel': 'scoiattolo'}


Ah! Takeing a deeper look at this `dict` you can see that there indeed is a missing translation! `'rago': 'spider'` doesn't exist! Just another reason to always expect your data to be wrong until you can prove otherwise. This is a simple fix. We will also see what files were correctly translated and which ones still need to be translated as well.

Note: This could be fixed if you decide to follow through these steps on your own. Do your own investigation and see if the issue still exists!

In [5]:
fixedTranslate = translate.translate.copy()
fixedTranslate.update({'ragno':'spider'})
print(fixedTranslate)
print('\nCurrent Folder Names:')
print(os.listdir(raw_dir))

{'cane': 'dog', 'cavallo': 'horse', 'elefante': 'elephant', 'farfalla': 'butterfly', 'gallina': 'chicken', 'gatto': 'cat', 'mucca': 'cow', 'pecora': 'sheep', 'scoiattolo': 'squirrel', 'dog': 'cane', 'elephant': 'elefante', 'butterfly': 'farfalla', 'chicken': 'gallina', 'cat': 'gatto', 'cow': 'mucca', 'spider': 'ragno', 'squirrel': 'scoiattolo', 'ragno': 'spider'}

Current Folder Names:
['horse', 'sheep', 'elephant', 'cat', 'scoiattolo', 'chicken', 'ragno', 'dog', 'cow', 'butterfly']


Now that we fixed the translation dict, the only step is to fixed untranslated folders. As seen above, the only two folders that were not translated was `scoiattolo` and `ragno`. We can quickly manualy translate them as there is only two folders.

In [6]:
os.rename(raw_dir + '/ragno',
          raw_dir + '/' + fixedTranslate['ragno'])
os.rename(raw_dir + '/scoiattolo',
          raw_dir + '/' + fixedTranslate['scoiattolo'])
print(os.listdir(raw_dir))

['horse', 'sheep', 'elephant', 'cat', 'chicken', 'dog', 'spider', 'squirrel', 'cow', 'butterfly']


Now we have the correct folder names. With this we can compare what was the expected classes and what was given in the dataset archive. We expect to see 10 categories:`dog`, `cat`, `horse`, `spider`, `butterfly`, `chicken`, `sheep`, `cow`, `squirrel`, and `elephant`. Comparing this list with the current printout above of the file names, they seem to match! So we can confirm that we have these 10 classes to work with. Now it's time to dig a bit deeper and look at the stats of each of these classes.

In [15]:
raw_files = os.listdir(raw_dir)
print('Sample of files from Cat folder: ')
print(sorted(os.listdir(raw_dir + '/' + 'cat')[:10]))
print('\n')

# Print number of images
for folder in sorted(raw_files):
    folder_files = os.listdir(raw_dir + '/' + folder)
    print('# of files in ' + folder + ': ' + str(len(folder_files)))

Sample of files from Cat folder: 
['1018.jpeg', '1043.jpeg', '1223.jpeg', '126.jpeg', '269.jpeg', '498.jpeg', '960.jpeg', '991.jpeg', 'ea35b80d2ef1013ed1584d05fb1d4e9fe777ead218ac104497f5c978a7e8b7bc_640.jpg', 'ea37b8082ef1013ed1584d05fb1d4e9fe777ead218ac104497f5c978a7e8b7bc_640.jpg']


# of files in butterfly: 2112
# of files in cat: 1668
# of files in chicken: 3098
# of files in cow: 1866
# of files in dog: 4863
# of files in elephant: 1446
# of files in horse: 2623
# of files in sheep: 1820
# of files in spider: 4821
# of files in squirrel: 1862


In [None]:
# Graph this as a distrubution