# **EPIC 2 - DATA UNDERSTANDING**

## Objectives

- Collect the dataset of animal images from Kaggle and rectify any deficiencies in the dataset before moving on to epic 3: data preparation.

### Acceptance Criteria
- The dataset should contain images of various species and breeds of animals, as agreed with the client.
- The images should be of sufficient quality and quantity to train the model.

## Tasks
- Download the dataset from Kaggle.
- Explore the dataset to understand its structure and contents.
- Run statistical tests and visualise the dataset.

## Inputs

- Kaggle dataset of animal images.
- Kaggle JSON file for authentication. 

## Outputs

- An image dataset containing sufficient quality and quantity: inputs/datasets.

---

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
print("Current working directory is:", current_dir)

Current working directory is: /Users/gingermale/Documents/repos/PP5/pet-image-classifier/jupyter_notebooks


In [2]:
os.chdir(os.path.dirname(current_dir)) # Change the current working directory to the parent directory
current_dir = os.getcwd() # Get the new current working directory
print("Changing the working directory to parent folder:", current_dir)

Changing the working directory to parent folder: /Users/gingermale/Documents/repos/PP5/pet-image-classifier


# Install Kaggle

In [3]:
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.14.tar.gz (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m910.9 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm
  Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Collecting bleach
  Using cached bleach-6.1.0-py3-none-any.whl (162 kB)
Collecting webencodings
  Using cached webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Collecting text-unidecode>=1.3
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: webencodings, text-unidecode, tqdm, python-slugify, bleach, kaggle
[33m  DEPRECATION: kaggle is being installed using the legacy 'setup.py install' method, because it does not ha

Change kaggle configuration directory to current working directory and permission of authentication file.

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd() # set environment variable
! chmod 600 kaggle.json # make sure the kaggle.json file is not public

Set kaggle dataset and download the dataset.

In [8]:
KaggleDatasetPath = "rafsunahmad/choose-your-pet"
DestinationFolder = "datasets/pet_data"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/rafsunahmad/choose-your-pet
License(s): other
Downloading choose-your-pet.zip to datasets/pet_data
100%|██████████████████████████████████████| 1.02G/1.02G [06:38<00:00, 2.94MB/s]
100%|██████████████████████████████████████| 1.02G/1.02G [06:38<00:00, 2.75MB/s]


Unzip downloaded dataset:

In [9]:
! unzip {DestinationFolder}/choose-your-pet.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/choose-your-pet.zip

Archive:  datasets/pet_data/choose-your-pet.zip
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_1.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_10.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_11.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_12.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_13.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_14.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_15.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_16.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_17.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_18.jpg  
  inflating: datasets/pet_data/Choose pet datset for kaggle/Abyssinian/Image_19.jpg  
  infla

# Initial data exploration

The dataset was reviewed with the client to identify if there were any missing animals. The following dog breeds were identified:

- Springer spaniel
- Border collie
- Jack russell terrier
- Pugs.

The client was satisfied that the dataset was not deficient in any other animal that was commonly seen during consults.

In addition, the dataset was visually 'sense-checked' for duplicates and any other anomalies. The following were identified:
- A number of images were duplicated.
- Some images were incorrectly labelled.

Due to the number of duplicates and incorrectly labelled images, the dataset was cleaned to remove these images. The following code identifies 17 duplicates.

In [19]:
import os
import filecmp

def find_duplicates(directory):
    files = []
    for dirpath, dirnames, filenames in os.walk(directory):
        files.extend([os.path.join(dirpath, f) for f in filenames])
    duplicates = []
    while len(files):
        current_file = files.pop()
        for file in files:
            if filecmp.cmp(current_file, file):
                duplicates.append((current_file, file))
    return duplicates

duplicates = find_duplicates('/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs')
print(f"Number of duplicate pairs: {len(duplicates)}")
print(duplicates)

Number of duplicate pairs: 17
[('/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Bulldog/Image_22.jpg', '/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/French Bulldog/Image_8.jpg'), ('/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Bulldog/Image_10.jpg', '/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/French Bulldog/Image_24.jpg'), ('/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Sugar Gliders/Image_18.jpg', '/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Sugar Gliders/Image_24.jpg'), ('/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Sugar Gliders/Image_26.jpg', '/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Sugar Gliders/Image_23.jpg'), ('/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Sugar Gliders/Image_23.png', '/Users/gingermale/Documents/repos/PP5/pet-image-classifier/inputs/Sugar Gliders/Image_20.png'), ('/Users/g

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
