# 4.0 Data Cleaning and Collection
With our ResNet-50 model constructed. Data are required to train our model so that we can classify our pictures into "Dog", "Cat", or neither. We will work through by first sourcing the data, then we would rescale it, and finally, convert it into a suitable `DataLoader` class.

## 4.1 Data Sourcing
Kaggle is a great site to find datasets and even test your model on them to compare it to others. We will use this [Kaggle dataset for dogs and cats classification](https://www.kaggle.com/datasets/bhavikjikadara/dog-and-cat-classification-dataset/data).

Let's export the kaggle dataset into `data/raw`

In [None]:
!kaggle datasets download -d bhavikjikadara/dog-and-cat-classification-dataset -p ../data/raw

With the dataset downloaded, then we unzip it.

In [None]:
from zipfile import ZipFile

raw_folder = '../data/raw'

with ZipFile(raw_folder + '/dog-and-cat-classification-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(raw_folder)
    
!mv ../data/raw/PetImages/* ../data/raw
!rm -rf ../data/raw/PetImages

With the image dataset exported to our raw folder. Let's load them into our notebook.

In [None]:
import os
import matplotlib.pyplot as plt
import random
import cv2

def load_raw_images():
    raw_cat_images = []
    raw_dog_images = []
    
    for filename in os.listdir(raw_folder + '/Cat'):
        img = cv2.imread(raw_folder + '/Cat/' + filename)
        if img is None:
            continue
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        raw_cat_images.append(img)
        
    for filename in os.listdir(raw_folder + '/Dog'):
        img = cv2.imread(raw_folder + '/Dog/' + filename)
        if img is None:
            continue
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # BGR -> RGB
        raw_dog_images.append(img)
        
    return (raw_cat_images, raw_dog_images)

In [None]:
raw_cat_images, raw_dog_images = load_raw_images()

With the raw images loaded, let's build a function to visualize it. It'll be super cute!

In [None]:
def visualize_pet_images(raw_cat_images, raw_dog_images):
    _, ax = plt.subplots(5, 5, figsize=(8, 8))
    for i in range(5):
        for j in range(5):
            is_cat = random.randint(0, 1)
            cat_images_len = len(raw_cat_images)
            dog_images_len = len(raw_dog_images)
            
            label = "Cat" if is_cat else "Dog"
            
            ax[i, j].imshow(raw_cat_images[random.randint(0, cat_images_len - 1)] if is_cat
                            else raw_dog_images[random.randint(0, dog_images_len - 1)])
            ax[i, j].axis("off")
            ax[i, j].set_title(label)
visualize_pet_images(raw_cat_images, raw_dog_images)

Cute huh!