### Checklist for submission

It is extremely important to make sure that:

1. Everything runs as expected (no bugs when running cells);
2. The output from each cell corresponds to its code (don't change any cell's contents without rerunning it afterwards);
3. All outputs are present (don't delete any of the outputs);
4. Fill in all the places that say `# YOUR CODE HERE`, or "**Your answer:** (fill in here)".
5. Never copy/paste any notebook cells. Inserting new cells is allowed, but it should not be necessary.
6. The notebook contains some hidden metadata which is important during our grading process. **Make sure not to corrupt any of this metadata!** The metadata may for example be corrupted if you copy/paste any notebook cells, or if you perform an unsuccessful git merge / git pull. It may also be pruned completely if using Google Colab, so watch out for this. Searching for "nbgrader" when opening the notebook in a text editor should take you to the important metadata entries.
7. Although we will try our very best to avoid this, it may happen that bugs are found after an assignment is released, and that we will push an updated version of the assignment to GitHub. If this happens, it is important that you update to the new version, while making sure the notebook metadata is properly updated as well. The safest way to make sure nothing gets messed up is to start from scratch on a clean updated version of the notebook, copy/pasting your code from the cells of the previous version into the cells of the new version.
8. If you need to have multiple parallel versions of this notebook, make sure not to move them to another directory.
9. Although not forced to work exclusively in the course `conda` environment, you need to make sure that the notebook will run in that environment, i.e. that you have not added any additional dependencies.

**FOR HA1, HA2 ONLY:** Failing to meet any of these requirements might lead to either a subtraction of points (at best) or a request for resubmission (at worst).

We advise you to perform the following steps before submission to ensure that requirements 1, 2, and 3 are always met: **Restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). This might require a bit of time, so plan ahead for this. Finally press the "Save and Checkout" button before handing in, to make sure that all your changes are saved to this .ipynb file.

### Fill in name of notebook file
This might seem silly, but the version check below needs to know the filename of the current notebook, which is not trivial to find out programmatically.

You might want to have several parallel versions of the notebook, and it is fine to rename the notebook as long as it stays in the same directory. **However**, if you do rename it, you also need to update its own filename below:

In [None]:
nb_fname = "XXX.ipynb"

### Fill in group number and member names (use NAME2 and GROUP only for HA1 and HA2):

In [None]:
NAME1 = ""
NAME2 = ""
GROUP = ""

### Check Python version

In [None]:
from platform import python_version_tuple

assert (
    python_version_tuple()[:2] == ("3", "11")
), "You are not running Python 3.11. Make sure to run Python through the course Conda environment."

### Check that notebook server has access to all required resources, and that notebook has not moved

In [None]:
import os

nb_dirname = os.path.abspath("")
assignment_name = os.path.basename(nb_dirname)
assert assignment_name in [
    "IHA1",
    "IHA2",
    "HA1",
    "HA2",
], "[ERROR] The notebook appears to have been moved from its original directory"

### Verify correct nb_fname

In [None]:
from IPython.display import HTML, display

try:
    display(
        HTML(
            r'<script>if("{nb_fname}" != IPython.notebook.notebook_name) {{ alert("You have filled in nb_fname = \"{nb_fname}\", but this does not seem to match the notebook filename \"" + IPython.notebook.notebook_name + "\"."); }}</script>'.format(
                nb_fname=nb_fname
            )
        )
    )
except NameError:
    assert False, "Make sure to fill in the nb_fname variable above!"

### Verify that your notebook is up-to-date and not corrupted in any way

In [None]:
import sys

sys.path.append("..")
from ha_utils import check_notebook_uptodate_and_not_corrupted

check_notebook_uptodate_and_not_corrupted(nb_dirname, nb_fname)

# Create the project structure

This is a helper notebook to create the folder structure necessary for HA1.
Start looking at `HA1.ipynb` and revisit this notebook when needed.

This should be run from the same folder where the `dogs-vs-cats.zip` file you downloaded from Kaggle is.

In [None]:
# For dealing with files we use the built-in python module `Path`
# It provides a nice abstraction of the file system, compared to working with strings only.
# It also makes your code more portable, i.e. easier to share with someone using another operating system.
# Some file system operation are not covered by 'Path' and we use 'shutil' for that
import shutil
from pathlib import Path

# For splitting the data
from sklearn.model_selection import train_test_split

## Data unpacking and re-organization

**NOTE**: This script assumes that you have the `dogs-vs-cats.zip` in the same directory as this notebook, but you can set the path to the data directory manually.

In [None]:
data_path = Path.cwd()
zip_file = data_path / "dogs-vs-cats.zip"

if not ((data_path / "test").exists() and (data_path / "train_all").exists()):
    if not zip_file.exists():
        raise FileNotFoundError(
            "Download and place `{}` in the current directory (`{}`)".format(
                zip_file.name, data_path
            )
        )
    # This is a list of all the directories and files this notebook will produce.
    # If you have run this before, we will delete them and start over from `dogs-vs-cats.zip`
    # Notice how we use the `map` function to conveniently run `Path(<filename>)` on all strings in our list,
    # to turn them in portable filepaths.
    pre_existing_items = map(
        lambda x: data_path / Path(x),
        [
            "test1.zip",
            "test",
            "val",
            "train.zip",
            "train",
            "train_all",
            "sampleSubmission.csv",
            "small_train",
            "small_val",
        ],
    )
    
    for item in pre_existing_items:
        if item.exists():
            # We need to use different functions for files and directories.
            if item.is_dir():
                shutil.rmtree(item)
            elif item.is_file():
                item.unlink()
            else:
                print("Unknown item: {}, remove manually".format(item))
    
    
    # Depending on your machine the following might take some seconds to run
    shutil.unpack_archive(data_path / Path("dogs-vs-cats.zip"), data_path)
    shutil.unpack_archive(data_path / Path("test1.zip"), data_path)
    shutil.unpack_archive(data_path / Path("train.zip"), data_path)
    
    (data_path / Path("test1")).rename(data_path / "test")
    (data_path / Path("train")).rename(data_path / "train_all")
    
    
    # Remove sub zip filess
    (data_path / Path("test1.zip")).unlink()
    (data_path / Path("train.zip")).unlink()
    
    # Take a look at your current directory. Apart from notebook files (those ending in *.ipynb) you should see
    # dogs-vs-cats.zip
    # sampleSubmission.csv
    # test
    # train_all

## Examination

Now we'll examine the data inside the directory `train_all`.
It contains files like this:

```
<id>.jpg
<id>.jpg
```
where each id is on the from `<label>.<number>`, e.g. `cat.123`.

Inside the directory `test` are unlabelled images

```
<id>.jpg
```
where each id is a single number.

Predictions on these unknown images is what you would submit to participate in the contest.

Let's count them separately.

In [None]:
train_all_path = data_path / "train_all"

# Get a list of all filenames inside (these will be used for training and validation)
# The asterisk '*' is a so called wildcard, i.e. we tell the `glob` method to find any cat/dog images,
# regardless of their id.
all_cat_filenames = list(train_all_path.glob("cat.*.jpg"))
all_dog_filenames = list(train_all_path.glob("dog.*.jpg"))

test_path = data_path / "test"
all_test_filenames = list(test_path.glob("*.jpg"))
print(f"Found {len(all_cat_filenames)} images of cats.")
print(f"Found {len(all_dog_filenames)} images of dogs.")
print(f"Found {len(all_test_filenames)} test images")

## Create a smaller dataset

We'll create `'small_train'` and `'small_val'` folders for a smaller subset of the original dataset (the assignment asks for 10%).

In [None]:
# Get a subset of the entire training dataset (10%)
_, few_cat_filenames, _, few_dog_filenames = train_test_split(all_cat_filenames, all_dog_filenames, test_size=0.1, random_state=1)
print(f"The smaller dataset has {len(few_cat_filenames)} images of cats and {len(few_dog_filenames)} images of dogs .")

In [None]:
# Split it into training and validation sets
split_ratio_small_dataset = 0.3

(few_cat_filenames_train, few_cat_filenames_val,
     few_dog_filenames_train, few_dog_filenames_val) = train_test_split(few_cat_filenames,
                                                                       few_dog_filenames,
                                                                       test_size=split_ratio_small_dataset,
                                                                       random_state=2)

print(f"The smaller dataset will be comprised of:")
print(f"Training:\t{len(few_cat_filenames_train)} cats and {len(few_dog_filenames_train)} dogs")
print(f"Validation:\t{len(few_cat_filenames_val)} cats and {len(few_dog_filenames_val)} dogs")

In [None]:
# Create the train and val directories and subdirectories
subdirectories = {
    data_path / "small_train/cats": few_cat_filenames_train,
    data_path / "small_train/dogs": few_dog_filenames_train,
    data_path / "small_val/cats": few_cat_filenames_val,
    data_path / "small_val/dogs": few_dog_filenames_val,
}

for subdirectory in subdirectories.keys():
    subdirectory = Path(subdirectory)
    subdirectory.mkdir(parents=True, exist_ok=True)


# Put the training and validation data in the respective folders
def fill_sub_dir(sub_dir, file_subset):
    """This function copies files from the `train_all` to a `<sub_dir>`
    A more efficient solution would be to use "symbolic links" (see https://kb.iu.edu/d/abbe)
    but for simplicity hard copies is used instead.
    """
    for file in file_subset:
        file_path = data_path / sub_dir / file.name
        shutil.copyfile(file, file_path)


for sub_dir, file_subset in subdirectories.items():
    fill_sub_dir(sub_dir, file_subset)

## Create a bigger dataset

Now we create the `val` and `train` folders for the bigger dataset. Note that this dataset will still be a portion of the entire dataset (50%).

In [None]:
# Get a subset of the entire training dataset (50%)
_, big_cat_filenames, _, big_dog_filenames = train_test_split(
    all_cat_filenames, all_dog_filenames, test_size=0.5, random_state=1
)
print(f"The bigger dataset has {len(big_cat_filenames)} images of cats and {len(big_dog_filenames)} images of dogs .")

Specify the train/val split (to something reasonable).

In [None]:
split_ratio_big_dataset = None  # Fill in here

if split_ratio_big_dataset is None:
    raise ValueError("`split_ratio_big_dataset` must have a value between 0 and 1.")

# Split it
(big_cat_filenames_train, big_cat_filenames_val,
     big_dog_filenames_train, big_dog_filenames_val) = train_test_split(big_cat_filenames,
                                                                       big_dog_filenames,
                                                                       test_size=split_ratio_big_dataset,
                                                                       random_state=3)

print("The bigger dataset will be comprised of:")
print(f"Train:\t{len(big_cat_filenames_train)} cats and {len(big_dog_filenames_train)} dogs.")
print(f"Val:\t{len(big_cat_filenames_val)} cats and {len(big_dog_filenames_val)} dogs")

In [None]:
# Create the train and val directories and subdirectories
subdirectories = {
    data_path / "train/cats": big_cat_filenames_train,
    data_path / "train/dogs": big_dog_filenames_train,
    data_path / "val/cats": big_cat_filenames_val,
    data_path / "val/dogs": big_dog_filenames_val,
}

for subdirectory in subdirectories.keys():
    subdirectory = Path(subdirectory)
    subdirectory.mkdir(parents=True, exist_ok=True)

for sub_dir, file_subset in subdirectories.items():
    fill_sub_dir(sub_dir, file_subset)