# Description

Notebook to resize all images in the following dataset:

- `03_the_wildfire_dataset` 

From their current size into 250 x 250 pixels.

**Dataset Source:** https://www.kaggle.com/datasets/elmadafri/the-wildfire-dataset/data

# Directory Structure

It is assumed that the directory structure for this dataset is organized as follows
```bash
jupyter_notebooks/
|
data_preprocessing/
└── 03_the_wildfire_dataset/
    ├── test/
    │   ├── fire/
    │   │   ├── Both_smoke_and_fire/
    │   │   └── Smoke_from_fires/
    │   └── nofire/
    │       ├── Fire_confounding_elements/
    │       ├── Forested_areas_without_confounding_elements/
    │       └── Smoke_confounding_elements/
    ├── train/
    │   ├── fire/
    │   │   ├── Both_smoke_and_fire/
    │   │   └── Smoke_from_fires/
    │   └── nofire/
    │       ├── Fire_confounding_elements/
    │       ├── Forested_areas_without_confounding_elements/
    │       └── Smoke_confounding_elements/
    └── val/
        ├── fire/
        │   ├── Both_smoke_and_fire/
        │   └── Smoke_from_fires/
        └── nofire/
            ├── Fire_confounding_elements/
            ├── Forested_areas_without_confounding_elements/
            └── Smoke_confounding_elements/
```

## Steps

In part 1 (`01_EDA_preprocessing_03dataset_part1.ipynb`) we corrected all the images that had incorrect number of channels and we saved a resulting dataframe into `03_image_list_updated.csv`.

In this notebook we'll resize all images from their current sizes to 250x250. The top level folder will be `03_the_wildfire_dataset_250x250`, and the subdirectory structure will mimic that of the original dataset.

# Import Usual Libraries

In [1]:
import numpy as np
import pandas as pd

import os

Read the image list file we saved in part 1.

In [2]:
df = pd.read_csv('03_image_list_updated.csv')
df.head()

Unnamed: 0,label,item,channels,width,height,issues
0,1,./test/fire/Smoke_from_fires/52230132421_efbcf...,3,1280,960,no
1,1,./test/fire/Smoke_from_fires/50517815722_17ae2...,3,2000,1500,no
2,1,./test/fire/Smoke_from_fires/41094811384_1382b...,3,6000,4000,no
3,1,./test/fire/Smoke_from_fires/37342469502_36f0e...,3,2048,1536,no
4,1,./test/fire/Smoke_from_fires/45922878832_c4755...,3,5727,3222,no


In [3]:
path_to_dataset = '../data_preprocessing/03_the_wildfire_dataset/'

In [84]:
df.shape

(2699, 6)

# Define Paths to Import Custom Modules

In [15]:
import sys 
sys.path.append('..')
print(sys.path)

['/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python312.zip', '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12', '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/lib-dynload', '', '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/site-packages', '..']


In [16]:
from src.data.dataset_contents import all_subdir_list

In [17]:
help(all_subdir_list)

Help on function all_subdir_list in module src.data.dataset_contents:

all_subdir_list(path_to_dataset, levels)
    Function that makes a list of subdirectories in a dataset folder.

    Parameters
    ----------
    path_to_dataset : string
        Path (absolute or relative) to contents of image dataset folder.
        The dataset is assumed to have the following structure
        dataset/
            folder1/
                subfolder1.1/
            folder2/
                subfolder2.1/
                subfolder2.2/
            folder3/

    levels : integer
        Number of nested levels in the image dataset

    Returns
    -------
    List
        Subdirectory list



In [18]:
dir_list = all_subdir_list(path_to_dataset,3)

Made a list with 15 directories


In [19]:
dir_list

['./test/fire/Smoke_from_fires',
 './test/fire/Both_smoke_and_fire',
 './test/nofire/Forested_areas_without_confounding_elements',
 './test/nofire/Fire_confounding_elements',
 './test/nofire/Smoke_confounding_elements',
 './train/fire/Smoke_from_fires',
 './train/fire/Both_smoke_and_fire',
 './train/nofire/Forested_areas_without_confounding_elements',
 './train/nofire/Fire_confounding_elements',
 './train/nofire/Smoke_confounding_elements',
 './val/fire/Smoke_from_fires',
 './val/fire/Both_smoke_and_fire',
 './val/nofire/Forested_areas_without_confounding_elements',
 './val/nofire/Fire_confounding_elements',
 './val/nofire/Smoke_confounding_elements']

We have defined the following variables to be used in the cell below
- `dir_list`: list of subdirectories in dataset folder
- `path_to_dataset`: string with the path to the original dataset
- `df`: dataframe with a list of all the 'clean' images

In [63]:
import os
import shutil
# import glob
from torchvision.io import read_image
from torchvision.utils import save_image
from torchvision.transforms import v2

# save original directory
original_dir = os.getcwd()

# change to dataset directory
os.chdir(path_to_dataset)

# get dataset folder name
curr_wd = os.getcwd()
folder_name = curr_wd.split('/')[-1]

new_folder = f'../{folder_name}_250x250'

for subdir in dir_list:
    print('Working on:\t',subdir)
    
    path_new_subdir = os.path.join(new_folder,subdir)
    
    # make new subdirectory if it doesn't already exist
    if not os.path.exists(path_new_subdir):
        os.makedirs(path_new_subdir)

    # filter images by subdirectory
    record_filter = df['item'].str.contains(subdir)
    df_filtered = df[record_filter]

    # temporary list for all renamed images
    temp_img_list = []
    
    print('Updating images')
    
    for path_image in df_filtered['item']:

        # read image in dataset directory
        original_img = read_image(path_image)

        # transform image

        temp = v2.Resize(size = (250,250))(original_img)

        # update image name from `name.extension` 
        # to `name_250x250_.extension`
    
        #imag extension
        extension = path_image[-4:]
        # path + name of image 
        path_image_name = path_image[0:-4]

        path_new_image_name =f'{path_image_name}_250x250_{extension}'
        
        # save image in original_dataset/subdir/
        save_image(temp/255,path_new_image_name)

        temp_img_list.append(path_new_image_name)
        
    print('Moving files')

    for img in temp_img_list:
        shutil.move(img,path_new_subdir)



# back to original directory
os.chdir(original_dir)

Working on:	 ./test/fire/Smoke_from_fires
Updating images
Moving files
Working on:	 ./test/fire/Both_smoke_and_fire
Updating images
Moving files
Working on:	 ./test/nofire/Forested_areas_without_confounding_elements
Updating images
Moving files
Working on:	 ./test/nofire/Fire_confounding_elements
Updating images
Moving files
Working on:	 ./test/nofire/Smoke_confounding_elements
Updating images
Moving files
Working on:	 ./train/fire/Smoke_from_fires
Updating images


Invalid SOS parameters for sequential JPEG


Moving files
Working on:	 ./train/fire/Both_smoke_and_fire
Updating images




Moving files
Working on:	 ./train/nofire/Forested_areas_without_confounding_elements
Updating images
Moving files
Working on:	 ./train/nofire/Fire_confounding_elements
Updating images


Invalid SOS parameters for sequential JPEG


Moving files
Working on:	 ./train/nofire/Smoke_confounding_elements
Updating images




Moving files
Working on:	 ./val/fire/Smoke_from_fires
Updating images
Moving files
Working on:	 ./val/fire/Both_smoke_and_fire
Updating images
Moving files
Working on:	 ./val/nofire/Forested_areas_without_confounding_elements
Updating images
Moving files
Working on:	 ./val/nofire/Fire_confounding_elements
Updating images
Moving files
Working on:	 ./val/nofire/Smoke_confounding_elements
Updating images
Moving files


# Make a List of All Images
Once done reshaping and moving, make a new dataframe with all the images, using one `all_images_list`.

In [64]:
from src.data.dataset_contents import all_images_list

In [65]:
all_images_list?

[0;31mSignature:[0m [0mall_images_list[0m[0;34m([0m[0mpath_to_dataset[0m[0;34m,[0m [0mdirectory_list[0m[0;34m,[0m [0mlabel_list[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Function that lists all images contained in the subdirectories of a dataset,
opens each one by one, and returns a dataframe containing all image names as well
as their labels and size.

Parameters
----------
path_to_dataset : string
    Path (absolute or relative) to contents of image dataset folder. 
    The dataset is assumed to have the following structure
    dataset/
        folder1/
            subfolder1.1/
        folder2/
            subfolder2.1/
            subfolder2.2/
        folder3/

directory_list : list
    List with all the subdirectories contained in the dataset.
label_list : list
    List with the numeric categories for each of the directories in `directory_list`

Returns
-------
Dataframe
    All the contents of the dataset into a dataframe containing 
    item

We defined `dir_list` previously

In [67]:
dir_list

['./test/fire/Smoke_from_fires',
 './test/fire/Both_smoke_and_fire',
 './test/nofire/Forested_areas_without_confounding_elements',
 './test/nofire/Fire_confounding_elements',
 './test/nofire/Smoke_confounding_elements',
 './train/fire/Smoke_from_fires',
 './train/fire/Both_smoke_and_fire',
 './train/nofire/Forested_areas_without_confounding_elements',
 './train/nofire/Fire_confounding_elements',
 './train/nofire/Smoke_confounding_elements',
 './val/fire/Smoke_from_fires',
 './val/fire/Both_smoke_and_fire',
 './val/nofire/Forested_areas_without_confounding_elements',
 './val/nofire/Fire_confounding_elements',
 './val/nofire/Smoke_confounding_elements']

Next the list of labels: 1 for fire, 0 for no fire.

In [68]:
label_list = [1,1,0,0,0,1,1,0,0,0,1,1,0,0,0]

Now the path to the resized dataset

In [69]:
path_to_dataset = '../data_preprocessing/03_the_wildfire_dataset_250x250/'

In [70]:
new_df = all_images_list(path_to_dataset,dir_list,label_list)

Completed list of images
Reading from image list
Finished reviewing all images


# Review DataFrame Contents
Just to make sure everything looks correct.

In [72]:
new_df.head()

Unnamed: 0,label,item,channels,width,height,issues
0,1,./test/fire/Smoke_from_fires/37342469502_36f0e...,3,250,250,no
1,1,./test/fire/Smoke_from_fires/28347651877_ce21e...,3,250,250,no
2,1,./test/fire/Smoke_from_fires/50380847162_24a48...,3,250,250,no
3,1,./test/fire/Smoke_from_fires/26131736898_9e6a8...,3,250,250,no
4,1,./test/fire/Smoke_from_fires/30227808988_2cd8f...,3,250,250,no


In [73]:
new_df.query('width != 250')

Unnamed: 0,label,item,channels,width,height,issues


In [75]:
new_df.query('height != 250')

Unnamed: 0,label,item,channels,width,height,issues


In [76]:
new_df.shape

(2699, 6)

In [77]:
new_df.query('channels !=3')

Unnamed: 0,label,item,channels,width,height,issues


# Save Annotations File

Lastly, we have to save `new_df` into an annotation files, suitable for PyTorch. In total there will be 3 files:
- train
- validation
- test

In [107]:
train_filter = new_df['item'].str.contains('/train')
val_filter = new_df['item'].str.contains('/val')
test_filter = new_df['item'].str.contains('/test')

In [108]:
print('number of images')
print('train',train_filter.sum())
print('val',val_filter.sum())
print('test',test_filter.sum())

number of images
train 1887
val 402
test 410


In [104]:
410+402+1887

2699

In [105]:
new_df.shape

(2699, 6)

In [109]:
train_df = new_df.loc[train_filter,['item','label']]
train_df.head()

Unnamed: 0,item,label
410,./train/fire/Smoke_from_fires/malachi-brooks--...,1
411,./train/fire/Smoke_from_fires/52442599185_acbd...,1
412,./train/fire/Smoke_from_fires/35666924341_c69e...,1
413,./train/fire/Smoke_from_fires/52295601315_39ff...,1
414,./train/fire/Smoke_from_fires/33607918741_a241...,1


In [110]:
train_df.to_csv('labels_03_train_dataset.csv', index = False, header = False)

In [111]:
val_df = new_df.loc[val_filter,['item','label']]
val_df.to_csv('labels_03_val_dataset.csv', index = False, header = False)

In [112]:
test_df = new_df.loc[test_filter,['item','label']]
test_df.to_csv('labels_03_test_dataset.csv', index = False, header = False)

In [113]:
! ls labels_03_*.csv

labels_03_test_dataset.csv  labels_03_val_dataset.csv
labels_03_train_dataset.csv


In [114]:
! mv labels_03_*.csv ../data_preprocessing/03_the_wildfire_dataset_250x250/