01_EDA_preprocessing_03dataset_part2.ipynb

# Preprocessing and EDA on 03_wildfire_dataset - part 2

Rodrigo Becerra Carrillo

https://github.com/bcrodrigo

# Introduction

IN this notebook I'll resize all images in the following dataset:

- `03_wildfire_dataset` 

From their current size into 250 x 250 pixels.

**Dataset Source:** https://www.kaggle.com/datasets/elmadafri/the-wildfire-dataset/data

## Directory Structure

It is assumed that the directory structure for this dataset is organized as follows

```bash
jupyter_notebooks/
|
data_preprocessing/
└── 03_wildfire_dataset/
    ├── test/
    │   ├── fire/
    │   │   ├── Both_smoke_and_fire/
    │   │   └── Smoke_from_fires/
    │   └── nofire/
    │       ├── Fire_confounding_elements/
    │       ├── Forested_areas_without_confounding_elements/
    │       └── Smoke_confounding_elements/
    ├── train/
    │   ├── fire/
    │   │   ├── Both_smoke_and_fire/
    │   │   └── Smoke_from_fires/
    │   └── nofire/
    │       ├── Fire_confounding_elements/
    │       ├── Forested_areas_without_confounding_elements/
    │       └── Smoke_confounding_elements/
    └── val/
        ├── fire/
        │   ├── Both_smoke_and_fire/
        │   └── Smoke_from_fires/
        └── nofire/
            ├── Fire_confounding_elements/
            ├── Forested_areas_without_confounding_elements/
            └── Smoke_confounding_elements/
```

## Steps

In part 1 (`01_EDA_preprocessing_03dataset_part1.ipynb`) we corrected all the images that had incorrect number of channels and we saved a resulting dataframe into `03_image_list_updated.csv`.

In this notebook we'll resize all images from their current sizes to 250x250. The top level folder will be `03_wildfire_dataset_250x250`, and the subdirectory structure will mimic that of the original dataset.

# Import Usual Libraries

In [1]:
import numpy as np
import pandas as pd
import os

## Define Path to Custom Modules

In [2]:
import sys
sys.path.append('..')

In [5]:
sys.path

['/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python312.zip',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/lib-dynload',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/site-packages',
 '..']

## Import Helper Functions

In [6]:
from src.data.dataset_contents import all_subdir_list

In [7]:
help(all_subdir_list)

Help on function all_subdir_list in module src.data.dataset_contents:

all_subdir_list(path_to_dataset, levels)
    Function that makes a list of subdirectories in a dataset folder.

    Parameters
    ----------
    path_to_dataset : string
        Path (absolute or relative) to contents of image dataset folder.
        The dataset is assumed to have the following structure
        dataset/
            folder1/
                subfolder1.1/
            folder2/
                subfolder2.1/
                subfolder2.2/
            folder3/

    levels : integer
        Number of nested levels in the image dataset

    Returns
    -------
    List
        Subdirectory list



In [8]:
from src.data.image_prep import resize_all_images

In [9]:
help(resize_all_images)

Help on function resize_all_images in module src.data.image_prep:

resize_all_images(path_to_dataset, dir_list, df_all_images, new_height, new_width)
    Function to resize all images in a dataset to a new square size
    It will create a new directory containing all the resized images accordingly

    Parameters
    ----------
    path_to_dataset : str
        Path (absolute or relative) to contents of image dataset folder.
        The dataset is assumed to have the following structure
        dataset/
            folder1/
                subfolder1.1/
            folder2/
                subfolder2.1/
                subfolder2.2/
            folder3/

    dir_list : list[str]
        A list of directories generated with `all_subdir_list`

    df_all_images : dataframe
        A dataframe listing all the images in a datasaet to be resized
        It is assumed to have at least the following columns: `item`, `label`,`channels`,`height`,`width`

    new_height : int
        Resized ima

## Define Path to Dataset

In [15]:
path_to_dataset = '../data_preprocessing/03_wildfire_dataset/'

In [16]:
! ls ../data_preprocessing/03_wildfire_dataset/

[1m[36mtest[m[m  [1m[36mtrain[m[m [1m[36mval[m[m


Read the image list file we saved in part 1.

In [17]:
df = pd.read_csv('03_image_list_updated.csv')
df.head()

Unnamed: 0,label,item,channels,width,height,issues
0,1,./test/fire/Smoke_from_fires/52230132421_efbcf...,3,1280,960,no
1,1,./test/fire/Smoke_from_fires/50517815722_17ae2...,3,2000,1500,no
2,1,./test/fire/Smoke_from_fires/41094811384_1382b...,3,6000,4000,no
3,1,./test/fire/Smoke_from_fires/37342469502_36f0e...,3,2048,1536,no
4,1,./test/fire/Smoke_from_fires/45922878832_c4755...,3,5727,3222,no


In [13]:
df.shape

(2699, 6)

In [18]:
dir_list = all_subdir_list(path_to_dataset,3)

Made a list with 15 directories


In [19]:
dir_list

['./test/fire/Smoke_from_fires',
 './test/fire/Both_smoke_and_fire',
 './test/nofire/Forested_areas_without_confounding_elements',
 './test/nofire/Fire_confounding_elements',
 './test/nofire/Smoke_confounding_elements',
 './train/fire/Smoke_from_fires',
 './train/fire/Both_smoke_and_fire',
 './train/nofire/Forested_areas_without_confounding_elements',
 './train/nofire/Fire_confounding_elements',
 './train/nofire/Smoke_confounding_elements',
 './val/fire/Smoke_from_fires',
 './val/fire/Both_smoke_and_fire',
 './val/nofire/Forested_areas_without_confounding_elements',
 './val/nofire/Fire_confounding_elements',
 './val/nofire/Smoke_confounding_elements']

# Resize Images

We now have defined the following variables to be used
- `dir_list`: list of subdirectories in dataset folder
- `path_to_dataset`: string with the path to the original dataset
- `df`: dataframe with a list of all the 'clean' images

In [20]:
resize_all_images(path_to_dataset,dir_list,df,250,250)


Working on:	 ./test/fire/Smoke_from_fires
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./test/fire/Smoke_from_fires

Working on:	 ./test/fire/Both_smoke_and_fire
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./test/fire/Both_smoke_and_fire

Working on:	 ./test/nofire/Forested_areas_without_confounding_elements
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./test/nofire/Forested_areas_without_confounding_elements

Working on:	 ./test/nofire/Fire_confounding_elements
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./test/nofire/Fire_confounding_elements

Working on:	 ./test/nofire/Smoke_confounding_elements
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./test/nofire/Smoke_confounding_elements

Working on:	 ./train/fire/Smoke_from_fires
Updating images


Invalid SOS parameters for sequential JPEG


Moving resized images to ../03_wildfire_dataset_250x250/./train/fire/Smoke_from_fires

Working on:	 ./train/fire/Both_smoke_and_fire
Updating images




Moving resized images to ../03_wildfire_dataset_250x250/./train/fire/Both_smoke_and_fire

Working on:	 ./train/nofire/Forested_areas_without_confounding_elements
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./train/nofire/Forested_areas_without_confounding_elements

Working on:	 ./train/nofire/Fire_confounding_elements
Updating images


Invalid SOS parameters for sequential JPEG


Moving resized images to ../03_wildfire_dataset_250x250/./train/nofire/Fire_confounding_elements

Working on:	 ./train/nofire/Smoke_confounding_elements
Updating images




Moving resized images to ../03_wildfire_dataset_250x250/./train/nofire/Smoke_confounding_elements

Working on:	 ./val/fire/Smoke_from_fires
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./val/fire/Smoke_from_fires

Working on:	 ./val/fire/Both_smoke_and_fire
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./val/fire/Both_smoke_and_fire

Working on:	 ./val/nofire/Forested_areas_without_confounding_elements
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./val/nofire/Forested_areas_without_confounding_elements

Working on:	 ./val/nofire/Fire_confounding_elements
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./val/nofire/Fire_confounding_elements

Working on:	 ./val/nofire/Smoke_confounding_elements
Updating images
Moving resized images to ../03_wildfire_dataset_250x250/./val/nofire/Smoke_confounding_elements


# Make a List of All Images

Once done resizing and moving, we'll make a new dataframe, using one `all_images_list`.

In [21]:
from src.data.dataset_contents import all_images_list

In [22]:
help(all_images_list)

Help on function all_images_list in module src.data.dataset_contents:

all_images_list(path_to_dataset, directory_list, label_list)
    Function that lists all images contained in the subdirectories of a dataset,
    opens each one by one, and returns a dataframe containing all image names as well
    as their labels and size.

    Parameters
    ----------
    path_to_dataset : string
        Path (absolute or relative) to contents of image dataset folder.
        The dataset is assumed to have the following structure
        dataset/
            folder1/
                subfolder1.1/
            folder2/
                subfolder2.1/
                subfolder2.2/
            folder3/

    directory_list : list
        List with all the subdirectories contained in the dataset.
    label_list : list
        List with the numeric categories for each of the directories in `directory_list`

    Returns
    -------
    Dataframe
        All the contents of the dataset into a dataframe cont

We defined `dir_list` previously

In [23]:
dir_list

['./test/fire/Smoke_from_fires',
 './test/fire/Both_smoke_and_fire',
 './test/nofire/Forested_areas_without_confounding_elements',
 './test/nofire/Fire_confounding_elements',
 './test/nofire/Smoke_confounding_elements',
 './train/fire/Smoke_from_fires',
 './train/fire/Both_smoke_and_fire',
 './train/nofire/Forested_areas_without_confounding_elements',
 './train/nofire/Fire_confounding_elements',
 './train/nofire/Smoke_confounding_elements',
 './val/fire/Smoke_from_fires',
 './val/fire/Both_smoke_and_fire',
 './val/nofire/Forested_areas_without_confounding_elements',
 './val/nofire/Fire_confounding_elements',
 './val/nofire/Smoke_confounding_elements']

Next the list of labels: 1 for fire, 0 for no fire.

In [24]:
label_list = [1,1,0,0,0,1,1,0,0,0,1,1,0,0,0]

Now the path to the resized dataset

In [25]:
path_to_resized_dataset = '../data_preprocessing/03_wildfire_dataset_250x250/'

In [26]:
new_df = all_images_list(path_to_resized_dataset,dir_list,label_list)

Completed list of images
Reading from image list
Finished reviewing all images


## Review DataFrame Contents
Just to make sure everything looks correct.

In [27]:
new_df.head()

Unnamed: 0,label,item,channels,width,height,issues
0,1,./test/fire/Smoke_from_fires/37342469502_36f0e...,3,250,250,no
1,1,./test/fire/Smoke_from_fires/28347651877_ce21e...,3,250,250,no
2,1,./test/fire/Smoke_from_fires/50380847162_24a48...,3,250,250,no
3,1,./test/fire/Smoke_from_fires/26131736898_9e6a8...,3,250,250,no
4,1,./test/fire/Smoke_from_fires/30227808988_2cd8f...,3,250,250,no


In [28]:
new_df.query('width != 250')

Unnamed: 0,label,item,channels,width,height,issues


In [29]:
new_df.query('height != 250')

Unnamed: 0,label,item,channels,width,height,issues


In [30]:
new_df.shape

(2699, 6)

In [31]:
new_df.query('channels !=3')

Unnamed: 0,label,item,channels,width,height,issues


# Save Annotations File

Lastly, we have to save `new_df` into an annotation files, suitable for PyTorch. In total there will be 3 files:
- train
- validation
- test

In [32]:
train_filter = new_df['item'].str.contains('/train')
val_filter = new_df['item'].str.contains('/val')
test_filter = new_df['item'].str.contains('/test')

In [33]:
print('Number of images:')
print('train',train_filter.sum())
print('val',val_filter.sum())
print('test',test_filter.sum())

Number of images:
train 1887
val 402
test 410


In [34]:
410+402+1887

2699

In [35]:
new_df.shape

(2699, 6)

All numbers add up correctly.

In [36]:
train_df = new_df.loc[train_filter,['item','label']]
train_df.head()

Unnamed: 0,item,label
410,./train/fire/Smoke_from_fires/malachi-brooks--...,1
411,./train/fire/Smoke_from_fires/52442599185_acbd...,1
412,./train/fire/Smoke_from_fires/35666924341_c69e...,1
413,./train/fire/Smoke_from_fires/52295601315_39ff...,1
414,./train/fire/Smoke_from_fires/33607918741_a241...,1


In [37]:
train_df.to_csv('labels_03_train_dataset.csv', index = False, header = False)

In [38]:
val_df = new_df.loc[val_filter,['item','label']]
val_df.to_csv('labels_03_val_dataset.csv', index = False, header = False)

In [39]:
test_df = new_df.loc[test_filter,['item','label']]
test_df.to_csv('labels_03_test_dataset.csv', index = False, header = False)

In [40]:
! ls labels_03_*.csv

labels_03_test_dataset.csv  labels_03_val_dataset.csv
labels_03_train_dataset.csv


In [41]:
! mv labels_03_*.csv ../data_preprocessing/03_wildfire_dataset_250x250/

In [42]:
! ls labels_03_*.csv

zsh:1: no matches found: labels_03_*.csv
