01_EDA_preprocessing_02dataset.ipynb

# Preprocessing and EDA on 02_deepfire_dataset

Rodrigo Becerra Carrillo

https://github.com/bcrodrigo

# Introduction

In this notebook I'll perform preprocessing and EDA on the following dataset:

- `02_deepfire_dataset` 

**Source:** https://www.kaggle.com/datasets/alik05/forest-fire-dataset

## Directory Structure

It is assumed that the directory structure for this dataset is organized as follows. Notice that I renamed the top level folder (`02_deepfire_dataset`) and created two new folders for the images in the `testing` folder.

```bash
.
├── data_preprocessing/
│   └── 02_deepfire_dataset/
│       ├── testing/
│       │   ├── fire/
│       │   └── nofire/
│       └── training/
│           ├── fire/
│           └── nofire/
│
├── jupyter_notebooks/
```
where the `jupyter_notebooks/` folder contains this notebook.

## Steps
We will do the following for each of the `training/` and `testing/` directories
1. Create a dataframe listing all images, labels, channels, width and height
2. Review that all images are the correct number of channels (3) and have the same width and height (250 x 250)
3. Crop or replace images as needed
4. Update dataframe accordingly
5. Save a .csv file with the annotation file to be used by PyTorch

Note the final csv should be used by the dataloader on a separate notebook

# Import Usual Libraries

In [1]:
import numpy as np
import pandas as pd
import os

## Define Path to Custom Modules

In [2]:
import sys
sys.path.append('..')

In [5]:
sys.path

['/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python312.zip',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/lib-dynload',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/site-packages',
 '..']

## Import Helper Functions

In [6]:
from src.data.dataset_contents import all_subdir_list

In [29]:
help(all_subdir_list)

Help on function all_subdir_list in module src.data.dataset_contents:

all_subdir_list(path_to_dataset, levels)
    Function that makes a list of subdirectories in a dataset folder.

    Parameters
    ----------
    path_to_dataset : string
        Path (absolute or relative) to contents of image dataset folder.
        The dataset is assumed to have the following structure
        dataset/
            folder1/
                subfolder1.1/
            folder2/
                subfolder2.1/
                subfolder2.2/
            folder3/

    levels : integer
        Number of nested levels in the image dataset

    Returns
    -------
    List
        Subdirectory list



In [8]:
from src.data.dataset_contents import all_images_list

In [9]:
help(all_images_list)

Help on function all_images_list in module src.data.dataset_contents:

all_images_list(path_to_dataset, directory_list, label_list)
    Function that lists all images contained in the subdirectories of a dataset,
    opens each one by one, and returns a dataframe containing all image names as well
    as their labels and size.

    Parameters
    ----------
    path_to_dataset : string
        Path (absolute or relative) to contents of image dataset folder.
        The dataset is assumed to have the following structure
        dataset/
            folder1/
                subfolder1.1/
            folder2/
                subfolder2.1/
                subfolder2.2/
            folder3/

    directory_list : list
        List with all the subdirectories contained in the dataset.
    label_list : list
        List with the numeric categories for each of the directories in `directory_list`

    Returns
    -------
    Dataframe
        All the contents of the dataset into a dataframe cont

## Define Path to Dataset

In [10]:
path_to_dataset = '../data_preprocessing/02_deepfire_dataset/'

In [11]:
! ls ../data_preprocessing/02_deepfire_dataset/

[1m[36mtesting[m[m  [1m[36mtraining[m[m


In [12]:
dir_list = all_subdir_list(path_to_dataset,2)
dir_list

Made a list with 4 directories


['./training/fire', './training/nofire', './testing/fire', './testing/nofire']

Now let's make a list with the labels matching the elements of `dir_list`, note 'fire' is labelled as 1 and 'nofire' as 0.

In [13]:
label_list = [1,0,1,0]

In [14]:
df = all_images_list(path_to_dataset,dir_list,label_list)

Completed list of images
Reading from image list
Finished reviewing all images


In [15]:
df.head()

Unnamed: 0,label,item,channels,width,height,issues
0,1,./training/fire/fire_0489.jpg,3,250,250,no
1,1,./training/fire/fire_0338.jpg,3,250,250,no
2,1,./training/fire/fire_0310.jpg,3,250,250,no
3,1,./training/fire/fire_0476.jpg,3,250,250,no
4,1,./training/fire/fire_0462.jpg,3,250,250,no


In [16]:
df.shape

(1900, 6)

In [17]:
(df['issues'] == 'yes').sum()

0

Note that there were no issues while reading the images. Let's now take a look at the items that have incorrect number of channels or dimensions

In [18]:
df.query('channels != 3')

Unnamed: 0,label,item,channels,width,height,issues


In [19]:
df.query('width != 250')

Unnamed: 0,label,item,channels,width,height,issues
782,0,./training/nofire/nofire_0790.jpg,3,256,256,no
805,0,./training/nofire/nofire_0801.jpg,3,256,256,no
806,0,./training/nofire/nofire_0815.jpg,3,256,256,no
834,0,./training/nofire/nofire_0793.jpg,3,256,256,no
836,0,./training/nofire/nofire_0792.jpg,3,256,256,no
...,...,...,...,...,...,...
1874,0,./testing/nofire/nofire_0771.jpg,3,256,256,no
1875,0,./testing/nofire/nofire_0943.jpg,3,256,256,no
1883,0,./testing/nofire/nofire_0774.jpg,3,256,256,no
1891,0,./testing/nofire/nofire_0762.jpg,3,256,256,no


In [20]:
df.query('height != 250')

Unnamed: 0,label,item,channels,width,height,issues
782,0,./training/nofire/nofire_0790.jpg,3,256,256,no
805,0,./training/nofire/nofire_0801.jpg,3,256,256,no
806,0,./training/nofire/nofire_0815.jpg,3,256,256,no
834,0,./training/nofire/nofire_0793.jpg,3,256,256,no
836,0,./training/nofire/nofire_0792.jpg,3,256,256,no
...,...,...,...,...,...,...
1874,0,./testing/nofire/nofire_0771.jpg,3,256,256,no
1875,0,./testing/nofire/nofire_0943.jpg,3,256,256,no
1883,0,./testing/nofire/nofire_0774.jpg,3,256,256,no
1891,0,./testing/nofire/nofire_0762.jpg,3,256,256,no


We see there are 74 images with the incorrect height (256 x 256)

# Correcting for Oversized Images

In [21]:
from src.data.image_prep import crop_image

In [22]:
help(crop_image)

Help on function crop_image in module src.data.image_prep:

crop_image(path_to_dataset, df_oversized, new_height, new_width)
    Function that crops a set of images and saves a new file accordingly

    Parameters
    ----------
    path_to_dataset : string
        Path (absolute or relative) to contents of image dataset folder.
        The dataset is assumed to have the following structure
        dataset/
            folder1/
                subfolder1.1/
            folder2/
                subfolder2.1/
                subfolder2.2/
            folder3/

    df_oversized : dataframe
        A dataframe listing all the 'oversized' images, to be cropped.
        It is assumed to have the following columns: `item`, `label`,`channels`,`height`,`width`

    new_height : integer
        Cropped image height in pixels

    new_width : integer
        Cropped image width in pixels

    Returns
    -------
    Dataframe
        Original `df_oversized` with three additonal columns
        - 

In [25]:
df_oversized = df.query('height !=250').copy()

In [26]:
# just double checking both height and width are the same
df_oversized.query('height == width')

Unnamed: 0,label,item,channels,width,height,issues
782,0,./training/nofire/nofire_0790.jpg,3,256,256,no
805,0,./training/nofire/nofire_0801.jpg,3,256,256,no
806,0,./training/nofire/nofire_0815.jpg,3,256,256,no
834,0,./training/nofire/nofire_0793.jpg,3,256,256,no
836,0,./training/nofire/nofire_0792.jpg,3,256,256,no
...,...,...,...,...,...,...
1874,0,./testing/nofire/nofire_0771.jpg,3,256,256,no
1875,0,./testing/nofire/nofire_0943.jpg,3,256,256,no
1883,0,./testing/nofire/nofire_0774.jpg,3,256,256,no
1891,0,./testing/nofire/nofire_0762.jpg,3,256,256,no


In [27]:
df_oversized = crop_image(path_to_dataset,df_oversized,250,250)


Cropped ./training/nofire/nofire_0790.jpg to 250 x 250


Cropped ./training/nofire/nofire_0801.jpg to 250 x 250


Cropped ./training/nofire/nofire_0815.jpg to 250 x 250


Cropped ./training/nofire/nofire_0793.jpg to 250 x 250


Cropped ./training/nofire/nofire_0792.jpg to 250 x 250


Cropped ./training/nofire/nofire_0779.jpg to 250 x 250


Cropped ./training/nofire/nofire_0341.jpg to 250 x 250


Cropped ./training/nofire/nofire_0806.jpg to 250 x 250


Cropped ./training/nofire/nofire_0812.jpg to 250 x 250


Cropped ./training/nofire/nofire_0769.jpg to 250 x 250


Cropped ./training/nofire/nofire_0797.jpg to 250 x 250


Cropped ./training/nofire/nofire_0768.jpg to 250 x 250


Cropped ./training/nofire/nofire_0807.jpg to 250 x 250


Cropped ./training/nofire/nofire_0811.jpg to 250 x 250


Cropped ./training/nofire/nofire_0795.jpg to 250 x 250


Cropped ./training/nofire/nofire_0794.jpg to 250 x 250


Cropped ./training/nofire/nofire_0780.jpg to 250 x 250


Cropped ./training/nofire/nofi

In [28]:
df_oversized.head()

Unnamed: 0,label,item,channels,width,height,issues,cropped_item,new_height,new_width
782,0,./training/nofire/nofire_0790.jpg,3,256,256,no,./training/nofire/nofire_0790_cropped.jpg,0,0
805,0,./training/nofire/nofire_0801.jpg,3,256,256,no,./training/nofire/nofire_0801_cropped.jpg,0,0
806,0,./training/nofire/nofire_0815.jpg,3,256,256,no,./training/nofire/nofire_0815_cropped.jpg,0,0
834,0,./training/nofire/nofire_0793.jpg,3,256,256,no,./training/nofire/nofire_0793_cropped.jpg,0,0
836,0,./training/nofire/nofire_0792.jpg,3,256,256,no,./training/nofire/nofire_0792_cropped.jpg,0,0


Now let's make an updated dataframe where we are going to overwrite `item` with `cropped_item`. Note that we did not reset the index of `df_oversized`, so we can use that to find the appropriate rows.

In [30]:
df_updated = df.copy()

In [32]:
df_updated.query('width !=250')

Unnamed: 0,label,item,channels,width,height,issues
782,0,./training/nofire/nofire_0790.jpg,3,256,256,no
805,0,./training/nofire/nofire_0801.jpg,3,256,256,no
806,0,./training/nofire/nofire_0815.jpg,3,256,256,no
834,0,./training/nofire/nofire_0793.jpg,3,256,256,no
836,0,./training/nofire/nofire_0792.jpg,3,256,256,no
...,...,...,...,...,...,...
1874,0,./testing/nofire/nofire_0771.jpg,3,256,256,no
1875,0,./testing/nofire/nofire_0943.jpg,3,256,256,no
1883,0,./testing/nofire/nofire_0774.jpg,3,256,256,no
1891,0,./testing/nofire/nofire_0762.jpg,3,256,256,no


In [36]:
for i,cropped_item in zip(df_oversized.index,df_oversized['cropped_item'].values):
    df_updated.at[i,'item'] = cropped_item
    df_updated.at[i,'width'] = 250
    df_updated.at[i,'height'] = 250

In [37]:
df_updated.query('width !=250')

Unnamed: 0,label,item,channels,width,height,issues


We've successfully updated the dataframe.

# Save Annotations File

Lastly, we have to save `df_updated` into annotation files, suitable for PyTorch. In total there will be 2 files:
- train
- test

In [39]:
train_filter = df_updated['item'].str.contains('/train')
test_filter = df_updated['item'].str.contains('/test')

In [40]:
print('Number of images:')
print('train',train_filter.sum())
print('test',test_filter.sum())

Number of images:
train 1520
test 380


In [41]:
1520 + 380

1900

In [42]:
df_updated.shape

(1900, 6)

All numbers add up correctly.

In [43]:
train_df = df_updated.loc[train_filter,['item','label']]
train_df.head()

Unnamed: 0,item,label
0,./training/fire/fire_0489.jpg,1
1,./training/fire/fire_0338.jpg,1
2,./training/fire/fire_0310.jpg,1
3,./training/fire/fire_0476.jpg,1
4,./training/fire/fire_0462.jpg,1


In [44]:
train_df.to_csv('labels_02_train_dataset.csv', index = False, header = False)

In [45]:
test_df = df_updated.loc[test_filter,['item','label']]
test_df.to_csv('labels_02_test_dataset.csv', index = False, header = False)

In [47]:
! ls labels_02_*_dataset.csv

labels_02_test_dataset.csv  labels_02_train_dataset.csv


python3.12(37245) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [48]:
! mv labels_02_*_dataset.csv ../data_preprocessing/02_deepfire_dataset

python3.12(37255) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [49]:
! ls labels_02_*_dataset.csv

zsh:1: no matches found: labels_02_*_dataset.csv


python3.12(37259) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
