# Description

Notebook to perform EDA on a given dataset, as indicated in the title. In this case this is for the following dataset

- `02_forest_fire_dataset` 

**Source:** https://www.kaggle.com/datasets/alik05/forest-fire-dataset

## Directory Structure

It is assumed that the directory structure for this dataset is organized as follows
```bash
.
├── data_preprocessing/
│   └── 02_forest_fire_dataset/
│       ├── testing/
│       │   ├── fire/
│       │   └── nofire/
│       └── training/
│           ├── fire/
│           └── nofire/
```

## Steps
We will do the following for each of the `training/` and `testing/` directories
1. Create a dataframe listing all images, labels, channels, width and height
2. Review that all images are the correct number of channels (3) and have the same width and height (250 x 250)
3. Crop or replace images as needed
4. Update dataframe accordingly
5. Save a .csv file with the annotation file to be used by PyTorch

Note the final csv should be used by the dataloader on a separate notebook

# Define Helper Functions
Note: in the future I want jupyter to import all helper functions contained in the `src/` folder

In [1]:
import numpy as np
import pandas as pd
import os

from torchvision.io import read_image

def review_dataset(path_to_dataset,dataset_foldername,img_folders):
    '''
    Function that creates the annotations file (.csv) for an image dataset.
    The dataset is assumed to be divided into two folders according to the labels.
    
    Parameters
    ----------
    path_to_dataset : str
        String with the path pointing to the folder containing the dataset
    dataset_foldername : str
        Name of the folder containing all images
    img_folders : list
        List with the folder names containing the images. Both folders are assumed to be inside 
        `dataset_foldername` and each position in the list is going to be mapped to a binary category
        img_folders[0] --> category 0
        img_folders[1] --> category 1
    
    Returns
    -------
    dataframe
        A dataframe contaning the list of images and their binary label
    
    '''
    assert isinstance(path_to_dataset,str),'path_to_dataset must be a string'
    assert isinstance(dataset_foldername,str),'dataset_foldername must be a string'
    assert isinstance(img_folders,list),'img_folders must be a list'
    
    Nfolders = len(img_folders)
    
    df_list = []
    
    for index in range(Nfolders):
    
        curr_folder = img_folders[index]
    
        assert isinstance(curr_folder,str),'element of img_folders must be a string'
    
        curr_path = path_to_dataset + dataset_foldername + '/' + curr_folder
    
        image_list = os.listdir(curr_path)
        labels = index * np.ones(len(image_list), dtype = int)
    
        temp_df = pd.DataFrame(columns = ['item'], data = image_list)
    
        # update each item to include the image folder 
        temp_df['item'] = temp_df['item'].apply(lambda x: curr_folder + '/' + x)
    
        # add column with labels
        temp_df['label'] = labels
    
        df_list.append(temp_df.copy())
    
    # concatenate dataframe list into a single file
    all_images_df = pd.concat(df_list, axis = 0, ignore_index = True)
    
    print('\n Created dataframe with all images\n')
    print(all_images_df.info())
    
    original_dir = os.getcwd()
    
    os.chdir(path_to_dataset + dataset_foldername)
    
    # preallocate columns for image channels and size
    all_images_df['channels'] = 0
    all_images_df['height'] = 0
    all_images_df['width'] = 0
    
    image_list = all_images_df['item'].to_list()
    
    for index,image in enumerate(image_list):
    
        img_shape = read_image(image).shape
    
        all_images_df.at[index,'channels'] = img_shape[0]
        all_images_df.at[index,'height'] = img_shape[1]
        all_images_df.at[index,'width'] = img_shape[2]

    os.chdir(original_dir)
    
    return all_images_df

In [2]:
from torchvision.transforms.functional import crop
from torchvision.utils import save_image
from torchvision.io import read_image

def crop_images(df_oversized,path_to_dataset,dataset_foldername,new_height,new_width):
    '''
    # helper function to crop images and save them
    # inputs
    # df with list of oversized images 
    # assumes it has the following columns: item, label, channels, height, width
    # new image size
    # path_to_dataset
    # dataset_foldername   
    
    '''
    assert isinstance(new_height,int) and isinstance(new_width,int), 'new dimensions should be integers'
    
    df_oversized['cropped_item'] = ''    
    
    original_dir = os.getcwd()
    
    os.chdir(path_to_dataset + dataset_foldername)
    
    for i in df_oversized.index:
    
        image_path_name = df_oversized.iat[i,0]
    
        extension = image_path_name[-4:]
        image_name = image_path_name[0:-4].split('/')[-1]
    
        # rename image, including path
        cropped_image_path_name = image_path_name[0:-4] + '_cropped' + extension
        
        # update dataframe
        df_oversized.at[i,'cropped_item'] = cropped_image_path_name
        
        # read image
        img = read_image(image_path_name)
        
        # crop image
        temp = crop(img,0,0,new_height,new_width)
        
        # save image -- need to normalize to the 0 to 1 interval
        save_image(temp/255,cropped_image_path_name)
    
    print(f'\nCropped all images to {new_height} x {new_width}\n')
    
    os.chdir(original_dir)
    
    return df_oversized

# Define Paths to Training and Testing Datasets

In [3]:
! pwd

/Users/rodrigo/Documents/BrainStation/Capstone Project/capstone_project/jupyter_notebooks


In [4]:
! ls ../data_preprocessing/02_forest_fire_dataset/

labels_02_forest_fire_dataset.csv      [1m[36mtesting[m[m
labels_02_forest_fire_dataset_prep.csv [1m[36mtraining[m[m
temp.jpg


In [5]:
# paths are relative to jupyter notebook location
path_to_dataset = '../data_preprocessing/'
dataset_foldername = '02_forest_fire_dataset'
img_folders_training = ['training/nofire','training/fire']
img_folders_testing = ['testing/nofire','testing/fire']

In [6]:
df_training = review_dataset(path_to_dataset,dataset_foldername,img_folders_training)


 Created dataframe with all images

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1632 entries, 0 to 1631
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   item    1632 non-null   object
 1   label   1632 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 25.6+ KB
None


In [7]:
df_training.head()

Unnamed: 0,item,label,channels,height,width
0,training/nofire/nofire_0169.jpg,0,3,250,250
1,training/nofire/nofire_0633.jpg,0,3,250,250
2,training/nofire/nofire_0155.jpg,0,3,250,250
3,training/nofire/nofire_0141.jpg,0,3,250,250
4,training/nofire/nofire_0627.jpg,0,3,250,250


We can see that the training set has 1576 images in total

In [8]:
df_testing = review_dataset(path_to_dataset,dataset_foldername,img_folders_testing)


 Created dataframe with all images

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   item    380 non-null    object
 1   label   380 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 6.1+ KB
None


In [9]:
df_testing.head()

Unnamed: 0,item,label,channels,height,width
0,testing/nofire/nofire_0800.jpg,0,3,256,256
1,testing/nofire/nofire_0828.jpg,0,3,250,250
2,testing/nofire/nofire_0196.jpg,0,3,250,250
3,testing/nofire/nofire_0357.jpg,0,3,250,250
4,testing/nofire/nofire_0431.jpg,0,3,250,250


The testing dataset has 380 images. We see from the head that the images might be bigger than 250 x 250. We'll explore in detail in the next sections

# Checking image sizes for Training Set

In [10]:
df_training['channels'].value_counts()

channels
3    1632
Name: count, dtype: int64

All 1576 images in the training set have 3 channels. No surprises here.

In [11]:
df_training['height'].value_counts()

height
250    1576
256      55
252       1
Name: count, dtype: int64

In [12]:
df_training['width'].value_counts()

width
250    1576
256      55
252       1
Name: count, dtype: int64

We see that in the dataset there are 56 images that do not have the same 250 x 250 size as all others (they are bigger).

In [13]:
#get all the oversized entries
df_training_oversized = df_training.query('width > 250').copy()

#re-start the indexing
df_training_oversized.reset_index(drop = True, inplace = True)

In [14]:
df_training_oversized.head()

Unnamed: 0,item,label,channels,height,width
0,training/nofire/nofire_0790.jpg,0,3,256,256
1,training/nofire/nofire_0801.jpg,0,3,256,256
2,training/nofire/nofire_0815.jpg,0,3,256,256
3,training/nofire/nofire_0793.jpg,0,3,256,256
4,training/nofire/nofire_0792.jpg,0,3,256,256


In [15]:
df_training_oversized.shape

(56, 5)

There's exactly 56 images that are oversized on both dimensions. We'll add another column to store the re-sized name.

In [20]:
df_training_oversized.head()

Unnamed: 0,item,label,channels,height,width
0,training/nofire/nofire_0790.jpg,0,3,256,256
1,training/nofire/nofire_0801.jpg,0,3,256,256
2,training/nofire/nofire_0815.jpg,0,3,256,256
3,training/nofire/nofire_0793.jpg,0,3,256,256
4,training/nofire/nofire_0792.jpg,0,3,256,256


In [36]:
# inputs for image cropping
new_height = 250
new_width = 250

In [25]:
df_training_oversized = crop_images(df_training_oversized,path_to_dataset,dataset_foldername,new_height,new_width)


Cropped all images to 250 x 250



In [27]:
df_training_oversized.head()

Unnamed: 0,item,label,channels,height,width,cropped_item
0,training/nofire/nofire_0790.jpg,0,3,256,256,training/nofire/nofire_0790_cropped.jpg
1,training/nofire/nofire_0801.jpg,0,3,256,256,training/nofire/nofire_0801_cropped.jpg
2,training/nofire/nofire_0815.jpg,0,3,256,256,training/nofire/nofire_0815_cropped.jpg
3,training/nofire/nofire_0793.jpg,0,3,256,256,training/nofire/nofire_0793_cropped.jpg
4,training/nofire/nofire_0792.jpg,0,3,256,256,training/nofire/nofire_0792_cropped.jpg


Now we have to update the original dataframe with the cropped images and labels.

In [32]:
# extract only the rows that have height (and width) of 250
df_correct_size = df_training.query('height == 250').copy()
df_correct_size.reset_index(drop = True, inplace = True)
df_correct_size.drop(labels = ['channels','height','width'],axis = 1, inplace = True)

In [33]:
# extract 'cropped_item' and 'label' columns from df_training_oversized
df_updated_sizes = df_training_oversized[['cropped_item','label']].copy()
df_updated_sizes.columns = ['item','label']

In [34]:
# concatenate dataframes
updated_items_labels = pd.concat([df_correct_size,df_updated_sizes], axis = 0)

In [35]:
! pwd

/Users/rodrigo/Documents/BrainStation/Capstone Project/capstone_project/jupyter_notebooks


In [37]:
path_to_dataset

'../data_preprocessing/'

In [38]:
dataset_foldername

'02_forest_fire_dataset'

In [39]:
full_path_new_annotations = path_to_dataset + dataset_foldername + '/' + 'labels_02_train_dataset_prep.csv'

In [41]:
updated_items_labels.to_csv(full_path_new_annotations,index = False, header = False)

In [42]:
! ls ../data_preprocessing/02_forest_fire_dataset/

labels_02_forest_fire_dataset.csv      temp.jpg
labels_02_forest_fire_dataset_prep.csv [1m[36mtesting[m[m
labels_02_train_dataset_prep.csv       [1m[36mtraining[m[m


The labels file has been created for the training dataset. Now we move to the testing set.

# Checking image sizes for Testing Set

In [43]:
df_testing['channels'].value_counts()

channels
3    380
Name: count, dtype: int64

No surprises on number of channels

In [44]:
df_testing['height'].value_counts()

height
250    362
256     18
Name: count, dtype: int64

In [45]:
df_testing['width'].value_counts()

width
250    362
256     18
Name: count, dtype: int64

We see there are 18 images that are oversized. We'll follow a similar procedure as before to crop them down to 250 x 250.

In [46]:
#get all the oversized entries
df_testing_oversized = df_testing.query('width > 250').copy()

#re-start the indexing
df_testing_oversized.reset_index(drop = True, inplace = True)

In [47]:
df_testing_oversized.shape

(18, 5)

Only 18 entries are oversized

In [48]:
df_testing_oversized.head()

Unnamed: 0,item,label,channels,height,width
0,testing/nofire/nofire_0800.jpg,0,3,256,256
1,testing/nofire/nofire_0791.jpg,0,3,256,256
2,testing/nofire/nofire_0785.jpg,0,3,256,256
3,testing/nofire/nofire_0787.jpg,0,3,256,256
4,testing/nofire/nofire_0796.jpg,0,3,256,256


In [49]:
new_height = 250
new_width = 250
df_testing_oversized = crop_images(df_testing_oversized,path_to_dataset,dataset_foldername,new_height,new_width)


Cropped all images to 250 x 250



In [50]:
df_testing_oversized.head()

Unnamed: 0,item,label,channels,height,width,cropped_item
0,testing/nofire/nofire_0800.jpg,0,3,256,256,testing/nofire/nofire_0800_cropped.jpg
1,testing/nofire/nofire_0791.jpg,0,3,256,256,testing/nofire/nofire_0791_cropped.jpg
2,testing/nofire/nofire_0785.jpg,0,3,256,256,testing/nofire/nofire_0785_cropped.jpg
3,testing/nofire/nofire_0787.jpg,0,3,256,256,testing/nofire/nofire_0787_cropped.jpg
4,testing/nofire/nofire_0796.jpg,0,3,256,256,testing/nofire/nofire_0796_cropped.jpg


Let's repeat the same steps as before, and update the records that have oversized images

In [51]:
# extract only the rows that have height (and width) of 250
df_correct_size = df_testing.query('height == 250').copy()
df_correct_size.reset_index(drop = True, inplace = True)
df_correct_size.drop(labels = ['channels','height','width'],axis = 1, inplace = True)

In [52]:
# extract 'cropped_item' and 'label' columns from df_testing_oversized
df_updated_sizes = df_testing_oversized[['cropped_item','label']].copy()
df_updated_sizes.columns = ['item','label']

In [53]:
# concatenate dataframes
updated_testing_labels = pd.concat([df_correct_size,df_updated_sizes], axis = 0)

In [54]:
full_path_new_annotations = path_to_dataset + dataset_foldername + '/' + 'labels_02_test_dataset_prep.csv'

In [55]:
updated_testing_labels.to_csv(full_path_new_annotations,index = False, header = False)

In [56]:
# to double check labels file was created
! ls ../data_preprocessing/02_forest_fire_dataset/

labels_02_forest_fire_dataset.csv      temp.jpg
labels_02_forest_fire_dataset_prep.csv [1m[36mtesting[m[m
labels_02_test_dataset_prep.csv        [1m[36mtraining[m[m
labels_02_train_dataset_prep.csv
