## Exploratory data analysis of the fragments

During the model training process, we discovered that certain layers were more crucial for ink detection than others. Specifically, the middle layers were found to be more relevant to the ink detection problem. As a result, we sought to investigate this phenomenon further through exploratory data analysis.

In [None]:
import os
import gc
import glob
import json
from collections import defaultdict
import multiprocessing as mp
from pathlib import Path
from types import SimpleNamespace
from typing import Dict, List, Optional, Tuple
import warnings
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
import pandas as pd
import PIL.Image as Image
from sklearn.metrics import fbeta_score
from sklearn.exceptions import UndefinedMetricWarning
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from tqdm import tqdm
import memory_profiler

In [None]:
PREFIX= 'kaggle//input//vesuvius-challenge//train//' 
BUFFER = 30  # Buffer size in x and y direction
Z_START = 16 # First slice in the z direction to use
Z_DIM = 32   # Number of slices in the z direction
TRAINING_EPOCHS = 20000
VALIDATION_EPOCHS= 500
LEARNING_RATE = 0.03
BATCH_SIZE = 32
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
rectangle=[(1100, 3500, 700, 950),(1000, 1000, 1200, 1200),(1500, 2500, 1200, 1200)] # Put the correct coordinates for the rectangle

In [None]:
collected = gc.collect()
 
# Prints Garbage collector
# as 0 object
print("Garbage collector: collected",
          "%d objects." % collected)

# Mean pixel value plots

Our initial approach involved examining the relationship between mean pixel values and layers. We generated these plots for each of the three fragments, which can be accessed at [Fragment 1 mean pixel values](mean_pixel_value_frag_1.png), [Fragment 2 mean pixel values](mean_pixel_value_frag_2.png), and [Fragment 3 mean pixel values](mean_pixel_value_frag_3.png).

Subsequently, when a new mask became available, we reproduced these plots. The updated versions are now accessible at [Fragment 1 mean pixel values](new_mean_pixel_value_frag_1.png), [Fragment 2 mean pixel values](new_mean_pixel_value_frag_2.png), and [Fragment 3 mean pixel values](new_mean_pixel_value_frag_3.png).



In [None]:
# Loop over the fragments
for j in range(1, 4):
    i = 0
    
    # Perform garbage collection
    collected = gc.collect()
    print("Garbage collector: collected",
          "%d objects." % collected)
    
    # Read mask image
    mask = np.array(Image.open(PREFIX + str(j) + "//mask.png").convert('1'))
    multiplier = (mask.shape[0] * mask.shape[1]) / mask.sum()
    a = []
    
    # Loop over a sorted list of files
    for filename in sorted(glob.glob(PREFIX + str(j) + "//surface_volume/*.tif")):
        # Read image file and convert pixel values
        df = pd.DataFrame(np.array(Image.open(filename), dtype=np.float32) / 65535.0 * mask)
        
        # Calculate mean pixel values
        df_mean = df.mean() * multiplier
        a.append(df_mean.mean())
        
    # Plot the data
    plt.plot(np.array(a))
    plt.title(f"Frag_{str(j)}: The mean pixel values vs layers")
    plt.xlabel("Layers")
    plt.ylabel("Mean pixel value")
    
    # Save the plot to a file and display it
    plt.savefig(f"mean_pixel_value_frag_{str(j)}.png")
    plt.show()


## Correlation plots

Finally, we graphed the correlation of each layer against its corresponding layer number. These visualizations can be found under [Correlation plot frag 1](Correlation_value_frag_1.png), [Correlation plot frag 2](Correlation_value_frag_2.png), and [Correlation plot frag 3](Correlation_value_frag_3.png).


In [None]:
from scipy import stats

# Loop over the fragments
for j in range(1, 4):
    i = 0
    
    # Perform garbage collection
    collected = gc.collect()
    print("Garbage collector: collected",
          "%d objects." % collected)
    
    # Read mask and labels images
    mask = np.array(Image.open(PREFIX + str(j) + "//mask.png").convert('1'))
    labels = (np.array(Image.open(PREFIX + str(j) + "//inklabels.png").convert('1'))).flatten()
    
    a = []
    
    # Loop over a sorted list of files
    for filename in sorted(glob.glob(PREFIX + str(j) + "//surface_volume/*.tif")):
        # Read image file and convert pixel values
        X = (np.array(Image.open(filename), dtype=np.float32) / 65535.0 * mask).flatten()
        
        # Calculate Pearson correlation coefficient
        a.append(stats.pearsonr(X, labels)[0])
    
    # Plot the data
    plt.plot(np.array(a))
    plt.title(f"Frag_{str(j)}: Correlation vs layers")
    plt.xlabel("Layers")
    plt.ylabel("Correlation")
    
    # Save the plot to a file and display it
    plt.savefig(f"Correlation_value_frag_{str(j)}.png")
    plt.show()


# Conclusion

These layers allow us to conclude that the middle layers are indeed responsible for most of the in