# How to Prepare Image Datasets and Models

This guide will bring you through the preparation of image datasets and models for testing on AI Verify.

To test models that take in images as an input, you would require the following:
- Dataset: Folder of images for testing
- Annotated Ground Truth Dataset: DataFrame containing file names of the images, along with their ground truth labels
- Model: Pipeline that processes image file paths before feeding into the final estimator


## 1. Dataset Preparation
<a id='section1'></a>
AI Verify is able to process images stored in a folder. As such, you may prepare your testing data as a folder of images.

An example of a folder structure you are required to have:
<pre>
└── image_folder
    ├── 0.png
    ├── 1.png
    ├── 2.png
    ├── 3.png
    ├── 4.png
        ...
    ├── 195.png
    ├── 196.png
    ├── 197.png
    ├── 198.png
    └── 199.png
</pre>

Upon upload of the folder, AI Verify would convert this folder into a pandas Dataframe with a column with the header 'image_directory' containing the file paths to these images. This is useful information to note to understand how the model pipeline is to be created.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: middle;
    }

    .dataframe thead th {
        text-align: middle;
    }
</style>
<table border="1" class="dataframe", align="left">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>image_directory</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>/home/documents/aiverify/uploads/image_folder/0.png</td>
    </tr>
    <tr>
      <th>1</th>
      <td>/home/documents/aiverify/uploads/image_folder/1.png</td>
    </tr>
    <tr>
      <th>2</th>
      <td>/home/documents/aiverify/uploads/image_folder/2.png</td>
    </tr>
    <tr>
      <th>3</th>
      <td>/home/documents/aiverify/uploads/image_folder/3.png</td>
    </tr>
    <tr>
      <th>4</th>
      <td>/home/documents/aiverify/uploads/image_folder/4.png</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
    </tr>
    <tr>
      <th>195</th>
      <td>/home/documents/aiverify/uploads/image_folder/195.png</td>
    </tr>
    <tr>
      <th>196</th>
      <td>/home/documents/aiverify/uploads/image_folder/196.png</td>
    </tr>
    <tr>
      <th>197</th>
      <td>/home/documents/aiverify/uploads/image_folder/97.png</td>
    </tr>
    <tr>
      <th>198</th>
      <td>/home/documents/aiverify/uploads/image_folder/198.png</td>
    </tr>
    <tr>
      <th>199</th>
      <td>/home/documents/aiverify/uploads/image_folder/199.png</td>
    </tr>
  </tbody>
</table>
</div>

## 2. Annotated Ground Truth Dataset
While the test dataset can be uploaded as a folder as detailed in [1. Dataset Preparation](#section1), an annotated ground truth dataset will have to be uploaded alongside this. The purpose of this dataset is to provide a map between the image file names and the corresponding ground truth.

This section will show an exmaple of how to prepare this dataset. Firstly, load the DataFrame containing the labels for the test dataset.

First import the relevant libraries:

In [1]:
import pickle, os
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from os.path import join
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

In [2]:
test_labels = pickle.load(open('test_labels.sav','rb'))
test_labels = test_labels.rename(columns = {0:'labels'})
display(test_labels)

Unnamed: 0,labels
0,9
1,2
2,1
3,1
4,6
...,...
195,2
196,8
197,3
198,6


Next, create a DataFrame that contains the file names of the images that are mapped to these labels.
In this example, the order of the test labels in test_labels correspond to the ascending order of files in the folder containing the test images.

In [3]:
test_dir_path = './test/'
file_names = []

for i in sorted(os.listdir(test_dir_path), 
                key=lambda i: int(os.path.splitext(os.path.basename(i))[0])):
    file_names.append(Path(i).name)
    file_names_df = pd.DataFrame(file_names, columns = ["file_name"])

display(file_names_df)

Unnamed: 0,file_name
0,0.png
1,1.png
2,2.png
3,3.png
4,4.png
...,...
195,195.png
196,196.png
197,197.png
198,198.png


Create the annotated dataset by joining file_names_df and test_labels.
This will provide the annotated ground truth dataset required by AI Verify (one column should contain the file names, and the other column should contain the ground truth labels).

In [4]:
annotated_ground_truth = pd.concat((file_names_df,test_labels), axis = 1)
pickle.dump(annotated_ground_truth, open('annotated_ground_truth.sav','wb+'))
display(annotated_ground_truth)

Unnamed: 0,file_name,labels
0,0.png,9
1,1.png,2
2,2.png,1
3,3.png,1
4,4.png,6
...,...,...
195,195.png,2
196,196.png,8
197,197.png,3
198,198.png,6


## 3. Model Preparation (Example: Scikit-learn Pipeline)
<p> To use AI Verify to test image models, the model will have to similarly take in a pandas DataFrame of image directories. This would mean that a pipeline model will have to be trained, as seen in the example below. </p>

### Step 1: Creating dataframe of directories
For the folders of images that you have on hand, convert them into pandas Dataframes with a column named 'image_directory' containing file paths.

In this example, the user has a folder (./train) containing the images used for training the model

In [5]:
train_dir_path = './train/'
train_dirs = []

for i in sorted(os.listdir(train_dir_path), 
                key=lambda i: int(os.path.splitext(os.path.basename(i))[0])):
    train_dirs.append(train_dir_path + i)

train_df = pd.DataFrame(train_dirs,columns = ['image_directory'])

print("DataFrame for training dataset:")
display(train_df)

DataFrame for training dataset:


Unnamed: 0,image_directory
0,./train/0.png
1,./train/1.png
2,./train/2.png
3,./train/3.png
4,./train/4.png
...,...
995,./train/995.png
996,./train/996.png
997,./train/997.png
998,./train/998.png


### Step 2: Loading the training labels
In this example, the user has a saved file 'train_labels.sav' containing the labels for the images in the training dataset above.

In [6]:
train_labels = pickle.load(open('train_labels.sav','rb'))
display(train_labels)

Unnamed: 0,0
0,9
1,0
2,0
3,3
4,0
...,...
995,7
996,3
997,3
998,9


### Step 3: Training a custom pipeline
With the training dataset and labels prepared, you may now define and train a custom pipeline to process images from a folder and make predictions with the final estimator

In [7]:
import numpy as np
import pandas as pd
from PIL import Image

class imageProcessingStage():
    def __init__(self, dir_column):
        self.dir_column = dir_column
    
    def transform(self, X, y=None):
        """Convert columns into dataframe for model input
        """
        images = []
        height, width, channel = 100, 100, 3
        X_=X.copy()
        for dir in X_[self.dir_column]:
            image_array = np.array(Image.open(dir)) / 255.
            image_array = image_array.reshape(height*width*channel)
            images.append(np.array(image_array))
        return pd.DataFrame(images)

    def fit(self, X, y=None):
        return self

In [8]:
pipe = Pipeline([
    ('preprocess images', imageProcessingStage(dir_column = 'image_directory')),
    ('model',  LogisticRegression())])

Training the pipeline:

In [9]:
pipe.fit(train_df, train_labels)

Pipeline(steps=[('preprocess images',
                 <__main__.imageProcessingStage object at 0x000001D17FB3F8C8>),
                ('model', LogisticRegression())])

Save the trained pipeline:

In [10]:
pickle.dump(pipe, open('pipeline_file.sav','wb+'))

To test this model, upload a model folder containing:
- A python file containing the class files that is used in the pipeline (i.e. imageProcessingStage in this example). Tip: Remember to include the relevant library imports.
- The trained pipeline file (i.e. 'pipeline_file.sav' in this example)