# MonReader_ViT
- Author: Yumo Bai
- Email: baiym104@gmail.com
- Date: May 3

Now that we have a functional CNN model working, we can move on to leverage the pretrained Vision Transformer (ViT) model to obtain a better solution. We will be accessing the ViT model through the HuggingFace API.

## Package Installation & Setup

In [1]:
!pip install transformers "datasets>=1.17.0" tensorboard --upgrade
!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.


In [2]:
# Log into our HuggingFace account to access the models
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In this example are we going to fine-tune the google/vit-base-patch16-224-in21k a Vision Transformer (ViT) pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224.

In [3]:
model_id = "google/vit-base-patch16-224-in21k"

### Preparing & Preprocessing the Dataset

Since we are using a custom dataset, we would need to convert them into a `Dataset` instance so the model could be fine tuned on it.

In [4]:
# Unzip the dataset
import os
import zipfile
import numpy as np

local_zip = './images.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('.')
zip_ref.close()

In [5]:
import datasets

def create_image_folder_dataset(root_path):
  """creates `Dataset` from image folder structure"""

  # get class names by folders names
  _CLASS_NAMES= ['flip', 'notflip']
  # defines `datasets` features`
  features=datasets.Features({
                      "img": datasets.Image(),
                      "label": datasets.features.ClassLabel(names=_CLASS_NAMES),
                  })
  # temp list holding datapoints for creation
  img_data_files=[]
  label_data_files=[]
  # load images into list for creation
  for img_class in _CLASS_NAMES:
    for img in os.listdir(os.path.join(root_path,img_class)):
      path_=os.path.join(root_path,img_class,img)
      img_data_files.append(path_)
      label_data_files.append(img_class)
  # create dataset
  ds = datasets.Dataset.from_dict({"img":img_data_files,"label":label_data_files},features=features)
  return ds

In [6]:
ROOT_DIR = './images'
TRAIN_DIR = os.path.join(ROOT_DIR, 'training')
TEST_DIR = os.path.join(ROOT_DIR, 'testing')

train_ds = create_image_folder_dataset(TRAIN_DIR)
test_ds = create_image_folder_dataset(TEST_DIR)

#### Image Processing

In [7]:
from transformers import ViTFeatureExtractor
import tensorflow as tf

# Set up GPU
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

feature_extractor = ViTFeatureExtractor.from_pretrained(model_id)

# learn more about data augmentation here: https://www.tensorflow.org/tutorials/images/data_augmentation
data_augmentation = tf.keras.Sequential(
    [
        tf.keras.layers.Resizing(720, 720),
        tf.keras.layers.Rescaling(1./255),
    ],
    name="data_augmentation",
)
# use keras image data augementation processing
def process(examples):
    examples.update(feature_extractor(examples['img'], ))
    return examples

Found GPU at: /device:GPU:0




In [8]:
# we are also renaming our label col to labels to use `.to_tf_dataset` later
train_ds = train_ds.rename_column("label", "labels")
test_ds = test_ds.rename_column("label", "labels")

train_ds = train_ds.map(process, batched=True, batch_size=8)
test_ds = test_ds.map(process, batched=True, batch_size=8)

Map:   0%|          | 0/2392 [00:00<?, ? examples/s]

Map:   0%|          | 0/597 [00:00<?, ? examples/s]

Now that the images have been processed, we now need to convert them into tensorflow datasets to prepare for training.

In [9]:
from huggingface_hub import HfFolder
import tensorflow as tf

img_class_labels = ['flip', 'notflip']

id2label = {str(i): label for i, label in enumerate(img_class_labels)}
label2id = {v: k for k, v in id2label.items()}

num_train_epochs = 15
train_batch_size = 8
eval_batch_size = 8
learning_rate = 3e-5
weight_decay_rate=0.01
num_warmup_steps=0
output_dir='MReader'
hub_token = HfFolder.get_token() # or your token directly "hf_xxx"
hub_model_id = f'{model_id.split("/")[1]}-MR'
fp16=True

# Train in mixed-precision float16
if fp16:
  tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [10]:
from transformers import DefaultDataCollator

# Data collator that will dynamically pad the inputs received, as well as the labels.
data_collator = DefaultDataCollator(return_tensors="tf")

# converting our train dataset to tf.data.Dataset
tf_train_dataset = train_ds.to_tf_dataset(
   columns=['pixel_values'],
   label_cols=["labels"],
   shuffle=True,
   batch_size=train_batch_size,
   collate_fn=data_collator)

# converting our test dataset to tf.data.Dataset
tf_eval_dataset = test_ds.to_tf_dataset(
   columns=['pixel_values'],
   label_cols=["labels"],
   shuffle=True,
   batch_size=eval_batch_size,
   collate_fn=data_collator)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


As the datasets have been transformed into tensorflow datasets, we can start constructing our pretrained model.

In [11]:
from transformers import TFViTForImageClassification, create_optimizer
import tensorflow as tf

# create optimizer wight weigh decay
num_train_steps = len(tf_train_dataset) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=learning_rate,
    num_train_steps=num_train_steps,
    weight_decay_rate=weight_decay_rate,
    num_warmup_steps=num_warmup_steps,
)

# load pre-trained ViT model
model = TFViTForImageClassification.from_pretrained(
    model_id,
    num_labels=len(img_class_labels),
    id2label=id2label,
    label2id=label2id,
)

# define loss
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# define metrics
metrics=[
    tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy"),
    tf.keras.metrics.SparseTopKCategoricalAccuracy(3, name="top-3-accuracy"),
]

# compile model
model.compile(optimizer=optimizer,
              loss=loss,
              metrics=metrics
              )


Some layers from the model checkpoint at google/vit-base-patch16-224-in21k were not used when initializing TFViTForImageClassification: ['vit/pooler/dense/kernel:0', 'vit/pooler/dense/bias:0']
- This IS expected if you are initializing TFViTForImageClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFViTForImageClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We add callbacks to stop the model from overfitting.

In [12]:
import os
from transformers.keras_callbacks import PushToHubCallback
from keras.callbacks import TensorBoard as TensorboardCallback, EarlyStopping

callbacks=[]

callbacks.append(TensorboardCallback(log_dir=os.path.join(output_dir,"logs")))
callbacks.append(EarlyStopping(monitor="val_accuracy",patience=1))
if hub_token:
  callbacks.append(PushToHubCallback(output_dir=output_dir,
                                     hub_model_id=hub_model_id,
                                     hub_token=hub_token))

/content/MReader is already a clone of https://huggingface.co/XO-Appleton/vit-base-patch16-224-in21k-MR. Make sure you pull the latest changes with `repo.git_pull()`.


In [13]:
with tf.device('/device:GPU:0'):

  train_results = model.fit(
      tf_train_dataset,
      validation_data=tf_eval_dataset,
      callbacks=callbacks,
      epochs=num_train_epochs,
  )

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15


### Model Evaluation

In [14]:
model.evaluate(tf_eval_dataset)



[0.012616408057510853, 0.9983249306678772, 1.0]

Our model achieved 99.83% accuracy on the testing dataset. Let's dive further and examine the roc and F1-score of the model.

In [15]:
logits = model.predict(tf_eval_dataset, verbose=1)['logits']
preds = np.argmax(logits, axis=1)



The model is deployed at [my HuggingFace Space](https://huggingface.co/spaces/XO-Appleton/XO-Appleton-vit-base-patch16-224-in21k-MR). By testing the model with images from existing image data for this project and random images downloaded from the internet, we found that the model is achieving the high accuracy when tested with the existing data, but struggling with image data that are not of the same source as it tend to predict most pages to be flipping. This shows that our model is still limited to the specific source that we trained on and we probably need to add more image sources of flipping page and non-flipping pages.

## Save The model

In [35]:
# Locally
model.save_pretrained("MR_ViT_model")

In [36]:
# To the huggingface hub
from huggingface_hub import HfApi

api = HfApi()

user = api.whoami(hub_token)


feature_extractor.save_pretrained(output_dir)

api.upload_file(
    token=hub_token,
    repo_id=f"{user['name']}/{hub_model_id}",
    path_or_fileobj=os.path.join(output_dir,"preprocessor_config.json"),
    path_in_repo="preprocessor_config.json",
)

'https://huggingface.co/XO-Appleton/vit-base-patch16-224-in21k-MR/blob/main/preprocessor_config.json'