# MonReader_ViT
- Author: Yumo Bai
- Email: baiym104@gmail.com
- Date: May 3

Now that we have a functional CNN model working, we can move on to leverage the pretrained Vision Transformer (ViT) model to obtain a better solution. We will be accessing the ViT model through the HuggingFace API.

## Package Installation & Setup

In [1]:
!pip install transformers "datasets>=1.17.0" tensorboard --upgrade
!sudo apt-get install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=1.17.0
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m44.2 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard
  Downloading tensorboard-2.13.0-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloadi

In [2]:
# Log into our HuggingFace account to access the models
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In this example are we going to fine-tune the google/vit-base-patch16-224-in21k a Vision Transformer (ViT) pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224.

In [3]:
model_id = "google/vit-base-patch16-224-in21k"

### Preparing & Preprocessing the Dataset

Since we are using a custom dataset, we would need to convert them into a `Dataset` instance so the model could be fine tuned on it.

In [4]:
# Unzip the dataset
import os
import zipfile

local_zip = './images.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('.')
zip_ref.close()

In [5]:
import datasets

def create_image_folder_dataset(root_path):
  """creates `Dataset` from image folder structure"""

  # get class names by folders names
  _CLASS_NAMES= ['flip', 'notflip']
  # defines `datasets` features`
  features=datasets.Features({
                      "img": datasets.Image(),
                      "label": datasets.features.ClassLabel(names=_CLASS_NAMES),
                  })
  # temp list holding datapoints for creation
  img_data_files=[]
  label_data_files=[]
  # load images into list for creation
  for img_class in _CLASS_NAMES:
    for img in os.listdir(os.path.join(root_path,img_class)):
      path_=os.path.join(root_path,img_class,img)
      img_data_files.append(path_)
      label_data_files.append(img_class)
  # create dataset
  ds = datasets.Dataset.from_dict({"img":img_data_files,"label":label_data_files},features=features)
  return ds

In [6]:
ROOT_DIR = './images'
TRAIN_DIR = os.path.join(ROOT_DIR, 'training')
TEST_DIR = os.path.join(ROOT_DIR, 'testing')

train_ds = create_image_folder_dataset(TRAIN_DIR)
test_ds = create_image_folder_dataset(TEST_DIR)

#### Image Augmentation

In [7]:
from transformers import ViTFeatureExtractor
import tensorflow as tf

feature_extractor = ViTFeatureExtractor.from_pretrained(model_id)

# learn more about data augmentation here: https://www.tensorflow.org/tutorials/images/data_augmentation
data_augmentation = tf.keras.Sequential(
    [
        tf.keras.layers.Resizing(720, 720),
        tf.keras.layers.Rescaling(1./255),
    ],
    name="data_augmentation",
)
# use keras image data augementation processing
def augmentation(examples):
    examples.update(feature_extractor(examples['img'], ))
    examples["pixel_values"] = [data_augmentation(image) for image in examples["img"]]
    return examples

Downloading (…)rocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]



In [8]:
# we are also renaming our label col to labels to use `.to_tf_dataset` later
train_ds = train_ds.rename_column("label", "labels")
test_ds = test_ds.rename_column("label", "labels")

In [None]:
train_ds = train_ds.map(augmentation, batched=True, batch_size=8)
test_ds = test_ds.map(augmentation, batched=True, batch_size=8)

Map:   0%|          | 0/2392 [00:00<?, ? examples/s]

