# Part 1: Downloading Dataset
Download the `ocr-data.zip` file from either kaggle or google drive, currently this notebook using kaggle API. Before running the cells below, please drop the `kaggle.json` credentials file to you colab directory.

## Method 1: Using Kaggle API
For some reason, downloading the dataset using Kaggle API is much faster than using google drive API.

In [27]:
# Install required libraries
!pip install -q kaggle

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [28]:
# create .kaggle directory in root direcory
!mkdir -p ~/.kaggle
# copy kaggle.json to ~/.kaggle
!cp kaggle.json ~/.kaggle
# change file permission for kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

In [29]:
# download ocr dataset
!kaggle datasets download -d aidapearson/ocr-data

ocr-data.zip: Skipping, found more recently modified local copy (use --force to force download)


In [30]:
# unzip the retrieved dataset into `raw_train_data`
!unzip ocr-data.zip -d raw_train_data

Archive:  ocr-data.zip
replace raw_train_data/batch_1/JSON/kaggle_data_1.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## Method 2: Using Google Drive API
Slower download speed than Kaggle API so currently not used.

In [31]:

# !pip install google-auth google-auth-oauthlib google-auth-httplib2

In [32]:
# # Authenticate with service account credentials
# from google.oauth2 import service_account
# from google.auth.transport.requests import Request
# # Access Google Drive using the authenticated credentials
# from googleapiclient.discovery import build
# from googleapiclient.http import MediaFileUpload,MediaIoBaseDownload
# import io

In [33]:
# credentials = service_account.Credentials.from_service_account_file(
#     '/content/project-50021-415714-5d993bed20f3.json',
#     scopes=['https://www.googleapis.com/auth/drive']
# )
# # Authenticate the credentials
# credentials.refresh(Request())
# # Build a Drive service object
# drive_service = build('drive', 'v3', credentials=credentials)

In [34]:
# Example: List files in Drive
# results = drive_service.files().list(pageSize=10).execute()
# items = results.get('files', [])

# if not items:
#     print('No files found.')
# else:
#     print('Files:')
#     for item in items:
#         print(f"{item['name']} ({item['id']})")

In [35]:
# file_name = "ocr-data.zip"

# file_metadata = {"name": "ocr-data.zip"}
# media = MediaFileUpload("/content/ocr-data.zip", mimetype="application/zip")
# uploaded_file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()


In [36]:

# file_id = "1uiNTe5CL3USzw2N-ShVWyD9IxdUlicYe"
# # pylint: disable=maybe-no-member
# request = drive_service.files().get_media(fileId=file_id)
# fh = io.FileIO('ocr-data.zip', mode='wb')
# downloader = MediaIoBaseDownload(fh, request)
# done = False
# while done is False:
#   status, done = downloader.next_chunk()
#   print(f"Download {int(status.progress() * 100)}.")


## Part 2: A custom Dataset object
A dataset object that should be modified according to the model being trained.

### Helper function to return a list of bounding box coordinates and the corresponding label for the object in each bounding box


In [37]:
import json
def create_bounding_box_labels(input_json_file,box_labels_dir):
    """
    for each image, create a file listing the coordinates of bounding boxes of latex chars of the image
    """
    data = []
    with open(f"{input_json_file}", 'r') as f:
        data = json.load(f)
    data = list(data)
    data.sort(key = lambda x: x["uuid"])
    bounding_box_dict = {}
    for d in data:
        # get output file name
        file_name = f"{d['uuid']}.jpg"
        # extract coordinates from each item in json array
        xmins = d["image_data"]["xmins"]
        ymins = d["image_data"]["ymins"]
        xmaxs = d["image_data"]["xmaxs"]
        ymaxs = d["image_data"]["ymaxs"]
        # make list of bounding box coordinates for each LaTeX character
        bounding_box_dict[file_name] = [[xmin,ymin,xmax,ymax]
                             for xmin,ymin,xmax,ymax in zip(xmins, ymins, xmaxs, ymaxs)]
    return bounding_box_dict

def set_default(obj):
    if isinstance(obj, set):
        return list(obj)
    raise TypeError

### Dataset object for the model to be trained with.

Currently `MathDataset` only handles two class labels:
- a latex object
- not a latex object

TODO

Add more classes such as:
- number
- operator ($+$,$-$,$\div$,$\times$, etc.)
- symbol (arrow, fraction, parenthesis)
- functions ($lim$, $tan$, $sin$, $cos$, etc.)
- mathematical variables ($x$, $y$, $z$, $\alpha$, $\beta$, etc.)

In [46]:
class MathDataset(Dataset):
    """
    Dataset object for a single batch in the dataset
    """
    def __init__(self, batch_number):
        self.batch_dir = f"raw_train_data/batch_{str(batch_number)}/background_images"
        self.file_names = sorted([filename
                                  for dirname, _, filenames in os.walk(batch_dir)
                                  for filename in filenames])
        self.no_of_files = len(self.file_names)

        training_label_file_name = f"raw_train_data/batch_{str(batch_number)}/JSON/kaggle_data_{str(batch_number)}.json"
        bounding_box_labels_dir = f"content/output/bounding_box_labels/batch_{str(batch_number)}"
        self.bounding_box_dict = create_bounding_box_labels(training_label_file_name,bounding_box_labels_dir)

    def __getitem__(self, idx):
        """
        each item is a tuple of (image: Tensor, target:dict:{boxes:list[list[int]],labels:list[int]} )
        """
        file_name = self.file_names[idx]
        image = Image.open(f"{self.batch_dir}/{file_name}") # open colour image
        binary_image = convert_image_to_binary(image, thresh = 127) # convert colour image to black and white image
        # preprocessing for the binary_image object
        process = transforms.Compose([
                                transforms.PILToTensor(), # convert it to a tensor
                                transforms.Resize((600,600),antialias = True) # convert it to 600 x 600
                                ])
        # apply preprocessing to the binary_image
        final_image = process(binary_image).float()
        # create target object for training
        target = {}
        # "boxes" is a list of bounding boxes for each detected object. Each bounding box is just a l;ist of
        # (xmin,ymin,xmax,ymax)
        target["boxes"] = torch.tensor(self.bounding_box_dict.get(file_name))
        # "labels" is a list of class labels for each bounding box in "boxes".
        # TODO: Currently every label is just 1, but once we have more classes, the "labels" entry of target will have a larger domain of possible values.
        target["labels"] = torch.ones(len(target.get("boxes")), dtype=torch.int64)
        return final_image, target
    def __len__(self):
        return self.no_of_files

## Part 3: Object Detection Model
TODO:
Code for training an object detection model.
We should probably try different object detection models

some built in models in pytorch include:
1. Faster R-CNN
2. Mask R-CNN
3. YOLO (You Only Look Once)
4. RetinaNet
and many more.

Then evaluate their performance and choose the best one.