## Creating Image Datasets with Hugging Face to run tests

Adapted from their [documentation](https://huggingface.co/docs/datasets/en/image_dataset) on building image data sets and a [tutorial](https://huggingface.co/docs/datasets/en/upload_dataset#upload-with-python) on uploading datasets with python.

Outside of Python you will need to:
1. Download images and CSV from Google Drive
2. Review the images to make sure they meet your needs

In this tutorial, you will do the following in python:
1. Check that the meta data is accurate
2. Put them in the proper folder structure for the evaluation dataset
3. Upload them to the Hugging Face Hub


Once all of thiss is done, you will want to create a dataset card as a [markdown file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md) or via the [web UI](https://huggingface.co/docs/datasets/en/upload_dataset#create-a-dataset-card)

By the end of the tutorial you'll be able to create image data sets on Hugging Face from local images.

## Getting the images on your local drive
For the purposes of this tutorial we are going to assume you have a set of images in a folder on your local drive, or a compressed file that contains images (i.e., a .zip or .tar). To start with you will:

1. Create a folder at `Tutorials/Data/images`
2. Extract the images into the new folder
3. Copy your metadata as `metadata.csv` into the folder with the images. NOTE: you will likely need to rename whatever CSV you have.


Once you have your images in the folder you will be able to move on to using python to verify everything.

## Make sure the metadata matches your data
To make a quality dataset we will want to do some preliminary quality control by making sure that all of the images in the folder have proper information in the metadata file. You can do this manually for small datasets but it will behoove you to develop a reusable function for larger datasets.

In [1]:
#start by importing the libraries we'll need
import pandas as pd
import huggingface_hub
import os

folderPath = "Data/images/"
meta = "metadata.csv"

In [4]:
metadata = pd.read_csv(f"{folderPath + meta}", index_col="file_name")
display(metadata) #Note this will not work for larger datasets

Unnamed: 0_level_0,error_description,answer,level
file_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
IMG-4065.PNG,"Multiply entries, don't add","1, 1, 1, 1",Matrix computation
IMG-4064.PNG,Forgot negative sign,x=-2,Simple algebra
IMG-4063.PNG,Order of operations,97,Simple arithmetic
IMG-4062.PNG,Order of operations,5,Simple arithmetic
IMG-4061.PNG,Matrices not compatible,DNE; does not exist,Matrix computation
IMG-4044.PNG,"Sum numbers, do not mush digits together",30,Simple arithmetic
IMG-4047.PNG,Log law applied incorrectly,ln(45),Algebra
IMG-4048.PNG,Differentiated with respect to wrong variable,0,Calculus
IMG-4049.PNG,Integrated with respect to the wrong variable,xy,Calculus
IMG-4050.PNG,Differentiated with respect to wrong variable,y,Calculus


In [20]:
metadata.to_json("Data/images/metadata.json")

In [None]:
def get_file_names(directory):
    names = []
    for x in os.listdir(directory):
        if x.endswith(".PNG"):
            #insert your code below
            pass
    return names

In [None]:
#Get the file names and then verify that all rows in the metadata are associated with files
image_names = get_file_names(folderPath)
metadata["verified"] = metadata.file_name.apply(lambda x: True if x in image_names else False) 
print(f"Are all of the images in the metadata present? {len(metadata) == metadata.verified.sum()}")

Are all of the images in the metadata present? True


In [58]:
metadata.head()

Unnamed: 0,file_name,error_description,answer,level,verified
0,IMG-4065.PNG,"Multiply entries, don't add","1, 1, 1, 1",Matrix computation,True
1,IMG-4064.PNG,Forgot negative sign,x=-2,Simple algebra,True
2,IMG-4063.PNG,Order of operations,97,Simple arithmetic,True
3,IMG-4062.PNG,Order of operations,5,Simple arithmetic,True
4,IMG-4061.PNG,Matrices not compatible,DNE; does not exist,Matrix computation,True


In [None]:
#Check to make sure all of the images in the folder correspond to a row in the metadata
print(f"Do all of the images in the folder have a row in the metadata.csv? {}") #insert your code in the curly brackets

Do all of the images in the folder have a row in the metadata.csv? True


## Uploading the dataset to Hugging Face Hub
Now that we have made sure all of the images in the metadata are present we will upload the dataset to huggingface.

In [6]:
#import necessary additional library
from datasets import load_dataset, Dataset

Dataset.cleanup_cache_files

dataset = load_dataset("imagefolder", data_dir="Data/images")
#dataset.push_to_hub("butterswords/math_helper_smoke_test", split="train", private=True)

Downloading data: 100%|██████████| 21/21 [00:00<00:00, 349525.33files/s]
Generating train split: 0 examples [00:00, ? examples/s]


DatasetGenerationError: An error occurred while generating the dataset

## Creating your dataset card