![File Corpus](https://raw.githubusercontent.com/arvest-data-in-context/ml-notebooks/refs/heads/main/docs/images/notebooks/files-to-corpus.png)

In this notebook, we shall see how we can turn a folder of files on your computer into a corpus. We will gather the information about each file, create a IIIF Manifest for each file, and then upload them to [Arvest](https://arvest.app).

# 0. Setup

Let's begin by installing and importing all of the different components we will need.

In [None]:
print("Installing and importing packages...")

# Uninstall and reinstall packages for a clean environment
!pip uninstall -q -y arvestapi
!pip uninstall -q -y arvesttools
!pip uninstall -q -y jhutils
!pip uninstall -q -y iiif_prezi3
!pip install -q --disable-pip-version-check git+https://github.com/arvest-data-in-context/arvest-api.git
!pip install -q --disable-pip-version-check git+https://github.com/arvest-data-in-context/arvest-api-tools.git
!pip install -q --disable-pip-version-check git+https://github.com/jdchart/jh-py-utils.git
!pip install -q --disable-pip-version-check git+https://github.com/iiif-prezi/iiif-prezi3.git
!pip install -q --disable-pip-version-check git+https://github.com/distant-viewing/dvt.git

# Import packages
import arvestapi
import arvesttools.manifest_creation
from jhutils.local_files import read_json, write_json, collect_files, get_file_info, read_txt
import jhutils.online_files
from jhutils.misc import print_progress_bar, slugify
from jhutils.html import html_to_png
import os
import iiif_prezi3
import shutil
from PIL import ImageDraw

TEMP_FOLDER = os.path.join(os.getcwd(), "_TEMP")
if os.path.isdir(TEMP_FOLDER) == False:
    os.makedirs(TEMP_FOLDER)

print("👍 Ready!")

# 1. Get files and info
To start, let's get all of the files we have in a `SOURCE_FOLDER`, and then gather all of each file's information using the `get_file_info` helper function. The second argument in this function hides private file information when set to `True`.

In [None]:
SOURCE_FOLDER = "/Users/jacob/Documents/sound/projects"

print(f"Collecting info about the files in {SOURCE_FOLDER}...")
file_list = collect_files(SOURCE_FOLDER)
file_info_list = []

for file in file_list:
    file_info_list.append(get_file_info(file, False)) # True hides private information

print(f"👍 Found info about {len(file_info_list)} files!")

To see the results, print a range of the files here:

In [None]:
for file_info in file_info_list[0:1]:
    print(file_info)

# 2. Create media
Next, we will need a media file to upload for each file. If image files, we shall upload the actual image to Arvest, but for other file types, we shall create a small image that give's the file's basic information.

To create the image, we define a function called `file_to_image()` which will insert file info into an html string, which then gets converted to pdf, and in turn converted to an image.

In [None]:
data = {}

for i, file_info in enumerate(file_info_list):
    print_progress_bar(i + 1, len(file_info_list), f"(Creating media for {file_info['basename']})...")

    original_file = os.path.join(file_info['dir'], file_info['basename'])

    was_image = False
    if file_info['mimetype'] != None:
        if str(file_info['mimetype']).split('/')[0] == "image":
            shutil.copy(original_file, os.path.join(TEMP_FOLDER, file_info['basename']))
            was_image = True

            data[original_file] = {
                "media_file" : os.path.join(TEMP_FOLDER, file_info['basename']),
                "was_image" : was_image
            }
    
    if was_image == False:
        image_path = os.path.join(TEMP_FOLDER, f"{os.path.splitext(os.path.basename(file_info['basename']))[0]}.png")

        html_template = read_txt(os.path.join(os.getcwd(), "html_template.html"))
        html_template = html_template.replace("&&FILENAME", file_info["basename"])
        html_template = html_template.replace("&&FILESIZE", str(file_info["size_bytes"]))
        html_template = html_template.replace("&&MIMETYPE", str(file_info["mimetype"]))
        html_template = html_template.replace("&&FILEDIR", str(file_info["dir"]))
        html_template = html_template.replace("&&CREATED", str(file_info["created"]))
        html_template = html_template.replace("&&MODIFIED", str(file_info["modified"]))

        this_image_posiitons = await html_to_png(html_template, image_path, wrapper_id = "wrapper", element_ids = ["location_element", "size_element", "mimetype_element", "created_element", "modified_element"])

        data[original_file] = {
            "media_file" : image_path,
            "media_positions" : this_image_posiitons,
            "was_image" : was_image
        }

write_json(os.path.join(TEMP_FOLDER, "_media_data.json"), data)
write_json(os.path.join(TEMP_FOLDER, "_file_data.json"), {"files" : file_info_list})

# 3. Upload to Arvest
Now we can upload everything to Arvest in the form of meida items and IIIF Manifests that include al of the metadata.

First, we need to "connect" to Arvest using the Arvest API package. For this, we need our user email and our password which we will give to an instance of the `arvestapi.Arvest()` class. For convenience, we've saved ours in a file which is why we get `LOGIN_DATA` by reading a json file.

In [None]:
# First, let's connect to our Arvest account:
LOGIN_DATA = os.path.join(os.getcwd(), "login_private.json")
credentials = read_json(LOGIN_DATA)

ar = arvestapi.Arvest(credentials["email"], credentials["password"])
print(f"👍 Succesfully connected to Arvest with \"{ar.profile.name}\"")

We'll first need to upload all of the images to Arvest. To do this, we'll use the `add_media()` function. We'll keep a track of the media items so that we can create our IIIF Manifests from them after.

In [None]:
media_data = read_json(os.path.join(TEMP_FOLDER, "_media_data.json"))
files_data = read_json(os.path.join(TEMP_FOLDER, "_file_data.json"))["files"]
arvest_media_items = {}

for i, file_info in enumerate(files_data):
    print_progress_bar(i + 1, len(files_data), f"(uploading {file_info['basename']})...")

    original_file = os.path.join(file_info['dir'], file_info['basename'])
    media_path = media_data[original_file]["media_file"]

    added_media = ar.add_media(path = media_path)

    added_media.update_title(f"{file_info['basename']}")
    added_media.update_description(f"An item from my file corpus")
    
    media_metadata = added_media.get_metadata()
    media_metadata["creator"] = "Folder to corpus tutorial script"
    media_metadata["identifier"] = "&&FOLDER-TO-CORPUS"
    added_media.update_metadata(media_metadata)

    arvest_media_items[original_file] = added_media

print("👍 Finished uploading to Arvest!")

Now we can create a IIIF Manifest from our media files using the [arvesttools](https://github.com/arvest-data-in-context/arvest-api-tools) `media_to_manifest()` function.

In [None]:
for i, file_info in enumerate(files_data):
    print_progress_bar(i + 1, len(files_data), f"(uploading {file_info['basename']})...")

    original_file = os.path.join(file_info['dir'], file_info['basename'])

    media_item = arvest_media_items[original_file]

    manifest = arvesttools.manifest_creation.media_to_manifest(media_item)

    metadata = []
    for key in file_info:
        metadata.append({
            "label" : {"en" : [f"{key}"]},
            "value" : {"en" : [f"{str(file_info[key])}"]}
        })
    
    manifest.metadata = metadata
    manifest.label = {"en" : [f"{file_info['basename']}"]}


    out_path = os.path.join(TEMP_FOLDER, f"{slugify(file_info['basename'])}-manifest.json")
    write_json(out_path, manifest.dict())
    added_manifest = ar.add_manifest(path = out_path)

    added_manifest.update_title(f"{file_info['basename']}")
    added_manifest.update_description(f"An item from my file corpus")
    if media_item.thumbnail_url != None:
        added_manifest.update_thumbnail_url(media_item.thumbnail_url)
    
    manifest_metadata = added_manifest.get_metadata()
    manifest_metadata["creator"] = "Folder to corpus tutorial script"
    manifest_metadata["identifier"] = "&&FOLDER-TO-CORPUS"
    added_manifest.update_metadata(manifest_metadata)

print("👍 Finished uploading to Arvest!")

You can now view the Manifests in your Arvest [workspace](https://workspace.arvest.app/).

# 5. Cleanup
To finish, lets clean up our mess! First, we can delete the temporary folder where the media was downloaded and our Manifests were created.

In [None]:
shutil.rmtree(TEMP_FOLDER)
print(f"🗑️ {TEMP_FOLDER} removed !")

And finally, if we want, we can remove the items uploaded to Arvest.

**⚠️ Warning: there's no going back after using the remove function, so be careful! To avoid accidential removal, we've added a `REMOVE` variable that need to be set to `True` for the code to run.**

In [None]:
REMOVE = False

if REMOVE:
    count = 0
    print("Removing files...")

    # Get all of our media files:
    all_media = ar.get_manifests()
    
    for i, media_file in enumerate(all_media):
        print_progress_bar(i + 1, len(all_media), f"(Processing file {i + 1}/{len(all_media)})")
        
        # Get the media item's metadata and check if it matches some conditions:
        media_metadata = media_file.get_metadata()
        if media_metadata["creator"] == "Folder to corpus tutorial script" and media_metadata["identifier"] == "&&FOLDER-TO-CORPUS":
            
            # Remove the item:
            media_file.remove()
            count = count + 1

    # Get all of our media files:
    all_media = ar.get_medias()
    
    for i, media_file in enumerate(all_media):
        print_progress_bar(i + 1, len(all_media), f"(Processing file {i + 1}/{len(all_media)})")
        
        # Get the media item's metadata and check if it matches some conditions:
        media_metadata = media_file.get_metadata()
        if media_metadata["creator"] == "Folder to corpus tutorial script" and media_metadata["identifier"] == "&&FOLDER-TO-CORPUS":
            
            # Remove the item:
            media_file.remove()
            count = count + 1

    print(f"🗑️ Removed {count} media files!")