# Download ORACC Directory ZIP Files (Google Colab version)

Author: Melinee Her, Tiffany Lee

**Melinee Her** - Summer 2023

Context:
Taking the methods Tiffany set up, my goal is to capture all directories for ORACC projects and download all corresponding zip files.

**Tiffany Lee** - Summer 2022

Context: Last April, I was trying to use utilities from [Computational Assyriology (Compass)](https://github.com/niekveldhuis/compass), a project by Professor Niek Veldhuis, to download datasets from various databases. What I found was that his tool uses the file paths of the [Oracc: The Open Richly Annotated Cuneiform Corpus](http://oracc.museum.upenn.edu/) database to download zip files of the datasets. With that in mind, I needed to produce a list of file paths to automate the process of downloading.

This notebook seeks to explore solutions to produce a list of file paths/URLs and download the ZIP files corresponding to the items in the list.

# Notebook Setup

## Mount Google Drive folder

The code snippet below is to mount Google Drive files so that we can interact with our Google Drive files using the file browser or command line. Running it will give a permissions prompt.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#Set folder for remote drive
#folder = '/content/drive/My Drive/FactGrid Cuneiform (AWCA)/people/Melinee'
folder = '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/people/Melinee/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load Libraries

Next, we load the libraries needed for this notebook. There will also be code prompts below to load libraries in case any user wants to run only a portion of the notebook instead of the whole thing.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm.auto import tqdm
import os
import ipywidgets as widgets
from zipfile import ZipFile
from urllib.request import urlopen

#added imports
import re

## Create Download Directory
Create a directory called `ORACC_zips`. If the directory already exists, do nothing. We can change this directory according to what is needed.

In [None]:
os.makedirs(folder + "/ORACC_zips", exist_ok = True)

# Obtain Zip File File Paths from [Oracc](http://oracc.museum.upenn.edu/) Projects Page

Before we can download, we need to know where to download the data for the projects. In the [Compass](https://github.com/niekveldhuis/compass) repository, Professor Veldhuis identified the pattern of the URLs corresponding to the ZIP files for the project data used the following format: "http://oracc.museum.upenn.edu/json/asbp.zip".

I noticed that the links on the [Oracc Project List](http://oracc.museum.upenn.edu/projectlist.html) page followed the pattern in the URL provided above. I thought it would be a good idea to compile a list of project ZIP file URLs by extracting it from [Oracc Project List](http://oracc.museum.upenn.edu/projectlist.html). To do so, I found a Stack Overflow solution to extract website file paths with the `requests` and `bs4` libraries ([How to extract URLs from an HTML page in Python - Stack Overflow](https://stackoverflow.com/questions/15517483/how-to-extract-urls-from-an-html-page-in-python)).

Thus, this part seeks to obtain the file paths for the ZIP files of each project linked on the ORACC Project Page to prepare for the downloads. Later, we will use the file paths to prepare formatted download URLs to download the project ZIP files.

### Define `getFilePathListFromURL` Method

In [None]:
def getFilePath(page, tofind):
    """
    :param page: html of web page
    :param tofind: html element to search for
    :return urls or file paths in that page
    This is a utility method for getFilePathListFromHTMLPage().
    """
    start_link = page.find(tofind)
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

def getFilePathListFromURL(source_url, tofind):
    """
    :param source_url: url of the source web page that we will extract file
                       paths from
    :return path_list: list of file paths in the source_url page
    """
    response = requests.get(source_url)
    page = str(BeautifulSoup(response.content))
    path_list = []
    while True:
        path, n = getFilePath(page, tofind)
        page = page[n:]
        if path:
            path_list.append(path)
        else:
            break
    return path_list

Run `getFilePathListFromURL` Method on the Oracc Project List to retrieve a list of all hyperlinked url paths

In [None]:
project_path_list = getFilePathListFromURL("http://oracc.museum.upenn.edu/projectlist.html", 'a href')
project_path_list[:5]



['/',
 './adsd',
 './adsd',
 'https://oeaw.academia.edu/ReinhardPirngruber',
 './adsd/adart1']

## Formatting the File Paths

Now we have a list of project ZIP file paths in the [Oracc](http://oracc.museum.upenn.edu/) website. However, we will need to do some clean up and formatting to make it easier for utilties to download the files. Some observations I have made include the following:

(1) I noticed that the project directory itself was listed as part of the paths (`/`) but it is redundant so I will remove the first item in the list.

(2) Since the download widget doesn't take relative paths and only the project names and their corresponding subproject path, I will be removing the `./` part of the paths.

(3) While the website addresses with "http", "https", and "www", may be useful for metadata purposes, I think it's best to filter them out for now.

(4) Removed duplicates in url_list too.

### Define `FormatFilePathListForDownload()` Method

In [None]:
def FormatFilePathListForDownload(project_list):
  """
  :param project_list: list of project file paths
  :return project_list_formatted: list of formatted project file paths
  """
  project_list_formatted = []

  #deletes of any unintentional hyperlinks found
  for path in project_list[1:]:
    if ("http") in path:
      continue;
    elif ("www") in path:
      continue;
    else:
      project_list_formatted.append(path[2:])

  #deletes duplicates
  project_list_formatted = [url for n, url in enumerate(project_list_formatted) if url not in project_list_formatted[:n]]
  return project_list_formatted

### Run `FormatFilePathListForDownload()` Method
This cleans up the original list of urls to include only the project and subproject paths.

In [None]:
project_path_list_clean = FormatFilePathListForDownload(project_path_list)
project_path_list_clean

['adsd',
 'adsd/adart1',
 'adsd/adart2',
 'adsd/adart3',
 'adsd/adart5',
 'adsd/adart6',
 'aemw',
 'aemw/alalakh/idrimi',
 'aemw/amarna',
 'akklove',
 'amgg',
 'ario',
 'armep',
 'arrim',
 'asbp',
 'asbp/ninmed',
 'asbp/rlasb',
 'atae',
 'atae/assur',
 'atae/burmarina',
 'atae/durkatlimmu',
 'atae/durszarrukin',
 'atae/guzana',
 'atae/huzirina',
 'atae/imgurenlil',
 'atae/kalhu',
 'atae/kunalia',
 'atae/mallanate',
 'atae/marqasu',
 'atae/nineveh',
 'atae/samal',
 'atae/szibaniba',
 'atae/tilbarsip',
 'atae/tuszhan',
 'babcity',
 'blms',
 'borsippa',
 'btmao',
 'btto',
 'cams',
 'cams/akno',
 'cams/anzu',
 'cams/barutu',
 'cams/etana',
 'cams/gkab',
 'cams/ludlul',
 'cams/selbi',
 'cams/tlab',
 'cdli',
 'ckst',
 'cmawro',
 'cmawro/cmawr1',
 'cmawro/cmawr2',
 'cmawro/cmawr3',
 'cmawro/maqlu',
 'contrib',
 'contrib/amarna',
 'contrib/lambert',
 'ctij',
 'dcclt',
 'dcclt/ebla',
 'dcclt/jena',
 'dcclt/nineveh',
 'dcclt/signlists',
 'dccmt',
 'dsst',
 'ecut',
 'eisl',
 'epsd2',
 'etcsri',
 

### Optional: Concatenate the list to single string instead of a list of strings

If we want to use some of Professor Veldhuis' [Compass](https://github.com/niekveldhuis/compass) utilities, it is helpful to have the path list be a single string.

In [None]:
project_path_list_clean_string = ', '.join([url for url in project_path_list_clean])
project_path_list_clean_string


'adsd, adsd/adart1, adsd/adart2, adsd/adart3, adsd/adart5, adsd/adart6, aemw, aemw/alalakh/idrimi, aemw/amarna, akklove, amgg, ario, armep, arrim, asbp, asbp/ninmed, asbp/rlasb, atae, atae/assur, atae/burmarina, atae/durkatlimmu, atae/durszarrukin, atae/guzana, atae/huzirina, atae/imgurenlil, atae/kalhu, atae/kunalia, atae/mallanate, atae/marqasu, atae/nineveh, atae/samal, atae/szibaniba, atae/tilbarsip, atae/tuszhan, babcity, blms, borsippa, btmao, btto, cams, cams/akno, cams/anzu, cams/barutu, cams/etana, cams/gkab, cams/ludlul, cams/selbi, cams/tlab, cdli, ckst, cmawro, cmawro/cmawr1, cmawro/cmawr2, cmawro/cmawr3, cmawro/maqlu, contrib, contrib/amarna, contrib/lambert, ctij, dcclt, dcclt/ebla, dcclt/jena, dcclt/nineveh, dcclt/signlists, dccmt, dsst, ecut, eisl, epsd2, etcsri, glass, hbtin, lacost, lovelyrics, nere, nimrud, obel, obmc, obta, ogsl, oimea, pnao, qcat, riao, ribo, ribo/bab7scores, ribo/babylon10, ribo/babylon2, ribo/babylon3, ribo/babylon4, ribo/babylon5, ribo/babylon6,

#Digging for more corpuses
There are some projects [e.g. epsd2, tcma] that have buried corpuses on their respective Oracc Project Pages.

Note: These were found with a knowledge of which projects have a buried corpuses. No method can be generalized to find buried corpuses because each project website varies in hyperlinks, paths, and style.

In [None]:
buried = ['epsd2/earlylit', 'epsd2/literary', 'epsd2/praxis', 'epsd2/praxis/liturgy', "tcma/ali1",
      'epsd2/admin/ebla', 'epsd2/admin/ed12', 'epsd2/admin/ed3b', 'epsd2/admin/lagash2',
      'epsd2/admin/oakk', 'epsd2/admin/oldbab', 'epsd2/admin/ur3',

      "tcma/amarna","tcma/assur","tcma/barri","tcma/bazmusian","tcma/billa", "tcma/brak","tcma/chuera","tcma/emar",
      "tcma/fekheriye","tcma/giricano","tcma/hana","tcma/haradum","tcma/hatti","tcma/kalhu","tcma/kartn","tcma/kulishinas",
      "tcma/miscellaneous","tcma/nineveh","tcma/nippur","tcma/nuzi","tcma/qitar","tcma/rimah","tcma/suri",
      "tcma/taban","tcma/tsa1","tcma/tsh1","tcma/ugarit"]

In [None]:
final = project_path_list_clean + buried

# Download Method 1: Use [Compass](https://github.com/niekveldhuis/compass) code


First, let's try using the Compass method of downloading [Oracc](http://oracc.museum.upenn.edu/) project files.


Quoted from [compass/2_1_0_download_ORACC-JSON.ipynb](https://github.com/niekveldhuis/compass/blob/master/2_1_Data_Acquisition_ORACC/2_1_0_download_ORACC-JSON.ipynb):

> For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

> In order to show a progress bar (with `tqdm`) we need to know how large the file to be downloaded is (this value is is then fed to the `total` parameter). The http protocol provides a key `content-length` in the headers (a dictionary) that indicates file length. Not all servers provide this field - if `content-length` is not avalaible it is set to 0. With the `total` value of 0 `tqdm` will show a bar and will count the number of chunks received, but it will not indicate the degree of progress.





## Define `downloadProjectsWithCompass()` Method

In [None]:
import requests
from tqdm.auto import tqdm

In [None]:
def downloadProjectsWithCompass(project_list, destination):
  """
  :param project_list: list of project file paths
  :param destination: destination directory to put the downloaded files in
  :return download_url_list: list of formatted download links of project ZIP
                             files that were successfully download
  :return url_not_found_list: list of formatted download links of project ZIP
                              files that failed to download
  """
  CHUNK = 1024
  download_url_list = []
  url_not_found_list = []
  for project in project_list:
      proj = project.replace('/', '-')
      url = f"http://oracc.museum.upenn.edu/json/{proj}.zip"
      #file = f'jsonzip/{proj}.zip'
      #file = f'drive/MyDrive/AWCA/tablet_zip/ORACC/{proj}.zip'
      file = f'{destination}/{proj}.zip'
      with requests.get(url, stream=True) as request:
          if request.status_code == 200:   # meaning that the file exists
              total_size = int(request.headers.get('content-length', 0))
              tqdm.write(f'Saving {url} as {file}')
              t=tqdm(total=total_size, unit='B', unit_scale=True, desc = project)
              with open(file, 'wb') as f:
                  for c in request.iter_content(chunk_size=CHUNK):
                      t.update(len(c))
                      f.write(c)
              download_url_list.append(url)
          else:
              tqdm.write(f"WARNING: {url} does not exist.")
              url_not_found_list.append(url)
  return download_url_list, url_not_found_list

## Run `downloadProjectsWithCompass()` Method

Set the destination directory and run the `downloadProjectsWithCompass()` method with the necessary parameters.

In [None]:
destination = folder + 'ORACC_zips'
download_url_list, url_not_found_list = downloadProjectsWithCompass(fix, destination)

NameError: ignored

In [None]:
destination = folder + 'ORACC_zips'
download_url_list, url_not_found_list = downloadProjectsWithCompass(buried, destination)

## Examine Downloaded Files in Destination Directory

Use bash `ls` command to check the contents of the directory after downloading.

In [None]:
!ls drive/MyDrive/Melinee/ORACC_zips

Examine the successful download URLs.

In [None]:
download_url_list

Examine the failed download URLs.

In [None]:
url_not_found_list

# Download Method 2: Download using Stack Overflow code

Using the links generated from some modified [Compass](https://github.com/niekveldhuis/compass) code from the above method, we can obtain the download links to the project ZIP files.

I found a solution from Stack Overflow using the `os` and `requests` libraries to perform this process: [Download file from URL and save it in a folder Python - Stack Overflow](https://stackoverflow.com/questions/56950987/download-file-from-url-and-save-it-in-a-folder-python). Somehow it runs *slower* when not using Google Colab Notebook though.

### Define `downloadProjectsWithStackOverflow()` Method and associated utility methods

In [None]:
import os
from tqdm.auto import tqdm
import requests

In [None]:
def getDownloadURL(file_path):
  """
  :param file: a file path
  :return url: a formatted download URL
  :return url_exists: True if a url exists, False if a url doesn't exist
  This is a utility method for getDownloadURLList() to prepare a formatted
  download URL.
  """
  file = file_path.replace('/', '-')
  url = f"http://oracc.museum.upenn.edu/json/{file}.zip"
  url_exists = False
  with requests.get(url, stream=True) as request:
    if request.status_code == 200:   # meaning that the file exists
      print(f"SUCCESS: {url} exists.")
      url_exists = True
    else:
      print(f"WARNING: {url} does not exist.")
      url_exists = False
  return url, url_exists

In [None]:
def getDownloadURLList(file_path_list):
  """
  :param file_path_list: list of file paths
  :return download_url_list: list of formatted download links of project ZIP
                             files that were successfully download
  :return url_not_found_list: list of formatted download links of project ZIP
                              files that failed to download
  This is a utility method for downloadProjectsWithStackOverflow() to
  prepare a list of URLs to download the project ZIP files.
  """
  download_url_list = []
  url_not_found_list = []
  for file_path in file_path_list:
    url, url_exists = getDownloadURL(file_path)
    if url_exists:
      download_url_list.append(url)
    else:
      url_not_found_list.append(url)
  return download_url_list, url_not_found_list

In [None]:
def downloadFile(url: str, destination: str):
#    if not os.path.exists(destination):
#        os.makedirs(destination)  # create folder if it does not exist
    """
    :param url: URL of file to download
    :param destination: destination directory to put the downloaded files in
    This is a utility method for downloadProjectsWithStackOverflow() to
    download a single file given a URL and a destination directory path.
    """
    filename = url.split('/')[-1].replace(" ", "_")  # be careful with file names
    file_path = os.path.join(destination, filename)

    r = requests.get(url, stream=True)
    if r.ok:
        print("saving to", os.path.abspath(file_path))
        with open(file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024 * 8):
                if chunk:
                    f.write(chunk)
                    f.flush()
                    os.fsync(f.fileno())
    else:  # HTTP status code 4XX/5XX
        print("Download failed: status code {}\n{}".format(r.status_code, r.text))

In [None]:
def downloadProjectsWithStackOverflow(project_list, destination):
  """
  :param project_list: list of project file paths
  :param destination: destination directory to put the downloaded files in
  :return download_url_list: list of formatted download links of project ZIP
                             files that were successfully download
  :return url_not_found_list: list of formatted download links of project ZIP
                              files that failed to download
  """
  download_url_list, url_not_found_list = getDownloadURLList(project_list)
  for url in download_url_list:
    downloadFile(url, destination)
  return download_url_list, url_not_found_list

## Run `downloadProjectsWithStackOverflow()` Method

Use bash `ls` command to check the contents of the destination directory before running the `downloadProjectsWithStackOverflow()` method.

In [None]:
!ls /content/drive/MyDrive/Melinee/ORACC_zips

In [None]:
destination = folder + "ORACC_zips"
download_url_list, url_not_found_list = downloadProjectsWithStackOverflow(project_path_list_clean, destination)

SUCCESS: http://oracc.museum.upenn.edu/json/adsd.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart1.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart2.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart3.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart5.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart6.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/aemw.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/aemw-alalakh-idrimi.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/aemw-amarna.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/akklove.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/amgg.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/ario.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/armep.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/arrim.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/asbp.zip exists.
SUCCESS: http://oracc.m

## Examine the Downloaded Files in the Destination Directory

Use bash `ls` command to check if we downloaded the files we needed.

In [None]:
!ls /content/drive/MyDrive/Melinee/ORACC_zips

Examine the successful download URLs.

In [None]:
download_url_list

Examine the failed download URLs.

In [None]:
url_not_found_list