# Download ORACC Directory ZIP Files (Google Colab version)

## Author: Tiffany Lee

Context: Last April, I was trying to use utilities from [Computational Assyriology (Compass)](https://github.com/niekveldhuis/compass), a project by Professor Niek Veldhuis, to download datasets from various databases. What I found was that his tool uses the file paths of the [Oracc: The Open Richly Annotated Cuneiform Corpus](http://oracc.museum.upenn.edu/) database to download zip files of the datasets. With that in mind, I needed to produce a list of file paths to automate the process of downloading. 

This notebook seeks to explore solutions to produce a list of file paths/URLs and download the ZIP files corresponding to the items in the list.

# Notebook Setup

Use bash `pwd` command to examine the current directory.

In [1]:
!pwd

/home/tiffany/awca/tablet_zip/download_oracc


## Load Libraries

Next, we load the libraries needed for this notebook. There will also be code prompts below to load libraries in case any user wants to run only a portion of the notebook instead of the whole thing.

In [2]:
import requests
from bs4 import BeautifulSoup
from tqdm.auto import tqdm
import os
import ipywidgets as widgets
from zipfile import ZipFile

## Create Download Directory
Create a directory called `ORACC`. If the directory already exists, do nothing. We can change this directory according to what is needed.

In [3]:
import os

In [5]:
os.makedirs("oracc", exist_ok = True)

# Obtain Zip File File Paths from [Oracc](http://oracc.museum.upenn.edu/) Projects Page

Before we can download, we need to know where to download the data for the projects. In the [Compass](https://github.com/niekveldhuis/compass) repository, Professor Veldhuis identified the pattern of the URLs corresponding to the ZIP files for the project data used the following format: "http://oracc.museum.upenn.edu/json/asbp.zip".

I noticed that the links on the [Oracc Project List](http://oracc.museum.upenn.edu/projectlist.html) page followed the pattern in the URL provided above. I thought it would be a good idea to compile a list of project ZIP file URLs by extracting it from [Oracc Project List](http://oracc.museum.upenn.edu/projectlist.html). To do so, I found a Stack Overflow solution to extract website file paths with the `requests` and `bs4` libraries ([How to extract URLs from an HTML page in Python - Stack Overflow](https://stackoverflow.com/questions/15517483/how-to-extract-urls-from-an-html-page-in-python)).

Thus, this part seeks to obtain the file paths for the ZIP files of each project linked on the ORACC Project Page to prepare for the downloads. Later, we will use the file paths to prepare formatted download URLs to download the project ZIP files.

In [6]:
import requests
from bs4 import BeautifulSoup

### Define `getFilePathListFromURL` Method

In [7]:
def getFilePath(page):
    """
    :param page: html of web page 
    :return urls or file paths in that page 
    This is a utility method for getFilePathListFromHTMLPage().
    """
    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

def getFilePathListFromURL(source_url):
    """
    :param source_url: url of the source web page that we will extract file 
                       paths from 
    :return path_list: list of file paths in the source_url page 
    """
    response = requests.get(source_url)
    page = str(BeautifulSoup(response.content))
    path_list = []
    while True:
        path, n = getFilePath(page)
        page = page[n:]
        if path:
            path_list.append(path)
        else:
            break
    return path_list

Run `getFilePathListFromURL` Method

In [8]:
project_path_list = getFilePathListFromURL(source_url = "http://oracc.museum.upenn.edu/projectlist.html")
project_path_list

['/',
 './adsd',
 './adsd',
 'https://oeaw.academia.edu/ReinhardPirngruber',
 './adsd/adart1',
 './adsd/adart1',
 './adsd/adart2',
 './adsd/adart2',
 './adsd/adart3',
 './adsd/adart3',
 './adsd/adart6',
 './adsd/adart6',
 './aemw',
 './aemw',
 './aemw/alalakh/idrimi',
 './aemw/alalakh/idrimi',
 'http://neareast.jhu.edu/bios/jacob-lauinger/',
 './aemw/amarna',
 './aemw/amarna',
 './akklove',
 './akklove',
 'https://www.harrassowitz-verlag.de/title_961.ahtml',
 './amgg',
 './amgg',
 './ario',
 './ario',
 ' https://www.en.ag.geschichte.uni-muenchen.de/staff/staff/heitmann-gordon/index.html',
 'https://www.humboldt-foundation.de/web/home.html',
 './armep',
 './armep',
 'http://www.en.ag.geschichte.uni-muenchen.de/chairs/chair_radner/index.html',
 'https://www.humboldt-foundation.de/web/home.html',
 './arrim',
 './arrim',
 ' https://www.humboldt-foundation.de/web/home.html',
 'http://www.en.ag.geschichte.uni-muenchen.de/chairs/chair_radner/index.html',
 './asbp',
 './asbp',
 './asbp/ninmed'

## Formatting the File Paths

Now we have a list of project ZIP file paths in the [Oracc](http://oracc.museum.upenn.edu/) website. However, we will need to do some clean up and formatting to make it easier for utilties to download the files. Some observations I have made include the following:

(1) I noticed that the project directory itself was listed as part of the paths (`/`) but it is redundant so I will remove the first item in the list.

(2) Since the download widget doesn't take relative paths and only the project names and their corresponding subproject path, I will be removing the `./` part of the paths.

(3) While the website addresses with "http", "https", and "www", may be useful for metadata purposes, I think it's best to filter them out for now. 

(4) Removed duplicates in url_list too.

### Define `FormatFilePathListForDownload()` Method

In [9]:
def FormatFilePathListForDownload(project_list):
  """
  :param project_list: list of project file paths 
  :return project_list_formatted: list of formatted project file paths
  """
  project_list_formatted = project_list[1:] 
  project_list_formatted = [path[2:] for path in project_list_formatted 
                            if ("http" or "www") not in path]  
  project_list_formatted = [url for n, url in enumerate(project_list_formatted) 
                           if url not in project_list_formatted[:n]]
  return project_list_formatted

### Run `FormatFilePathListForDownload()` Method

In [10]:
project_path_list_clean = FormatFilePathListForDownload(project_path_list)
project_path_list_clean

['adsd',
 'adsd/adart1',
 'adsd/adart2',
 'adsd/adart3',
 'adsd/adart6',
 'aemw',
 'aemw/alalakh/idrimi',
 'aemw/amarna',
 'akklove',
 'amgg',
 'ario',
 'armep',
 'arrim',
 'asbp',
 'asbp/ninmed',
 'asbp/rlasb',
 'atae',
 'atae/assur',
 'atae/burmarina',
 'atae/durkatlimmu',
 'atae/durszarrukin',
 'atae/guzana',
 'atae/huzirina',
 'atae/imgurenlil',
 'atae/kalhu',
 'atae/kunalia',
 'atae/mallanate',
 'atae/marqasu',
 'atae/nineveh',
 'atae/samal',
 'atae/szibaniba',
 'atae/tilbarsip',
 'atae/tuszhan',
 'babcity',
 'blms',
 'btto',
 'cams',
 'cams/akno',
 'cams/anzu',
 'cams/barutu',
 'cams/etana',
 'cams/gkab',
 'cams/ludlul',
 'cams/selbi',
 'cams/tlab',
 'cdli',
 'ckst',
 'cmawro',
 'cmawro/cmawr1',
 'cmawro/cmawr2',
 'cmawro/cmawr3',
 'cmawro/maqlu',
 'contrib',
 'contrib/amarna',
 'contrib/lambert',
 'ctij',
 'w.tau.ac.il/humanities/archaeology/ancient_israel.html',
 'dcclt',
 'dcclt/ebla',
 'dcclt/jena',
 'dcclt/nineveh',
 'dcclt/signlists',
 'dccmt',
 'dsst',
 'ecut',
 'epsd2',
 

### Optional: Concatenate the list to single string instead of a list of strings

If we want to use some of Professor Veldhuis' [Compass](https://github.com/niekveldhuis/compass) utilities, it is helpful to have the path list be a single string.

In [11]:
project_path_list_clean_string = ', '.join([url for url in project_path_list_clean])
project_path_list_clean_string

'adsd, adsd/adart1, adsd/adart2, adsd/adart3, adsd/adart6, aemw, aemw/alalakh/idrimi, aemw/amarna, akklove, amgg, ario, armep, arrim, asbp, asbp/ninmed, asbp/rlasb, atae, atae/assur, atae/burmarina, atae/durkatlimmu, atae/durszarrukin, atae/guzana, atae/huzirina, atae/imgurenlil, atae/kalhu, atae/kunalia, atae/mallanate, atae/marqasu, atae/nineveh, atae/samal, atae/szibaniba, atae/tilbarsip, atae/tuszhan, babcity, blms, btto, cams, cams/akno, cams/anzu, cams/barutu, cams/etana, cams/gkab, cams/ludlul, cams/selbi, cams/tlab, cdli, ckst, cmawro, cmawro/cmawr1, cmawro/cmawr2, cmawro/cmawr3, cmawro/maqlu, contrib, contrib/amarna, contrib/lambert, ctij, w.tau.ac.il/humanities/archaeology/ancient_israel.html, dcclt, dcclt/ebla, dcclt/jena, dcclt/nineveh, dcclt/signlists, dccmt, dsst, ecut, epsd2, etcsri, glass, hbtin, lacost, lovelyrics, nere, nimrud, obel, obmc, obta, ogsl, oimea, pnao, qcat, riao, ribo, ribo/bab7scores, ribo/babylon10, ribo/babylon2, ribo/babylon3, ribo/babylon4, ribo/baby

# Download Method 1: Use [Compass](https://github.com/niekveldhuis/compass) code


First, let's try using the Compass method of downloading [Oracc](http://oracc.museum.upenn.edu/) project files.


Quoted from [compass/2_1_0_download_ORACC-JSON.ipynb](https://github.com/niekveldhuis/compass/blob/master/2_1_Data_Acquisition_ORACC/2_1_0_download_ORACC-JSON.ipynb):

> For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

> In order to show a progress bar (with `tqdm`) we need to know how large the file to be downloaded is (this value is is then fed to the `total` parameter). The http protocol provides a key `content-length` in the headers (a dictionary) that indicates file length. Not all servers provide this field - if `content-length` is not avalaible it is set to 0. With the `total` value of 0 `tqdm` will show a bar and will count the number of chunks received, but it will not indicate the degree of progress.





## Define `downloadProjectsWithCompass()` Method

In [12]:
import requests
from tqdm.auto import tqdm

In [13]:
def downloadProjectsWithCompass(project_list, destination):
  """
  :param project_list: list of project file paths 
  :param destination: destination directory to put the downloaded files in
  :return download_url_list: list of formatted download links of project ZIP
                             files that were successfully download
  :return url_not_found_list: list of formatted download links of project ZIP 
                              files that failed to download
  """
  CHUNK = 1024
  download_url_list = []
  url_not_found_list = []
  for project in project_list:
      proj = project.replace('/', '-')
      url = f"http://oracc.museum.upenn.edu/json/{proj}.zip"
      #file = f'jsonzip/{proj}.zip'
      #file = f'drive/MyDrive/AWCA/tablet_zip/ORACC/{proj}.zip'
      file = f'{destination}/{proj}.zip'
      with requests.get(url, stream=True) as request:
          if request.status_code == 200:   # meaning that the file exists
              total_size = int(request.headers.get('content-length', 0))
              tqdm.write(f'Saving {url} as {file}')
              t=tqdm(total=total_size, unit='B', unit_scale=True, desc = project)
              with open(file, 'wb') as f:
                  for c in request.iter_content(chunk_size=CHUNK):
                      t.update(len(c))
                      f.write(c)
              download_url_list.append(url)
          else:
              tqdm.write(f"WARNING: {url} does not exist.")
              url_not_found_list.append(url)       
  return download_url_list, url_not_found_list

## Run `downloadProjectsWithCompass()` Method

Use bash `ls` command to check the contents of the destination directory before running the `downloadProjectsWithCompass()` method.

In [42]:
!ls oracc

unzipped


Set the destination directory and run the `downloadProjectsWithCompass()` method with the necessary parameters.

In [43]:
destination = 'oracc'
download_url_list, url_not_found_list = downloadProjectsWithCompass(project_path_list_clean, destination)

Saving http://oracc.museum.upenn.edu/json/adsd.zip as oracc/adsd.zip


adsd:   0%|          | 0.00/8.90M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/adsd-adart1.zip as oracc/adsd-adart1.zip


adsd/adart1:   0%|          | 0.00/5.05M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/adsd-adart2.zip as oracc/adsd-adart2.zip


adsd/adart2:   0%|          | 0.00/7.33M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/adsd-adart3.zip as oracc/adsd-adart3.zip


adsd/adart3:   0%|          | 0.00/10.6M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/adsd-adart6.zip as oracc/adsd-adart6.zip


adsd/adart6:   0%|          | 0.00/4.80M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/aemw.zip as oracc/aemw.zip


aemw:   0%|          | 0.00/4.80k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/aemw-alalakh-idrimi.zip as oracc/aemw-alalakh-idrimi.zip


aemw/alalakh/idrimi:   0%|          | 0.00/210k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/aemw-amarna.zip as oracc/aemw-amarna.zip


aemw/amarna:   0%|          | 0.00/4.25M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/akklove.zip as oracc/akklove.zip


akklove:   0%|          | 0.00/2.08M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/amgg.zip as oracc/amgg.zip


amgg:   0%|          | 0.00/131k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ario.zip as oracc/ario.zip


ario:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/armep.zip as oracc/armep.zip


armep:   0%|          | 0.00/224M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/arrim.zip as oracc/arrim.zip


arrim:   0%|          | 0.00/6.43k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/asbp.zip as oracc/asbp.zip


asbp:   0%|          | 0.00/5.41M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/asbp-ninmed.zip as oracc/asbp-ninmed.zip


asbp/ninmed:   0%|          | 0.00/3.20M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/asbp-rlasb.zip as oracc/asbp-rlasb.zip


asbp/rlasb:   0%|          | 0.00/146k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae.zip as oracc/atae.zip


atae:   0%|          | 0.00/74.9M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-assur.zip as oracc/atae-assur.zip


atae/assur:   0%|          | 0.00/7.13M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-burmarina.zip as oracc/atae-burmarina.zip


atae/burmarina:   0%|          | 0.00/353k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-durkatlimmu.zip as oracc/atae-durkatlimmu.zip


atae/durkatlimmu:   0%|          | 0.00/2.26M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-durszarrukin.zip as oracc/atae-durszarrukin.zip


atae/durszarrukin:   0%|          | 0.00/230k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-guzana.zip as oracc/atae-guzana.zip


atae/guzana:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-huzirina.zip as oracc/atae-huzirina.zip


atae/huzirina:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-imgurenlil.zip as oracc/atae-imgurenlil.zip


atae/imgurenlil:   0%|          | 0.00/605k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-kalhu.zip as oracc/atae-kalhu.zip


atae/kalhu:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-kunalia.zip as oracc/atae-kunalia.zip


atae/kunalia:   0%|          | 0.00/727k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-mallanate.zip as oracc/atae-mallanate.zip


atae/mallanate:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-marqasu.zip as oracc/atae-marqasu.zip


atae/marqasu:   0%|          | 0.00/857k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-nineveh.zip as oracc/atae-nineveh.zip


atae/nineveh:   0%|          | 0.00/56.6M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-samal.zip as oracc/atae-samal.zip


atae/samal:   0%|          | 0.00/92.3k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-szibaniba.zip as oracc/atae-szibaniba.zip


atae/szibaniba:   0%|          | 0.00/423k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-tilbarsip.zip as oracc/atae-tilbarsip.zip


atae/tilbarsip:   0%|          | 0.00/267k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/atae-tuszhan.zip as oracc/atae-tuszhan.zip


atae/tuszhan:   0%|          | 0.00/427k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/babcity.zip as oracc/babcity.zip


babcity:   0%|          | 0.00/4.04M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/blms.zip as oracc/blms.zip


blms:   0%|          | 0.00/13.5M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/btto.zip as oracc/btto.zip


btto:   0%|          | 0.00/3.74M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams.zip as oracc/cams.zip


cams:   0%|          | 0.00/529k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-akno.zip as oracc/cams-akno.zip


cams/akno:   0%|          | 0.00/12.5M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-anzu.zip as oracc/cams-anzu.zip


cams/anzu:   0%|          | 0.00/629k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-barutu.zip as oracc/cams-barutu.zip


cams/barutu:   0%|          | 0.00/196k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-etana.zip as oracc/cams-etana.zip


cams/etana:   0%|          | 0.00/196k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-gkab.zip as oracc/cams-gkab.zip


cams/gkab:   0%|          | 0.00/29.3M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-ludlul.zip as oracc/cams-ludlul.zip


cams/ludlul:   0%|          | 0.00/612k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-selbi.zip as oracc/cams-selbi.zip


cams/selbi:   0%|          | 0.00/168k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cams-tlab.zip as oracc/cams-tlab.zip


cams/tlab:   0%|          | 0.00/5.77M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ckst.zip as oracc/ckst.zip


ckst:   0%|          | 0.00/1.24M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cmawro.zip as oracc/cmawro.zip


cmawro:   0%|          | 0.00/11.9M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cmawro-cmawr1.zip as oracc/cmawro-cmawr1.zip


cmawro/cmawr1:   0%|          | 0.00/6.10M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cmawro-cmawr2.zip as oracc/cmawro-cmawr2.zip


cmawro/cmawr2:   0%|          | 0.00/6.09M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cmawro-cmawr3.zip as oracc/cmawro-cmawr3.zip


cmawro/cmawr3:   0%|          | 0.00/3.17M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/cmawro-maqlu.zip as oracc/cmawro-maqlu.zip


cmawro/maqlu:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/contrib-amarna.zip as oracc/contrib-amarna.zip


contrib/amarna:   0%|          | 0.00/4.16M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/contrib-lambert.zip as oracc/contrib-lambert.zip


contrib/lambert:   0%|          | 0.00/792 [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ctij.zip as oracc/ctij.zip


ctij:   0%|          | 0.00/3.45M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/dcclt.zip as oracc/dcclt.zip


dcclt:   0%|          | 0.00/70.0M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/dcclt-ebla.zip as oracc/dcclt-ebla.zip


dcclt/ebla:   0%|          | 0.00/2.16M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/dcclt-jena.zip as oracc/dcclt-jena.zip


dcclt/jena:   0%|          | 0.00/902k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/dcclt-nineveh.zip as oracc/dcclt-nineveh.zip


dcclt/nineveh:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/dcclt-signlists.zip as oracc/dcclt-signlists.zip


dcclt/signlists:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/dccmt.zip as oracc/dccmt.zip


dccmt:   0%|          | 0.00/4.70M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/dsst.zip as oracc/dsst.zip


dsst:   0%|          | 0.00/8.71M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ecut.zip as oracc/ecut.zip


ecut:   0%|          | 0.00/6.47M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/epsd2.zip as oracc/epsd2.zip


epsd2:   0%|          | 0.00/318M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/etcsri.zip as oracc/etcsri.zip


etcsri:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/glass.zip as oracc/glass.zip


glass:   0%|          | 0.00/947k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/hbtin.zip as oracc/hbtin.zip


hbtin:   0%|          | 0.00/25.6M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/lacost.zip as oracc/lacost.zip


lacost:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/lovelyrics.zip as oracc/lovelyrics.zip


lovelyrics:   0%|          | 0.00/5.15k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/nere.zip as oracc/nere.zip


nere:   0%|          | 0.00/275k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/nimrud.zip as oracc/nimrud.zip


nimrud:   0%|          | 0.00/255k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/obel.zip as oracc/obel.zip


obel:   0%|          | 0.00/2.32M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/obmc.zip as oracc/obmc.zip


obmc:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/obta.zip as oracc/obta.zip


obta:   0%|          | 0.00/437k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ogsl.zip as oracc/ogsl.zip


ogsl:   0%|          | 0.00/216k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/oimea.zip as oracc/oimea.zip


oimea:   0%|          | 0.00/55.2M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/pnao.zip as oracc/pnao.zip


pnao:   0%|          | 0.00/129k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/qcat.zip as oracc/qcat.zip


qcat:   0%|          | 0.00/11.5M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/riao.zip as oracc/riao.zip


riao:   0%|          | 0.00/40.4M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo.zip as oracc/ribo.zip


ribo:   0%|          | 0.00/7.56M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-bab7scores.zip as oracc/ribo-bab7scores.zip


ribo/bab7scores:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon10.zip as oracc/ribo-babylon10.zip


ribo/babylon10:   0%|          | 0.00/107k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon2.zip as oracc/ribo-babylon2.zip


ribo/babylon2:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon3.zip as oracc/ribo-babylon3.zip


ribo/babylon3:   0%|          | 0.00/147k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon4.zip as oracc/ribo-babylon4.zip


ribo/babylon4:   0%|          | 0.00/73.0k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon5.zip as oracc/ribo-babylon5.zip


ribo/babylon5:   0%|          | 0.00/22.6k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon6.zip as oracc/ribo-babylon6.zip


ribo/babylon6:   0%|          | 0.00/4.22M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon7.zip as oracc/ribo-babylon7.zip


ribo/babylon7:   0%|          | 0.00/9.53M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-babylon8.zip as oracc/ribo-babylon8.zip


ribo/babylon8:   0%|          | 0.00/294k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/ribo-sources.zip as oracc/ribo-sources.zip


ribo/sources:   0%|          | 0.00/4.13M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rimanum.zip as oracc/rimanum.zip


rimanum:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap.zip as oracc/rinap.zip


rinap:   0%|          | 0.00/21.7M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap-rinap1.zip as oracc/rinap-rinap1.zip


rinap/rinap1:   0%|          | 0.00/3.75M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap-rinap2.zip as oracc/rinap-rinap2.zip


rinap/rinap2:   0%|          | 0.00/9.35M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap-rinap3.zip as oracc/rinap-rinap3.zip


rinap/rinap3:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap-rinap4.zip as oracc/rinap-rinap4.zip


rinap/rinap4:   0%|          | 0.00/7.14M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap-rinap5.zip as oracc/rinap-rinap5.zip


rinap/rinap5:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap-scores.zip as oracc/rinap-scores.zip


rinap/scores:   0%|          | 0.00/14.4M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/rinap-sources.zip as oracc/rinap-sources.zip


rinap/sources:   0%|          | 0.00/26.7M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao.zip as oracc/saao.zip


saao:   0%|          | 0.00/64.6M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-aebp.zip as oracc/saao-aebp.zip


saao/aebp:   0%|          | 0.00/15.2M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-knpp.zip as oracc/saao-knpp.zip


saao/knpp:   0%|          | 0.00/14.0M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa01.zip as oracc/saao-saa01.zip


saao/saa01:   0%|          | 0.00/4.99M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa02.zip as oracc/saao-saa02.zip


saao/saa02:   0%|          | 0.00/2.70M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa03.zip as oracc/saao-saa03.zip


saao/saa03:   0%|          | 0.00/4.06M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa04.zip as oracc/saao-saa04.zip


saao/saa04:   0%|          | 0.00/8.16M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa05.zip as oracc/saao-saa05.zip


saao/saa05:   0%|          | 0.00/4.99M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa06.zip as oracc/saao-saa06.zip


saao/saa06:   0%|          | 0.00/7.04M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa07.zip as oracc/saao-saa07.zip


saao/saa07:   0%|          | 0.00/3.82M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa08.zip as oracc/saao-saa08.zip


saao/saa08:   0%|          | 0.00/7.19M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa09.zip as oracc/saao-saa09.zip


saao/saa09:   0%|          | 0.00/778k [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa10.zip as oracc/saao-saa10.zip


saao/saa10:   0%|          | 0.00/8.77M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa11.zip as oracc/saao-saa11.zip


saao/saa11:   0%|          | 0.00/3.01M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa12.zip as oracc/saao-saa12.zip


saao/saa12:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa13.zip as oracc/saao-saa13.zip


saao/saa13:   0%|          | 0.00/3.82M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa14.zip as oracc/saao-saa14.zip


saao/saa14:   0%|          | 0.00/6.24M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa15.zip as oracc/saao-saa15.zip


saao/saa15:   0%|          | 0.00/5.86M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa16.zip as oracc/saao-saa16.zip


saao/saa16:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa17.zip as oracc/saao-saa17.zip


saao/saa17:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa18.zip as oracc/saao-saa18.zip


saao/saa18:   0%|          | 0.00/4.71M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa19.zip as oracc/saao-saa19.zip


saao/saa19:   0%|          | 0.00/5.49M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa20.zip as oracc/saao-saa20.zip


saao/saa20:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saa21.zip as oracc/saao-saa21.zip


saao/saa21:   0%|          | 0.00/3.81M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/saao-saas2.zip as oracc/saao-saas2.zip


saao/saas2:   0%|          | 0.00/1.43M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/suhu.zip as oracc/suhu.zip


suhu:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/tcma.zip as oracc/tcma.zip


tcma:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/tsae.zip as oracc/tsae.zip


tsae:   0%|          | 0.00/110M [00:00<?, ?B/s]

Saving http://oracc.museum.upenn.edu/json/xcat.zip as oracc/xcat.zip


xcat:   0%|          | 0.00/6.59k [00:00<?, ?B/s]

## Examine Downloaded Files in Destination Directory

Use bash `ls` command to check the contents of the directory after downloading.

In [44]:
!ls oracc

adsd-adart1.zip		 cams-tlab.zip	      ribo-babylon8.zip
adsd-adart2.zip		 cams.zip	      ribo-sources.zip
adsd-adart3.zip		 ckst.zip	      ribo.zip
adsd-adart6.zip		 cmawro-cmawr1.zip    rimanum.zip
adsd.zip		 cmawro-cmawr2.zip    rinap-rinap1.zip
aemw-alalakh-idrimi.zip  cmawro-cmawr3.zip    rinap-rinap2.zip
aemw-amarna.zip		 cmawro-maqlu.zip     rinap-rinap3.zip
aemw.zip		 cmawro.zip	      rinap-rinap4.zip
akklove.zip		 contrib-amarna.zip   rinap-rinap5.zip
amgg.zip		 contrib-lambert.zip  rinap-scores.zip
ario.zip		 ctij.zip	      rinap-sources.zip
armep.zip		 dcclt-ebla.zip       rinap.zip
arrim.zip		 dcclt-jena.zip       saao-aebp.zip
asbp-ninmed.zip		 dcclt-nineveh.zip    saao-knpp.zip
asbp-rlasb.zip		 dcclt-signlists.zip  saao-saa01.zip
asbp.zip		 dcclt.zip	      saao-saa02.zip
atae-assur.zip		 dccmt.zip	      saao-saa03.zip
atae-burmarina.zip	 dsst.zip	      saao-saa04.zip
atae-durkatlimmu.zip	 ecut.zip	      saao-saa05.zip
atae-durszarrukin.zip	 epsd2.zip	      saao-saa06.zip
a

Examine the successful download URLs.

In [45]:
download_url_list

['http://oracc.museum.upenn.edu/json/adsd.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart1.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart2.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart3.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart6.zip',
 'http://oracc.museum.upenn.edu/json/aemw.zip',
 'http://oracc.museum.upenn.edu/json/aemw-alalakh-idrimi.zip',
 'http://oracc.museum.upenn.edu/json/aemw-amarna.zip',
 'http://oracc.museum.upenn.edu/json/akklove.zip',
 'http://oracc.museum.upenn.edu/json/amgg.zip',
 'http://oracc.museum.upenn.edu/json/ario.zip',
 'http://oracc.museum.upenn.edu/json/armep.zip',
 'http://oracc.museum.upenn.edu/json/arrim.zip',
 'http://oracc.museum.upenn.edu/json/asbp.zip',
 'http://oracc.museum.upenn.edu/json/asbp-ninmed.zip',
 'http://oracc.museum.upenn.edu/json/asbp-rlasb.zip',
 'http://oracc.museum.upenn.edu/json/atae.zip',
 'http://oracc.museum.upenn.edu/json/atae-assur.zip',
 'http://oracc.museum.upenn.edu/json/atae-burmarina.zip',
 'ht

Examine the failed download URLs.

In [46]:
url_not_found_list

['http://oracc.museum.upenn.edu/json/cdli.zip',
 'http://oracc.museum.upenn.edu/json/contrib.zip',
 'http://oracc.museum.upenn.edu/json/w.tau.ac.il-humanities-archaeology-ancient_israel.html.zip',
 'http://oracc.museum.upenn.edu/json/rinap-rinap5p1.zip']

# Download Method 2: Download using Stack Overflow code

Using the links generated from some modified [Compass](https://github.com/niekveldhuis/compass) code from the above method, we can obtain the download links to the project ZIP files.

I found a solution from Stack Overflow using the `os` and `requests` libraries to perform this process: [Download file from URL and save it in a folder Python - Stack Overflow](https://stackoverflow.com/questions/56950987/download-file-from-url-and-save-it-in-a-folder-python).

### Define `downloadProjectsWithStackOverflow()` Method and associated utility methods

In [62]:
import os
from tqdm.auto import tqdm
import requests

In [63]:
def getDownloadURL(file_path):
  """
  :param file: a file path
  :return url: a formatted download URL
  :return url_exists: True if a url exists, False if a url doesn't exist
  This is a utility method for getDownloadURLList() to prepare a formatted
  download URL.
  """
  file = file_path.replace('/', '-')
  url = f"http://oracc.museum.upenn.edu/json/{file}.zip"
  url_exists = False
  with requests.get(url, stream=True) as request:
    if request.status_code == 200:   # meaning that the file exists
      print(f"SUCCESS: {url} exists.")
      url_exists = True
    else:
      print(f"WARNING: {url} does not exist.")
      url_exists = False
  return url, url_exists

In [64]:
def getDownloadURLList(file_path_list):
  """
  :param file_path_list: list of file paths 
  :return download_url_list: list of formatted download links of project ZIP
                             files that were successfully download
  :return url_not_found_list: list of formatted download links of project ZIP 
                              files that failed to download
  This is a utility method for downloadProjectsWithStackOverflow() to
  prepare a list of URLs to download the project ZIP files.
  """
  download_url_list = []
  url_not_found_list = []
  for file_path in file_path_list:
    url, url_exists = getDownloadURL(file_path)
    if url_exists:
      download_url_list.append(url)
    else:
      url_not_found_list.append(url)    
  return download_url_list, url_not_found_list

In [65]:
def downloadFile(url: str, destination: str):
#    if not os.path.exists(destination):
#        os.makedirs(destination)  # create folder if it does not exist
    """
    :param url: URL of file to download 
    :param destination: destination directory to put the downloaded files in
    This is a utility method for downloadProjectsWithStackOverflow() to 
    download a single file given a URL and a destination directory path.
    """
    filename = url.split('/')[-1].replace(" ", "_")  # be careful with file names
    file_path = os.path.join(destination, filename)

    r = requests.get(url, stream=True)
    if r.ok:
        print("saving to", os.path.abspath(file_path))
        with open(file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024 * 8):
                if chunk:
                    f.write(chunk)
                    f.flush()
                    os.fsync(f.fileno())
    else:  # HTTP status code 4XX/5XX
        print("Download failed: status code {}\n{}".format(r.status_code, r.text))

In [66]:
def downloadProjectsWithStackOverflow(project_list, destination):
  """
  :param project_list: list of project file paths 
  :param destination: destination directory to put the downloaded files in
  :return download_url_list: list of formatted download links of project ZIP
                             files that were successfully download
  :return url_not_found_list: list of formatted download links of project ZIP 
                              files that failed to download
  """
  download_url_list, url_not_found_list = getDownloadURLList(project_list)
  if not destination.endswith("/"):
    destination = destination + "/"
  for url in download_url_list:
    downloadFile(url, destination)
  return download_url_list, url_not_found_list

## Run `downloadProjectsWithStackOverflow()` Method

Use bash `ls` command to check the contents of the destination directory before running the `downloadProjectsWithStackOverflow()` method.

In [68]:
!ls oracc

unzipped


In [69]:
destination = "oracc"
download_url_list, url_not_found_list = downloadProjectsWithStackOverflow(project_list, destination)

SUCCESS: http://oracc.museum.upenn.edu/json/adsd.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart1.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart2.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart3.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/adsd-adart6.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/aemw.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/aemw-alalakh-idrimi.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/aemw-amarna.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/akklove.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/amgg.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/ario.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/armep.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/arrim.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/asbp.zip exists.
SUCCESS: http://oracc.museum.upenn.edu/json/asbp-ninmed.zip exists.
SUCCESS: http://oracc.m

## Examine the Downloaded Files in the Destination Directory

Use bash `ls` command to check if we downloaded the files we needed.

In [70]:
!ls oracc

adsd-adart1.zip		 cams-ludlul.zip      ribo-babylon4.zip
adsd-adart2.zip		 cams-selbi.zip       ribo-babylon5.zip
adsd-adart3.zip		 cams-tlab.zip	      ribo-babylon6.zip
adsd-adart6.zip		 cams.zip	      ribo-babylon7.zip
adsd.zip		 ckst.zip	      ribo-babylon8.zip
aemw-alalakh-idrimi.zip  cmawro-cmawr1.zip    ribo-sources.zip
aemw-amarna.zip		 cmawro-cmawr2.zip    ribo.zip
aemw.zip		 cmawro-cmawr3.zip    rimanum.zip
akklove.zip		 cmawro-maqlu.zip     rinap-rinap1.zip
amgg.zip		 cmawro.zip	      rinap-rinap2.zip
ario.zip		 contrib-amarna.zip   rinap-rinap3.zip
armep.zip		 contrib-lambert.zip  rinap-rinap4.zip
arrim.zip		 ctij.zip	      rinap-rinap5.zip
asbp-ninmed.zip		 dcclt-ebla.zip       rinap-scores.zip
asbp-rlasb.zip		 dcclt-jena.zip       rinap-sources.zip
asbp.zip		 dcclt-nineveh.zip    rinap.zip
atae-assur.zip		 dcclt-signlists.zip  saao-aebp.zip
atae-burmarina.zip	 dcclt.zip	      saao-knpp.zip
atae-durkatlimmu.zip	 dccmt.zip	      saao-saa01.zip
atae-durszarrukin.zip	 dsst.zip

Examine the successful download URLs.

In [71]:
download_url_list

['http://oracc.museum.upenn.edu/json/adsd.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart1.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart2.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart3.zip',
 'http://oracc.museum.upenn.edu/json/adsd-adart6.zip',
 'http://oracc.museum.upenn.edu/json/aemw.zip',
 'http://oracc.museum.upenn.edu/json/aemw-alalakh-idrimi.zip',
 'http://oracc.museum.upenn.edu/json/aemw-amarna.zip',
 'http://oracc.museum.upenn.edu/json/akklove.zip',
 'http://oracc.museum.upenn.edu/json/amgg.zip',
 'http://oracc.museum.upenn.edu/json/ario.zip',
 'http://oracc.museum.upenn.edu/json/armep.zip',
 'http://oracc.museum.upenn.edu/json/arrim.zip',
 'http://oracc.museum.upenn.edu/json/asbp.zip',
 'http://oracc.museum.upenn.edu/json/asbp-ninmed.zip',
 'http://oracc.museum.upenn.edu/json/asbp-rlasb.zip',
 'http://oracc.museum.upenn.edu/json/atae.zip',
 'http://oracc.museum.upenn.edu/json/atae-assur.zip',
 'http://oracc.museum.upenn.edu/json/atae-burmarina.zip',
 'ht

Examine the failed download URLs.

In [72]:
url_not_found_list

['http://oracc.museum.upenn.edu/json/cdli.zip',
 'http://oracc.museum.upenn.edu/json/contrib.zip',
 'http://oracc.museum.upenn.edu/json/w.tau.ac.il-humanities-archaeology-ancient_israel.html.zip',
 'http://oracc.museum.upenn.edu/json/rinap-rinap5p1.zip']

# Unzip Files

After we downloaded all of the ZIP files, the next step may be to unzip and examine its contents.

I recently learned about a library called `zipfile` that handles this step from this website: [Python unzip: How To Extract Single or Multiple Files](https://appdividend.com/2022/01/19/python-unzip/).

Below we will define some methods that will help to unzip project files.

In [47]:
from zipfile import ZipFile

## Create Unzipped Files Directory
Create a directory called `unzipped`. If the directory already exists, do nothing. We can change this directory according to what is needed.

In [48]:
os.makedirs("oracc/unzipped", exist_ok = True)

## Define `unzipFile()` Method

This is a method that unzips a single file.

In [49]:
def unzipFile(file, source_directory, destination):
  """
  :param file: ZIP file name
  :param source_directory: source directory of the ZIP file
  :param destination: destination directory to put the downloaded files in
  This is a method that unzips a single file. Utility for `unzipMultipleFiles().`
  """
  if not source_directory.endswith("/"):
    source_directory = source_directory + "/"
  if not destination.endswith("/"):
    destination = destination + "/"
  file_path = source_directory + file
  print(file_path)
  file_name = file[:file.rfind(".zip")]
  with ZipFile(file_path, "r") as zipObj:
      zipObj.extractall(f"{destination}{file_name}")
  file_name = file[:file.rfind(".zip")]
  print(f'Unzipped {file}. See {destination}{file_name}.')

### Run `unzipFile() Method`

We will use the `adsd-adart1.zip` as an example to run this `unzipFile()` method.

In [51]:
unzipFile("adsd-adart1.zip", "oracc", "oracc/unzipped")

oracc/adsd-adart1.zip
Unzipped adsd-adart1.zip. See oracc/unzipped/adsd-adart1.


Use bash `ls` command with option `-R` to get a recursive directory listing of the unzipped file.

In [53]:
!ls -R oracc/unzipped/adsd-adart1/

oracc/unzipped/adsd-adart1/:
adsd

oracc/unzipped/adsd-adart1/adsd:
adart1

oracc/unzipped/adsd-adart1/adsd/adart1:
adsd-adart1-portal.json  corpusjson	 index-cat.json  index-txt.json
catalogue.json		 gloss-akk.json  index-lem.json  metadata.json
cat.geojson		 gloss-qpn.json  index-qpn.json  sortcodes.json
corpus.json		 index-akk.json  index-tra.json

oracc/unzipped/adsd-adart1/adsd/adart1/corpusjson:
X102611.json  X102831.json  X103090.json  X103430.json	X103753.json
X102612.json  X102832.json  X103210.json  X103460.json	X103780.json
X102613.json  X102840.json  X103221.json  X103570.json	X103790.json
X102620.json  X102861.json  X103222.json  X103610.json	X103801.json
X102630.json  X102862.json  X103223.json  X103661.json	X103802.json
X102640.json  X102870.json  X103224.json  X103670.json	X103811.json
X102661.json  X102880.json  X103241.json  X103680.json	X103812.json
X102662.json  X102890.json  X103242.json  X103690.json	X103813.json
X102701.json  X102910.json  X103280.json  X103700.j

## Define `unzipMultipleFiles()` Method

This is a method that unzips multiple given files.

In [54]:
def unzipFile(file, source_directory, destination):
  """
  :param file: ZIP file name
  :param source_directory: source directory of the ZIP file
  :param destination: destination directory to put the downloaded files in
  This is a method that unzips a single file. Utility for unzipMultipleFiles().
  """
  if not source_directory.endswith("/"):
    source_directory = source_directory + "/"
  if not destination.endswith("/"):
    destination = destination + "/"
  file_path = source_directory + file
  print(file_path)
  file_name = file[:file.rfind(".zip")]
  with ZipFile(file_path, "r") as zipObj:
      zipObj.extractall(f"{destination}{file_name}")
  file_name = file[:file.rfind(".zip")]
  print(f'Unzipped {file}. See {destination}{file_name}.')

In [55]:
def unzipMultipleFiles(file_list, source_directory, destination):
  """
  :param file: ZIP file name
  :param source_directory: source directory of the ZIP file
  :param destination: destination directory to put the downloaded files in
  This is a method that unzips multiple files. Uses unzipFile().
  """
  if not source_directory.endswith("/"):
    source_directory = source_directory + ("/")
  if not destination.endswith("/"):
    destination = destination + "/"
  for file in file_list:
    try:
        unzipFile(file, source_directory, destination)
    except (FileNotFoundError, IOError):
        print("File not found. Wrong file path.")

### Run `unzipMultipleFiles()` Method

We will use the `file_list` as an example to run this `unzipMultipleFiles()` method.

In [56]:
file_list = ['adsd-adart2.zip', 'adsd-adart3.zip', 'cams-tlab.zip', 'cams.zip', 
             'cdli.zip', 'ckst.zip', 'ribo.zip']
source_directory = "oracc"
destination = "oracc/unzipped"
unzipMultipleFiles(file_list, source_directory, destination)

oracc/adsd-adart2.zip
Unzipped adsd-adart2.zip. See oracc/unzipped/adsd-adart2.
oracc/adsd-adart3.zip
Unzipped adsd-adart3.zip. See oracc/unzipped/adsd-adart3.
oracc/cams-tlab.zip
Unzipped cams-tlab.zip. See oracc/unzipped/cams-tlab.
oracc/cams.zip
Unzipped cams.zip. See oracc/unzipped/cams.
oracc/cdli.zip
File not found. Wrong file path.
oracc/ckst.zip
Unzipped ckst.zip. See oracc/unzipped/ckst.
oracc/ribo.zip
Unzipped ribo.zip. See oracc/unzipped/ribo.


Use bash `ls` command with option `-R` to get a recursive directory listing of the unzipped files.

In [60]:
!ls -R oracc/unzipped/adsd-adart2/

oracc/unzipped/adsd-adart2/:
adsd

oracc/unzipped/adsd-adart2/adsd:
adart2

oracc/unzipped/adsd-adart2/adsd/adart2:
adsd-adart2-portal.json  corpusjson	 index-cat.json  index-txt.json
catalogue.json		 gloss-akk.json  index-lem.json  metadata.json
cat.geojson		 gloss-qpn.json  index-qpn.json  sortcodes.json
corpus.json		 index-akk.json  index-tra.json

oracc/unzipped/adsd-adart2/adsd/adart2/corpusjson:
X201641.json  X201770.json  X201880.json  X202001.json	X202270.json
X201642.json  X201781.json  X201891.json  X202002.json	X202291.json
X201650.json  X201782.json  X201892.json  X202011.json	X202292.json
X201661.json  X201783.json  X201893.json  X202012.json	X202300.json
X201662.json  X201784.json  X201901.json  X202013.json	X202304.json
X201671.json  X201791.json  X201902.json  X202014.json	X202320.json
X201672.json  X201792.json  X201903.json  X202021.json	X202340.json
X201681.json  X201793.json  X201904.json  X202022.json	X202370.json
X201682.json  X201794.json  X201905.json  X202030.j

In [61]:
!ls -R oracc/unzipped/ribo/

oracc/unzipped/ribo/:
ribo

oracc/unzipped/ribo/ribo:
catalogue.json	gloss-sux-x-emesal.json  index-tra.json
cat.geojson	index-akk.json		 index-txt.json
corpus.json	index-cat.json		 metadata.json
corpusjson	index-lem.json		 ribo-portal.json
gloss-akk.json	index-qpn.json		 sortcodes.json
gloss-qpn.json	index-sux.json
gloss-sux.json	index-sux-x-emesal.json

oracc/unzipped/ribo/ribo/corpusjson:
