# Extract words_df and df_cat DataFrames from ZIP files

Notebook by Melinee Her

See how to download ZIP files of data from the ORACC project list in this COLAB:
https://colab.research.google.com/drive/1EAahNUcBXxk6-BXc68Dhp5hPj2bXWF-C?usp=sharing

The goal for this notebook is to take a series of project zip files and export the words_df and df_cat DataFrames for each project.

# Mount Google Drive folder + Load Libraries

The code snippet below is to mount Google Drive files so that we can interact with our Google Drive files using the file browser or command line. Running it will give a permissions prompt.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#any necessary imports
import pandas as pd
import zipfile
from zipfile import ZipFile
import json
import requests
from tqdm import tqdm
import os
import errno
import re
import random
import numpy as np
import sys
import copy
import networkx as nx
from pathlib import Path

#Set folder for remote drive
#folder = '/content/drive/My Drive/FactGrid Cuneiform (AWCA)/people/Melinee'
folder = '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/people/Melinee/'

#importing utils for the method which downloads the current text json files
os.chdir(folder + 'network/utils/')
from utils import oracc_download

# This is a user defined module that searches through the texts to find the entities in the text that
# are people and places, to be imported as nodes into the network
os.chdir(folder + 'network/')
import rank_parser4 as rp

The steps:
1. Take the ZIP files known and systematically unzip all of them.
2. Create a process that takes each unzipped project file and parses the JSON files.
3. Find the words_df and df_cat dataFrames and rename the index columns as "id_text".
4. Create a directory to store all words_df and df_cats for each project.

For reference, I will show an example of the process for a single project first, followed by the automatic version for all projects below.

# Unzip project adsd-adart1
This section is for a single project and does not have to be run before automating the process. It is an in-depth look at the process from start to finish.

After sucessfully running the first COLAB notebook and creating the "ORACC_zips" directory, we will create a new directory called "ORACC_UNZIPPED" that holds all of the unzipped project files.

In [None]:
os.makedirs("drive/MyDrive/ORACC_UNZIPPED", exist_ok = True)

The unzipFile function unzips any given one file and places it in a given destination.

In [None]:
def unzipFile(file, source_directory, destination):
  """
  :param file: ZIP file name
  :param source_directory: source directory of the ZIP file
  :param destination: destination directory to put the downloaded files in
  This is a method that unzips a single file. Utility for `unzipMultipleFiles().`
  """
  if not source_directory.endswith("/"):
    source_directory = source_directory + "/"
  if not destination.endswith("/"):
    destination = destination + "/"
  file_path = source_directory + file
  print(file_path)
  file_name = file[:file.rfind(".zip")]
  with ZipFile(file_path, "r") as zipObj:
      zipObj.extractall(f"{destination}{file_name}")
  file_name = file[:file.rfind(".zip")]
  print(f'Unzipped {file}. See {destination}{file_name}.')

### Run `unzipFile() Method`

For our example, we will use the adsd-adart1 project. First, unzip the adsd-adart1 zip file (located in the ORACC_zips directory) by calling the unzipFile() Method and locating it in the ORACC_UNZIPPED directory.

In [None]:
unzipFile("adsd-adart1.zip", folder + "ORACC_zips", folder + "ORACC_UNZIPPED")

/content/drive/MyDrive/Melinee/ORACC_zips/adsd-adart1.zip
Unzipped adsd-adart1.zip. See /content/drive/MyDrive/Melinee/ORACC_UNZIPPED/adsd-adart1.


## Now we will process this via JSON parsing

With the project file unzipped, we have access to the project's JSON files. The parsejson function takes a given text and searches for useful lemmas.

In [None]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject:
            #print('cdl in JSON')
            parsejson(JSONobject)
        if "label" in JSONobject:
            meta_d["label"] = JSONobject['label']
        if "f" in JSONobject:
            lemma = JSONobject["f"]
            #if "ftype" in JSONobject:   # you don't need this - useful for distinguishing between regular text and year names
            #    lemma['ftype'] = JSONobject['ftype']
            lemma["id_word"] = JSONobject["ref"]
            lemma['label'] = meta_d["label"]
            lemma["id_text"] = meta_d["id_text"]
            lemm_l.append(lemma)
            #print('Appending Lemma: ' + str(lemma))
        #if "strict" in JSONobject and JSONobject["strict"] == "1":
         #   lemma = {key: JSONobject[key] for key in dollar_keys}
         #   lemma["id_word"] = JSONobject["ref"]
         #   lemma["id_text"] = meta_d["id_text"]
         #   lemm_l.append(lemma)
    return

Using the utils from the network folder provided by FactGrid, we can download the project of interest with the function oracc_download.

In [None]:
#gets utils, downloads projects
os.chdir(folder + 'network/')

projects = ['adsd/adart1']
projects = oracc_download(projects,'') #DOWNLOAD REDUNDANCY
projects

Saving http://build-oracc.museum.upenn.edu/json/adsd-adart1.zip as jsonzip/adsd-adart1.zip.




adsd/adart1:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

['adsd/adart1']

Finally, we use the downloaded project and the corresponding JSON files to create the project's words_df and df_cat.

In [None]:
lemm_l = []
meta_d = {"label": None, "id_text": None}
#dollar_keys = ["extent", "scope", "state"]

df_cat = pd.DataFrame()
used_pnums = []
cat_d = {}
for project in projects:
  print('Project: ' + str(project))
  z = zipfile.ZipFile('jsonzip/' + project.replace('/','-') + '.zip')
  #print(file + " does not exist or is not a proper ZIP file")
  files = z.namelist()     # list of all the files in the ZIP
  files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']
  cat_file = z.read(project + '/catalogue.json').decode('utf-8')
  cat_json = json.loads(cat_file)

  cat_d.update(dict(cat_json['members']))
  #df_cat = pd.concat([df_cat,pd.DataFrame(cat_json['members']).T])                          #that holds all the P, Q, and X numbers.

  for filename in tqdm(files):       #iterate over the file names
      id_text = filename[-12:-5]
      if id_text in used_pnums:
        continue
      else:
        used_pnums.append(id_text)
      meta_d["id_text"] = id_text

      st = z.read(filename).decode('utf-8')       #read and decode the json file of one particular text
      #if empty file, skip, else run through parsejson
      if len(st) < 1:
        print(filename); #prints out the empty
      else:
        data_json = json.loads(st)                # make it into a json object (essentially a dictionary)
        #print(str(data_json))
        parsejson(data_json)               # and send to the parsejson() function
  z.close()

  df_cat = pd.DataFrame(cat_d).T
  words_df = pd.DataFrame(lemm_l)

Project: adsd/adart1



  0%|          | 0/89 [00:00<?, ?it/s][A
  9%|▉         | 8/89 [00:00<00:01, 77.50it/s][A
 24%|██▎       | 21/89 [00:00<00:00, 105.53it/s][A
 36%|███▌      | 32/89 [00:00<00:00, 74.94it/s] [A
 47%|████▋     | 42/89 [00:00<00:00, 82.55it/s][A
 63%|██████▎   | 56/89 [00:00<00:00, 91.60it/s][A
 76%|███████▋  | 68/89 [00:00<00:00, 99.42it/s][A
 89%|████████▉ | 79/89 [00:00<00:00, 91.71it/s][A
100%|██████████| 89/89 [00:01<00:00, 85.81it/s]


Here's a quick glance of what these dataframes look like.

In [None]:
words_df.rename(columns={0: 'number'}, inplace=True)
words_df.index.name = 'number'
words_df.head(3)

Unnamed: 0_level_0,lang,form,delim,gdl,pos,id_word,label,id_text,cf,gw,sense,norm,epos
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,akk,x,,"[{'x': 'ellipsis', 'id': 'X103322.2.1.0', 'bre...",u,X103322.2.1,o 1',X103322,,,,,
1,akk,GE₆,,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",N,X103322.2.2,o 1',X103322,mūšu,night,night,mūšu,N
2,akk,28,,"[{'n': 'n', 'sexified': '2(u) 8(diš)', 'form':...",n,X103322.2.3,o 1',X103322,,,,,


In [None]:
df_cat.rename(columns={0:'id'}, inplace=True)
df_cat.index.name = 'id_text'
df_cat.head(3)

Unnamed: 0_level_0,langs,project,id_text,designation,copy,photo,museum_no,text_comments,accession_no,ancient_year,...,object_type,period,provenience,pleiades_id,pleiades_coord,supergenre,trans,tablet_comments,date_comments,bibilography
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
X102611,0x08000000,adsd/adart1,X102611,AD -261A,LBAT 249,ADART I plate 66,BM 41690,A obv. 9': At the beginning of the line either...,"81-6-25,308",SE 50,...,tablet,Hellenistic,Babylon,893951,"[44.422236,32.543395]",unknown,[en],,,
X102612,0x08000000,adsd/adart1,X102612,AD -261B,ADART I plate 68,ADART I plates 66 and 67,BM 32245 + 32404,B obv. 11': Eclipse of -261 Dec. 21 (not visib...,"S+76-11-17,1972+2137",SE 50,...,tablet,Hellenistic,Babylon,893951,"[44.422236,32.543395]",unknown,[en],Both pieces [-261B&C] are almost certainly par...,,
X102613,0x08000000,adsd/adart1,X102613,AD -261C,ADART I plate 68,ADART I plates 66 and 67,BM 41615 + 41913,,"81-6-25,230+533",SE 50,...,tablet,Hellenistic,Babylon,893951,"[44.422236,32.543395]",unknown,[en],Both pieces [-261B&C] are almost certainly par...,,


The last step is to export these dataframes to a specified directory. The directory {folder}/ORACC_DFS/PROJECT_DFS will hold the exported words_dfs and df_cats for all projects.

A modified project-oriented version of this folder is nested with separate folders for each project that then store the respective dataframes. This can be created with directories like ORACC_DFS/PROJECT_DFS/PROJECT/SUBPROJECT (e.g. oracc_dfs/project_dfs/adsd/adart1). This is more useful for project oriented case studies, however a larger folder that contains all dfs is useful for quicker and simpler access to all project dataframes.

In [None]:
os.makedirs(folder + 'ORACC_DFS', exist_ok=True)

In [None]:
file = project.replace("/", "-")
os.makedirs(folder + 'ORACC_DFS/PROJECT_DFS', exist_ok=True)
words_df.to_csv(folder + 'ORACC_DFS/PROJECT_DFS' + file + '-words-df.csv')
df_cat.to_csv(folder + 'ORACC_DFS/PROJECT_DFS' + file + '-df-cat.csv')

!ls /content/drive/MyDrive/Melinee/ORACC_DF/PROJECT_DFS

adsd-adart1-df-cat.csv	adsd-adart1-words-df.csv


With this single project example finished, we can move on to the automated version of this process.

#Automated Process for Unzipping Project Files and Exporting DataFrames
This section defines the process for unzipping all project files, parsing through the JSON files, and extracting and exporting the dataframes for each project.

## Define `unzipMultipleFiles()` Method

This is a method that unzips multiple given files.

In [11]:
def unzipFile(file, source_directory, destination):
  """
  :param file: ZIP file name
  :param source_directory: source directory of the ZIP file
  :param destination: destination directory to put the downloaded files in
  This is a method that unzips a single file. Utility for unzipMultipleFiles().
  """
  if not source_directory.endswith("/"):
    source_directory = source_directory + "/"
  if not destination.endswith("/"):
    destination = destination + "/"
  file_path = source_directory + file
  print(file_path)
  file_name = file[:file.rfind(".zip")]
  with ZipFile(file_path, "r") as zipObj:
      zipObj.extractall(f"{destination}{file_name}")
  file_name = file[:file.rfind(".zip")]
  print(f'Unzipped {file}. See {destination}{file_name}.')

In [12]:
def unzipMultipleFiles(file_list, source_directory, destination):
  """
  :param file: ZIP file name
  :param source_directory: source directory of the ZIP file
  :param destination: destination directory to put the downloaded files in
  This is a method that unzips multiple files. Uses unzipFile().
  """
  if not source_directory.endswith("/"):
    source_directory = source_directory + ("/")
  if not destination.endswith("/"):
    destination = destination + "/"
  for file in file_list:
    try:
        unzipFile(file, source_directory, destination)
    except (FileNotFoundError, IOError):
        print("File not found. Wrong file path.")

## DataFrame Extraction

This part includes:
1. defining the parsejson function
2. creating a path list of all relevant projects.
3. unzipping all files
4. downloading all of the projects from ORACC
5. running through all projects, parse through every usable file, and downloading the dataframes

The most efficient way of catching errors (e.g. missing files, empty corpuses) and ensuring each project was accounted for (reducing long run times) when running the final code cell was to split the final path list into smaller sections [p1,p2,p3,p4,p5,p6].

In [13]:
#parses through json file (repeated code from section above)
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject:
            #print('cdl in JSON')
            parsejson(JSONobject)
        if "label" in JSONobject:
            meta_d["label"] = JSONobject['label']
        if "f" in JSONobject:
            lemma = JSONobject["f"]
            #if "ftype" in JSONobject:   # you don't need this - useful for distinguishing between regular text and year names
            #    lemma['ftype'] = JSONobject['ftype']
            lemma["id_word"] = JSONobject["ref"]
            lemma['label'] = meta_d["label"]
            lemma["id_text"] = meta_d["id_text"]
            lemm_l.append(lemma)
            #print('Appending Lemma: ' + str(lemma))
        #if "strict" in JSONobject and JSONobject["strict"] == "1":
         #   lemma = {key: JSONobject[key] for key in dollar_keys}
         #   lemma["id_word"] = JSONobject["ref"]
         #   lemma["id_text"] = meta_d["id_text"]
         #   lemm_l.append(lemma)
    return

## Get the list of the projects

There is a csv file exported from the ORACC_downloads notebook, "project_paths_list_df" that we will use to make a list in format of 'project'.zip so we can unzip all files.


In [14]:
#this list was found from the ORACC_ZIPS colab notebook
#includes all projects on the Oracc Project List page and the buried file paths from epsd [see final and redacted in cell below]
path_list = ['adsd','adsd/adart1','adsd/adart2','adsd/adart3','adsd/adart5','adsd/adart6','aemw','aemw/alalakh/idrimi','aemw/amarna','akklove','amgg',
             'ario','armep','arrim','asbp','asbp/ninmed','asbp/rlasb','atae','atae/assur','atae/burmarina','atae/durkatlimmu','atae/durszarrukin',
             'atae/guzana','atae/huzirina','atae/imgurenlil','atae/kalhu','atae/kunalia','atae/mallanate','atae/marqasu','atae/nineveh','atae/samal',
             'atae/szibaniba','atae/tilbarsip','atae/tuszhan','babcity','blms','borsippa','btmao','btto','cams','cams/akno','cams/anzu','cams/barutu',
             'cams/etana','cams/gkab','cams/ludlul','cams/selbi','cams/tlab','cdli','ckst','cmawro','cmawro/cmawr1','cmawro/cmawr2','cmawro/cmawr3',
             'cmawro/maqlu','contrib','contrib/amarna','contrib/lambert','ctij','dcclt','dcclt/ebla','dcclt/jena','dcclt/nineveh','dcclt/signlists',
             'dccmt','dsst','ecut','eisl','epsd2','etcsri','glass','hbtin','lacost','lovelyrics','nere','nimrud','obel','obmc','obta','ogsl','oimea',
             'pnao','qcat','riao','ribo','ribo/bab7scores','ribo/babylon10','ribo/babylon2','ribo/babylon3','ribo/babylon4','ribo/babylon5','ribo/babylon6',
             'ribo/babylon7','ribo/babylon8','ribo/sources','rimanum','rinap','rinap/rinap1','rinap/rinap2','rinap/rinap3','rinap/rinap4','rinap/rinap5',
             'rinap/rinap5p1','rinap/scores','rinap/sources','saao','saao/aebp','saao/knpp','saao/saa01','saao/saa02','saao/saa03','saao/saa04','saao/saa05',
             'saao/saa06','saao/saa07','saao/saa08','saao/saa09','saao/saa10','saao/saa11','saao/saa12','saao/saa13','saao/saa14','saao/saa15','saao/saa16',
             'saao/saa17','saao/saa18','saao/saa19','saao/saa20','saao/saa21','saao/saas2','suhu','tcma','tsae','xcat',
             'epsd2/admin', 'epsd2/earlylit', 'epsd2/literary', 'epsd2/praxis', 'epsd2/praxis/liturgy']

In [15]:
#projects alphabetical, starts with 'a'
p1 = ['adsd','adsd/adart1','adsd/adart2','adsd/adart3','adsd/adart5','adsd/adart6','aemw/alalakh/idrimi','aemw/amarna','akklove', 'ario',
      'armep','asbp','asbp/ninmed','asbp/rlasb','atae','atae/assur','atae/burmarina','atae/durkatlimmu',
      'atae/guzana','atae/huzirina','atae/imgurenlil','atae/kalhu','atae/mallanate','atae/marqasu',
      'atae/nineveh','atae/samal','atae/szibaniba','atae/tilbarsip','atae/tuszhan']


#projects alphabetical, starts with 'b' through 'e'
p2 = ['babcity','blms','borsippa','btmao','btto','cams','cams/akno','cams/anzu','cams/barutu','cams/etana','cams/ludlul',
      'cams/selbi','cams/tlab','ckst','cmawro','cmawro/cmawr1','cmawro/cmawr2','cmawro/cmawr3', 'cmawro/maqlu','contrib/amarna', 'ctij',
      'dcclt','dcclt/ebla','dcclt/jena','dcclt/nineveh','dcclt/signlists','dccmt','dsst', 'ecut', 'eisl','epsd2','etcsri',]


#projects alphabetical, 'g' through 'r'
p3 = ['glass','hbtin','lacost','nere','obel','obmc','obta','oimea','qcat','riao',
      'ribo','ribo/bab7scores','ribo/babylon10','ribo/babylon2','ribo/babylon3','ribo/babylon4','ribo/babylon5','ribo/babylon6',
      'ribo/babylon7','ribo/babylon8','ribo/sources','rimanum','rinap','rinap/rinap1','rinap/rinap2','rinap/rinap3',
      'rinap/rinap4','rinap/rinap5','rinap/rinap5p1','rinap/scores','rinap/sources',]


#projects alphabetical, 's' through 'x'
p4 = ['saao','saao/aebp','saao/knpp','saao/saa01','saao/saa02','saao/saa03','saao/saa04','saao/saa05','saao/saa06', 'saao/saa07','saao/saa08',
      'saao/saa09','saao/saa10','saao/saa11','saao/saa12','saao/saa13','saao/saa14','saao/saa15','saao/saa16','saao/saa17',
      'saao/saa18','saao/saa19','saao/saa20','saao/saa21','saao/saas2','suhu','tcma','tsae','xcat']

#the buried projects
p5 = ['epsd2/earlylit', 'epsd2/literary', 'epsd2/praxis', 'epsd2/praxis/liturgy','epsd2/admin/ed12', 'epsd2/admin/ed3b', 'epsd2/admin/lagash2',
      'epsd2/admin/oakk', 'epsd2/admin/oldbab', 'epsd2/admin/ur3']

p6 =  ["tcma/ali1","tcma/amarna","tcma/assur","tcma/barri","tcma/bazmusian","tcma/billa", "tcma/brak","tcma/chuera","tcma/emar",
      "tcma/fekheriye","tcma/giricano","tcma/hana","tcma/haradum","tcma/hatti","tcma/kalhu","tcma/kartn","tcma/kulishinas",
      "tcma/miscellaneous","tcma/nineveh","tcma/nippur","tcma/nuzi","tcma/qitar","tcma/rimah","tcma/suri",
      "tcma/taban","tcma/tsa1","tcma/tsh1","tcma/ugarit"]

pfinal = p1+p2+p3+p4+p5+p6

#redacted projects:
#amgg (no corpus), 'arrim' (no corpus), atae/durszarrukin (no corpus), atae/kunalia (no corpus), aemw (no corpus, split to idrimi and amarna), cdli (no catalogue)
#cams/gkab (has no cdl to read, but holds texts***), contrib/lambert (no corpus), lovelyrics (no corpus), nimrud (no corpus), ogsl (no corpus), pnao (no corpus)

#more information about imperfect projects
#ario [Q007131, Q007189] empty, ctij [X000533.json] empty, cdli [P002296, P005702], cmawro/cmawr3 [Q007424, Q007432] empty
#ecut [Q000000, Q006881, Q007089, Q008031, Q008040, Q008089, Q008091, Q008092, Q008217, Q008226]
#lacost [P226580, P281779, P432130, P464355, P464358] empty, saao/saa04 [P237481] empty, saao/saa07 [P335792]
#epsd2/admin/ebla does not exist.

## Run `unzipMultipleFiles()` Method

Would recommend running this only once through as it is unnecessary to unzip all of the files each time this notebook is run.

In [16]:
#establishes the directory to place all unzipped files
os.makedirs("drive/MyDrive/ORACC_UNZIPPED", exist_ok = True)

We will use the `file_list` to run the `unzipMultipleFiles()` method.

In [None]:
files = [word + '.zip' for word in p5]
file_list = [word.replace("/", "-") for word in files]
source_directory = folder + "ORACC_zips"
destination = folder + "ORACC_UNZIPPED"
unzipMultipleFiles(file_list, source_directory, destination)

##Download the words_df and df_cat for all projects

To use the split sections [p1,p2,p3,p4,p5,p6], change the first parameter of oracc_download to any list of projects you want to parse and download dataframes from in the next cell. Using pfinal will take a long time to run for both cells below.

In [None]:
os.chdir(folder + 'network/')
projects = oracc_download(pfinal,'') #DOWNLOAD REDUNDANCY

In [None]:
lemm_l = []
meta_d = {"label": None, "id_text": None}
#dollar_keys = ["extent", "scope", "state"]

df_cat = pd.DataFrame()
used_pnums = []
cat_d = {}
for project in projects:
  print('Project: ' + str(project))
  z = zipfile.ZipFile('jsonzip/' + project.replace('/','-') + '.zip')
  #print(file + " does not exist or is not a proper ZIP file")
  files = z.namelist()     # list of all the files in the ZIP
  files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']
  cat_file = z.read(project + '/catalogue.json').decode('utf-8')
  cat_json = json.loads(cat_file)

  cat_d.update(dict(cat_json['members']))
  #df_cat = pd.concat([df_cat,pd.DataFrame(cat_json['members']).T])                          #that holds all the P, Q, and X numbers.

  for filename in tqdm(files):       #iterate over the file names
      id_text = filename[-12:-5]
      if id_text in used_pnums:
        continue
      else:
        used_pnums.append(id_text)
      meta_d["id_text"] = id_text

      st = z.read(filename).decode('utf-8')       #read and decode the json file of one particular text
      #if empty file, skip, else run through parsejson
      if len(st) < 1:
        print(filename); #prints out the empty
      else:
        data_json = json.loads(st)                # make it into a json object (essentially a dictionary)
        #print(str(data_json))
        parsejson(data_json)               # and send to the parsejson() function
  z.close()

  df_cat = pd.DataFrame(cat_d).T
  words_df = pd.DataFrame(lemm_l)
  words_df.index.name = 'number'
  df_cat.index.name = 'id_text'
  #downloads the df_cat and words_df dataFrames as csv files into a master folder
  file = project.replace("/", "-")
  words_df.to_csv(folder + 'ORACC_DFS/PROJECT_DFS' + file + '-words-df.csv')
  df_cat.to_csv(folder + 'ORACC_DFS/PROJECT_DFS' + file + '-df-cat.csv')

Project: tcma/barri


100%|██████████| 4/4 [00:00<00:00, 436.76it/s]


Project: tcma/brak


100%|██████████| 5/5 [00:00<00:00, 301.99it/s]


Project: tcma/bazmusian


100%|██████████| 7/7 [00:00<00:00, 188.83it/s]


Project: tcma/tsh1


100%|██████████| 250/250 [00:01<00:00, 127.06it/s]


Project: tcma/fekheriye


100%|██████████| 14/14 [00:00<00:00, 758.09it/s]


Project: tcma/miscellaneous


100%|██████████| 24/24 [00:00<00:00, 550.60it/s]


Project: tcma/nineveh


100%|██████████| 9/9 [00:00<00:00, 102.63it/s]


Project: tcma/emar


100%|██████████| 3/3 [00:00<00:00, 494.83it/s]


Project: tcma/hana


100%|██████████| 3/3 [00:00<00:00, 175.58it/s]


Project: tcma/nippur


100%|██████████| 4/4 [00:00<00:00, 403.85it/s]


Project: tcma/giricano


100%|██████████| 15/15 [00:00<00:00, 299.33it/s]


Project: tcma/amarna


100%|██████████| 3/3 [00:00<00:00, 209.52it/s]


Project: tcma/tsa1


100%|██████████| 17/17 [00:00<00:00, 446.81it/s]


Project: tcma/kalhu


100%|██████████| 1/1 [00:00<00:00, 49.27it/s]


Project: tcma/qitar


100%|██████████| 1/1 [00:00<00:00, 100.27it/s]


Project: tcma/suri


100%|██████████| 1/1 [00:00<00:00, 114.86it/s]


Project: tcma/rimah


100%|██████████| 124/124 [00:00<00:00, 679.55it/s]


Project: tcma/chuera


100%|██████████| 97/97 [00:00<00:00, 537.80it/s]


Project: tcma/assur


 12%|█▏        | 132/1130 [00:00<00:02, 412.80it/s]

tcma/assur/corpusjson/P288859.json


100%|██████████| 1130/1130 [00:02<00:00, 420.06it/s]


Project: tcma/kartn


100%|██████████| 62/62 [00:00<00:00, 132.77it/s]


Project: tcma/ali1


100%|██████████| 24/24 [00:00<00:00, 561.29it/s]


Project: tcma/hatti


100%|██████████| 11/11 [00:00<00:00, 260.58it/s]


Project: tcma/taban


100%|██████████| 13/13 [00:00<00:00, 445.87it/s]


Project: tcma/ugarit


100%|██████████| 5/5 [00:00<00:00, 203.25it/s]


Project: tcma/haradum


100%|██████████| 2/2 [00:00<00:00, 270.04it/s]


Project: tcma/kulishinas


100%|██████████| 10/10 [00:00<00:00, 356.51it/s]


Project: tcma/nuzi


100%|██████████| 1/1 [00:00<00:00, 226.23it/s]


Project: tcma/billa


100%|██████████| 66/66 [00:00<00:00, 674.83it/s]


We have accomplished:

1. unzipping all project files from a given directory
2. checking each file of each project for content
3. downloading relevant dataframes from each project