Skip to content

Cleaned, parsed, and analyzed Hebrew textual corpus of documents pertaining to construction, planning, and architecture.

Notifications You must be signed in to change notification settings

bdar-lab/heb_architecture_corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hebrew textual corpus on construction, planning, and architecture

Curated, processed, parsed, and analyzed by the Big Data in Architectural Research Lab, Faculty of Architecture and Town Planning, Technion - IIT

Primary researcher: Dr. Or Aleksandrowicz
Project supervisors: Dr. Daniel Rosenberg, Dr. Omri Shafer-Raviv
Project advisors: Dr. Noam Ordan, Dr. Nick Howell
Project assistants: Dina El Qasem, Hodaya Saada, Mai Sabbah, Sherry-Atara Khasdan, Naama Koren, Shiran-Ester Shnaiderman

The construction industry is one of the main economic sectors in Israel and it is expected to maintain its central position in the coming decades in light of the country's rapid population growth rate. Unlike many developed countries, where the rate of new construction is slow due to low rates of population growth, in Israel, the built-up area doubles every 25 years. The creation of a textual corpus in Hebrew on construction, planning, and architecture is expected to facilitate and expedite the development of NLP-based tools for application and assimilation in technological fields related to the construction industry.

The corpus consists of Hebrew documents from a wide variety of contemporary and historical sources, including legislative decrees, regulatory guidelines, research reports, academic studies, and professional journals. In the development of the corpus, we were using digitally born as well as scanned printed publications, which went through a process of optical character recognition (OCR), cleaning, and tagging. Tagging was performed using the Trankit Python Toolkit.

The corpus holds 22,382,594 words in 1218 documents.

Using the CQPweb interface, you can browse the corpus contents and run textual queries on its items. To access this system, please contact omrish@technion.ac.il.

This work was supported by the Israel Innovation Authority. The corpus is available for all types of uses for NLP research and development according to the CC BY 4.0 license (Attribution 4.0 International).

We wish to thank Vicky Davydov, Lena Avrahami and Shai Zack from the Library of the Faculty of Architecture and Town Planning (Technion), as well as Moti Yeger, Director of the Technion's Central Library, and Prof. Rafael Sacks, Head of the National Building Research Institute, for the help they have been providing for the project since its inception.

Cite: Aleksandrowicz, O., Rosenberg, D., Shafer-Raviv, O., Ordan, N. (2024). Hebrew textual corpus on construction, planning, and architecture. GitHUB. https://github.com/bdar-lab/heb_architecture_corpus.

Contents

  • /conllu/ - Morphologically analyzed texts in CONLLU format
  • /txt/ - Plain text
  • full_IIA_corpus_8.2.2024.csv - Full corpus metadata

BDAR-logo-1

About

Cleaned, parsed, and analyzed Hebrew textual corpus of documents pertaining to construction, planning, and architecture.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5