# Downloading OSzK MEK data as PDFs

To collect necessary Hungarian LLM training data, we collect and clean publicly available books, for which they aren't under licence protection anymore, or wasn't at any time. These books can be found catalogised in the collection website of OSzK (Országos Széchenyi Könyvtár) MEK (Magyar Elektronikus Könyvtár) web application, available to download freely.

## Preparation
To download all of the PDFs make sure:
- 280+ GByte of free space
- 2 days to download on single thread (1MByte/s download limit is on MEK server side)

Install necessary tools:
- beautifulsoup4 (to parse HTML pages for metadata)

Create directory structure to store documents

In [1]:
!pip install beautifulsoup4



In [7]:
!mkdir hun_books

## Do the actual download
Fortunately, the site contains all the relevant documents with an autoincrement integer.
We have downloadable books from 0 to 25338. The URL where those books are accessible is
also very easy and straightforward to guess:
- info page: https://mek.oszk.hu/01200/01234/
- PDF document: https://mek.oszk.hu/01200/01234/01234.pdf

For some books, PDF version is not available, so we get either HTTP 404, of a HTML error page as download.

What we need to do is:
- in a for cycle, just calculate (guess) the PDF URL
- try to download
- if download was successful, scrape the corresponding info page for metadata

In [5]:
import requests
import os
import json
from datetime import date, datetime
from typing import List
from bs4 import BeautifulSoup

cntr_download = 0
cntr_metadata = 0

class Metadata:
    def __init__(self, tags: List[str], author: str, title: str, release_date: date):
        self.tags = tags
        self.author = author
        self.title = title
        self.release_date = release_date

    def to_dict(self):
        return {
            "tags": self.tags,
            "author": self.author,
            "title": self.title,
            "release_date": self.release_date.isoformat()  # Convert date to string
        }
    
    def to_json(self):
        return json.dumps(self.to_dict(), indent=4, ensure_ascii=False)

def parse_metadata(content):
    soup = BeautifulSoup(content, "html.parser")
    authorelement = soup.find("h4", class_="author")
    if authorelement is None:
        authortext = ""
    else:
        authortext = authorelement.text
    title = soup.find("h3", class_="title")
    topic = soup.find("div", class_="topic")
    subtopic = soup.find("div", class_="subtopic")
    reldate = soup.find("div", class_="reldate")
    return Metadata(
        tags=[topic.text, subtopic.text],
        author=authortext,
        title=title.text,
        release_date=datetime.fromisoformat(reldate.text).date()
    )    

def download_metadata(url, folder, filename):
    global cntr_metadata
    try:
        page = requests.get(url)
        page.raise_for_status()
        metadata = parse_metadata(page.content)
        with open(f"{folder}/{filename}.json", 'w', encoding='utf-8') as file:
            file.write(metadata.to_json())
        cntr_metadata += 1
    except Exception as e:
        print("Metadata error")
        #print(e)

def download_pdf(url, folder, filename):
    global cntr_download
    if not os.path.exists(f"{folder}/{filename}.pdf"):
        try:
            response = requests.get(f"{url}{filename}.pdf")
            response.raise_for_status()
            if response.headers["content-type"] != "application/pdf":
                raise Exception
            if not os.path.exists(folder):
                os.makedirs(folder)
            with open(f"{folder}/{filename}.pdf", 'wb') as file:
                file.write(response.content)
            cntr_download += 1
            print(f"Download complete: {filename}")
            download_metadata(url.replace("/pdf/","/"), folder, filename)
        except Exception as e:
            # Check in pdf subdir as well
            if not "pdf" in url:
                download_pdf(f"{url}pdf/", folder, filename)
            else:
                print(f"Not found/not PDF: {filename}")
    else:
        cntr_download += 1

for i in range(1, 26000):
    url = f"https://mek.oszk.hu/{int(i/100)*100:05}/{i:05}/"
    folder = f"hun_books/{int(i/100):03}"
    filename = f"{i:05}"
    download_pdf(url, folder, filename)

print(f"All/downloaded/metadata : 25338/{cntr_download}/{cntr_metadata}")

Not found/not PDF: 26000
Not found/not PDF: 26001
Not found/not PDF: 26002
Not found/not PDF: 26003
Not found/not PDF: 26004
Not found/not PDF: 26005
Not found/not PDF: 26006
Not found/not PDF: 26007
Not found/not PDF: 26008
Not found/not PDF: 26009
Not found/not PDF: 26010
Not found/not PDF: 26011
Not found/not PDF: 26012
Not found/not PDF: 26013
Not found/not PDF: 26014
Not found/not PDF: 26015
Not found/not PDF: 26016
Not found/not PDF: 26017
Not found/not PDF: 26018
Not found/not PDF: 26019
Not found/not PDF: 26020
Not found/not PDF: 26021
Not found/not PDF: 26022
Not found/not PDF: 26023
Not found/not PDF: 26024
Not found/not PDF: 26025
Not found/not PDF: 26026
Not found/not PDF: 26027
Not found/not PDF: 26028
Not found/not PDF: 26029
Not found/not PDF: 26030
Not found/not PDF: 26031
Not found/not PDF: 26032
Not found/not PDF: 26033
Not found/not PDF: 26034
Not found/not PDF: 26035
Not found/not PDF: 26036
Not found/not PDF: 26037
Not found/not PDF: 26038
Not found/not PDF: 26039


KeyboardInterrupt: 