The way to install VikParuchuri/marker on Windows 10. #12

avan06 · 2023-12-01T15:39:24Z

The most challenging aspect of installing Marker on Windows lies in the detectron2 package developed by Facebook Research. Facebook Research is not very Windows-friendly, and they basically do not support or provide installation guidance for Windows.

The following records the process of installing VikParuchuri/marker on Windows 10.

To install the detectron2 package on Windows, you need to clone detectron2 and make some modifications before installation:

Compilation of detectron2 requires a C/C++ compiler. I have MSVC (Visual Studio 2022) cl.exe in my environment, and you must have a similar C/C++ compiler in your environment.
Visual Studio Download: https://visualstudio.microsoft.com/vs/community/
Compilation of detectron2 requires NVIDIA CUDA's nvcc. You must install the CUDA Toolkit first. I installed version 12.3.
CUDA Toolkit Download: https://developer.nvidia.com/cuda-downloads
The torch package may also need to be installed. I installed the latest version provided by PyTorch:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Install wheel:
pip install wheel
Clone detectron2:
git clone https://github.com/facebookresearch/detectron2.git
Fix the "identifier 'single_box_iou_rotated' is undefined" issue by Viliami. (Refer to: nvcc.exe failed with exit status 1: Problem installing detectron2 on Windows 10 facebookresearch/detectron2#1601 (comment))
Install the local detectron2.
Install detectron2: pip install -e detectron2

If everything goes smoothly, detectron2 should be installed. If there are any issues, you'll need to check the error logs for further investigation.

Installing the Windows version of Tesseract and Ghostscript.

To install Tesseract OCR on Windows
setup tesseract-ocr-w64-setup-5.3.3.20231005.exe or a newer version
https://digi.bib.uni-mannheim.de/tesseract/
To install Ghostscript on Windows
setup gs10021w64.exe or a newer version
https://ghostscript.readthedocs.io/en/gs10.02.0/Install.html

Installing the VikParuchuri/marker

git clone https://github.com/VikParuchuri/marker.git
Remove detectron2 from VikParuchuri/marker/requirements.txt and install it manually using the aforementioned steps
nougat in VikParuchuri/marker/requirements.txt is installing to the wrong repository. It needs to be removed from requirements.txt and installed from the repository developed by facebookresearch(https://github.com/facebookresearch/nougat).
pip install nougat-ocr
Install the missing dependencies.
pip install -r requirements.txt
pip install ftfy
pip install spellchecker
pip install pyspellchecker
pip install ocrmypdf
pip install nltk
pip install thefuzz
pip uninstall python-magic
pip install python-magic-bin

avan06 · 2023-12-01T16:34:32Z

In the process of converting multiple files (convert.py), I found that the ray package was missing.
However, installing the latest version resulted in an error. It worked fine after installing version 2.7.1.

pip install ray==2.7.1

vasnt · 2023-12-02T03:00:51Z

can be installed using pinokio as it seems convenient method?
https://twitter.com/cocktailpeanut/status/1730635095654953039

SimonB97 · 2023-12-04T20:42:15Z

can be installed using pinokio as it seems convenient method? https://twitter.com/cocktailpeanut/status/1730635095654953039

this would be great, or something similar. the above process is way too complicated for the average user. unless it wasn't meant for windows users ..

VikParuchuri · 2023-12-05T21:23:55Z

Thank you for documenting these steps! I think I may have caused some confusion here - requirements.txt was an old development artifact, and the actual install process is to use poetry install (like in the README). This should be a lot simpler than the process above.

I'll remove requirements.txt now.

SimonB97 · 2023-12-06T12:16:30Z

Thank you for documenting these steps! I think I may have caused some confusion here - requirements.txt was an old development artifact, and the actual install process is to use poetry install (like in the README). This should be a lot simpler than the process above.

I'll remove requirements.txt now.

Hi, could you please clarify for me which of the steps above will be replaced by doing poetry install to make it run on Windows? Thanks!

VikParuchuri · 2023-12-06T16:56:34Z

@SimonB97 I don't have windows, so I can't test, but poetry install should replace all of the steps except the tesseract/ghostscript install.

SuperMaxine · 2023-12-11T11:41:42Z

Is there any way to make the installation run through docker? Having a complete image should alleviate a lot of the manual work, but I'm not quite sure how nvcc runs on Windows via docker desktop. The only way I know of is to use the NVIDIA tools kit image as the base image.

goapurva · 2023-12-12T16:04:58Z

Need some help. I am on a Windows 10 machine.
I am trying to run convert_single.py as follows. I do have pydantic version 1.10.13 installed. But I am getting an error related to pydantic.

(bsrp310) C:\Users\Starlord\marker>pip show pydantic
Name: pydantic
Version: 1.10.13
Summary: Data validation and settings management using python type hints
Home-page: https://github.com/pydantic/pydantic
Author: Samuel Colvin
Author-email: s@muelcolvin.com
License: MIT
Location: c:\programdata\anaconda3\envs\bsrp310\lib\site-packages
Requires: typing-extensions
Required-by: label-studio

(bsrp310) C:\Users\Starlord\marker>python convert_single.py .\6941.pdf .\6941_converted.md --parallel_factor 2 --max_pages 10
Traceback (most recent call last):
File "C:\Users\Starlord\marker\convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "C:\Users\Starlord\marker\marker\convert.py", line 3, in
from marker.cleaners.table import merge_table_blocks, create_new_tables
File "C:\Users\Starlord\marker\marker\cleaners\table.py", line 2, in
from marker.schema import Line, Span, Block, Page
File "C:\Users\Starlord\marker\marker\schema.py", line 4, in
from pydantic import BaseModel, field_validator
ImportError: cannot import name 'field_validator' from 'pydantic' (C:\ProgramData\anaconda3\envs\bsrp310\lib\site-packages\pydantic_init_.cp310-win_amd64.pyd)

VikParuchuri · 2023-12-12T17:08:48Z

The pyproject.toml file (listing the package that will be installed with poetry install) lists pydantic 2 as a requirement. You're using pydantic 1. Did you install with poetry install?

goapurva · 2023-12-12T17:35:50Z

The pyproject.toml file (listing the package that will be installed with poetry install) lists pydantic 2 as a requirement. You're using pydantic 1. Did you install with poetry install?

Thanks for the reply. The pydantic version I have came with Label Studio, which I am using for image annotation, I think I'll have to start over by creating a new virtual environment to avoid conflict.

Thanks again!

goapurva · 2023-12-12T18:57:36Z

Hi @VikParuchuri

This time I tried with poetry. Everything seems to have gone ok.

However, when I tried to run convert_single.py, I am getting the following error, towards the bottom of the page:

(marker-py3.10) (base) C:\Users\Starlord\marker>python convert_single.py .\6941.pdf .\6941_output.md --parallel_factor 2 --max_pages 10
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 502/502 [00:00<?, ?B/s]
C:\Users\Starlord\AppData\Local\pypoetry\Cache\virtualenvs\marker-T8AhNGQT-py3.10\lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Starlord.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.61k/1.61k [00:00<?, ?B/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 12.8MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 14.6MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 10.4MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75.0/75.0 [00:00<?, ?B/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [00:00<?, ?B/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.6k/25.6k [00:00<?, ?B/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.06k/3.06k [00:00<?, ?B/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.79k/2.79k [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 502/502 [00:00<?, ?B/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.61k/1.61k [00:00<?, ?B/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 10.2MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 12.2MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 12.2MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75.0/75.0 [00:00<?, ?B/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [00:00<?, ?B/s]
Traceback (most recent call last):
File "C:\Users\Starlord\marker\convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "C:\Users\Starlord\marker\marker\convert.py", line 18, in
import magic
File "C:\Users\Starlord\AppData\Local\pypoetry\Cache\virtualenvs\marker-T8AhNGQT-py3.10\lib\site-packages\magic_init_.py", line 209, in
libmagic = loader.load_lib()
File "C:\Users\Starlord\AppData\Local\pypoetry\Cache\virtualenvs\marker-T8AhNGQT-py3.10\lib\site-packages\magic\loader.py", line 49, in load_lib
raise ImportError('failed to find libmagic. Check your installation')
ImportError: failed to find libmagic. Check your installation

VikParuchuri · 2023-12-12T18:59:39Z

This is a system requirement (see the brew packages that need to be installed). I don't know how to install this on windows, but it should be possible. You'll need to install the other system requirements, too.

goapurva · 2023-12-12T19:01:10Z

Thanks @VikParuchuri -- I will look into it.

scillidan · 2023-12-25T02:20:19Z

I used python39 and Windows10.

Install requirements:

scoop install python39
pip install poetry
scoop install ghostscript tesseract

Get the tessdata_best:

git clone https://github.com/tesseract-ocr/tessdata_best

Then:

git clone https://github.com/VikParuchuri/marker
cd marker
python(39) -m venv venv
venv\Scripts\activate.bat
poetry install

I used GPU, so:

pip uninstall torch

Install CUDA, then install the corresponding version of torch. I used CUDA 1.11.8, but other version should also be able to try.

Create the local.env:

TORCH_DEVICE=cuda
TESSDATA_PREFIX=WhereIs\tessdata_best

Then:

python convert_single.py in.pdf out.md

It tell me to Enable your device for development.

Finally, it said something else, but work well:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
D:\binr\pdf_marker\venv\lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\TensorShape.cpp:3527.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

preyasprathap · 2024-01-14T20:31:35Z

I am having issues when running poetry install

poetry install

  RuntimeError

  The Poetry configuration is invalid:
    - Additional properties are not allowed ('group' was unexpected)


  at /usr/lib/python3/dist-packages/poetry/core/factory.py:43 in create_poetry
       39│             message = ""
       40│             for error in check_result["errors"]:
       41│                 message += "  - {}\n".format(error)
       42│
    →  43│             raise RuntimeError("The Poetry configuration is invalid:\n" + message)
       44│
       45│         # Load package
       46│         name = local_config["name"]
       47│         version = local_config["version"]

avan06 mentioned this issue Dec 1, 2023

Windows Installation? #5

Closed

nekiee13 mentioned this issue Dec 11, 2023

Missing requirements.txt in repository, and indexer error on execution #41

Closed

VikParuchuri closed this as completed Dec 13, 2023

nekiee13 mentioned this issue Dec 24, 2023

How to use it on windows? #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The way to install VikParuchuri/marker on Windows 10. #12

The way to install VikParuchuri/marker on Windows 10. #12

avan06 commented Dec 1, 2023 •

edited

avan06 commented Dec 1, 2023

vasnt commented Dec 2, 2023

SimonB97 commented Dec 4, 2023

VikParuchuri commented Dec 5, 2023

SimonB97 commented Dec 6, 2023 •

edited

VikParuchuri commented Dec 6, 2023

SuperMaxine commented Dec 11, 2023

goapurva commented Dec 12, 2023

VikParuchuri commented Dec 12, 2023

goapurva commented Dec 12, 2023

goapurva commented Dec 12, 2023

VikParuchuri commented Dec 12, 2023

goapurva commented Dec 12, 2023

scillidan commented Dec 25, 2023 •

edited

preyasprathap commented Jan 14, 2024

The way to install VikParuchuri/marker on Windows 10. #12

The way to install VikParuchuri/marker on Windows 10. #12

Comments

avan06 commented Dec 1, 2023 • edited

avan06 commented Dec 1, 2023

vasnt commented Dec 2, 2023

SimonB97 commented Dec 4, 2023

VikParuchuri commented Dec 5, 2023

SimonB97 commented Dec 6, 2023 • edited

VikParuchuri commented Dec 6, 2023

SuperMaxine commented Dec 11, 2023

goapurva commented Dec 12, 2023

VikParuchuri commented Dec 12, 2023

goapurva commented Dec 12, 2023

goapurva commented Dec 12, 2023

VikParuchuri commented Dec 12, 2023

goapurva commented Dec 12, 2023

scillidan commented Dec 25, 2023 • edited

preyasprathap commented Jan 14, 2024

avan06 commented Dec 1, 2023 •

edited

SimonB97 commented Dec 6, 2023 •

edited

scillidan commented Dec 25, 2023 •

edited