Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The way to install VikParuchuri/marker on Windows 10. #12

Closed
avan06 opened this issue Dec 1, 2023 · 15 comments
Closed

The way to install VikParuchuri/marker on Windows 10. #12

avan06 opened this issue Dec 1, 2023 · 15 comments

Comments

@avan06
Copy link

avan06 commented Dec 1, 2023

The most challenging aspect of installing Marker on Windows lies in the detectron2 package developed by Facebook Research. Facebook Research is not very Windows-friendly, and they basically do not support or provide installation guidance for Windows.

The following records the process of installing VikParuchuri/marker on Windows 10.


To install the detectron2 package on Windows, you need to clone detectron2 and make some modifications before installation:

  1. Compilation of detectron2 requires a C/C++ compiler. I have MSVC (Visual Studio 2022) cl.exe in my environment, and you must have a similar C/C++ compiler in your environment.
    Visual Studio Download: https://visualstudio.microsoft.com/vs/community/

  2. Compilation of detectron2 requires NVIDIA CUDA's nvcc. You must install the CUDA Toolkit first. I installed version 12.3.
    CUDA Toolkit Download: https://developer.nvidia.com/cuda-downloads

  3. The torch package may also need to be installed. I installed the latest version provided by PyTorch:
    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

  4. Install wheel:
    pip install wheel

  5. Clone detectron2:
    git clone https://github.com/facebookresearch/detectron2.git

  6. Fix the "identifier 'single_box_iou_rotated' is undefined" issue by Viliami. (Refer to: nvcc.exe failed with exit status 1: Problem installing detectron2 on Windows 10 facebookresearch/detectron2#1601 (comment))

  7. Install the local detectron2.
    Install detectron2: pip install -e detectron2

If everything goes smoothly, detectron2 should be installed. If there are any issues, you'll need to check the error logs for further investigation.


Installing the Windows version of Tesseract and Ghostscript.

  1. To install Tesseract OCR on Windows
    setup tesseract-ocr-w64-setup-5.3.3.20231005.exe or a newer version
    https://digi.bib.uni-mannheim.de/tesseract/

  2. To install Ghostscript on Windows
    setup gs10021w64.exe or a newer version
    https://ghostscript.readthedocs.io/en/gs10.02.0/Install.html


Installing the VikParuchuri/marker

  1. git clone https://github.com/VikParuchuri/marker.git
  2. Remove detectron2 from VikParuchuri/marker/requirements.txt and install it manually using the aforementioned steps
  3. nougat in VikParuchuri/marker/requirements.txt is installing to the wrong repository. It needs to be removed from requirements.txt and installed from the repository developed by facebookresearch(https://github.com/facebookresearch/nougat).
    pip install nougat-ocr
  4. Install the missing dependencies.
    pip install -r requirements.txt
    pip install ftfy
    pip install spellchecker
    pip install pyspellchecker
    pip install ocrmypdf
    pip install nltk
    pip install thefuzz
    pip uninstall python-magic
    pip install python-magic-bin
@avan06
Copy link
Author

avan06 commented Dec 1, 2023

In the process of converting multiple files (convert.py), I found that the ray package was missing.
However, installing the latest version resulted in an error. It worked fine after installing version 2.7.1.

pip install ray==2.7.1

@vasnt
Copy link

vasnt commented Dec 2, 2023

can be installed using pinokio as it seems convenient method?
https://twitter.com/cocktailpeanut/status/1730635095654953039

@SimonB97
Copy link

SimonB97 commented Dec 4, 2023

can be installed using pinokio as it seems convenient method? https://twitter.com/cocktailpeanut/status/1730635095654953039

this would be great, or something similar. the above process is way too complicated for the average user. unless it wasn't meant for windows users ..

@VikParuchuri
Copy link
Owner

Thank you for documenting these steps! I think I may have caused some confusion here - requirements.txt was an old development artifact, and the actual install process is to use poetry install (like in the README). This should be a lot simpler than the process above.

I'll remove requirements.txt now.

@SimonB97
Copy link

SimonB97 commented Dec 6, 2023

Thank you for documenting these steps! I think I may have caused some confusion here - requirements.txt was an old development artifact, and the actual install process is to use poetry install (like in the README). This should be a lot simpler than the process above.

I'll remove requirements.txt now.

Hi, could you please clarify for me which of the steps above will be replaced by doing poetry install to make it run on Windows? Thanks!

@VikParuchuri
Copy link
Owner

@SimonB97 I don't have windows, so I can't test, but poetry install should replace all of the steps except the tesseract/ghostscript install.

@SuperMaxine
Copy link

Is there any way to make the installation run through docker? Having a complete image should alleviate a lot of the manual work, but I'm not quite sure how nvcc runs on Windows via docker desktop. The only way I know of is to use the NVIDIA tools kit image as the base image.

@goapurva
Copy link

Need some help. I am on a Windows 10 machine.
I am trying to run convert_single.py as follows. I do have pydantic version 1.10.13 installed. But I am getting an error related to pydantic.

(bsrp310) C:\Users\Starlord\marker>pip show pydantic
Name: pydantic
Version: 1.10.13
Summary: Data validation and settings management using python type hints
Home-page: https://github.com/pydantic/pydantic
Author: Samuel Colvin
Author-email: s@muelcolvin.com
License: MIT
Location: c:\programdata\anaconda3\envs\bsrp310\lib\site-packages
Requires: typing-extensions
Required-by: label-studio

(bsrp310) C:\Users\Starlord\marker>python convert_single.py .\6941.pdf .\6941_converted.md --parallel_factor 2 --max_pages 10
Traceback (most recent call last):
File "C:\Users\Starlord\marker\convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "C:\Users\Starlord\marker\marker\convert.py", line 3, in
from marker.cleaners.table import merge_table_blocks, create_new_tables
File "C:\Users\Starlord\marker\marker\cleaners\table.py", line 2, in
from marker.schema import Line, Span, Block, Page
File "C:\Users\Starlord\marker\marker\schema.py", line 4, in
from pydantic import BaseModel, field_validator
ImportError: cannot import name 'field_validator' from 'pydantic' (C:\ProgramData\anaconda3\envs\bsrp310\lib\site-packages\pydantic_init_.cp310-win_amd64.pyd)

@VikParuchuri
Copy link
Owner

The pyproject.toml file (listing the package that will be installed with poetry install) lists pydantic 2 as a requirement. You're using pydantic 1. Did you install with poetry install?

@goapurva
Copy link

The pyproject.toml file (listing the package that will be installed with poetry install) lists pydantic 2 as a requirement. You're using pydantic 1. Did you install with poetry install?

Thanks for the reply. The pydantic version I have came with Label Studio, which I am using for image annotation, I think I'll have to start over by creating a new virtual environment to avoid conflict.

Thanks again!

@goapurva
Copy link

Hi @VikParuchuri

This time I tried with poetry. Everything seems to have gone ok.

However, when I tried to run convert_single.py, I am getting the following error, towards the bottom of the page:

(marker-py3.10) (base) C:\Users\Starlord\marker>python convert_single.py .\6941.pdf .\6941_output.md --parallel_factor 2 --max_pages 10
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 502/502 [00:00<?, ?B/s]
C:\Users\Starlord\AppData\Local\pypoetry\Cache\virtualenvs\marker-T8AhNGQT-py3.10\lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Starlord.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.61k/1.61k [00:00<?, ?B/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 12.8MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 14.6MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 10.4MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75.0/75.0 [00:00<?, ?B/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [00:00<?, ?B/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.6k/25.6k [00:00<?, ?B/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.06k/3.06k [00:00<?, ?B/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.79k/2.79k [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 502/502 [00:00<?, ?B/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.61k/1.61k [00:00<?, ?B/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 10.2MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 12.2MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 12.2MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75.0/75.0 [00:00<?, ?B/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [00:00<?, ?B/s]
Traceback (most recent call last):
File "C:\Users\Starlord\marker\convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "C:\Users\Starlord\marker\marker\convert.py", line 18, in
import magic
File "C:\Users\Starlord\AppData\Local\pypoetry\Cache\virtualenvs\marker-T8AhNGQT-py3.10\lib\site-packages\magic_init_.py", line 209, in
libmagic = loader.load_lib()
File "C:\Users\Starlord\AppData\Local\pypoetry\Cache\virtualenvs\marker-T8AhNGQT-py3.10\lib\site-packages\magic\loader.py", line 49, in load_lib
raise ImportError('failed to find libmagic. Check your installation')
ImportError: failed to find libmagic. Check your installation

@VikParuchuri
Copy link
Owner

This is a system requirement (see the brew packages that need to be installed). I don't know how to install this on windows, but it should be possible. You'll need to install the other system requirements, too.

@goapurva
Copy link

Thanks @VikParuchuri -- I will look into it.

@scillidan
Copy link

scillidan commented Dec 25, 2023

I used python39 and Windows10.

Install requirements:

scoop install python39
pip install poetry
scoop install ghostscript tesseract

Get the tessdata_best:

git clone https://github.com/tesseract-ocr/tessdata_best

Then:

git clone https://github.com/VikParuchuri/marker
cd marker
python(39) -m venv venv
venv\Scripts\activate.bat
poetry install

I used GPU, so:

pip uninstall torch

Install CUDA, then install the corresponding version of torch. I used CUDA 1.11.8, but other version should also be able to try.

Create the local.env:

TORCH_DEVICE=cuda
TESSDATA_PREFIX=WhereIs\tessdata_best

Then:

python convert_single.py in.pdf out.md

It tell me to Enable your device for development.

Finally, it said something else, but work well:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
D:\binr\pdf_marker\venv\lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\TensorShape.cpp:3527.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

@preyasprathap
Copy link

I am having issues when running poetry install

poetry install

  RuntimeError

  The Poetry configuration is invalid:
    - Additional properties are not allowed ('group' was unexpected)


  at /usr/lib/python3/dist-packages/poetry/core/factory.py:43 in create_poetry
       39│             message = ""
       40│             for error in check_result["errors"]:
       41│                 message += "  - {}\n".format(error)
       42│
    →  43│             raise RuntimeError("The Poetry configuration is invalid:\n" + message)
       44│
       45│         # Load package
       46│         name = local_config["name"]
       47│         version = local_config["version"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants