Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do i extract images from pdf file #26

Closed
muntasirhossain1 opened this issue Apr 3, 2024 · 2 comments
Closed

How do i extract images from pdf file #26

muntasirhossain1 opened this issue Apr 3, 2024 · 2 comments

Comments

@muntasirhossain1
Copy link

I want to extract all the images.

@dmester
Copy link
Owner

dmester commented Apr 6, 2024

There is no direct API for extracting images, but you can get hold of the images by implementing a custom ImageResolver. It's bit of a hack, but here is a working example:

private class ImageExtractor : ImageResolver
{
    private string outputDirectory;
    private int count;

    public ImageExtractor(string outputDirectory)
    {
        this.outputDirectory = outputDirectory;
    }

    public override string ResolveImageUrl(Image image, CancellationToken cancellationToken)
    {
        var content = image.GetContent(cancellationToken);
        var extension = image.ContentType == "image/jpeg" ? ".jpeg" : ".png";
        var outputFileName = "image" + ++count + extension;
        var outputPath = Path.Combine(outputDirectory, outputFileName);

        File.WriteAllBytes(outputPath, content);

        return outputFileName;
    }
}

public static void Main()
{
    var inputFile = "<enter path to PDF here>";
    var outputDir = "<enter path to output directory here>";

    using (var doc = PdfDocument.Open(inputFile))
    {
        var options = new SvgConversionOptions
        {
            ImageResolver = new ImageExtractor(outputDir),
        };

        foreach (var page in doc.Pages)
        {
            page.ToSvgString(options);
        }
    }
}

I'll see if I can add a dedicated API for accessing images in a future version.

dmester added a commit that referenced this issue Apr 20, 2024
@dmester
Copy link
Owner

dmester commented Apr 20, 2024

There is now a dedicated API for accessing images from a PDF:

using (var document = PdfDocument.Open("input.pdf"))
{
    var imageNo = 1;

    foreach (var image in document.Images)
    {
        var content = image.GetContent();
        var fileName = $"image{imageNo++}{image.Extension}";
        File.WriteAllBytes(fileName, content);
    }
}

This was added in version 1.3.0

@dmester dmester closed this as completed Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants