Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse embedded file(OLE obejct) in pptx/docx #644

Closed
1 of 2 tasks
hong1997 opened this issue Dec 3, 2019 · 7 comments
Closed
1 of 2 tasks

How to parse embedded file(OLE obejct) in pptx/docx #644

hong1997 opened this issue Dec 3, 2019 · 7 comments

Comments

@hong1997
Copy link

hong1997 commented Dec 3, 2019

Before submitting an issue, please fill this out

Is this a:

  • Issue with the OpenXml library
  • Question on library usage

How to parse embedded files(OLE obejct) in pptx/docx.
They are Ole objects mostly, like object1.bin.
If there're any good ways to parse it?
Unzip the OLE object, there're several kinds of format:
image
image
image
image

Didn't find out a general good way to achieve that.
I check the source code of Tika parser, they extract it in a rule-based method...

// Please add a self-contained, minimum viable repro of the issue.
// If you require external resources, please provide a gist or GitHub repro
// An Xunit style test is preferred, but a console application would work too.

Observed

Please add your observed behavior here

Expected

Please add your expected behavior here.

@ashahabov
Copy link
Contributor

Use follow code example to get OLEObjects from the first slide presentation:

public static IEnumerable<DocumentFormat.OpenXml.Presentation.GraphicFrame> GetOleObjects(string pptxFilePath)
{
    using (var doc = PresentationDocument.Open(pptxFilePath, false))
    {
        // Gets first slide
        var sld = doc.PresentationPart.SlideParts.First().Slide;
        // OLEObjects is stored in graphic frame element
        var oleFrames = new List<DocumentFormat.OpenXml.Presentation.GraphicFrame>();
        foreach (var frame in sld.CommonSlideData.ShapeTree.OfType<DocumentFormat.OpenXml.Presentation.GraphicFrame>())
        {
            if (frame.Descendants<DocumentFormat.OpenXml.Presentation.OleObject>().Any())
            {
                oleFrames.Add(frame);
            }
        }

        return oleFrames;
    }
}

@hong1997
Copy link
Author

hong1997 commented Dec 8, 2019

Use follow code example to get OLEObjects from the first slide presentation:

public static IEnumerable<DocumentFormat.OpenXml.Presentation.GraphicFrame> GetOleObjects(string pptxFilePath)
{
    using (var doc = PresentationDocument.Open(pptxFilePath, false))
    {
        // Gets first slide
        var sld = doc.PresentationPart.SlideParts.First().Slide;
        // OLEObjects is stored in graphic frame element
        var oleFrames = new List<DocumentFormat.OpenXml.Presentation.GraphicFrame>();
        foreach (var frame in sld.CommonSlideData.ShapeTree.OfType<DocumentFormat.OpenXml.Presentation.GraphicFrame>())
        {
            if (frame.Descendants<DocumentFormat.OpenXml.Presentation.OleObject>().Any())
            {
                oleFrames.Add(frame);
            }
        }

        return oleFrames;
    }
}

Hi adamshakhabov, thanks for your reply! According to my knowledge, the ole object should be stored in embedded object parts(X.MainDocumentPart.EmbeddedObjectParts), and I am asking for a method to parse the oleobject instead of just getting it.

@ashahabov
Copy link
Contributor

Hi @hong1997!

I think Open XML SDK has not some specific method for OLEObject element reading (parse its properties). Can you say more precise, which one feature of OLEObject you try to parse?

Also, it would be better if you attach pptx-file with this OLEObject case.

@ThomasBarnekow
Copy link
Collaborator

@hong1997 and @adamshakhabov, GitHub issues are not the place to ask and discuss questions regarding Open XML SDK library usage. You should ask usage-related questions on stackoverflow.com, where you will already find a large number of questions and answers tagged with openxml or openxml-sdk.

In this specific case, another user already asked about how he could extract OLE-embedded files from Word documents, and I provided an accepted answer.

@hong1997
Copy link
Author

hong1997 commented Dec 8, 2019

@ThomasBarnekow , thanks for your info, I will close the issue. However, the answer you provided only handles 1 kind of OLE structure. You could see from my description that only the last kind of ole object can be handled by the class you provided.

@lindexi
Copy link
Contributor

lindexi commented Feb 29, 2020

Some of the OLE can show as wmf image. Because it contain the fallback element. Here is my code that save the fallback element to file https://github.com/lindexi/lindexi_gd/tree/d182ca9f0cece56d32a801923a1fdffa64f95dfd/NallwerewawchailawileeForeehakel .

Some ole can use WinForms to convert. The DotNet Heaven: Read OLE Object type image field in C#.net

@twsouthwick
Copy link
Member

Thanks everyone for an interesting discussion. This looks to have been resolved so I'll close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants