<a href="https://colab.research.google.com/github/coatless-r-n-d/colab-notes/blob/main/09-markitdown-python-package-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MarkItDown Tutorial

This notebook demonstrates how to use the MarkItDown library to convert various document formats into markdown text suitable for LLMs. We'll cover different types of inputs and show the conversion results.

## Setup

First, let's install and import the necessary packages:

In [None]:
!pip install markitdown fpdf

Collecting markitdown
  Downloading markitdown-0.0.1a2-py3-none-any.whl.metadata (3.3 kB)
Collecting fpdf
  Downloading fpdf-1.7.2.tar.gz (39 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mammoth (from markitdown)
  Downloading mammoth-1.8.0-py2.py3-none-any.whl.metadata (24 kB)
Collecting markdownify (from markitdown)
  Downloading markdownify-0.14.1-py3-none-any.whl.metadata (8.5 kB)
Collecting pathvalidate (from markitdown)
  Downloading pathvalidate-3.2.1-py3-none-any.whl.metadata (12 kB)
Collecting pdfminer-six (from markitdown)
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Collecting puremagic (from markitdown)
  Downloading puremagic-1.28-py3-none-any.whl.metadata (5.8 kB)
Collecting pydub (from markitdown)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-pptx (from markitdown)
  Downloading python_pptx-1.0.2-py3-none-any.whl.metadata (2.5 kB)
Collecting speechrecognition (from markitdown)
  Downlo

Update the copy of `dox` on Google Colab:

In [None]:
!pip install --upgrade python-docx



In [None]:
from markitdown import MarkItDown
import requests
import tempfile
import os

## Basic Usage

Let's create a MarkItDown instance that we'll use throughout this tutorial:

In [None]:
converter = MarkItDown()

## Converting Web Content

### HTML Pages

Let's convert a simple webpage to markdown:

In [None]:
# Convert a webpage
url = "https://example.com"
result = converter.convert(url)
print(result.text_content[:500] + "...")


# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https://www.iana.org/domains/example)

...


### Wikipedia Articles

MarkItDown has special handling for Wikipedia pages:

In [None]:
wiki_url = "https://en.wikipedia.org/wiki/Markdown"
result = converter.convert(wiki_url)
print(result.text_content[:500] + "...")

# Markdown

Plain text markup language
For the marketing term, see [Price markdown](/wiki/Price_markdown "Price markdown").

Markdown
|  | |
| --- | --- |
| [Filename extensions](/wiki/Filename_extension "Filename extension") | `.md`, `.markdown`[[1]](#cite_note-df-2022-1)[[2]](#cite_note-rfc7763-2) |
| [Internet media type](/wiki/Media_type "Media type") | `text/markdown`[[2]](#cite_note-rfc7763-2) |
| [Uniform Type Identifier (UTI)](/wiki/Uniform_Type_Identifier "Uniform Type Identifier") | `n...


## Working with Local Files

### PDF Files

Let's create a sample PDF and convert it:

In [None]:
# Create a temporary PDF file
from fpdf import FPDF

pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
pdf.cell(200, 10, txt="Test PDF Document", ln=1, align="C")
pdf.multi_cell(0, 10, txt="This is a sample PDF document created for testing MarkItDown conversion.")

with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as tmp:
    pdf_path = tmp.name
    pdf.output(pdf_path)

# Convert the PDF
result = converter.convert(pdf_path)
print(result.text_content)

# Clean up
os.unlink(pdf_path)

This is a sample PDF document created for testing MarkItDown conversion.

Test PDF Document




### Word Documents (DOCX)

Let's create and convert a Word document:

In [None]:
from docx import Document

# Create a sample Word document
doc = Document()
doc.add_heading('Sample Document', 0)
doc.add_paragraph('This is a paragraph in the document.')
doc.add_heading('Section 1', level=1)
doc.add_paragraph('Content in section 1.')

with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as tmp:
    docx_path = tmp.name
    doc.save(docx_path)

# Convert the document
result = converter.convert(docx_path)
print(result.text_content)

# Clean up
os.unlink(docx_path)



Sample Document

This is a paragraph in the document.

# Section 1

Content in section 1.




## Comments



Next, let's explore if we can extract comments from the word document.

In [None]:
from docx import Document
from markitdown import MarkItDown
import docx
from docx.oxml.shared import qn, OxmlElement

def add_comment(doc, paragraph, comment_text):
    # Get the paragraph element
    p = paragraph._p

    # Create comment
    comment = OxmlElement("w:comment")
    comment.set(qn("w:id"), "1")
    comment.set(qn("w:author"), "Author")
    comment.set(qn("w:date"), "2024-01-01T12:00:00")
    comment.set(qn("w:initials"), "A")

    # Add comment text
    comment_p = OxmlElement("w:p")
    comment_r = OxmlElement("w:r")
    comment_t = OxmlElement("w:t")
    comment_t.text = comment_text
    comment_r.append(comment_t)
    comment_p.append(comment_r)
    comment.append(comment_p)

    # Make sure we have a comments part
    if not doc.part.comments_part:
        doc.part.add_comments_part()

    # Add comment to document
    doc.part.comments_part._element.append(comment)

    # Create comment range start
    comment_start = OxmlElement("w:commentRangeStart")
    comment_start.set(qn("w:id"), "1")
    p.addprevious(comment_start)

    # Create comment reference
    comment_ref = OxmlElement("w:commentReference")
    comment_ref.set(qn("w:id"), "1")
    r = p.find(qn("w:r"))
    if r is not None:
        r.append(comment_ref)

    # Create comment range end
    comment_end = OxmlElement("w:commentRangeEnd")
    comment_end.set(qn("w:id"), "1")
    p.addnext(comment_end)

# Create document
doc = Document()
doc.add_heading('Document with Comments', 0)
p = doc.add_paragraph('This is the main text. It should have a comment attached.')

# Add comment
add_comment(doc, p, "This is a comment on the paragraph.")

# Save document
doc.save('test_with_comments2.docx')

# Convert and print
converter = MarkItDown()
result = converter.convert('test_with_comments2.docx')
print("Converted content:")
print(result.text_content)

Converted content:


Document with Comments

This is the main text. It should have a comment attached.




Unfortunately, it does not look like so! We're keeping on eye on this issue in:



### Excel Spreadsheets (XLSX)

Now let's create and convert an Excel spreadsheet:

In [None]:
import pandas as pd

# Create sample data
df = pd.DataFrame({
    'Name': ['John', 'Jane', 'Bob'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']
})

# Save to Excel
with tempfile.NamedTemporaryFile(suffix='.xlsx', delete=False) as tmp:
    xlsx_path = tmp.name
    df.to_excel(xlsx_path, index=False)

# Convert the spreadsheet
result = converter.convert(xlsx_path)
print(result.text_content)

# Clean up
os.unlink(xlsx_path)

## Sheet1
| Name | Age | City |
| --- | --- | --- |
| John | 25 | New York |
| Jane | 30 | London |
| Bob | 35 | Paris |


## Stream Conversion

MarkItDown can also convert from input streams:

In [None]:
from io import BytesIO

# Create a sample text stream
text = "Hello, this is a test stream!\nIt has multiple lines.\n"
stream = BytesIO(text.encode('utf-8'))

# Convert the stream, specifying the file extension
result = converter.convert_stream(stream, file_extension='.txt')  # Note the added file_extension parameter
print(result.text_content)

Hello, this is a test stream!
It has multiple lines.



## Error Handling

Let's see how MarkItDown handles errors:

In [None]:
try:
    # Try to convert a non-existent file
    result = converter.convert('nonexistent.pdf')
except Exception as e:
    print(f"Error: {str(e)}")

Error: local variable 'res' referenced before assignment


## Working with Optional Features



## CLI Usage

First, we need to create a sample document:

In [13]:
# Create a sample DOCX file
import docx
doc = docx.Document()
doc.add_heading('Sample Document')
doc.add_paragraph('This is a test document for trying out the markitdown CLI tool.')
doc.save('sample.docx')

From there, we can use `{markitdown}` CLI with:

In [12]:
%%bash
# Convert a file:
markitdown sample.docx

# Convert using pipe:
cat sample.docx | markitdown

# Convert using input redirection:
markitdown < sample.docx


# Sample Document

This is a test document for trying out the markitdown CLI tool.



# Sample Document

This is a test document for trying out the markitdown CLI tool.



# Sample Document

This is a test document for trying out the markitdown CLI tool.




### Using a Custom Requests Session

You can use a custom requests session with custom headers:

In [None]:
session = requests.Session()
session.headers.update({
    'User-Agent': 'MarkItDown Tutorial/1.0'
})

converter_with_session = MarkItDown(requests_session=session)
result = converter_with_session.convert('https://example.com')
print(result.text_content[:200] + "...")


# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https...


## Cleanup

Remember to properly close and clean up any resources:

In [None]:
# Close the requests session
session.close()

## Using with LLMs

Here's an example of how you might use MarkItDown with an LLM:

In [None]:
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(mlm_client=client, mlm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

## Conclusion

So, we've seen how `MarkItDown` is able to turn files into text that LLMs can understand. This is exciting as we can now build document Q&A systems, analyze research papers and legal documents, extract text from presentations, process images and audio recordings, and so so much more.


Stay up to date with the project over at <https://github.com/microsoft/markitdown>