# Document Extraction

**ATTENTION** This is a *brute force* workflow, meaning that it makes an LLM call for every piece of text that is being analyzed. 
This can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!

---------------

Here, we'll be extracting content from a longer document.

The basic workflow is the following:

1. load the document
2. potentially clean it up
3. chunk it up to pieces
4. then run extract on each piece

We'll apply this workflow to an HTML file.

For our clean up, we'll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.


When scraping HTML, executing javascript may be necessary, and in some cases, an additional sleep period may be needed to wait for javascript to run. 

Here's a piece of code that can execute javascript using playwright: 

```python
async def a_download_html(url: str, extra_sleep: int) -> str:
    """Download an HTML from a URL."""

    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url, wait_until="load")
        if extra_sleep:
            await asyncio.sleep(extra_sleep)
        html_content = await page.content()
        await browser.close()
    return html_content
```

Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader

---------
 
Again this can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!

In [1]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.insert(0, "../../")

In [2]:
from typing import List, Optional
import itertools
import requests

import pandas as pd
from pydantic import BaseModel, Field, validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.documents.html import MarkdownifyHTMLProcessor
from langchain import OpenAI
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

## LLM

Instantiate an LLM. 

Try experimenting with the cheaper davinci models or with gpt-3.5-turbo before trying the more expensive davinci-003 or gpt 4.

In some cases, providing a better prompt (with more examples) can help make up for using a smaller model. 

In [3]:
llm = OpenAI(
    model_name="text-davinci-003",
    temperature=0,
    max_tokens=-1,
)

## Schema

In [17]:
class ShowOrMovie(BaseModel):
    name: str = Field(
        description="The name of the movie or tv show",
    )
    season: Optional[str] = Field(
        description="Season of TV show. Extract as a digit stripping Season prefix.",
    )
    year: Optional[str] = Field(
        description="Year when the movie / tv show was released",
    )
    latest_episode: Optional[str] = Field(
        description="Date when the latest episode was released",
    )
    link: Optional[str] = Field(description="Link to the movie / tv show.")

    # rating -- not included because rating on rottentomatoes is in the html elements
    # you could try extracting it by using the raw HTML (rather than markdown)
    # or you could try doing something similar on imdb

    @validator("name")
    def name_must_not_be_empty(cls, v):
        if not v:
            raise ValueError("Name must not be empty")
        return v


schema, extraction_validator = from_pydantic(
    ShowOrMovie,
    description="Extract information about popular movies/tv shows including their name, year, link and rating.",
    examples=[
        (
            "[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)",
            {"name": "Rain Dogs", "latest_episode": "Apr 03", "link": "/tv/rain_dogs"},
        )
    ],
    many=True,
)

In [5]:
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="csv",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

## Download

Let's download one of rottentomatoes pages. (It is my favorite site for finding out about movies!)

In [6]:
url = "https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular"
response = requests.get(url)  # Please see comment at top about using Selenium or

Remember that in some cases you will need to execute javascript! Here's a snippet

```python
from langchain.document_loaders import SeleniumURLLoader
document = SeleniumURLLoader(url).load()
```

## Extract

Now, let's langchain building blocks to assemble whatever pipeline you need for your own purpose.

Create a langchain document with the HTML content

In [7]:
doc = Document(page_content=response.text)

Convert to markdown

**ATTENTION** This step is lossy and may end up removing information that's relevant for extraction. You can always try pushing the raw HTML through if you're not worried about cost.

In [8]:
md = MarkdownifyHTMLProcessor().process(doc)

Break the document to chunks so it fits in context window

In [9]:
split_docs = RecursiveCharacterTextSplitter().split_documents([md])

In [10]:
print(split_docs[-1].page_content)

[Shrinking

 Latest Episode: Mar 24](/tv/shrinking)

Watch the trailer for The Order

[The Order](/tv/the_order)

Watch the trailer for Swarm

[Swarm

 Latest Episode: Mar 17](/tv/swarm)

Watch the trailer for The Last Kingdom

[The Last Kingdom](/tv/the_last_kingdom)

Watch the trailer for Extrapolations

[Extrapolations

 Latest Episode: Apr 07](/tv/extrapolations)

Watch the trailer for Rain Dogs

[Rain Dogs

 Latest Episode: Apr 03](/tv/rain_dogs)

Watch the trailer for You

[You

 Latest Episode: Mar 09](/tv/you)

Watch the trailer for Great Expectations

[Great Expectations

 Latest Episode: Apr 02](/tv/great_expectations_2023)

[War Sailor

 Latest Episode: Apr 02](/tv/war_sailor)

Watch the trailer for Poker Face

[Poker Face

 Latest Episode: Mar 09](/tv/poker_face)

Watch the trailer for She-Hulk: Attorney at Law

[She-Hulk: Attorney at Law](/tv/she_hulk_attorney_at_law)

No results

 Reset Filters

 Load more

Close video

See Details

See Details

* [Help](/help_desk)
* [Ab

In [11]:
len(split_docs)

4

Extract stuff using the defined schema

In [12]:
from langchain.callbacks import get_openai_callback

In [13]:
with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Total Tokens: 7204
Prompt Tokens: 6175
Completion Tokens: 1029
Successful Requests: 4
Total Cost (USD): $0.14407999999999999


In [18]:
validated_data = list(
    itertools.chain.from_iterable(
        extraction["validated_data"] for extraction in document_extraction_results
    )
)

In [19]:
len(validated_data)

55

In [20]:
pd.DataFrame(record.dict() for record in validated_data)

Unnamed: 0,name,season,year,latest_episode,link
0,Beef,1.0,,,/tv/beef/s01
1,Dave,3.0,,,/tv/dave/s03
2,Schmigadoon!,2.0,,,/tv/schmigadoon/s02
3,Pretty Baby: Brooke Shields,1.0,,,/tv/pretty_baby_brooke_shields/s01
4,Tiny Beautiful Things,1.0,,,/tv/tiny_beautiful_things/s01
5,Grease: Rise of the Pink Ladies,1.0,,,/tv/grease_rise_of_the_pink_ladies/s01
6,Jury Duty,1.0,,,/tv/jury_duty/s01
7,The Crossover,1.0,,,/tv/the_crossover/s01
8,Transatlantic,1.0,,,/tv/transatlantic/s01
9,Race to Survive: Alaska,1.0,,,/tv/race_to_survive_alaska/s01
