# Document Extraction

Here, we'll be extracting content from a longer document.


The basic workflow is the following:

1. Load the document
2. Clean up the document (optional)
3. Split the document into chunks
4. Extract from *every* chunk of text

-------------

**ATTENTION** This is a *brute force* workflow -- there will be an LLM call for every piece of text that is being analyzed. 
This can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!

---------------

Let's apply this workflow to an HTML file.

We'll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.

When scraping HTML, executing javascript may be necessary to get all HTML fully rendered. 

Here's a piece of code that can execute javascript using playwright: 


```python
async def a_download_html(url: str, extra_sleep: int) -> str:
    """Download an HTML from a URL.
    
    In some pathological cases, an extra sleep period may be needed.
    """

    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url, wait_until="load")
        if extra_sleep:
            await asyncio.sleep(extra_sleep)
        html_content = await page.content()
        await browser.close()
    return html_content
```

Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader

---------
 
Again this can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!

In [1]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.insert(0, "../../")

In [12]:
from typing import List, Optional
import itertools
import requests

import pandas as pd
from pydantic import BaseModel, Field, field_validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.documents.html import MarkdownifyHTMLProcessor
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI

## LLM

Instantiate an LLM. 

Try experimenting with the cheaper davinci models or with gpt-3.5-turbo before trying the more expensive davinci-003 or gpt 4.

In some cases, providing a better prompt (with more examples) can help make up for using a smaller model.


-------------------

Quality can vary a **lot** depending on which LLM is used and how many examples are provided.

-------------------

In [13]:
# Using gpt-3.5-turbo which is pretty cheap, but has worse quality
llm = ChatOpenAI(temperature=0)

## Schema

In [15]:


class ShowOrMovie(BaseModel):
    name: str = Field(
        description="The name of the movie or tv show",
    )
    season: Optional[str] = Field(
        description="Season of TV show. Extract as a digit stripping Season prefix.",
    )
    year: Optional[str] = Field(
        description="Year when the movie / tv show was released",
    )
    latest_episode: Optional[str] = Field(
        description="Date when the latest episode was released",
    )
    link: Optional[str] = Field(description="Link to the movie / tv show.")

    # rating -- not included because rating on rottentomatoes is in the html elements
    # you could try extracting it by using the raw HTML (rather than markdown)
    # or you could try doing something similar on imdb

    @field_validator("name")
    def name_must_not_be_empty(cls, v):
        if not v:
            raise ValueError("Name must not be empty")
        return v


schema, extraction_validator = from_pydantic(
    ShowOrMovie,
    description="Extract information about popular movies/tv shows including their name, year, link and rating.",
    examples=[
        (
            "[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)",
            {"name": "Rain Dogs", "latest_episode": "Apr 03", "link": "/tv/rain_dogs"},
        )
    ],
    many=True,
)

In [16]:
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="csv",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

## Download

Let's download a page containing movies from my favorite movie review site.

In [17]:
url = "https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular"
response = requests.get(url)  # Please see comment at top about using Selenium or

Remember that in some cases you will need to execute javascript! Here's a snippet

```python
from langchain.document_loaders import SeleniumURLLoader
document = SeleniumURLLoader(url).load()
```

## Extract

Use langchain building blocks to assemble whatever pipeline you need for your own purposes.

Create a langchain document with the HTML content.

In [18]:
doc = Document(page_content=response.text)

Convert to markdown

**ATTENTION** This step is lossy and may end up removing information that's relevant for extraction. You can always try pushing the raw HTML through if you're not worried about cost.

In [19]:
md = MarkdownifyHTMLProcessor().process(doc)

Break the document to chunks so it fits in context window

In [20]:
split_docs = RecursiveCharacterTextSplitter().split_documents([md])

In [21]:
print(split_docs[-1].page_content)

Latest Episode: Jul 11](/tv/vikings_valhalla)

Watch the trailer for Land of Women

[89%

 35%

 Land of Women

 Latest Episode: Jul 10](/tv/land_of_women)

[100%

 68%

 The Mole

 Latest Episode: Jul 12](/tv/the_mole_2022)

Watch the trailer for Bridgerton

[84%

 73%

 Bridgerton

 Latest Episode: Jun 13](/tv/bridgerton)

Watch the trailer for Your Honor

[50%

 68%

 Your Honor](/tv/your_honor_2020)

Watch the trailer for Mayor of Kingstown

[51%

 89%

 Mayor of Kingstown

 Latest Episode: Jul 07](/tv/mayor_of_kingstown)

Watch the trailer for The Serpent Queen

[100%

 84%

 The Serpent Queen

 Latest Episode: Jul 12](/tv/the_serpent_queen)

Watch the trailer for True Detective

[79%

 57%

 True Detective](/tv/true_detective)

[67%

 Mirzapur

 Latest Episode: Jul 05](/tv/mirzapur)

Watch the trailer for Hotel Cocaine

[70%

 Hotel Cocaine

 Latest Episode: Jul 07](/tv/hotel_cocaine)

[86%

 69%

 A Good Girl's Guide to Murder

 Latest Episode: Jul 01](/tv/a_good_girls_guide_to_

In [22]:
len(split_docs)

4

Run extraction

In [23]:
from langchain_community.callbacks import get_openai_callback

In [24]:
with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

  warn_deprecated(


Total Tokens: 6342
Prompt Tokens: 5390
Completion Tokens: 952
Successful Requests: 4
Total Cost (USD): $0.009989000000000001


In [25]:
validated_data = list(
    itertools.chain.from_iterable(
        extraction["validated_data"] for extraction in document_extraction_results
    )
)

In [26]:
len(validated_data)

50

Extraction is not perfect, but you can use a better LLM or provide more examples!

In [27]:
pd.DataFrame(record.dict() for record in validated_data)

Unnamed: 0,name,season,year,latest_episode,link
0,Sunny,1.0,,,/tv/sunny/s01
1,Vikings: Valhalla,3.0,,,/tv/vikings_valhalla/s03
2,Sunny,1.0,,,/tv/sunny/s01
3,Vikings: Valhalla,3.0,,,/tv/vikings_valhalla/s03
4,Sausage Party: Foodtopia,1.0,,,/tv/sausage_party_foodtopia/s01
5,The Serpent Queen,2.0,,,/tv/the_serpent_queen/s02
6,Me,1.0,,,/tv/me/s01
7,The Bachelorette,21.0,,,/tv/the_bachelorette/s21
8,Mastermind: To Think Like a Killer,1.0,,,/tv/mastermind_to_think_like_a_killer/s01
9,Melissa Etheridge: I'm Not Broken,1.0,,,/tv/melissa_etheridge_im_not_broken/s01
