# Document URLs

Every entity should have an associated document which is the *legal instrument* that defines the entity. These should be stored under the `document_url` field of the entity. This notebook uses the webcrawler and search functionality included in the repository to find the correct document url for a sample of conservation areas.

<div class="alert alert-warning">
This notebook assumes that all documentation_urls have been validated when the document_url search begins. This allows us to do a shallow crawl from the documentation_url of the entity rather than a deep crawl from the homepage.

The reason this is important is that PDFs are typically stored in a generic file folder (eg. www.council-name.gov.uk/files/all_the_pdfs.pdf) so we can't use keyword filters on the URL as we did in the documentation_url notebook. Without those filters, the entire website is crawled and a large amount of files are returned. This makes the notebook take a very long time and require a lot of memory to run. Whereas shallow crawls from validation documentation_urls successfully result in finding the document_url in a much shorter time.
</div>

In [None]:
import logging

logging.basicConfig(level=logging.ERROR)

In [None]:
import pickle
import urllib

import matplotlib.pyplot as plt
import polars as pl
import requests
import seaborn as sns
from markitdown import MarkItDown

from data_quality_utils.crawler import Crawler
from data_quality_utils.crawler.utils import clean_url
from data_quality_utils.similarity_searcher import SimilaritySearcher

## 1. Load Sample

To test the approach, we include a small set of conservation areas with known `document_urls` in this repository. In this section, we load that data and then query datasette to find the organisation responsible for each. This will give us the key details we need to scrape the organisations' websites and find the PDFs.

In [None]:
test_cases_df = pl.read_csv("data/test_data.csv")
DATA_FILE = "datasette_data.csv"
QUERY_DATA = True

In [None]:
if QUERY_DATA:
    # get data from datasette
    datasette_base_url = "https://datasette.planning.data.gov.uk/digital-land.csv"

    query = """
    select 
    l.entity,
    o.website,
    o.organisation 
    from lookup as l
    left join organisation as o
    on l.organisation=o.organisation 
    where l.entity in {}
    and o.website != 'https://historicengland.org.uk'
    """.format(
        tuple(test_cases_df["entity"].to_list())
    )
    encoded_query = urllib.parse.urlencode({"sql": query})

    r = requests.get(f"{datasette_base_url}?{encoded_query}", auth=("user", "pass"))

    with open(DATA_FILE, "wb") as f_out:
        f_out.write(r.content)
data = pl.read_csv(DATA_FILE)
test_cases_df = test_cases_df.join(data, on="entity").unique()
test_cases_df = test_cases_df.with_columns(
    document_url=test_cases_df["document_url"].str.replace_all("%20", " ")
)

In [None]:
test_cases_df

## 2. Scrape Websites

The web crawler was introduced for the `documentation_url` challenge. It takes the homepage URL of an organisation (council website) and crawls it to look for pages given some keyword filters to ensure that it does not scrape the entire website including many pages that are not of interest.

We have extended the crawler to return the URLs of PDFs when the `crawl_type` parameter is set to `pdf`. We will use this setting to scrape all PDFs from the websites in the test data.

### Example PDF Scraping
As a fast example, we show that a max_depth of 1 starting from the correct `documentation_url` is quick and returns the right PDF for Napsbury. 

In [None]:
# only searches for all PDFs at the URL patterns
max_depth = 1
filters = [
    {"type": "ContentTypeFilter", "allowed_types": ["text/html", "application/pdf"]},
]

crawler = Crawler(
    max_depth=max_depth,
    keyword_scorer=None,
    filters=filters,
    cache_enabled=False,
    crawl_type="pdf",
)

In [None]:
crawl_data = await crawler.deep_crawl("https://www.stalbans.gov.uk/conservation-areas")

In [None]:
napsbury_document_url = test_cases_df.filter(test_cases_df["name"] == "Napsbury")[
    "document_url"
][0]
napsbury_document_url in crawl_data

### Scrape all test cases
Next we use this functionality to get all URLs of all test cases. For now, set the max_depth to 6 but ideally you would have aleady validated the `documentation_url` for all entities and can use a max_depth of one or two.

In [None]:
pdf_urls = dict()

for documentation_url in test_cases_df["documentation_url"]:
    if documentation_url not in pdf_urls:
        print(40 * "*")
        print(f"Starting {documentation_url}")
        crawl_data = await crawler.deep_crawl(documentation_url)
        pdf_urls[documentation_url] = crawl_data

Often multiple links with and without www. or with different conventions on how to display spaces in the file name are recovered. Deduplicate by fixing these issues.

In [None]:
for website, url_list in pdf_urls.items():
    if url_list:
        pdf_urls[website] = list(map(clean_url,url_list)

Store these to prevent re-scraping.

In [None]:
with open("data/pdf_urls.pickle", "wb") as f:
    pickle.dump(pdf_urls, f)

## 2. Markitdown
We use the package markitdown to convert our PDF URLs to markdown text. This will allow us to perform a search.

In [None]:
with open("data/pdf_urls.pickle", "rb") as f:
    pdf_urls = pickle.load(f)

In [None]:
def convert_pdf(pdf_url):
    try:
        md = MarkItDown(enable_plugins=False)
        text = md.convert(pdf_url).markdown
        return text
    except:
        return "Fail."


def document_df_from_urls(website, url_list):
    df = pl.DataFrame(data=url_list, schema=["id"])
    df = df.with_columns(text=df["id"].map_elements(convert_pdf, return_dtype=str))
    return df

This DataFrame with every document represented by an ID (here the url for simplicity) and having associated markdown text is the format expected by our similarity search function so we will process all test cases in this way.

In [None]:
pdf_dfs = dict()
for website, url_list in pdf_urls.items():
    print(website)
    if website not in pdf_dfs:
        pdf_dfs[website] = document_df_from_urls(website=website, url_list=url_list)
        with open("data/pdf_dfs.pickle", "wb") as f:
            pickle.dump(pdf_dfs, f)

## 3. Search

Finally we use the similarity searcher to find the most similar document to a query that represents the text we'd expect to find in the entity definition document. We'll make the assumption that the name of the conservation area will appear in its document to greatly simplify this process. `SimilaritySearcher` embeds text after removing all irrelevant documents so the use of keywords with remove the vast majority of PDFs and make this much faster.

In [None]:
with open("data/pdf_dfs.pickle", "rb") as f:
    pdf_dfs = pickle.load(f)

In [None]:
pdf_dfs

In [None]:
for key, df in pdf_dfs.items():
    df = df.with_columns(id=df["id"].map_elements(clean_url, return_dtype=str))
    pdf_dfs[key] = df

In [None]:
def pretty_print_results(sorted_df, num_results):
    # print top n urls with similarity scores
    print("\nTop Similar PDFs:\n" + "=" * 40)
    for i in range(min(num_results, len(sorted_df))):
        url = sorted_df.get_column("id")[i]
        score = sorted_df.get_column("similarity")[i]
        print(f"{i+1}. {url.ljust(60)} | Similarity: {score:.4f}")

In [None]:
query = """
Section 69 of the Planning (Listed Buildings and Conservation Areas) Act 1990 states that 
every local authority shall determine areas of spcecial architectural or historic interest and 
designate them as conservation areas.

The aims of this Character Statement are to show the way in which the form of the
conservation area has evolved and to assess its present character; to indicate the principles
to be adopted in considering planning applications in the area; and to form a framework
within which more detailed proposals may be formulated.
"""

In [None]:
searcher = SimilaritySearcher(strategy="best_of_three")
results = dict()
for (
    entity_id,
    entity_name,
    correct_url,
    documentation_url,
    _,
    _,
) in test_cases_df.iter_rows():
    document_df = pdf_dfs[documentation_url]
    results[entity_name] = searcher.search(
        query=query, document_df=document_df, keyword_filters=[entity_name]
    )

In [None]:
pretty_print_results(results["Napsbury"], num_results=5)

In [None]:
ranks = list()
num_unclassified = 0
for (
    entity_id,
    entity_name,
    correct_url,
    documentation_url,
    _,
    _,
) in test_cases_df.iter_rows():
    if not results[entity_name]["id"].is_empty():
        rank = results[entity_name]["id"].index_of(clean_url(correct_url))
        if rank is not None:
            ranks.append(rank + 1)
            continue
        else:
            num_unclassified += 1
    else:
        num_unclassified += 1

In [None]:
fig, ax = plt.subplots(figsize=(3.5, 4), tight_layout=True)
ax = sns.histplot(
    x=ranks + num_unclassified * [10],
    binwidth=1,
    discrete=True,
    color="#00625E",
    stat="proportion",
)
ax.set_title(f"Search Ranking of Correct Document")
ax.set_xlabel("Rank")
ax.set_ylabel("Frequency")
ax.set_xticks(range(1, 11))
ax.set_xticklabels([f"{i}" for i in range(1, 8)] + ["", "", "Not Found"])
plt.show()