# Documentation URL Search

In this notebook, we demonstrate how the tools in this repository could be used to validate and search for documentation urls for entities in the planning.data.gov.uk data.

The approach involves 2 main steps: a web crawler and an embedding similarity search. The webcrawler finds all potentially relevant pages of a council's website using very simple filters and the similarity search then finds the pages most relevant to a user input query. By making this input reflective of the expected content of the page the documentation_url would point to, we should find some strong candiate pages.

This notebook demonstrates the two tools (crawler and search) and runs the search for a small number of test cases.

In [None]:
import os
import urllib
from statistics import mode

import matplotlib.pyplot as plt
import numpy as np
import polars as pl
import requests
import seaborn as sns

from data_quality_utils.crawler import Crawler
from data_quality_utils.similarity_searcher import (
    SimilaritySearcher,
)

In [None]:
# get data from datasette
datasette_base_url = "https://datasette.planning.data.gov.uk/digital-land.csv"

query = """
select *
from source as s
left join organisation as o
on s.organisation=o.organisation
where s.collection="conservation-area"
"""
encoded_query = urllib.parse.urlencode({"sql": query})

r = requests.get(f"{datasette_base_url}?{encoded_query}", auth=("user", "pass"))

filename = "data/datasette_data.csv"
with open(filename, "wb") as f_out:
    f_out.write(r.content)

In [None]:
# group by organisation as we're looking for one page per council
filename = "data/datasette_data.csv"

council_data = (
    pl.read_csv(filename)
    .group_by("name")
    .agg(pl.col("website").first(), pl.col("documentation_url"))
)
council_data = council_data.with_columns(pl.col("website").str.strip_chars_end("/"))
council_data.head()

## 1. Web crawler

The web crawler takes a homepage URL of an organisation (council website) and crawls it to look for pages talking about conservation areas.

The crawler will look for links on a single page, put them in a queue and then iteratively check them until it finds what it was looking for or it reaches a stopping criterion, such as maximum depth (how many clicks away from home page). 

In order to save time, we can define some scorers or filters which tell the crawler which pages to prioritise or ignore. In this case, some common patterns of what a user needs to click to get to the page of interest are _"planning"_, _"building"_, _"heritage"_ or _"conservation"_.

The crawler uses a *"best first strategy"*, which utilises the scorers or filters to visit most relevant sites first, rather than a depth-first or breath-first search.

The crawler extracts the HTML from the pages and turns them into markdown. This is because it's more readable and easier to work with in the next steps. The crawler returns a list of pairs of (_url_, _markdown_).


### Basic Usage

In [None]:
council = council_data[0, "name"]
homepage = council_data[0, "website"]

In [None]:
crawler = Crawler(
    max_depth=1,  # clicks from home page
    cache_enabled=False,
)
print(f"Crawling {council}")
crawl_data = await crawler.deep_crawl(homepage)

### 1.2 Filters and scorers

In the previous example, the depth the crawler searched was kept low to increase the speed of crawling with no other settings. However, higher depths are required to find most documentation_urls so we show here how to use filters and keyword scorers to restrict the time taken and number of pages returned whilst still ensuring all relevant pages are collected.

These filters and scorers are one of the key things to change when trying to search for new entities.



In [None]:
keyword_scorer = {
    "keywords": [
        "conservation",
        "conservation area",
        "planning",
        "building",
        "urban",
        "heritage",
        "resident",
    ],
    "weight": 0.8,
}

filters = [
    {
        "type": "SEOFilter",
        "threshold": 0.6,
        "keywords": ["conservation", "area", "planning", "heritage", "resident"],
    },
    {
        "type": "ContentRelevanceFilter",
        "query": "conservation area or planning data",
        "threshold": 0.2,
    },
    {"type": "ContentTypeFilter", "allowed_types": ["text/html"]},
    {
        "type": "URLPatternFilter",
        "patterns": ["*conservation*", "*planning*", "*building*"],
    },
]

In [None]:
council = council_data[1, "name"]
homepage = council_data[1, "website"]

In [None]:
crawler = Crawler(
    max_depth=4,  # clicks from home page
    cache_enabled=False,
    keyword_scorer=keyword_scorer,
    filters=filters,
)
print(f"Crawling {council}")
crawl_data = await crawler.deep_crawl(homepage)

### 1.3 Downloading the test set

In order to test the search functionality, we have collected 25 correct documentation URLs and included them in this repository. In this section, we scrape the relevant websites for those test cases and store them. This can take several hours but since each site is saved as it is scraped, you only need to run it once.

In [None]:
async def scrape_council(
    council_name,
    council_website,
    max_depth=6,
    keyword_scorer=None,
    filters=None,
    cache_enabled=False,
    save_dfs=True,
    load_dfs=True,
    data_dir="",
):
    crawler = Crawler(
        max_depth=max_depth,
        keyword_scorer=keyword_scorer,
        filters=filters,
        cache_enabled=cache_enabled,
    )

    short_council_name = council_website.split(".")[1]
    save_path = f"{data_dir}{short_council_name}.csv"

    print("=" * 40 + f"\nProcessing {council_name}...\n")

    if os.path.isdir(data_dir) and load_dfs:
        if f"{short_council_name}.csv" in os.listdir(data_dir):
            crawl_df = pl.read_csv(save_path)
            return crawl_df

    # crawl url
    crawl_data = await crawler.deep_crawl(council_website)

    crawl_df = pl.DataFrame(crawl_data, schema=["id", "text"], orient="row")

    if save_dfs:
        crawl_df.write_csv(save_path)

    return crawl_df

In [None]:
test_df = pl.read_csv("data/page_ranking_truth_df.csv")
test_df = test_df.with_columns(
    pl.col("correct_documentation_url").str.strip_chars_end("/")
)
search_tests = council_data.join(test_df, on="website", how="inner")

In [None]:
filters = [
    {"type": "ContentTypeFilter", "allowed_types": ["text/html"]},
    {
        "type": "URLPatternFilter",
        "patterns": ["*conservation*", "*planning*", "*building*"],
    },
]

In [None]:
scraped_data = dict()

for council_name, council_website, _, _ in search_tests.iter_rows():
    scraped_data[council_website] = await scrape_council(
        council_name=council_name,
        council_website=council_website,
        keyword_scorer=None,
        filters=filters,
        save_dfs=True,
        load_dfs=True,
        data_dir="data/crawled_sites/",
    )

## 2. Embedding search

Embedding is a method where a vector representation of our scraped markdown text is generated. The way the embedding model is trained ensures that that vector numerically captures the meaning of that text. As a result, we can then measure the closeness of the meaning of two pieces of text using the cosine similarity of their embedding vectors - a standard mathematical approach to measuring the similarity of vectors.

Our goal is to find the webpage with the highest cosine similarity to our example prompt, which can be user specified for the type of thing we are searching for, here it identifies conservation areas but in principle could be changed for article 4 directions or similar.

At present we have three strategies: 
1) Embed the entire webpage.
2) Chunk the webpage into smaller texts ("chunks") and find the page with the best matching chunk
3) Chunk and find the webpage where the three most similar chunks to the query have the highest average

We recommend the third approach - using chunks gives a better range of similarities from bad matches to strong matches as the meaning of key bits of a page are not washed out by averaging over teh whole page. However, using a few chunks rather than the single best prevents matches to pages that are simply a link to the page you want with a short paragraph description.

If the latter is chosen, we trying to find the webpage <i>that <b>has</b> the chunk</i> with the highest embedding similarity. The parameter `chunk_size` determines the approximate size of these chunks, split at one of the `separators`. There is also `chunk_overlap` to specify how much of the previous chunk you want to begin the next, which is useful for preserving context.

To limit the quantity of these matches returned, specify the cutoff, organised by similarity with `num_results`, and specify how many of these are printed after the crawler and embedding model has run with `num_printing_results`. These scores can be saved and loaded in `data_dir` using `save_dfs` and `load_dfs`.

In [None]:
prompt_template = """
Conservation Areas in {}

We are committed to preserving the historic and architectural character of our borough
through designated conservation areas. These areas protect buildings, streets, and 
landscapes of special significance, ensuring that any changes respect their unique heritage. 
If you live or own property in a conservation area, additional planning controls may 
apply to alterations, demolitions, and new developments. Our aim is to balance modern 
needs with the protection of our historic environment. For more information on conservation 
area guidelines, planning applications, and how you can contribute to local heritage preservation, 
please visit our planning and conservation pages. You will find maps, appraisal documents and the
list of council conservation areas. 
"""

In [None]:
def pretty_print_results(sorted_df, num_results):
    """
    Print top n URLs with their similarity score.
    Assumes df is sorted.
    """
    print("\nTop Similar Pages:\n" + "=" * 40)
    for i in range(min(num_printing_results, len(sorted_df))):
        url = sorted_df.get_column("id")[i]
        score = sorted_df.get_column("similarity")[i]
        print(f"{i+1}. {url.ljust(60)} | Similarity: {score:.4f}")

In [None]:
searcher = SimilaritySearcher(strategy="document")
prompt = prompt_template.format("Wirral")
res = searcher.search(prompt, scraped_data["https://www.wirral.gov.uk"])

In [None]:
pretty_print_results(res, num_results=5)

## Evaluating Results with Chunking

For this section we manually labelled 25 councils with true page of their conservation area list. We perform tests and ranking here to assess the performance of our model. First we import a manually defined list of true documentation urls as test_df, and filter our main dataset for that. We also strip the final slash and will clean urls in other ways throughout.

Embedding all of the pages from all of the councils can take a while. We improve this by using keyword filters to remove pages that don't at least mention a concept that we expect to see on the page we are searching for. These should be used sparingly, really only when you are certain the correct page will contain a given word.

In [None]:
STRATEGY = "best_of_three"
searcher = SimilaritySearcher(strategy=STRATEGY)

results = dict()
for name, website, _, correct_url in search_tests.iter_rows():
    # Scraper collects page sections as separate pages so drop duplicates
    # eg. main.html#main and main.html are considered different pages
    document_df = scraped_data[website].unique(subset=["text"])

    # clean URLs for matching
    document_df = document_df.with_columns(
        id=document_df["id"].map_elements(lambda x: x.split("#")[0].strip("/"))
    )

    prompt = prompt_template.format(name)
    results[website] = searcher.search(
        query=prompt, document_df=document_df, keyword_filters=["conservation"]
    )

In [None]:
ranks = list()
num_unclassified = 0

for _, website, _, correct_url in search_tests.iter_rows():
    rank = results[website]["id"].index_of(correct_url)
    if rank is not None:
        ranks.append(rank + 1)
    else:
        num_unclassified += 1

In [None]:
height = ranks.count(mode(ranks))

fig, ax = plt.subplots(figsize=(6, 4))
ax = sns.histplot(
    x=ranks, binwidth=1, discrete=True, color="#00625E", stat="proportion"
)
ax.set_title(f"Frequency of Ranks - {num_unclassified} unclassified")
ax.set_xlabel("Rank")
ax.set_ylabel("Frequency")
ax.set_xticks(range(1, 25))

plt.tight_layout()
plt.show()

In [None]:
def mean_reciprocal_rank(ranks):
    size = len(ranks)
    return sum([1 / rank for rank in ranks]) / size

In [None]:
mrr = mean_reciprocal_rank(ranks)
print(f"Reciprocal of Mean Reciprocal Rank: {1/mrr}")

In [None]:
ranking_dict

### Case Studies

Let's look a little further at the ones that did not work out.

In [None]:
correct_page = search_tests.filter(
    search_tests["website"] == "https://www.camden.gov.uk"
)["correct_documentation_url"][0]
print(
    scraped_data["https://www.camden.gov.uk"].filter(pl.col("id") == correct_page)[
        "text"
    ]
)

For Dover, the web scraper simply does not parse the website properly. It only crawls over four pages, the use of ASPX file types is breaking the crawler even though it is ultimately HTML. Crawl4AI only checks the file type not the content when deciding what to filter.

In [None]:
correct_page = search_tests.filter(
    search_tests["website"] == "https://www.dover.gov.uk"
)["correct_documentation_url"][0]
print(
    scraped_data["https://www.dover.gov.uk"].filter(pl.col("id") == correct_page)[
        "text"
    ]
)
print(len(scraped_data["https://www.dover.gov.uk"]))

This may be down to the .aspx extension used for the website, so the web crawler may not be appropriate in this case.