# Scraping web pages
This notebook will be used to create our initial data representation of the contents of the WKIT web page. Based on different URL:s, the data will be stored and structured in local JSON files to enable further use in the RAG notebooks.

### Imports
The following libraries are what we need in order to scrape the web based content. 
- **requests** are used to fetch HTML content from web URL:s
- **BeautifulSoup** is a parsing library that performs well on extracting text content from HTML
- **pandas** is a go-to library for Data Science operations on a textual corpus. Referring to their documentation for further understanding of the vast opportunities of the framework, I will only introduce it briefly below.

In [1]:
import requests
from IPython.display import display, HTML
from bs4 import BeautifulSoup
from pyquery import PyQuery as pq
import pandas as pd

### Part 1: Identifying and fetching the URL:s manually
For this example on the We Know IT website, I have manually passed all the relevant URL:s into a list. Optionally, one can programmatically search all sub-URL:s hierarchically. Up to you!

In [2]:

def fetch_html_content(url):
    """
    Fetch the HTML content of a web URL.

    Parameters:
    url (str): The URL of the webpage to fetch.

    Returns:
    str: The HTML content of the webpage.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for HTTP errors
        return response.text
    except requests.RequestException as e:
        print(f"An error occurred: {e}")
        return None

### Part 2 - Parsing the content
Here is a very simple default approach to parse the content using **BeautifulSoup**. We use their standard html.parser, along with a default get_text() method that tells the parser to consider blank spaces the separator (as in standard text) and strip=True ensures that we clean the text from leading or trailing white spaces.

In [3]:
def parse_html_content(html_content):
    """
    Parse HTML content using BeautifulSoup.

    """
    soup = BeautifulSoup(html_content, 'html.parser')
    content = soup.get_text(separator=' ', strip=True)
    return content

### The manually identified URL:s

In [4]:
urls = ['https://www.weknowit.se/om-oss/', 
        'https://www.weknowit.se/karriar/', 
        'https://www.weknowit.se/tjanster/konsultuthyrning/', 
        'https://www.weknowit.se/tjanster/webbutveckling/',
        'https://www.weknowit.se/tjanster/apputveckling/',
        'https://www.weknowit.se/tjanster/design/',
        'https://www.weknowit.se/tjanster/digital-marknadsforing/',
        'https://www.weknowit.se/tjanster/hosting-forvaltning/',
        'https://www.weknowit.se/kunder/vernivia-webbutveckling-marknadsforing/',
        'https://www.weknowit.se/kunder/urw-systemutveckling/',
        'https://www.weknowit.se/kunder/place2place-webbutveckling/',
        'https://www.weknowit.se/kunder/varldskulturmuseet-webbutveckling/',
        'https://www.weknowit.se/kunder/kundcase-folkes-biluthyrning/',
        'https://www.weknowit.se/kunder/kundcase-boujt/',
        'https://www.weknowit.se/kunder/kundcase-world-hidden-cash/',
        'https://www.weknowit.se/kunder/kundcase-kavlinge-kommun-volontarportal/'
        ]
contents = []


### Part 3 - Constructing the DataFrame

Now we move on to some simple data engineering. The first function is a helper function that enables some nice metadata for the final dataset, determining if a web view concerns **customers**, **services**, or **general company information**.

#### Why Metadata?

This is an important question when considering data engineering for NLP (Natural Language Processing) tasks. While this codebase will not cover the applications of metadata filtering, it's still important to explain why it is good practice.

Metadata in embedded vector spaces (turn to the README.md file for a conceptual introduction) is preserved in its original form. This preservation enables filtering downstream when querying the vector database, often in terms of classification.

Consider a case like the one we have here, where there is a lot of information divided into three categories: **customers**, **services**, and **general**. If all text chunks have such a metadata tag, and a user writes a question to the system like:

*Have We Know IT previously worked with any customers requiring payment services in their app?*

A Large Language Model (LLM) could be prompted, few-shot prompted, or trained to become an expert in classifying this question. This question would likely belong to the class **"customers"**, and thus we only need to look for relevant context among the chunks that have the **"customers"** metadata tag.

**In short, such filtering can be very beneficial to reduce complexity and increase performance!**

### Summary of Benefits
- **Preservation**: Metadata remains intact in its original form.
- **Filtering**: Enables efficient filtering of relevant data.
- **Classification**: Improves the classification of user queries.
- **Performance**: Reduces complexity and enhances performance.

By organizing your data with appropriate metadata, you can significantly improve the efficiency and effectiveness of your NLP tasks.



In [2]:
def determine_category(url):
    """
    Determine the category based on the URL.

    Parameters:
    url (str): The URL to analyze.

    Returns:
    str: The category based on the URL.
    """
    if 'kunder' in url:
        return 'customers'
    elif 'tjanster' in url:
        return 'services'
    else:
        return 'general'

def process_urls(urls):
    """
    Process a list of URLs to fetch and parse their HTML content, then store the results in a DataFrame.

    Parameters:
    urls (list): A list of URLs to process.

    Returns:
    pd.DataFrame: A DataFrame with columns for the key, parsed text, URL, and category.
    """
    data = []
    
    for url in urls:
        html_content = fetch_html_content(url)
        if html_content:
            parsed_text = parse_html_content(html_content)
            key = url.rstrip('/').split('/')[-1]  # Extract the key from the URL
            category = determine_category(url)  # Determine the category
            data.append({'key': key, 'text': parsed_text, 'url': url, 'category': category})
    
    df = pd.DataFrame(data)
    return df

### Part 4 - Build the DataFrame

The function above is called using the list of URLs we constructed. **Note this:** The specific requirements for filtering and parsing may vary from case to case. Here, the parsing was pretty straightforward. I then programmatically extract the key from the URL to determine the category we discussed earlier. 

You see that the code line:

```python
key = url.rstrip('/').split('/')[-1]
```

is not a universal solution, but rather tailored to the *We Know IT* website URLs. Anyway, I want to point out that you should aim to capture as much metadata as possible during the data engineering phase. **The more the merrier!** You never know what might become relevant in the future of the app.

Imagine a scenario: a customer of ours receives a chatbot on their website from us, and then after a while goes:

> "Hey, can you make sure the info presented by the chatbot is always ordered by how recently the information was uploaded?"

If you have little metadata, this would be a pain in the ass. Luckily for you, you realized that such things might happen, so you appended a `created_at` column for all metadata documents. Now you can simply apply an ordering mechanism instead of rebuilding.

For the purpose of demonstration, this function only creates a pandas DataFrame with **4 columns**:
- **Text**: the actual text we will be embedding.
- **Key**: based on the routing structure of the app.
- **Original URL**: the original URL.
- **Category**: the category we determined above.


In [9]:
df = process_urls(urls)

In [10]:
display(df)

Unnamed: 0,key,text,url,category
0,om-oss,Nyfiken På We Know IT? | Om Oss Våra kunder Vå...,https://www.weknowit.se/om-oss/,general
1,karriar,Karriar | We Know IT Våra kunder Våra tjänster...,https://www.weknowit.se/karriar/,general
2,konsultuthyrning,Konsultuthyrning | Din Framtida Talang | We Kn...,https://www.weknowit.se/tjanster/konsultuthyrn...,services
3,webbutveckling,Webbutveckling För Företag & Organisationer | ...,https://www.weknowit.se/tjanster/webbutveckling/,services
4,apputveckling,Apputveckling | Vi Bygger Er App | We Know IT ...,https://www.weknowit.se/tjanster/apputveckling/,services
5,design,Design | UX/UI | Grafisk Design | Designprotot...,https://www.weknowit.se/tjanster/design/,services
6,digital-marknadsforing,Digital Marknadsföring | Maximera Din Synlighe...,https://www.weknowit.se/tjanster/digital-markn...,services
7,hosting-forvaltning,"Hosting & Förvaltning | Snabb, Säker, Online |...",https://www.weknowit.se/tjanster/hosting-forva...,services
8,vernivia-webbutveckling-marknadsforing,"Vernivia | Design, Webbutveckling & Marknadsfö...",https://www.weknowit.se/kunder/vernivia-webbut...,customers
9,urw-systemutveckling,Ärendehanteringssystem åt Unibail-Rodamco-West...,https://www.weknowit.se/kunder/urw-systemutvec...,customers


### Part 5 - Save locally
For now, we store this dataframe as a JSON object for further usage in other notebooks.

In [11]:
df.to_json('data.json', orient='records', lines=True)