## Fandom (dot) Com Scraper + Package Builder

This notebook can be used to scrape fandom (dot) com wiki pages in order to prepare a data package for the charlie discord bot.

#### Usage

This notebook does not automatically create an index of urls to scrape. In order to get started you must manually create the `urls.txt` file. This file should simply be a plaintext list of urls with one per line. Note that the order in which the urls are in the files matters. When running de-dupe the first file will be the one that keeps the data. [see here for more info](#de-dupe)

A couple of the scripts in here make use of the OpenAI API. In order to use this they expect to find `OPENAI_API_KEY="<your key here>"` in a file called `.env` in the working directory.

#### Preparation

In [None]:
# prep in case you do not have the nltk modules installed
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

#### Downloading

Using pyppeteer to scrape has worked well for me. If working on other sites check the output of this step to see you are getting the data you expect.

In [None]:
import os.path as path
import asyncio
from pyppeteer import launch
import os

url_file = 'urls.txt'
download_dir = 'downloaded_data'
log_dir = 'logs'
log_name = 'download_log'
no_redownload = True

# ensure the log directory exists
if not path.exists(log_dir):
    os.mkdir(log_dir)

download_errors = open(path.join(log_dir, log_name), 'a', encoding='utf-8', newline='\n')  # nopep8
cwd = os.getcwd()

# ensure the download directory exists
if not path.exists(download_dir):
    os.mkdir(download_dir)


async def download(url: str):
    save_name = url.split('/')[-1].strip()
    # colons are not allowed in filenames on windows...
    save_name = save_name.replace(':', '__').replace('%27', '\'') + '.html'

    if no_redownload and path.exists(path.join(cwd, download_dir, save_name)):
        # check if the file has content
        with open(path.join(cwd, download_dir, save_name), 'r', encoding='utf-8') as f:
            content = f.read()

        if content.__len__() != 0:
            print(f'Skipping {url} because it already exists and is non-empty')
            return

    print(f'Downloading {url} to {path.join(cwd, download_dir, save_name)}')

    browser = await launch()
    page = await browser.newPage()
    try:
        await page.goto(url)
        content = await page.content()

        with open(path.join(download_dir, save_name), 'w', encoding='utf-8', newline='\n') as f:
            f.write(content)

        if content.__len__() == 0:
            print(f'Warning: {url} is empty')
            download_errors.write(f'Warning: {url} is empty')
        else:
            print(f'Successfully downloaded {url}')
    except Exception as e:
        print(f'Error with {url}')
        download_errors.write(f'Error with {url}')

    finally:
        await browser.close()


async def main():
    # read in the list of urls from a file
    with open(url_file, 'r') as f:
        urls = f.readlines()

    # download each url
    for i in range(urls.__len__()):
        url = urls[i]
        await download(url)
        # wait a bit to be nice
        if (i < urls.__len__() - 1):
            await asyncio.sleep(5)

await asyncio.get_event_loop().create_task(main())
download_errors.close()

print(f'Logged errors to {path.join(cwd, log_dir, log_name)}')


#### Extraction

We now want to extract the text content of the page disregarding the headers, footers and most of the navigation. This step is what makes this notebook highly specific fandom wiki pages. If you want to adapt this to other pages major changes will be needed here.

In [None]:
import os.path as path
from bs4 import BeautifulSoup
import re

url_file = 'urls.txt'
download_dir = 'downloaded_data'
log_dir = 'logs'
extracted_data = 'extracted_data'
log_name = 'extraction_log'

# ensure the log directory exists
if not path.exists(log_dir):
    os.mkdir(log_dir)

extraction_errors = open(path.join(log_dir, log_name), 'a', encoding='utf-8', newline='\n')  # nopep8
cwd = os.getcwd()

# ensure the extraction directory exists
if not path.exists(extracted_data):
    os.mkdir(extracted_data)


# use the url file to get the list of files to extract
with open(url_file, 'r') as f:
    urls = f.readlines()

for url in urls:
    save_name = url.split('/')[-1].strip()
    # colons are not allowed in filenames on windows...
    save_name = save_name.replace(':', '__').replace('%27', '\'')
    input_file = save_name + '.html'
    save_name = save_name + '.txt'

    # read in the file
    with open(path.join(download_dir, input_file), 'r', encoding='utf-8') as f:
        content = f.read()

    soup = BeautifulSoup(content, 'html.parser')
    soup = soup.find('div', {'class': 'mw-parser-output'})
    # remove all tables with the class 'navbox'
    for navbox in soup.find_all('table', {'class': 'navbox'}):
        navbox.decompose()
    text = soup.get_text()

    # remove lines with only whitespace
    text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE)

    # if the line ends in "[]" we should insert a newline at the start of the line
    text = re.sub(r'^(.*)(\[\])$', r'\n\1\2', text, flags=re.MULTILINE)

    # replace all 3+ newlines with a double newline
    text = re.sub(r'\n{3,}', '\n\n', text)

    if len(text) < 100:
        print(f'Warning: low content in {save_name}')
        extraction_errors.write(f'Warning: low content in {save_name}')

    # save to file
    with open(path.join(extracted_data, save_name), 'w', encoding='utf-8', newline='\n') as f:
        f.write(text)

    print(f'Extracted {path.join(cwd, extracted_data, save_name)}')


#### De-Dupe

This is a step in which the ordering in the url list comes into play. The first file on the list will be the one that keeps the duplicated chunks. Intuitively the duplicated data is likely to be indexes/navigation, acknowledgements, branding stuff, and other low-detail content or even unwanted content. It makes sense for this data to be in the files which are indexes/navigation, overviews, or other miscellaneous pages. This will improve the high-detail information content density of the pages where we are most likely to  looking up specifics. This should result in better performance of the agent.

In [None]:
# de-duplication
# chunks of text are separated by a blank line

import os.path as path

url_file = 'urls.txt'
extracted_data = 'extracted_data'
deduped_data = 'deduped_data'
cwd = os.getcwd()
verbose = True

# ensure the deduplication directory exists
if not path.exists(deduped_data):
    os.mkdir(deduped_data)

# de-duplicate the data

hash_set = set()

# use the url file to get the list of files to dedupe
with open(url_file, 'r') as f:
    urls = f.readlines()

for url in urls:
    save_name = url.split('/')[-1].strip()
    # colons are not allowed in filenames on windows...
    save_name = save_name.replace(':', '__').replace('%27', '\'')
    input_file = save_name + '.txt'
    save_name = save_name + '.txt'

    # read in the file
    with open(path.join(extracted_data, input_file), 'r', encoding='utf-8', newline='\n') as f:
        content = f.read()

    # split the content into chunks
    chunks = content.split('\n\n')

    # save the deduped data
    save_file = open(path.join(deduped_data, save_name), 'w', encoding='utf-8', newline='\n')  # nopep8

    # add each chunk to the hash set
    for chunk in chunks:
        if chunk not in hash_set:
            save_file.write(chunk + '\n\n')
            hash_set.add(chunk)
        elif verbose:
            print(f'Duplicate chunk found: {chunk}')

    print(f'Deduped {path.join(cwd, deduped_data, save_name)}')

    save_file.close()


#### Cleaning

Data cleaning can be done with a LLM.

**Note:** This makes OpenAI calls and will need the API key set in the `.env` file.

**WARNING!!!** This prompt is badly calibrated. Check that you are not getting garbage out of this step.


In [None]:
# cleaning prototyping

import openai
from dotenv import load_dotenv
import os
import os.path as path

url_file = 'urls.txt'
deduped_data = 'deduped_data'
cleaned_data = 'cleaned_data'
log_dir = 'logs'
log_name = 'cleaning_log'
cwd = os.getcwd()

# ensure the log directory exists
if not path.exists(log_dir):
    os.mkdir(log_dir)

# ensure the cleaning directory exists
if not path.exists(cleaned_data):
    os.mkdir(cleaned_data)

# open the log file
cleaning_errors = open(path.join(log_dir, log_name), 'a', encoding='utf-8', newline='\n')  # nopep8

load_dotenv()

openai.api_key = os.getenv('OPENAI_API_KEY')


def clean(text, filename, line_number):
    if text.__len__() == 0:
        return text

    messages = [
        {'role': 'user', 'content': f'Clean the following text, inserting spaces where it looks like spaces need to be inserted. Only output the cleaned text.\n\n{text}\n\n'},
    ]

    retry_count = 3

    while retry_count > 0:
        try:
            response = openai.ChatCompletion.create(
                model='gpt-3.5-turbo-0613',
                messages=messages,
                temperature=0.0,
                functions=[
                    {
                        "name": "abort_cleaning",
                        "description": "If the input text is already clean, abort the cleaning process.",
                        "parameters": {
                            "type": "object",
                            "properties": {},
                            "required": [],
                        },
                    }
                ],
                function_call="auto",
                # "Clean" - 28629
                # "clean" - 18883
                logit_bias={28629: -1, 18883: -1},
            )

            response_message = response["choices"][0]["message"]  # type: ignore # nopep8
            if response_message.get("function_call"):
                print(f'Aborting cleaning of {filename}:{line_number} >>> {text}')  # nopep8
                return text
            else:
                return response_message["content"].strip()

        except Exception as e:
            if retry_count > 0:
                print(f'Error: {e} (retrying)')
                retry_count -= 1
                continue
            else:
                print(f'Error: {e}')
                cleaning_errors.write(f'Error in {filename}:{line_number} (cleaning line skipped): {e}\n')  # nopep8
    return text


# use the url file to get the list of files to clean
with open(url_file, 'r') as f:
    urls = f.readlines()

for url in urls:
    save_name = url.split('/')[-1].strip()
    # colons are not allowed in filenames on windows...
    save_name = save_name.replace(':', '__').replace('%27', '\'')
    input_file = save_name + '.txt'
    save_name = save_name + '.txt'

    # read in the file
    with open(path.join(deduped_data, input_file), 'r', encoding='utf-8', newline='\n') as f:
        content = f.read()

    # we do the cleaning line by line for best results
    lines = content.split('\n')
    for i, line in enumerate(lines):
        lines[i] = clean(line, save_name, i)

    # rejoin the lines
    content = '\n'.join(lines)

    # save the cleaned data
    with open(path.join(cleaned_data, save_name), 'w', encoding='utf-8', newline='\n') as f:
        f.write(content.strip())

    print(f'Cleaned {path.join(cwd, deduped_data, save_name)} written to {path.join(cwd, cleaned_data, save_name)}')  # nopep8


#### Named Entity Recognition

By running named entity recognition on the data we can create a decent way to preform activity detection.

In [None]:
# FIXME I have had problems with doing this step well, and so right now we are doing something very lazy

import os

url_file = 'urls.txt'
named_entities = 'named_entities.txt'
cwd = os.getcwd()

with open(url_file, 'r') as f:
    urls = f.readlines()

output_file = open(named_entities, 'w', encoding='utf-8', newline='\n')

for url in urls:
    entity = url.split('/')[-1].strip()
    entity = entity.replace('%27', '\'').replace('_', ' ')
    output_file.write(entity + '\n')

output_file.close()

print(f'Named entities written to {os.path.join(cwd, named_entities)}')


#### Chunking

We now need to break up the pages slightly overlapping into chunks for ingestion.

In [None]:
import os.path as path
import os
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

target_word_count_per_chunk = 300
url_file = 'urls.txt'
cleaned_data = 'cleaned_data'
chunked_data = 'chunked_data'
log_dir = 'logs'
log_name = 'cleaning_log'
cwd = os.getcwd()

# ensure the log directory exists
if not path.exists(log_dir):
    os.mkdir(log_dir)

# ensure the chunking directory exists
if not path.exists(chunked_data):
    os.mkdir(chunked_data)

# use the url file to get the list of files to clean
with open(url_file, 'r') as f:
    urls = f.readlines()

for url in urls:
    save_name = url.split('/')[-1].strip()
    # colons are not allowed in filenames on windows...
    save_name = save_name.replace(':', '__').replace('%27', '\'')
    input_file = save_name + '.txt'

    page = open(path.join(cleaned_data, input_file), 'r', encoding='utf-8', newline='\n')  # nopep8
    text = page.read()
    page.close()

    sentences = nltk.sent_tokenize(text)
    chunks = []
    current_chunk = ''
    current_chunk_word_count = 0
    for sentence in sentences:
        current_chunk += sentence
        current_chunk_word_count += len(sentence.split(' '))
        if current_chunk_word_count >= target_word_count_per_chunk:
            chunks.append(current_chunk)
            current_chunk_word_count = 0
            # overlap the chunks by 1 sentence
            current_chunk = sentence
        else:
            current_chunk += '\n\n'

    # write the chunks to a file
    i = 0
    for chunk in chunks:
        good_file_name = save_name.replace('__', ':')
        good_file_name = save_name.replace('_', ' ')
        outfile = open(path.join(chunked_data, f"{save_name}_{i}.txt"), 'w', encoding='utf-8', newline='\n')  # nopep8
        outfile.write(f"{good_file_name} --------------\n\n")
        outfile.write(chunk)
        outfile.close()
        print(f'Chunked {path.join(cwd, cleaned_data, input_file)} into {path.join(cwd, chunked_data, save_name)}_{i}.txt')  # nopep8
        i += 1
            


#### Ingesting and Exporting

We now need to ingest the data into the vector store.

I am going to redo the package format, but I have not done this yet. After I do this I will update this section with a cell that can do this. For right now however you need to use the `package_builder.js` script to create the package.

```bash
npm install
npm run create -- my_package_name
```

#### Utilities

In [47]:
# a script to clean up all the temporary directories and files

import os

download_dir = 'downloaded_data'
extracted_data = 'extracted_data'
deduped_data = 'deduped_data'
cleaned_data = 'cleaned_data'
chunked_data = 'chunked_data'
log_dir = 'logs'

for dir in [download_dir, extracted_data, deduped_data, cleaned_data, chunked_data, log_dir]:
    if os.path.exists(dir):
        for file in os.listdir(dir):
            os.remove(os.path.join(dir, file))
        os.rmdir(dir)
