# Project GenAIssance Demo

This demonstration allows you to scrape data from a webiste, or a subsection of a website, ingest it into a in-memory vector database, and then provide a natural language query that is sent to the chatgpt API along with context from your data. This allows you to access the power of chatgpt for use with your data.

**Important:** This demo is running in a limited environment and should not be used for very large websites. If you have a larger website, you can use the filter options described below to limit the scope of your job. You can use this demo effectively for up to a few hundred pages. If you attempt to scrape a very large amount of content, it may run slowly or fail. 

## Step 1: Scrape the web content to gather your data

In the first step, we will use the scrapy python library to capture data from your website. 

### About input variables

In ALL cases, this demo requires that the user inputs their OPENAI_API_KEY when prompted in the exercises below.

In the sections below, you can execute the code with the example variables provided, or you can enter your own desired values to customize the demo with your desired data. If you would like to customize the demo with your data, please read this explanation to understand the values you will need to provide. 

#### The ALLOWED_DOMAINS variable

This should be the url for the website or subsite you want to gather data from. If you have a more specific url that is better, for example, if you want to access content only from the vmware documentation, you would set your ALLOWED_DOMAIN to docs.vmware.com.

#### The START_URL variable

This is the seed URL that the web scraper will use as its starting point. If you are targeting an entire website, you could just enter the homepage URL for the website. But if you are targeting a subsection within a website, such as the documentation for a specific product, you should use the main page for the subsection you want to target. For example if you wanted to collect the TAP 1.4 documentation, a good URL to use as the START_URL value would be the main page for the TAP 1.4 documentation: `https://docs.vmware.com/en/VMware-Tanzu-Application-Platform/1.4/tap/overview.html`.

The scraper will start with this url, and then recursively scan all URLS on this page within the ALLOWED_DOMAIN, and recursively scan each discovered page until it has identified all page URLs in the ALLOWED_DOMAIN, subject to the ALLOW_RULE as described below

#### The ALLOW_RULE Variable

The ALLOW_RULE is a substring of a URL that allows you to limit the scope of the pages the scraper will index. For example, assume you want to index the Tanzu Application (TAP) 1.4 documentation, you will find that all the page URL's within the TAP 1.4 docs include a common pattern, for example:

The TAP 1.4 Documentation main page URL is: `https://docs.vmware.com/en/VMware-Tanzu-Application-Platform/1.4/tap/overview.html`. This URL includes the substring `en/VMware-Tanzu-Application-Platform/1.4`. If you look through the TAP documentation, you will find that all English language pages within the TAP 1.4 docs have this same URL pattern. Accordingly, an ALLOW_RULE value that would capture the TAP 1.4 documentation would be `en/VMware-Tanzu-Application-Platform/1.4/*`

The ALLOW_RULE restricts the scraper from finding pages that do not include this pattern. The scraper starts crawling your site with the START_URL. Accordingly, if you were to use a START_URL of `docs.vmware.com` with the ALLOW_RULE `en/VMware-Tanzu-Application-Platform/1.4/*`, it would not work because there are no direct links to the TAP 1.4 Docs from the homepage, and the ALLOW_RULE prevents the scraper from crawling any links that do not match the ALLOW_RULE pattern.

#### The PARSING_RULE variable (Optional)

If desired, the parsing rule will allow you to filter the data you gather based on an additional substring in the URL. For example, assume you wanted to gather all the pages in the TAP 1.4 documentation specific to installation. There are many different ways to install TAP, and there are several different pages and subsections related to installation. You may find that the string "install" is present in all of the URL's you want. If you set the PARSING_RULE to `*install*`, it would only return URLs that included the substring `install` within the URL

#### The BODY_TAG variable (Optional)

Your results will be most optimal if you can focus in on the main sections within the webpages that you want to gather. Web pages have tons of metadata, and while this model will work well even with less-clean data, but, the cleaner the data the better. In most cases, you probably want the main body of visible text from your web page data. Most organized websites will use an html or css tag to identify the main body of text on every page. For example on docs.vmware.com, the main body of text on every page is wrapped inside a div with the tag "div.rhs-center.article-wrapper". This same tag will not work on other websites, as each website will have their own tagging schemes. If you do not know of a tag like this that will work for your desired site, you can just leave this blank, the demo code still does a very good job of identifying the relevant content from your input data for queries. 

The BODY_TAG variable is used after your site is crawled and the urls you want have been downloaded. There is a parse function that cleans the gathered data, and as part of that process, it can extract content that has the tag you specify with the BODY_TAG variable. This allows the cleaning job to easily discard all of the other irrelevant data like side nav bars and twitter links, headers/footers etc so the result only includes the main body of visible text on each page. 

### How the scraper works in this demo

Assuming you use the default input variables:
```
os.environ['ALLOWED_DOMAINS'] = 'docs.vmware.com'
os.environ['START_URL'] = 'https://docs.vmware.com/en/VMware-Tanzu-Application-Platform/1.4/tap/cli-plugins-tanzu-cli.html'
os.environ['ALLOW_RULE'] = 'en/VMware-Tanzu-Application-Platform/1.4/*'
os.environ['PARSING_RULE'] = ''
os.environ['BODY_TAG'] = 'div.rhs-center.article-wrapper'
```

The scraper will start crawling the start URL and gather all URLs it can find subject to the ALLOWED_DOMAINS and ALLOW_RULE parameters, and will compile a list of all URL's that need to be collected. The UrlScraperSpider class then calls the HtmlScraperSpider class which downloads and cleans the content from each of the URLs identified, and can optionally enhance cleaning of the data with a BODY_TAG if provided. The example input variables will scrape the tanzu application platform 1.4 documentation from the vmware documentation website.

### Execute the scraping logic

In [None]:
# The following installs are required. Please execute this code block without changing anything.
!pip install scrapy
!pip install crochet
message = "The required libraries have been installed"
print(message)

In [None]:
# The following imports are required. Please execute this code block without changing anything.
import os
import scrapy
import logging
from crochet import setup, wait_for
from scrapy.utils.log import configure_logging
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy import signals
from urllib.parse import urlparse

# Initialize crochet
setup()

# Configure Scrapy logging
configure_logging()
logging.getLogger("scrapy").setLevel(logging.INFO)

message = "The dependencies have been imported"
print(message)

_The following code block contains pre-populated input values that you may want to change. If you want to use your own data source, please modify the values below before executing._ 

In [None]:
# In a container we would have typically exported these envars in the shell directly, but since this notebook is just running in python, we add this extra step to define the vars, while doing so as envars so as to preserve the rest of the code to run the same as it would server-side
os.environ['ALLOWED_DOMAINS'] = 'docs.vmware.com'
os.environ['START_URL'] = 'https://docs.vmware.com/en/VMware-Tanzu-Application-Platform/1.5/tap/overview.html'
os.environ['ALLOW_RULE'] = 'en/VMware-Tanzu-Application-Platform/1.5/*'
os.environ['PARSING_RULE'] = ''
os.environ['BODY_TAG'] = 'div.rhs-center.article-wrapper'
message = "The Envars have been loaded"
print(message)

In [None]:
# Execute this code block without changing anything. 
allowed_domains = os.environ.get('ALLOWED_DOMAINS')
start_url = os.environ.get('START_URL')
allow_rule = os.environ.get('ALLOW_RULE')
body_tag = os.environ.get('BODY_TAG')
parsing_rule = os.environ.get('PARSING_RULE')
message = "The Envars have been imported"
print(message)

_The following code block is the UrlScraperSpider class. You do not need to change anything, just execute it to load the function and we will call it in a subsequent step._

In [None]:
class UrlScraperSpider(CrawlSpider):
    name = 'url_scraper'

    def __init__(self, allowed_domains, start_url, allow_rule, parsing_rule=None, *args, **kwargs):
        self.allowed_domains = [allowed_domains]
        self.start_urls = [start_url]
        self.allow_rule = [allow_rule]
        self.rules = (
            Rule(LinkExtractor(allow=allow_rule), callback='parse_item', follow=True),
        )
        self.parsing_rule = parsing_rule
        self.urls = set()  

        super(UrlScraperSpider, self).__init__(*args, **kwargs)

        self._compile_rules()

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(UrlScraperSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        logging.info("UrlScraperSpider started...")

    def spider_closed(self, spider):
        logging.info("UrlScraperSpider finished.")

    def parse_item(self, response):
        links = LinkExtractor(allow=self.allow_rule).extract_links(response)
        if self.parsing_rule:
            links = LinkExtractor(allow=self.parsing_rule).extract_links(response)
            
        for link in links:
            url = urlparse(link.url)._replace(fragment='').geturl()
            self.urls.add(url)
        return {}


    def closed(self, reason):
        with open('urls.txt', 'w') as f:
            f.write('\n'.join(sorted(self.urls)))
message = "The UrlScraperSpider class has been loaded"
print(message)

_Now load the HtmlScraperSpider Class, which when called will downloads and clean the urls_

In [None]:
# Execute this code block, do not change anything
class HtmlScraperSpider(scrapy.Spider):
    name = 'html_scraper'

    def __init__(self, allowed_domains, body_tag=None, *args, **kwargs):
        super(HtmlScraperSpider, self).__init__(*args, **kwargs)
        self.allowed_domains = [allowed_domains]
        os.makedirs('html_downloads', exist_ok=True)
        self.start_urls = self.get_start_urls()
        self.body_tag = body_tag

    def get_start_urls(self):
        with open('urls.txt', 'r') as f:
            urls = f.read().split('\n')
        return urls

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(HtmlScraperSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        logging.info("HtmlScraperSpider started...")

    def spider_closed(self, spider):
        logging.info("HtmlScraperSpider finished.")

    def parse(self, response):
        filename = os.path.join('html_downloads', response.url.split('/')[-1])

        if self.body_tag:
            content = response.css(self.body_tag).get()
        else:
            content = response.body

        if content:
            content = Selector(text=content).xpath('//text()').getall()
            content = ''.join(content).strip()

            with open(filename, 'w') as f:
                f.write(content)

        return {}
message = "The HtmlScraperSpider class has been loaded"
print(message)

_Now that the classes have been loaded, call them by executing the code block below:_

__Note:__ The command block below could take a minute or two to complete.

In [None]:
runner = CrawlerRunner(get_project_settings())

@wait_for(timeout=None)
def run_spider(spider_class, **kwargs):
    return runner.crawl(spider_class, **kwargs)

# Run the spiders
run_spider(UrlScraperSpider, allowed_domains=allowed_domains, start_url=start_url, allow_rule=allow_rule, parsing_rule=parsing_rule)
run_spider(HtmlScraperSpider, allowed_domains=allowed_domains, body_tag=body_tag)
message = "The spiders have completed running"
print(message)

### Review Step1 Outputs

Congratulations, you made it through step 1! You have now downloaded and cleaned your input data!

- In the left nav bar, you should see a file named "urls.txt". This file was created by the UrlSpiderScraper class and contains a list of all the URLS that were identified given your input criteria. Double click on the file to view it. 
- In the left nav bar, you should also see the html_downloads directory, double click on it to view the html files that were downloaded.
  - Double click on some of the html files, you will see it is actually now just plaintext as it has beel cleaned for the next step, vectorizing the data
  - Click on the folder icon in the nav bar, so you can see the root folder contents