
![image](img/spider_bot.png)

## Investor Relations Web Scraping bot
This code will pop up a Gradio interface to start scraping a website. This is a utility notebook, created to quickly gather documents from IR sites to create a KB. 
I've tuned the scraper to go through the Investor Relations tree of a company website and save all documents with extensions (xls, pdf, word, etc), but not the HTML content.

Due to the way scrapy works with async loops, I had to make a separate script and run it as a subprocess, in order for it to work in a Jupyter notebook.

Can be used to scrape multiple websites (one at a time). Saves scraped files in a kb/{domain} subdirectory (it does **not** preserve website tree structure)

Uses **spider_runner.py**, which needs to be in the same directory as the notebook (will check and abort if not present).


### Scraping logic
scrapy does a pretty decent job of getting the necessary files, although some dynamic sites will not yield the best results. For a more robust scraper I probably need to move to Selenium in a future upgrade. Still, the tool is quite practical for many occasions, as many companies keep their IR websites static. You may need to tweak the follow-on link scraping patterns, I have kept it very simple (it will follow whatever link has 'investor-relations/' in it and limit the links to follow per page to avoid infinite scraping)

In a real application environment we would be running the spider class inside the application - this would enable simpler real-time updates in the output. For an interactive notebook I find this approach sufficient enough.

In [None]:
import subprocess, os, sys
import gradio as gr
from urllib.parse import urlparse, urljoin


# from urllib.parse import urljoin, urlparse
# from scrapy.crawler import CrawlerRunner
# from scrapy.utils.log import configure_logging
# from twisted.internet import reactor, defer
# import asyncio

is_scraper_completed = False # global variable to check if the scraper has completed
status_value= "Ready"

with gr.Blocks() as scraper_ui:
    gr.Markdown("## Web Scraper")
    gr.Markdown("This is a simple web scraper that can be used to scrape investor relations pages.")
    
    url = gr.Textbox(label="Enter URL", placeholder="https://example.com")
    
    status = gr.Textbox(label="Status", interactive=False, value="Ready to scrape. Enter a URL and press Enter.", lines=5)

    def run_scraper(url):
        # Run the spider as a subprocess
        if not url.startswith("http"):
            url = "http://" + url
        # Extract the domain from the URL
        parsed_url = urlparse(url)
        domain = parsed_url.netloc.replace("www.", "")
        if not domain:
            return "Invalid URL. Please enter a valid URL."
        # Check if the spider_runner.py file exists
        if not os.path.exists('spider_runner.py'):
            return "Error: spider_runner.py not found. Please ensure it is in the current directory."
        # Run the spider using subprocess
        try:
            result = subprocess.run([sys.executable, 'spider_runner.py', url, domain], check=True, text=True, capture_output=True)
            status_value = f"Scraping completed for {url}."
            is_scraper_completed = True  # Set the global variable to True
            return result.stderr, status_value
        except subprocess.CalledProcessError as e:
            is_scraper_completed = True
            status_value = "Error during scraping. Check the logs for details."
            return f"Error: {e}", status_value
    
    output = gr.Textbox(label="Output", interactive=False)
    
    url.submit(run_scraper, inputs=url, outputs=[output,status]) 

scraper_ui.launch(inbrowser=True)