<a href="https://colab.research.google.com/github/crimeacs/vision_scraper/blob/main/Website_to_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Install
#@markdown Run this cell to install dependencies

# Set up for running selenium in Google Colab
## You don't need to run this code if you do it in Jupyter notebook, or other local Python setting
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install -q selenium
pip install -q chromedriver-autoinstaller
pip install -q openai

In [2]:
#@title Set up
#@markdown Run this cell to set up everything we need
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import chromedriver_autoinstaller

# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

from selenium import webdriver
from PIL import Image
import base64
import io
import time
import json

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

def download_website(url):
    # set up the webdriver
    driver = webdriver.Chrome(options=chrome_options)

    # Navigate to the website
    driver.get(url)

    # Set the size of the browser window (adjust as needed)
    width, height = 1200, 1080
    driver.set_window_size(width, height)

    screenshots = []
    max_scrolls = 6
    current_scroll = 0

    while True:
        print(f'Taking {current_scroll+1}/{max_scrolls} screenshot')
        time.sleep(2)
        # Capture a screenshot
        screenshot = driver.get_screenshot_as_png()
        screenshots.append(screenshot)

        # Check if this is the last scroll
        last_height = height * (current_scroll + 1)

        # Scroll down
        driver.execute_script(f"window.scrollTo(0, {last_height});")

        # Wait for page to load (adjust sleep time as needed)
        time.sleep(2)  # Import 'time' module at the beginning of your script

        # Get new scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")

        # Check if reached the end of the page
        if new_height <= last_height or current_scroll >= max_scrolls:
            print("Reached the end of the page or max scroll limit.")
            break
        else:
            current_scroll += 1

    driver.quit()

    # Process and save screenshots
    combined_height = height * len(screenshots)
    combined_image = Image.new("RGB", (width, combined_height))

    current_height = 0
    for img_data in screenshots:
        img = Image.open(io.BytesIO(img_data))
        combined_image.paste(img, (0, current_height))
        current_height += height

    combined_image.save("combined_screenshot.png")
    print('Done')

    # Path to your image
    image_path = "/content/combined_screenshot.png"

    # Getting the base64 string
    base64_image = encode_image(image_path)
    return base64_image

def analyse_content(user_prompt, system_prompt, base64_image, max_tokens):
    response_scoring = client.chat.completions.create(
    model="gpt-4-vision-preview",
    # response_format={ "type": "json_object"},
    messages=[
        {"role": "system", "content":
        f"""
        You are a Helpful Assistent with following characteristics: {system_prompt}. Your task is to analyse given website.
        """
        },

        {
      "role": "user",
      "content": [
        {"type": "text", "text": user_prompt},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}",
          },
        }]
        }
    ],

  max_tokens=max_tokens,
    )

    return json.loads(response_scoring.json())['choices'][0]['message']['content']

In [4]:
#@title Input your OpenAI API Key
#@markdown Please input your own OpenAI API Key and run the cell

from openai import OpenAI

# Ask the user to input their OpenAI API key
api_key ="YOU API KEY" #@param {"type" : "string"}
client = OpenAI(api_key=api_key)


In [33]:
#@title # Scrape
import os
from tqdm import tqdm
import base64
import time

#@markdown Please insert URL to the website you want to scrape
url = 'https://github.com/features/copilot' #@param {"type":"string"}

#@markdown System prompt defines the role of your scraper. Default is a good start
system_prompt = 'You are a Scraping Assistant that is able to summarize content of any website from its screenshot.' #@param {"type":"string"}

#@markdown User prompt defines the task for the scraper. Default is a good start
user_prompt = 'Please summarize the content of this screenshot. Give me an executive summary about this webpage and thier business' #@param {"type":"string"}

#@markdown How long do you want your summary to be
max_tokens = 200 #@param {"type":"integer"}

base64_image = download_website(url)
results = analyse_content(user_prompt, system_prompt, base64_image, max_tokens)

print(results)

This screenshot depicts a webpage for GitHub Copilot, advertised as the world's most widely adopted AI developer tool. The site emphasizes Copilot's potential to enhance developer productivity and accelerate the pace of software development. The tool is described as the competitive advantage that developers ask for by name and is positioned as an industry standard.

Key highlights include:
- Adoption by over 37,000 businesses and the usage by one in three Fortune 500 companies.
- A developer preference of 55% according to a Stack Overflow 2022 Survey.

GitHub Copilot is designed by AI leaders to provide confidence to its users, with endorsements from reputable organizations such as Shopify, Mercado Libre, Mercedes-Benz, and Fidelity. The webpage also mentions a case study featuring Duolingo, which uses GitHub Copilot as a force multiplier for its engineering team.

The AI coding assistant is showcased as a tool that can assist in writing code, helping developers to focus more on creati

In [22]:
#@title BONUS: Compute embedding of a summary (if you know what embedding is)
#@markdown embeddings are saved to the root of the folder

import numpy as np
response = client.embeddings.create(
    input=results,
    model="text-embedding-ada-002"
)

website_embedding = response.data[0].embedding
np.save(f"{url.split('.')[1]}.npy", website_embedding)