# Prerequisities and Scraping
## Prerequsities
### What to input the model
The model we will be using is WebOrganizer/TopicClassifier-NoURL, and the input it expects is of Form:
RS Supervisor Clarifies Washington Role in Tax-Exempt Targeting

Susan Walsh/AP file photo

An Internal Revenue Service supervisor at the center of the tax-exempt applications process that erupted in political scandal in May has contradicted agency assertions that the mishandling was confined to an office in Cincinnati, but she offered no evidence that top IRS or Obama administration officials directed the added scrutiny.

As reported by the Associated Press on Monday, Holly Paz, who recently left her Washington post as a top deputy in the exempt organizations division, told congressional interviewers she reviewed some 20-to-30 of the applications from tea party and other conservative groups whose applications were delayed.

According to AP’s review, transcripts from the joint investigation by the House Oversight and Government Reform Committee and the Ways and Means Committee show that “Paz described an agency in which IRS supervisors in Washington worked closely with agents in the field but didn’t fully understand what those agents were doing. Paz said agents in Cincinnati openly talked about handling `tea party’ cases, but she thought the term was merely shorthand for all applications from groups that were politically active * conservative and liberal,” wrote reporter Stephen Ohlemacher.

Six IRS staff have been interviewed so far by the committee. AP examined the transcripts of three of them, namely Paz, and the Cincinnati office's Gary Muthert and Elizabeth Hofacre.

“It’s very fact-and-circumstance intensive. So it’s a difficult issue,” said Paz, who said the scrutiny began in February 2010. “Oftentimes what we will do, and what we did here, is we’ll transfer it to (the technical unit), get someone who’s well-versed on that area of the law working the case so they can see what the issues are,” the transcript said. “The goal with that is ultimately to develop some guidance or a tool that can be given to folks in (the Cincinnati office) to help them in working the cases themselves.”

Paz added that her IRS colleagues are not political and were sorting the applications using the shorthand rhetoric of the groups under the assumption that it was for internal purposes only. “Because they are so apolitical, they are not as sensitive as we would like them to be as to how things might appear,” Paz said, according to the AP.

Meanwhile, the interviewing continues to prompt partisan friction between House Oversight Chairman Darrell Issa, R-Calif., and Ranking Minority Member Rep. Elijah Cummings, D-Md., over whether to release the transcripts in excerpts or in their entirety.

Over the weekend, Issa said that as the interviews continue, "We're learning about…officials who had reason to believe something was very wrong but tried keep that under wraps for as long as possible.” Cummings, who has threatened to release the full transcripts if Issa does not do so by Monday, said Issa was leaking"cherry-picked excerpts that show no White House involvement whatsoever in the identification and screening of these cases."

Close [ x ] More from GovExec

### The Ouputs of the model
The output from ids to labels are( labels are in order):
* Adult-0
* Art & Design-1
* Software Dev-2
* Crime & Law-3
* Education & Jobs-4
* Hardware-5
* Entertainment-6
* Social Life- 7
* Fashion and Beauty- 8
* Finance & Buisness- 9
* Food & Dining- 10
* Games- 11
* Health- 12
* History- 13
* Home & Hobbies-14
* Industrial- 15
* Literature- 16
* Politics- 17
* Religion- 18
* Science & Tech- 19
* Software- 20
* Sports & Fitness- 21
* Transportation- 22
* Travel- 23

## Scraping


In [4]:
from collections import Counter

## Check word count in an input instance


Text = '''Criminal Law – SORA – Registration (access required)

POSTED: 09:22 AM Tuesday, July 12, 2011
BY: Ed Wesoloski
TAGS: , , , , , , , , , ,

Homeless sex offenders are not excused from complying with the Sex Offenders Registration Act (SORA), MCL 28.721 et seq. “[H]omelessness does not preclude an offender from entering a police station and reporting to a law enforcement agency regarding the offender’s residence or domicile. “The Legislature intended SORA to be a comprehensive system that requires all [...]

Michigan Lawyers Weekly Daily Alert


CLE & Events Calendar

Follow us on social media'''

words = Text.split(" ")

print(f"The number of words in this instance: {Counter(words).total()}")

The number of words in this instance: 92


### Heuristic in our Scraping
As seen from above example, small input had 92 word instances, so we will set 80 words as minimum words our first scraping job should extract

### Scraping Procedure
We'll use beautifulSoup to parse the HTML response for a request and clean it in the form suitable for model input
* First, the url string itself should be validated, for that we will use Regex
* Then, the url can be scraped using request and beautifulSoup
* If the content in the response is < 80 words, then we'll use Selenium
* While Selenium handles interactions with the webpage, the project should handle anytime of webpage. So the selenium usage will be for most generic webpages with low use of Js and AJAX.
* The output length of both HTML parsing and Selenium will be compared to finalize the input to the model

## Model Output mapped to Google's Categories list
The url to google's categories-> [Click me ](https://cloud.google.com/natural-language/docs/categories)

In [1]:
# Checking the list

with open('../data/google_list.txt') as f:
    google_list = f.read()

!head ../data/google_list.txt

/Adult
/Arts & Entertainment/Celebrities & Entertainment News
/Arts & Entertainment/Other
/Arts & Entertainment/Comics & Animation/Anime & Manga
/Arts & Entertainment/Comics & Animation/Cartoons
/Arts & Entertainment/Comics & Animation/Comics
/Arts & Entertainment/Comics & Animation/Other
/Arts & Entertainment/Entertainment Industry/Film & TV Industry
/Arts & Entertainment/Entertainment Industry/Recording Industry
/Arts & Entertainment/Entertainment Industry/Other


In [2]:
google_categories = google_list.split("\n")
print(google_categories)

['/Adult', '/Arts & Entertainment/Celebrities & Entertainment News', '/Arts & Entertainment/Other', '/Arts & Entertainment/Comics & Animation/Anime & Manga', '/Arts & Entertainment/Comics & Animation/Cartoons', '/Arts & Entertainment/Comics & Animation/Comics', '/Arts & Entertainment/Comics & Animation/Other', '/Arts & Entertainment/Entertainment Industry/Film & TV Industry', '/Arts & Entertainment/Entertainment Industry/Recording Industry', '/Arts & Entertainment/Entertainment Industry/Other', '/Arts & Entertainment/Events & Listings/Bars, Clubs & Nightlife', '/Arts & Entertainment/Events & Listings/Concerts & Music Festivals', '/Arts & Entertainment/Events & Listings/Event Ticket Sales', '/Arts & Entertainment/Events & Listings/Expos & Conventions', '/Arts & Entertainment/Events & Listings/Film Festivals', '/Arts & Entertainment/Events & Listings/Food & Beverage Events', '/Arts & Entertainment/Events & Listings/Live Sporting Events', '/Arts & Entertainment/Events & Listings/Movie Lis

In [3]:
from collections import Counter
print(Counter(google_categories).total())

1092


### Mapping Decision
As we can see, Google's categorizer is huge(1092 classes !), the model we use does classify into Adult and some others in the google's list. But mapping means the content should be further classifed into the sub categories in a category .i.e. /Business & Industrial/Business Services/Commercial Distribution. With model we are using, we can use the Buisness output to map to /Business & Industrial, but the problem lies in our model not being able to deeply categories the content as able by google's model. so, **Using our model, mapping it's output to google's list is not possible**. So we scrap the idea of mapping out model's output.

## Final Workflow 
* Scrape the url
* Input the scrape content to open source model.

## Scraping tests
### Requests and BeautifulSoup4

In [10]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/BERT_(language_model)"

page = requests.get(url)

print(page.text) if page.status_code == 200 else print("Something went wrong requesting the URL")

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>BERT (language model) - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vect

In [12]:
# Using bs4

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   BERT (language model) - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-c

In [14]:
# We will get all the data in <p> tags

p_tags = soup.find_all(['p'])

for tag in p_tags:
    print(tag.get_text(separator=' ',strip= True))

Bidirectional encoder representations from transformers ( BERT ) is a language model introduced in October 2018 by researchers at Google . [ 1 ] [ 2 ] It learns to represent text as a sequence of vectors using self-supervised learning . It uses the encoder-only transformer architecture. BERT dramatically improved the state-of-the-art for large language models . As of 2020 [update] , BERT is a ubiquitous baseline in natural language processing (NLP) experiments. [ 3 ]
BERT is trained by masked token prediction and next sentence prediction. As a result of this training process, BERT learns contextual, latent representations of tokens in their context, similar to ELMo and GPT-2 . [ 4 ] It found applications for many natural language processing tasks, such as coreference resolution and polysemy resolution. [ 5 ] It is an evolutionary step over ELMo , and spawned the study of "BERTology", which attempts to interpret what is learned by BERT. [ 3 ]
BERT was originally implemented in the Engli

In [16]:
## Above is good enough let's try another webpage not similar to webpages like wikipedia

url2 = "https://huggingface.co/WebOrganizer/TopicClassifier-NoURL"

page2 = requests.get(url2)

soup2 = BeautifulSoup(page2.text, 'html.parser')

for tag in soup2.find_all('p'):
    print(tag.get_text(separator=' ',strip= True))

[ Paper ] [ Website ] [ GitHub ]
The TopicClassifier-NoURL organizes web content into 17 categories based on the text contents of web pages (without using URL information).
The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:
This classifier expects input in the following format:
Example:
You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):
The full definitions of the categories can be found in the taxonomy config .
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here ) and loading the model like:
Files info
Base model


In [17]:
## Ok some content are missing let's inspect what tags they are on

print(soup2.prettify())

<!DOCTYPE html>
<html class="">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, user-scalable=no" name="viewport"/>
  <meta content="We’re on a journey to advance and democratize artificial intelligence through open source and open science." name="description"/>
  <meta content="1321688464574422" property="fb:app_id"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="@huggingface" name="twitter:site"/>
  <meta content="https://cdn-thumbnails.huggingface.co/social-thumbnails/models/WebOrganizer/TopicClassifier-NoURL.png" name="twitter:image"/>
  <meta content="WebOrganizer/TopicClassifier-NoURL · Hugging Face" property="og:title"/>
  <meta content="website" property="og:type"/>
  <meta content="https://huggingface.co/WebOrganizer/TopicClassifier-NoURL" property="og:url"/>
  <meta content="https://cdn-thumbnails.huggingface.co/social-thumbnails/models/WebOrganizer/TopicClassifier-NoURL.png" property="og:image"/>
  <lin

In [18]:
## Ok they are inside lists without p being their parents, so we don't want to miss them

parsing_all = soup2.find_all(['p', 'ul', 'ol'])

output = []

for tag in parsing_all:
    # Skip if it's <ul> or <ol> and is inside a <p>
    if tag.name in ['ul', 'ol'] and tag.find_parent('p'):
        continue
    # Otherwise, get its text
    output.append(tag.get_text(separator=' ', strip=True))

all_text = '\n'.join(output)
print(all_text)

Models Datasets Spaces Posts Docs Enterprise Pricing Log In Sign Up
WebOrganizer/TopicClassifier-NoURL Usage Citation
Usage Citation


[ Paper ] [ Website ] [ GitHub ]
The TopicClassifier-NoURL organizes web content into 17 categories based on the text contents of web pages (without using URL information).
The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:
WebOrganizer/TopicAnnotations-Llama-3.1-8B : 1M documents annotated by Llama-3.1-8B (first-stage training) WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8 : 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
WebOrganizer/FormatClassifier WebOrganizer/FormatClassifier-NoURL WebOrganizer/TopicClassifier WebOrganizer/TopicClassifier-NoURL ← you are here!
This classifier expects input in the following format:
Example:
You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id

In [19]:
# Ok, the parsing misses snippet's of code, such code example might be crucial to the model, so let's use them too

parsing_all = soup2.find_all(['p', 'ul', 'ol', 'pre'])

output = []

for tag in parsing_all:
    # Skip if it's <ul> or <ol> and is inside a <p>
    if tag.name in ['ul', 'ol', 'pre'] and tag.find_parent('p'):
        continue
    # Otherwise, get its text
    output.append(tag.get_text(separator=' ', strip=True))

all_text = '\n'.join(output)
print(all_text)

Models Datasets Spaces Posts Docs Enterprise Pricing Log In Sign Up
WebOrganizer/TopicClassifier-NoURL Usage Citation
Usage Citation


[ Paper ] [ Website ] [ GitHub ]
The TopicClassifier-NoURL organizes web content into 17 categories based on the text contents of web pages (without using URL information).
The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:
WebOrganizer/TopicAnnotations-Llama-3.1-8B : 1M documents annotated by Llama-3.1-8B (first-stage training) WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8 : 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
WebOrganizer/FormatClassifier WebOrganizer/FormatClassifier-NoURL WebOrganizer/TopicClassifier WebOrganizer/TopicClassifier-NoURL ← you are here!
This classifier expects input in the following format:
{text}
Example:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained( "WebOrganizer/TopicClassifier-N

In [21]:
words_in_text = all_text.split(" ")
print(Counter(words_in_text).total())

321


In [26]:
# Checking headers of our request

print(page2.request.headers)

{'User-Agent': 'python-requests/2.32.3', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


In [32]:
## Above header is minimal and can get blocked by anti-bot systems so, a browser header will be set according to a variable

def get_browser_headers(browser="chrome"):
    browser = browser.lower()

    if browser == "chrome" or browser == "brave":
        user_agent = (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
        )
    elif browser == "firefox":
        user_agent = (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) "
            "Gecko/20100101 Firefox/124.0"
        )
    else:
        raise ValueError("Unsupported browser. Choose 'chrome', 'brave', or 'firefox'.")

    headers = {
        "User-Agent": user_agent,
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.google.com/",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Connection": "keep-alive",
    }

    return headers

headers = get_browser_headers("firefox")

response_with_header = requests.get(url2, headers=headers)
print(response_with_header.status_code)
print(response_with_header.request.headers)


200
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Connection': 'keep-alive', 'Accept-Language': 'en-US,en;q=0.9', 'Referer': 'https://www.google.com/'}


Above experiment on request and bs4 is good enough, in our final script we'll also wrap request with cloudscraper for cloudfare protected sits. Now let's move to selenium

### Selenuim

In [58]:
# Experiment on Brave driver
import shutil
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service 
from webdriver_manager.chrome import ChromeDriverManager

get_brave_linux = shutil.which("brave")

print(get_brave_linux)
user_agent = "Chrome/120.0.0.0"

options  = Options()
options.binary_location = get_brave_linux
options.add_argument("--headless=new") 
options.add_argument("--disable-gpu")
options.add_argument(f"user-agent={user_agent}")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--remote-debugging-port=9222") 

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Let's check my portfolio website as it uses heavy JS and client side rendering

url = "angel-tamang.vercel.app"

driver.get(url)
time.sleep(3) # For JS rendered components

html = driver.page_source
driver.quit()

print(html)

/snap/bin/brave


SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 114
Current browser version is 136.0.7103.93 with binary path /snap/bin/brave
Stacktrace:
#0 0x55b7a3bc04e3 <unknown>
#1 0x55b7a38efc76 <unknown>
#2 0x55b7a391d04a <unknown>
#3 0x55b7a39184a1 <unknown>
#4 0x55b7a3915029 <unknown>
#5 0x55b7a3953ccc <unknown>
#6 0x55b7a395347f <unknown>
#7 0x55b7a394ade3 <unknown>
#8 0x55b7a39202dd <unknown>
#9 0x55b7a392134e <unknown>
#10 0x55b7a3b803e4 <unknown>
#11 0x55b7a3b843d7 <unknown>
#12 0x55b7a3b8eb20 <unknown>
#13 0x55b7a3b85023 <unknown>
#14 0x55b7a3b531aa <unknown>
#15 0x55b7a3ba96b8 <unknown>
#16 0x55b7a3ba9847 <unknown>
#17 0x55b7a3bb9243 <unknown>
#18 0x7f543a28f6ba <unknown>


In [52]:
import os
# Using dirver installed in device
def find_file(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)
    return None

file_name = "chromedriver"
search_path = "/home/"

result = find_file(file_name, search_path)

if result:
    print(f"File found at: {result}")
else:
    print("File not found.")
            
        
            

File found at: /home/angel-tamang/.wdm/drivers/chromedriver/linux64/114.0.5735.90/chromedriver


In [53]:
service = Service(result)

driver = webdriver.Chrome(service=service, options=options)

# Let's check my portfolio website as it uses heavy JS and client side rendering

url = "angel-tamang.vercel.app"

driver.get(url)
time.sleep(3) # For JS rendered components

html = driver.page_source
driver.quit()

print(html)

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 114
Current browser version is 136.0.7103.93 with binary path /snap/bin/brave
Stacktrace:
#0 0x5648b21714e3 <unknown>
#1 0x5648b1ea0c76 <unknown>
#2 0x5648b1ece04a <unknown>
#3 0x5648b1ec94a1 <unknown>
#4 0x5648b1ec6029 <unknown>
#5 0x5648b1f04ccc <unknown>
#6 0x5648b1f0447f <unknown>
#7 0x5648b1efbde3 <unknown>
#8 0x5648b1ed12dd <unknown>
#9 0x5648b1ed234e <unknown>
#10 0x5648b21313e4 <unknown>
#11 0x5648b21353d7 <unknown>
#12 0x5648b213fb20 <unknown>
#13 0x5648b2136023 <unknown>
#14 0x5648b21041aa <unknown>
#15 0x5648b215a6b8 <unknown>
#16 0x5648b215a847 <unknown>
#17 0x5648b216a243 <unknown>
#18 0x7fb5e588f6ba <unknown>


In [61]:
import subprocess
import re
# Ok so we can try downloading the driver of current version browser 

def get_brave_version():
    brave_path = shutil.which("brave-browser") or shutil.which("brave")
    if not brave_path:
        return "Brave browser not found"

    try:
        output = subprocess.check_output([brave_path, "--version"], stderr=subprocess.STDOUT)
        version_str = output.decode("utf-8").strip()  # e.g. "Brave Browser 114.0.5735.90"
        
        # Extract version number using regex
        match = re.search(r"(\d+\.\d+\.\d+\.\d+)", version_str)
        if match:
            return match.group(1)
        else:
            return "Version number not found"
    except Exception as e:
        return f"Error getting Brave version: {e}"

print(get_brave_version())

service = Service(ChromeDriverManager(driver_version=get_brave_version()).install())

driver = webdriver.Chrome(service=service, options=options)

driver.get(url)
time.sleep(3) # For JS rendered components

html = driver.page_source
driver.quit()

print(html)


136.1.78.97


Exception: No such driver version 136.1.78.97 for linux64

### Managing selenium for brave seems tricky, for now let's look at chrome which is widely supported

In [70]:


google_binary = shutil.which("google-chrome")

options  = Options()
options.binary_location = get_brave_linux
# options.add_argument("--headless=new") 
options.add_argument("--disable-gpu")
options.add_argument(f"user-agent={user_agent}")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--remote-debugging-port=9222") 
options.binary_location = google_binary

service = Service(ChromeDriverManager(driver_version="136.0.7103.113").install())

driver = webdriver.Chrome(service=service, options=options)

driver.get(url)

html = driver.page_source

driver.quit()
driver.implicitly_wait(2)

print(html)

SessionNotCreatedException: Message: session not created
from disconnected: unable to connect to renderer
Stacktrace:
#0 0x55d47cffa75a <unknown>
#1 0x55d47ca9d0a0 <unknown>
#2 0x55d47cadd041 <unknown>
#3 0x55d47cad8095 <unknown>
#4 0x55d47cad2c0f <unknown>
#5 0x55d47cb22d75 <unknown>
#6 0x55d47cb22296 <unknown>
#7 0x55d47cb14173 <unknown>
#8 0x55d47cae0d4b <unknown>
#9 0x55d47cae19b1 <unknown>
#10 0x55d47cfbf90b <unknown>
#11 0x55d47cfc380a <unknown>
#12 0x55d47cfa7662 <unknown>
#13 0x55d47cfc4394 <unknown>
#14 0x55d47cf8c49f <unknown>
#15 0x55d47cfe8538 <unknown>
#16 0x55d47cfe8716 <unknown>
#17 0x55d47cff95c6 <unknown>
#18 0x7f6fc2c8f6ba <unknown>


### Decision on Selenuim
Even with same driver and browser versions of chrome there are problems, this could be my underlying system too. For that reason I will switch to testing on playwright in next notebook