[WebSearcher](https://github.com/gitronald/WebSearcher) is a Python package that facilitates obtaining and parsing search results from Google text search. Compared to `webbotparser`, it supports parsing more diverse results (ads, knowledge boxes, etc.), but only Google text results (for now). We can utilize its parsing capabilities on search result pages downloaded using [WebBot](https://github.com/gesiscss/WebBot) as follows:

## Installation

In [1]:
%pip install git+https://github.com/gitronald/WebSearcher@dev # more up-to-date than the PyPi package
%pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/gitronald/WebSearcher@dev
  Cloning https://github.com/gitronald/WebSearcher (to revision dev) to /tmp/pip-req-build-9mjxviny
  Running command git clone --filter=blob:none --quiet https://github.com/gitronald/WebSearcher /tmp/pip-req-build-9mjxviny
  Running command git checkout -b dev --track origin/dev
  Switched to a new branch 'dev'
  Branch 'dev' set up to track remote branch 'dev' from 'origin'.
  Resolved https://github.com/gitronald/WebSearcher to commit 7e04755ea4562949e7b7a0391e1d10f8eeead075
  Preparing metadata (setup.py) ... [?25ldone
Collecting tldextract (from WebSearcher==2023.5.19)
  Downloading tldextract-3.4.4-py3-none-any.whl (93 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.3/93.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting brotli (from WebSearcher==2023.5.19)
  Downloading Brotli-1.0.9-cp38-cp38-

## Import packages

In [2]:
import WebSearcher as ws
import json
import pandas as pd

## Usage

Initialze the WebSearcher

In [3]:
se = ws.SearchEngine()

WebSearcher doesn't (yet) have a function to load external HTML for parsing. So we do this manually:

In [4]:
filename = 'testdata/www.google.com_Climate change_text_2023-08-02_14_50_10.html'
with open(filename, 'r') as file:
    se.html = file.read()
    se.serp_id = filename

Parse the results using the parser provided by WebSearcher:

In [5]:
se.parse_results()
print(json.dumps(se.results, indent=2))

[
  {
    "sub_rank": 0,
    "type": "unknown",
    "cmpt_rank": 0,
    "serp_rank": 0,
    "crawl_id": null,
    "serp_id": "testdata/www.google.com_Climate change_text_2023-08-02_14_50_10.html",
    "qry": "Climate change",
    "lang": "de",
    "lhs_bar": false
  },
  {
    "type": "general",
    "sub_rank": 0,
    "cite": null,
    "details": "",
    "text": null,
    "cmpt_rank": 1,
    "serp_rank": 1,
    "crawl_id": null,
    "serp_id": "testdata/www.google.com_Climate change_text_2023-08-02_14_50_10.html",
    "qry": "Climate change",
    "lang": "de",
    "lhs_bar": false
  },
  {
    "type": "general",
    "sub_rank": 0,
    "title": "Home \u2013 Climate Change: Vital Signs of the Planet",
    "url": "https://climate.nasa.gov/",
    "cite": "https://climate.nasa.gov",
    "details": "",
    "text": "Vital Signs of the Planet: Global  Climate Change  and  Global Warming . Current news and data streams about  global warming  and  climate change  from NASA.",
    "timestamp": nu

We can also convert them to a Pandas dataframe:

In [6]:
pd.DataFrame(se.results)

Unnamed: 0,sub_rank,type,cmpt_rank,serp_rank,crawl_id,serp_id,qry,lang,lhs_bar,cite,details,text,title,url,timestamp
0,0,unknown,0,0,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,,,,,,
1,0,general,1,1,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,,,,,,
2,0,general,2,2,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climate.nasa.gov,,Vital Signs of the Planet: Global Climate Cha...,Home – Climate Change: Vital Signs of the Planet,https://climate.nasa.gov/,
3,0,general,3,3,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climate.nasa.gov › effects,,“ Climate change ” encompasses global warming...,Effects | Facts – Climate Change: Vital Signs ...,https://climate.nasa.gov/effects/,
4,0,general,4,4,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://en.wikipedia.org › wiki › Climate_change,,"In common usage, climate change describes g...",Climate change,https://en.wikipedia.org/wiki/Climate_change,
5,0,general,5,5,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://www.ipcc.ch,,The Intergovernmental Panel on Climate Change...,IPCC — Intergovernmental Panel on Climate Change,https://www.ipcc.ch/,
6,0,general,6,6,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climate.ec.europa.eu › caus...,,The main driver of climate change is the gre...,Causes of climate change,https://climate.ec.europa.eu/climate-change/ca...,
7,0,general,7,7,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://royalsociety.org › projects,,Climate change • Climate change refers to l...,Climate change: evidence and causes,https://royalsociety.org/topics-policy/project...,
8,0,general,8,8,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://www.bbc.com › news › sci...,,29.06.2023 — Climate change is a shift in t...,What is climate change? A really simple guide,https://www.bbc.com/news/science-environment-2...,29.06.2023
9,0,general,9,9,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climateknowledgeportal.worldbank.org ›...,,Climate change is the significant variation o...,What is Climate Change,https://climateknowledgeportal.worldbank.org/o...,


Another example: parsing the second page.

In [7]:
filename = 'testdata/www.google.com_Climate change_text_2023-08-02_14_50_18.html'
with open(filename, 'r') as file:
    se.html = file.read()
    se.serp_id = filename

In [8]:
pd.DataFrame(se.results)

Unnamed: 0,sub_rank,type,cmpt_rank,serp_rank,crawl_id,serp_id,qry,lang,lhs_bar,cite,details,text,title,url,timestamp
0,0,unknown,0,0,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,,,,,,
1,0,general,1,1,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,,,,,,
2,0,general,2,2,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climate.nasa.gov,,Vital Signs of the Planet: Global Climate Cha...,Home – Climate Change: Vital Signs of the Planet,https://climate.nasa.gov/,
3,0,general,3,3,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climate.nasa.gov › effects,,“ Climate change ” encompasses global warming...,Effects | Facts – Climate Change: Vital Signs ...,https://climate.nasa.gov/effects/,
4,0,general,4,4,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://en.wikipedia.org › wiki › Climate_change,,"In common usage, climate change describes g...",Climate change,https://en.wikipedia.org/wiki/Climate_change,
5,0,general,5,5,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://www.ipcc.ch,,The Intergovernmental Panel on Climate Change...,IPCC — Intergovernmental Panel on Climate Change,https://www.ipcc.ch/,
6,0,general,6,6,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climate.ec.europa.eu › caus...,,The main driver of climate change is the gre...,Causes of climate change,https://climate.ec.europa.eu/climate-change/ca...,
7,0,general,7,7,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://royalsociety.org › projects,,Climate change • Climate change refers to l...,Climate change: evidence and causes,https://royalsociety.org/topics-policy/project...,
8,0,general,8,8,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://www.bbc.com › news › sci...,,29.06.2023 — Climate change is a shift in t...,What is climate change? A really simple guide,https://www.bbc.com/news/science-environment-2...,29.06.2023
9,0,general,9,9,,testdata/www.google.com_Climate change_text_20...,Climate change,de,False,https://climateknowledgeportal.worldbank.org ›...,,Climate change is the significant variation o...,What is Climate Change,https://climateknowledgeportal.worldbank.org/o...,
