### Install WebBotParser

In [1]:
%pip install git+https://github.com/gesiscss/WebBotParser@tutorial-v0.1.1#egg=webbotparser

Defaulting to user installation because normal site-packages is not writeable
Collecting webbotparser
  Cloning https://github.com/gesiscss/WebBotParser (to revision tutorial-v0.1.1) to /tmp/pip-install-8shjtifb/webbotparser_233b53f59206440f9306e2883af24410
  Running command git clone --filter=blob:none --quiet https://github.com/gesiscss/WebBotParser /tmp/pip-install-8shjtifb/webbotparser_233b53f59206440f9306e2883af24410
  Resolved https://github.com/gesiscss/WebBotParser to commit ca104e66b4b422c539d71ce013dd8af29865bb33
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


### Import WebBotParser and specify which engine is supposed to be used

In [2]:
from webbotparser import WebBotParser
parser = WebBotParser(engine = 'DuckDuckGo News')

### Extract search results and metadata from a given results page

In [3]:
file = './testdata/duckduckgo.com_climate change_news_2023-07-28_17_48_39.html'
metadata, results = parser.get_results(file, with_metadata=True)
metadata

{'result type': 'news',
 'engine': 'duckduckgo.com',
 'query': 'climate change',
 'date': Timestamp('2023-07-28 17:48:39')}

In [4]:
results

Unnamed: 0,title,link,text,source,has_image,published
0,Sudden explosion of dangerous fungus Candida a...,https://www.cbsnews.com/news/candida-auris-cli...,Candida auris is a globally emerging public he...,CBS News,True,31 minutes ago
1,G-20 Ministers Fail to Agree on Key Climate Is...,https://www.bloomberg.com/news/articles/2023-0...,The Group of 20 environment and climate minist...,Bloomberg L.P.,True,14 minutes ago
2,Chapter Zero to Train Southern African Boards ...,https://www.bloomberg.com/news/articles/2023-0...,"Chapter Zero, an initiative to educate non-exe...",Bloomberg L.P.,True,3 hours ago
3,Climate Change is Changing How We Dream,https://time.com/6298730/climate-change-dreams/,The Harris Poll survey also showed that 43% of...,Time,True,22 hours ago
4,Another effect of climate change? More flight ...,https://www.cbsnews.com/news/climate-change-fl...,Travelers have had to suffer through a record ...,CBS News,True,18 hours ago
...,...,...,...,...,...,...
68,Two UK hotspots to bear brunt of climate chang...,https://www.msn.com/en-gb/weather/topstories/t...,Although July 2023 has been much cooler than t...,Bristol Live on MSN.com,True,6 hours ago
69,Hudson Technologies: Long-Term Tailwinds From ...,https://www.msn.com/en-us/money/companies/huds...,Summary Hudson Technologies provides sustainab...,Seeking Alpha on MSN.com,True,9 hours ago
70,Celo and TinyTap partner for climate change fo...,https://www.cryptonewsz.com/celo-and-tinytap-p...,Celo announces its partnership with TinyTap to...,cryptonewsz,False,4 hours ago
71,Climate Change,https://www.voanews.com/z/6837,A large wildfire burning on the Greek island o...,Voice of America,True,3 days ago


### Only get the metadata

In [5]:
parser = WebBotParser(engine = 'Google Text')
file = './testdata/google.com_elections_text_2023-03-15_13_48_09.html'
parser.get_metadata(file)

{'result type': 'text',
 'engine': 'google.com',
 'query': 'elections',
 'page': 1,
 'date': Timestamp('2023-03-15 13:48:09'),
 'total results': 783000000}

### Extract all search results from a directory, in correct order
This is particularly useful for search engines that return results in pages, e.g. news results on Google. If more results are loaded by scrolling (e.g. DuckDuckGo text results), we don't need this.

In [6]:
parser = WebBotParser(engine = 'Google News')
directory = './testdata/google_news/'
metadata, results = parser.get_results_from_dir(directory)
metadata

{'result type': 'news',
 'engine': 'google.com',
 'query': 'elections',
 'total results': 138000000}

In [7]:
# examine the last 5 results
results[-5:]

Unnamed: 0,title,link,text,source,has_image,published,date,page,position
45,Elections in Nigeria: 2023 General Elections,https://www.ifes.org/tools-resources/election-...,Nigeria will hold general elections on Saturda...,The International Foundation for Electoral Sys...,True,vor 4 Wochen,2023-03-15 13:49:00,5,5
46,Japan PM faces test on April 23 as 5 parliamen...,https://mainichi.jp/english/articles/20230315/...,TOKYO (Kyodo) -- Japanese Prime Minister Fumio...,The Mainichi,True,vor 1 Stunde,2023-03-15 13:49:00,5,6
47,24 women seeking to be state governors in Satu...,https://www.premiumtimesng.com/news/587633-24-...,Eighteen parties are presenting candidates for...,Premium Times Nigeria,True,vor 1 Tag,2023-03-15 13:49:00,5,7
48,"2023 elections not perfect, APC chairman admits",https://www.premiumtimesng.com/news/headlines/...,"Speaking on the lapses, Mr Adamu said that eve...",Premium Times Nigeria,True,vor 1 Tag,2023-03-15 13:49:00,5,8
49,2023 Elections: Nigeria’s Fate in the Hands of...,https://www.thisdaylive.com/index.php/2023/03/...,"From all indications, the Supreme Court will l...",THISDAYLIVE,True,vor 1 Stunde,2023-03-15 13:49:00,5,9


# Extract images

WebBot archives images inline in the html file of the search results, i.e., they are neither external files on your drive nor fetched from the original source on viewing the downloaded search results page. This allows us to extract the images directly from the html file for further analysis. The engines and result types supported out of the box with WebBotParser allow for extracting images as well:

In [8]:
parser = WebBotParser('Google News',
                      extract_images=True,
                      #extract_images_prefix='video', # optional, default is ''
                      # for available formats see: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html
                      extract_images_format='PNG', # optional, default is 'JPEG'
                      extract_images_to_dir='extracted_images' # optional, default is 'extracted_images'
                      )
metadata, results = parser.get_results('./testdata/google_news/google.com_elections_news_2023-03-15_13_48_37.html')

We can use some magical IPython formatting to directly view the images as well:

In [9]:
from IPython.core.display import HTML

results['image preview'] = "<img src='" + results['image'] + "'/>"
HTML(results[['title','has_image','image preview']].to_html(escape=False))

Unnamed: 0,title,has_image,image preview
0,Turkey's Kilicdaroglu ahead of Erdogan two months before ...,True,
1,Dutch farmers look to reap gains in elections,True,
2,Dutch go to polls in midterm provincial elections,True,
3,Local elections of national consequence begin in Netherlands,True,
4,"Greek polls: Mitsotakis, Tsipras in neck-and-neck race before ...",True,
5,Inside the courts and challenging election outcomes,True,
6,Turkish citizens abroad to start voting for elections in April | Daily Sabah,True,
7,"Elections for the water boards take place today, and you can ...",True,
8,Electoral Commission on municipal by-elections of 22 March ...,False,
9,Join our latest event on the 2024 EU Elections,True,


### Initialize the WebBotParser with custom queries

This is an example of how you can initialize the WebBotParser with custom result_selector, queries, and metadata_extractor. This is necessary for parsing result types that are not covered out of the box by the WebBotParser. It might also become necessary if a search engine changes their layout such that the predefined queries or result selectors become erroneous.

In [10]:
from webbotparser import GoogleParser
import pandas as pd

# some custom functions for extracting information from individual results
# if most queries are custom, rewrite the __evaluate_query function instead.
def get_published_date(_soup):
    try:
        date = _soup.select('div.P7xzyf > span:last-child')[0].get_text()
    except: date = None
    return date

def get_duration(_soup):
    duration = _soup.select('div.J1mWY')
    if len(duration) == 1: # some videos don't have a duration
        min_sec = duration[0].get_text().split(':')
        return pd.to_timedelta(int(min_sec[0])*60 + int(min_sec[1]), unit='seconds')
    return None

# queries descriminate the parts of an individual result
# WebBotParser supports text, attribute, exists, and custom queries
my_queries = [
        {'name': 'title', 'type': 'text', 'selector': 'h3'},
        {'name': 'link', 'type': 'attribute', 'selector': 'div.ct3b9e > div > a', 'attribute': 'href'},
        {'name': 'text', 'type': 'text', 'selector': 'div.Uroaid'},
        {'name': 'source', 'type': 'text', 'selector': 'span.Zg1NU'},
        {'name': 'published', 'type': 'custom', 'function': get_published_date}, # pass a function for custom queries.
        {'name': 'duration', 'type': 'custom', 'function': get_duration}
    ]

# the result_selector is used to find the individual results, returned as a list
my_result_selector = 'div.MjjYud'

# initialize a custom WebBotParser for Google Video results (also provided for out of the box usage with the webbotparser package)
parser = WebBotParser(
    queries = my_queries,
    result_selector = my_result_selector,
    metadata_extractor = GoogleParser.google_metadata # you can re-use parts such as this metadata_extractor already defined in the webbotparser package
)

In [11]:
# some warnings might be shown due to malformed webpage nevertheless it should not matter for the rest of the results
metadata, results = parser.get_results('./testdata/google.com_elections_videos_2023-03-15_13_19_24.html')

  result[query['name']] = self.__evaluate_query(query, _soup)
  result[query['name']] = self.__evaluate_query(query, _soup)
  result[query['name']] = self.__evaluate_query(query, _soup)
  result[query['name']] = self.__evaluate_query(query, _soup)


In [12]:
metadata

{'result type': 'videos',
 'engine': 'google.com',
 'query': 'elections',
 'page': 1,
 'date': Timestamp('2023-03-15 13:19:24'),
 'total results': 193000000}

In [13]:
# examine the results
results.dropna()

Unnamed: 0,title,link,text,source,published,duration
0,Presidential Election Process | USAGov,https://www.usa.gov/election,"Electoral College. In other U.S. elections, ca...",USA.gov,12 Sept 2016,0 days 00:02:22
1,Elections | Tennessee Secretary of State,https://sos.tn.gov/elections,Tre Hargett was elected by the Tennessee Gener...,Tennessee Secretary of State,29 Dec 2010,0 days 00:00:39
2,Elections misinformation policies - YouTube Help,https://support.google.com/youtube/answer/1083...,Voter suppression: · Candidate eligibility: · ...,Google Help,15 Jul 2021,0 days 00:02:53
3,Elections - Department of Political and Peaceb...,https://dppa.un.org/en/elections,An electoral officer (right) assists a voter a...,Department of Political and Peacebuilding Affairs,31 May 2019,0 days 00:05:37
4,Bexar County Elections Department - Bexar County,https://www.bexar.org/1568/Elections-Department,The Bexar County Elections Department is respo...,Bexar County,30 Sept 2019,0 days 00:03:21
5,Committee on Elections - 03/14/23 - YouTube,https://www.youtube.com/watch?v=eABxOrPeM2I,S.F. 1827 (Carlson) Major political party defi...,YouTube,14 hours ago,0 days 00:01:59
6,By-elections and supplementary elections,https://www.aec.gov.au/elections/supplementary...,A supplementary election must be held if a can...,Australian Electoral Commission,22 Oct 2007,0 days 00:00:42
7,Voting on election day - NSW Electoral Commission,https://elections.nsw.gov.au/voters/voting-opt...,"If you're an eligible voter, it is compulsory ...",NSW Electoral Commission,6 days ago,0 days 00:05:10
8,Postal voting - NSW Electoral Commission,https://elections.nsw.gov.au/voters/voting-opt...,If your circumstances make it difficult to vot...,NSW Electoral Commission,13 Jan 2023,0 days 00:03:38
9,Welcome to voting - NSW Electoral Commission,https://elections.nsw.gov.au/voters/welcome-to...,Are you new to voting? Find out what to expect...,NSW Electoral Commission,2 weeks ago,0 days 00:04:19
