Skip to content
/ harvest Public

A toolkit for extracting posts and post metadata from web forums.

License

Notifications You must be signed in to change notification settings

fhgr/harvest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Harvest - A toolkit for extracting posts and post metadata from web forums

Actions Status codecov PyPI version

Automatic extraction of forum posts and metadata is a challenging task since forums do not expose their content in a standardized structure. Harvest performs this task reliably for many web forums and offers an easy way to extract data from web forums.

Installation

At the command line:

$ pip install harvest-webforum

If you want to install from the latest sources, you can do:

$ git clone https://github.com/fhgr/harvest.git
$ cd harvest
$ python3 setup.py install

Python library

Embedding harvest into your code is easy, as outlined below:

from urllib.request import urlopen, Request
from harvest import extract_data

USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"

url = "https://forum.videolan.org/viewtopic.php?f=14&t=145604"
req = Request(url, headers={'User-Agent': USER_AGENT})
html = urlopen(req).read().decode('utf-8')

result = extract_data(html, url)
print(result)

WEB-FORUM-52 gold standard

The corpus currently contains from 52 different web forums gold standard documents. These documents are also used by the integrations test of harvest.

Publication