The Portuguese Web Archive (PWA) main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http:/…
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Heritrix @ 3b4487c
Plone @ 28882b4
PrintServer @ 3de60a3 Pywb submodules added May 23, 2016
PrototypeReplayBarHTML @ 1e0a0aa Submodule Prototype bar added May 23, 2016
PwaArchive-access
PwaLogsMiner @ 183cd5c
PwaLucene PwaLucene added to project root May 24, 2016
PwaProcessor @ 2811d63 Submodules added May 23, 2016
PwaSpellchecker
ScreenShotServer @ 80a452a
TestCollection @ d911a5e Submodules PwaLogsMiner, scripts and TestCollection added May 23, 2016
functional-tests @ e1f4ef2
pywb-opensearch-cdx @ 027ac57
scripts @ d20a0ea updating submdules May 30, 2016
.gitignore
.gitmodules
README.md
Report.pdf Updated Report.pdf - now contains more info about pywb running in PWA… Jul 8, 2016
barra.html
favicon.png
index.jsp updating logo homepage prize winners Jul 2, 2018
robustifyDiagram.png

README.md

pwa-technologies

The Portuguese Web Archive (PWA) main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with billions of documents since 1996.