DSpace Link Extractor

This repository contains a software that extracts links on DSpace bitstream documents (references to external links).

Prerequisies

Given a file with links to sitemap DSpace.

Algoritm

For each sitemap, parse DSpace site map.
For each site map entry, download it and on its HTML extract relevant URLs using a regex that matches bitstream links.
For each bitstream URL, download it, extract its links using tikalinkextract software and save each links extracted to a file with same file structure.

Like:

From URL http://repositorio-aberto.up.pt/bitstream/10216/63886/2/90220.pdf
To file: output/repositorio-aberto.up.pt/bitstream/10216/63886/2/90220.pdf_seeds.txt

Dependencies

It uses the tikalinkextract software and tika server.

On other shell run:

java -mx1000m -jar tools/tika-server-1.20.jar --port=9998

Build

mvn clean package

Run

Run the dspace link extractor on background and redirect to a file:

java -jar target/dspace-link-extractor-0.1-SNAPSHOT.jar dspace-urls.txt output >> dspace.log 2>&1

If you only want the entries that have been changed from a specific date add a date on argument like using format yyyy-MM-dd like:

java -jar target/dspace-link-extractor-0.1-SNAPSHOT.jar dspace-urls.txt output 2019-01-01 >> dspace.log 2>&1

Finish

When thw crawl has finished you could remove all the 'handle' folders. Because the seeds are on bitsteam folder.

find output -maxdepth 2 -name handle -exec rm -rf {} \;

Concatenate all the seeds on a single file:

find output/ -type f -name "*_seeds.txt" -exec cat {} \; >> seeds.txt

Remove mails and filter duplicates:

cat seeds.txt | egrep -v "^mail.*" | sort | uniq > seeds_uniq.txt

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src/main/java/pt/arquivo/dspaceLinkExtractor		src/main/java/pt/arquivo/dspaceLinkExtractor
tools		tools
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
dspace-urls.txt		dspace-urls.txt
dspace_test_sitemap.sh		dspace_test_sitemap.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main/java/pt/arquivo/dspaceLinkExtractor

src/main/java/pt/arquivo/dspaceLinkExtractor

tools

tools

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

dspace-urls.txt

dspace-urls.txt

dspace_test_sitemap.sh

dspace_test_sitemap.sh

pom.xml

pom.xml

Repository files navigation

DSpace Link Extractor

Prerequisies

Algoritm

Dependencies

Build

Run

Finish

About

Releases

Packages

Contributors 2

Languages

License

arquivo/dspace-link-extractor

Folders and files

Latest commit

History

Repository files navigation

DSpace Link Extractor

Prerequisies

Algoritm

Dependencies

Build

Run

Finish

About

Topics

Resources

License

Stars

Watchers

Forks

Languages