Skip to content

ada-url/url-various-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Various URL Datasets

These are collections of URLs for benchmarking purposes.

  • files/node_files.txt: all source files from a given Node.js snapshot as URLs (43415 URLs).
  • files/linux_files.txt: all files from a Linux systems as URLs (169312 URLs).
  • wikipedia/wikipedia_100k.txt: 100k URLs from a snapshot of all Wikipedia articles as URLs (March 6th 2023)
  • others/kasztp.txt: test URLs from https://github.com/kasztp/URL_Shortener (MIT License) (48009 URLs).
  • others/userbait.txt : test URLs from https://github.com/userbait/phishing_sites_detector (unknown copyright) (11430 URLs).
  • top100/top100.txt: crawl of the top visited 100 websites and extracts unique URLs

Disclaimer: This repository is developed and released for research purposes only.

  • This project reshares some publicly available datasets. When in doubt, investigate the copyright of the files you want to use.
  • There may be errors and duplicates in these files.

About

Various URL datasets

Resources

License

MIT, Apache-2.0 licenses found

Licenses found

MIT
LICENSE-MIT
Apache-2.0
LICENSE_APACHE

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published