cdi

Author: Jian Wu
Affiliation: College of IST at Penn State University
Reference (bibtex): @inproceedings{wu2012cdi, author = {Wu, Jian and Teregowda, Pradeep and Khabsa, Madian and Carman, Stephen and Jordan, Douglas and San Pedro Wandelmer, Jose and Lu, Xin and Mitra, Prasenjit and Giles, C. Lee}, title = {Web crawler middleware for search engine digital libraries: a case study for citeseerX}, booktitle = {Proceedings of the twelfth international workshop on Web information and data management}, series = {WIDM '12}, year = {2012}, isbn = {978-1-4503-1720-7}, location = {Maui, Hawaii, USA}, pages = {57--64}, numpages = {8}, url = {http://doi.acm.org/10.1145/2389936.2389949}, doi = {10.1145/2389936.2389949}, acmid = {2389949}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {information retrieval, ingestion, middleware, search engine, web crawling}, }

This a software package designed for the CiteSeerX digital library search engine. The main function, cdi.py, is used to import a collection of documents from a crawling job to the CiteSeerX crawl database and repository. There are other crawl related modules such as the csxcrawl_schedule.py, which is used to select seed URLs injected to the crawler.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cdi

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

cdi

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages