# URLs

Given a study carrel, output information regarding URLs found in it.

Just as the Toolbox extracts feastures such as ngrams, parts-of-speech, and named entities, the Toolbox also extracts URLs from carrel content. Understanding what URLs exist in a carrel helps address questions regarding the type of content in the carrel as well as helps address a perenial task, namely, "Find more like this one." This notebook demonstrates how to do simple URL output, counting, and tabulating.

Our standard carrel (homer) includes zero URLs, therefore a different carrel is used. More specifically, this notebook uses a carrel named 'ital-2006-2010', which includes about 160 scholarly journal articles on the topic of computers and libraries. The articles include many URLs.

Also, please, "Don't let the perfect be the enemy of the good." Feature extraction is not perfect, and because of the way word processors format text, many URLs are not well-formed, and therefore many (but not most) URLS are invalid.

In [1]:
# configure
CARREL = 'ital-2006-2010'
FILTER = 'oclc'

In [2]:
# require
import rdr

In [3]:
# check to see if the carrel exists; if not, then download the carrel
try    : rdr.checkForCarrel( CARREL )
except : rdr.download( CARREL )


  library. Are you sure you entered its name correctly? Try 'rdr
  catalog' to make sure.

  Alternatively, maybe you have moved the library and your settings
  are not up-to-date. If so, then use 'rdr get -s local' and/or
  'rdr set -s local' to rectify the issue.


  INFO: Downloading remote study carrel...
  INFO: Saving study carrel...
  INFO: Unziping study carrel...
  INFO: Done.


In [4]:
# output a list of all the URLs and their frequencies
results = rdr.urls( CARREL, count=True )
print( results )

http://zthes.z3950.org/	1
http://zakros.ucsd.edu/~trohrer/	1
http://xml.cover	1
http://xml.com/pub/a/2002/11/06/	1
http://xdoclet	1
http://www5.oclc	1
http://www4.auto	1
http://www2002.org/CDROM/refereed/338/index.html	1
http://www2002.org/CDROM/	1
http://www2002	1
http://www2.ulcc.ac.uk/unesco/concept/MT_MT_2.55	1
http://www2.ulcc.ac.uk/unesco/concept/MT_2.75	1
http://www2.ulcc.ac.uk/unesco/concept/MT_2.65	1
http://www2.ulcc.ac.uk/unesco/concept/MT_2.60	1
http://www10.org/cdrom/papers/489	1
http://www10.org/cdrom/papers/466/	1
http://www1.cs.columbia.edu/~andreas/	1
http://www.zope	1
http://www.zoomify.com/	1
http://www.yorku.ca/caml/review/33-3/	1
http://www.yorku.ca/caml/	1
http://www.worldcat.org/devnet/wiki/Services	1
http://www.win.tue.nl/SW­EL/2006/	1
http://www.wikipedia	1
http://www.well.com/~doctorow/metacrap.htm	1
http://www.website	1
http://www.webjunction.org/	1
http://www.webjunction	1
http://www.webchoir.com/	1
http://www.washington.edu/	1
http://www.wascsenior	1
http://

In [5]:
# count, tabulate, and filter the URLs
results = rdr.urls( CARREL, count=True, like=FILTER )
print( results )

http://www5.oclc	1
http://www.oclc.org/worldcatlocal/about/213941usf_some_findings_about_worldcat_local.pdf	1
http://www.oclc.org/us/en/news/	1
http://www.oclc.org/us/en/	1
http://www.oclc.org/support/	1
http://www.oclc.org/research/projects/termservices/default	1
http://www.oclc.org/research/projects/frbr/algo-	1
http://www.oclc.org/research/projects/	1
http://www.oclc.org/research/presentations/	1
http://www.oclc.org/research/	1
http://www.oclc.org/reports/pdfs/studentperceptions.pdf	1
http://www.oclc.org/reports/onlinecatalogs/fullreport.pdf	1
http://www.oclc.org/reports/onlinecatalogs/default.htm	1
http://www.oclc.org/reports/2005perceptions.htm	1
http://www.oclc.org/reports/	1
http://www.oclc.org/programs/publications/reports/2007-04	1
http://www.oclc.org/programs/publications/reports/2007-03	1
http://www.oclc.org/product­	1
http://www.oclc.org/productworks/wcwiki.htm	1
http://www.oclc.org/news/publications/news	1
http://www.oclc	1
http://orweblog.oclc.org/archives/000927	1
http:/

In [6]:
# output and list of all the domains of the URLs and their frequencies
results = rdr.urls( CARREL, count=True, select='domain' )
print( results )

www	40
www.oclc.org	27
www.w3.org	23
www.dlib.org	19
exch.mail.umd.edu	17
hdl.handle.net	13
myee.bol.ucla.edu	13
www.loc.gov	13
en.wikipedia.org	12
www.ala.org	9
digital.library.unlv.edu	8
dublincore.org	8
www.ii.fsu.edu	8
www.library.unlv.edu	8
www.ifla.org	7
crl.acrl.org	6
www.adobe.com	6
www.facebook.com	6
www.libraryjournal.com	6
www.tandfonline.com	6
books.google.com	5
creativecommons.org	5
del.icio.us	5
en.wikipedia	5
hdl.handle	5
wordnet.princeton.edu	5
www.cs.umass.edu	5
www.techsource.ala.org	5
www.ugr.es	5
	4
files.eprints.org	4
ital-ica.blogspot.com	4
journal.code4lib.org	4
journal.lib.uoguelph.ca	4
libraries.universityofcalifornia.edu	4
prezi.com	4
search.cpan.org	4
search.ebscohost.com	4
wiki.dlib.indiana.edu	4
www.arl.org	4
www.comp.lancs.ac.uk	4
www.eionet.eu.int	4
www.eprints.org	4
www.lib.ncsu.edu	4
www.slideshare.net	4
www.techcrunch.com	4
www.techsmith.com	4
www2.ulcc.ac.uk	4
appdev.hsclib.sunysb.edu	3
arxiv.org	3
aurarialibrary.worldcat	3
blogs.library.duke.edu	3
ca

In [7]:
# count, tabulate, and filter the domains
results = rdr.urls( CARREL, count=True, like=FILTER, select='domain' )
print( results )

www.oclc.org	27
orweblog.oclc	2
orweblog.oclc.org	2
ddcresearch.oclc.org	1
www.oclc	1
www5.oclc	1
