WeSearch_DataCollection

Background

We are seeking to collect user-generated text to support the evaluation of parser adaptation across domain/genre. The proposal specifies five types of sources: Open Access Research Literature, Wikipedia, Technology Blogs, Product Reviews and User Forums.

Potential Data Sources

Open Access Research Literature

The [http://acl-arc.comp.nus.edu.sg/ ACL Anthology Reference Corpus], a snapshot of the ACL Anthology content up to February 2007. 10,921 articles in PDF, and text dumps from [http://pdfbox.apache.org/ pdfbox]. The data is available in ~jread/data/acl-arc.
The [http://books.nips.cc/ Proceedings of Advances in Neural Information Processing Systems], 24 volumes of PDFs
The [http://www.ncbi.nlm.nih.gov/pubmed PubMed] database, indexing approximately 3,000,000 free full text English biomedical articles (and 4,000 Norwegian free full text articles).

Wikis

The WeScience corpus composed of Wikipedia articles in the domain of Natural Language Processing.
[http://www.thinkwiki.org/wiki/ThinkWiki ThinkWiki], a collection of reference materials and HOWTOs for Think Pad users, with a particular focus on linux. All information found on this wiki is published under the GNU Free Documentation License.
[http://wiki.laptop.org/go/The_OLPC_Wiki The One Laptop Per Child Wiki] describes work and ideas related to the One Laptop Per Child project. The content is available under Creative Commons Attribution 3.0.
[http://askdrwiki.com/mediawiki/index.php?title=Physician_Medical_Wiki Dr Wiki] is a nonprofit educational web site made by physicians for physicians, medical students, and healthcare providers. Content is available under Creative Commons Attribution-Non Commercial-Share Alike 3.0 Unported.

Product Reviews

[http://www.cs.cornell.edu/people/pabo/movie-review-data/ Polarity 2.0], a collection of 2,000 movie reviews originally posted on Usenet. Reviews are classified according to sentiment, and tokenised by sentence. Capitalisation information has been removed. About 1,400,000 tokens in 64,000 sentences.
Bing Lui's [http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html Amazon Product Review Data] contains about 5.8 million customer product reviews from Amazon, (about 980 million words). Licensing information is not mentioned, but Amazon's website says "Amazon grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of Amazon. This license does not include any resale or commercial use of this site or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of this site or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools. This site or any portion of this site may not be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of Amazon.". The data is available in ~jread/data/lui
The [http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ Multi-Domain Sentiment Dataset] also contains Amazon product reviews in several domains. The processed reviews are distributed as feature vectors. The unprocessed data is available in ~jread/data/mdsd.
[http://www.epinions.com/ Epinions] collects consumer reviews in the domains of: cars, books, movies, music, computers, electronics, gifts, home/garden, kids/family, office supply, sports and travel. Their usage policy states that "Using any automated means to access the site or collect any information from the site" is inappropriate, but perhaps it's worth sending an email, as [http://www.trustlet.org/wiki/Extended_Epinions_dataset Paolo Massa] obtained a dump directly from Epinions (but unfortunately did not retain the textual data).
http://www.category5.tv/product_reviews/ (creative commons).
http://www.geek.com/ "© 2010 Geeknet, Inc. "... but copyright details make no mention of redistribution.

Blogs

The [http://www.icwsm.org/data/ Spinn3r Blog Dataset] contains 44 million blog posts, with metadata including original site and topic tags. The [http://icwsm.org/2009/data/icwsm-spinn3r.pdf usage agreement] prohibits redistribution of the content. A sample is available at ~jread/data/sp inn3r-sample.xml.
Glasgow distributes the [http://ir.dcs.gla.ac.uk/test_collections/ Blogs06] collection of 3.2 million blog articles. Costs £400 and subject to a user agreement prohibiting redistribution.
The [http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm Blog Authorship Corpus], 140 million words in posts collected from blogger.com. Annotated with blogger gender and age. No license information provided. Available at ~jread/data/koppel.
[http://slashdot.org/ SlashDot] "All trademarks and copyrights on this page are owned by their respective owners. Comments are owned by the Poster. The Rest © 1997-2010 Geeknet, Inc."... but copyright details make no mention of redistribution.
Linux: http://embraceubuntu.com/ http://linuxhelp.blogspot.com/ http://planet.ubuntu.com/ http://www.ubuntuhq.com/ http://polishlinux.org/ http://linuxhelp.blogspot.com/ http://www.ubuntu-unleashed.com/ http://www.linuxscrew.com/ http://www.fsckin.com/ http://www.ubuntugeek.com/ http://bashcurescancer.com/ http://tweako.com/section/Linux http://www.markshuttleworth.com/ http://ubuntu.philipcasey.com/
NLP/IR: http://nlpers.blogspot.com/ http://lingpipe-blog.com/ http://apperceptual.wordpress.com/ http://arnoldit.com/wordpress/ http://blog.codalism.com/ http://researchonsearch.blogspot.com/ http://anand.typepad.com/datawocky/ http://battellemedia.com/ http://www.searchenginecaffe.com/
ML: http://anyall.org/blog/ http://blog.smola.org/ http://datamining.typepad.com http://www.dataists.com/ http://mark.reid.name/iem/ http://mlstat.wordpress.com/ http://hunch.net/ http://www.machinedlearnings.com/ http://thenoisychannel.com/

Mailing Lists

DELPH-IN
MOSES
The [http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html Usenet Corpus], consisting of 28 million documents (each between 500 and 500,000 words in length, from around 48,000 different newsgroups. Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Canada License.

User Forums

The [http://blog.stackoverflow.com/category/cc-wiki-dump/ Stack Overflow Data Dump] is from a question/answer website with content relating to cooking, game development, gaming, mathematics, photography, server faults, stack apps, programming, system administration, ubuntu, web applications and web administration. Shared under Creative Commons Attribution Share Alike 2.5 Generic.
[http://www.linuxquestions.org LinuxQuestions], no license declaration.
[http://www.ubuntuforums.org UbuntuForums], license owned by Canonical, but no restrictions specified.
[http://www.nabble.com/ Nabble] is a free forum hosting service with no license declaration. Many different topics and several languages.

Related Work

Baldwin, T., Martinez, D., Penman, R. B., Kim, S. N., Lui, M., Wang, L. and MacKinlay, A. (2010) [http://www.aclweb.org/anthology/W/W10/W10-0508.pdf Intelligent Linux Information Access by Data Mining: the ILIAD Project]. Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media.
Nichols, E., Murakami, K., Inui, K., and Matsumoto, Y. (2009) [http://cl.naist.jp/~eric-n/papers/bscorpus-pacling2009-paper.pdf Constructing a Scientific Blog Corpus for Information Credibility Analysis]. Proceedings of PACLING 2009.
Weimer, M., Gurevych, I and Mühlhäuser, M. (2007). [http://www.aclweb.org/anthology-new/P/P07/P07-2032.pdf Automatically Assessing the Post Quality in Online Discussions on Software]. Proceedings of the ACL 2007 Demo and Poster Sessions.

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly