Skip to content

WeSearch_DataCollection

JonathonRead edited this page Nov 2, 2011 · 51 revisions

Background

We are seeking to collect user-generated text to support the evaluation of parser adaptation across domain/genre. The proposal specifies five types of sources: Open Access Research Literature, Wikipedia, Technology Blogs, Product Reviews and User Forums.

Collected Data

NLP blogs were obtained in mid-April from the following sites:

Linux blogs were also downloaded in mid-April, from:

Linux forums were extracted from the Unix & Linux subset of the April 2011 Stack Exchange Creative Commons Dump. In this set a text corresponds to a post (be it a question or an answer). If necessary threads can be reconstructed by using the primary/new id xref file.

Linux reviews are from http://www.softpedia.com/reviews/linux/. They possible require some manual cleaning - each review typically ends with a sentence like 'Check out these screenshots'

The Linux wiki set was created following the method used for WikiWoods.

All data and scripts are in /ltg/jread/workspace/wesearch/data-collection. The content has been extracted by finding the most specific element that contains all the relevant text (for example, blog posts typically contain some element with an attribute indicating that is the content element). All mark-up related to rendering has been retained for now. Sentences were obtained from tokenizer (as used in creating WikiWoods).

Section Items Coverage Length Ambiguity Time Tokens Types
NLP, wiki 11558 86.4% 18.0 10859 8.2 238059 19396
NLP, blog 46106 81.9% 15.5 8158 6.1 838592 41771
Linux, wiki 40738 85.0% 18.5 12407 9.6 843082 45783
Linux, blog 92280 83.7% 11.1 5151 3.9 1000683 48511
Linux, review 14761 84.6% 18.1 10610 7.5 304672 13158
Linux, forum 85743 74.8% 11.0 4885 3.1 1115412 56673

Corpus statistics for each section. Coverage shows what precentage of items received an analysis (using the unadapted parser 'out of the box'), and ambiguity and time give an indication of average parsing complexity (for the 'vanilla' parser configuration). Tokens shows the token count of each section and types is the number of unique, non-punctuation tokens seen per section.

Data Preparation

  • Given an HTML document, extract elements specified by a set of XPaths.

  • Sentence segment using tokenizer adapted to handle HTML tags---P, LI, PRE, DIV force line breaks.

  • Simplify by:

    • removing non-human
    • removing superfluous whitespace
    • removing comments
    • removing some attributes (e.g. HREF)
    • ersatzing CODE and IMG
  • Filter CODE and IMG if they occur in isolation. Filter OL, UL, TABLE.

  • Number items with CSDDDDDDIIIII0

  • Create line-oriented itsdb import files with only one source and up to 1,000 items. Do not split documents across profiles.

Output:

  • Profile import files as lists of IDs and items.
  • Pointer file for each profile, with lines corresponding to items, with item start index in source document, and lists of pairs (start, length) indicating deletions.
  • Cross-reference file with document ids CSDDDDDD and source URL.

Other Potential Sources

Open Access Research Literature

Wikis

  • The WeScience corpus composed of Wikipedia articles in the domain of Natural Language Processing.

  • ThinkWiki, a collection of reference materials and HOWTOs for Think Pad users, with a particular focus on linux. All information found on this wiki is published under the GNU Free Documentation License.

  • The One Laptop Per Child Wiki describes work and ideas related to the One Laptop Per Child project. The content is available under Creative Commons Attribution 3.0.

  • Dr Wiki is a nonprofit educational web site made by physicians for physicians, medical students, and healthcare providers. Content is available under Creative Commons Attribution-Non Commercial-Share Alike 3.0 Unported.

Product Reviews

  • Polarity 2.0, a collection of 2,000 movie reviews originally posted on Usenet. Reviews are classified according to sentiment, and tokenised by sentence. Capitalisation information has been removed. About 1,400,000 tokens in 64,000 sentences.

  • Bing Lui's Amazon Product Review Data contains about 5.8 million customer product reviews from Amazon, (about 980 million words). Licensing information is not mentioned, but Amazon's website says "Amazon grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of Amazon. This license does not include any resale or commercial use of this site or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of this site or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools. This site or any portion of this site may not be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of Amazon.". The data is available in ~jread/data/lui

  • The Multi-Domain Sentiment Dataset also contains Amazon product reviews in several domains. The processed reviews are distributed as feature vectors. The unprocessed data is available in ~jread/data/mdsd.

  • Epinions collects consumer reviews in the domains of: cars, books, movies, music, computers, electronics, gifts, home/garden, kids/family, office supply, sports and travel. Their usage policy states that "Using any automated means to access the site or collect any information from the site" is inappropriate, but perhaps it's worth sending an email, as Paolo Massa obtained a dump directly from Epinions (but unfortunately did not retain the textual data).

  • http://www.category5.tv/product_reviews/ (creative commons).

  • http://www.geek.com/ "© 2010 Geeknet, Inc. "... but copyright details make no mention of redistribution.

  • http://www.reviewlinux.com/. Creative Commons licensed reviews of linux distributions.

  • http://news.softpedia.com/cat/Reviews/Linux-software/ 'enthusiast'-generated content. Need to request copyright exemption.

Blogs

Mailing Lists

  • DELPH-IN

  • MOSES

  • The Usenet Corpus, consisting of 28 million documents (each between 500 and 500,000 words in length, from around 48,000 different newsgroups. Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Canada License.

User Forums

  • The Stack Overflow Data Dump is from a question/answer website with content relating to cooking, game development, gaming, mathematics, photography, server faults, stack apps, programming, system administration, ubuntu, web applications and web administration. Shared under Creative Commons Attribution Share Alike 2.5 Generic.

  • LinuxQuestions, no license declaration.

  • UbuntuForums, license owned by Canonical, but no restrictions specified.

  • Nabble is a free forum hosting service with no license declaration. Many different topics and several languages.

Related Work

Clone this wiki locally