Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
code
preprocessing
sample-data
README

README

extract-product
Author: Greg Durrett (http://www.cs.utexas.edu/~gdurrett/)

This system identifies products being bought and sold in forum posts.
It has several modes of operation, but most commonly identifies one
most prominent poduct in a given post, represented by choosing a particular
word or noun phrase from that post.


Prerequisites:

This project is written primarily in Scala. The easiest way to build it is with SBT:
https://github.com/harrah/xsbt/wiki/Getting-Started-Setup

Run
sbt assembly

which will compile everything and build a runnable jar.

You can also import the project into Eclipse and use the Scala IDE plug-in for Eclipse http://scala-ide.org
Or you can use IntelliJ, which has support for importing projects based on build.sbt



Preprocessing data

The product extraction system operates on data that has been tokenized, part-of-speech tagged, and dependency
parsed. To turn raw data into this form, use the code in the preprocessing subdirectory. First, build the
project there with sbt assembly. Then, run.sh shows how to tokenize and parse sample files using the
Stanford CoreNLP toolkit. This script first downloads the model files from the Stanford NLP website,
then runs the tokenizer, then runs the parser.



Running the product extractor

Once you have built the project, run

java -Xmx2g -cp target/scala-2.11/extract-products-assembly-1.jar edu.berkeley.forum.main.SystemRunner

This runs the system on the data in ../sample-data by default and writes the results in output/. You can
verify that these match sample-output, which contains the expected results. Three files are produced:

products.csv: list of product extractions, one line per file corresponding to the product
extracted for that file. The columns are:
col1: file name
col2: whole noun phrase of the extracted product
col3: head word
col4: canonicalized head (lowercased and stemmed)
col5: "full semantic product" (potentially a premodifier (e.g. "malware") + canonicalized head (e.g. "installs"))
col6: fine product category (if available)
col7: coarse product category (if available)

heads.txt: Counts of head words (column 3 of the csv)

refined-heads.txt: Counts of head words with additional refinement surrounding common categories.

The system is controlled with the following command-line arguments:

-inputDir: A flat directory containing forum posts, one per file
-outputDir: The directory to write output files to
-modelPath: default is models/bothmodel-doclevel.ser.gz -- change this out if you wish to train your own model

(More arguments are available; in fact, all arguments in classes with a main (SystemRunner, Trainer) are
command-line arguments.)