JSON support, addjars-maven-plugin, Java Twokenizer

JSON support - uses jackson-core, use the "-input_format json" flag on runTagger.sh addjars-maven-plugin - makes it easier to add libraries without editing pom files... unfortunately it requires Maven 3.0.3+ Java Twokenizer - I ported the Scala Twokenizer to Java because it's faster and so I could get rid of the scala library dependencies.
brendano · Jun 22, 2012 · 67fded0 · 67fded0
1 parent ee854a0
commit 67fded0
Show file tree

Hide file tree

Showing 28 changed files with 508 additions and 412,786 deletions.
diff --git a/README b/README
@@ -1,61 +1,63 @@
-CMU ARK Twitter Part-of-Speech Tagger v0.2
-http://www.ark.cs.cmu.edu/TweetNLP/
-
-Basic usage
------------
-
-Requires Java 6.  To run the tagger:
-
-    ./runTagger.sh -input example_tweets.txt -output tagged_tweets.txt
-
-The output should match tagged_tweets_expected.txt.
-
-Advanced usage
---------------
-
-We include a pre-compiled .jar of the tagger so you hopefully don't need to
-compile it.  But if you need to recompile, do:
-  mvn install
-
-To train and evalute the tagger, see:
-  ark-tweet-nlp/src/main/java/edu/cmu/cs/lti/ark/ssl/pos/SemiSupervisedPOSTagger.java
-  scripts/train.sh and scripts/test.sh
-
-Contents
---------
- * runTagger.sh       is the script you probably want
- * lib/               dependencies
- * ark-tweet-nlp/src  the project code itself (mostly java, and one bit of scala)
-
-Information
------------
-This tagger is described in the following paper.  Please cite it if you write a
-research paper using this software.
-
-  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
-  Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills,
-  Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith
-  In Proceedings of the Annual Meeting of the Association for Computational
-  Linguistics, companion volume, Portland, OR, June 2011.
-  http://www.ark.cs.cmu.edu/TweetNLP/gimpel+etal.acl11.pdf
-
-The software is licensed under Apache 2.0 (see LICENSE file).
-
-Version 0.2 of the tagger differs from version 0.1 in the following ways:
-
-* The tokenizer has been improved and integrated with the tagger in a single Java program.
-
-* The new tokenizer was run on the 1,827 tweets used for the annotation effort and the
-annotations were adapted for tweets with differing tokenizations. The revised annotations
-are contained in a companion v0.2 release of the data (twpos-data-v0.2).
-
-* The tagging model is trained on ALL of the available annotated data in twpos-data-v0.2.
-The model in v0.1 was only trained on the training set.
-
-* The tokenizer/tagger is integrated with Twitter's text commons annotations API.
-
-Contact
--------
-Please contact Brendan O'Connor (brenocon@cmu.edu) and Kevin Gimpel (kgimpel@cs.cmu.edu)
-if you encounter any problems.
-
+CMU ARK Twitter Part-of-Speech Tagger v0.2.1
+http://www.ark.cs.cmu.edu/TweetNLP/
+
+Basic usage
+-----------
+
+Requires Java 6.  To run the tagger:
+
+    ./runTagger.sh -input example_tweets.txt -output tagged_tweets.txt
+	./runTagger.sh -input barackobama.txt -input_format json -output tagged_barackobama.txt
+
+The outputs should match tagged_tweets_expected.txt and barackobamaexpected.txt respectively.
+
+Advanced usage
+--------------
+
+We include a pre-compiled .jar of the tagger so you hopefully don't need to
+compile it.  But if you need to recompile, do:
+  mvn install
+NOTE: requires Maven 3.0.3+
+
+To train and evalute the tagger, see:
+  ark-tweet-nlp/src/main/java/edu/cmu/cs/lti/ark/ssl/pos/SemiSupervisedPOSTagger.java
+  scripts/train.sh and scripts/test.sh
+
+Contents
+--------
+ * runTagger.sh       is the script you probably want
+ * lib/               dependencies
+ * ark-tweet-nlp/src  the project code itself (all java)
+
+Information
+-----------
+This tagger is described in the following paper.  Please cite it if you write a
+research paper using this software.
+
+  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
+  Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills,
+  Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith
+  In Proceedings of the Annual Meeting of the Association for Computational
+  Linguistics, companion volume, Portland, OR, June 2011.
+  http://www.ark.cs.cmu.edu/TweetNLP/gimpel+etal.acl11.pdf
+
+The software is licensed under Apache 2.0 (see LICENSE file).
+
+Version 0.2 of the tagger differs from version 0.1 in the following ways:
+
+* The tokenizer has been improved and integrated with the tagger in a single Java program.
+
+* The new tokenizer was run on the 1,827 tweets used for the annotation effort and the
+annotations were adapted for tweets with differing tokenizations. The revised annotations
+are contained in a companion v0.2 release of the data (twpos-data-v0.2).
+
+* The tagging model is trained on ALL of the available annotated data in twpos-data-v0.2.
+The model in v0.1 was only trained on the training set.
+
+* The tokenizer/tagger is integrated with Twitter's text commons annotations API.
+
+Contact
+-------
+Please contact Brendan O'Connor (brenocon@cmu.edu) and Kevin Gimpel (kgimpel@cs.cmu.edu)
+if you encounter any problems.
+
diff --git a/ark-tweet-nlp/pom.xml b/ark-tweet-nlp/pom.xml
@@ -12,40 +12,6 @@
     </properties>
     <build>
         <plugins>
-            <plugin>
-                <groupId>org.scala-tools</groupId>
-                <artifactId>maven-scala-plugin</artifactId>
-                <version>2.15.2</version>
-                <executions>
-                    <execution>
-                        <id>compile</id>
-                        <goals>
-                            <goal>compile</goal>
-                        </goals>
-                        <phase>compile</phase>
-                    </execution>
-                    <execution>
-                        <id>test-compile</id>
-                        <goals>
-                            <goal>testCompile</goal>
-                        </goals>
-                        <phase>test-compile</phase>
-                    </execution>
-                    <execution>
-                        <phase>process-resources</phase>
-                        <goals>
-                            <goal>compile</goal>
-                        </goals>
-                    </execution>
-                </executions>
-                <configuration>
-                    <scalaVersion>2.9.0</scalaVersion>
-                    <jvmArgs>
-                        <jvmArg>-Xms64m</jvmArg>
-                        <jvmArg>-Xmx1024m</jvmArg>
-                    </jvmArgs>
-                </configuration>
-            </plugin>
             <plugin>
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-compiler-plugin</artifactId>
@@ -67,7 +33,7 @@
                             <addClasspath>true</addClasspath>
                         </manifest>
                     </archive>
-                    <outputDirectory>target/bin</outputDirectory>
+                    <outputDirectory>../bin</outputDirectory>
                 </configuration>
             </plugin>
             <plugin>
@@ -80,12 +46,31 @@
                             <goal>copy-dependencies</goal>
                         </goals>
                         <configuration>
-                            <outputDirectory>target/bin</outputDirectory>
+                            <outputDirectory>../bin</outputDirectory>
                             <includeScope>runtime</includeScope>
                         </configuration>
                     </execution>
                 </executions>
             </plugin>
+			<plugin>
+				<groupId>com.googlecode.addjars-maven-plugin</groupId>
+				<artifactId>addjars-maven-plugin</artifactId>
+				<version>1.0.3</version>
+				<executions>
+					<execution>
+						<goals>
+							<goal>add-jars</goal>
+						</goals>
+						<configuration>
+							<resources>
+								<JarResource>
+									<directory>../lib</directory>
+								</JarResource>
+							</resources>						
+						</configuration>
+					</execution>
+				</executions>
+			</plugin>			
         </plugins>
     </build>
     <repositories>
@@ -94,23 +79,11 @@
             <name>jboss-maven2-release-repository</name>
             <url>https://oss.sonatype.org/content/repositories/JBoss</url>
         </repository>
-        <repository>
-            <id>scala-tools.org</id>
-            <name>Scala-tools Maven2 Repository</name>
-            <url>http://scala-tools.org/repo-releases</url>
-        </repository>
         <repository>
             <id>twitter</id>
             <url>http://maven.twttr.com/</url>
         </repository>
     </repositories>
-    <pluginRepositories>
-        <pluginRepository>
-            <id>scala-tools.org</id>
-            <name>Scala-tools Maven2 Repository</name>
-            <url>http://scala-tools.org/repo-releases</url>
-        </pluginRepository>
-    </pluginRepositories>
     <dependencies>
         <dependency>
             <groupId>commons-codec</groupId>
@@ -127,11 +100,6 @@
             <artifactId>twitter-text</artifactId>
             <version>1.4.1</version>
         </dependency>
-        <dependency>
-            <groupId>org.scala-lang</groupId>
-            <artifactId>scala-library</artifactId>
-            <version>2.9.0</version>
-        </dependency>
         <dependency>
             <groupId>org.apache.lucene</groupId>
             <artifactId>lucene-core</artifactId>
@@ -148,21 +116,6 @@
             <version>10.0.1</version>
         </dependency>
         <!-- START locally distributed libs -->
-        <!-- END locally distributed libs -->
-        <!-- START testing dependecies -->
-        <dependency>
-            <groupId>junit</groupId>
-            <artifactId>junit</artifactId>
-            <version>4.8.2</version>
-            <scope>test</scope>
-        </dependency>
-        <dependency>
-            <groupId>org.hamcrest</groupId>
-            <artifactId>hamcrest-all</artifactId>
-            <version>1.1</version>
-            <scope>test</scope>
-        </dependency>
-        <!-- END testing dependecies -->
         <dependency>
         	<groupId>net.sf.jargs</groupId>
         	<artifactId>jargs</artifactId>
@@ -180,6 +133,21 @@
         	<artifactId>jackson-core</artifactId>
         	<version>2.1.0-SNAPSHOT</version>
         	<type>pom</type>
+        </dependency>        
+        <!-- END locally distributed libs -->
+        <!-- START testing dependecies -->
+        <dependency>
+            <groupId>junit</groupId>
+            <artifactId>junit</artifactId>
+            <version>4.8.2</version>
+            <scope>test</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.hamcrest</groupId>
+            <artifactId>hamcrest-all</artifactId>
+            <version>1.1</version>
+            <scope>test</scope>
         </dependency>
+        <!-- END testing dependecies -->
     </dependencies>
 </project>
diff --git a/ark-tweet-nlp/src/main/java/edu/cmu/cs/lti/ark/ssl/util/BasicFileIO.java b/ark-tweet-nlp/src/main/java/edu/cmu/cs/lti/ark/ssl/util/BasicFileIO.java
@@ -5,6 +5,9 @@
 import java.util.zip.GZIPInputStream;
 import java.util.zip.GZIPOutputStream;
 
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.core.JsonParser;
+
 public class BasicFileIO {
 
 	/*
@@ -79,6 +82,32 @@ public static String getLine(BufferedReader bReader) {
 		}
 		return null;
 	}
+
+	public static String getLine(JsonParser jParse) {
+		//returns the next "text" field or null if none left
+		try {
+			while(jParse.getText()!=null){
+			    if ("text".equals(jParse.getCurrentName())) {
+			    	jParse.nextToken(); // move to value
+			    	String tweet = jParse.getText();
+			    	jParse.nextToken();
+			    	return tweet;
+			    }
+			    jParse.nextToken();
+			}
+		} catch(JsonParseException e){
+			e.printStackTrace();
+			log.severe("Error parsing JSON.");
+			System.exit(-1);			  
+		  }
+		catch(IOException e) {
+			e.printStackTrace();
+			log.severe("Could not read line from file.");
+			System.exit(-1);
+		}
+
+		return null;	//jParse is null (EOF)	
+	}
 
 	public static void writeLine(BufferedWriter bWriter, String line) {
 		try {