version 0.1-0

cran · Jun 7, 2017 · 694ccf4 · 694ccf4
commit 694ccf4
Show file tree

Hide file tree

Showing 29 changed files with 55,131 additions and 0 deletions.
diff --git a/ChangeLog b/ChangeLog
@@ -0,0 +1,3 @@
+Release 0.1-0:
+  * Added score().
+  * Added multi-corpus parallelism via OpenMP.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -0,0 +1,27 @@
+Package: meanr
+Type: Package
+Title: Basic Sentiment Analysis Scorer
+Version: 0.1-0
+Description: A popular technique in text analysis today is sentiment analysis, 
+    or trying to determine the overall emotional attitude of a piece of text
+    (positive or negative).  We provide a new, basic implementation of a common
+    method for computing sentiment, whereby words are scored as positive or
+    negative according to a "dictionary", and then an average of those scores
+    for the document is produced.  The package uses the 'Hu' and 'Liu' sentiment
+    dictionary for assigning sentiment.
+License: BSD 2-clause License + file LICENSE
+Depends: R (>= 3.0.0)
+LazyData: yes
+LazyLoad: yes
+NeedsCompilation: yes
+ByteCompile: yes
+Authors@R: c(person("Drew", "Schmidt", role=c("aut", "cre"), 
+    email="wrathematics@gmail.com"))
+Maintainer: Drew Schmidt <wrathematics@gmail.com>
+URL: https://github.com/wrathematics/meanr
+BugReports: https://github.com/wrathematics/meanr/issues
+RoxygenNote: 6.0.1
+Packaged: 2017-06-07 00:55:33 UTC; mschmid3
+Author: Drew Schmidt [aut, cre]
+Repository: CRAN
+Date/Publication: 2017-06-07 05:34:24 UTC
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,2 @@
+YEAR: 2016-2017
+COPYRIGHT HOLDER: Drew Schmidt
diff --git a/MD5 b/MD5
@@ -0,0 +1,28 @@
+6ce23289ee3e810358d184de4303c09b *ChangeLog
+bce2bd0237c58a32e1219bc9d3d5758f *DESCRIPTION
+97ee9bd0a8d8ea998d470cd3df01391f *LICENSE
+d4d9df789e6601e992c577535c418514 *NAMESPACE
+be480688d606c6a88fd90540113df43c *R/meanr-package.r
+4258a6fe3fcd310ccc384d289b63599f *R/meanr.nthreads.r
+d5d68b9c5177a404ac3dc07c973079aa *R/score.r
+69b50329afc39e9763dca0dd2661dc81 *README.md
+4ebe6d22d2b7a0e071e649c97d7c23e6 *cleanup
+b5530b8555dc17eeb91776d98f5a0798 *configure
+c6dd3e936cff072a8b121aee6f53477d *configure.ac
+d41d8cd98f00b204e9800998ecf8427e *configure.win
+1bb01aa7ab651f8512f4c1fba171afaa *inst/CITATION
+eef52dce481f6b54222a915c294d5889 *man/meanr-package.Rd
+b9823d3ad4553f326d1e82670fb7d3f8 *man/meanr.nthreads.Rd
+ef74e3503ccfd589f63298e2f9ef07d5 *man/score.Rd
+95e3011e37d9dde0d75f3a3819b2acd3 *src/Makevars
+4f5835e95f25efff49554c474ecee49d *src/global.c
+c1f447e37ac6efe6dcd9d2cbed62b845 *src/hashtable/neghash.h
+0a6d1caef0603775d9b7f870d7b9b740 *src/hashtable/poshash.h
+f419b8d8cbd840d3bf82eba5c3b064df *src/include/RNACI.h
+531e126d0a1fde18833a3edb18827b67 *src/include/reactor.h
+216b5b1f0a6d8c7b0515ba396ad5090d *src/include/safeomp.h
+9e0d08e456002046dc89dea5c5fd0cd9 *src/meanr_native.c
+a5697d9b417c73ef3d38c69c3851ac0c *src/meanr_nthreads.c
+348588922857ad48ebd8bb8281facaf7 *src/score.c
+5beae7b13364e09da44c060c2818ca19 *tests/degenerate_cases.r
+62b442f823e90485ae18176cfb54094c *tests/score.r
diff --git a/NAMESPACE b/NAMESPACE
@@ -0,0 +1,5 @@
+# Generated by roxygen2: do not edit by hand
+
+export(meanr.nthreads)
+export(score)
+useDynLib(meanr, R_score, R_meanr_nthreads)
diff --git a/R/meanr-package.r b/R/meanr-package.r
@@ -0,0 +1,19 @@
+#' meanr: Basic Sentiment Analysis Scorer
+#' 
+#' A popular technique in text analysis today is sentiment
+#' analysis, or trying to determine the overall emotional
+#' attitude of a piece of text (positive or negative).
+#' We provide a new, basic implementation of a common
+#' method for computing sentiment, whereby words are scored
+#' as positive or negative according to a "dictionary", and
+#' then an average of those scores for the document is produced.
+#' The package uses the Hu and Liu sentiment dictionary for
+#' assigning sentiment.
+#' 
+#' @useDynLib meanr, R_score, R_meanr_nthreads
+#' 
+#' @name meanr-package
+#' @docType package
+#' @author Drew Schmidt \email{wrathematics AT gmail.com}
+#' @keywords Package
+NULL
diff --git a/R/meanr.nthreads.r b/R/meanr.nthreads.r
@@ -0,0 +1,13 @@
+#' meanr.nthreads
+#' 
+#' Returns the number of cores + hyperthreads on the system.  The function
+#' respects the environment variable \code{OMP_NUM_THREADS}.
+#' 
+#' @return
+#' An integer; the number of threads.
+#' 
+#' @export
+meanr.nthreads <- function()
+{
+  .Call(R_meanr_nthreads)
+}
diff --git a/R/score.r b/R/score.r
@@ -0,0 +1,52 @@
+#' score
+#' 
+#' Computes the sentiment score, the sum of the total number of positive and
+#' negative scored words.  The scorer is vectorized so that it will return one
+#' row per input text, and each 
+#' 
+#' Preprocessing is largely unnecessary.  For example, the scorer ignores
+#' case and punctuation.  That said, preprocessing probably won't hurt.
+#' 
+#' @details
+#' The scorer uses OpenMP
+#' 
+#' The function uses the Hu and Liu sentiment dictionary (same as everybody
+#' else) available here:
+#' https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
+#' 
+#' @param s
+#' A string or vector of strings.
+#' @param nthreads
+#' Number of threads to use. By default it will use the total number of
+#' cores + hyperthreads.
+#' 
+#' @return
+#' A dataframe, consisting of columns "positive", "negative", "score", and "wc".
+#' With the exception of "score", these are counts; that is, "positive" is the
+#' number of positive sentiment words, "negative" is the number of negative
+#' sentiment words, and "wc" is the wordcount (total number of words).
+#' 
+#' @examples
+#' \dontrun{
+#' library(meanr)
+#' s1 = "Abundance abundant accessable."
+#' s2 = "Banana apple orange."
+#' s3 = "Abnormal abolish abominable."
+#' s = c(s1, s2, s3)
+#' 
+#' # as separate 'documents'
+#' score(s)
+#' 
+#' # as one document
+#' score(paste0(s, collapse=" "))
+#' }
+#' 
+#' @references
+#' Hu, M., & Liu, B. (2004). Mining opinion features in customer
+#' reviews. National Conference on Artificial Intelligence.
+#' 
+#' @seealso
+#' \code{\link{meanr.nthreads}}
+#' 
+#' @export
+score <- function(s, nthreads=meanr.nthreads()) .Call(R_score, s, nthreads)
diff --git a/README.md b/README.md
@@ -0,0 +1,76 @@
+# meanr
+
+* **Version:** 0.1-0
+* **Status:** [![Build Status](https://travis-ci.org/wrathematics/meanr.png)](https://travis-ci.org/wrathematics/meanr)
+* **License:** [BSD 2-Clause](http://opensource.org/licenses/BSD-2-Clause)
+* **Author:** Drew Schmidt
+
+
+**meanr** is an R package performing basic sentiment analysis.  Its main main method, `score()`, computes sentiment as a simple sum of the counts of positive (+1) and negative (-1) sentiment words in a piece of text.  More sophisticated techniques are available to R, for example in the **qdap** package's `polarity()` function.  This package uses [the Hu and Liu sentiment dictionary](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), same as everybody else.
+
+**meanr** is significantly faster than everything else I tried (which was actually the motivation for its creation), but I don't claim to have tried everything.  I believe the package is quite fast.  However, the method is merely a dictionary lookup, so it ignores word context like in more sophisticated methods.  On the other hand, the more sophisticated tools are very slow.  If you have a large volume of text, I believe there is value in getting a "first glance" at the data, and **meanr** allows you to do this very quickly.
+
+
+
+## Installation
+
+<!-- You can install the stable version from CRAN using the usual `install.packages()`:
+
+```r
+install.packages("meanr")
+``` -->
+
+The development version is maintained on GitHub, and can easily be installed by any of the packages that offer installations from GitHub:
+
+```r
+### Pick your preference
+devtools::install_github("wrathematics/meanr")
+ghit::install_github("wrathematics/meanr")
+remotes::install_github("wrathematics/meanr")
+```
+
+
+
+## Example Usage
+
+I have a dataset that, for legal reasons, I can not describe, much less provide.  You can think of it like a collection of tweets (they are not tweets).  But take my word for it that it's real, English language text.  The data is in the form of a vector of strings, which we'll call `x`.
+
+```r
+x = readRDS("x.rds")
+
+length(x)
+## [1] 655760
+
+sum(nchar(x))
+## [1] 162663972
+
+library(meanr)
+system.time(s <- score(x))
+##  user  system elapsed 
+## 1.072   0.000   0.285 
+
+head(s)
+##   positive negative score  wc
+## 1        2        0     2  32
+## 2        5        0     5  29
+## 3        4        2     2  67
+## 4       12        3     9 203
+## 5        8        2     6 101
+## 6        4        3     1  99
+```
+
+
+
+## How It Works
+
+The `score()` function receives a vector of strings, and operates on each one as follows:
+
+1. The maximum string length is found, and a buffer of that size is allocated.
+2. The string is copied to the buffer.
+3. All punctuation is removed. All characters are converted to lowercase.
+4. Score sentiment:
+    - Tokenize words as collections of chars separated by a space.
+    - Check if the word is positive; if not, check if it is negative; if not, then it's assumed to be neutral.  Each check is a lookup up in one of two tables of Hu and Liu's dictionaries.
+    - If the word is in the table, get its value from the hash table (positive words have value 1, negative words -1) and update the various counts.  Otherwise, the word is "neutral" (score of 0).
+
+This is all done in four passes of each string; each pass corresponds to each of the enumerated items above.  The hash tables uses perfect hash functions generated by gperf.
diff --git a/cleanup b/cleanup
@@ -0,0 +1,6 @@
+#! /bin/sh
+rm -rf ./src/*.dylib
+rm -rf ./src/*.so*
+rm -rf ./src/*.o
+rm -rf ./src/*.d
+rm -rf ./src/*.dll