Skip to content

Commit

Permalink
version 0.1-0
Browse files Browse the repository at this point in the history
  • Loading branch information
wrathematics authored and cran-robot committed Jun 7, 2017
0 parents commit 694ccf4
Show file tree
Hide file tree
Showing 29 changed files with 55,131 additions and 0 deletions.
3 changes: 3 additions & 0 deletions ChangeLog
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Release 0.1-0:
* Added score().
* Added multi-corpus parallelism via OpenMP.
27 changes: 27 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Package: meanr
Type: Package
Title: Basic Sentiment Analysis Scorer
Version: 0.1-0
Description: A popular technique in text analysis today is sentiment analysis,
or trying to determine the overall emotional attitude of a piece of text
(positive or negative). We provide a new, basic implementation of a common
method for computing sentiment, whereby words are scored as positive or
negative according to a "dictionary", and then an average of those scores
for the document is produced. The package uses the 'Hu' and 'Liu' sentiment
dictionary for assigning sentiment.
License: BSD 2-clause License + file LICENSE
Depends: R (>= 3.0.0)
LazyData: yes
LazyLoad: yes
NeedsCompilation: yes
ByteCompile: yes
Authors@R: c(person("Drew", "Schmidt", role=c("aut", "cre"),
email="wrathematics@gmail.com"))
Maintainer: Drew Schmidt <wrathematics@gmail.com>
URL: https://github.com/wrathematics/meanr
BugReports: https://github.com/wrathematics/meanr/issues
RoxygenNote: 6.0.1
Packaged: 2017-06-07 00:55:33 UTC; mschmid3
Author: Drew Schmidt [aut, cre]
Repository: CRAN
Date/Publication: 2017-06-07 05:34:24 UTC
2 changes: 2 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
YEAR: 2016-2017
COPYRIGHT HOLDER: Drew Schmidt
28 changes: 28 additions & 0 deletions MD5
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
6ce23289ee3e810358d184de4303c09b *ChangeLog
bce2bd0237c58a32e1219bc9d3d5758f *DESCRIPTION
97ee9bd0a8d8ea998d470cd3df01391f *LICENSE
d4d9df789e6601e992c577535c418514 *NAMESPACE
be480688d606c6a88fd90540113df43c *R/meanr-package.r
4258a6fe3fcd310ccc384d289b63599f *R/meanr.nthreads.r
d5d68b9c5177a404ac3dc07c973079aa *R/score.r
69b50329afc39e9763dca0dd2661dc81 *README.md
4ebe6d22d2b7a0e071e649c97d7c23e6 *cleanup
b5530b8555dc17eeb91776d98f5a0798 *configure
c6dd3e936cff072a8b121aee6f53477d *configure.ac
d41d8cd98f00b204e9800998ecf8427e *configure.win
1bb01aa7ab651f8512f4c1fba171afaa *inst/CITATION
eef52dce481f6b54222a915c294d5889 *man/meanr-package.Rd
b9823d3ad4553f326d1e82670fb7d3f8 *man/meanr.nthreads.Rd
ef74e3503ccfd589f63298e2f9ef07d5 *man/score.Rd
95e3011e37d9dde0d75f3a3819b2acd3 *src/Makevars
4f5835e95f25efff49554c474ecee49d *src/global.c
c1f447e37ac6efe6dcd9d2cbed62b845 *src/hashtable/neghash.h
0a6d1caef0603775d9b7f870d7b9b740 *src/hashtable/poshash.h
f419b8d8cbd840d3bf82eba5c3b064df *src/include/RNACI.h
531e126d0a1fde18833a3edb18827b67 *src/include/reactor.h
216b5b1f0a6d8c7b0515ba396ad5090d *src/include/safeomp.h
9e0d08e456002046dc89dea5c5fd0cd9 *src/meanr_native.c
a5697d9b417c73ef3d38c69c3851ac0c *src/meanr_nthreads.c
348588922857ad48ebd8bb8281facaf7 *src/score.c
5beae7b13364e09da44c060c2818ca19 *tests/degenerate_cases.r
62b442f823e90485ae18176cfb54094c *tests/score.r
5 changes: 5 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by roxygen2: do not edit by hand

export(meanr.nthreads)
export(score)
useDynLib(meanr, R_score, R_meanr_nthreads)
19 changes: 19 additions & 0 deletions R/meanr-package.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#' meanr: Basic Sentiment Analysis Scorer
#'
#' A popular technique in text analysis today is sentiment
#' analysis, or trying to determine the overall emotional
#' attitude of a piece of text (positive or negative).
#' We provide a new, basic implementation of a common
#' method for computing sentiment, whereby words are scored
#' as positive or negative according to a "dictionary", and
#' then an average of those scores for the document is produced.
#' The package uses the Hu and Liu sentiment dictionary for
#' assigning sentiment.
#'
#' @useDynLib meanr, R_score, R_meanr_nthreads
#'
#' @name meanr-package
#' @docType package
#' @author Drew Schmidt \email{wrathematics AT gmail.com}
#' @keywords Package
NULL
13 changes: 13 additions & 0 deletions R/meanr.nthreads.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#' meanr.nthreads
#'
#' Returns the number of cores + hyperthreads on the system. The function
#' respects the environment variable \code{OMP_NUM_THREADS}.
#'
#' @return
#' An integer; the number of threads.
#'
#' @export
meanr.nthreads <- function()
{
.Call(R_meanr_nthreads)
}
52 changes: 52 additions & 0 deletions R/score.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#' score
#'
#' Computes the sentiment score, the sum of the total number of positive and
#' negative scored words. The scorer is vectorized so that it will return one
#' row per input text, and each
#'
#' Preprocessing is largely unnecessary. For example, the scorer ignores
#' case and punctuation. That said, preprocessing probably won't hurt.
#'
#' @details
#' The scorer uses OpenMP
#'
#' The function uses the Hu and Liu sentiment dictionary (same as everybody
#' else) available here:
#' https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
#'
#' @param s
#' A string or vector of strings.
#' @param nthreads
#' Number of threads to use. By default it will use the total number of
#' cores + hyperthreads.
#'
#' @return
#' A dataframe, consisting of columns "positive", "negative", "score", and "wc".
#' With the exception of "score", these are counts; that is, "positive" is the
#' number of positive sentiment words, "negative" is the number of negative
#' sentiment words, and "wc" is the wordcount (total number of words).
#'
#' @examples
#' \dontrun{
#' library(meanr)
#' s1 = "Abundance abundant accessable."
#' s2 = "Banana apple orange."
#' s3 = "Abnormal abolish abominable."
#' s = c(s1, s2, s3)
#'
#' # as separate 'documents'
#' score(s)
#'
#' # as one document
#' score(paste0(s, collapse=" "))
#' }
#'
#' @references
#' Hu, M., & Liu, B. (2004). Mining opinion features in customer
#' reviews. National Conference on Artificial Intelligence.
#'
#' @seealso
#' \code{\link{meanr.nthreads}}
#'
#' @export
score <- function(s, nthreads=meanr.nthreads()) .Call(R_score, s, nthreads)
76 changes: 76 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# meanr

* **Version:** 0.1-0
* **Status:** [![Build Status](https://travis-ci.org/wrathematics/meanr.png)](https://travis-ci.org/wrathematics/meanr)
* **License:** [BSD 2-Clause](http://opensource.org/licenses/BSD-2-Clause)
* **Author:** Drew Schmidt


**meanr** is an R package performing basic sentiment analysis. Its main main method, `score()`, computes sentiment as a simple sum of the counts of positive (+1) and negative (-1) sentiment words in a piece of text. More sophisticated techniques are available to R, for example in the **qdap** package's `polarity()` function. This package uses [the Hu and Liu sentiment dictionary](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), same as everybody else.

**meanr** is significantly faster than everything else I tried (which was actually the motivation for its creation), but I don't claim to have tried everything. I believe the package is quite fast. However, the method is merely a dictionary lookup, so it ignores word context like in more sophisticated methods. On the other hand, the more sophisticated tools are very slow. If you have a large volume of text, I believe there is value in getting a "first glance" at the data, and **meanr** allows you to do this very quickly.



## Installation

<!-- You can install the stable version from CRAN using the usual `install.packages()`:
```r
install.packages("meanr")
``` -->

The development version is maintained on GitHub, and can easily be installed by any of the packages that offer installations from GitHub:

```r
### Pick your preference
devtools::install_github("wrathematics/meanr")
ghit::install_github("wrathematics/meanr")
remotes::install_github("wrathematics/meanr")
```



## Example Usage

I have a dataset that, for legal reasons, I can not describe, much less provide. You can think of it like a collection of tweets (they are not tweets). But take my word for it that it's real, English language text. The data is in the form of a vector of strings, which we'll call `x`.

```r
x = readRDS("x.rds")

length(x)
## [1] 655760

sum(nchar(x))
## [1] 162663972

library(meanr)
system.time(s <- score(x))
## user system elapsed
## 1.072 0.000 0.285

head(s)
## positive negative score wc
## 1 2 0 2 32
## 2 5 0 5 29
## 3 4 2 2 67
## 4 12 3 9 203
## 5 8 2 6 101
## 6 4 3 1 99
```



## How It Works

The `score()` function receives a vector of strings, and operates on each one as follows:

1. The maximum string length is found, and a buffer of that size is allocated.
2. The string is copied to the buffer.
3. All punctuation is removed. All characters are converted to lowercase.
4. Score sentiment:
- Tokenize words as collections of chars separated by a space.
- Check if the word is positive; if not, check if it is negative; if not, then it's assumed to be neutral. Each check is a lookup up in one of two tables of Hu and Liu's dictionaries.
- If the word is in the table, get its value from the hash table (positive words have value 1, negative words -1) and update the various counts. Otherwise, the word is "neutral" (score of 0).

This is all done in four passes of each string; each pass corresponds to each of the enumerated items above. The hash tables uses perfect hash functions generated by gperf.
6 changes: 6 additions & 0 deletions cleanup
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#! /bin/sh
rm -rf ./src/*.dylib
rm -rf ./src/*.so*
rm -rf ./src/*.o
rm -rf ./src/*.d
rm -rf ./src/*.dll

0 comments on commit 694ccf4

Please sign in to comment.