-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 694ccf4
Showing
29 changed files
with
55,131 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Release 0.1-0: | ||
* Added score(). | ||
* Added multi-corpus parallelism via OpenMP. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
Package: meanr | ||
Type: Package | ||
Title: Basic Sentiment Analysis Scorer | ||
Version: 0.1-0 | ||
Description: A popular technique in text analysis today is sentiment analysis, | ||
or trying to determine the overall emotional attitude of a piece of text | ||
(positive or negative). We provide a new, basic implementation of a common | ||
method for computing sentiment, whereby words are scored as positive or | ||
negative according to a "dictionary", and then an average of those scores | ||
for the document is produced. The package uses the 'Hu' and 'Liu' sentiment | ||
dictionary for assigning sentiment. | ||
License: BSD 2-clause License + file LICENSE | ||
Depends: R (>= 3.0.0) | ||
LazyData: yes | ||
LazyLoad: yes | ||
NeedsCompilation: yes | ||
ByteCompile: yes | ||
Authors@R: c(person("Drew", "Schmidt", role=c("aut", "cre"), | ||
email="wrathematics@gmail.com")) | ||
Maintainer: Drew Schmidt <wrathematics@gmail.com> | ||
URL: https://github.com/wrathematics/meanr | ||
BugReports: https://github.com/wrathematics/meanr/issues | ||
RoxygenNote: 6.0.1 | ||
Packaged: 2017-06-07 00:55:33 UTC; mschmid3 | ||
Author: Drew Schmidt [aut, cre] | ||
Repository: CRAN | ||
Date/Publication: 2017-06-07 05:34:24 UTC |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
YEAR: 2016-2017 | ||
COPYRIGHT HOLDER: Drew Schmidt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
6ce23289ee3e810358d184de4303c09b *ChangeLog | ||
bce2bd0237c58a32e1219bc9d3d5758f *DESCRIPTION | ||
97ee9bd0a8d8ea998d470cd3df01391f *LICENSE | ||
d4d9df789e6601e992c577535c418514 *NAMESPACE | ||
be480688d606c6a88fd90540113df43c *R/meanr-package.r | ||
4258a6fe3fcd310ccc384d289b63599f *R/meanr.nthreads.r | ||
d5d68b9c5177a404ac3dc07c973079aa *R/score.r | ||
69b50329afc39e9763dca0dd2661dc81 *README.md | ||
4ebe6d22d2b7a0e071e649c97d7c23e6 *cleanup | ||
b5530b8555dc17eeb91776d98f5a0798 *configure | ||
c6dd3e936cff072a8b121aee6f53477d *configure.ac | ||
d41d8cd98f00b204e9800998ecf8427e *configure.win | ||
1bb01aa7ab651f8512f4c1fba171afaa *inst/CITATION | ||
eef52dce481f6b54222a915c294d5889 *man/meanr-package.Rd | ||
b9823d3ad4553f326d1e82670fb7d3f8 *man/meanr.nthreads.Rd | ||
ef74e3503ccfd589f63298e2f9ef07d5 *man/score.Rd | ||
95e3011e37d9dde0d75f3a3819b2acd3 *src/Makevars | ||
4f5835e95f25efff49554c474ecee49d *src/global.c | ||
c1f447e37ac6efe6dcd9d2cbed62b845 *src/hashtable/neghash.h | ||
0a6d1caef0603775d9b7f870d7b9b740 *src/hashtable/poshash.h | ||
f419b8d8cbd840d3bf82eba5c3b064df *src/include/RNACI.h | ||
531e126d0a1fde18833a3edb18827b67 *src/include/reactor.h | ||
216b5b1f0a6d8c7b0515ba396ad5090d *src/include/safeomp.h | ||
9e0d08e456002046dc89dea5c5fd0cd9 *src/meanr_native.c | ||
a5697d9b417c73ef3d38c69c3851ac0c *src/meanr_nthreads.c | ||
348588922857ad48ebd8bb8281facaf7 *src/score.c | ||
5beae7b13364e09da44c060c2818ca19 *tests/degenerate_cases.r | ||
62b442f823e90485ae18176cfb54094c *tests/score.r |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Generated by roxygen2: do not edit by hand | ||
|
||
export(meanr.nthreads) | ||
export(score) | ||
useDynLib(meanr, R_score, R_meanr_nthreads) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#' meanr: Basic Sentiment Analysis Scorer | ||
#' | ||
#' A popular technique in text analysis today is sentiment | ||
#' analysis, or trying to determine the overall emotional | ||
#' attitude of a piece of text (positive or negative). | ||
#' We provide a new, basic implementation of a common | ||
#' method for computing sentiment, whereby words are scored | ||
#' as positive or negative according to a "dictionary", and | ||
#' then an average of those scores for the document is produced. | ||
#' The package uses the Hu and Liu sentiment dictionary for | ||
#' assigning sentiment. | ||
#' | ||
#' @useDynLib meanr, R_score, R_meanr_nthreads | ||
#' | ||
#' @name meanr-package | ||
#' @docType package | ||
#' @author Drew Schmidt \email{wrathematics AT gmail.com} | ||
#' @keywords Package | ||
NULL |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
#' meanr.nthreads | ||
#' | ||
#' Returns the number of cores + hyperthreads on the system. The function | ||
#' respects the environment variable \code{OMP_NUM_THREADS}. | ||
#' | ||
#' @return | ||
#' An integer; the number of threads. | ||
#' | ||
#' @export | ||
meanr.nthreads <- function() | ||
{ | ||
.Call(R_meanr_nthreads) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
#' score | ||
#' | ||
#' Computes the sentiment score, the sum of the total number of positive and | ||
#' negative scored words. The scorer is vectorized so that it will return one | ||
#' row per input text, and each | ||
#' | ||
#' Preprocessing is largely unnecessary. For example, the scorer ignores | ||
#' case and punctuation. That said, preprocessing probably won't hurt. | ||
#' | ||
#' @details | ||
#' The scorer uses OpenMP | ||
#' | ||
#' The function uses the Hu and Liu sentiment dictionary (same as everybody | ||
#' else) available here: | ||
#' https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html | ||
#' | ||
#' @param s | ||
#' A string or vector of strings. | ||
#' @param nthreads | ||
#' Number of threads to use. By default it will use the total number of | ||
#' cores + hyperthreads. | ||
#' | ||
#' @return | ||
#' A dataframe, consisting of columns "positive", "negative", "score", and "wc". | ||
#' With the exception of "score", these are counts; that is, "positive" is the | ||
#' number of positive sentiment words, "negative" is the number of negative | ||
#' sentiment words, and "wc" is the wordcount (total number of words). | ||
#' | ||
#' @examples | ||
#' \dontrun{ | ||
#' library(meanr) | ||
#' s1 = "Abundance abundant accessable." | ||
#' s2 = "Banana apple orange." | ||
#' s3 = "Abnormal abolish abominable." | ||
#' s = c(s1, s2, s3) | ||
#' | ||
#' # as separate 'documents' | ||
#' score(s) | ||
#' | ||
#' # as one document | ||
#' score(paste0(s, collapse=" ")) | ||
#' } | ||
#' | ||
#' @references | ||
#' Hu, M., & Liu, B. (2004). Mining opinion features in customer | ||
#' reviews. National Conference on Artificial Intelligence. | ||
#' | ||
#' @seealso | ||
#' \code{\link{meanr.nthreads}} | ||
#' | ||
#' @export | ||
score <- function(s, nthreads=meanr.nthreads()) .Call(R_score, s, nthreads) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# meanr | ||
|
||
* **Version:** 0.1-0 | ||
* **Status:** [![Build Status](https://travis-ci.org/wrathematics/meanr.png)](https://travis-ci.org/wrathematics/meanr) | ||
* **License:** [BSD 2-Clause](http://opensource.org/licenses/BSD-2-Clause) | ||
* **Author:** Drew Schmidt | ||
|
||
|
||
**meanr** is an R package performing basic sentiment analysis. Its main main method, `score()`, computes sentiment as a simple sum of the counts of positive (+1) and negative (-1) sentiment words in a piece of text. More sophisticated techniques are available to R, for example in the **qdap** package's `polarity()` function. This package uses [the Hu and Liu sentiment dictionary](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), same as everybody else. | ||
|
||
**meanr** is significantly faster than everything else I tried (which was actually the motivation for its creation), but I don't claim to have tried everything. I believe the package is quite fast. However, the method is merely a dictionary lookup, so it ignores word context like in more sophisticated methods. On the other hand, the more sophisticated tools are very slow. If you have a large volume of text, I believe there is value in getting a "first glance" at the data, and **meanr** allows you to do this very quickly. | ||
|
||
|
||
|
||
## Installation | ||
|
||
<!-- You can install the stable version from CRAN using the usual `install.packages()`: | ||
```r | ||
install.packages("meanr") | ||
``` --> | ||
|
||
The development version is maintained on GitHub, and can easily be installed by any of the packages that offer installations from GitHub: | ||
|
||
```r | ||
### Pick your preference | ||
devtools::install_github("wrathematics/meanr") | ||
ghit::install_github("wrathematics/meanr") | ||
remotes::install_github("wrathematics/meanr") | ||
``` | ||
|
||
|
||
|
||
## Example Usage | ||
|
||
I have a dataset that, for legal reasons, I can not describe, much less provide. You can think of it like a collection of tweets (they are not tweets). But take my word for it that it's real, English language text. The data is in the form of a vector of strings, which we'll call `x`. | ||
|
||
```r | ||
x = readRDS("x.rds") | ||
|
||
length(x) | ||
## [1] 655760 | ||
|
||
sum(nchar(x)) | ||
## [1] 162663972 | ||
|
||
library(meanr) | ||
system.time(s <- score(x)) | ||
## user system elapsed | ||
## 1.072 0.000 0.285 | ||
|
||
head(s) | ||
## positive negative score wc | ||
## 1 2 0 2 32 | ||
## 2 5 0 5 29 | ||
## 3 4 2 2 67 | ||
## 4 12 3 9 203 | ||
## 5 8 2 6 101 | ||
## 6 4 3 1 99 | ||
``` | ||
|
||
|
||
|
||
## How It Works | ||
|
||
The `score()` function receives a vector of strings, and operates on each one as follows: | ||
|
||
1. The maximum string length is found, and a buffer of that size is allocated. | ||
2. The string is copied to the buffer. | ||
3. All punctuation is removed. All characters are converted to lowercase. | ||
4. Score sentiment: | ||
- Tokenize words as collections of chars separated by a space. | ||
- Check if the word is positive; if not, check if it is negative; if not, then it's assumed to be neutral. Each check is a lookup up in one of two tables of Hu and Liu's dictionaries. | ||
- If the word is in the table, get its value from the hash table (positive words have value 1, negative words -1) and update the various counts. Otherwise, the word is "neutral" (score of 0). | ||
|
||
This is all done in four passes of each string; each pass corresponds to each of the enumerated items above. The hash tables uses perfect hash functions generated by gperf. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#! /bin/sh | ||
rm -rf ./src/*.dylib | ||
rm -rf ./src/*.so* | ||
rm -rf ./src/*.o | ||
rm -rf ./src/*.d | ||
rm -rf ./src/*.dll |
Oops, something went wrong.