Skip to content
This repository
Browse code

Merge branch 'master' of github.com:cathywu/Sentiment-Analysis

  • Loading branch information...
commit ebb67916f2dbc362214d213079b1bd374d62f31c 2 parents ec276c1 + de4484d
cathywu authored February 05, 2012
25  egpaper_final.tex
@@ -6,6 +6,7 @@
6 6
 \usepackage{graphicx}
7 7
 \usepackage{amsmath}
8 8
 \usepackage{amssymb}
  9
+\usepackage{url}
9 10
 
10 11
 % Include other packages here, before hyperref.
11 12
 
@@ -51,11 +52,14 @@ \section{Introduction}
51 52
 
52 53
 Sentiment analysis, broadly speaking, is the set of techniques that allows detection of emotional content in text. This has a variety of applications: it is commonly used by trading algorithms to process news articles, as well as by corporations to better respond to consumer service needs. Similar techniques can also be applied to other text analysis problems, like spam filtering.
53 54
 
  55
+The source code described in this paper is available at https://github.com/cathywu/Sentiment-Analysis.
  56
+
54 57
 \section{Previous Work}
55 58
 
56  
-We set out to replicate Pang's work from 2002 on using classical knowledge-free supervised machine learning techniques to perform sentiment classification. They used the machine learning methods (Naive Bayes, maximum entropy classification, and support vector machines), methods commonly used for topic classification, to explore the difference between and sentiment classification in documents. Pang cited a number of related works, but they mostly pertain to classifying documents on criteria weakly tied to sentiment or using knowledge-based sentiment classification methods. We used a similar dataset, as released by the authors, and made efforts to use the same libraries and pre-processing techniques.
  59
+We set out to replicate Pang's work \cite{Pang} from 2002 on using classical knowledge-free supervised machine learning techniques to perform sentiment classification. They used the machine learning methods (Naive Bayes, maximum entropy classification, and support vector machines), methods commonly used for topic classification, to explore the difference between and sentiment classification in documents. Pang cited a number of related works, but they mostly pertain to classifying documents on criteria weakly tied to sentiment or using knowledge-based sentiment classification methods. We used a similar dataset, as released by the authors, and made efforts to use the same libraries and pre-processing techniques.
  60
+
  61
+In addition to replicating Pang's work as closely as we could, we extended the work by exploring an additional dataset, additional preprocessing techniques, and combining classifiers. We tested how well classifiers trained on Pang's dataset extended to reviews in another domain. Although Pang limited many of his tests to use only the 16165 most common ngrams, advanced processors have lifted this computational constraint, and so we additionally tested on all ngrams. We used a newer parameter estimation algorithm called Limited-Memory Variable Metric (L-BFGS)\cite{Liu} for maximum entropy classification. Pang used the Improved Iterative Scaling method. We also implemented and tested the effect of term frequency-inver document frequency (TF-IDF) on classification results.
57 62
 
58  
-In addition to replicating Pang's work as closely as we could, we extended the work by exploring an additional dataset, additional preprocessing techniques, and combining classifiers. We tested how well classifiers trained on Pang's dataset extended to reviews in another domain. Although Pang limited many of his tests to use only the 16165 most common ngrams, advanced processors have lifted this computational constraint, and so we additionally tested on all ngrams. We used a newer parameter estimation algorithm called Limited-Memory Variable Metric (L-BFGS) for maximum entropy classification. Pang used the Improved Iterative Scaling method. We also implemented and tested the effect of term frequency-inverse document frequency (TF-IDF) on classification results.
59 63
 
60 64
 \section{The User Review Domain}
61 65
 For our experiments, we worked with movie reviews. Our data source was Pang's released dataset (http://www.cs.cornell.edu/people/pabo/movie-review-data/) from their 2004 publication. The dataset contains 1000 positive reviews and 1000 negative reviews, each labeled with their true sentiment. The original data source was the Internet Movie Database (IMDb).
@@ -67,7 +71,7 @@ \section{The User Review Domain}
67 71
 
68 72
 \section{Machine Learning Methods}
69 73
 \subsection{The Naive Bayes Classifier}
70  
-The Naive Bayes classifier is an extremely simple classifier that relies on Bayesian probability and the assumption that feature probabilities are independent of one another.
  74
+The Naive Bayes classifier\cite{Manning} is an extremely simple classifier that relies on Bayesian probability and the assumption that feature probabilities are independent of one another.
71 75
 Baye's Rule gives:
72 76
 $$
73 77
 P(C | F_1, F_2, \ldots, F_n)
@@ -98,18 +102,16 @@ \subsection{The Naive Bayes Classifier}
98 102
 
99 103
 While the Naive Bayes classifier seems very simple, it is observed to have high predictive power; in our tests, it performed competitively with the more sophisticated classifiers we used. The Bayes classifier can also be implemented very efficiently. Its independence assumption means that it does not fall prey to the curse of dimensionality, and its running time is linear in the size of the input.
100 104
 
101  
-[http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html]
102  
-
103 105
 \subsection{The Maximum Entropy Classifier}
104 106
 
105  
-Maximum Entropy is a general-purpose machine learning technique that provides the least biased estimate possible based on the given information. In other words, “it is maximally noncommittal with regards to missing information” [src]. Importantly, it makes no conditional independence assumption between features, as the Naive Bayes classifier does.
  107
+Maximum Entropy is a general-purpose machine learning technique that provides the least biased estimate possible based on the given information. In other words, ``it is maximally noncommittal with regards to missing information'' \cite{Jaynes}. Importantly, it makes no conditional independence assumption between features, as the Naive Bayes classifier does.
106 108
 
107 109
 Maximum entropy's estimate of $P(c|d)$ takes the following exponential form:
108 110
 $$P(c|d) = \frac{1}{Z(d)} \exp(\sum_i(\lambda_{i,c} F_{i,c}(d,c)))$$
109 111
 
110 112
 The $\lambda_{i,c}$'s are feature-weigh parameters, where a large $\lambda_{i,c}$ means that $f_i$ is considered a strong indicator for class $c$. We use 30 iterations of the Limited-Memory Variable Metric (L-BFGS) parameter estimation. Pang used the Improved Iterative Scaling (IIS) method, but L-BFGS, a method that was invented after their paper was published, was found to out-perform both IIS and generalized iterative scaling (GIS), yet another parameter estimation method. 
111 113
 
112  
-We used Zhang Le's (2004) Package Maximum Entropy Modeling Toolkit for Python and C++ [link] [src], with no special configuration.
  114
+We used Zhang Le's (2004) Package Maximum Entropy Modeling Toolkit for Python and C++ \cite{Le}, with no special configuration.
113 115
 
114 116
 \subsection{The Support Vector Machine Classifier}
115 117
 
@@ -121,7 +123,7 @@ \subsection{The Support Vector Machine Classifier}
121 123
 $$\forall i, \zeta_i \ge 0$$
122 124
 $$\forall i, y_i  (\vec{x}_i^T \cdot \vec{B} + B0) \ge 1 - \zeta_i $$
123 125
 
124  
-For this paper, we use the PyML implementation of SVMs, which uses the liblinear optimizer to actually find the separating hyperplane. Of the three classifiers, this was the slowest to train, as it suffers from the curse of dimensionalit
  126
+For this paper, we use the PyML implementation of SVMs \cite{PyML}, which uses the liblinear optimizer to actually find the separating hyperplane. Of the three classifiers, this was the slowest to train, as it suffers from the curse of dimensionalit
125 127
 
126 128
 \section{Experimental Setup}
127 129
 We used documents from the movie review dataset and ran 3-fold cross validation in a number of test configurations. We ignored case and treated punctuation marks as separate lexical items. 
@@ -138,7 +140,7 @@ \subsection{Feature Counting Method}
138 140
 
139 141
 \subsection{Conditional Independence Assumption}
140 142
 
141  
-The Bayes classifier depends on a conditional independence assumption, meaning that the model it predicts assumes that the probability of a given word is independent of the other words. Clearly, this assumption does not hold. Nevertheless, the Bayes classifier functions well, in part because the positive and negative correlations between features tend to cancel each other out [http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf].
  143
+The Bayes classifier depends on a conditional independence assumption, meaning that the model it predicts assumes that the probability of a given word is independent of the other words. Clearly, this assumption does not hold. Nevertheless, the Bayes classifier functions well, in part because the positive and negative correlations between features tend to cancel each other out \cite{Zhang}.
142 144
 
143 145
 We found a huge difference between results of Naive Bayes and Maximum Entropy for positive testing accuracy and negative testing accuracy. Maximum Entropy, which makes no unfounded assumptions about the data, gave very similar results for positive tests and negative tests with a 0.2\% difference on average. On the other hand, positive and negative results from Naive Bayes, which assumes conditional independence, varies by 27.5\% on average, with the worst cases performing on test configurations using frequency, averaging 40\% difference. These disparities suggest evidence that the movie dataset does not satisfy the conditional independence assumption.
144 146
 
@@ -166,7 +168,8 @@ \subsection{Position Tagging}
166 168
 Position tagging was not helpful. For bigrams, it harmed performance by around 5\% in most cases, and for unigrams, it was not helpful. If reviews end up not actually following the model specified or if the model has no bearing on where the relevant data is, position tagging will be harmful because it increases the dimensionality of the input without increasing the information content. We suspect that is the case here.
167 169
 
168 170
 \subsection{Part of Speech Tagging}
169  
-We appended POS tags to every word using Oliver Mason's Qtag program [src]. This serves as a rough way to disambiguate words that may hold different meanings in different contexts. For example, it would distinguish the different uses of “love” in ``I love this movie'' versus ``This is a love story.'' However, it turns out that word disambiguation is a much more complicated problem, as POS says nothing to distinguish between the meaning of cold in ``I was a bit cold during the movie'' and ``The cold murderer chilled my heart.''
  171
+
  172
+We appended POS tags to every word using Oliver Mason's Qtag program \cite{qtag}. This serves as a rough way to disambiguate words that may hold different meanings in different contexts. For example, it would distinguish the different uses of “love” in ``I love this movie'' versus ``This is a love story.'' However, it turns out that word disambiguation is a much more complicated problem, as POS says nothing to distinguish between the meaning of cold in ``I was a bit cold during the movie'' and ``The cold murderer chilled my heart.''
170 173
 
171 174
 Part of speech tagging was not very helpful for unigram results; in fact, the NB classifier did slightly worse with parts of speech tagged when using unigrams. However, when using bigrams, the MaxEnt and SVM classifiers did significantly better, achieving 3-4\% better accuracy with part of speech tagging when measuring frequency and presence information.
172 175
 
@@ -182,7 +185,7 @@ \subsection{Majority Voting}
182 185
 Majority voting in some cases provided a small but significant improvement over the classifiers alone; combining Bayes, MaxEnt, and SVM classifiers over the same data provided a three to four percent boost over the best of the individual classifiers alone.
183 186
 
184 187
 \subsection{Neighboring Domain Data}
185  
-Mostly out of curiosity, we wanted to see how our test configurations will perform when training on the movie dataset and testing on the Yelp dataset, an external out-of-domain dataset. We preprocessed the Yelp dataset such that it matched the format of the movie dataset and selected 1000 of each of the 1-5 star rating reviews. For evaluation purposes, we scored the accuracy on only 1-star and 5-star reviews, giving our testbed only high-confidence negative and positive reviews, respectively. The score was simply the average of the two accuracies.
  188
+Mostly out of curiosity, we wanted to see how our test configurations will perform when training on the movie dataset and testing on the Yelp dataset, an external out-of-domain dataset. We preprocessed the Yelp dataset\cite{yelp} such that it matched the format of the movie dataset and selected 1000 of each of the 1-5 star rating reviews. For evaluation purposes, we scored the accuracy on only 1-star and 5-star reviews, giving our testbed only high-confidence negative and positive reviews, respectively. The score was simply the average of the two accuracies.
186 189
 
187 190
 Across the board, the classifiers had a harder time with the Yelp dataset as compared to the movie dataset, performing between 56.0\% and 75.2\%. The respective lowest and highest performing configurations scored at 67.0\% and 84.0\% on the movie dataset.
188 191
 
126  fpbib.bib
... ...
@@ -1,94 +1,64 @@
1  
-@inproceedings{Gordon,
2  
-author = "G. Gordon, T.Darrell, M. Harville, and J. Woodfill",
3  
-title = {Background estimation and removal based on range and color},
4  
-booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
5  
-address = {Fort Collins, Colorado},
6  
-pages = {459--454}, 
7  
-year = 1999
  1
+@InProceedings{Pang,
  2
+  author =       {Bo Pang and Lillian Lee and Shivakumar Vaithyanathan},
  3
+  title =        {Thumbs up?  {Sentiment} Classification using Machine Learning Techniques},
  4
+  booktitle =    "Proceedings of the 2002 Conference on Empirical Methods in Natural
  5
+Language Processing (EMNLP)",
  6
+  pages = {79--86},
  7
+  year =  2002
8 8
 }
9 9
 
10  
-@inproceedings{Jones,
11  
-author = "D. Jones and J. Malik",
12  
-title = {Determining three-dimensional shape from orientation and spatial frequency disparities},
13  
-booktitle = {Proceeding of ECCV},
14  
-address = {Genoa},
15  
-year = 1992
16  
-}
17  
-
18  
-@article{Martin,
19  
-author = "D. Martin, C. Fowlkes, J. Malik",
20  
-title = {Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues},
21  
-journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
22  
-year = 2004,
23  
-volume = 26, 
24  
-number = 5, 
25  
-pages = {530--549}
26  
-}
27  
-
28  
-@inproceedings{McIvor,
29  
-author = "A. McIvor",
30  
-title = {Background subtraction techniques},
31  
-booktitle = {Proceedings of Image \& Vision Computing New Zealand 2000 IVCNZ’00},
32  
-address = {Auckland, New Zealand},
33  
-year = 2000
34  
-}
35  
-
36  
-@inproceedings{Scott,
37  
-author = "G. Scott and H Longuet-Higgins",
38  
-title = {Feature grouping by relocalisation of eigenvectors of the proximity matrix},
39  
-booktitle = {Proceeding of British Machine Vision Conference},
40  
-pages = {103--108},
41  
-year = 1990
  10
+@inproceedings{Zhang,
  11
+author = "Harry Zhang",
  12
+title = {The Optimality of Naive Bayes},
  13
+booktitle = {American Association for Artificial Intelligence},
  14
+year = 2004
42 15
 }
43 16
 
44  
-@article{Seitz,
45  
-author = "P. Seitz",
46  
-title = {Using local orientation information as image primitive
47  
-for robust object recognition},
48  
-journal = {SPIE Visual Communications and Image Processing IV},
49  
-pages = {1630--1639}, 
50  
-volume = {1199},
51  
-number = 1,
52  
-year = 1989
  17
+@inproceedings{Liu,
  18
+author = {Doug C. Liu and Jorge Nocedal},
  19
+title = {On the Limited Memory BFGS Method for Large Scale Optimization},
  20
+booktitle = {Mathematical Programming 45},
  21
+pages = {503--528},
  22
+year = 1989,
53 23
 }
54 24
 
55  
-@electronic{Vance,
56  
-  author        = "A. Vance",
57  
-  title         = "Microsoft's Ambivalence About Kinect Hackers",
58  
-  note           = {http://www.businessweek.com/magazine/ content/11\_04/b4212028870272.htm},
59  
-  month         = jan,
60  
-  year          = "2011"
  25
+@book{Manning,
  26
+author = {Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze},
  27
+title = {Introduction to Information Retrieval},
  28
+publisher = {Cambridge University Press},
  29
+year = 2008
61 30
 }
62 31
 
63  
-@article{Wren,
64  
-author = "C. Wren and Y. Ivanov",
65  
-title = {Volumetric Operations with Surface Margins},
66  
-journal = {IEEE Computer Vision and Pattern Recognition Technical Sketches},
67  
-year = 2002
  32
+@inproceedings{Jaynes,
  33
+author = {E.T. Jaynes},
  34
+title = {Information Theory and Statistical Mechanics},
  35
+booktitle = {The Physical Review},
  36
+volume = 106,
  37
+year = 1957
68 38
 }
69 39
 
70  
-@article{Zabih,
71  
- author = "R. Zabih and J. Woodfill",
72  
- title = {Non-parametric local transforms for computing visual correspondence.},
73  
- journal = {Lecture Notes in Computer Science 800},
74  
- year = 1994,
75  
- pages = {151-158}
  40
+@misc{Le, 
  41
+author={Zhang Le},
  42
+title ={Maximum Entropy Modeling Toolkit for Python and C++},
  43
+year=2011,
  44
+howpublished ="http://homepages.inf.ed.ac.uk/lzhang10/maxent\_toolkit.html"
76 45
 }
77 46
 
78  
-@inproceedings{Zhang,
79  
-  author        = "L. Zhang, B. Curless, and S. M. Seitz",
80  
-  title         = "Rapid Shape Acquisition Using Color 
81  
-Structured Light and Multi-pass Dynamic Programming",
82  
-  intype        = "presented at the",
83  
-  booktitle     = "Proceedings of the 1st 
84  
-International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT)",
85  
-  address 	= "Padova, Italy",
86  
-  year          = "2002",
87  
-  pages = "24-36", 
  47
+@misc{PyML,
  48
+author={Asa Ben-Hur},
  49
+title={PyML - Machine Learning in Python},
  50
+year = 2011,
  51
+howpublished = "http://pyml.sourceforge.net/"
88 52
 }
89 53
 
90  
-@misc{depthmap,
91  
-  title = {Screenshot.png},
92  
-  note = {http://www.vislab.usyd.edu.au/blogs/media/ blogs/baz/Screenshot.png},
  54
+@misc{qtag,
  55
+author={Oliver Mason},
  56
+title={QTag},
  57
+howpublished = "http://phrasys.net/uob/om/software"
93 58
 }
94 59
 
  60
+@misc{yelp,
  61
+author={Yelp},
  62
+title = {Yelp Academic Dataset},
  63
+howpublished = "http://www.yelp.com/academic\_dataset"
  64
+}

0 notes on commit ebb6791

Please sign in to comment.
Something went wrong with that request. Please try again.