In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="custom.css">

<div class="supertitle">
Presentation of the Ph.D. Thesis
</div>

<div class="title">
Sentiment Analysis of German Twitter
</div>

<div class="author">
Uladzimir Sidarenka<br/>
(Wladimir Sidorenko)
</div>

<div class="institution">
University of Potsdam
</div>

<div class="date">
July 12, 2019
</div>

<h1 class="chapter">Chapter I: Introduction</h1>

<strong><em>Sentiment Analysis</em></strong> is a field of knowledge that deals with the analysis of emotions, sentiments, evaluations, and attitudes (Liu, 2012).

<img src="img/may_tears.jpg"/>
<img src="img/trump_anger.jpg"/>
<img src="img/erdogan.jpg"/>
<img src="img/merkel_laugh.jpg"/>
<img src="img/macron_putin.jpg"/>
<img src="img/medvedev_rain.jpg"/>

Main difficulties of sentiment analysis:
* There can be different opinions (good, bad, mixed);
* That can be expressed by different people (children and adults, common people and celebrities);
* On different topics (family, work, politics);
* In different languages (English, Chinese, German);
* In different communication channels (newspapers, emails, social media);
* At different language levels (via words, sentences [utterances], or a complete discourse).

<strong><em>Twitter</em></strong> is an American online news and social networking service on which users post and interact with short messages (up to 140 [280] characters) known as <em>"tweets"</em>. Registered users can post, like, and retweet tweets, whereas unregistered users can only read them. 

**TODO:** German Sentiment + Twitter

# Research Questions

* Can we apply opinion mining methods devised for standard English to German Twitter?

* Which groups of approaches are best suited for which sentiment tasks?

* By how much do word- and discourse-level analyses affect message-level sentiment classification?

* Does text normalization help analyze sentiments?

* Can we do better than existing methods?

In order to answer these questions, we need:

* data;
* baselines;
* solutions.

<h1 class="chapter">Chapter II: Data</h1>

# Challenges

* Get as many sentiments as possible;

* Still try keeping the bias of the data sample low.

# Selection Criteria

* Content-Related Criteria (based on the information contained in the tweet);

* Formal Criteria (based on the form of tweet).

## Content-Related Criteria

As criteria that could help us get more opinions, we considered topic and form of the
tweets, assuming that some subjects, especially social or political issues, would be more
amenable to subjective statements. Because we started creating the corpus in spring 2013,
obvious choices of opinion-rich topics to us were the papal conclave, which took place in
March of that year, and the German federal elections, which were held in autumn. Since
both of these events implied some form of voting, we decided to counterbalance the election
specifics by including general political discussions as the third subject in our dataset. Finally,
to obey the second principle, i.e., to keep the corpus bias low, we sampled the rest of the
data from casual everyday conversations without any prefiltering.

* Federal Elections 2013;
* Papal Conclave 2013;
* General Political Discussions;
* Everyday Conversations.

## Formal Criteria

In the next step, I divided all tweets of the same topic into three groups based on the
following formal criteria:
* I put all messages that contained at least one polar term from the sentiment lexicon
of Remus et al. (2010) into the first group;
* Microblogs that did not satisfy the first condition, but had at least one exclamation
mark or emoticon were allocated to the second group;
* All remaining microblogs were assigned to the third category.

# Annotation Scheme

The annotation scheme for the corpus includes the following elements:

* **sentiments**, which were defined as *polar subjective evaluative opinions about people, entities, or events*, e.g.:
<pre>
   <sentiment>Mir hat die letzte Folge von Games of Thrones gar nicht gefallen.</sentiment>
   <translation><sentiment>I absolutely didn't like the last episode of Game of Thrones.</sentiment></translation>
</pre>
With this element, we associated the following types of attributes, which also had to be specified by the experts:
  * ***polarity*** with possible values *positive*, *negative*, and *comparative*;
  * ***intensity*** with possible values *weak*, *medium*, and *strong*;
  * as well as the boolean attribute ****sarcasm***.

* **targets**, which are entities or events evaluated by opinions.
<pre>
   [Mir hat [die letzte Folge von Games of Thrones]_target gar nicht gefallen.]_sentiment
   [I absolutely didn't like [the last episode of Game of Thrones]_target.]_sentiment
</pre>
This tag has the following attributes:
  * the boolean property ***preferred***, which, however, only had to be used in comparative opinions;
  * and two link-attributes ***anaph-ref***, which connects an a pronomial target to its antecedent, and ***sentiment-ref***, which links the target to its respective opinion if the target is located at the intersection of two sentiments.

* **sources**, which mark the immediate authors of opinions.
<pre>
   [[Mir]_source hat [die letzte Folge von Games of Thrones]_target gar nicht gefallen.]_sentiment
   [[I]_source absolutely didn't like [the last episode of Game of Thrones]_target.] sentiment
</pre>
and have only one possible attributes, ***sentiment-ref***, defined in the same way as above.

In addition to the three core opinion-level elements (sentiments, sources, and targets), I also defined a set of word-level items that had to be labeled by the annotators and were supposed to ease automatic sentiment analysis.  These were:

* ***polar terms***, which are defined as words or idioms that have a distinguishable evaluative
lexical meaning (e.g., "ekelhaft" [*disgusting*], "lieben" [*to love*], "Held" [*hero*], "wie die Pest meiden" [*to
avoid like the pest*]).
<pre>
   [[Mir]_source hat [die letzte Folge von Games of Thrones]_target gar nicht [gefallen]_polar_term.]_sentiment
   [[I]_source absolutely didn't [like]_polar_term [the last episode of Game of Thrones]_target.] sentiment
</pre>
The attributes of these elements are:
  * polarity (*positive*, *negative*);
  * intensity (*weak*, *medium*, *strong*);
  * sarcasm (*true*, *false*);
  * subjective-fact (*true*, *false*);
  * uncertain (*true*, *false*);
  * sentiment-ref.

* ***intensifiers*** and ***diminishers*** are elements that increase (reduce) the expressivity and subjective sense of a polar term (e.g., "sehr" [*very*], "super" [*super*], "stark" [*strongly*], "kaum" [*hardly*]).
The attributes of these elements are:
  * degree (*medium*, *strong*);
  * polar-term-ref.

* and, finally, ***negations***, which are grammatical or lexical means that reverse the semantic orientation of a polar term. (e.g., "nicht" [*not*]).
<pre>
   [[Mir]_source hat [die letzte Folge von Games of Thrones]_target gar [nicht]_negation [gefallen]_polar_term.]_sentiment
   [[I]_source absolutely did[n't]_negation [like]_polar_term [the last episode of Game of Thrones]_target.] sentiment
</pre>
The only attribute of this element is:
  * polar-term.

# Evaluation Metrics

For estimating the inter-annotator agreement, I adopted the popular κ metric [Cohen, 1960]. Following the standard practice, I computed this term as:

$$\kappa = \frac{p_o - p_c}{1 - p_c},$$

where $p_o$ denotes the observed agreement and $p_c$ is the agreement by chance.

The **observed agreement** is normally estimated as:
$$p_o = \frac{T − A_1 + M_1 − A_2 + M_2}{T},$$
where $T$ means the total number of tokens; $A_1$ and $A_2$ are the numbers of tokens annotated by the first and second annotator respectively; and $M_1$ and $M_2$ denote the numbers of tokens with matching annotations.

The **agreement by chance** is computed as:
$$p_c = c_1\times c_2 + (1 - c_1)\times(1 - c_2),$$
where c 1 and c 2 are the proportions of tokens annotated with the given class in the first and
second annotation respectively, i.e., $c_1 = \frac{A_1}{T}$ and $c_2 = \frac{A_2}{T}$.

<p class="annotator1">Annotation 1:</p>
[Mein Vater hasst [dieses schöne Buch] sentiment .] sentiment

[My father hates [this nice book] sentiment .] sentiment

<p class="annotator2">Annotation 2:</p>
Mein [Vater hasst [dieses schöne Buch] sentiment .]_sentiment

My [father hates [this nice book] sentiment .]_sentiment

**Binary $\kappa$**

$A_1 = 10$

$A_2 = 10$

$M_1 = A_1$

$M_2 = A_2$

$\kappa = 1$

**Proportional $\kappa$**

$A_1 = 7$

$A_2 = 6$

$M_1 = 6$

$M_2 = 6$

$\kappa = 0$

# Stage I: Initial Annotation

<table>
  <thead>
      <tr>
    <td>Element</td>
    <td colspan=5>Binary $\kappa$</td>
    <td colspan=5>Proportional $\kappa$</td>
          </tr>
<tr>
<td></td>
<td>$M_1$ </td>
<td>$A_1$ </td>
<td>$M_2$ </td>
<td>$A_2$ </td>
<td>$\kappa$</td>
<td>$M_1$</td>
<td>$A_1$</td>
<td>$M_2$</td>
<td>$A_2$</td>
<td>$\kappa$</td>
</tr>
  </thead>
<tbody>
    <tr>
<td>Sentiment</td>
<td>4,215</td>
<td>7,070</td>
<td>3,484</td>
<td>9,827</td>
<td>38.05</td>
<td>3,269</td>
<td>6,812</td>
<td>3,269</td>
<td>9,796</td>
<td>31.21</td>
    </tr>
    <tr>
<td>Target</td>
<td>1,103</td>
<td>1,943</td>
<td>1,217</td>
<td>4,162</td>
<td>35.48</td>
<td>898</td>
<td>1,905</td>
<td>898</td>
<td>4,148</td>
<td>26.85</td>
    </tr>
    <tr>
<td>Source</td>
<td>159</td>
<td>445</td>
<td>156</td>
<td>456</td>
<td>34.53</td>
<td>153</td>
<td>439</td>
<td>153</td>
<td>456</td>
<td>33.75</td>
    </tr>
    <tr>
<td>Polar Term</td>
<td>1,951</td>
<td>2,854</td>
<td>2,029</td>
<td>3,188</td>
<td>64.29</td>
<td>1,902</td>
<td>2,851</td>
<td>1,902</td>
<td>3,180</td>
<td>61.36</td>
    </tr>
    <tr>
<td>Intensifier</td>
<td>57</td>
<td>101</td>
<td>59</td>
<td>123</td>
<td>51.71</td>
<td>57</td>
<td>101</td>
<td>57</td>
<td>123</td>
<td>50.81</td>
    </tr>
    <tr>
<td>Diminisher</td>
<td>3</td>
<td>10</td>
<td>3</td>
<td>8</td>
<td>33.32</td>
<td>3</td>
<td>10</td>
<td>3</td>
<td>8</td>
<td>33.32</td>
    </tr>
    <tr>
<td>Negation</td>
<td>21</td>
<td>63</td>
<td>21</td>
<td>83</td>
<td>28.69</td>
<td>21</td>
<td>63</td>
<td>21</td>
<td>83</td>
<td>28.69</td>
    </tr>
</tbody>
</table>


# Stage II: Adjudication Step

<table>
  <thead>
      <tr>
    <td>Element</td>
    <td colspan=5>Binary $\kappa$</td>
    <td colspan=5>Proportional $\kappa$</td>
          </tr>
<tr>
<td></td>
<td>$M_1$ </td>
<td>$A_1$ </td>
<td>$M_2$ </td>
<td>$A_2$ </td>
<td>$\kappa$</td>
<td>$M_1$</td>
<td>$A_1$</td>
<td>$M_2$</td>
<td>$A_2$</td>
<td>$\kappa$</td>
</tr>
  </thead>
<tbody>
<tr>
<td>Sentiment</td>
<td>8,198</td>
<td>8,530</td>
<td>8,260</td>
<td>14,034</td>
<td>67.92</td>
<td>7,435</td>
<td>8,243</td>
<td>7,435</td>
<td>13,714</td>
<td>61.94</td>
</tr>
<tr>
<td>Target</td>
<td>3,088</td>
<td>3,407</td>
<td>2,814</td>
<td>5,303</td>
<td>65.66</td>
<td>2,554</td>
<td>3,326</td>
<td>2,554</td>
<td>5,212</td>
<td>57.27</td>
</tr>
<tr>
<td>Source</td>
<td>573</td>
<td>690</td>
<td>545</td>
<td>837</td>
<td>72.91</td>
<td>539</td>
<td>676</td>
<td>539</td>
<td>833</td>
<td>71.12</td>
</tr>
<tr>
<td>Polar Term</td>
<td>3,164</td>
<td>3,298</td>
<td>3,261</td>
<td>4,134</td>
<td>85.68</td>
<td>3,097</td>
<td>3,290</td>
<td>3,097</td>
<td>4,121</td>
<td>82.64</td>
</tr>
<tr>
<td>Intensifier</td>
<td>111</td>
<td>219</td>
<td>113</td>
<td>180</td>
<td>56.01</td>
<td>111</td>
<td>219</td>
<td>111</td>
<td>180</td>
<td>55.51</td>
</tr>
<tr>
<td>Diminisher</td>
<td>9</td>
<td>16</td>
<td>10</td>
<td>16</td>
<td>59.37</td>
<td>9</td>
<td>16</td>
<td>9</td>
<td>15</td>
<td>58.05</td>
</tr>
<tr>
<td>Negation</td>
<td>68</td>
<td>84</td>
<td>67</td>
<td>140</td>
<td>60.21</td>
<td>67</td>
<td>83</td>
<td>67</td>
<td>140</td>
<td>60.03</td>
</tr>
</tbody>
</table>

# Stage III: Final Annotation

<table>
  <thead>
      <tr>
    <td>Element</td>
    <td colspan=5>Binary $\kappa$</td>
    <td colspan=5>Proportional $\kappa$</td>
          </tr>
<tr>
<td></td>
<td>$M_1$ </td>
<td>$A_1$ </td>
<td>$M_2$ </td>
<td>$A_2$ </td>
<td>$\kappa$</td>
<td>$M_1$</td>
<td>$A_1$</td>
<td>$M_2$</td>
<td>$A_2$</td>
<td>$\kappa$</td>
</tr>
  </thead>
<tbody>
<tr>
<td>Sentiment</td>
<td>14,748</td>
<td>15,929</td>
<td>14,969</td>
<td>26,047</td>
<td>65.03</td>
<td>13,316</td>
<td>15,375</td>
<td>13,316</td>
<td>25,352</td>
<td>58.82</td>
</tr>
<tr>
<td>Target</td>
<td>5,765</td>
<td>6,629</td>
<td>5,292</td>
<td>9,852</td>
<td>64.76</td>
<td>4,789</td>
<td>6,462</td>
<td>4,789</td>
<td>9,659</td>
<td>56.61</td>
</tr>
<tr>
<td>Source</td>
<td>966</td>
<td>1,207</td>
<td>910</td>
<td>1,619</td>
<td>65.99</td>
<td>898</td>
<td>1,180</td>
<td>898</td>
<td>1,604</td>
<td>64.1</td>
</tr>
<tr>
<td>Polar Term</td>
<td>5,574</td>
<td>5,989</td>
<td>5,659</td>
<td>7,419</td>
<td>82.83</td>
<td>5,441</td>
<td>5,977</td>
<td>5,441</td>
<td>7,395</td>
<td>80.29</td>
</tr>
<tr>
<td>Intensifier</td>
<td>192</td>
<td>432</td>
<td>194</td>
<td>338</td>
<td>49.97</td>
<td>192</td>
<td>432</td>
<td>192</td>
<td>338</td>
<td>49.71</td>
</tr>
<tr>
<td>Diminisher</td>
<td>16</td>
<td>30</td>
<td>17</td>
<td>34</td>
<td>51.55</td>
<td>16</td>
<td>30</td>
<td>16</td>
<td>33</td>
<td>50.78</td>
</tr>
<tr>
<td>Negation</td>
<td>111</td>
<td>132</td>
<td>110</td>
<td>243</td>
<td>58.87</td>
<td>110</td>
<td>131</td>
<td>110</td>
<td>242</td>
<td>58.92</td>
</tr>
</tbody>
</table>

# Inter-Annotator Agreement on Attributes

<table class="narrow">
  <thead>
    <tr>
      <td>Element</td>
      <td>Polarity $\kappa$</td>
      <td>Intensity $\alpha$</td>
    </tr>
  </thead>
<tbody>
<tr>
<td>Sentiment</td>
<td>58.8 </td>
<td>73.54</td>
</tr>
<tr>
<td>Polar Term</td>
<td>87.12</td>
<td>78.79</td>
</tr>
</tbody>
</table>

# Qualitative Analysis

<div class="example">
  <p class="annotator1">Annotation 1:</p>
  @TinaPannes immerhin ist die #afd nicht dabei ,
  <div class="translation">@TinaPannes anyway the #afd is not there ,</div>

  <p class="annotator2">Annotation 2:</p>
  @TinaPannes <sentiment><target>immerhin ist die #afd nicht dabei</target> ,</sentiment>

  <div class="translation">@TinaPannes <sentiment><target>anyway the #afd is not there</target> ,</sentiment></div>
</div>

<example>
<p class="annotator1">Annotation 1:</p>
<sentiment>Koalition wirft der SPD <target>Blockadehaltung</target> vor</sentiment>
<div class="translation"><sentiment>Coalition accuses the SPD of <target>blocking politics</target></sentiment></div>

<p class="annotator2">Annotation 2:</p>
<sentiment>Koalition wirft [der SPD] target Blockadehaltung vor</sentiment>
<div class="translation"><sentiment>Coalition accuses [the SPD] target of blocking politics</sentiment></div>
</example>

<example>
<p class="annotator1">Annotation 1:</p>
Syrien vor dem Angriff&mdash;bringen diese Bomben den Frieden?

<div class="translation">Syria facing an attack&mdash;will these bombs bring peace?</div>

<p class="annotator2">Annotation 2:</p>
Syrien vor dem <polarterm>Angriff</polarterm>&mdash;bringen diese <polarterm>Bomben</polarterm> polar-term den [Frieden] polar-term ?

<div class="translation">Syria facing an <polarterm>attack</polarterm>&mdash;will these <polarterm>bombs</polarterm> polar-term bring [peace] polar-term ?</div>
</example>

# Effect of the Selection Criteria

<img src="images/sentiment_stat.png" class="heatmap"><img src="images/emo-expression_stat.png" class="heatmap"/>

# Effect of the Selection Criteria

<img src="images/sentiment_agreement.png" class="heatmap"/><img src="images/emo-expression_agreement.png" class="heatmap"/>

# Effect of the Selection Criteria

<table>
<thead>
<tr>
<td rowspan="2">Selection Criteria</td>
<td colspan="4">Correlation Coefficients</td>
</tr>
<tr>
<td> # of elements </td>
<td> agreement </td>
<td> # of elements </td>
<td> agreement</td>
</tr>
</thead>
<tbody>
<tr>
<td>Federal Elections</td>
<td>0.312</td>
<td>0.169</td>
<td>0.356</td>
<td>0.289</td>
</tr>
<tr>
<td>Papal Conclave</td>
<td>0.149</td>
<td>0.124</td>
<td>0.182</td>
<td>0.264</td>
</tr>
<tr>
<td>Political Discussions</td>
<td>0.195</td>
<td>0.148</td>
<td>0.218</td>
<td>0.244</td>
</tr>
<tr>
<td>General Conversations</td>
<td>0.183</td>
<td>0.19</td>
<td>0.372</td>
<td>0.452</td>
</tr>
<tr>
<td>Polar Terms</td>
<td>0.445</td>
<td>0.352</td>
<td>0.38</td>
<td>0.301</td>
</tr>
<tr>
<td>Emoticons</td>
<td>0.127</td>
<td>0.096</td>
<td>0.47</td>
<td>0.615</td>
</tr>
<tr>
<td>Random</td>
<td>0.216</td>
<td>0.134</td>
<td>0.143</td>
<td>0.138</td>
</tr>
</tbody>
</table>

# Summary

* I have presented a comprehensive, manually labeled corpus of 7,992 German tweets;
* All microblogs in this collection were sampled from four different topics (political elections, papal conclave, general political conversations, and everyday smalltalk);
* Afterwards, two human experts annotated these messages with sentiments, sources, targets, polar terms, their inensifiers, diminishers, and negations;
* As it turned out, marking these elements poses a significant challenge even to professional linguists, with their mutual inter-annotator agreement on sentiments hardly reaching 35% (which is generally considered as a low IAA);
* These difficulties, however, can be largely reduced if we let the annotators resolve their contradicting  cases (the IAA on sentiments, in this case, rises to);
* After adjudication, the inter-rater reliability remains at a constantly high level (sentiment IAA $\approx$ 59%);
* The remaining disagreement cases mostly represent ambiguous or controversial statements;
* Finally, we could see a significant correlation between the initial selection criteria and the number and reliability of annotated sentiments.

<h1 class="chapter">Chapter III: Lexicons</h1>

# Lexicon Types

Main types of sentiment lexicons are:

* manual;

* semi-automatic;

* automatic:
  * dictionary-based;
  * corpus-based;
  * word-embedding&ndash;based ones.

# Semi-Automatic Lexicons

* **German Polarity Clues** (GPC; Waltinger, 2010), which contains 10,141 polar terms
from the English sentiment lexicons Subjectivity Clues (Wilson et al., 2005) and Sen-
tiSpin (Takamura et al., 2005) that were automatically translated into German and
then manually revised by the author. Apart from that, Waltinger also manually en-
riched these translations with their frequent synonyms and 290 negated phrases; 3

* **SentiWS** (SWS; Remus et al., 2010), which includes 1,818 positively and 1,650 nega-
tively connoted terms along with their part-of-speech tags and inflections, which results
in a total of 32,734 word forms. As in the previous case, the authors obtained the initial
entries for their resource by translating an English polarity list (the General Inquirer
lexicon) and then manually correcting these translations. In addition to this, they
expanded the translated set with words and phrases that frequently co-occurred with
positive and negative seed terms in a corpus of 10,200 customer reviews or in the
German Collocation Dictionary (Quasthoff, 2010); 4

* and, finally, the only the lexicon that was not obtained through translation—the
**Zurich Polarity List** (ZPL; Clematide and Klenner, 2010), which features 8,000 sub-
jective entries extracted from GermaNet synsets (Hamp and Feldweg, 1997). These
synsets had been manually annotated by human experts with their prior polarities.
Since the authors, however, found the number of polar adjectives obtained this way
to be insufficient for their classification experiments, they automatically enriched this
lexicon with more attributive terms, using the collocation method of Hatzivassiloglou
and McKeown (1997).

# Results of Semi-Automatic Lexicons

<table>
<thead>
<tr>
<td rowspan="2">Lexicon</td>
<td colspan="3">Positive Expressions</td>
<td colspan="3">Negative Expressions</td>
<td colspan="3">Neutral Terms</td>
<td rowspan="2">Macro-$F_1$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tr>
<td>GPC</td>
<td>0.209</td>
<td>0.535</td>
<td>0.301</td>
<td>0.195</td>
<td>0.466</td>
<td>0.275</td>
<td>0.983</td>
<td>0.923</td>
<td>0.952</td>
<td>0.509</td>
<td>0.906 </td>
</tr>
<tr>
<td>SWS</td>
<td>0.335</td>
<td>0.435</td>
<td>0.379</td>
<td>0.484</td>
<td>0.344</td>
<td>0.402</td>
<td>0.977</td>
<td>0.975</td>
<td>0.976</td>
<td>0.586</td>
<td>0.952</td>
</tr>
<tr>
<td>ZPL</td>
<td>0.411</td>
<td>0.424</td>
<td>0.417</td>
<td>0.38</td>
<td>0.352</td>
<td>0.366</td>
<td>0.977</td>
<td>0.979</td>
<td>0.978</td>
<td>0.587</td>
<td>0.955 </td>
</tr>
<tr>
<td>GPC $\cap$ SWS $\cap$ ZPL</td>
<td><b>0.527</b></td>
<td>0.372</td>
<td><b>0.436</b></td>
<td><b>0.618</b></td>
<td>0.244</td>
<td>0.35</td>
<td>0.973</td>
<td><b>0.99</b></td>
<td><b>0.982</b></td>
<td><b>0.589</b></td>
<td><b>0.964</b></td>
</tr>
<tr>
<td>GPC $\cup$ SWS $\cup$ ZPL</td>
<td>0.202</td>
<td>0.562</td>
<td>0.297</td>
<td>0.195</td>
<td>0.532</td>
<td>0.286</td>
<td>0.985</td>
<td>0.917</td>
<td>0.95</td>
<td>0.51</td>
<td>0.901 </td>
</tr>
</table>

# Dictionary-Based Lexicons

* Hu and Liu (2004);

* Blair-Goldensohn et al. (2008);

* Kim and Hovy (2004, 2006);

* Esuli and Sebastiani (2006a);

* Rao and Ravichandran (2009):

# Results of Dictionary-Based Methods

<table>
<thead>
<tr>
<td rowspan="2">Lexicon</td>
<td rowspan="2"># of Terms</td>
<td colspan="3">Positive Expressions</td>
<td colspan="3">Negative Expressions</td>
<td colspan="3">Neutral Terms</td>
<td rowspan="2">Macro-$F_1$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tr>
<td>Seed Set</td>
<td>20</td>
<td>0.771</td>
<td>0.102</td>
<td>0.18</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.398</td>
<td>0.962</td>
</tr>
<tr>
<td>HL</td>
<td>5,745</td>
<td>0.161</td>
<td>0.266</td>
<td>0.2</td>
<td>0.2</td>
<td>0.133</td>
<td>0.16</td>
<td>0.969</td>
<td>0.96</td>
<td>0.965</td>
<td>0.442</td>
<td>0.93</td>
</tr>
<tr>
<td>BG</td>
<td>1,895</td>
<td>0.503</td>
<td>0.232</td>
<td>0.318</td>
<td>0.285</td>
<td>0.093</td>
<td>0.14</td>
<td>0.968</td>
<td>0.991</td>
<td>0.979</td>
<td>0.479</td>
<td>0.959</td>
</tr>
<tr>
<td>KH</td>
<td>356</td>
<td>0.716</td>
<td>0.159</td>
<td>0.261</td>
<td>0.269</td>
<td>0.044</td>
<td>0.076</td>
<td>0.965</td>
<td>0.997</td>
<td>0.981</td>
<td>0.439</td>
<td>0.962</td>
</tr>
<tr>
<td>ES</td>
<td>39,181</td>
<td>0.042</td>
<td>0.564</td>
<td>0.078</td>
<td>0.033</td>
<td>0.255</td>
<td>0.059</td>
<td>0.981</td>
<td>0.689</td>
<td>0.81</td>
<td>0.315</td>
<td>0.644</td>
</tr>
<tr>
<td>RR$_{mincut}$</td>
<td>8,060</td>
<td>0.07</td>
<td>0.422</td>
<td>0.12</td>
<td>0.216</td>
<td>0.073</td>
<td>0.109</td>
<td>0.972</td>
<td>0.873</td>
<td>0.92</td>
<td>0.383</td>
<td>0.849</td>
</tr>
<tr>
<td>RR$_{lbl-prop}$</td>
<td>1,105</td>
<td>0.567</td>
<td>0.176</td>
<td>0.269</td>
<td>0.571</td>
<td>0.046</td>
<td>0.085</td>
<td>0.965</td>
<td>0.997</td>
<td>0.981</td>
<td>0.445</td>
<td>0.962</td>
</tr>
<tr>
<td>AR</td>
<td>23</td>
<td>0.768</td>
<td>0.1</td>
<td>0.176</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.397</td>
<td>0.962</td>
</tr>
<tr>
<td>HL $\cap$ BG $\cap$ RR$_{lbl-prop}$</td>
<td>752</td>
<td>0.601</td>
<td>0.165</td>
<td>0.259</td>
<td>0.567</td>
<td>0.045</td>
<td>0.084</td>
<td>0.965</td>
<td>0.997</td>
<td>0.981</td>
<td>0.441</td>
<td>0.962</td>
</tr>
<tr>
<td>HL $\cup$ BG $\cup$ RR$_{lbl-prop}$</td>
<td>6,258</td>
<td>0.166</td>
<td>0.288</td>
<td>0.21</td>
<td>0.191</td>
<td>0.146</td>
<td>0.165</td>
<td>0.97</td>
<td>0.958</td>
<td>0.964</td>
<td>0.446</td>
<td>0.929</td>
</tr>
</table>

# Corpus-Based Lexicons

* Takamura et al. (2005);

* Velikovich et al. (2010);

* Kiritchenko et al. (2014);

* Severyn and Moschitti (2015).

# Results of Corpus-Based Methods

<table>
<thead>
<tr>
<td rowspan="2">Lexicon</td>
<td rowspan="2"># of Terms</td>
<td colspan="3">Positive Expressions</td>
<td colspan="3">Negative Expressions</td>
<td colspan="3">Neutral Terms</td>
<td rowspan="2">Macro-$F_1$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tr>
<td>Seed Set</td>
<td>20</td>
<td>0.771</td>
<td>0.102</td>
<td>0.18</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.398</td>
<td>0.962</td>
</tr>
<tr>
<td>TKM</td>
<td>920</td>
<td>0.646</td>
<td>0.134</td>
<td>0.221</td>
<td>0.565</td>
<td>0.029</td>
<td>0.055</td>
<td>0.964</td>
<td>0.998</td>
<td>0.981</td>
<td>0.419</td>
<td>0.962</td>
</tr>
<tr>
<td>VEL</td>
<td>60</td>
<td>0.764</td>
<td>0.102</td>
<td>0.18</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.98</td>
<td>0.398</td>
<td>0.962</td>
</tr>
<tr>
<td>KIR</td>
<td>320</td>
<td>0.386</td>
<td>0.106</td>
<td>0.166</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.996</td>
<td>0.979</td>
<td>0.393</td>
<td>0.959</td>
</tr>
<tr>
<td>SEV</td>
<td>60</td>
<td>0.68</td>
<td>0.102</td>
<td>0.177</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.397</td>
<td>0.962</td>
</tr>
<tr>
<td>TKM $\cap$ VEL $\cap$ SEV</td>
<td>20</td>
<td>0.771</td>
<td>0.102</td>
<td>0.18</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.398</td>
<td>0.962</td>
</tr>
<tr>
<td>TKM $\cup$ VEL $\cup$ SEV</td>
<td>1,020</td>
<td>0.593</td>
<td>0.134</td>
<td>0.218</td>
<td>0.565</td>
<td>0.029</td>
<td>0.055</td>
<td>0.964</td>
<td>0.998</td>
<td>0.98</td>
<td>0.418</td>
<td>0.962</td>
</tr>
</table>

# NWE-Based Lexicons

* Tang et al. (2014);

* Vo and Zhang (2016);

* <div class="new">Nearest Centroids;</div>

* <div class="new">$k$-Nearest Neighbors;</div>

* <div class="new">Principal Component Analysis;</div>

* <div class="new">Linear Projection.</div>

# Results of NWE-Based Methods

<table>
<thead>
<tr>
<td rowspan="2">Lexicon</td>
<td rowspan="2"># of Terms</td>
<td colspan="3">Positive Expressions</td>
<td colspan="3">Negative Expressions</td>
<td colspan="3">Neutral Terms</td>
<td rowspan="2">Macro-$F_1$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tr>
<td>Seed Set</td>
<td>20</td>
<td>0.771</td>
<td>0.102</td>
<td>0.18</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.398</td>
<td>0.962</td>
</tr>
<tr>
<td>TNG</td>
<td>1,600</td>
<td>0.088</td>
<td>0.153</td>
<td>0.112</td>
<td>0.193</td>
<td>0.155</td>
<td>0.172</td>
<td>0.966</td>
<td>0.953</td>
<td>0.959</td>
<td>0.414</td>
<td>0.921</td>
</tr>
<tr>
<td>VO</td>
<td>40</td>
<td>0.117</td>
<td>0.115</td>
<td>0.116</td>
<td>0.541</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.98</td>
<td>0.971</td>
<td>0.374</td>
<td>0.944</td>
</tr>
<tr>
<td>NC</td>
<td>5,200</td>
<td>0.771</td>
<td>0.102</td>
<td>0.18</td>
<td>0.568</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.398</td>
<td>0.962</td>
</tr>
<tr>
<td>$k$-NN</td>
<td>420</td>
<td>0.486</td>
<td>0.182</td>
<td>0.265</td>
<td>0.65</td>
<td>0.091</td>
<td>0.16</td>
<td>0.966</td>
<td>0.995</td>
<td>0.98</td>
<td>0.468</td>
<td>0.961</td>
</tr>
<tr>
<td>PCA</td>
<td>40</td>
<td>0.771</td>
<td>0.102</td>
<td>0.18</td>
<td>0.529</td>
<td>0.017</td>
<td>0.033</td>
<td>0.963</td>
<td>0.999</td>
<td>0.981</td>
<td>0.398</td>
<td>0.962</td>
</tr>
<tr>
<td>LP</td>
<td>6,340</td>
<td>0.741</td>
<td>0.156</td>
<td>0.257</td>
<td>0.436</td>
<td>0.088</td>
<td>0.147</td>
<td>0.966</td>
<td>0.998</td>
<td>0.982</td>
<td>0.462</td>
<td>0.963</td>
</tr>
</table>

## Effect of Word Embeddings

* word2vec;

* task-specific;

* task-specific + word2vec;

* task-specific least-squares mapping.

## Effect of Word Embeddings

<img src="img/potts_embeddings.png" alt="t-SNE visualization of different word-embedding types"/>

## Effect of Word Embeddings

<table>
<thead>
<tr>
<td rowspan="2">Lexicon</td>
<td colspan="4">Embedding Type</td>
</tr>
<tr>
<td>word2vec</td>
<td>task-specific + word2vec</td>
<td>task-specific + least squares</td>
<td>task-specific</td>
</tr>
</thead>
<tbody>
<tr>
<td>NC</td>
<td>0.398</td>
<td>0.398</td>
<td><div class="best">0.401</div></td>
<td>0.399</td>
</tr>
<tr>
<td>$k$-NN</td>
<td><div class="best">0.468</div></td>
<td>0.43</td>
<td>0.398</td>
<td>0.392</td>
</tr>
<tr>
<td>PCA</td>
<td>0.398</td>
<td>0.398</td>
<td>0.404</td>
<td><div class="best">0.409</div></td>
</tr>
<tr>
<td>LP</td>
<td><div class="best">0.462</div></td>
<td>0.441</td>
<td>0.398</td>
<td>0.399</td>
</tr>
</tbody>
</table>

## Effect of Vector Normalization

* mean and length normalization $\vec{v}^* = \frac{\frac{\vec{v}}{\left\lVert\vec{v}\right\rVert} - \vec{\mu}^*}{\vec{\sigma}^*}$;

* mean normalization: $\vec{v}^* = \frac{\vec{v} - \vec{\mu}}{\vec{\sigma}}$;

* length normalization: $\vec{v}^* = \frac{\vec{v}}{\left\lVert\vec{v}\right\rVert}$;

* no normalization.

## Effect of Vector Normalization

<table>
<thead>
<tr>
<td rowspan="2">SLG Method</td>
<td colspan="4">Vector Normalization</td>
</tr>
<tr>
<td>mean normalization + length normalization</td>
<td>mean normalization</td>
<td>length normalization</td>
<td>no normalization</td>
</tr>
</thead>
<tbody>
<tr>
<td>NC</td>
<td>0.398</td>
<td>0.398</td>
<td>0.398</td>
<td>0.398</td>
</tr>
<tr>
<td>$k$-NN</td>
<td>0.468</td>
<td>0.418</td>
<td>0.467</td>
<td>0.417</td>
</tr>
<tr>
<td>PCA</td>
<td>0.398</td>
<td>0.396</td>
<td>0.398</td>
<td>0.396</td>
</tr>
<tr>
<td>LP</td>
<td>0.462</td>
<td>0.416</td>
<td>0.461</td>
<td>0.442</td>
</tr>
</tbody>
</table>

## Effect of Seed Sets

<table>
<thead>
<tr>
<td>Seed Set</td>
<td>Cardinality</td>
<td>Part of Speech</td>
<td>Examples</td>
</tr>
</thead>
<tbody>
<tr>
<td>Hu and Liu (2004)</td>
<td>14 positive, 15 negative, and 10 neutral terms</td>
<td>adjectives</td>
<td class="example">fantastisch, lieb, sympathisch, böse, dumm, schwierig</td>
</tr>
<tr>
<td>Kim and Hovy (2004)</td>
<td>60 positive, 60 negative, and 60 neutral terms</td>
<td>any</td>
<td class="example">fabelhaft, Hoffnung, lieben, hässlich, Missbrauch, töten</td>
</tr>
<tr>
<td>Esuli and Sebastiani (2006)</td>
<td>16 positive, 35 negative, and 4,122 neutral terms</td>
<td>any</td>
<td>angenehm, ausgezeichnet, freundlich, arm, bedauernswert, dürftig</td>
</tr>
<tr>
<td>Remus (2010)</td>
<td>12 positive, 12 negative, and 10 neutral terms</td>
<td>adjectives</td>
<td>gut, schön, richtig, schlecht, unschön, falsch</td>
</tr>
</table>


## Effect of Seed Sets on Dictionary-Based Methods
<img src="img/sentilex-dict-alt-seed-sets.png" als="Effect of Seed Sets on Dictionary-Based Methods"/>

## Effect of Seed Sets on Corpus-Based Methods
<img src="img/sentilex-crp-alt-seed-sets.png" als="Effect of Seed Sets on Corpus-Based Methods"/>

## Effect of Seed Sets on NWE-Based Methods
<img src="img/sentilex-nwe-alt-seed-sets.png" als="Effect of Seed Sets on Dictionary-Based Methods"/>

# Summary

* semi-automatic translations of common English polarity lists
  notably outperform purely automatic SLG methods, which are applied
  to German data directly;

* despite their allegedly worse ability to accommodate new
  domains, dictionary-based approaches are still better than
  corpus-based systems;

* a potential weakness of these algorithms though is their
  dependence on manually
  annotated linguistic resources, which might not necessarily be
  present for every language;

* in this regard, a viable alternative to dictionary-based methods
  are SLG systems that induce polar lexicons from neural word
  embeddings;

* with at least two of such methods ($k$-NN and linear
  projection), I was able to establish a new state of the art for
  the macro- and micro-averaged \F-scores of automatically induced
  sentiment lexicons;

* I also checked how different types of embeddings affected the
  performance of NWE-based SLG systems, noticing that the $k$-NN and
  linear projection methods worked best with standard word2vec
  vectors, while nearest centroids and PCA yielded better results when
  using task-specific representations;

* all NWE-based approaches benefit from
  mean-scaling and length normalization of the input vectors, getting
  an improvement by up to 5% in their macro-averaged $F_1$-scores;

* finally, an extensive evaluation of various sets of seed terms
  revealed that the results of almost all tested SLG algorithms
  crucially depend on the quality of their initial seeds, with larger
  and more balanced seed sets typically leading to much higher scores.

# Chapter IV: Aspect-Based Sentiment Analysis

## Task

Given an input sentence $x_1, x_2, \ldots, x_n$, we need to automatically find textual spans of sentiments, sources, and targets, i.e. to assign a label $y_i\in{SNT, SRC, TRG, NON}$ to each token $x_i$ in the sentence.

<div class="example">
  TODO provide an example
  <div class="translation">TODO</div>
</div>

Since this problem involves simultaneous prediction of tags for multiple inter-connected random variables ($y_1, y_2, \ldots, y_n$), it is commonly considered as a structured-prediction task, viz. as a sequence-labeling problem (SLP), and addressed with two common SLP methods:

* probabilistic graphical model (conditional random fields [CRFs]);

* recurrent neural networks (long-short term memory [LSTM] or gated recurrent units [GRU]).

# Evaluation Metrics

<div class="example">
  TODO: provide an example
  <div class="translation">TODO</div>
</div>

Possible ways to measure the prediction quality of label spans:

1. Exact match;

2. Binary overlap;

3. Proportional overlap (Johansson and Moschitti, 2010):
Given two sets of manually and automatically tagged spans ($\mathcal{S}$ and
$\widehat{\mathcal{S}}$, respectively), Johansson and Moschitti
estimate the precision of automatic assignment as:
$$P(\mathcal{S}, \widehat{\mathcal{S}}) = \frac{C(\mathcal{S},
    \widehat{\mathcal{S}})}{|\widehat{\mathcal{S}}|},$$
where $C(\mathcal{S},\widehat{\mathcal{S}})$ stands for the proportion
of overlapping tokens across all pairs of manually ($s_i$) and
automatically ($s_j$) annotated spans:
$$C(\mathcal{S}, \widehat{\mathcal{S}}) = \sum_{s_i \in
    \mathcal{S}}\sum_{s_j \in \widehat{\mathcal{S}}}c(s_i, s_j),$$
and the $|\widehat{\mathcal{S}}|$ term denotes the total number of
spans automatically labeled with the given tag.
Similarly, the recall of this assignment is estimated as:
$$R(\mathcal{S}, \widehat{\mathcal{S}}) = \frac{C(\mathcal{S},
    \widehat{\mathcal{S}})}{|\mathcal{S}|}.$$
Using these two values, one can normally compute the $F_1$-measure as:
$$F_1 = 2\times\frac{P \times R}{P + R}.$$

# Conditional Random Fields

TODO: Definition

Conditional random fields are an undirected discriminative probabilistic graphical model, which tries to find the most likely 

## Features

For my experiments, I devised the following types of features
* **formal**, which included:
  * the initial three characters of each token,
  * its last three characters,
  * general spelling class of the word (e.g., alphanumeric, digit, or punctuation);
* **morphological**, which encompassed:
  * the part-of-speech tag of the analyzed token,
  * grammatical case and gender of inflectable PoS types,
  * degree of comparison for adjectives,
  * mood, tense, and person forms of verbs;
* **lexical**, which comprised:
  * the actual lemma and form of the token (using one-hot encoding),
  * its polarity class (positive, negative, or neutral), obtained from the Zurich Polarity Lexicon (Clematide and Klenner, 2010);
* **syntactic**, which were:
  * the dependency relation via which token $x_i$ was connected to its parent,
  * two binary attributes that showed whether the previous token in the sentence was the parent or the child of the current word,
  * the dependency relation of the previous token in the sentence to its parent + the dependency relation of the current token to its ancestor,
  * the dependency link of the next token + the dependency relation of the current token to its parent;
* **lexico-syntactic**, which included:
  * the lemma of syntactic parent;
  * the part-of-speech tag and polarity class of the grandparent in the syntactic tree;
  * the lemma of the child node + dependency relation between the current token and its child;
  * the PoS tag of the child node + its dependency relation + the PoS tag of the current token;
  * the lemma of the child node + its dependency relation + the lemma of the current token;
  * the overall polarity of syntactic children, which was computed by summing up the polarity scores of all immediate dependents, and checking whether     the resulting value was greater, less than, or equal to zero.

## Results

<table>
<thead>
<tr>
<td rowspan="2">Data Set</td>
<td colspan="3">Sentiment</td>
<td colspan="3">Source</td>
<td colspan="3">Target</td>
<td rowspan="2">Macro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision </td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td>Training Set</td>
<td>0.949</td>
<td>0.908</td>
<td>0.928</td>
<td>0.903</td>
<td>0.87</td>
<td>0.886</td>
<td>0.933</td>
<td>0.865</td>
<td>0.898</td>
<td>0.904</td>
</tr>
<tr>
<td>Test Set</td>
<td>0.37</td>
<td>0.28</td>
<td>0.319</td>
<td>0.305</td>
<td>0.244</td>
<td>0.271</td>
<td>0.304</td>
<td>0.244</td>
<td>0.271</td>
<td>0.287</td>
</tr>
</tbody>
</table>

## Feature Analysis (Ablation Tests)

<table>
<thead>
<tr>
<td rowspan="2">Element</td>
<td rowspan="2">Original $F_1$-Score </td>
<td colspan="2">$F_1$-Score after Feature Removal</td>
</tr>
<tr>
<td>Formal</td>
<td>Morphological</td>
<td>Lexical</td>
<td>Syntactic</td>
<td>Lexico-Syntactic</td>
</tr>
</thead>
<tbody>
<tr>
<td>Sentiment</td>
<td>0.346</td>
<td>0.343\negdelta0.003</td>
<td>0.344\negdelta0.002</td>
<td>0.326\negdelta0.02</td>
<td>0.345\negdelta0.001</td>
<td>0.324\negdelta0.022</td>
</tr>
<tr>
<td>Source</td>
<td>0.309</td>
<td>0.321\posdelta0.012</td>
<td>0.313\posdelta0.004</td>
<td>0.265\negdelta0.044</td>
<td>0.359\posdelta0.05</td>
<td>0.271\negdelta0.038</td>
</tr>
<tr>
<td>Target</td>
<td>0.26</td>
<td>0.282\posdelta0.022</td>
<td>0.252\negdelta0.008</td>
<td>0.263\posdelta0.003</td>
<td>0.233\negdelta0.027</td>
<td>0.263\posdelta0.003</td>
</tr>
</tbody>
</table>

## Feature Analysis (Top-10 Features)

<table>
<thead>
<tr>
<td rowspan="2">Rank</td>
<td colspan="2">State Features</td>
<td colspan="2">Transition Features</td>
</tr>
<tr>
<td>Feature </td>
<td>Score </td>
<td>Feature </td>
<td>Score</td>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>prntLemma=meiste $\rightarrow$ TRG</td>
<td>18.68</td>
<td>NON $\rightarrow$ TRG</td>
<td>-7.01</td>
</tr>
<tr>
<td>2</td>
<td>prntLemma=rettungsschirme $\rightarrow$ TRG</td>
<td>18.3</td>
<td>NON $\rightarrow$ SRC</td>
<td>-6.85</td>
</tr>
<tr>
<td>3</td>
<td>initChar=sty $\rightarrow$ NON</td>
<td>-16.04</td>
<td>NON $\rightarrow$ SNT</td>
<td>-5.39</td>
</tr>
<tr>
<td>4</td>
<td>form=meisten $\rightarrow$ NON</td>
<td>15.99</td>
<td>TRG $\rightarrow$ SRC</td>
<td>-2.99</td>
</tr>
<tr>
<td>5</td>
<td>prntLemma=urlauberin $\rightarrow$ SNT</td>
<td>14.74</td>
<td>NON $\rightarrow$ NON</td>
<td>2.69</td>
</tr>
<tr>
<td>6</td>
<td>lemma=anfechten  $\rightarrow$ SNT</td>
<td>14.07</td>
<td>SRC $\rightarrow$ NON</td>
<td>-2.59</td>
</tr>
<tr>
<td>7</td>
<td>form=thomasoppermann  $\rightarrow$ TRG</td>
<td>13.44</td>
<td>SNT $\rightarrow$ SNT</td>
<td>2.54</td>
</tr>
<tr>
<td>8</td>
<td>form=bezeichnete $\rightarrow$ SNT</td>
<td>13.25</td>
<td>TRG $\rightarrow$ TRG</td>
<td>2.31</td>
</tr>
<tr>
<td>9</td>
<td>deprel[0]|deprel[1]=NK|AMS $\rightarrow$ NON</td>
<td>12.92</td>
<td>SRC $\rightarrow$ SRC</td>
<td>2.19</td>
</tr>
<tr>
<td>10</td>
<td>trailChar=te. $\rightarrow$ NON</td>
<td>12.77</td>
<td>SRC $\rightarrow$ TRG</td>
<td>-2.07</td>
</tr>
</tbody>
</table>

# Recurrent Neural Networks

In [None]:
TODO: Definition

## LSTM

The main components of a long-short term memory network are:

* An **input gate** $\vec{i}^{(t)}$, which controls how much input information will be used for the current prediction:
$$\vec{i}^{(t)} = \sigma\left(W_i\cdot \vec{x}^{(t)} + U_i \cdot \vec{h}^{(t-1)} + \vec{b}_i\right);$$
where $\sigma$ denotes the sigmoid function; $W_i$, $U_i$, and $\vec{b_i}$ represent model's parameters; and $\vec{x}^{(t)}$ and $\vec{h}^{(t-1)}$ stand for the input and previous hidden states respectively;

* A **forget gate** $\vec{f}^{(t)}$, which controls how much previous information will be used for the current prediction:
$$\vec{f}^{(t)} = \sigma\left(W_i\cdot \vec{x}^{(t)} + U_i \cdot \vec{h}^{(t-1)} + \vec{b}_i\right);$$

* An **intermediate update vector** $\widetilde{c}^{(t)}$:
$$\widetilde{c}^{(t)} = tanh\left(W_c\cdot \vec{x}^{(t)} + U_c \cdot \vec{h}^{(t-1)} + \vec{b}_c\right);$$

* The **final update vector** $\vec{c}^{(t)}$, which is a combination of the intermediate and previous updates:
$$\vec{c}^{(t)} = \vec{i}^{(t)} \odot \widetilde{c}^{(t)} + \vec{f}^{(t)} \odot \vec{c}^{(t-1)};$$

Using the final update, we can then compute the new **hidden state** $\vec{h}^{(t)}$ and **output vector** $\vec{o}^{(t)}$ for position $t$:
$$\vec{o}^{(t)} = \sigma\left(W_o\cdot \vec{x}^{(t)} + U_o \cdot \vec{h}^{(t-1)} + V_o \cdot \vec{c}^{(t)} + \vec{b}_o\right),$$
$$\vec{h}^{(t)} = \vec{o}^{(t)} \odot tanh(\vec{c}^{(t)}).$$


## GRU

A gated recurrent unit network is fundamentally similar to LSTM in that it also features:

* An **input gate** $\vec{i}^{(t)}$:
$$\vec{i}^{(t)} = \sigma\left(W_i\cdot \vec{x}^{(t)} + U_i \cdot \vec{h}^{(t-1)} + \vec{b}_i\right);$$

* A **forget gate** $\vec{f}^{(t)}$:
$$\vec{f}^{(t)} = \sigma\left(W_i\cdot \vec{x}^{(t)} + U_i \cdot \vec{h}^{(t-1)} + \vec{b}_i\right);$$

* An **update vector** $\widetilde{c}^{(t)}$:
$$\widetilde{c}^{(t)} = tanh\left(W_c\cdot \mathbf{x}^{(t)} + U_c \cdot \left(\vec{f}^{(t)} \odot \vec{h}^{(t-1)}\right)  + \vec{b}_c\right);$$

But in contrast to LSTM, this update vector is used directly to compute the **hidden state** $\vec{h}^{(t)}$, which simultaneously serves as the **output** of the model:
$$  \vec{h}^{(t)} = \vec{i}^{(t)} \odot \vec{h}^{(t-1)} + \left(\vec{1} -\vec{i}^{(t)}\right) \odot \widetilde{c}^{(t)}.$$

## Training

Since the size of our tagset (four tags) was obviously to small for keeping relevant context information in the hidden and output states ($\vec{h}^{(t)}$ and $\vec{o}^{(t)}$), I set the size of these vectors to 100 and multiplied them with matrix $O\in\mathbb{R}^{4 \times 100}$ to obtain the final prediction:
$$\hat{y}^{(t)} = argmax(O\cdot\vec{o}^{(t)}).$$

Due to a high imbalance of targets classes (with most of the words having the tag NON), I **upsampled** subjective tweets by randomly repeating microblogs that contained seniments until I got an equal proportion of subjective and objective messages in the training set.  Furthermore, I used *hinge-loss* as optimized **objective function**:
$$L = \sum_{i}^{N}\sum_{t=0}^{\lvert\vec{x}_i\rvert}\max\left(0, c + \max\limits_{y'\neq y}\vec{p}_{t,y'} - \vec{p}_{t,y}\right) + \alpha \left\lVert{}O\right\rVert^2_2$$

Finally, I initialized the values of all matrix parameters to random **orthogonal** matrices and used **uniform He sampling** (He et al., 2015) for initializing the values of bias terms.  Afterwards, I ran the training for **256 epochs**, using **RMSProp algorithm** (Tielemann and Hinton, 2012) for optimization and selecting the parameters that yielded the highest macro-averaged $F_1$-score on the dev data during the training.

## Results

<table>
<thead>
<tr>
<td rowspan="2">Data Set</td>
<td colspan="3">Sentiment</td>
<td colspan="3">Source</td>
<td colspan="3">Target</td>
<td>Macro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td>Training Set</td>
<td>0.49<div class="stddev">0.16</div></td>
<td>0.75<div class="stddev">0.01</div></td>
<td>0.58<div class="stddev">0.13</div></td>
<td>0.45<div class="stddev">0.05</div></td>
<td>0.63<div class="stddev">0.12</div></td>
<td>0.52<div class="stddev">0.08</div></td>
<td>0.41<div class="stddev">0.11</div></td>
<td>0.73<div class="stddev">0.06</div></td>
<td>0.52<div class="stddev">0.11</div></td>
<td>0.54<div class="stddev">0.11</div></td>
</tr>
<tr>
<td>Test Set</td>
<td>0.29<div class="stddev">0.03</div></td>
<td>0.31<div class="stddev">0.11</div></td>
<td>0.29<div class="stddev">0.03</div></td>
<td>0.25<div class="stddev">0.02</div></td>
<td>0.31<div class="stddev">0.0</div></td>
<td>0.27<div class="stddev">0.01</div></td>
<td>0.23<div class="stddev">0.02</div></td>
<td>0.25<div class="stddev">0.05</div></td>
<td>0.24<div class="stddev">0.01</div></td>
</tr>
<tr>
<td>Training Set</td>
<td>0.51<div class="stddev">0.08</div></td>
<td>0.66<div class="stddev">0.05</div></td>
<td>0.57<div class="stddev">0.03</div></td>
<td>0.42<div class="stddev">0.03</div></td>
<td>0.62<div class="stddev">0.05</div></td>
<td>0.5<div class="stddev">0.03</div></td>
<td>0.47<div class="stddev">0.11</div></td>
<td>0.63<div class="stddev">0.11</div></td>
<td>0.52<div class="stddev">0.04</div></td>
<td>0.53<div class="stddev">0.03</div></td>
</tr>
<tr>
<td>Test Set</td>
<td>0.3<div class="stddev">0.01</div></td>
<td>0.26<div class="stddev">0.06</div></td>
<td>0.28<div class="stddev">0.03</div></td>
<td>0.22<div class="stddev">0.03</div></td>
<td>0.28<div class="stddev">0.02</div></td>
<td>0.24<div class="stddev">0.02</div></td>
<td>0.24<div class="stddev">0.03</div></td>
<td>0.21<div class="stddev">0.07</div></td>
<td>0.22<div class="stddev">0.03</div></td>
<td>0.25<div class="stddev">0.01</div></td>
</tr>
</tbody>
</table>

## Effect of Word Embeddings

<table>
<thead>
<tr>
<td rowspan="2">RNN</td>
<td colspan="3">Sentiment</td>
<td colspan="3">Source</td>
<td colspan="3">Target</td>
<td>Macro-$F_1$</td>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="11" class="table-subheader">Task-Specific Embeddings</td>
</tr>
<tr>
<td>LSTM</td>
<td>0.283</td>
<td>0.288</td>
<td>0.278</td>
<td>0.293</td>
<td>0.372</td>
<td>0.328</td>
<td>0.254</td>
<td>0.27</td>
<td>0.259</td>
<td>0.288</td>
</tr>
<tr>
<td>GRU</td>
<td>0.287</td>
<td>0.246</td>
<td>0.263</td>
<td>0.287</td>
<td>0.405</td>
<td>0.335</td>
<td>0.252</td>
<td>0.205</td>
<td>0.216</td>
<td>0.271</td>
</tr>
<tr>
<td colspan="11" class="table-subheader">Least-Squares Embeddings</td>
</tr>
<tr>
<td>LSTM</td>
<td>0.268</td>
<td>0.37</td>
<td>0.307</td>
<td>0.261</td>
<td>0.414</td>
<td>0.314</td>
<td>0.223</td>
<td>0.275</td>
<td>0.245</td>
<td>0.289</td>
</tr>
<tr>
<td>GRU</td>
<td>0.256</td>
<td>0.341</td>
<td>0.291</td>
<td>0.267</td>
<td>0.395</td>
<td>0.318</td>
<td>0.229</td>
<td>0.262</td>
<td>0.245</td>
<td>0.285</td>
</tr>
<tr>
<td colspan="11" class="table-subheader">Word2Vec Embeddings</td>
</tr>
<tr>
<td>LSTM</td>
<td>0.291</td>
<td>0.329</td>
<td>0.309</td>
<td>0.2</td>
<td>0.311</td>
<td>0.244</td>
<td>0.221</td>
<td>0.219</td>
<td>0.22</td>
<td>0.257</td>
</tr>
<tr>
<td>GRU</td>
<td>0.273</td>
<td>0.355</td>
<td>0.301</td>
<td>0.207</td>
<td>0.353</td>
<td>0.257</td>
<td>0.213</td>
<td>0.26</td>
<td>0.233</td>
<td>0.264</td>
</tr>
</tbody>
</table>

# Evaluation

## Annotation Scheme (Example)

**Broad Interpretation:**
<div class="example">
  <div class="seniment"><div class="target">Francis</div> makes a <div class="intensifier">very</div> <div class="emoexpression">good</div> impression on <div class="source">me</div>!<div class="emoexpression">:)</div></div>

  $\rightarrow$

  Francis/TRG makes/SNT a/SNT very/SNT good/SNT impression/SNT on/SNT me/SRC !/SNT :)/SNT
</div>

**Narrow Interpretation:**
<div class="example">
  <div class="seniment"><div class="target">Francis</div> makes a <div class="intensifier">very</div> <div class="emoexpression">good</div> impression on <div class="source">me</div>!<div class="emoexpression">:)</div></div>

  $\rightarrow$

  Francis/TRG makes/NON a/NON very/NON good/SNT impression/NON on/NON me/SRC !/NON :)/SNT
</div>

## Annotation Scheme (Results)

<table>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Sentiment</td>
<td colspan="3">Source</td>
<td colspan="3">Target</td>
<td>Macro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="11">Broad Interpretation</td>
</tr>
<tr>
<td>CRF</td>
<td>0.38</td>
<td>0.32</td>
<td>0.34</td>
<td>0.3</td>
<td>0.33</td>
<td>0.31</td>
<td>0.29</td>
<td>0.23</td>
<td>0.26</td>
<td>0.31</td>
</tr>
<tr>
<td>LSTM</td>
<td>0.28</td>
<td>0.29</td>
<td>0.28</td>
<td>0.29</td>
<td>0.37</td>
<td>0.33</td>
<td>0.25</td>
<td>0.27</td>
<td>0.26</td>
<td>0.29</td>
</tr>
<tr>
<td>GRU</td>
<td>0.29</td>
<td>0.25</td>
<td>0.26</td>
<td>0.29</td>
<td>0.4</td>
<td>0.34</td>
<td>0.25</td>
<td>0.21</td>
<td>0.22</td>
<td>0.27</td>
</tr>
<tr>
<td colspan="11">Narrow Interpretation</td>
</tr>
<tr>
<td>CRF</td>
<td>0.59</td>
<td>0.64</td>
<td>0.62</td>
<td>0.26</td>
<td>0.23</td>
<td>0.24</td>
<td>0.22</td>
<td>0.20</td>
<td>0.21</td>
<td>0.36</td>
</tr>
<tr>
<td>LSTM</td>
<td>0.62</td>
<td>0.65</td>
<td>0.63</td>
<td>0.3</td>
<td>0.35</td>
<td>0.32</td>
<td>0.26</td>
<td>0.14</td>
<td>0.18</td>
<td>0.38</td>
</tr>
<tr>
<td>GRU</td>
<td>0.62</td>
<td>0.63</td>
<td>0.62</td>
<td>0.28</td>
<td>0.33</td>
<td>0.3</td>
<td>0.23</td>
<td>0.24</td>
<td>0.23</td>
<td>0.38</td>
</tr>
</tbody>
</table>

# Evaluation

## Graph Structure of the Models

* first- and higher-order linear chain CRFs;
* first- and higher-order semi-Markov model;
* tree-structured CRFs.

TODO: Visualization

## Graph Structure of the Models (Results)

<table>
<thead>
<tr>
<td rowspan="2">Element</td>
<td colspan="9">Structure</td>
</tr>
<tr>
<td>lcCRF$^1$</td>
<td>lcCRF$^2$</td>
<td>lcCRF$^3$</td>
<td>lcCRF$^4$</td>
<td>smCRF$^1$</td>
<td>smCRF$^2$</td>
<td>smCRF$^3$</td>
<td>smCRF$^4$</td>
<td>trCRF$^1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="10">Training Set</td>
</tr>
<tr>
<td>Sentiment</td>
<td>0.928</td>
<td>0.919</td>
<td>0.922</td>
<td>0.925</td>
<td>0.931</td>
<td>0.931</td>
<td>0.933</td>
<td>0.931</td>
<td>0.906</td>
</tr>
<tr>
<td>Source</td>
<td>0.887</td>
<td>0.876</td>
<td>0.89</td>
<td>0.901</td>
<td>0.869</td>
<td>0.886</td>
<td>0.874</td>
<td>0.878</td>
<td>0.881</td>
</tr>
<tr>
<td>Target</td>
<td>0.898</td>
<td>0.811</td>
<td>0.816</td>
<td>0.827</td>
<td>0.813</td>
<td>0.827</td>
<td>0.815</td>
<td>0.817</td>
<td>0.876</td>
</tr>
<tr>
<td colspan="10">Development Set</td>
</tr>
<tr>
<td>Sentiment</td>
<td>0.345</td>
<td>0.334</td>
<td>0.332</td>
<td>0.335</td>
<td>0.395</td>
<td>0.385</td>
<td>0.389</td>
<td>0.378</td>
<td>0.331</td>
</tr>
<tr>
<td>Source</td>
<td>0.313</td>
<td>0.32</td>
<td>0.272</td>
<td>0.304</td>
<td>0.298</td>
<td>0.282</td>
<td>0.287</td>
<td>0.291</td>
<td>0.223</td>
</tr>
<tr>
<td>Target</td>
<td>0.258</td>
<td>0.235</td>
<td>0.24</td>
<td>0.229</td>
<td>0.287</td>
<td>0.309</td>
<td>0.301</td>
<td>0.292</td>
<td>0.243</td>
</tr>
</tbody>
</table>

## Graph Structure of the Models (Results)

<table>
<thead>
<tr>
<td rowspan="2">Element</td>
<td colspan="8">Structure</td>
</tr>
<tr>
<td>lcLSTM$^1$</td>
<td>lcLSTM$^2$</td>
<td>lcLSTM$^3$</td>
<td>lcGRU$^1$</td>
<td>lcGRU$^2$</td>
<td>lcGRU$^3$</td>
<td>trLSTM$^1$</td>
<td>trGRU$^1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="9">Training Set</td>
</tr>
<tr>
<td>Sentiment</td>
<td>0.584</td>
<td>0.559</td>
<td>0.54</td>
<td>0.57</td>
<td>0.587</td>
<td>0.606</td>
<td>0.43</td>
<td>0.518</td>
</tr>
<tr>
<td>Source</td>
<td>0.525</td>
<td>0.458</td>
<td>0.424</td>
<td>0.503</td>
<td>0.546</td>
<td>0.548</td>
<td>0.317</td>
<td>0.372 </td>
</tr>
<tr>
<td>Target</td>
<td>0.521</td>
<td>0.513</td>
<td>0.501</td>
<td>0.519</td>
<td>0.544</td>
<td>0.605</td>
<td>0.305</td>
<td>0.425</td>
</tr>
<tr>
<tr>
<td colspan="9">Development Set</td>
</tr>
<td>Sentiment</td>
<td>0.278</td>
<td>0.285</td>
<td>0.281</td>
<td>0.335</td>
<td>0.252</td>
<td>0.253</td>
<td>0.314</td>
<td>0.292</td>
</tr>
<tr>
<td>Source</td>
<td>0.328</td>
<td>0.314</td>
<td>0.303</td>
<td>0.263</td>
<td>0.298</td>
<td>0.306</td>
<td>0.256</td>
<td>0.262</td>
</tr>
<tr>
<td>Target</td>
<td>0.259</td>
<td>0.218</td>
<td>0.222</td>
<td>0.216</td>
<td>0.219</td>
<td>0.188</td>
<td>0.205</td>
<td>0.193</td>
</tr>
</tbody>
</table>

# Evaluation

## Effect of Text Normalization

<table>
<thead>
<tr>
<td rowspan="2">Data Set</td>
<td colspan="3">Sentiment</td>
<td colspan="3">Source</td>
<td colspan="3">Target</td>
<td rowspan="2">Macro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="11">With Normalization</td>
</tr>
<tr>
<td>CRF</td>
<td>0.376</td>
<td>0.319</td>
<td>0.345</td>
<td>0.298</td>
<td>0.33</td>
<td>0.313</td>
<td>0.293</td>
<td>0.231</td>
<td>0.258</td>
<td>0.305</td>
</tr>
<tr>
<td>LSTM</td>
<td>0.283</td>
<td>0.288</td>
<td>0.278</td>
<td>0.293</td>
<td>0.372</td>
<td>0.328</td>
<td>0.254</td>
<td>0.27</td>
<td>0.259</td>
<td>0.288</td>
</tr>
<tr>
<td>GRU</td>
<td>0.287</td>
<td>0.246</td>
<td>0.263</td>
<td>0.287</td>
<td>0.405</td>
<td>0.335</td>
<td>0.252</td>
<td>0.205</td>
<td>0.216</td>
<td>0.271</td>
</tr>
<tr>
<td colspan="11">Without Normalization</td>
</tr>
<tr>
<td>CRF</td>
<td>0.301</td>
<td>0.278</td>
<td>0.289</td>
<td>0.276</td>
<td>0.3</td>
<td>0.287</td>
<td>0.255</td>
<td>0.23</td>
<td>0.242</td>
<td>0.273</td>
</tr>
<tr>
<td>LSTM</td>
<td>0.274</td>
<td>0.252</td>
<td>0.261</td>
<td>0.284</td>
<td>0.367</td>
<td>0.32</td>
<td>0.237</td>
<td>0.241</td>
<td>0.237</td>
<td>0.273</td>
</tr>
<tr>
<td>GRU</td>
<td>0.266</td>
<td>0.245</td>
<td>0.252</td>
<td>0.296</td>
<td>0.369</td>
<td>0.328</td>
<td>0.232</td>
<td>0.268</td>
<td>0.245</td>
<td>0.275</td>
</tr>
</tbody>
</table>

# Summary and Concusions

* CRFs can learn meaningful weights for state- and
  transition-features, although different features types might have
  different effects on classification of opinion elements: whereas
  sentiments benefited from all features used in our
  experiments, sources profited most from lexical and
  complex attributes, and targets were positively
  influenced by morphological and syntactic features only;

* Apart from that, we analyzed the effect of different embedding
  types on the net results of RNN systems, finding that least-squares
  embeddings yield the best overall scores for these methods;

* Furthermore, even higher prediction scores for
  \markable{sentiment}s can be achieved by narrowing the spans of
  these elements to polar terms.  This, however, might negatively
  affect the classification of \markable{source}s and
  \markable{target}s;

* Even though context seems to play an important role, redefining
  models' structures by increasing the order of their dependencies or
  performing inference over trees instead of linear sequences does not
  bring much improvement.  I could, however, still outperform the
  results of traditional first-order linear-chain CRFs with their
  first- and second-order semi-Markov modifications;

* In the final step, I estimated the effect of text normalization
  by rerunning all experiments with original (unnormalized) tweets.
  This test showed that preprocessing is an extremely helpful
  procedure, which might improve the results of ABSA methods by up to
  3%.

# Chapter V: Message-Level Sentiment Analysis

## Task

Given an input tweet $t$, our task is to automatically determine the polarity (positive 😊, negative ☹️, or neutral 😐) of that message.

<div class="example">
  TODO Provide an example
  <div class="translation">TODO</div>
</div>

## Evaluation Metrics

To estimate the quality of compared systems, I rely on two established evaluation metrics that are commonly used to measure the MLSA results:

* **macro-averaged $F_1$-score** over two main polarity classes (positive and negative):
$$F_1 = \frac{F_{pos} + F_{neg}}{2} $$'

* and **micro-averaged $F_1$-score** over all three semantic orientations (positive, negative, and neutral), which essentially corresponds to accuracy.

## Data Preparation

As in the previous experiments, I
* preprocessed all tweets with the text normalization system of Sidarenka et al. (2013},
* tokenized them with the adjusted version of Christopher Potts' tokenizer,
* lemmatized and assigned part-of-speech tags to these tokens with the TreeTagger of Schmid (1995),
* and obtained morphological and syntactic analyses with the Mate dependency parser (Bohnet et al., 2013);
* I again split the 70-10-20 percent 

## Data Preparation (Inference of Gold Labels)

Since the PotTS corpus, however, did not provide explicit message-level gold labels, I used a simple heuristic rule to derive these labels from existing annotations.  In partucular:

* I assigned the positive (negative) label to the microblogs that had exclusively positive (negative) sentiments;
* Messages that did not have any sentiments, but had exclusively positive (negative) polar terms were also assigned the respective label;
* Tweets that featured sentiments of both polarities or had both positive and negative polar terms were considered as mixed and skipped from our experiments;
* Finally, all remaining microblogs were regarded as neutral.

## Data Preparation (Inference of Gold Labels)

<div class="example">
Ich finde den Papst <div class="emoexpression positive">putzig</div> <div class="emoexpression positive">🙂</div>
<div class="translation">I find the Pope <div class="emoexpression positive">cute</div> <div class="emoexpression positive">🙂</div></div>
<div class="label positive">positive</div>
</div>

<div class="example">
<div class="emoexpression negative">typisch</div> Bayern kaum ist der neue Papst da und schon haben sie ihn <div class="emoexpression negative">in der Tasche</div>&#8230;
<div class="translation"><div class="emoexpression negative">typical</div></div> Bavaria The new Pope is hardly there, as they already have him <div class="emoexpression negative">in their pocket</div>
<div class="label negative">negative</div>
</div>

<div class="example">
Unser Park, unser Geld, unsere Stadt! -NICHT unser Finanzminister! <div class="emoexpression positive">🙂</div> #schmid #spd #s21 #btw13
<div class="translation">Our park, our money, our city! -NOT our Finance Minister! <div class="emoexpression positive">🙂</div> #schmid #spd #s21 #btw13</div>
<div class="label positive erroneous">positive</div>
</div>

<div class="example">
Auf die Lobby-FDP von heute kann Deutschland verzichten&#8230;
<div class="translation">Germany can go without today's lobby FDP</div>
<div class="label positive erroneous">neutral</div>
</div>

## Data Preparation (SB10k)

The SB10k dataset comprises a total of 9,738 microblogs, which were sampled from a
larger snapshot of 5M German tweets gathered between August and November 2013. To
ensure lexical diversity and proportional polarity distribution in this corpus, the authors
first grouped all posts of this snapshot into 2,500 clusters using k-means with unigram
features. Afterwards, from each of these groups, they selected tweets that contained at least
one positive or one negative term from the German Polarity Clues lexicon (Waltinger, 2010), and the n let three human experts annotate these microblogs with their message-level polarity (positive, negative, neutral, or mixed). Unfortunately, due to the restrictions of
Twitter’s terms of use, I could only retrieve 7,476 tweets of this collection, which, however, still represents
a substantial part of the original dataset.

## Data Preparation (Statistics)

<table>
<thead>
 <caption><caption>Polarity class distribution in PotTS, SB10k<br/>
(&#42; – the mixed polarity was excluded from our experiments)</caption></caption>
<tr>
<td rowspan="2">Dataset</td>
<td colspan="4">Polarity Class</td>
<td colspan="2">Label Agreement</td>
</tr>
<tr>
<td>Positive} </td>
<td>Negative</td>
<td>Neutral</td>
<td>Mixed*</td>
<td>$\alpha$ </td>
<td>$\kappa$</td>
</tr>
</thead>
<tbody>
<tr>
<td>PotTS</td>
<td>3,380</td>
<td>1,541</td>
<td>2,558</td>
<td>513</td>
<td>0.66</td>
<td>0.4</td>
</tr>
<tr>
<td>SB10k</td>
<td>1,717</td>
<td>1,130</td>
<td>4,629</td>
<td>0</td>
<td>0.39</td>
<td class="NA">NA</td>
</tr>
<!--
<tr>
<td>GTS</td>
<td>3,326,829</td>
<td>350,775</td>
<td>19,453,669</td>
<td>73,776</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
-->
</tbody>
</table>

## Methods

* lexicon-based methods;
* machine-learning&ensp;based methods;
* deep-learing&ensp;based methods.

## Lexicon-Based Methods

Given a tweet $t$, a lexicon-based method determines the overall polarity of that tweet $y\in\{\mathrm{positive},\mathrm{negative},\mathrm{neutral}\}$ by summing the scores of polar terms (taken from a sentiment lexicon), possibly modifying these scores by some context factors (intensifiers, downtoners, negations).

The most well-known lexicon-based systems are the methods of:

* Hu and Liu (2004);

* Taboada et al. (2011);

* Musto et al. (2014);

* and  Kolchyna et al. (2015).

## Lexicon-Based Methods (Results)

<table>
    <caption>Results of lexicon-based MLSA methods<br/>
HL &emsp; Hu and Liu (2004), TBD &emsp; Taboada et al. (2011), MST &emsp; Musto et al. (2014), JRK &emsp; Jurek
et al. (2015), KLCH &emsp; Kolchyna et al. (2015)</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td rowspan="2">Macro-$F_1^{+/-}$</td>
<td rowspan="2">Micro-$F_1^$</td>
</tr>
<tr>
<td>Precision </td>
<td>Recall </td>
<td>$F_1$</td>
<td>Precision </td>
<td>Recall </td>
<td>$F_1$</td>
<td>Precision </td>
<td>Recall </td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td>HL</td>
<td>0.75</td>
<td>0.76</td>
<td>0.76</td>
<td>0.53</td>
<td>0.43</td>
<td>0.47</td>
<td>0.67</td>
<td>0.73</td>
<td>0.69</td>
<td>0.615</td>
<td>0.685</td>
</tr>
<tr>
<td>TBD</td>
<td>0.77</td>
<td>0.71</td>
<td>0.74</td>
<td>0.54</td>
<td>0.39</td>
<td>0.45</td>
<td>0.63</td>
<td>0.77</td>
<td>0.69</td>
<td>0.597</td>
<td>0.674</td>
</tr>
<tr>
<td>MST</td>
<td>0.75</td>
<td>0.72</td>
<td>0.74</td>
<td>0.48</td>
<td>0.47</td>
<td>0.48</td>
<td>0.68</td>
<td>0.72</td>
<td>0.7</td>
<td>0.606</td>
<td>0.675</td>
</tr>
<tr>
<td>JRK</td>
<td>0.6</td>
<td>0.31</td>
<td>0.41</td>
<td>0.42</td>
<td>0.2</td>
<td>0.27</td>
<td>0.43</td>
<td>0.8</td>
<td>0.56</td>
<td>0.339</td>
<td>0.467</td>
</tr>
<tr>
<td>KLCH</td>
<td>0.71</td>
<td>0.72</td>
<td>0.71</td>
<td>0.34</td>
<td>0.17</td>
<td>0.22</td>
<td>0.66</td>
<td>0.82</td>
<td>0.73</td>
<td>0.468</td>
<td>0.651</td>
</tr>
<tr>
<td>HL</td>
<td>0.49</td>
<td>0.62</td>
<td>0.55</td>
<td>0.27</td>
<td>0.33</td>
<td>0.3</td>
<td>0.73</td>
<td>0.62</td>
<td>0.67</td>
<td>0.421</td>
<td>0.577</td>
</tr>
<tr>
<td>TBD</td>
<td>0.48</td>
<td>0.6</td>
<td>0.53</td>
<td>0.24</td>
<td>0.27</td>
<td>0.25</td>
<td>0.72</td>
<td>0.63</td>
<td>0.67</td>
<td>0.393</td>
<td>0.57</td>
</tr>
<tr>
<td>MST</td>
<td>0.45</td>
<td>0.49</td>
<td>0.47</td>
<td>0.29</td>
<td>0.35</td>
<td>0.32</td>
<td>0.7</td>
<td>0.64</td>
<td>0.67</td>
<td>0.395</td>
<td>0.568</td>
</tr>
<tr>
<td>JRK</td>
<td>0.41</td>
<td>0.39</td>
<td>0.4</td>
<td>0.36</td>
<td>0.26</td>
<td>0.3</td>
<td>0.69</td>
<td>0.75</td>
<td>0.72</td>
<td>0.351</td>
<td>0.592</td>
</tr>
<tr>
<td>KLCH</td>
<td>0.39</td>
<td>0.22</td>
<td>0.28</td>
<td>0.34</td>
<td>0.13</td>
<td>0.19</td>
<td>0.66</td>
<td>0.86</td>
<td>0.75</td>
<td>0.235</td>
<td>0.606</td>
</tr>
</tbody>
</table>

## Lexicon-Based Methods (An Error Made by the System of Taboada et al.)

<div class="example">
Der beste Microsoft Knowledgebase-Artikel, den ich je gelesen habe.
<div class="translation">The best Microsoft-Knowledgebase article I've ever read.</div>
Gold Label:<div class="label positive">positive</div>
Predicted Label:<div class="label neutral">neutral*</div>
</div>

## Lexicon-Based Methods (An Error Made by the System of Musto et al.)

<div class="example">
Mensch Meier, Mensch Meier! Das sieht gut aus f&uuml;r die %User:
<div class="translation">Gosh Meier, Gosh Meier! It looks good for the %User:</div>
Gold Label:<div class="label positive">positive</div>
Predicted Label:<div class="label neutral">neutral*</div>
</div>

## Lexicon-Based Methods (An Error Made by the System of Jurek et al.)

<div class="example">
Normal bin ich ja nicht der mensch dwer sich beschwert wegen dem essen aber diese Pizza von Joeys&8230; boah wie ekelhaft
<div class="translation">Normally I'm not a person who complains about food but  this pizza from Joeys&8230; Boah it's so disgusting</div>
Gold Label:<div class="label negative">negative</div>
Predicted Label:<div class="label positive">positive*</div>
</div>

## Lexicon-Based Methods (Effect of Polarity Changing Factors)

<table>
<caption>
Effect of polarity-changing factors on lexicon-based MLSA methods
</caption>
<thead>
<tr>
<td rowspan="3">Polarity-Changing Factors</td>
<td colspan="10">System Scores</td>
</tr>
<tr>
<td colspan="2">HL</td>
<td colspan="2">TBD</td>
<td colspan="2">MST</td>
<td colspan="2">JRK</td>
<td colspan="2">KLCH</td>
</tr>
<tr>
<td> Macro-$F_1$</td>
<td> Micro-$F_1$</td>
<td> Macro-$F_1$</td>
<td> Micro-$F_1$</td>
<td> Macro-$F_1$</td>
<td> Micro-$F_1$</td>
<td> Macro-$F_1$</td>
<td> Micro-$F_1$</td>
<td> Macro-$F_1$</td>
<td> Micro-$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="11">
PotTS
</td>
</tr>
<tr>
<td>All</td>
<td>0.615</td>
<td>0.685</td>
<td>0.593</td>
<td>0.671</td>
<td>0.606</td>
<td>0.675</td>
<td>0.339</td>
<td>0.467</td>
<td>0.468</td>
<td>0.651</td>
</tr>
<tr>
<td>--Negation</td>
<td>0.622</td>
<td>0.691</td>
<td>0.596</td>
<td>0.672</td>
<td>0.641</td>
<td>0.7</td>
<td>0.357</td>
<td>0.473</td>
<td>0.298</td>
<td>0.463</td>
</tr>
<tr>
<td>--Intensification</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.595</td>
<td>0.672</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.339</td>
<td>0.467</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--Other Modifiers</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.613</td>
<td>0.684</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td colspan="11">
SB10k
</td>
</tr>
<tr>
<td>All</td>
<td>0.421</td>
<td>0.577</td>
<td>0.392</td>
<td>0.569</td>
<td>0.395</td>
<td>0.568</td>
<td>0.351</td>
<td>0.592</td>
<td>0.235</td>
<td>0.606</td>
</tr>
<tr>
<td>--Negation</td>
<td>0.415</td>
<td>0.576</td>
<td>0.395</td>
<td>0.572</td>
<td>0.381</td>
<td>0.559</td>
<td>0.316</td>
<td>0.586</td>
<td>0.218</td>
<td>0.609</td>
</tr>
<tr>
<td>--Intensification</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.4</td>
<td>0.576</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.352</td>
<td>0.59</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--Other Modifiers</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.406</td>
<td>0.566</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
</tbody>
</table>

## ML-Based Methods

Given a tweet $t$, a machine-learning&ndash;based method determines the overall polarity of that tweet $y\in\{\mathrm{positive},\mathrm{negative},\mathrm{neutral}\}$ by multiplying the values of its features with their automatically learned coefficients.

### Representatives

* One of the earliest ML-based systems for message-level sentiment classification was proposed by **Gamon (2004)**, who used an SVM classifier with linguistic and surface-level features, such as
  * part-of-speech trigrams,
  * context-free phrase-structure patterns,
   * and part-of-speech information coupled with syntactic relations) to distinguish between positive and negative customer feedback;

Following by the successes of SVMs at various NLP tasks, these systems also rapidly gained ground at the initial runs of SemEval competition on sentiment analysis in (English) Twitter (Nakov et al., 2013; Rosenthal et al, 2014), with the most prominent representatives being:
* the system by Mohammad et al. (Mohammad et al., 2013), which relied on an extensive set of features:
  * character and token $n$-grams;
  * Brown clusters (Brown et al., 1992);
  * statistics on part-of-speech tags, punctuation marks, and elongated words;
  * numerous sentiment-lexicon features, extracted from both manual and automatic lexicons;

* the system by G&uuml;nther and Furrer (&uuml;nther and Furrer, 2013), which also used on an extensive set of manually-defined attributes, including:
  * original and lemmatized unigrams;
  * word clusters;
  * and lexicon features (only SentiWordNet [Esuli and Sebastiani, 2005]).

## ML-Based Methods (Results)

<table>
<caption>
Results of machine-learning&ndash;based MLSA methods<br/>
GMN &mdash; Gamon (2004), MHM &mdash; Mohammad et al. (2013), GNT &mdash; Günther et al. (2014)
</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td rowspan="2">Macro$F_1$</td>
<td rowspan="2">Macro$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="12">
PotTS
</td>
</tr>
<tr>
<td>GMN</td>
<td>0.67</td>
<td>0.73</td>
<td>0.7</td>
<td>0.35</td>
<td>0.15</td>
<td>0.21</td>
<td>0.6</td>
<td>0.72</td>
<td>0.66</td>
<td>0.453</td>
<td>0.617</td>
</tr>
<tr>
<td>MHM</td>
<td class="best">0.79</td>
<td>0.77</td>
<td class="best">0.78</td>
<td class="best">0.58</td>
<td class="best">0.56</td>
<td class="best">0.57</td>
<td class="best">0.73</td>
<td class="best">0.76</td>
<td class="best">0.74</td>
<td class="best">0.674</td>
<td class="best">0.727</td>
</tr>
<tr>
<td>GNT</td>
<td>0.71</td>
<td class="best">0.8</td>
<td>0.75</td>
<td>0.55</td>
<td>0.45</td>
<td>0.5</td>
<td>0.68</td>
<td>0.63</td>
<td>0.65</td>
<td>0.624</td>
<td>0.673</td>
</tr>
<tr>
<td colspan="12">
SB10k
</td>
</tr>
<tr>
<td>GMN</td>
<td>0.65</td>
<td>0.45</td>
<td>0.53</td>
<td>0.38</td>
<td>0.08</td>
<td>0.13</td>
<td>0.72</td>
<td class="best">0.93</td>
<td>0.81</td>
<td>0.329</td>
<td>0.699</td>
</tr>
<tr>
<td>MHM</td>
<td class="best">0.71</td>
<td class="best">0.65</td>
<td class="best">0.68</td>
<td class="best">0.51</td>
<td class="best">0.4</td>
<td class="best">0.45</td>
<td class="best">0.8</td>
<td>0.87</td>
<td class="best">0.84</td>
<td class="best">0.564</td>
<td class="best">0.752</td>
</tr>
<tr>
<td>GNT</td>
<td>0.67</td>
<td>0.62</td>
<td>0.64</td>
<td>0.44</td>
<td>0.28</td>
<td>0.34</td>
<td>0.78</td>
<td>0.87</td>
<td>0.82</td>
<td>0.491</td>
<td>0.724</td>
</tr>
</tbody>
</table>

## ML-Based Methods (An Error Made by the System of Mohammad et al.)

<div class="example">
das klingt richtig gut! Was f&uuml;r eine hast du denn? (uvu) %PosSmiley3
<div class="translation">It sounds really great. Which one do you have? (uvu) %PosSmiley3</div>
Gold Label:<div class="label positive">positive</div>
Predicted Label:<div class="label neutral">neutral*</div>
</div>

Top-ranking features of this example:

1. &#42; (neutral): 0.131225868029;
2. &#42; (negative): -0.0840804221845;
3. %PoS-CARD (neutral): 0.0833658576233;
4. %PoS-ADJD (neutral): -0.069745190018;
5. t-␣-n (positive): 0.0556721202587.

## ML-Based Methods (An Error Made by the System of G&uuml;nther et al.)

<div class="example">
Den CDU-Wählern traue ich durchaus zu der FDP 8 bis 9% zu bescheren! Die sind so borniert, nicht nur in Niedersachsen!
<div class="translation">I don't put giving 8 to 9% to the FDP past the CDU-voters! They are so narrow-minded, not only in Lower Saxony!</div>
Gold Label:<div class="label negative">negative</div>
Predicted Label:<div class="label positive">positive*</div>
</div>

The resason for this misclassification is the prevalence of many general features (e.g., 8, nicht-nur_NEG, nur_NEG, etc.) and their strong bias towards the majority class in the PotTS dataset.

## ML-Based Methods (Feature-Ablation Test)

<table>
    <caption>
        Results of the feature-ablation test for ML-based MLSA methods
    </caption>
<thead>
<tr>
<td rowspan="3">Features</td>
<td colspan="6">System Scores</td>
</tr>
<tr>
<td colspan="2">GMN</td>
<td colspan="2">MHM</td>
<td colspan="2">GNT</td>
</tr>
<tr>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F$</td>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F$</td>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">
PotTS
</td>
</tr>   
<tr>
<td>All</td>
<td>0.453</td>
<td>0.617</td>
<td>0.674</td>
<td>0.727</td>
<td>0.624</td>
<td>0.673</td>
</tr>
<tr>
<td>--Constituents</td>
<td>0.388</td>
<td>0.545</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--PoS Tags</td>
<td>0.417</td>
<td>0.607</td>
<td>0.669</td>
<td>0.721</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--Character Features</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.671</td>
<td>0.734</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--Token Features</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.659</td>
<td>0.704</td>
<td>0.0</td>
<td>0.366</td>
</tr>
<tr>
<td>--Automatic Lexicons</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.667</td>
<td>0.717</td>
<td>0.613</td>
<td>0.666</td>
</tr>
<tr>
<td>--Manual Lexicons</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.665</td>
<td>0.715</td>
<td>0.617</td>
<td>0.675</td>
</tr>
<tr>
<td colspan="7">
SB10k
</td>
<tr>
<tr>
<td>All</td>
<td>0.329</td>
<td>0.699</td>
<td>0.564</td>
<td>0.752</td>
<td>0.491</td>
<td>0.724</td>
</tr>
<tr>
<td>--Constituents</td>
<td>0.127</td>
<td>0.646</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--PoS Tags</td>
<td>0.301</td>
<td>0.7</td>
<td>0.57</td>
<td>0.757</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--Character Features</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.546</td>
<td>0.753</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>--Token Features</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.559</td>
<td>0.741</td>
<td>0.046</td>
<td>0.62</td>
</tr>
<tr>
<td>--Automatic Lexicons</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.54</td>
<td>0.753</td>
<td>0.517</td>
<td>0.735</td>
</tr>
<tr>
<td>--Manual Lexicons</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
<td>0.553</td>
<td>0.751</td>
<td>0.51</td>
<td>0.739</td>
    </tr>
</tbody>
</table>

## ML-Based Methods (Top-10 Features)

<table>
    <caption>
       Top-10 features learned by ML-based MLSA methods<br />
(sorted by the absolute values of their weights)
    </caption>
<thead>
<tr>
<td rowspan="2">Rank</td>
<td colspan="3">GMN</td>
<td colspan="3">MHM</td>
<td colspan="3">GNT</td>
</tr>
<tr>
<td>Feature</td>
<td>Label</td>
<td>Weight</td>
<td>Feature</td>
<td>Label</td>
<td>Weight</td>
<td>Feature</td>
<td>Label</td>
<td>Weight</td>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>NK-ITJ|</td>
<td>POS</td>
<td>0.457</td>
<td>*</td>
<td>NEUT</td>
<td>0.131</td>
<td>hate</td>
<td>NEG</td>
<td>1.86 </td>
</tr>
<tr>
<td>2</td>
<td>DM-ITJ|</td>
<td>POS</td>
<td>0.334</td>
<td>Last-%QMark-Cnt</td>
<td>NEUT</td>
<td>0.088</td>
<td>sick</td>
<td>NEG</td>
<td>1.7</td>
</tr>
<tr>
<td>3</td>
<td>V-DM-I</td>
<td>POS</td>
<td>0.244</td>
<td>s-c</td>
<td>NEG</td>
<td>0.079</td>
<td>kahretsinn</td>
<td>NEG</td>
<td>1.69</td>
</tr>
<tr>
<td>4</td>
<td>N-NK-I</td>
<td>POS</td>
<td>0.24</td>
<td>*-%possmiley</td>
<td>POS</td>
<td>0.067</td>
<td>dasisaberschade</td>
<td>NEG</td>
<td>1.69</td>
</tr>
<tr>
<td>5</td>
<td>MO-ITJ|</td>
<td>POS</td>
<td>0.211</td>
<td>c-h-e-i-s</td>
<td>NEG</td>
<td>0.064</td>
<td>Anziehen</td>
<td>POS</td>
<td>1.67</td>
</tr>
<tr>
<td>6</td>
<td>A-DM-I</td>
<td>POS</td>
<td>0.196</td>
<td>h-a-h</td>
<td>POS</td>
<td>0.064</td>
<td>&#7452;</td>
<td>POS</td>
<td>1.65</td>
</tr>
<tr>
<td>7</td>
<td>A-MO-I</td>
<td>POS</td>
<td>0.191</td>
<td>t-&blank;-.</td>
<td>NEG</td>
<td>0.064</td>
<td>p&auml;rchenabend</td>
<td>POS</td>
<td>1.65</td>
</tr>
<tr>
<td>8</td>
<td>NK-ITJ</td>
<td>POS</td>
<td>0.165</td>
<td>geil</td>
<td>POS</td>
<td>0.062</td>
<td>derien❤️❤️</td>
<td>POS</td>
<td>1.65</td>
</tr>
<tr>
<td>9</td>
<td>NK-$.</td>
<td>NEUT</td>
<td>0.16</td>
<td>*-?</td>
<td>NEUT</td>
<td>0.062</td>
<td>sch&ouml;n-nicht</td>
<td>POS</td>
<td>1.56</td>
</tr>
<tr>
<td>10</td>
<td>DM-ITJ</td>
<td>POS</td>
<td>0.157</td>
<td>?</td>
<td>NEUT</td>
<td>0.061</td>
<td>applause</td>
<td>POS</td>
<td>1.5</td>
</tr>
</tbody>
</table>

## ML-Based Methods (Effect of Classifiers)

<table>
    <caption>
       Results of ML-based MLSA methods with different classifiers
    </caption>
<thead>
<tr>
<td rowspan="3">Classifier</td>
<td colspan="6">System Scores</td>
</tr>
<tr>
<td colspan="2">GMN</td>
<td colspan="2">MHM</td>
<td colspan="2">GNT</td>
</tr>
<tr>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F_1$</td>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F_1$</td>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">
PotTS
</td>
</tr>   
<tr>
<td>SVM</td>
<td>0.453</td>
<td>0.617</td>
<td>0.674</td>
<td>0.727</td>
<td>0.624</td>
<td>0.673</td>
</tr>
<tr>
<td>Naive Bayes</td>
<td>0.432</td>
<td>0.577</td>
<td>0.635</td>
<td>0.675</td>
<td>0.567</td>
<td>0.59</td>
</tr>
<tr>
<td>Logistic Regression</td>
<td>0.431</td>
<td>0.612</td>
<td>0.677</td>
<td>0.741</td>
<td>0.624</td>
<td>0.688</td>
</tr>
<tr>
<td colspan="7">
SB10k
</td>
</tr>   
<tr>
<td>SVM</td>
<td>0.329</td>
<td>0.699</td>
<td>0.564</td>
<td>0.752</td>
<td>0.491</td>
<td>0.724</td>
</tr>
<tr>
<td>Naive Bayes</td>
<td>0.351</td>
<td>0.637</td>
<td>0.516</td>
<td>0.755</td>
<td>0.453</td>
<td>0.675</td>
</tr>
<tr>
<td>Logistic Regression</td>
<td>0.309</td>
<td>0.693</td>
<td>0.553</td>
<td>0.772</td>
<td>0.512</td>
<td>0.75</td>
</tr>
</tbody>
</table>

## DL-Based Methods

Given a tweet $t$, a deep-learning method determines the overall polarity of that tweet $y\in\{\mathrm{positive},\mathrm{negative},\mathrm{neutral}\}$ by taking the embedding vectors of its words as input, passing them through a neural network, which produces a vector of (unnormalized) probabilities for each possible polarity class.

The most common representatives of DL-based systems are:

1. The **matrix-space** approach by Yessenalina and Cardie (2011):
$$\xi = \vec{u}^\top\left(\prod_{j=1}^{|x|}W_{w_j}\right)\vec{v},$$
where $W_{w_j}\in\mathcal{R}^{m\times{}m}$ is matrix representation of word $w_j$;

2. The **deep recursive autoencoder** (RAE) method by Socher et al. (2011):
$$\vec{w}_p = softmax\left(W\begin{bmatrix}
      \vec{w}_l\\
      \vec{w}_r
  \end{bmatrix}\right),$$
where $\vec{w}_p\in\mathbb{R}^n$ stands for the embedding of the parent node, and $\vec{w}_l$ and $\vec{w}_r$ represent the embeddings of its left and right dependents, and $W\in\mathbb{R}^{n\times{}2n}$;

3. The **matrix-vector recursive neural network** (MVRNN) by Socher et al. (2012):
$$  \vec{w}_p = tanh\left(W_v \begin{bmatrix}W_r\vec{w}_l\\
      W_l \vec{w}_r\end{bmatrix} \right),\\
  W_p = W_m \begin{bmatrix}W_l;\\
    W_r\end{bmatrix};
$$
where, in addition to the previously defined word vectors $\vec{w}_p$, $\vec{w}_l$, and $\vec{w}_r$, the authors also associated a matrix parameter with each of these words ($W_p$, $W_l$, $W_r$);

4. The **recursive neural tensor network** (RNTN) by Socher et al. (2013):
$$\vec{w}_p = softmax\left(\begin{bmatrix}
  \vec{w}_l\\
  \vec{w}_r
  \end{bmatrix}^{\top}V^{[1:d]}\begin{bmatrix}
  \vec{w}_l\\
  \vec{w}_r
  \end{bmatrix}
            + W\begin{bmatrix}
  \vec{w}_l\\
  \vec{w}_r
\end{bmatrix}\right),
$$
where $\vec{w}_p$, $\vec{w}_l$, $\vec{w}_r$, and $W$ are defined as before, and $V$ is a $2n\times{}2n\times{}n$-dimensional tensor;

5. The **convolutional system** by Severyn and Moschitti (2015), which was the winner of the SemEval task in 2015;

6. The **attention system** by Baziotis et al. (2017):
<figure>
<img src="img/baziotis.png" alt="Architecture of the neural network proposed by Baziotis et al. (2017)">
<figcaption>
Architecture of the neural network proposed by Baziotis et al. (2017)
</figcaption>
</figure>
in which the attention coefficients are estimated as follows:
$$\vec{a} = \sum_{i=1}^{|\mathbf{x}|}a_i\vec{h}_i,$$
where
$$a_i =\frac{\exp(e_i)}{\sum_{j=1}^{|\mathbf{x}|}\exp(e_j)},$$
such that
$$e_i = tanh\left(\vec{\alpha}\vec{h}^{(2)}_i + \beta_i\right)$$

The $\vec{a}$ and $b$ terms in the above equations represent automatically learned attention parameters, and the $\vec{h}^{(2)}_i$ represents the concatenated output of the second bidirectional LSTM layer produced at the $i$-th tweet token: $\vec{h}_i^{(2)} = [\vec{h}^{(2)}_{i_{\rightarrow}}, \vec{h}^{(2)}_{i_{\leftarrow}}]\in\mathbb{R}^{300}$.

In the final step, Baziotis et al. multiplied the output of the attention layer with matrix $W$ and selected the label with the highest resulting score:
$$\hat{y} = argmax\left(softmax(W^\top\vec{a})\right).$$

7. Apart from these numerous existing approaches, I have also proposed my own **lexicon-based attention system**, which builds upon the work of Baziotis et al. (2017). In this approach, in addition to the standard attention mechanism over positions
$$\vec{a} = \sum_{i=1}^{|\mathbf{x}|}a_i\vec{h}_i,$$
$$a_i =\frac{\exp(e_i)}{\sum_{j=1}^{|\mathbf{x}|}\exp(e_j)},$$
$$e_i = tanh\left(\vec{\alpha}\vec{h}^{(2)}_i + \beta_i\right),$$
I have introduced two more types of attention: **lexicon-** and **context-based** one.

In the first of these types (i.e., **lexicon-based** attention), I estimate the relevance of the $i$-th tweet token as a globally normalized polarity score of this term (taking this score from the previously introduced linear projection lexicon):
$$\vec{b} = \sum_{i=1}^{|\mathbf{x}|}b_i\vec{h}_i,$$
where
$$b_i = \frac{\exp(f_i)}{\sum_{j=1}^{|\mathbf{x}|}\exp(f_j)},$$
such that
$$f_i = \left\{
  \begin{array}{ll}
    \tanh(abs(V[{w_i}]) + \epsilon) & \textrm{ if } w_i\in V\\
    \tanh(\epsilon) & \, \textrm{otherwise.} \\
  \end{array}
  \right.$$


Since the semantic orientation of a polar term can be overturned by its context, I also use another type of atttention (**context-based** one), which is supposed to assign greater importance to such modifying context elements:
$$ \vec{c} = \sum_{i=1}^{|\mathbf{x}|}c_i\vec{h}_i,$$
where
$$c_i =\frac{\exp(g_i)}{\sum_{j=1}^{|\mathbf{x}|}\exp(g_j)},$$
such that
$$g_i =\tanh\left(C [\vec{w}_i, \vec{b}_p]^\top\right).$$

The $C$ term in the above equation represents a model parameter (context matrix), and the $\vec{b}_p$ value denotes the output of lexicon-based attention produced for the parent word $p$. That way, I hoped to better differentiate cases where a modifying element was changing the meaning of a polar term from the rest of the situations, e.g.:

<div class="example">
Ich mag den neuen Bundesminister <b>nicht</b>.
<div class="translation">I do <b>not</b> like the new federal minister.</div>
</div>    

<div class="example">
Ich gehe heute <b>nicht</b> ins Kino.
<div class="translation">I am <b>not</b> going to cinema today.</div>
</div>    

Finally, to make the final prediction, I concatenate the outputs of the three attention layers into a single matrix $A\in\mathbb{R}^{3\times 100}$ and multiply it with a vector $\vec{w}\in\mathbb{R}^{1\times{}100}$, applying softmax normalization at the end:
$$\vec{o} = softmax\left(A\vec{w}^\top\right),$$
where 
$$A = \begin{bmatrix}
    \vec{a}\\
    \vec{b}\\
    \vec{c}\end{bmatrix}.
$$

<figure>
<img src="img/lba.png" alt="Architecture of the neural network with lexicon- and context-based attention">
<figcaption>
Architecture of the neural network with lexicon- and context-based attention
</figcaption>

## DL-Based Methods (Results)

<table>
<caption>
Results of deep-learning&mdash;based MLSA methods<br/>
Y&amp;C &mdash; Yessenalina and Cardie (2011), RAE &mdash; Recursive Auto-Encoder (Socher et al., 2011),
MVRNN &mdash; Matrix-Vector RNN (Socher et al., 2012), RNTN &mdash; Recursive Neural-Tensor Network
(Socher et al., 2013), SEV &mdash; Severyn and Moschitti (2015b), BAZ &mdash; Baziotis et al. (2017),
LBA (1) &mdash; lexicon-based attention with one Bi-LSTM layer, LBA (2) &mdash; lexicon-based attention with
two Bi-LSTM layers
</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td rowspan="2">Macro-$F_1^{+/-}$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="12">
PotTS
</td>
</tr>
<tr>
<td>Y&amp;C</td>
<td>0.45</td>
<td>1.0</td>
<td>0.62</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.308</td>
<td>0.446</td>
</tr>
<tr>
<td>RAE</td>
<td>0.64</td>
<td>0.78</td>
<td>0.7</td>
<td>0.38</td>
<td>0.04</td>
<td>0.08</td>
<td>0.57</td>
<td>0.68</td>
<td>0.62</td>
<td>0.389</td>
<td>0.605</td>
</tr>
<tr>
<td>MVRNN</td>
<td>0.45</td>
<td>1.0</td>
<td>0.62</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.308</td>
<td>0.446</td>
</tr>
<tr>
<td>RNTN</td>
<td>0.45</td>
<td>0.87</td>
<td>0.59</td>
<td>0.19</td>
<td>0.02</td>
<td>0.03</td>
<td>0.32</td>
<td>0.1</td>
<td>0.15</td>
<td>0.312</td>
<td>0.428</td>
</tr>
<tr>
<td>SEV</td>
<td>0.73</td>
<td>0.79</td>
<td>0.76</td>
<td>0.41</td>
<td>0.52</td>
<td>0.46</td>
<td>0.72</td>
<td>0.55</td>
<td>0.62</td>
<td>0.608</td>
<td>0.651</td>
</tr>
<tr>
<td>BAZ</td>
<td>0.45</td>
<td>1.0</td>
<td>0.62</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.308</td>
<td>0.446</td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.82</td>
<td>0.73</td>
<td>0.77</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.56</td>
<td>0.92</td>
<td>0.69</td>
<td>0.387</td>
<td>0.662</td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.45</td>
<td>1.0</td>
<td>0.62</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.308</td>
<td>0.446</td>
</tr>
<tr>
<td colspan="12">
SB10k
</td>
</tr>
<tr>
<td>Y&amp;C</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.62</td>
<td>1.0</td>
<td>0.77</td>
<td>0.0</td>
<td>0.622</td>
</tr>
<tr>
<td>RAE</td>
<td>0.63</td>
<td>0.57</td>
<td>0.6</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.75</td>
<td>0.94</td>
<td>0.83</td>
<td>0.299</td>
<td>0.721</td>
</tr>
<tr>
<td>MVRNN</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.62</td>
<td>1.0</td>
<td>0.77</td>
<td>0.0</td>
<td>0.622</td>
</tr>
<tr>
<td>RNTN</td>
<td>0.2</td>
<td>0.03</td>
<td>0.05</td>
<td>0.07</td>
<td>0.01</td>
<td>0.02</td>
<td>0.62</td>
<td>0.94</td>
<td>0.75</td>
<td>0.033</td>
<td>0.594</td>
</tr>
<tr>
<td>SEV</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.62</td>
<td>1.0</td>
<td>0.77</td>
<td>0.0</td>
<td>0.622</td>
</tr>
<tr>
<td>BAZ</td>
<td>0.75</td>
<td>0.47</td>
<td>0.58</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.71</td>
<td>0.98</td>
<td>0.83</td>
<td>0.291</td>
<td>0.72</td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.72</td>
<td>0.58</td>
<td>0.64</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.74</td>
<td>0.97</td>
<td>0.84</td>
<td>0.321</td>
<td>0.737</td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.76</td>
<td>0.49</td>
<td>0.6</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.72</td>
<td>0.98</td>
<td>0.83</td>
<td>0.298</td>
<td>0.723</td>
</tr>
</tbody>
</table>

## DL-Based Methods (An Error Made by the System of Baziotis et al.)

<div class="example">
Wollte meinen Kleiderschrank aufr&auml;umen ... sitze nun darin und singe Liebeslieder ...
<div class="translation">Wanted to clean up my wardrobe... Now sitting in it and singing love songs...</div>
Gold Label:<div class="label neutral">neutral</div>
Predicted Label:<div class="label positive">positive*</div>
</div>

## DL-Based Methods (An Error Made by the LBA System)

<div class="example">
Gerade super Lust, mit Carls Haaren was zu machen aber ca 300 km Distanz halten mich davon ab.
<div class="translation">Wanted to clean up my wardrobe... Now sitting in it and singing love songs...</div>
Gold Label:<div class="label neutral">neutral</div>
Predicted Label:<div class="label positive">positive*</div>
</div>

## DL-Based Methods (Effect of Word Embeddings)

<table>
<caption>
Results of deep-learning&ndash;based MLSA methods with pretrained word2vec vectors
</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="12">
PotTS
<td>
</tr>
<tr>
<td>RAE</td>
<td>0.61<div class="negdelta">0.03</div></td>
<td>0.61<div class="negdelta">0.17</div></td>
<td>0.61<div class="negdelta">0.09</div></td>
<td>0.22<div class="negdelta">0.16</div></td>
<td>0.01<div class="negdelta">0.03</div></td>
<td>0.03<div class="negdelta">0.05</div></td>
<td>0.48<div class="negdelta">0.09</div></td>
<td>0.72<div class="negdelta">0.04</div></td>
<td>0.57<div class="negdelta">0.05</div></td>
<td>0.32<div class="negdelta">0.07</div></td>
<td>0.54<div class="negdelta">0.07</div></td>
</tr>
<tr>
<td>RNTN</td>
<td>0.45</td>
<td>0.82<div class="negdelta">0.05</div></td>
<td>0.59</td>
<td>0.24<div class="posdelta">0.05</div></td>
<td>0.06<div class="negdelta">0.04</div></td>
<td>0.1<div class="negdelta">0.07</div></td>
<td>0.43<div class="posdelta">0.09</div></td>
<td>0.17<div class="posdelta">0.07</div></td>
<td>0.24<div class="posdelta">0.09</div></td>
<td>0.34<div class="posdelta">0.03</div></td>
<td>0.44<div class="negdelta">0.01</div></td>
</tr>
<tr>
<td>SEV</td>
<td>0.73</td>
<td>0.74<div class="negdelta">0.05</div></td>
<td>0.74<div class="negdelta">0.02</div></td>
<td>0.0<div class="negdelta">0.41</div></td>
<td>0.0<div class="negdelta">0.52</div></td>
<td>0.0<div class="negdelta">0.46</div></td>
<td>0.56<div class="negdelta">0.16</div></td>
<td>0.84<div class="posdelta">0.29</div></td>
<td>0.68<div class="posdelta">0.06</div></td>
<td>0.37<div class="negdelta">0.24</div></td>
<td>0.64<div class="negdelta">0.01</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.82<div class="posdelta">0.37</div></td>
<td>0.72<div class="negdelta">0.28</div></td>
<td>0.77<div class="posdelta">0.15</div></td>
<td>0.62<div class="posdelta">0.62</div></td>
<td>0.49<div class="posdelta">0.49</div></td>
<td>0.55<div class="posdelta">0.55</div></td>
<td>0.68<div class="posdelta">0.68</div></td>
<td>0.85<div class="posdelta">0.85</div></td>
<td>0.76<div class="posdelta">0.76</div></td>
<td>0.66<div class="posdelta">0.35</div></td>
<td>0.73<div class="posdelta">0.28</div></td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.76<div class="negdelta">0.06</div></td>
<td>0.84<div class="posdelta">0.11</div></td>
<td>0.79<div class="posdelta">0.02</div></td>
<td>0.6<div class="posdelta">0.6</div></td>
<td>0.56<div class="posdelta">0.56</div></td>
<td>0.58<div class="posdelta">0.58</div></td>
<td>0.75<div class="posdelta">0.19</div></td>
<td>0.68<div class="negdelta">0.24</div></td>
<td>0.72<div class="posdelta">0.03</div></td>
<td>0.69<div class="posdelta">0.3</div></td>
<td>0.73<div class="posdelta">0.07</div></td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.84<div class="posdelta">0.39</div></td>
<td>0.73<div class="negdelta">0.27</div></td>
<td>0.78<div class="posdelta">0.16</div></td>
<td>0.57<div class="posdelta">0.57</div></td>
<td>0.48<div class="posdelta">0.48</div></td>
<td>0.53<div class="posdelta">0.53</div></td>
<td>0.66<div class="posdelta">0.66</div></td>
<td>0.82<div class="posdelta">0.82</div></td>
<td>0.73<div class="posdelta">0.73</div></td>
<td>0.65<div class="posdelta">0.34</div></td>
<td>0.72<div class="posdelta">0.27</div></td>
</tr>
<tr>
<td colspan="12">
SB10k
<td>
</tr>
<tr>
<td>RAE</td>
<td>0.5<div class="negdelta">0.13</div></td>
<td>0.73<div class="posdelta">0.16</div></td>
<td>0.59<div class="negdelta">0.01</div></td>
<td>0.35<div class="posdelta">0.35</div></td>
<td>0.06<div class="posdelta">0.06</div></td>
<td>0.1<div class="posdelta">0.1</div></td>
<td>0.8<div class="posdelta">0.05</div></td>
<td>0.8<div class="negdelta">0.14</div></td>
<td>0.8<div class="negdelta">0.03</div></td>
<td>0.35<div class="posdelta">0.15</div></td>
<td>0.68<div class="negdelta">0.04</div></td>
</tr>
<tr>
<td>RNTN</td>
<td>0.0<div class="negdelta">0.02</div></td>
<td>0.0<div class="negdelta">0.03</div></td>
<td>0.0<div class="negdelta">0.05</div></td>
<td>0.0<div class="negdelta">0.07</div></td>
<td>0.0<div class="negdelta">0.01</div></td>
<td>0.0<div class="negdelta">0.02</div></td>
<td>0.62</td>
<td>1.0<div class="negdelta">0.06</div></td>
<td>0.77<div class="negdelta">0.02</div></td>
<td>0.0<div class="negdelta">0.03</div></td>
<td>0.62<div class="posdelta">0.03</div></td>
</tr>
<tr>
<td>SEV</td>
<td>0.64<div class="posdelta">0.64</div></td>
<td>0.58<div class="posdelta">0.58</div></td>
<td>0.61<div class="posdelta">0.61</div></td>
<td>0.51<div class="posdelta">0.51</div></td>
<td>0.21<div class="posdelta">0.21</div></td>
<td>0.3<div class="posdelta">0.3</div></td>
<td>0.76<div class="posdelta">0.14</div></td>
<td>0.89<div class="negdelta">0.11</div></td>
<td>0.82<div class="posdelta">0.05</div></td>
<td>0.45<div class="posdelta">0.45</div></td>
<td>0.72<div class="posdelta">0.1</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.72<div class="posdelta">0.03</div></td>
<td>0.59<div class="posdelta">0.12</div></td>
<td>0.65<div class="posdelta">0.07</div></td>
<td>0.53<div class="posdelta">0.53</div></td>
<td>0.33<div class="posdelta">0.33</div></td>
<td>0.41<div class="posdelta">0.41</div></td>
<td>0.79<div class="posdelta">0.08</div></td>
<td>0.91<div class="negdelta">0.07</div></td>
<td>0.84<div class="posdelta">0.01</div></td>
<td>0.53<div class="posdelta">0.24</div></td>
<td>0.75<div class="posdelta">0.03</div></td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.6<div class="negdelta">0.12</div></td>
<td>0.72<div class="posdelta">0.14</div></td>
<td>0.66<div class="posdelta">0.02</div></td>
<td>0.47<div class="posdelta">0.47</div></td>
<td>0.42<div class="posdelta">0.42</div></td>
<td>0.44<div class="posdelta">0.44</div></td>
<td>0.84<div class="posdelta">0.1</div></td>
<td>0.8<div class="negdelta">0.17</div></td>
<td>0.82<div class="posdelta">0.02</div></td>
<td>0.55<div class="posdelta">0.23</div></td>
<td>0.73<div class="negdelta">0.01</div></td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.72<div class="negdelta">0.04</div></td>
<td>0.57<div class="posdelta">0.08</div></td>
<td>0.64<div class="posdelta">0.04</div></td>
<td>0.55<div class="posdelta">0.55</div></td>
<td>0.39<div class="posdelta">0.39</div></td>
<td>0.46<div class="posdelta">0.46</div></td>
<td>0.79<div class="posdelta">0.07</div></td>
<td>0.9<div class="negdelta">0.08</div></td>
<td>0.84<div class="posdelta">0.01</div></td>
<td>0.55<div class="posdelta">0.25</div></td>
<td>0.75<div class="posdelta">0.03</div></td>
</tr>
</tbody>
</table>


## DL-Based Methods (Effect of Word Embeddings)

<table>
<caption>
Results of deep-learning&ndash;based MLSA methods with least-squares embeddings
</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td>Macro-$F_1^{+/-}$</td>
<td>Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="12">
PotTS
<td>
</tr>
<tr>
<td>RAE</td>
<td>0.5<div class="negdelta">0.13</div></td>
<td>0.73<div class="posdelta">0.16</div></td>
<td>0.59<div class="negdelta">0.01</div></td>
<td>0.35<div class="posdelta">0.35</div></td>
<td>0.06<div class="posdelta">0.06</div></td>
<td>0.1<div class="posdelta">0.1</div></td>
<td>0.8<div class="posdelta">0.05</div></td>
<td>0.8<div class="negdelta">0.14</div></td>
<td>0.8<div class="negdelta">0.03</div></td>
<td>0.35<div class="posdelta">0.15</div></td>
<td>0.68<div class="negdelta">0.04</div></td>
</tr>
<tr>
<td>RNTN</td>
<td>0.0<div class="negdelta">0.02</div></td>
<td>0.0<div class="negdelta">0.03</div></td>
<td>0.0<div class="negdelta">0.05</div></td>
<td>0.0<div class="negdelta">0.07</div></td>
<td>0.0<div class="negdelta">0.01</div></td>
<td>0.0<div class="negdelta">0.02</div></td>
<td>0.62</td>
<td>1.0<div class="negdelta">0.06</div></td>
<td>0.77<div class="negdelta">0.02</div></td>
<td>0.0<div class="negdelta">0.03</div></td>
<td>0.62<div class="posdelta">0.03</div></td>
</tr>
<tr>
<td>SEV</td>
<td>0.64<div class="posdelta">0.64</div></td>
<td>0.58<div class="posdelta">0.58</div></td>
<td>0.61<div class="posdelta">0.61</div></td>
<td>0.51<div class="posdelta">0.51</div></td>
<td>0.21<div class="posdelta">0.21</div></td>
<td>0.3<div class="posdelta">0.3</div></td>
<td>0.76<div class="posdelta">0.14</div></td>
<td>0.89<div class="negdelta">0.11</div></td>
<td>0.82<div class="posdelta">0.05</div></td>
<td>0.45<div class="posdelta">0.45</div></td>
<td>0.72<div class="posdelta">0.1</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.72<div class="posdelta">0.03</div></td>
<td>0.59<div class="posdelta">0.12</div></td>
<td>0.65<div class="posdelta">0.07</div></td>
<td>0.53<div class="posdelta">0.53</div></td>
<td>0.33<div class="posdelta">0.33</div></td>
<td>0.41<div class="posdelta">0.41</div></td>
<td>0.79<div class="posdelta">0.08</div></td>
<td>0.91<div class="negdelta">0.07</div></td>
<td>0.84<div class="posdelta">0.01</div></td>
<td>0.53<div class="posdelta">0.24</div></td>
<td>0.75<div class="posdelta">0.03</div></td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.6<div class="negdelta">0.12</div></td>
<td>0.72<div class="posdelta">0.14</div></td>
<td>0.66<div class="posdelta">0.02</div></td>
<td>0.47<div class="posdelta">0.47</div></td>
<td>0.42<div class="posdelta">0.42</div></td>
<td>0.44<div class="posdelta">0.44</div></td>
<td>0.84<div class="posdelta">0.1</div></td>
<td>0.8<div class="negdelta">0.17</div></td>
<td>0.82<div class="posdelta">0.02</div></td>
<td>0.55<div class="posdelta">0.23</div></td>
<td>0.73<div class="negdelta">0.01</div></td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.72<div class="negdelta">0.04</div></td>
<td>0.57<div class="posdelta">0.08</div></td>
<td>0.64<div class="posdelta">0.04</div></td>
<td>0.55<div class="posdelta">0.55</div></td>
<td>0.39<div class="posdelta">0.39</div></td>
<td>0.46<div class="posdelta">0.46</div></td>
<td>0.79<div class="posdelta">0.07</div></td>
<td>0.9<div class="negdelta">0.08</div></td>
<td>0.84<div class="posdelta">0.01</div></td>
<td>0.55<div class="posdelta">0.25</div></td>
<td>0.75<div class="posdelta">0.03</div></td>
</tr>
<tr>
<td colspan="12">
SB10k
<td>
</tr>
<tr>
<td>RAE</td>
<td>0.5<div class="negdelta">0.13</div></td>
<td>0.73<div class="posdelta">0.16</div></td>
<td>0.59<div class="negdelta">0.01</div></td>
<td>0.35<div class="posdelta">0.35</div></td>
<td>0.06<div class="posdelta">0.06</div></td>
<td>0.1<div class="posdelta">0.1</div></td>
<td>0.8<div class="posdelta">0.05</div></td>
<td>0.8<div class="negdelta">0.14</div></td>
<td>0.8<div class="negdelta">0.03</div></td>
<td>0.35<div class="posdelta">0.15</div></td>
<td>0.68<div class="negdelta">0.04</div></td>
</tr>
<tr>
<td>RNTN</td>
<td>0.0<div class="negdelta">0.02</div></td>
<td>0.0<div class="negdelta">0.03</div></td>
<td>0.0<div class="negdelta">0.05</div></td>
<td>0.0<div class="negdelta">0.07</div></td>
<td>0.0<div class="negdelta">0.01</div></td>
<td>0.0<div class="negdelta">0.02</div></td>
<td>0.62</td>
<td>1.0<div class="negdelta">0.06</div></td>
<td>0.77<div class="negdelta">0.02</div></td>
<td>0.0<div class="negdelta">0.03</div></td>
<td>0.62<div class="posdelta">0.03</div></td>
</tr>
<tr>
<td>SEV</td>
<td>0.64<div class="posdelta">0.64</div></td>
<td>0.58<div class="posdelta">0.58</div></td>
<td>0.61<div class="posdelta">0.61</div></td>
<td>0.51<div class="posdelta">0.51</div></td>
<td>0.21<div class="posdelta">0.21</div></td>
<td>0.3<div class="posdelta">0.3</div></td>
<td>0.76<div class="posdelta">0.14</div></td>
<td>0.89<div class="negdelta">0.11</div></td>
<td>0.82<div class="posdelta">0.05</div></td>
<td>0.45<div class="posdelta">0.45</div></td>
<td>0.72<div class="posdelta">0.1</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.72<div class="posdelta">0.03</div></td>
<td>0.59<div class="posdelta">0.12</div></td>
<td>0.65<div class="posdelta">0.07</div></td>
<td>0.53<div class="posdelta">0.53</div></td>
<td>0.33<div class="posdelta">0.33</div></td>
<td>0.41<div class="posdelta">0.41</div></td>
<td>0.79<div class="posdelta">0.08</div></td>
<td>0.91<div class="negdelta">0.07</div></td>
<td>0.84<div class="posdelta">0.01</div></td>
<td>0.53<div class="posdelta">0.24</div></td>
<td>0.75<div class="posdelta">0.03</div></td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.6<div class="negdelta">0.12</div></td>
<td>0.72<div class="posdelta">0.14</div></td>
<td>0.66<div class="posdelta">0.02</div></td>
<td>0.47<div class="posdelta">0.47</div></td>
<td>0.42<div class="posdelta">0.42</div></td>
<td>0.44<div class="posdelta">0.44</div></td>
<td>0.84<div class="posdelta">0.1</div></td>
<td>0.8<div class="negdelta">0.17</div></td>
<td>0.82<div class="posdelta">0.02</div></td>
<td>0.55<div class="posdelta">0.23</div></td>
<td>0.73<div class="negdelta">0.01</div></td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.72<div class="negdelta">0.04</div></td>
<td>0.57<div class="posdelta">0.08</div></td>
<td>0.64<div class="posdelta">0.04</div></td>
<td>0.55<div class="posdelta">0.55</div></td>
<td>0.39<div class="posdelta">0.39</div></td>
<td>0.46<div class="posdelta">0.46</div></td>
<td>0.79<div class="posdelta">0.07</div></td>
<td>0.9<div class="negdelta">0.08</div></td>
<td>0.84<div class="posdelta">0.01</div></td>
<td>0.55<div class="posdelta">0.25</div></td>
<td>0.75<div class="posdelta">0.03</div></td>
</tr>
</tbody>
</table>


# Evaluation

## Distant Supervision (Data Statistics)

<table>
<thead>
 <caption>Polarity class distribution in PotTS, SB10k, and the German Twitter Snapshot (GTS)<br/>
(&#42; – the mixed polarity was excluded from our experiments)</caption>
<tr>
<td rowspan="2">Dataset</td>
<td colspan="4">Polarity Class</td>
<td colspan="2">Label Agreement</td>
</tr>
<tr>
<td>Positive} </td>
<td>Negative</td>
<td>Neutral</td>
<td>Mixed*</td>
<td>$\alpha$ </td>
<td>$\kappa$</td>
</tr>
</thead>
<tbody>
<tr>
<td>PotTS</td>
<td>3,380</td>
<td>1,541</td>
<td>2,558</td>
<td>513</td>
<td>0.66</td>
<td>0.4</td>
</tr>
<tr>
<td>SB10k</td>
<td>1,717</td>
<td>1,130</td>
<td>4,629</td>
<td>0</td>
<td>0.39</td>
<td class="NA">NA</td>
</tr>
<tr>
<td>GTS</td>
<td>3,326,829</td>
<td>350,775</td>
<td>19,453,669</td>
<td>73,776</td>
<td class="NA">NA</td>
<td class="NA">NA</td>
</tr>
</tbody>
</table>

## Distant Supervision (Results)

<table>
<caption>Results of MLSA methods with distantly supervised data</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td rowspan="2">Macro-$F_1$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td>GMN</td>
<td>0.8<div class="posdelta">0.13</div></td>
<td>0.34<div class="negdelta">0.39</div></td>
<td>0.48<div class="negdelta">0.22</div></td>
<td>0.2<div class="negdelta">0.15</div></td>
<td>0.29<div class="posdelta">0.14</div></td>
<td>0.24<div class="negdelta">0.03</div></td>
<td>0.53<div class="negdelta">0.07</div></td>
<td>0.79<div class="posdelta">0.07</div></td>
<td>0.63<div class="negdelta">0.03</div></td>
<td>0.36<div class="negdelta">0.01</div></td>
<td>0.49<div class="negdelta">0.12</div></td>
</tr>
<tr>
<td>MHM</td>
<td>0.86<div class="posdelta">0.07</div></td>
<td>0.59<div class="negdelta">0.18</div></td>
<td>0.7<div class="negdelta">0.08</div></td>
<td>0.31<div class="negdelta">0.27</div></td>
<td>0.39<div class="negdelta">0.17</div></td>
<td>0.35<div class="negdelta">0.22</div></td>
<td>0.55<div class="negdelta">0.18</div></td>
<td>0.68<div class="negdelta">0.08</div></td>
<td>0.61<div class="negdelta">0.13</div></td>
<td>0.52<div class="negdelta">0.15</div></td>
<td>0.59<div class="negdelta">0.14</div></td>
</tr>
<tr>
<td>GNT</td>
<td>0.86<div class="posdelta">0.15</div></td>
<td>0.6<div class="negdelta">0.2</div></td>
<td>0.71<div class="negdelta">0.04</div></td>
<td>0.26<div class="negdelta">0.29</div></td>
<td>0.31<div class="negdelta">0.14</div></td>
<td>0.28<div class="negdelta">0.22</div></td>
<td>0.53<div class="negdelta">0.15</div></td>
<td>0.68<div class="negdelta">0.05</div></td>
<td>0.59<div class="negdelta">0.06</div></td>
<td>0.5<div class="negdelta">0.12</div></td>
<td>0.57<div class="negdelta">0.1</div></td>
</tr>
<tr>
<td>RAE</td>
<td>0.68<div class="posdelta">0.07</div></td>
<td>0.31<div class="negdelta">0.3</div></td>
<td>0.43<div class="negdelta">0.18</div></td>
<td>0.25<div class="posdelta">0.03</div></td>
<td>0.46<div class="posdelta">0.45</div></td>
<td>0.32<div class="posdelta">0.29</div></td>
<td>0.49<div class="posdelta">0.01</div></td>
<td>0.61<div class="negdelta">0.11</div></td>
<td>0.54<div class="negdelta">0.03</div></td>
<td>0.38<div class="posdelta">0.06</div></td>
<td>0.45<div class="negdelta">0.09</div></td>
</tr>
<tr>
<td>SEV</td>
<td>0.87<div class="posdelta">0.14</div></td>
<td>0.51<div class="negdelta">0.23</div></td>
<td>0.64<div class="negdelta">0.1</div></td>
<td>0.27<div class="posdelta">0.27</div></td>
<td>0.49<div class="posdelta">0.49</div></td>
<td>0.35<div class="posdelta">0.35</div></td>
<td>0.55<div class="negdelta">0.01</div></td>
<td>0.58<div class="negdelta">0.26</div></td>
<td>0.56<div class="negdelta">0.12</div></td>
<td>0.49<div class="posdelta">0.12</div></td>
<td>0.53<div class="negdelta">0.11</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.0<div class="negdelta">0.82</div></td>
<td>0.0<div class="negdelta">0.72</div></td>
<td>0.0<div class="negdelta">0.77</div></td>
<td>0.19<div class="negdelta">0.43</div></td>
<td>1.0<div class="posdelta">0.51</div></td>
<td>0.32<div class="negdelta">0.23</div></td>
<td>0.0<div class="negdelta">0.68</div></td>
<td>0.0<div class="negdelta">0.85</div></td>
<td>0.0<div class="negdelta">0.76</div></td>
<td>0.16<div class="negdelta">0.5</div></td>
<td>0.19<div class="negdelta">0.43</div></td>
</tr>
<tr>
<td>LBA$^(1)$</td>
<td>0.48<div class="negdelta">0.28</div></td>
<td>0.88<div class="posdelta">0.04</div></td>
<td>0.62<div class="negdelta">0.17</div></td>
<td>0.25<div class="negdelta">0.35</div></td>
<td>0.23<div class="negdelta">0.33</div></td>
<td>0.24<div class="negdelta">0.34</div></td>
<td>0.0<div class="negdelta">0.75</div></td>
<td>0.0<div class="negdelta">0.68</div></td>
<td>0.0<div class="negdelta">0.72</div></td>
<td>0.43<div class="negdelta">0.26</div></td>
<td>0.44<div class="negdelta">0.29</div></td>
</tr>
<tr>
<td>LBA$^(2)$</td>
<td>0.91<div class="posdelta">0.07</div></td>
<td>0.08<div class="negdelta">0.65</div></td>
<td>0.14<div class="negdelta">0.64</div></td>
<td>0.19<div class="negdelta">0.38</div></td>
<td>0.99<div class="posdelta">0.51</div></td>
<td>0.32<div class="negdelta">0.21</div></td>
<td>0.0<div class="negdelta">0.66</div></td>
<td>0.0<div class="negdelta">0.82</div></td>
<td>0.0<div class="negdelta">0.73</div></td>
<td>0.23<div class="negdelta">0.42</div></td>
<td>0.22<div class="negdelta">0.5</div></td>
</tr>
<tr>
<td>GMN</td>
<td>0.71<div class="posdelta">0.06</div></td>
<td>0.27<div class="negdelta">0.18</div></td>
<td>0.4<div class="negdelta">0.13</div></td>
<td>0.24<div class="negdelta">0.14</div></td>
<td>0.11<div class="posdelta">0.03</div></td>
<td>0.15<div class="posdelta">0.02</div></td>
<td>0.71<div class="negdelta">0.01</div></td>
<td>0.96<div class="posdelta">0.03</div></td>
<td>0.82<div class="negdelta">0.01</div></td>
<td>0.27<div class="negdelta">0.06</div></td>
<td>0.68<div class="negdelta">0.02</div></td>
</tr>
<tr>
<td>MHM</td>
<td>0.77<div class="posdelta">0.06</div></td>
<td>0.4<div class="negdelta">0.25</div></td>
<td>0.53<div class="negdelta">0.15</div></td>
<td>0.61<div class="negdelta">0.1</div></td>
<td>0.1<div class="negdelta">0.3</div></td>
<td>0.18<div class="negdelta">0.27</div></td>
<td>0.71<div class="negdelta">0.09</div></td>
<td>0.97<div class="negdelta">0.1</div></td>
<td>0.82<div class="negdelta">0.02</div></td>
<td>0.35<div class="negdelta">0.21</div></td>
<td>0.71<div class="negdelta">0.04</div></td>
</tr>
<tr>
<td>GNT</td>
<td>0.77<div class="posdelta">0.1</div></td>
<td>0.39<div class="negdelta">0.23</div></td>
<td>0.52<div class="negdelta">0.12</div></td>
<td>0.25<div class="negdelta">0.19</div></td>
<td>0.13<div class="negdelta">0.15</div></td>
<td>0.17<div class="negdelta">0.17</div></td>
<td>0.71<div class="negdelta">0.07</div></td>
<td>0.92<div class="posdelta">0.05</div></td>
<td>0.8<div class="negdelta">0.02</div></td>
<td>0.34<div class="negdelta">0.15</div></td>
<td>0.68<div class="negdelta">0.04</div></td>
</tr>
<tr>
<td>RAE</td>
<td>0.44<div class="negdelta">0.06</div></td>
<td>0.27<div class="negdelta">0.51</div></td>
<td>0.34<div class="negdelta">0.25</div></td>
<td>0.24<div class="negdelta">0.11</div></td>
<td>0.59<div class="posdelta">0.53</div></td>
<td>0.34<div class="posdelta">0.24</div></td>
<td>0.78<div class="negdelta">0.02</div></td>
<td>0.62<div class="negdelta">0.18</div></td>
<td>0.69<div class="negdelta">0.11</div></td>
<td>0.34<div class="negdelta">0.01</div></td>
<td>0.54<div class="negdelta">0.14</div></td>
</tr>
<tr>
<td>SEV</td>
<td>0.64</td>
<td>0.39<div class="negdelta">0.19</div></td>
<td>0.49<div class="negdelta">0.12</div></td>
<td>0.34<div class="negdelta">0.17</div></td>
<td>0.12<div class="negdelta">0.09</div></td>
<td>0.18<div class="negdelta">0.12</div></td>
<td>0.7<div class="negdelta">0.06</div></td>
<td>0.9<div class="posdelta">0.01</div></td>
<td>0.78<div class="negdelta">0.04</div></td>
<td>0.33<div class="negdelta">0.12</div></td>
<td>0.69<div class="negdelta">0.03</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.24<div class="negdelta">0.48</div></td>
<td>1.0<div class="posdelta">0.41</div></td>
<td>0.38<div class="negdelta">0.27</div></td>
<td>0.0<div class="negdelta">0.53</div></td>
<td>0.0<div class="negdelta">0.33</div></td>
<td>0.0<div class="negdelta">0.41</div></td>
<td>0.0<div class="negdelta">0.79</div></td>
<td>0.0<div class="negdelta">0.91</div></td>
<td>0.0<div class="negdelta">0.84</div></td>
<td>0.19<div class="negdelta">0.34</div></td>
<td>0.24<div class="negdelta">0.51</div></td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.64<div class="posdelta">0.04</div></td>
<td>0.43<div class="negdelta">0.29</div></td>
<td>0.52<div class="negdelta">0.14</div></td>
<td>0.59<div class="posdelta">0.12</div></td>
<td>0.09<div class="negdelta">0.33</div></td>
<td>0.16<div class="negdelta">0.28</div></td>
<td>0.71<div class="negdelta">0.13</div></td>
<td>0.93<div class="posdelta">0.13</div></td>
<td>0.8<div class="negdelta">0.02</div></td>
<td>0.34<div class="negdelta">0.21</div></td>
<td>0.69<div class="negdelta">0.04</div></td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.0<div class="negdelta">0.72</div></td>
<td>0.0<div class="negdelta">0.57</div></td>
<td>0.0<div class="negdelta">0.64</div></td>
<td>0.14<div class="negdelta">0.41</div></td>
<td>1.0<div class="posdelta">0.61</div></td>
<td>0.25<div class="negdelta">0.21</div></td>
<td>0.0<div class="negdelta">0.79</div></td>
<td>0.0<div class="negdelta">0.9</div></td>
<td>0.0<div class="negdelta">0.84</div></td>
<td>0.12<div class="negdelta">0.43</div></td>
<td>0.14<div class="negdelta">0.61</div></td>
</tr>
</tbody>
</table>

## Lexicons

<div>
<figure class="halfpage">
<img src="img/cgsa_potts_macro_lexicons.png" alt="Macro F-Scores of MLSA Classifiers with Different Lexicons on the PotTS corus">
<figcaption>Macro-$F_1$</figcaption>
</figure>
<figure class="halfpage">
<img src="img/cgsa_potts_micro_lexicons.png" alt="Micro F-Scores of MLSA Classifiers with Different Lexicons on the PotTS corus">
<figcaption>Micro-$F_1$</figcaption>
</figure>
Results of MLSA methods with different lexicons on the PotTS corpus
</div>

## Lexicons

<div>
<figure class="halfpage">
<img src="img/cgsa_sb10k_macro_lexicons.png" alt="Macro F-Scores of MLSA Classifiers with Different Lexicons on the SB10k corus">
<figcaption>Macro-$F_1$</figcaption>
</figure>
<figure class="halfpage">
<img src="img/cgsa_sb10k_micro_lexicons.png" alt="Micro F-Scores of MLSA Classifiers with Different Lexicons on the SB10k corus">
<figcaption>Micro-$F_1$</figcaption>
</figure>
Results of MLSA methods with different lexicons on the SB10k corpus
</div>

## Text Normalization

<table>
<caption>
Results of MLSA methods without text normalization
</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td rowspan="2">Macro-$F_1^{+/-}$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<tr>
<td colspan="12">PotTS</td>
</tr>
<td>HL</td>
<td>0.63<div class="negdelta">0.12</div></td>
<td>0.3<div class="negdelta">0.46</div></td>
<td>0.4<div class="negdelta">0.36</div></td>
<td>0.46<div class="negdelta">0.07</div></td>
<td>0.29<div class="negdelta">0.14</div></td>
<td>0.36<div class="negdelta">0.11</div></td>
<td>0.41<div class="negdelta">0.26</div></td>
<td>0.77<div class="posdelta">0.04</div></td>
<td>0.54<div class="negdelta">0.15</div></td>
<td>0.38<div class="negdelta">0.24</div></td>
<td>0.464<div class="negdelta">0.22</div></td>
</tr>
<tr>
<td>TBD</td>
<td>0.65<div class="negdelta">0.12</div></td>
<td>0.24<div class="negdelta">0.47</div></td>
<td>0.36<div class="negdelta">0.38</div></td>
<td>0.46<div class="negdelta">0.08</div></td>
<td>0.27<div class="negdelta">0.12</div></td>
<td>0.34<div class="negdelta">0.11</div></td>
<td>0.41<div class="negdelta">0.22</div></td>
<td>0.83<div class="posdelta">0.06</div></td>
<td>0.55<div class="negdelta">0.14</div></td>
<td>0.348<div class="negdelta">0.25</div></td>
<td>0.457<div class="negdelta">0.22</div></td>
</tr>
<tr>
<td>MST</td>
<td>0.63<div class="negdelta">0.12</div></td>
<td>0.29<div class="negdelta">0.43</div></td>
<td>0.4<div class="negdelta">0.34</div></td>
<td>0.47<div class="negdelta">0.01</div></td>
<td>0.34<div class="negdelta">0.13</div></td>
<td>0.39<div class="negdelta">0.09</div></td>
<td>0.42<div class="negdelta">0.26</div></td>
<td>0.77<div class="posdelta">0.05</div></td>
<td>0.54<div class="negdelta">0.16</div></td>
<td>0.4<div class="negdelta">0.21</div></td>
<td>0.47<div class="negdelta">0.21</div></td>
</tr>
<tr>
<td>JRK</td>
<td>0.44<div class="negdelta">0.16</div></td>
<td>0.22<div class="negdelta">0.09</div></td>
<td>0.29<div class="negdelta">0.12</div></td>
<td>0.14<div class="negdelta">0.28</div></td>
<td>0.06<div class="negdelta">0.14</div></td>
<td>0.08<div class="negdelta">0.19</div></td>
<td>0.36<div class="negdelta">0.07</div></td>
<td>0.7<div class="negdelta">0.1</div></td>
<td>0.47<div class="negdelta">0.09</div></td>
<td>0.19<div class="negdelta">0.15</div></td>
<td>0.36<div class="negdelta">0.11</div></td>
</tr>
<tr>
<td>KLCH</td>
<td>0.61<div class="negdelta">0.1</div></td>
<td>0.23<div class="negdelta">0.49</div></td>
<td>0.33<div class="negdelta">0.38</div></td>
<td>0.33<div class="negdelta">0.01</div></td>
<td>0.21<div class="posdelta">0.04</div></td>
<td>0.26<div class="posdelta">0.04</div></td>
<td>0.41<div class="negdelta">0.25</div></td>
<td>0.82</td>
<td>0.55<div class="negdelta">0.18</div></td>
<td>0.3<div class="negdelta">0.17</div></td>
<td>0.44<div class="negdelta">0.21</div></td>
</tr>
<tr>
<td>GMN</td>
<td>0.59<div class="negdelta">0.08</div></td>
<td>0.77<div class="posdelta">0.04</div></td>
<td>0.66<div class="negdelta">0.04</div></td>
<td>0.37<div class="negdelta">0.02</div></td>
<td>0.14<div class="negdelta">0.01</div></td>
<td>0.2<div class="negdelta">0.01</div></td>
<td>0.57<div class="negdelta">0.03</div></td>
<td>0.55<div class="negdelta">0.17</div></td>
<td>0.56<div class="negdelta">0.1</div></td>
<td>0.43<div class="negdelta">0.02</div></td>
<td>0.57<div class="negdelta">0.05</div></td>
</tr>
<tr>
<td>MHM</td>
<td>0.78<div class="negdelta">0.01</div></td>
<td>0.76<div class="negdelta">0.01</div></td>
<td>0.77<div class="negdelta">0.01</div></td>
<td>0.59<div class="posdelta">0.01</div></td>
<td>0.54<div class="negdelta">0.02</div></td>
<td>0.56<div class="negdelta">0.01</div></td>
<td>0.7<div class="negdelta">0.03</div></td>
<td>0.74<div class="negdelta">0.02</div></td>
<td>0.72<div class="negdelta">0.02</div></td>
<td>0.67<div class="negdelta">0.006</div></td>
<td>0.71<div class="negdelta">0.007</div></td>
</tr>
<tr>
<td>GNT</td>
<td>0.68<div class="negdelta">0.03</div></td>
<td>0.8</td>
<td>0.73<div class="negdelta">0.02</div></td>
<td>0.55</td>
<td>0.43<div class="negdelta">0.02</div></td>
<td>0.48<div class="negdelta">0.02</div></td>
<td>0.67<div class="negdelta">0.01</div></td>
<td>0.59<div class="negdelta">0.04</div></td>
<td>0.62<div class="negdelta">0.03</div></td>
<td>0.61<div class="negdelta">0.017</div></td>
<td>0.65<div class="negdelta">0.02</div></td>
</tr>
<tr>
<td>Y&amp;C</td>
<td>0.45</td>
<td>1.0</td>
<td>0.62</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.31</td>
<td>0.45</td>
</tr>
<tr>
<td>RAE</td>
<td>0.46<div class="negdelta">0.15</div></td>
<td>0.98<div class="posdelta">0.37</div></td>
<td>0.62<div class="posdelta">0.01</div></td>
<td>0.0<div class="negdelta">0.22</div></td>
<td>0.0<div class="negdelta">0.01</div></td>
<td>0.0<div class="negdelta">0.03</div></td>
<td>0.63<div class="posdelta">0.15</div></td>
<td>0.05<div class="negdelta">0.67</div></td>
<td>0.09<div class="negdelta">0.48</div></td>
<td>0.31<div class="negdelta">0.01</div></td>
<td>0.46<div class="negdelta">0.08</div></td>
</tr>
<tr>
<td>MVRNN</td>
<td>0.45</td>
<td>0.92<div class="negdelta">0.08</div></td>
<td>0.6<div class="negdelta">0.02</div></td>
<td>0.08<div class="posdelta">0.08</div></td>
<td>0.01<div class="posdelta">0.01</div></td>
<td>0.01<div class="posdelta">0.01</div></td>
<td>0.26<div class="posdelta">0.26</div></td>
<td>0.03<div class="posdelta">0.03</div></td>
<td>0.06<div class="posdelta">0.06</div></td>
<td>0.31</td>
<td>0.43<div class="negdelta">0.02</div></td>
</tr>
<tr>
<td>RNTN</td>
<td>0.45</td>
<td>0.93<div class="posdelta">0.11</div></td>
<td>0.61<div class="posdelta">0.02</div></td>
<td>0.29<div class="posdelta">0.05</div></td>
<td>0.01<div class="negdelta">0.05</div></td>
<td>0.01<div class="negdelta">0.09</div></td>
<td>0.4<div class="negdelta">0.03</div></td>
<td>0.07<div class="negdelta">0.1</div></td>
<td>0.12<div class="negdelta">0.12</div></td>
<td>0.31<div class="negdelta">0.03</div></td>
<td>0.45<div class="negdelta">0.01</div></td>
</tr>
<tr>
<td>SEV</td>
<td>0.56<div class="negdelta">0.17</div></td>
<td>0.79<div class="posdelta">0.05</div></td>
<td>0.66<div class="negdelta">0.08</div></td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.57<div class="posdelta">0.01</div></td>
<td>0.57<div class="negdelta">0.27</div></td>
<td>0.57<div class="negdelta">0.11</div></td>
<td>0.33<div class="negdelta">0.04</div></td>
<td>0.56<div class="negdelta">0.08</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.65<div class="negdelta">0.17</div></td>
<td>0.59<div class="negdelta">0.13</div></td>
<td>0.62<div class="negdelta">0.15</div></td>
<td>0.62</td>
<td>0.22<div class="negdelta">0.27</div></td>
<td>0.32<div class="negdelta">0.23</div></td>
<td>0.5<div class="negdelta">0.18</div></td>
<td>0.74<div class="negdelta">0.11</div></td>
<td>0.6<div class="negdelta">0.16</div></td>
<td>0.47<div class="negdelta">0.19</div></td>
<td>0.57<div class="negdelta">0.16</div></td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.58<div class="negdelta">0.18</div></td>
<td>0.77<div class="negdelta">0.07</div></td>
<td>0.66<div class="negdelta">0.13</div></td>
<td>0.54<div class="negdelta">0.06</div></td>
<td>0.53<div class="negdelta">0.03</div></td>
<td>0.54<div class="negdelta">0.04</div></td>
<td>0.63<div class="negdelta">0.12</div></td>
<td>0.37<div class="negdelta">0.31</div></td>
<td>0.46<div class="negdelta">0.26</div></td>
<td>0.6<div class="negdelta">0.09</div></td>
<td>0.58<div class="negdelta">0.15</div></td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.67<div class="negdelta">0.17</div></td>
<td>0.52<div class="negdelta">0.21</div></td>
<td>0.59<div class="negdelta">0.19</div></td>
<td>0.51<div class="negdelta">0.06</div></td>
<td>0.44<div class="negdelta">0.04</div></td>
<td>0.47<div class="negdelta">0.06</div></td>
<td>0.52<div class="negdelta">0.14</div></td>
<td>0.7<div class="negdelta">0.12</div></td>
<td>0.6<div class="negdelta">0.13</div></td>
<td>0.53<div class="negdelta">0.12</div></td>
<td>0.57<div class="negdelta">0.15</div></td>
</tr>
<tr>
<td colspan="12">SB10k</td>
</tr>
<tr>
<td>HL</td>
<td>0.41<div class="negdelta">0.08</div></td>
<td>0.42<div class="negdelta">0.2</div></td>
<td>0.42<div class="negdelta">0.13</div></td>
<td>0.24<div class="negdelta">0.03</div></td>
<td>0.28<div class="negdelta">0.06</div></td>
<td>0.26<div class="negdelta">0.04</div></td>
<td>0.66<div class="negdelta">0.07</div></td>
<td>0.63<div class="negdelta">0.01</div></td>
<td>0.65<div class="negdelta">0.02</div></td>
<td>0.34<div class="negdelta">0.08</div></td>
<td>0.53<div class="negdelta">0.05</div></td>
</tr>
<tr>
<td>TBD</td>
<td>0.41<div class="negdelta">0.07</div></td>
<td>0.37<div class="negdelta">0.23</div></td>
<td>0.39<div class="negdelta">0.14</div></td>
<td>0.21<div class="negdelta">0.03</div></td>
<td>0.24<div class="negdelta">0.03</div></td>
<td>0.22<div class="negdelta">0.03</div></td>
<td>0.65<div class="negdelta">0.07</div></td>
<td>0.66<div class="posdelta">0.03</div></td>
<td>0.66<div class="negdelta">0.01</div></td>
<td>0.31<div class="negdelta">0.08</div></td>
<td>0.53<div class="negdelta">0.04</div></td>
</tr>
<tr>
<td>MST</td>
<td>0.4<div class="negdelta">0.05</div></td>
<td>0.32<div class="negdelta">0.17</div></td>
<td>0.35<div class="negdelta">0.12</div></td>
<td>0.26<div class="negdelta">0.03</div></td>
<td>0.3<div class="negdelta">0.05</div></td>
<td>0.28<div class="negdelta">0.04</div></td>
<td>0.65<div class="negdelta">0.05</div></td>
<td>0.68<div class="negdelta">0.04</div></td>
<td>0.67</td>
<td>0.32<div class="negdelta">0.08</div></td>
<td>0.54<div class="negdelta">0.03</div></td>
</tr>
<tr>
<td>JRK</td>
<td>0.4<div class="negdelta">0.01</div></td>
<td>0.42<div class="negdelta">0.03</div></td>
<td>0.41<div class="negdelta">0.01</div></td>
<td>0.36</td>
<td>0.26</td>
<td>0.3</td>
<td>0.69</td>
<td>0.72<div class="negdelta">0.03</div></td>
<td>0.71<div class="negdelta">0.01</div></td>
<td>0.36<div class="posdelta">0.01</div></td>
<td>0.59<div class="negdelta">0.006</div></td>
</tr>
<tr>
<td>KLCH</td>
<td>0.42<div class="posdelta">0.03</div></td>
<td>0.21<div class="negdelta">0.01</div></td>
<td>0.28</td>
<td>0.25<div class="negdelta">0.09</div></td>
<td>0.13</td>
<td>0.17<div class="negdelta">0.02</div></td>
<td>0.66</td>
<td>0.86</td>
<td>0.75</td>
<td>0.23<div class="negdelta">0.005</div></td>
<td>0.6<div class="negdelta">0.002</div></td>
</tr>
<tr>
<td>GMN</td>
<td>0.48<div class="negdelta">0.17</div></td>
<td>0.31<div class="negdelta">0.14</div></td>
<td>0.37<div class="negdelta">0.16</div></td>
<td>0.27<div class="negdelta">0.11</div></td>
<td>0.07<div class="negdelta">0.01</div></td>
<td>0.11<div class="negdelta">0.02</div></td>
<td>0.69<div class="negdelta">0.03</div></td>
<td>0.9<div class="negdelta">0.03</div></td>
<td>0.78<div class="negdelta">0.03</div></td>
<td>0.24<div class="negdelta">0.09</div></td>
<td>0.64<div class="negdelta">0.06</div></td>
</tr>
<tr>
<td>MHM</td>
<td>0.67<div class="negdelta">0.04</div></td>
<td>0.62<div class="negdelta">0.03</div></td>
<td>0.65<div class="negdelta">0.03</div></td>
<td>0.59<div class="negdelta">0.08</div></td>
<td>0.42<div class="negdelta">0.02</div></td>
<td>0.49<div class="negdelta">0.04</div></td>
<td>0.8</td>
<td>0.88<div class="negdelta">0.01</div></td>
<td>0.84</td>
<td>0.56<div class="negdelta">0.002</div></td>
<td>0.75<div class="negdelta">0.001</div></td>
</tr>
<tr>
<td>GNT</td>
<td>0.42<div class="negdelta">0.25</div></td>
<td>0.21<div class="negdelta">0.41</div></td>
<td>0.28<div class="negdelta">0.36</div></td>
<td>0.25<div class="negdelta">0.19</div></td>
<td>0.13<div class="negdelta">0.15</div></td>
<td>0.17<div class="negdelta">0.17</div></td>
<td>0.66<div class="negdelta">0.12</div></td>
<td>0.86<div class="negdelta">0.01</div></td>
<td>0.75<div class="negdelta">0.07</div></td>
<td>0.22<div class="negdelta">0.2</div></td>
<td>0.604<div class="negdelta">0.12</div></td>
</tr>
<tr>
<td>Y&amp;C</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.62</td>
<td>1.0</td>
<td>0.77</td>
<td>0.0</td>
<td>0.62</td>
</tr>
<tr>
<td>RAE</td>
<td>0.46<div class="negdelta">0.04</div></td>
<td>0.62<div class="negdelta">0.11</div></td>
<td>0.53<div class="negdelta">0.06</div></td>
<td>0.18<div class="negdelta">0.17</div></td>
<td>0.02<div class="negdelta">0.04</div></td>
<td>0.03<div class="negdelta">0.07</div></td>
<td>0.77<div class="negdelta">0.03</div></td>
<td>0.82<div class="posdelta">0.02</div></td>
<td>0.79<div class="negdelta">0.01</div></td>
<td>0.28<div class="negdelta">0.07</div></td>
<td>0.66<div class="negdelta">0.02</div></td>
</tr>
<tr>
<td>MVRNN</td>
<td>0.19</td>
<td>0.01</td>
<td>0.03</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.62</td>
<td>0.97</td>
<td>0.76</td>
<td>0.01</td>
<td>0.61</td>
</tr>
<tr>
<td>RNTN</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.62</td>
<td>1.0</td>
<td>0.77</td>
<td>0.0</td>
<td>0.62</td>
</tr>
<tr>
<td>SEV</td>
<td>0.58<div class="negdelta">0.06</div></td>
<td>0.39<div class="negdelta">0.19</div></td>
<td>0.47<div class="negdelta">0.14</div></td>
<td>0.23<div class="negdelta">0.28</div></td>
<td>0.05<div class="negdelta">0.16</div></td>
<td>0.08<div class="negdelta">0.22</div></td>
<td>0.7<div class="negdelta">0.06</div></td>
<td>0.92<div class="posdelta">0.03</div></td>
<td>0.8<div class="negdelta">0.02</div></td>
<td>0.27<div class="negdelta">0.18</div></td>
<td>0.67<div class="negdelta">0.05</div></td>
</tr>
<tr>
<td>BAZ</td>
<td>0.69<div class="negdelta">0.03</div></td>
<td>0.54<div class="negdelta">0.16</div></td>
<td>0.6<div class="negdelta">0.05</div></td>
<td>0.36<div class="negdelta">0.17</div></td>
<td>0.49<div class="posdelta">0.16</div></td>
<td>0.41</td>
<td>0.79</td>
<td>0.79<div class="negdelta">0.12</div></td>
<td>0.79<div class="negdelta">0.05</div></td>
<td>0.51<div class="negdelta">0.02</div></td>
<td>0.69<div class="negdelta">0.06</div></td>
</tr>
<tr>
<td>LBA$^{(1)}$</td>
<td>0.24<div class="negdelta">0.36</div></td>
<td>0.86<div class="posdelta">0.14</div></td>
<td>0.38<div class="negdelta">0.28</div></td>
<td>0.45<div class="negdelta">0.02</div></td>
<td>0.45<div class="posdelta">0.03</div></td>
<td>0.45<div class="posdelta">0.01</div></td>
<td>0.69<div class="negdelta">0.15</div></td>
<td>0.01<div class="negdelta">0.79</div></td>
<td>0.02<div class="negdelta">0.8</div></td>
<td>0.41<div class="negdelta">0.14</div></td>
<td>0.27<div class="negdelta">0.46</div></td>
</tr>
<tr>
<td>LBA$^{(2)}$</td>
<td>0.74<div class="negdelta">0.02</div></td>
<td>0.42<div class="negdelta">0.15</div></td>
<td>0.54<div class="negdelta">0.1</div></td>
<td>0.62<div class="posdelta">0.07</div></td>
<td>0.25<div class="negdelta">0.14</div></td>
<td>0.35<div class="negdelta">0.11</div></td>
<td>0.73<div class="negdelta">0.06</div></td>
<td>0.95<div class="posdelta">0.05</div></td>
<td>0.82<div class="negdelta">0.02</div></td>
<td>0.45<div class="negdelta">0.1</div></td>
<td>0.72<div class="negdelta">0.03</div></td>
</tr>
</tbody>
</table>


# Summary and Concusions

* I have compared three major families of message-level sentiment analysis methods: lexicon-, machine-learning&ndash; and deep-learning&ndash;based ones, finding that the last two groups significantly outperform lexicon-driven systems;

* Surprisingly, among all compared lexicon methods, the most simple one (the classifier of Hu and Liu [2004]) produced the best macro- and micro-averaged $F_1$-results on the PotTS corpus (0.615 and 0.685 respectively) and also yielded the highest macro $F_1$-measure on the SB10k dataset (0.421). Other systems, however, could have improved their scores if they better handled the negation of polar terms (after switching off the negation component in the method of Musto et al., its macro-$F_1$ on the PotTS corpus increased to 0.641, surpassing the benchmark of Hu and Liu);

* as expected, the ML-based system of Mohammad et al. (2013)—the winner of the inaugural run of SemEval task in sentiment analysis of Twitter (Nakov et al., 2013)—also surpassed other ML competitors, achieving highly competitive results: 0.674 macro-and 0.727 micro-$F_1$ on the PotTS data, and 0.564 macro- and 0.752 micro-averaged$F_1$-measure on the SB10k test set;

* as in the previous case, however, these results could have been improved if the classifier dispensed with character-level and part-of-speech features and used logistic regression instead of SVM;

* a much more varied situation was observed with deep-learning–based systems, which frequently simply fell into always predicting the majority class for all tweets, but sometimes yielded extraordinarily good results as it was the case with our proposed lexicon-based attention system, which attained 0.69 macro-$F_1$ on the PotTS corpus and 0.55 macro $F_1$-score on the SB10k dataset (0.73 and 0.75 micro-F 1 respectively), setting a new state of the art for the former data;

* speaking of word embeddings, we should note that almost all DL-based approaches showed fairly low scores when they used randomly initialized task-specific embeddings, but notably improved their results after switching to pre-trained word2vec vectors, and benefited even more from the least-squares fallback;

* against our expectations, we could not overcome the majority class pitfall of DL-based systems after adding more distantly supervised training data, which, in general, only lowered the scores of both ML- and DL-based methods. Since this result contradicts the findings of other authors, we hypothesize that this degradation is primarily due to the differences in the class distributions between automatically and manually labeled tweets;

* on the other hand, we could see that using more qualitative sentiment lexicons (especially manually curated and dictionary-based ones) resulted in further improvements for the systems that relied on this lexical resource;

* last but not least, we proved the utility of the text normalization step, which brought about significant improvements for all tested methods, as confirmed by our last ablation test.

# Chapter VI: Discourse-Aware Sentiment Analysis

## Motivating Examples

<div class="example">
Wollte meinen Kleiderschrank aufr&auml;umen ... sitze nun darin und singe Liebeslieder ...
<div class="translation">Wanted to clean up my wardrobe... Now sitting in it and singing love songs...</div>
Gold Label:<div class="label neutral">neutral</div>
Predicted Label:<div class="label positive">positive*</div>
</div>

<div class="example">
Gerade super Lust, mit Carls Haaren was zu machen aber ca 300 km Distanz halten mich davon ab.
<div class="translation">Wanted to clean up my wardrobe... Now sitting in it and singing love songs...</div>
Gold Label:<div class="label neutral">neutral</div>
Predicted Label:<div class="label positive">positive*</div>
</div>

## Common Approaches to Discourse Analysis

* Rhetorical Structure Theory (Mann and Thompson, 1988), which divides the text into elementary discourse units (EDUs) and infers a hierarchical structure (typically a tree) between these units;

* PDTB (Prasad et al., 2004), which analyzes the occurrences of (either explicitly mentioned or implicitly assumed) connectives (i.e., lexico-grammatic elements that connect two sentences) in the text and considers text sentences as arguments of these connectives;

* SDRT (Lascarides and Asher, 2001), which is conceptually similar to RST in that it also assumes a hierarchical structure if the text, with the leaves of this structure representing Elementary Discourse Units.  But in contrast to the first theory, SDRT allows this structure to be a graph, not just a tree (i.e., a node can have multiple parents and there can be multiple links between the same pair of nodes).

## Data Preparation

* I split all microblogs from the PotTS and SB10k corpora into elementary discourse units with the ML-based discourse segmenter of Sidarenka et al (2015).  After filtering out all tweets that had only one EDU, I obtained 4,771 messages (12,137 segments) for PotTS and 3,763 microblogs (9,625 segments) for the SB10k corpus.

* In the next step, I assigned polarity scores to the segments of these microblogs with the help of the lexicon-based attention classifier, analyzing each elementary unit in isolation, independently of the rest of the tweet.  We again used the same 70--10--20 split into training, development, and test sets as I did in the previous chapters.

<figure>
<img src="img/dasa_potts_edu_distribution.png" alt="EDU distribution in PotTS">
<figcaption>PotTS</figcaption>
<img src="img/dasa_sb10k_edu_distribution.png" alt="EDU distribution in SB10k">
<figcaption>SB10k</figcaption>
</figure>
<div>
Distribution of elementary discourse units and polarity classes in the training and development sets of PotTS and SB10k
</div>

<div class="example">
[Guinness on Wheelchairs :]$_1$ [Das .]$_2$ [Ist .]$_3$ [Verdammt .]$_4$ [Noch .]$_5$ [Mal .]$_6$ [Einer .]$_7$ [Der .]$_8$ [Besten .]$_9$ [Werbespots .]$_{10}$ [Des .]$_{11}$ [Jahrzehnts .]$_{12}$ [( Auch ...]$_{13}$
<div class="translation">
[Guinness on Wheelchairs :]$_1$ [This .]$_2$ [Is .]$_3$ [Gosh .]$_4$ [Darn .]$_5$ [It .]$_6$ [One .]$_7$ [Of .]$_8$ [The best .]$_9$ [Commercials .]$_{10}$ [Of .]$_{11}$ [The Decade .]$_{12}$ [( Also ...]$_{13}$</div>
</div>

* Finally, I derived RST trees for the segmented tweets with the DPLP dicsourse parser (Yi and Eisenstein, 2014) that I had previously retrained on the Potsdam Commentary Corpus (PCC 2.0; Stede and Neumann, 2014).  An example of such automatically derived RST tree is provided below:

TODO: convert twitter-rst.tex to png

## DASA Approaches

* the **No-Discourse** baseline, in which I simply re-use the scores assigned by the LBA classifier to the whole message;

* **Last**, in which I determine the overall polarity of a tweet by taking the LBA scores assigned to its last EDU;

* **Root**, which is conceptually similar to **Last** with the only difference that it infers the polarity of the microblog from the root EDU in the discourse tree instead of the last segment;

* the method of **Wang and Wu** (2013), who determine the semantic orientation of a document by taking a linear combination of the polarity scores of its EDUs and multiplying these scores with automatically learned coefficients:
$$\psi = \sum_{i}p_i\times d_i + b$$,
where $\psi$ is the final sentiment score of the whole document, $p_i$ is the sentiment score of the $i$-th EDU, and $d_i$ and $b$ are automatically learned model parameters, with the former term denoting the strength of the discourse relation via which $i$-th EDU is connected to its parent;

* **Discourse-Depth Reweighting** (DDR) by Bhatia et al. (2015), in which the authors estimate the relevance $\lambda_i$ of each discourse unit $i$ as:
$$\lambda_i = \max(0.5, 1. - \frac{d_i}{6})$$
where $d_i$ stands for the depth of the $i$-th EDU in the document's discourse tree.
After estimating the sentiment score of each unit as:
$$\sigma_i = \vec{\theta}^{\top}\vec{w},$$
where $\vec{w}$ represents a vector of features, and $\vec{\theta}$ stands for the corresponding parameters, Bhatia et al. compute the overall polarity of a document $\psi$ as:
$$\psi = \sum_i\lambda_i\vec{\theta}^{\top}\vec{w}_i = \vec{\theta}^{\top}\sum_i\lambda_i\vec{w}_i;$$

* **Recurrent Rhetorical Neural Network** (R2N2; Bhatia et al., 2015), where the authos mainly follow the RNN approach by Socher et al. (2013) by recursively computing the polarity score of each discourse unit as:
$$  \psi_i = \tanh\left(K_n^{(r_i)} \psi_{n(i)} + K_s^{(r_i)}\psi_{s(i)} \right)$$
The $K_n^{(r_i)}$ and $K_s^{(r_i)}$ terms in the above equation stand for the nucleus and satellite coefficients associated with the rhetorical relation $r_i$, and $\psi_{n(i)}$ and $\psi_{s(i)}$ represent sentiment scores of the nucleus and satellite of the $i$-th vertex.

In addition to these existing (baseline) solutions, I also propose three **own** approaches:

* **latent Conditional Random Fields** (LCRF):
<figure>
<img src="img/latent-crf-correct.png" alt="Computational path of the probability of the correct label in latent CRF">
<figcaption>
Computational path of the probability of the correct label in latent CRF
</figcaption>
</figure>

<figure>
<img src="img/latent-crf-wrong.png" alt="Computational path of the probability of the wrong label in latent CRF">
<figcaption>
Computational path of the probability of the wrong label in latent CRF
</figcaption>
</figure>

* **latent Marginalized Conditional Random Fields** (LMCRF):

<figure>
<img src="img/latent-mcrf-correct.png" alt="Computational path of the probability of the correct label in latent marginalized CRF">
<figcaption>
Computational path of the probability of the correct label in latent marginalized CRF
</figcaption>
</figure>

<figure>
<img src="img/latent-mcrf-wrong.png" alt="Computational path of the probability of the wrong label in latent marginalized CRF">
<figcaption>
Computational path of the probability of the wrong label in latent marginalized CRF
</figcaption>
</figure>

* the **Recursive Dirichlet Process**:

$$p_{\theta}(\mathbf{x},\mathbf{z})=p_{\theta}(\mathbf{x}|\mathbf{z})p_{\theta}(\mathbf{z})$$

$$\theta_{max} = \underset{\theta}{\operatorname{argmax}}p_{\theta}(\mathbf{x})$$

$$p_{\theta_{max}}(\mathbf{z}|\mathbf{x}) = \frac{p_{\theta_{max}(\mathbf{x}, \mathbf{z})}}{\int p_{\theta_{max}}(\mathbf{x}, \mathbf{z})\mathrm{d}\mathbf{z}}$$

$$q_{\phi}\approx p_{\theta_{max}}(\mathbf{z}|\mathbf{x})$$

Objective function:

$$\mathrm{ELBO}\overset{\Delta}{=}\mathbb{E}_{q_{\phi}(\boldsymbol{z})}\left[\log p_{\theta}(\boldsymbol{x},\boldsymbol{z}) - \log q_{\phi}(\boldsymbol{z})\right]$$

I associate a random variable $z_{j_k} \in \mathbb{R}^{3}_{+}$ with every RST node $i$, which represents the probabilities of polarity classes (negative, neutral, and positive) for that node after seeing its $k$-th child:
$$z_{j_k} \sim Dir(\boldsymbol{\alpha}).$$

I set the initial values of these variables (i.e., $z_{j_0}$) to the scores predicted by the LBA classifier for EDUs and the root nodes, and set them to zeroes for abstract nodes.  Then when analyzing the $k$-th child of that node, I compute the score that comes from that child as follows:
$$\boldsymbol{z}^{*} = \textrm{sparsemax}(M_r\boldsymbol{z}_k^{\top}),$$
where the $M_r \sim \mathcal{N}_{3\times3}(\boldsymbol{\mu}_r, \boldsymbol{\Sigma}_r)$ matrix reflects contextual changes introduced by discourse relation $r$ that holds between parent and child.  The initial priors for this parameter are:
$$  \boldsymbol{\mu}_r=\begin{bmatrix}
  1 & 0 & 0\\
  0 & 0.3 & 0\\
  0 & 0 & 1
  \end{bmatrix}$$
and
$$
  \boldsymbol{\Sigma}_r=\begin{bmatrix}
  1 & 1 & 1\\
  1 & 1 & 1\\
  1 & 1 & 1
  \end{bmatrix}.
$$

Then I compute the $\boldsymbol{\alpha}$ parameters as:
$$  \boldsymbol{\alpha}_{j_k} = \boldsymbol{\beta}\odot\boldsymbol{z}^*_k +
  (\boldsymbol{1} -
  \boldsymbol{\beta})\odot\boldsymbol{z}_{j_{k-1}},
$$
where $\boldsymbol{\beta}\in\mathbb{R}^3$ is another multivariate
random variable sampled from the Beta distribution $B(5., 5.)$, which
controls the amount of information we want to pass from child to its
parent.

The only thing that I need to do to the above $\boldsymbol{\alpha}_{j_k}$ term before drawing the actual probability
$\boldsymbol{z}_{j_k}$ is to scale this vector by a certain amount in
order to reduce the variance of the resulting Dirichlet
distribution.\footnote{Because if we keep the
  $\boldsymbol{\alpha}_{j_k}$ vector from Equation~\ref{dasa:eq:alpha}
  unchanged, most of its values will be in the range $[0,\ldots,1]$
  which will lead to an extremely high variance of the Dirichlet
  distribution.} In particular, we compute this scaling factor as
follows:
$$\textrm{scale} = \frac{\xi \times \left(0.1 + \cos\left(\boldsymbol{z}^*_k, \boldsymbol{z}_{j_{k-1}}\right)\right)}{H\left(\boldsymbol{\alpha}_{j_k}\right)};$$
where $\xi$ is a model parameter sampled from a $\chi^2$-distribution: $\xi\sim\chi^2(34)$; 0.1 is a constant used to prevent zero scales in the cases when $\cos\left(\mathbf{z}^*_k, \mathbf{z}_{j_{k-1}}\right)$ is zero; and $H\left(\boldsymbol{\alpha}_{j_k}\right)$ stands for the entropy of the $\boldsymbol{\alpha}_{j_k}$ vector.

Finally, to predict the final polarity label for the whole tweet, I simply draw a value from a Categorical distribution, using the value $\boldsymbol{z}_{\text{\textsc{Root}}}$ as its parameter:
$$y \sim Cat(\mathbf{z}_{\text{Root}}).$$

<figure>
   <img src="img/dirichlet-process.png" alt="Probability distributions of polar classes computed by the Recursive Dirichlet Process">
    <figcaption>
    Probability distributions of polar classes computed by the Recursive Dirichlet Process
    </figcaption>
</figure>

Probability distributions of polar classes computed by the Recursive Dirichlet Process

(higher probability regions are highlighted in red; $\boldsymbol{p}_{prnt}$ means the probability of the parent
node [the values in the vector represent the scores for the negative, neutral, and positive polarities respectively]; $\boldsymbol{p}_{chld}$ denotes the probability of the child; and $\boldsymbol{\alpha}$, $\boldsymbol{\mu}$, and $\boldsymbol{\sigma}^2$ represent the parameters of the resulting joint distribution shown in the simplices)

<figure>
   <img src="img/rdp.png" alt="A plate diagram of Recursive Dirichlet Process">
    <figcaption>
    A plate diagram of Recursive Dirichlet Process
    </figcaption>
</figure>

## Results

<table>
<caption>
Results of discourse-aware sentiment analysis methods<br/>
(LCRF &mdash; latent conditional random fields, LMCRF &mdash; latent-marginalized conditional random fields, RDP &mdash; recursive Dirichlet process, DDR &mdash; discourse-depth reweighting~(Bhatia et al., 2015), R2N2 &mdash; rhetorical recursive neural network (Bhatia et al., 2015), WNG &mdash; (Wang et al, 2013), Last &mdash; polarity determined by the last EDU, Root &mdash; polarity determined by the root EDU(s), No-Discourse &mdash; discourse-unaware classifier)
</caption>
<thead>
<tr>
<td rowspan="2">Method</td>
<td colspan="3">Positive</td>
<td colspan="3">Negative</td>
<td colspan="3">Neutral</td>
<td rowspan="2">Macro-$F_1$</td>
<td rowspan="2">Micro-$F_1$</td>
</tr>
<tr>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
<td>Precision</td>
<td>Recall</td>
<td>$F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="12">
    PotTS
</td>
</tr>
<tr>
<td>LCRF</td>
<td>0.76</td>
<td>0.79</td>
<td>0.77</td>
<td>0.61</td>
<td>0.53</td>
<td>0.56</td>
<td>0.7</td>
<td>0.71</td>
<td>0.71</td>
<td>0.67</td>
<td>0.709</td>
</tr>
<tr>
<td>LMCRF</td>
<td>0.77</td>
<td>0.77</td>
<td>0.77</td>
<td>0.61</td>
<td>0.54</td>
<td>0.57</td>
<td>0.69</td>
<td>0.74</td>
<td>0.72</td>
<td>0.671</td>
<td>0.712</td>
</tr>
<tr>
<td>RDP</td>
<td>0.73</td>
<td>0.82</td>
<td>0.77</td>
<td>0.61</td>
<td>0.56</td>
<td>0.58</td>
<td>0.73</td>
<td>0.65</td>
<td>0.69</td>
<td>0.678</td>
<td>0.706</td>
</tr>
<tr>
<td>DDR</td>
<td>0.73</td>
<td>0.77</td>
<td>0.75</td>
<td>0.54</td>
<td>0.59</td>
<td>0.56</td>
<td>0.69</td>
<td>0.61</td>
<td>0.65</td>
<td>0.655</td>
<td>0.674</td>
</tr>
<tr>
<td>R2N2</td>
<td>0.74</td>
<td>0.78</td>
<td>0.76</td>
<td>0.59</td>
<td>0.53</td>
<td>0.56</td>
<td>0.68</td>
<td>0.68</td>
<td>0.68</td>
<td>0.657</td>
<td>0.692</td>
</tr>
<tr>
<td>WNG</td>
<td>0.58</td>
<td>0.79</td>
<td>0.67</td>
<td>0.61</td>
<td>0.21</td>
<td>0.31</td>
<td>0.61</td>
<td>0.57</td>
<td>0.59</td>
<td>0.487</td>
<td>0.59</td>
</tr>
<tr>
<td>Last</td>
<td>0.52</td>
<td>0.83</td>
<td>0.64</td>
<td>0.57</td>
<td>0.17</td>
<td>0.26</td>
<td>0.61</td>
<td>0.43</td>
<td>0.5</td>
<td>0.453</td>
<td>0.549</td>
</tr>
<tr>
<td>Root</td>
<td>0.56</td>
<td>0.73</td>
<td>0.64</td>
<td>0.58</td>
<td>0.22</td>
<td>0.32</td>
<td>0.55</td>
<td>0.54</td>
<td>0.54</td>
<td>0.481</td>
<td>0.56</td>
</tr>
<tr>
<td>No-Discourse</td>
<td>0.73</td>
<td>0.82</td>
<td>0.77</td>
<td>0.61</td>
<td>0.56</td>
<td>0.58</td>
<td>0.72</td>
<td>0.66</td>
<td>0.69</td>
<td>0.677</td>
<td>0.706</td>
</tr>
<tr>
<td colspan="12">
    SB10k
</td>
</tr>
<tr>
<td>LCRF</td>
<td>0.64</td>
<td>0.69</td>
<td>0.66</td>
<td>0.45</td>
<td>0.45</td>
<td>0.45</td>
<td>0.82</td>
<td>0.79</td>
<td>0.8</td>
<td>0.557</td>
<td>0.713</td>
</tr>
<tr>
<td>LMCRF</td>
<td>0.64</td>
<td>0.69</td>
<td>0.67</td>
<td>0.45</td>
<td>0.45</td>
<td>0.45</td>
<td>0.82</td>
<td>0.79</td>
<td>0.8</td>
<td>0.56</td>
<td>0.715</td>
</tr>
<tr>
<td>RDP</td>
<td>0.64</td>
<td>0.69</td>
<td>0.66</td>
<td>0.45</td>
<td>0.45</td>
<td>0.45</td>
<td>0.82</td>
<td>0.79</td>
<td>0.8</td>
<td>0.557</td>
<td>0.713</td>
</tr>
<tr>
<td>DDR</td>
<td>0.59</td>
<td>0.63</td>
<td>0.61</td>
<td>0.48</td>
<td>0.44</td>
<td>0.46</td>
<td>0.77</td>
<td>0.76</td>
<td>0.77</td>
<td>0.534</td>
<td>0.681</td>
</tr>
<tr>
<td>R2N2</td>
<td>0.64</td>
<td>0.69</td>
<td>0.66</td>
<td>0.46</td>
<td>0.45</td>
<td>0.45</td>
<td>0.81</td>
<td>0.79</td>
<td>0.8</td>
<td>0.559</td>
<td>0.713</td>
</tr>
<tr>
<td>WNG</td>
<td>0.61</td>
<td>0.63</td>
<td>0.62</td>
<td>0.46</td>
<td>0.29</td>
<td>0.36</td>
<td>0.76</td>
<td>0.82</td>
<td>0.79</td>
<td>0.488</td>
<td>0.693</td>
</tr>
<tr>
<td>Last</td>
<td>0.56</td>
<td>0.55</td>
<td>0.56</td>
<td>0.46</td>
<td>0.29</td>
<td>0.36</td>
<td>0.73</td>
<td>0.8</td>
<td>0.76</td>
<td>0.459</td>
<td>0.661</td>
</tr>
<tr>
<td>Root</td>
<td>0.51</td>
<td>0.55</td>
<td>0.53</td>
<td>0.4</td>
<td>0.3</td>
<td>0.35</td>
<td>0.74</td>
<td>0.76</td>
<td>0.75</td>
<td>0.438</td>
<td>0.64</td>
</tr>
<tr>
<td>No-Discourse</td>
<td>0.64</td>
<td>0.69</td>
<td>0.66</td>
<td>0.45</td>
<td>0.45</td>
<td>0.45</td>
<td>0.82</td>
<td>0.79</td>
<td>0.8</td>
<td>0.557</td>
<td>0.713</td>
</tr>
</tbody>
</table>

# Evaluation

## Base Classifier

Since the scores of the presented discourse-aware sentiment systems crucially depended on the accuracy of the base classifier (the one we use to assign sentiment scores to single EDUs), I decided to rerun the experiments using the best systems from the two other message-level sentiment anaysis groups (lexicon- and machine-learning&ndash;), namely:
* the system of Hu and Liu (2004);
* and the SVM classifier of Mohammad et al. (2013).

<figure>
<img src="img/dasa-potts-bc-macro-F1.png" alt="Macro-$F_1$ Results on the PotTS corpus with Different Base Classifiers">
<figcaption>Macro-$F_1$</figcaption>
<img src="img/dasa-potts-bc-micro-F1.png" alt="Micro-$F_1$ Results on the PotTS corpus with Different Base Classifiers">
<figcaption>Micro-$F_1$</figcaption>
</figure>
<div>
Results of discourse-aware sentiment analysis methods with different base classifiers on the PotTS corpus
</div>

<figure>
<img src="img/dasa-sb10k-bc-macro-F1.png" alt="Macro-$F_1$ Results on the SB10k corpus with Different Base Classifiers">
<figcaption>Macro-$F_1$</figcaption>
<img src="img/dasa-sb10k-bc-micro-F1.png" alt="Micro-$F_1$ Results on the SB10k corpus with Different Base Classifiers">
<figcaption>Micro-$F_1$</figcaption>
</figure>
<div>
Results of discourse-aware sentiment analysis methods with different base classifiers on the SB10k corpus
</div>

## Relation Scheme

Another factor that could significantly affect the results of discourse-aware sentiment methods was the set of discourse relations distinguished by the parsing system. On the one hand, this set considerably affected the quality of discourse parsing (with richer sets typically leading to lower accuracy); on the other hand, it was also important to the sentiment systems (with richer sets allowing them to distinguish more facets).  To check which of these factors had a greater impact on the net results of discourse-aware sentiment methods, I reran the experiments with the following alternative sets:

<table>
<caption>
RST relations used in the original Potsdam Commentary Corpus and different
discourse-aware sentiment methods<br/>
(default relation, which subsumes the rest of the links, is shown in boldface)    
</caption>
<thead>
<tr>
<td>Scheme</td>
<td>Relation Set</td>
<td>Equivalence Classes</td>
</tr>
</thead>
<tbody>
<tr>
<td>Bhatia et al.</td>
<td>{Contrastive, <div class="default-rel">Non-Contrastive}</td>
<td>Contrastive &#x225D; {Antithesis, Antithesis-E, Comparison, Concession, Consequence-S, Contrast, Problem-Solution}.</td>
</tr>
<tr>
<td>Chenlo et al.</td>
<td>{Attribution, Background, Cause, Comparison, Condition, Consequence, Contrast, Elaboration, Enablement, Evaluation, Explanation, Joint, Otherwise, Temporal, <div class="default-rel">Other</div>}</td>
<td></td>
</tr>
<tr>
<td>Heerschop et al.</td>
<td>{Attribution, Background, Cause, Condition, Contrast, Elaboration, Enablement, Explanation, <div class="default-rel">Other</div>}</td>
<td></td>
</tr>
<tr>
<td>PCC</td>
<td>{Antithesis, Background, Cause, Circumstance, Concession, Condition, Conjunction, Contrast, Disjunction, E-Elaboration, Elaboration, Enablement, Evaluation-N, Evaluation-S, Evidence, Interpretation, Joint, Justify,       List, Means, Motivation, Otherwise, Preparation, Purpose, Reason, Restatement, Restatement-MN, Result, Sequence, Solutionhood, Summary, Unconditional, Unless, Unstated-Relation}</td>
<td></td>
</tr>
<tr>
<td>Zhou et al.</td>
<td>{Contrast, Condition, Continuation, Cause, Purpose, <div class="default-rel">Other</div>}</td>
<td>Contrast &#x225D; {Antithesis, Concession, Contrast, Otherwise<br/>
    Continuation &#x225D; {Continuation, Parallel}<br/>
    Cause &#x225D; {Evidence, Nonvolitional-Cause, Nonvolitional-Result, Volitional Cause, Volitional-Result}</td>
</tr>
</tbody>
</table>


<table>
    <caption>
        Results of the DPLP parser on PCC 2.0 with different relation schemes
    </caption>
<thead>
<tr>
<td>Relation Scheme</td>
<td>Span $F_1$</td>
<td>Nuclearity $F_1$</td>
<td>Relation $F_1$</td>
</tr>
</thead>
<tbody>
<tr>
<td>Bhatia et al.</td>
<td>0.777</td>
<td>0.512</td>
<td>0.396</td>
</tr>
<tr>
<td>Chenlo et al.</td>
<td>0.769</td>
<td>0.505</td>
<td>0.362</td>
</tr>
<tr>
<td>Heerschop et al.</td>
<td>0.774</td>
<td>0.51</td>
<td>0.361</td>
</tr>
<tr>
<td>PCC</td>
<td>0.776</td>
<td>0.534</td>
<td>0.326</td>
</tr>
<tr>
<td>Zhou et al.</td>
<td>0.776</td>
<td>0.501</td>
<td>0.388</td>
</tr>
</tbody>
</table>


<figure>
<img src="img/dasa-potts-macro-F1.png" alt="Macro-$F_1$ Results on the PotTS corpus with Different Relation Schemes">
<figcaption>Macro-$F_1$</figcaption>
<img src="img/dasa-potts-micro-F1.png" alt="Micro-$F_1$ Results on the PotTS corpus with Different Relation Schemes">
<figcaption>Micro-$F_1$</figcaption>
</figure>
<div>
Results of discourse-aware sentiment classifiers for different relation schemes on the
PotTS corpus
</div>

<figure>
<img src="img/dasa-sb10k-macro-F1.png" alt="Macro-$F_1$ Results on the SB10k corpus with Different Relation Schemes">
<figcaption>Macro-$F_1$</figcaption>
<img src="img/dasa-sb10k-micro-F1.png" alt="Micro-$F_1$ Results on the SB10k corpus with Different Relation Schemes">
<figcaption>Micro-$F_1$</figcaption>
</figure>
<div>
Results of discourse-aware sentiment classifiers for different relation schemes on the
sb10k corpus
</div>

# Summary and Conclusions

* I have presented an overview of the most popular approaches to automatic discourse analysis (RST, PDTB, and SDRT) and explained why we think that one of these frameworks (Rhetorical Structure Theory) would be more amenable to the purposes of discourse-aware sentiment analysis than the others;

* to substantiate our claims and to see whether the lexicon-based attention system introduced in the previous chapter would indeed benefit from information on discourse structure, I segmented all microblogs from the PotTS and SB10k corpora into elementary discourse units using the SVM-based segmenter of Sidarenka et al. (2015} and parsed these messages with the RST parser of Ji and Eisenstein (2014), which had been previously retrained on the Potsdam Commentary Corpus (Stede and Neumann, 2014);

* afterwards, I estimated the results of existing discourse-aware sentiment methods (the systems of Wang and Wu (2015) and Bhatia et al. (2015) and also evaluated two simpler baselines (in which I predicted the semantic orientation of a tweet by taking the polarity of its last and root EDUs), getting the best results with the R2N2 solution (0.657 and 0.559 macro-\F{} on PotTS and SB10k respectively);

* I could, however, improve on these scores and also outperform the plain LBA system (although by a not very large margin) with three proposed discourse-aware sentiment solutions (latent and latent-marginalized conditional random fields and Recursive Dirichlet Process), pushing the macro-averaged $F_1$-score on PotTS up to 0.678 and increasing the result on SB10k to 0.56 macro-$F_1$;

* a subsequent evaluation of these approaches with different settings showed that the results of all discourse-aware methods largely correlate with the scores of the base sentiment classifier and also revealed an important drawback of the latent-marginalized CRFs, which failed to predict any positive or negative instance on the test set of the SB10k corpus when trained in combination with the lexicon-based approach of Hu and Liu (2004};

* nevertheless, almost all DASA solutions could improve their scores when tested on richer sets of discourse relations.

# Overall Summary

* In my dissertation, I have presented a corpus of $\approx8,000$ German tweets, which pertain to four different topics (federal elections, papal conclave, general political discussions, and casual everyday conversations) and were sampled according to three formal criteria (tweets containing a polar term, messages having a smiley, and all remaining microblogs).  After annotating the corpus in three steps (initial, adjudication, and final), I have attained a reliable level of inter-annotator agreement for all elements (sentiments, sources, targets, polar terms, downtoners, negations, and intensifiers), finding that both selection criteria (topics and formal traits) significantly affected the distribution of sentiments and polar terms and the reliability of their annotation;

* Then, I have compared existing German sentiment lexicons, which were translated from English resources and
  revised by human experts, with lexicons that were generated automatically from scratch with the help of state-of-the-art dictionary-, corpus -, and word-embedding&ndash;based methods.  An evaluation of these approaches on our corpus showed that semi-automatically translated polarity lists were generally better than the automatically induced ones, reaching 0.587 macro-$F_1$ and attaining 0.955 micro-$F_1$&ndash; score on the prediction of polar terms.  Furthermore, among fully automatic methods, dictionary-based systems showed stronger results than their corpus- and word-embedding&ndash;based competitors, yielding 0.479 macro-$F_1$ and 0.962 micro-$F_1$.  We could, however, improve on the latter metric (pushing it to 0.963) with our proposed linear projection solution, in which we first found a line that maximized the mutual distance between the projections of seed vectors with opposite semantic orientations and then projected the embeddings of all remaining words on that line, considering the distance of these projections to the median as polarity scores of respective terms;

* In Chapter 4, I turned the attention to the aspect-based sentiment analysis, in which I tried to predict the
  spans of sentiments, targets, and holders of the opinions using two most popular approaches to this task&mdash;conditional random fields and recurrent neural networks.  I obtained my best results (0.287
  macro-$F_1$) with the first-order linear-chain CRFs and could increase these scores by using alternative topologies of CRFs (second-order linear-chain and semi-Markov CRFs), also boosting the macro-averaged $F_1$to 0.38 by taking a narrower interpretation  of sentiment spans (in which I only assigned the *sentiment* tag to polar terms).  Further evaluation of these methods proved the utility of the text normalization step (which raised the macro-$F_1$ of the CRF-method by almost 3%) and task-specific word embeddings with the least-squares fallback (which improved the macro-$F_1$--score of the GRU system by 1.4%);

* Afterwards, in Chapter 5, I addressed one of the most popular objective in contemporary sentiment analysis&mdash;**message-level sentiment analysis** (MLSA).  To get a better overview of the numerous existing systems, I compared three larger families of MLSA methods: **dictionary-**, **machine-learning&ndash;**, and **deep-learning&ndash;based** ones; finding that **the last two groups performed significantly better** than the lexicon-based approaches (the best macro-$F_1$&ndash;scores of machine- and deep-learning methods run up to 0.677 and 0.69 respectively, whereas the best lexicon-based solution [Hu and Liu, 2004] only reached 0.641 macro-$F_1$).  Apart from this, I improved the results of many reimplemented approaches by changing their default configuration (e.g., abandoning polarity changing rules of lexicon-based systems, using alternative classifiers in ML-based systems, or taking the least-squares embeddings for DL-based methods).  In addition to the numerous reimplementations of popular existing algorithms, I also my own solution&mdash;**lexicon-based attention** (LBA), in which I tried to unite the lexicon and deep-learning paradigms by taking a bidirectional LSTM network and explicitly pointing its attention to the polar terms that appeared in the analyzed messages.  With this solution, I not only outperformed all alternative DL systems but also improved on the scores of ML-based classifiers, attaining 0.69 macro-$F_1$ and 0.73 micro-$F_1$ on the PotTS corpus.  Similarly to the findings in the previous chapter, I observed a strong positive effect of text normalization and task-specific embeddings with the least-squares approximation;

* Finally, in the last part, I tried to improve the results of the proposed LBA solution by making it aware of the discourse structure.  For this purpose, I segmented all microblogs from the PotTS and SB10k corpora into elementary discourse units, individually analyzing each of these segments with our MLSA classifier, and then estimated the overall polarity of a tweet by joining the polarity scores of its EDUs over the RST tree.  We proposed three different ways of doing this joining (latent CRFs, latent-marginalized CRFs, and Recursive Dirichlet Process), obtaining better results than existing discourse-aware sentiment methods and also outperforming the original discourse-unaware baseline.  In the concluding experiments, we further improved these scores by using manually annotated RST trees and richer subsets of iscourse relations.

# General Conclusions

* **Can we apply opinion mining methods devised for standard English to German Twitter?**

Yes, we can, but the success of these approaches might significantly vary depending on the task, the size and the reliability of the training data, as well as the evaluation metric that we use.  I can, however, provide a few general rules of thumb:

  * **Prefer methods that are closest to your training objective and that were trained under similar conditions w.r.t. the amount of data, their class distribution and domain;**

  * **Put every single setting of these methods into question**;

  * **Try using manually labeled resources for your target domain, if they are available, but pay attention to the quality of their annotation&mdash;it often matters more than the corpus size;**

  * **Prefer machine-learning methods to hard-coded rules**&mdash;they will penalize their bad components automatically by themselves;

  * **Do not use randomly initialized word embeddings for deep-learning systems**&mdashinitialize them with language-model vectors;

* **Which groups of approaches are best suited for which sentiment tasks?**

  * **Sentiment lexicon generation** is more amenable to dictionary-based solutions and my proposed word-embedding&ndash;based algorithms;

  * **Aspect-based sentiment analysis** can be better addressed with probabilistic graphical models, such as conditional random fields;

  * **Message-level sentiment analysis** can be efficiently tackled with both machine- and deep-learning approaches;

  * Finally, **probabilistic graphical models** strike back at discourse-aware opinion mining.

* **How much do word- and discourse-level analyses affect message-level sentiment classification?**

  My evaluation showed that the macro-averaged $F_1$-scores of our proposed lexicon-based attention system varied   by up to 14% (from 0.64 to 0.69 macro-$F_1$ on the PotTS corpus, and from 0.44 to 0.58 on SB10k) depending on     the lexicon used. At the same, discourse enhancements could only improve the results of LBA by at most 1.5%       percent (from 0.677 to 0.678 on PotTS, and from 0.557 to 0.572 on SB10k).

* **Does text normalization help analyze sentiments?**

  Yes, it definitely does. Normalization significantly improved the quality of aspect-based and message-level sentiment analyses, boosting the results on the former task by up to 4% and improving the macro-averaged $F_1$-measure of message-level sentiment methods by up to 25%;

* **Can we do better than existing approaches?**
Yes, we can with the proposed:
* **linear-projection algorithm**;
* **alternative CRF topologies**;
* **lexicon-based attention network**;
* and **latent-marginalized CRFs** and **Recursive Dirichlet Process**.

# Contributions

* The **Potsdam Twitter Sentiment Corpus**: https://github.com/WladimirSidorenko/PotTS;

* **Lexicon Generation Methods**: https://github.com/WladimirSidorenko/SentiLex;

* **Text-Normalization Pipeline** and **Aspect-Based Sentiment Methods**: https://github.com/WladimirSidorenko/TextNormalization;

* **MLSA Approaches**: https://github.com/WladimirSidorenko/CGSA;

* **Discourse-Aware Sentiment Systems**: https://github.com/WladimirSidorenko/DASA;

* **Discourse Segmenter**: https://github.com/WladimirSidorenko/DiscourseSegmenter;

* **Retrained Discourse Parser**: https://github.com/WladimirSidorenko/RSTParser;

* **Extened Version of RST Markup Tool**: https://github.com/WladimirSidorenko/RSTTool.