<img src="images/bannerugentdwengo.png" alt="BannerUGentDwengo" width="250"/>

<div>
    <font color=#690027 markdown="1">
<h1>RULE-BASED SENTIMENT ANALYSIS</h1>    </font>
</div>

<div class="alert alert-box alert-success">
Language technologists rely on machine learning models to study sentiment words in given texts. In this notebook, you will get acquainted with the principles of their research. <br> The use of technology is becoming increasingly accessible to workers in a non-technical sector, such as linguists, communication scientists, historians, and lawyers. <br>Thanks to the highly accessible programming language Python, you too will discover some possibilities of language technology.</div>

<div class="alert alert-box alert-warning">
As a preparation for this notebook, it's best to delve into the notebooks 'Strings', 'Lists' and 'Dictionaries'. It's also recommended to familiarize yourself with some programming structures in the notebook 'Structures. Applications with strings, lists and dictionaries'.</div>

<div>
    <font color=#690027 markdown="1">
<h2>1. Principles of rule-based sentiment analysis</h2>    </font>
</div>

For **rule-based** sentiment analysis, you use an (existing) **lexicon** with words linked to their **polarity** (positive, negative, or neutral), so a dictionary of sentiment words.
'Happy' for example has a positive polarity, 'home banking' a neutral one and 'angry' a negative polarity. In the lexicon, the polarity is indicated by a real number between -2 and 2. A strictly positive number corresponds to a positive sentiment, a strictly negative number with a negative sentiment and 0 with a neutral sentiment.

<div>
<img src="images/schaal.png" alt="Banner" align="center" width="500"/>
</div>

The polarity of a text is given by the sum of the polarities of the sentiment words in that text.

Before you can match sentiment words from a lexicon with the given text (the data), you have to process the data.- 1) *read in*;- 2) *preprocess*, i.e. pre-processing for *lexicon matching* or for *machine learning*.
Preprocessing includes all the steps necessary to prepare the data for what follows, whether it's a simple lexicon matching, or a complex machine learning system that will be trained on the data. <br>Below is a list of common preprocessing steps.

#### Preprocessing
* **Lowercasing:** all capital letters are replaced with lowercase letters. Lowercasing is necessary because the words in a lexicon are listed without capital letters.* **Tokenization:** all sentences are split into meaningful units or 'tokens', such as words and punctuation marks. This splitting is based on the spaces present in the sentences; therefore, the words must be separated from each other by a space.* **Part-of-speech tagging:** each token is assigned the grammatical word category, such as adjective or symbol. Some words can, for example, occur as a noun and as an adjective. Such a word can also have a different sentiment value depending on its word type.* **Lemmatization:** all tokens are converted to their lemma or dictionary form (e.g. a noun appears in a dictionary in singular and you find the infinitive of a verb there). This dictionary form is then looked up in the lexicon.

#### Example:Given sentence:- The games were cool icebreakers.
Lowercasing:- the games were cool icebreakers.
The input doesn't seem to include any Dutch text to translate. Could you please provide a text in Dutch for translation?- 'the' 'games' 'were' 'cool' 'icebreakers' '.'
The input does not contain any Dutch text to be translated.- 'de': article;- 'spelletjes': noun;- 'waren': verb;- 'toffe': adjective;- 'icebreakers': noun;-  '.': punctuation mark (symbol).
Lemmas: 'the', 'game', 'are', 'cool', 'icebreaker', '.'
Polarity:- The polarities of the lemmas are looked up in the lexicon; articles and punctuation marks are not important.- 'game' has polarity 1, 'are' has polarity 0, 'cool' has polarity 0.8, and 'icebreaker' 0.- The polarity of the given sentence is the sum of these polarities, so 1.8.-  1.8 is a positive number. The sentence evokes a positive sentiment.

<div style='color: #690027;' markdown="1">
    <h2>2. The lexicon</h2></div>

### Importing Modules

Now that you know this, you can almost get started. You first load two Python modules. <br>To do this, execute the code cell below.

<div class="alert alert-block alert-info"> 
Python is often very intuitive to use and moreover so popular that there are many modules available that one can freely use. A module contains many functions that experienced computer scientists have already programmed for you.</div>

In [None]:
# import modulesimport pickle                     # for lexiconfrom colorama import Fore, Back   # to be able to print in colorimport string                     # for enumeration punctuation marksfrom lexiconhelper import tenelements

### Read in Lexicon

Run the code cell below. You don't need to understand the code in this cell in detail.

In [None]:
# read in lexiconwith open("data/new_lexicondict.pickle", "rb") as file: # file lexicondict.pickle in folder data contains the sentiment lexicon    lexicon = pickle.load(file)

In [None]:
type(lexicon)

In [None]:
# number of elements in lexiconlen(lexicon)

In [None]:
# show ten elements of lexiconprint(tenelements(lexicon))

<div class="alert alert-block alert-info"> 
The lexicon is a <b>dictionary</b> with 10,938 words. The lexicon provides the word class (part-of-speech tag, 'postag') and the polarity ('polarity') of the words in the lexicon.<br>The words in the lexicon are the <b>keys</b> of the dictionary.The <b>values</b> of this dictionary are themselves a dictionary with two keys ("postag" and "polarity") which both have a list as a value, a list with at most 2 elements.</div>

Some words from the lexicon:<br><br>  'rhetorical': {'postag': ['ADJ'], 'polarity': [0.0]}, <br> 'swift': {'postag': ['ADJ'], 'polarity': [0.6]}, <br>'balanced': {'postag': ['ADJ'], 'polarity': [1.25]},<br> 'modal': {'postag': ['ADJ'], 'polarity': [0.4]},<br> 'digital': {'postag': ['ADJ'], 'polarity': [0.0]}, <br>'wrong': {'postag': ['ADJ', 'NOUN'], 'polarity': [-0.5, -2.0]}<br>'illiterate': {'postag': ['NOUN'], 'polarity': [-1.0]}<br>'stigmatize': {'postag': ['VERB'], 'polarity': [-2.0]}

From this you can deduce for example:- the word 'rhetorical' is an adjective that has a neutral polarity in terms of sentiment;- the word 'gezwind' is an adjective that has a positive polarity in terms of sentiment;- The word 'balanced' is an adjective that also has a positive polarity in terms of sentiment, but it is perceived as more positive than 'swift';- the word 'wrong' can be both an adjective and a noun, both with negative polarity, but as a noun it is perceived to be more negative than as an adjective;- the word 'illiterate' is a noun that has a negative polarity in terms of sentiment;- the word 'stigmatize' is a verb that has a negative polarity in terms of sentiment, more negative than 'illiterate'.

So you could also represent the lexicon in the form of a table:<br><br>
<table>
 <thead align="center">
    <tr>
      <td>word</td><td>postag</td><td>polarity</td>     </tr>    
  </thead>
  <tbody align="center">  
      <tr> <td> rhetorical </td>   <td> ADJ </td> <td> 0.0 </td>  </tr>      <tr> <td> swift </td>     <td> ADJ </td> <td> 0.6 </td>  </tr><tr> <td> balanced </td> <td> ADJ </td> <td> 1.25 </td>  </tr><tr> <td> modal </td>      <td> ADJ </td> <td> 0.3 </td>  </tr>      <tr> <td> digital </td>    <td> ADJ </td> <td> 0.0 </td> </tr><tr> <td> wrong </td> <td> ADJ </td> <td> -0.5 </td> </tr>      <tr> <td> error </td>        <td> NOUN </td> <td> -2.0 </td> </tr>    </tbody>           
</table>

**Note** that the parts of speech are returned in English.
-  *'NOUN'* stands for a noun or substantive;-  *'ADJ'* for an adjective or adjective;-  *'ADV'* for an adverb;-  *'DET'* for a determiner;-  *'VERB'* for a verb;-  *'AUX' for an auxiliary verb;-  *'PRON'* for a pronoun;-  *'PROPN'* for a proper noun, etc.- Punctuation is assigned *'SYM'*.

You can easily look up the word type and polarity of a word in the lexicon using Python code:- `lexicon["retorisch"]["postag"]` outputs `['ADJ']`- `lexicon["rhetorical"]["polarity"]` outputs `[0.0]`- `lexicon["error"]["postag"]` outputs `['ADJ', 'NOUN']`- `lexicon["error"]["polarity"]` outputs `[-0.5, -2.0]`
Test this out:

In [None]:
lexicon["rhetorical"]["postag"]

In [None]:
lexicon["rhetorical"]["polarity"]

In [None]:
lexicon["error"]["postag"]

In [None]:
lexicon["error"]["polarity"]

#### Exercise 2.1:
Search in the lexicon:- the word type of 'kwekkebekken'

- the word type of 'recommend'

- the polarity of 'jollity'

- the polarity of 'intriguing'

The words that are in the lexicon are called *keys* in Python. You can request them with the instruction `lexicon.keys()`. <br> Execute the following code cells to check if certain sentiment words are a key.

In [None]:
"zieke" in lexicon.keys()

In [None]:
"angry" in lexicon.keys()

#### Exercise 2.2:
Search for a sentiment word that is not in the lexicon.

Answer:

So, you're ready for an application. Step 1: Read and view the data.

<div>
    <font color=#690027 markdown="1">
<h2>3. Application: customer review</h2>    </font>
</div>

In what follows you will perform sentiment analysis on a given review.

### The data

Run the following code cell to read in the review and then view it.

In [None]:
review = "New concept in Gent, but I think it could be better. Most of the cornflakes were just the basic kinds. Also a bit expensive for the amount you get, especially with the toppings they are modest. And if you offer breakfast, also give people more choice for their coffee."

In [None]:
print(review)

You are ready for step 2: perform preprocessing on the review.

<div>
    <font color=#690027 markdown="1">
<h2>4. Preprocessing</h2>    </font>
</div>

### LowercasingIn this step, you convert the text in the review to lowercase. Lowercasing is necessary because words without uppercase are in the lexicon.
The variable `review_lowercase` refers to this converted text.

In [None]:
# convert review text to lowercase textreview_lowercase = review.lower()  # write review in lowercase letters

In [None]:
# show result of lowercasingprint(review_lowercase)

### TokenizationNow you will split the review into words and punctuation marks using the computer, this is based on spaces.These words and punctuation marks are **tokens**.
To automatically generate the tokens, it is necessary for the words and punctuation to be separated by a space. After each space, a new token can then be generated.E.g. Hello, world! is first written as Hello , world ! and the four tokens are then: 'Hello', ',', 'world' and '!'.
You will therefore first need to modify the review: a space must definitely be present before and after each punctuation mark.

#### Insert spaces before each punctuation markIn this step, you place a space before (and after) every punctuation mark in the review.
The variable `review_spatie` refers to this converted text.

In [None]:
punctuation = string.punctuationprint(punctuation)

In [None]:
# adding spaces to review textreview_spatie = ""     # empty stringfor character in review_lowercase:    if character not in punctuation:        review_space = review_space + characterelse:        review_space = review_space + " " + character + " "     # space before and after a punctuation mark

In [None]:
# show result of adding spacesprint(review_space)

In [None]:
# tokenizationtokens = review_space.split()

In [None]:
# show result of tokenizationprint(tokens)

So you get a list of the tokens.

Once the text has been tokenized, you can assign a **part-of-speech tag** to each token, for a word, this is the word type.<br>When looking up a token in the lexicon, it is important that you also check the part-of-speech tag. After all, some words can be used both as a noun and as an adjective.
Looking up a token in the lexicon is done in its dictionary form. Each token must therefore be lemmatized, i.e., brought back to its **lemma** or dictionary form, such as singular for a noun and the infinitive for a verb. The search for such a dictionary form in the lexicon can then be **automated**.

### Part-of-speech tagging and lemmatization

So you want a list of the part-of-speech tags and the lemmas corresponding to the tokens, and you can then look up these lemmas in the lexicon.

The first token is 'new'. The base form or lemma is `new` and this is an adjective, ADJ.
Take the token ','. The basic form or lemma is `,` and this is punctuation, so a symbol, SYM.
Take the token 'kan'. The base form or lemma is `kunnen` and this is a verb, VERB.
Take the token 'species'. The base form or lemma is `species` and this is a noun, NOUN.

**Note** that the lemma of tokens such as *'were'*, *'types'*, *'gets'*, *'toppings'*, *'offers'*, *'give'* and *'people'* is not the same as the token. With other tokens, such as *'new'*, that is the case.

Create lists of the lemmas and the part-of-speech tags.<br>The order in these lists is important. The first element in one list must match the first element in the other list; the second element in one list must match the second element in the other list; etc. <br>Tip: start with the list of tokens and use 'copy-paste' (CTRL-C, CTRL-V), then only adjust the lemma that differ from the token. You will have to type in all the postags.

In [None]:
postags = ['ADJ', 'NOUN', 'ADP', 'PROPN', 'SYM', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PRON', 'ADV', 'ADJ', 'SYM', 'DET', 'ADV', 'NOUN', 'AUX', 'ADJ', 'DET', 'ADJ', 'NOUN', 'SYM', 'ADV', 'DET', 'ADJ', 'ADP', 'DET', 'NOUN', 'PRON', 'PRON', 'AUX', 'SYM', 'ADV', 'ADP', 'DET', 'NOUN', 'AUX', 'PRON', 'ADJ', 'SYM', 'CCONJ', 'SCONJ', 'PRON', 'NOUN', 'AUX', 'SYM', 'VERB', 'DET', 'NOUN', 'ADV', 'ADV', 'ADV', 'PRON', 'DET', 'NOUN', 'ADP', 'PRON', 'NOUN', 'SYM', 'SYM', 'SYM']lemmas = ['new', 'concept', 'in', 'Ghent', ',', 'but', 'that', 'can', 'be', 'good', 'according', 'to', 'me', '.', 'the', 'many', 'cornflakes', 'are', 'just', 'the', 'basic', 'type', '.', 'also', 'a', 'bit', 'expensive', 'for', 'the', 'amount', 'you', 'get', ',', 'especially', 'with', 'the', 'topping', 'they', 'are', 'stingy', '.', 'and', 'if', 'you', 'offer', 'breakfast', ',', 'give', 'people', 'a', 'few', 'more', 'coffee', 'options', '.', '.', '.']

Compiling these lists does take some time as they are largely done manually.

<div>
    <font color=#690027 markdown="1">
<h2>5. Sentiment lexicon matching</h2>    </font>
</div>

Now that your review is *preprocessed*, you can **determine the sentiment** using the lexicon of sentiment words that you have at your disposal.<br>Here too, you work in two large steps:- you search for the sentiment words in the review in the lexicon, i.e. you look which tokens based on their lemma can be found in the lexicon (you apply 'matching' to the review);- you need the polarity of the sentiment words according to their word type in the review.

#### Example 5.1Consider the lemma `"new"`.You are looking up this lemma in the sentiment lexicon.

In [None]:
# does "new" exist in the lexicon?"new" in lexicon

So it is included. Ask for the part-of-speech tag ("postag") and the polarity ("polarity") of `"new"`.

In [None]:
lexicon["new"]["postag"]

In [None]:
lexicon["new"]["polarity"]

#### Exercise 5.1Do the same for the lemma `"expensive"`.

#### Example 5.2The index of `"NOUN"` in the list is 1.<br>With the following code, you can request the polarity of `"expensive"` as a noun:

In [None]:
lexicon["expensive"]["polarity"][1]

#### Example 5.3-  For the sentiment analysis of the review, you now create a list of the **sentiment words** of the review: the tokens that appear in the lexicon. You refer to this list with the variable `lexiconmatches`.<br>- You also create a list with the polarities of these tokens. You refer to this list with the variable `polarities`.
For this, you go through all the lemmas one by one. For each lemma that is in the lexicon, you check the word type, part-of-speech tag. With a correct word type, you add the corresponding token to the list of tokens and the corresponding polarity to the list of polarities.
For `"new"`, this means that `"new"` is added to the `lexiconmatches` list and `0.575` to the `polarities` list.
Finally, you add up all the polarities. The sum, `sum(polarities)`, indicates the sentiment of the review.

In [None]:
# search for lexicon matches in the reviewlexiconmatches = []       # empty list, to be filled with tokens of the lemmas found in lexiconpolarities = []         # empty list, to be filled with polarities of found tokens
# consider lemmas with corresponding word type and tokenfor lemma, postag, token in zip(lemmas, postags, tokens):    if lemma in lexicon.keys() and postag in lexicon[lemma]["postag"]:lexiconmatches.append(token)                      # add corresponding token to the list lexiconmatches            if postag == lexicon[lemma]["postag"][0]:polarities.append(lexicon[lemma]["polarity"][0])else:polarities.append(lexicon[lemma]["polarity"][1])# add corresponding polarity to list of polarities# lemma must be present in lexicon# only when the lemma and the POS-tag match, there is a match (see for example 'error' as ADJ and 'error' as NOUN)
# review polaritypolarity = sum(polarities)
# final decision for this reviewif polarity > 0:sentiment = "positive"elif polarity == 0:    sentiment = "neutral"elif polarity < 0:    sentiment = "negative"print("The polarity of the review is: " +str(polarity))print("The sentiment of the review is " + sentiment + ".")

#### Exercise 5.2Some processes were already automated, some you had to do manually.List what happened manually and what automatically.

Answer:

Answer:

<div>
    <font color=#690027 markdown="1">
<h2>6. Sentiment lexicon matching: marking</h2>    </font>
</div>

Take a look at the (matched) sentiment words and their polarity by requesting them.

In [None]:
print(lexiconmatches)print(polarities)

You can **highlight sentiment words** in the given review: green for a positive polarity, red for a negative one, and blue for a neutral one.
For this, you start from the original text. In this text, you replace the tokens that are sentiment words with themselves on a colored background. You leave the non-sentiment words untouched.

In [None]:
review_highlighted = review_spatie    # take review where spaces were inserted# mark lexiconmatches that appear as a word, not part of a wordfor token, polarity in zip(lexiconmatches, polarities):if polarity > 0: # corresponding polarity is positivereview_highlighted = review_highlighted.replace(" " + token + " ", " " + Back.GREEN + token + Back.RESET + " ")   # mark positive token in greenelif polarity == 0.0: # corresponding polarity is neutralreview_highlighted = review_highlighted.replace(" " + token + " ", " " + Back.BLUE + token + Back.RESET + " ")    # mark neutral token in blueelif polarity < 0: # negative polarityreview_highlighted = review_highlighted.replace(" " + token + " ", " " + Back.RED + token + Back.RESET + " ")     # mark negative token in red
print(review_highlighted)

<div>
    <font color=#690027 markdown="1">
<h2>7. Exercise</h2>    </font>
</div>

View the following reviews.
- Choose one.- Which sentiment do you intuitively associate with it?- Does the lexicon lead to the same result?

Sunrisewater is just what Belgium needs. With the obesity problem that is definitely on the rise, we can use all the initiatives to get the youth to drink plain water again instead of that American rubbish! It tastes great and what's even better, you can find it on any street corner! Really wonderful! Especially the pink and yellow is highly recommended.
Salé & Sucré is known for its super delicious and original cocktails, unfortunately there was no non-alcoholic version available. Our designated driver for the day had to stick to soft drinks.
It was really fun to get a taste of the Filipino cuisine. The dishes were well put together, the portions were definitely large enough and the flavors were spot on. Worth repeating!
Cozy atmosphere, delicious coffee, and a beautiful interior. The combination of a study bar and a chat bar is a genius idea! Studying with a tasty cup of coffee, a delicious snack, and together with other students, is extremely motivating. The interior is very calming with little distraction, which is why I've never been able to accomplish so much!
Wow, what cool restaurants! And the food there is mega delicious.

<div class="alert alert-box alert-success">
Congratulations, you have learned how a rule-based system for sentiment analysis works!</div>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook Chatbot, see <a href="http://www.aiopschool.be">AI At School</a>, by C. Van Hee, V. Hoste, F. Wyffels, T. Neutens, Z. Van de Staey & N. Gesquière is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.