<img src="images/bannerugentdwengo.png" alt="BannerUGentDwengo" width="250"/>

<div>
    <font color=#690027 markdown="1">
<h1>RULE-BASED SENTIMENT ANALYSIS</h1>    </font>
</div>

<div class="alert alert-box alert-success">
Language technologists rely on machine learning models to study sentiment words in given texts. In this notebook, you will become familiar with the principles of their research. <br> The use of technology is becoming increasingly accessible to workers in non-technical sectors, such as linguists, communication scientists, historians, and lawyers. <br>Thanks to the highly accessible programming language Python, you too will discover some possibilities of language technology.</div>

<div class="alert alert-box alert-warning">
As preparation for this notebook, you should familiarize yourself with the notebooks 'Strings', 'Lists' and 'Dictionaries'. It would also be beneficial for you to familiarize yourself with some programming structures in the notebook 'Structures. Applications with strings, lists and dictionaries'.</div>

<div>
    <font color=#690027 markdown="1">
<h2>1. Principles of rule-based sentiment analysis</h2>    </font>
</div>

For **rule-based** sentiment analysis, you use an (existing) **lexicon** with words linked to their **polarity** (positive, negative or neutral), so a dictionary of sentiment words.
'Happy', for example, has a positive polarity, 'home banking' a neutral one, and 'angry' a negative polarity. In the lexicon, polarity is represented by a real number between -2 and 2. A strictly positive number corresponds to a positive sentiment, a strictly negative number to a negative sentiment, and 0 to a neutral sentiment.

<div>
<img src="images/schaal.png" alt="Banner" align="center" width="500"/>
</div>

The polarity of a text is given by the sum of the polarities of the sentiment words in that text.

Before you can match sentiment words from a lexicon with the given text (the data), you need to process the data.-  1) *read in*;-  2) *preprocessing*, i.e. preparing for *lexicon matching* or for *machine learning*.
Preprocessing includes all steps required to prepare the data for what follows, whether that's simple lexicon matching, or a complicated machine learning system that will be trained on the data. <br>Below you will find a list of common preprocessing steps.

#### Preprocessing
* **Lowercasing:** all capital letters are replaced by lowercase letters. Lowercasing is necessary because the words in a lexicon are without capital letters.* **Tokenization:** all sentences are split into meaningful units or 'tokens', such as words and punctuation marks. This split is based on the spaces present in the sentences; therefore, the words must be separated from each other by a space.* **Part-of-speech tagging:** each token is assigned the grammatical word category, such as adjective or symbol. Some words can occur e.g. as a noun as well as an adjective. Such a word can also have a different sentiment value depending on its word type.* **Lemmatization:** all tokens are converted to their lemma or dictionary form (for example, a noun appears in a dictionary in the singular and you find the infinitive of a verb there). That dictionary form is then looked up in the lexicon.

#### Example:Given sentence:- The games were great icebreakers.
Lowercasing:- The games were great icebreakers.
Without any provided Dutch text to be translated, I'm unable to provide a translation. Could you please provide me with the text you'd like to be translated?"the" "games" "were" "cool" "icebreakers" "."
The input text doesn't contain any Dutch text. Please provide a Dutch text for translation.- 'the': article;- 'spelletjes': noun;- 'waren': verb;- 'toffe': adjective;- 'icebreakers': noun;-  '.': punctuation mark (symbol).
Lemmas: 'the', 'game', 'are', 'cool', 'icebreaker', '.'
Polarity:- The polarities of the lemmas are looked up in the lexicon; articles and punctuation marks are not important in this context.- 'game' has polarity 1, 'are' has polarity 0, 'cool' has polarity 0.8 and 'icebreaker' 0.- The polarity of the given sentence is the sum of these polarities, so 1.8.- 1.8 is a positive number. The sentence evokes a positive sentiment.

<div style='color: #690027;' markdown="1">
<h2>2. The lexicon</h2></div>

### Importing Modules

Now that you know this, you can almost get started. You first load two Python modules. <br>To do this, execute the code cell below.

<div class="alert alert-block alert-info"> 
Python is often very intuitive to use and moreover so popular that there are many modules available that one can use freely. A module contains many functions that experienced computer scientists have already programmed for you.</div>

In [None]:
# import modulesimport pickle                     # for lexiconfrom colorama import Fore, Back   # to be able to print in colorimport string                     # for enumerating punctuation marksfrom lexiconhelp import haselements

### Reading in Lexicon

Run the code cell below. You do not need to understand the details of the code in this cell.

In [None]:
# read in lexiconwith open("data/new_lexicondict.pickle", "rb") as file: # the file lexicondict.pickle in the data folder contains the sentiment lexicon    lexicon = pickle.load(file)

In [None]:
# type of lexicon...

In [None]:
# number of elements in lexicon...

In [None]:
# display ten elements of lexiconprint(tenelements(lexicon))

<div class="alert alert-block alert-info"> 
The lexicon is a <b>dictionary</b> with 10,938 words. The lexicon gives the word type (part-of-speech tag, 'postag') and the polarity ('polarity') of the words in the lexicon.<br>The words in the lexicon are the <b>keys</b> of the dictionary.The <b>values</b> of this dictionary are themselves a dictionary with two keys ("postag" and "polarity") which both have a list as a value, a list with at most 2 elements.</div>

Some words from the lexicon:<br><br> 'rhetorical': {'postag': ['ADJ'], 'polarity': [0.0]}, <br> 'swift': {'postag': ['ADJ'], 'polarity': [0.6]}, <br>'balanced': {'postag': ['ADJ'], 'polarity': [1.25]},<br> 'modal': {'postag': ['ADJ'], 'polarity': [0.4]},<br> 'digital': {'postag': ['ADJ'], 'polarity': [0.0]}, <br>'false': {'postag': ['ADJ', 'NOUN'], 'polarity': [-0.5, -2.0]}<br>'illiterate': {'postag': ['NOUN'], 'polarity': [-1.0]}<br>'stigmatize': {'postag': ['VERB'], 'polarity': [-2.0]}

From this you can deduce for example:- the word 'rhetorical' is an adjective that has a neutral polarity in terms of sentiment;- the word 'gezwind' is an adjective that has a positive polarity in terms of sentiment;- the word 'balanced' is an adjective that also has a positive polarity in terms of sentiment, but it is perceived more positively than 'swift';- the word 'error' can be both an adjective and a noun, both with a negative polarity, but as a noun it is perceived as more negative than as an adjective;- the word 'illiterate' is a noun that has a negative polarity in terms of sentiment;- the word 'stigmatize' is a verb that has a negative polarity in sentiment, more negative than 'illiterate'.

So you could also display the lexicon in the form of a table:<br><br>
<table>
 <thead align="center">
    <tr>
<td>word</td><td>postag</td><td>polarity</td>     </tr>    
  </thead>
  <tbody align="center">  
      <tr> <td> rhetorical </td>   <td> ADJ </td> <td> 0.0 </td>  </tr>      <tr> <td> swift </td>     <td> ADJ </td> <td> 0.6 </td>  </tr><tr> <td> balanced </td> <td> ADJ </td> <td> 1.25 </td>  </tr><tr> <td> modal </td>      <td> ADJ </td> <td> 0.3 </td>  </tr>      <tr> <td> digital </td>    <td> ADJ </td> <td> 0.0 </td> </tr><tr> <td> wrong </td>        <td> ADJ </td> <td> -0.5 </td> </tr>      <tr> <td> error </td>        <td> NOUN </td> <td> -2.0 </td> </tr>    </tbody>           
</table>

**Note** that the parts of speech are returned in English.
-  *'NOUN'* stands for a noun or substantive;-  *'ADJ'* for an adjective or adjective;-  *'ADV'* for an adverb;-  *'DET'* for an article;-  *'VERB'* for a verb;-  *'AUX' for an auxiliary verb;-  *'PRON'* for a pronoun;-  *'PROPN'* for a proper noun, etc.-  Punctuation marks are assigned *'SYM'*.

You can easily look up the word type and polarity of a word in the lexicon using Python code:- `lexicon["rhetoric"]["postag"]` outputs `['ADJ']`- `lexicon["retorisch"]["polarity"]` outputs `[0.0]`- `lexicon["error"]["postag"]` outputs `['ADJ', 'NOUN']`- `lexicon["error"]["polarity"]` outputs `[-0.5, -2.0]`
Test this out:

In [None]:
lexicon["rhetorical"]["postag"]

In [None]:
lexicon["rhetorical"]["polarity"]

In [None]:
lexicon["error"]["postag"]

In [None]:
lexicon["error"]["polarity"]

#### Exercise 2.1:
Search in the lexicon:- the word type of 'kwekkebekken'

- the word type of 'recommend'

- the polarity of 'jollity'

- the polarity of 'intriguing'

The words that are present in the lexicon are called *keys* in Python. You can retrieve them using the instruction `lexicon.keys()`.<br> Execute the following code cells to verify if certain sentiment words are a key.

In [None]:
"zieke" in lexicon.keys()

In [None]:
"angry" in lexicon.keys()

#### Exercise 2.2:
Look for a sentiment word that is not in the lexicon.

Answer:

So, you are ready for an application. Step 1: read and review the data.

<div>
    <font color=#690027 markdown="1">
<h2>3. Application: customer review</h2>    </font>
</div>

In what follows, you will perform sentiment analysis on a given review.

### The data

Run the following code cell to read in the review.

In [None]:
review = "New concept in Ghent, but in my opinion it could be better. Most of the cornflakes were just the basic types. Also quite expensive for the amount you get, especially with the toppings they are sparing. And if you offer breakfast, also give people more choice for their coffee."

Enter an instruction in the following code cell to view this review.

You are ready for step 2: performing preprocessing on the review.

<div>
    <font color=#690027 markdown="1">
<h2>4. Preprocessing</h2>    </font>
</div>

### LowercasingIn this step, you convert the text in the review to lowercase letters. Lowercasing is necessary because words without a capital letter are in the lexicon.
The variable `review_kleineletters` refers to this converted text.

In [None]:
# convert review text into lowercase textreview_lowercase = review.lower()  # write review in lowercase letters

In [None]:
# show result of lowercasing

### TokenizationNow you will split the review into words and punctuation using the computer, this is based on spaces.These words and punctuation marks are **tokens**.
To automatically generate the tokens, it is necessary that the words and punctuation are separated by a space. After each space, a new token can then be generated.E.g. Hello, world! is first written as Hello, world! and the four tokens are then: 'Hello', ',', 'world' and '!'.
You will therefore need to adjust the review first: there must certainly be a space before and after each punctuation mark.

#### Enter spaces before each punctuation markIn this step, you place a space before (and after) each punctuation mark in the review.
The variable `review_space` refers to this converted text.

In [None]:
punctuation = ...

In [None]:
# adding spaces to review textreview_spatie = ""     # empty stringWithout a specific Dutch text provided pertaining to HTML, Markdown, Python syntax, or Dutch comments, I regretfully cannot perform the requested task.

For example, if you provide a Dutch python script as follows:

```python
# Dit is een voorbeeld van een Nederlandse Python comment
voor karakter in "Hallo Wereld":
  print(karakter)
```

The English translation would be:

```python
# This is an example of a Dutch Python comment
for character in "Hello World":
  print(character)
```

In [None]:
# show result of adding spacesprint(review_space)

Now you want to obtain a list of all tokens in the review.

In [None]:
# tokenizationThe text you've provided doesn't include any specific text or phrases in the Dutch language, HTML or markdown syntax, Python syntax, or any comments. Please provide the text so I could assist accordingly.

In [None]:
# show result of tokenizationprint(tokens)

So you get a list of the tokens.

Once the text has been tokenized, you can assign a **part-of-speech tag** to each token, which is the word type for a word.<br>When looking up a token in the lexicon, it's important to also check the part-of-speech tag. After all, some words can be used both as a noun and as an adjective.
Searching for a token in the lexicon is done in its dictionary form. Thus, every token must be lemmatized, in other words, it must be reduced to its **lemma** or dictionary form, such as singular for a noun and the infinitive for a verb. Searching for such a dictionary form in the lexicon can then be **automated**.

### Part-of-speech tagging and lemmatization

So you want a list of the part-of-speech tags and of the lemmas that belong to the tokens, and then you can look up these lemmas in the lexicon.

The first token is 'new'. The base form or lemma is `new` and this is an adjective, ADJ.
Take the token ','. The basic form or lemma is `,` and this is a punctuation mark, thus a symbol, SYM.
Take the token 'kan'. The base form or lemma is `kunnen` and this is a verb, VERB.
Take the token 'species'. The base form or lemma is `kind` and this is a noun, NOUN.

**Note** that the lemma of tokens such as *'were'*, *'types'*, *'gets'*, *'toppings'*, *'offers'*, *'give'* and *'people'* is not the same as the token. With other tokens, such as *'new'*, that is the case.

Create lists of the lemmas and the part-of-speech tags.<br>The order in these lists is important. The first element in one list must correspond to the first element in the other list; the second element in one list must correspond to the second element in the other list; etc. <br>Without any specific text provided for translation from Dutch to English, I'm unable to provide a specific translation. However, I can give you a general overview:

1. Translate the text from Dutch to English.
    For example, "vertrek van de lijst van de tokens" would be translated as "departure from the list of the tokens". 

2. HTML and markdown syntax should be left untouched. That means tags like <p>, <h1> etc. or markdown elements like #, *, etc should not be translated.

3. Python syntax should also be left as is, although comments within the Python code (which start with #) can be translated. 

For example, if the input is:
    
```python
# Dit is een commentaar
print("Hallo wereld")
```
        
It could be translated to:
    
```python
# This is a comment
print("Hello world")
```

4. If the input contains no Dutch text to translate, then return the input as it is.

Please provide the specific text you'd like translated if you want a more detailed response.

In [None]:
postags = ['ADJ', ...]lemmas = ['new', ...]

Creating these lists does take some time as they are largely done manually.

<div>
    <font color=#690027 markdown="1">
<h2>5. Sentiment lexicon matching</h2>    </font>
</div>

Now that your review is *preprocessed*, you can **determine the sentiment** using the sentiment word lexicon that you have available.<br>Here too, you work in two large steps:- you look up the sentiment words in the review in the lexicon, in other words, you check which tokens based on their lemma can be found in the lexicon (you apply 'matching' to the review);- you need the polarity of the sentiment words according to their word type in the review.

#### Example 5.1Consider the lemma `"new"`.You look up this lemma in the sentiment lexicon.

In [None]:
# is "new" in the lexicon?

So it's in there. Query the part-of-speech tag ("postag") and the polarity ("polarity") of `"new"`.

#### Exercise 5.1Do the same for the lemma `"expensive"`.

#### Example 5.2The index of `"NOUN"` in the list is 1.<br>With the following code, you request the polarity of `"expensive"` as a noun:

In [None]:
lexicon["expensive"]["polarity"][1]

#### Example 5.3- For the sentiment analysis of the review, you now make a list of the **sentiment words** of the review: the tokens that appear in the lexicon. You refer to this list with the variable `lexiconmatches`.<br>- You also create a list with the polarities of these tokens. Refer to this list with the variable `polarities`.
You go through all the lemmas one by one. For each lemma that is in the lexicon, you check the word type, part-of-speech tag. If the word type is correct, you add the corresponding token to the list of tokens and the corresponding polarity to the list of polarities.
For `"new"`, this means that `"new"` is added to the `lexiconmatches` list and `0.575` to the `polarities` list.
Finally, you add all the polarities together. The sum, `sum(polarities)`, gives the sentiment of the review.

In [None]:
# search for lexicon matches in the reviewlexiconmatches = []       # empty list, to be filled with tokens of the lemmas found in lexiconpolarities = []         # empty list, to be filled with polarities of found tokens
# consider lemmas with corresponding part of speech and tokenfor lemma, postag, token in zip(lemmas, postags, tokens):    ...

# review polaritypolarity = sum(polarities)
# final decision for this reviewThe provided text doesn't contain any Dutch words or phrases for translation. The input is already in English, hence it is: "if ...:".    sentiment = "positive"elif ...:    sentiment = "neutral"...
print("The polarity of the review is: " ...)print("The sentiment of the review is " ...)

#### Exercise 5.2Some things were already automated, some you had to do manually.List once what happened manually and what automatically.

Answer:

Answer:

<div>
    <font color=#690027 markdown="1">
<h2>6. Sentiment lexicon matching: marking</h2>    </font>
</div>

Take a look at the (matched) sentiment words and their polarity by requesting them.

In [None]:
print(lexiconmatches)print(polarities)

You can **highlight the sentiment words** in the provided review: green for a positive polarity, red for a negative one, and blue for a neutral one.
For this, you start from the original text. In this text, you replace the tokens that are sentiment words with themselves on a colored background. You leave the non-sentiment words alone.

In [None]:
review_highlighted = review_spatie    # take review where spaces were added# mark lexicon matches that occur as a word, not part of a word...

print(review_highlighted)

<div>
    <font color=#690027 markdown="1">
<h2>7. Exercise</h2>    </font>
</div>

Check out the following reviews.
- Choose one.- Which sentiment do you intuitively associate with it?- Does the lexicon lead to the same result?

Sunrisewater is just what Belgium needs. With the obesity problem that is indeed on the rise, we can use all initiatives to get the youth to drink regular water again instead of that American junk! It tastes great and what's even better is that you can find it on every street corner! Truly amazing! Especially the pink and yellow is highly recommended.
Salé & Sucré is known for its super delicious and original cocktails, unfortunately there was no non-alcoholic variant available. Our designated driver had to stick with soft drinks.
It was super fun to get a taste of the Filipino cuisine. The dishes were well put together, the portions were definitely large enough and the flavors were absolutely spot on. Worth repeating!
Cozy atmosphere, tasty coffee and a beautiful interior. The combination of a study bar and a chat bar is a brilliant idea! Studying with a nice cup of coffee, a delicious snack and together with other students, is extremely motivating. The interior is very relaxing with little distraction, allowing me to get more done than ever before!
Wow, what cool restaurants! And the food is super delicious there.

<div class="alert alert-box alert-success">
Congratulations, you have learned how a rule-based system for sentiment analysis works!</div>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook Chatbot, see <a href="http://www.aiopschool.be">AI at School</a>, by C. Van Hee, V. Hoste, F. Wyffels, T. Neutens, Z. Van de Staey & N. Gesquière is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.