<img src="images/bannerugentdwengo.png" alt="BannerUGentDwengo" width="250"/>

<div>
    <font color=#690027 markdown="1">
<h1>AI MODEL FOR SENTIMENT ANALYSIS</h1>    </font>
</div>

<div class="alert alert-box alert-success">
In this notebook, you will investigate sentiment words in given texts (the data) using artificial intelligence (AI). After all, you will be using a <em>machine learning</em> model. This model was trained with annotated texts and can tokenize a text with high accuracy and determine the part-of-speech tag and the lemma of each token. You use a <em>rule-based AI system</em> to determine the sentiment of the given text.</div>

In the previous notebook, 'Rule-based sentiment analysis', you were introduced to the principles of rule-based sentiment analysis:
- You use an (existing) **lexicon** or dictionary that links words to their **polarity** (positive, negative, or neutral).- Before you can match sentiment words from a lexicon with the data, you need to read in the data and **preprocess** it.- Common preprocessing steps are **lowercasing**, **tokenization**, **part-of-speech tagging**, and **lemmatization**. 
In the previous notebook, not all steps were automated. Lemmatization and part-of-speech tagging had to be done manually.

<div class="alert alert-box alert-success">
In this notebook, you will fully <b>automate</b> the output of sentiment analysis. In other words, you will let the computer do the work: the computer will preprocess the data with a <em>machine learning model (ML-model)</em>, and match the tokens with the given lexicon using a <em>rule-based AI system</em>, and finally make a decision on the sentiment of the given text.</div>

### Loading Modules, Model and Lexicon

Before you get started, first provide the necessary tools:
- You import the necessary modules (you only need to do this once). <br>These modules contain functions and methods that will facilitate your research. There are in fact already things pre-programmed, allowing you to work with quite simple instructions.- You load a machine learning model to use later.- You're also reading in a sentiment lexicon.
To do this, execute the three code cells below. You do not need to understand the code in these cells.

In [None]:
# import modulesimport pickle                     # for lexiconfrom colorama import Fore, Back   # to be able to display in colorimport spacy                      # for machine learning preprocessing model

In [None]:
# load machine learning modelnlp = spacy.load("nl_core_news_sm")    # nlp stands for Natural Language Processing

In [None]:
# read in lexicon, file 'new_lexicondict.pickle' contains sentiment lexiconwith open("data/new_lexicondict.pickle", "rb") as file:    lexicon = pickle.load(file)

So, you're ready for step 1: read in and review the data.

<div>
    <font color=#690027 markdown="1">
<h2>1. Read the data</h2>    </font>
</div>

For this task, you will work with the same **customer review** as in the notebook 'Rule-based sentiment analysis'.

Step 1: Execute the following code cell to read in the review and then view it.

In [None]:
review = "New concept in Ghent, but I think it could be better. Most of the cornflakes were just the basic types. Also a bit expensive for the amount you get, especially with the toppings they are stingy. And if you offer breakfast, at least give people a bit more choice for their coffee."print(review)

You are ready for step 2.
In the following, you let the computer do all the pre-processing on the review: we had already automated lowercasing in 'Rule-based Sentiment Analysis'. You take over that code.
You should not add spaces in the text, because the machine learning model takes care of the tokenization. Also, part-of-speech tagging and lemmatization are now automated with the help of the model.

<div>
    <font color=#690027 markdown="1">
<h2>2. Preprocessing</h2>    </font>
</div>

### Lowercasing

In [None]:
# convert review text to lowercase textreview_lowercase = review.lower()

### Tokenization, part-of-speech tagging, and lemmatization

The review is **tokenized** and each token is assigned a **part-of-speech** and a **lemma**, this happens automatically with the help of a previously trained model with an accuracy of 93%!
For this, you enter the review (in lowercase) in the ML model `nlp`.

In [None]:
# input review in lowercase into modelreview_preprocessed = nlp(review_lowercase)

The tokens of the review have now been determined and for each token, the word (or punctuation) itself, the part-of-speech tag, and the dictionary form (lemma) are stored in an object referred to by `review_preprocessed`. <br>You can now request the characteristics of tokens: the word/punctuation via the instruction `token.text`, the part of speech via `token.pos_` and the dictionary form via `token.lemma_`.

#### Show the part of speech and dictionary form of each token

In [None]:
# tokensfor token in review_preprocessed:print(token.text)

In [None]:
# part-of-speech tag of each tokenfor token in review_preprocessed:print(token.text + ": " + token.pos_)

In [None]:
# lemma of each tokenfor token in review_preprocessed:print(token.text + ": " + token.lemma_)

### Create lists of the tokens, lemmas, and part-of-speech tags.

In 'Rule-based sentiment analysis', the lists of lemmas and part-of-speech tags were manually created. Now, this can be done automatically because all the necessary information is collected in the object to which the variable `review_voorverwerkt` refers.

In [None]:
# listingtokens = []lemmas = []postags = []for token in review_preprocessed:    tokens.append(token.text)      # add each token to the list of tokenslemmas.append(token.lemma_)    # add each lemma to the list of lemmaspostags.append(token.pos_)     # add each part-of-speech tag to list of postags
# showing listsprint("tokens:")print(tokens)print("lemmas:")print(lemmas)print("part-of-speech tags:")print(postags)

<div>
    <font color=#690027 markdown="1">
<h2>3. Sentiment lexicon matching</h2>    </font>
</div>

Now that the review has been *preprocessed*, you can determine the sentiment using the sentiment lexicon that you have available. This was already automated in 'Rule-based sentiment analysis'. You take over the code from 'Rule-based sentiment analysis'.

In [None]:
# search for matches with lexicon in reviewlexiconmatches = []       # empty list, to be filled with tokens of the lemmas found in lexiconpolarities = []         # empty list, to be filled with polarities of found tokens
# consider lemmas with corresponding word type and tokenfor lemma, postag, token in zip(lemmas, postags, tokens):    if lemma in lexicon.keys() and postag in lexicon[lemma]["postag"]:lexiconmatches.append(token)                      # add corresponding token to lexiconmatches list            if postag == lexicon[lemma]["postag"][0]:polarities.append(lexicon[lemma]["polarity"][0])else:polarities.append(lexicon[lemma]["polarity"][1])# add corresponding polarity to list of polarities    # lemma must be present in lexicon# only when the lemma and the POS-tag match, there is a match (see for example 'wrong' as ADJ and 'wrong' as NOUN)
# review polaritypolarity = sum(polarities)
# final decision for this reviewif polarity > 0:    sentiment = "positive"elif polarity == 0:    sentiment = "neutral"elif polarity < 0:    sentiment = "negative"print("The polarity of the review is: " + str(polarity))print("The sentiment of the review is " + sentiment + ".")

<div>
    <font color=#690027 markdown="1">
<h2>4. Exercise: Sentiment lexicon matching on own review</h2>    </font>
</div>

You can also do this for a self-written review and compare the system's output with your own annotation.

In [None]:
# place self-written review between quotation marks, so adjust given stringself_written_review = "Hopefully this will be a fun notebook!"# fill in polarity between quotation marks (positive, negative, neutral), so also adjust given string herelabel = "positive"
# next steps: show review and apply nlp() to itprint(user_written_review)review = nlp(self_written_review.lower())
# show every word in review with word type and part-of-speech tag and save in liststokens = []lemmas = []postags = []for token in review:tokens.append(token.text)lemmas.append(token.lemma_)    postags.append(token.pos_)
print("tokens:")print(tokens)print("lemmas:")print(lemmas)print("part-of-speech tags:")print(postags)

Now that the preprocessing is complete, you can search for matches with the lexicon again.

In [None]:
# search for matches with lexicon in reviewlexiconmatches = []       # empty list, to be filled with tokens of the lemmas found in lexiconpolarities = []         # empty list, to be filled with polarities of found tokens
# consider lemmas with corresponding word class and tokenfor lemma, postag, token in zip(lemmas, postags, tokens):    if lemma in lexicon.keys() and postag in lexicon[lemma]["postag"]:lexiconmatches.append(token)                      # add corresponding token to the lexiconmatches listif postag == lexicon[lemma]["postag"][0]:polarities.append(lexicon[lemma]["polarity"][0])            else:polarities.append(lexicon[lemma]["polarity"][1])# add corresponding polarity to list of polarities    # lemma must be present in lexicon# only when the lemma and the POS-tag match, there is a match (see for example 'wrong' as ADJ and 'wrong' as NOUN)
# review polaritypolarity = sum(polarities)
# final decision for this reviewif polarity > 0:    sentiment = "positive"elif polarity == 0:    sentiment = "neutral"elif polarity < 0:sentiment = "negative"print("The polarity of the review is: " + str(polarity))print("The sentiment of the review is " + sentiment + ".")

Compare the final decision of the rule-based system with your own annotation. Did the system get it right? Why/why not, do you think?

Answer:

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook Chatbot, see <a href="http://www.aiopschool.be">AI at School</a>, by C. Van Hee, V. Hoste, F. wyffels, Z. Van de Staey & N. Gesquière is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.