# AdjectivePhraseTagger

A class that tags simple adjective phrases in the **Text** object.

## Usage

In [1]:
from adjective_phrase_tagger.adj_phrase_tagger import AdjectivePhraseTagger
from estnltk import Text

Create **Text** object, **AdjectivePhraseTagger** object and tag adjective phrases as a new layer of the **Text** object.

In [2]:
tagger = AdjectivePhraseTagger(return_layer=True) # return_layer=True returns only the adjective phrase layer
sent = Text("Peaaegu 8-aastane koer oli väga energiline ja mänguhimuline.")
tagger.tag(sent)

[{'adverb_class': 'doubt',
  'adverb_weight': 0.7,
  'end': 17,
  'intersects_with_verb': False,
  'lemmas': ['peaaegu', '8aastane'],
  'measurement_adj': True,
  'start': 0,
  'text': 'Peaaegu 8-aastane',
  'type': 'adjective'},
 {'adverb_class': 'strong_intensifier',
  'adverb_weight': 2,
  'end': 59,
  'intersects_with_verb': False,
  'lemmas': ['väga', 'energiline', 'ja', 'mänguhimuline'],
  'measurement_adj': False,
  'start': 27,
  'text': 'väga energiline ja mänguhimuline',
  'type': 'adjective'}]

### Attributes that are given with the adjective phrases:

**type** is the type of the phrase: 
* **adjective**: adjective is in its 'normal' (aka positive) form
* **comparative**: contains a comparative adjective
* **participle**: contains an adjective derived from a verb

**measurement_adj** means that the adjective in the phrase either contains a number or some other type of measurement

**intersects_with_verb** signifies whether the found adjective phrase intersects with a verb phrase in the text; this happens mostly in the case of participles as in the following sentence:

In [3]:
tagger.tag(Text("Ta oli väga üllatunud."))

[{'adverb_class': 'strong_intensifier',
  'adverb_weight': 2,
  'end': 21,
  'intersects_with_verb': True,
  'lemmas': ['väga', 'üllatunud'],
  'measurement_adj': False,
  'start': 7,
  'text': 'väga üllatunud',
  'type': 'participle'}]

**adverb_class** marks the intensity of the adverb in the phrase. Each class has also been assigned a weight (**adverb_weight**) noting its intensity. Currently there are 6 classes with their corresponding weights:
* diminisher: 0.5
* doubt: 0.7
* affirmation: 1.5
* strong_intensifier: 2
* surprise: 3
* excess: 3

All the adverbs are not divided into classes, therefore some do have _unknow_ as **adverb_class** and **adverb_weight**.

### Example

Adjective phrases can be used for sentiment analysis - determining the polarity of the text. While this is often done using only adjectives, the phrases consisting of an adverb and an adjective can give more precise results because adverbs in these kinds of phrases are usually some sort of intensifiers. For this purpose, the most frequent adverbs are already divided into classes and assigned weights based on their intensifying properties (see above).

To illustrate this, let's build a very simple system for sentiment analysis. 
For this, we can use hinnavaatlus.csv dataset that contains user reviews and their ratings (positive, negative and neutral).

First, let's extract adjectives from the user reviews and create separate frequency lists of adjectives appearing in positive and negative reviews.

In [4]:
import csv
from collections import defaultdict

pos = {}
neg = {}

adjectives = defaultdict(lambda : defaultdict(int))

with open('data/hinnavaatlus.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    for idx, row in enumerate(reader):
        tagged = tagger.tag(Text(row[1]))
        label = row[2]
        for tag in tagged:
            if len(tag) > 0:  
                adj = tag['lemmas'][-1]
                adjectives[label][adj] += 1

Of course, we can imagine that not all the adjectives used in positive reviews are positive and the same with negative reviews. To overcome this problem, we can use the [**volcanoplot**](https://github.com/estnltk/volcanoplot) (tutorial [**here**](https://github.com/estnltk/volcanoplot/blob/master/docs/postimees_tutorial.ipynb)) tool which visualises the two lists and helps us find over-represented words from both. For this, we need to save both lexicons into csv files.

In [5]:
with open("neg.csv", "w") as fout:
    writer = csv.writer(fout, dialect = 'excel')
    for row in adjectives['Negatiivne']:
        writer.writerow([row, adjectives['Negatiivne'][row]])

In [6]:
with open("pos.csv", "w") as fout:
    writer = csv.writer(fout, dialect = 'excel')
    for row in adjectives['Positiivne']:
        writer.writerow([row, adjectives['Positiivne'][row]])

From **volcanoplot** we save two lexicons - one for positive (data/positive.txt) and one for negative (data/negative.txt) words. 
Now let's decide that an adjective appearing in the positive lexicon has a score of 1 and an adjective in negative lexicon has a score of -1. Adjectives not present in either of the lexicons have a score 0.

In [7]:
negative = []
with open("data/negative.txt", "r") as fin:
    words = fin.readlines()
    negative = set([word.strip() for word in words])

positive = []
with open("data/positive.txt", "r") as fin:
    words = fin.readlines()
    positive = ([word.strip() for word in words])    

Now we can assign a score to each adjective and compute weights to phrases containing of an adverb and an adjective by multiplying the score of an adjective by the weight of the preceding adverb. By summing the scores of all the phrases in a review, we can calculate the polarity of the review.

In [8]:
with open('data/hinnavaatlus.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    review_scores = {}
    
    for idx, row in enumerate(reader):
        tagged = tagger.tag(Text(row[1]))
        total_score = []
        
        if idx < 10:
            print(row[1])
        
        for i in tagged:
            if i['lemmas'][-1] in positive:
                if 'adverb_weight' in i:
                    score = 1*i['adverb_weight']
                else:
                    score = 1
                    
            elif i['lemmas'][-1] in negative:
                if 'adverb_weight' in i:
                    score = -1*i['adverb_weight']
                else:
                    score = -1    
                    
            else:
                score = 0
            if idx < 10:
                print(i['lemmas'], ' ', score)
            total_score.append(score)
            
        if idx < 10:
            print("Total score: ", str(sum(total_score)))
            print("-----------------------------")

        review_scores[row[1]] = sum(total_score)

Kommentaar
Total score:  0
-----------------------------
Väike, aga tubli firma!
['väike']   0
['tubli']   0
Total score:  0
-----------------------------
väga hea firma
['väga', 'hea']   2
Total score:  2
-----------------------------
Viimasel ajal pole midagi halba öelda, aga samas ei konkureeri nad kuidagi Genneti, Ordiga ei hindadelt ega teeninduselt. Toorikute ja tindi ostmiseks samas hea koht ja kuna müüjaid on rohkem valima hakatud, siis võiks 2 ikka ära panna - tuleks 3 kui hinadele ei pandaks kirvest ja toodete saadavus oleks parem.
['viimane']   0
['halb']   0
['hea']   1
['hakatud']   -1
['parem']   0
Total score:  0
-----------------------------
Fotode kvaliteet väga pro ja "jjk" seal töötamise ajal leiti ikka paljudele asjadele väga meeldivad lahendused. Samas hilisem läpaka ost sujus ka väga meeldivalt - sain esialgse rahas ostusoovi vormistada ümber järelmaksule...äärmiselt asjalik teenindus.
['hilisem']   1
['esialgne']   1
['äärmiselt', 'asjalik']   2
['väga', 'meeldiv

As we saved the reviews and their scores to the dict **review_scores**, we can sort it and find reviews that have the highest and lowest scores.

In [9]:
from collections import OrderedDict
sorted_scores = OrderedDict(sorted(review_scores.items(), key=lambda t: t[1], reverse = True))

Let's print 5 most positive reviews:

In [10]:
for idx, i in enumerate(sorted_scores):
    if idx < 5:
        print(i, sorted_scores[i])
        print()

Väga viisakas koht, helitasin tellisin Creative Elite Pro sealt. Saadeti korralik arve sealt, järgmine päev oli postis ja siis järgmine päev juba kohal. Väga korralik.Pakk oli korralik, paks ja hea. Helikaart oli värske mitte kribitud ja kole. Minul on hea kogemus, väha head hinnad ja kiire asjaajamine. 11

Lisaks vahelduseks ka ühe veidi positiivsema kommentaari. Ostsin sealt läpaka. Alguses oli küll veidi segadust, sest esimeses kohas oli soovitud mudel olemas, kuid pakendi kleeps oli lõhutud ning taaskleebitud. Igaks juhuks nõudsin sisu näidata ja kohe oli aru saada, et asi varem lahti käinud. Läpaka must läikiv pind oli paksult käejälgi täis ja ühest nurgast avastasin korraliku kukkumisjälje. Seepeale küsisin, et kas neil on mõnes salongis täiesti avamata pakendit. Helistati ja saadeti järgmisse kohta, kus küll oli antud mark olemas aga vale mudel. Kolmandas kohas lõpuks leiti see õige mudel ja tehti isegi 500.- alet. Kokkuvõttes jäin rahule, ringisõidetud aja ja bensukulu tasuti. 

And 5 most negative reviews:

In [11]:
for idx, i in enumerate(OrderedDict(reversed(list(sorted_scores.items())))):
    if idx < 5:
        print(i, sorted_scores[i])
        print()

Minu esimene tellimus sellest firmast jäi kohe ka viimaseks. Lugu oli selline (eile, s.o. 11. jaan. 2013. Tellisin 6. jaan. kolm asja, millest üks oli videokaart. Tolle viimase tarneajaks anti 2-7 tööpäeva. Okei, läks viis, sain kullerilt kauba kätte. Paki avamisel ilmnes, et tellitud videokaardi asemel oli pakki pandud hoopis teise firma ja märksa nõrgema hinnaklassi toode. Loomulikult reedese päeva õhtupoolikul, mil reageerimiseks juba ülinapilt aega. Saatelehel ilutses täiesti tuimalt tellitud, mitte muudetud toote nimi. Polnud midagi parata, tuli ise Õismäele kohale sõita, et asja uurida. Õnneks veel jõudsin vaatamata ummikutele. Kaks umbkeelset teenindajat maigutasid ainult suud. Küsisin, kuidas saab nii räigelt eksida - vale oli mitte ainult detaili mudel, vaid ka firmanimi. Vastuseks kuulsin mingit pobinat, et jah tuli neid õigeid siia vaid üks ja kellegi teise tellimus olnud justkui vaja kiiremini täita. Mida häma! Pakuti, et 6 päeva pärast saaksin õige kätte. Kui see pole petu