# Exercise 2: unsupervised polarity system:
1. Get the first synset (most frequent) of one of the next alternatives:
    - [nouns, verbs, adjectives and adverbs],
    - [nouns, adjectives and adverbs],
    - [only adjectives]
2. Sum all the positive scores and negative ones to get the polarity
3. Apply the system to the movie reviews corpus and give the accuracy
4. Give some conclusions about the work

In [1]:
from nltk.corpus import movie_reviews as mr
from nltk.corpus import wordnet as wn
from nltk import pos_tag
from nltk.corpus import sentiwordnet as swn

In [2]:
def get_valid_pairs(original_pairs):
    """
    Produces a list containing the pairs word-tag that are valid
    or the WordNet Synset analysis (Only nouns, verbs, adjectives and adverbs),
    and converts the POS tag to the WordNet POS tag
    """
    valid_pairs = []
    for word, tag in original_pairs:
        #if word is a noun
        if tag.startswith('N'):
            valid_pairs.append((word, wn.NOUN))
        #if word is a verb
        elif tag.startswith('V'):
            valid_pairs.append((word, wn.VERB))
        #if word is an adjective
        elif tag.startswith('J'):
            valid_pairs.append((word, wn.ADJ))
        #if word is a verb
        elif tag.startswith('R'):
            valid_pairs.append((word, wn.ADV))
    return valid_pairs

def unsupervised_polarity_system(words):
    pos = pos_tag(words)
    valid_pairs = set(get_valid_pairs(pos))
    
    polarity = 0
    for word, tag in valid_pairs:
        synsets = wn.synsets(word, tag)
        if len(synsets) > 0:
            synset = synsets[0]
            sentiSynset = swn.senti_synset(synset.name())
            polarity += sentiSynset.pos_score() - sentiSynset.neg_score()
    return polarity

In [3]:
print('Negatives:')
count_neg = 0
for fileid in mr.fileids('neg'):
    words = mr.words(fileid)
    polarity = unsupervised_polarity_system(words)
    if(polarity<0):
        count_neg+=1
print('Total negatives: ', len(mr.fileids('neg')))
print('Correctly identified as negatives: ', count_neg)
print('Percentage of correctly identified negatives: ', count_neg/len(mr.fileids('neg')))
print()

count_pos = 0
print('Positives:')
for fileid in mr.fileids('pos'):
    words = mr.words(fileid)
    polarity = unsupervised_polarity_system(words)
    if(polarity>0):
        count_pos+=1
print('Total positives: ', len(mr.fileids('pos')))
print('Correctly identified as positives: ', count_pos)
print('Percentage of correctly identified positives: ', count_pos/len(mr.fileids('pos')))

Negatives:
Total negatives:  1000
Correctly identified as negatives:  284
Percentage of correctly identified negatives:  0.284

Positives:
Total positives:  1000
Correctly identified as positives:  881
Percentage of correctly identified positives:  0.881


In [4]:
accuracy = (count_neg+count_pos)/(len(mr.fileids('neg'))+len(mr.fileids('pos')))
print(accuracy)

0.5825


### Conclusions
We built our system analyzing all the available POS words accepted by WordNet (nouns, verbs, adjectives and adverbs), and we saw that we got quite good results in identifying the positive example (88% of correctly identified sentences), but really bad results in identifying the negative examples (28% of correctly identified sentences).
This might be due to the fact that we used all the types of POS to analyze the sentiment of the sentences.
In fact, in most of the cases, sentences with a negative sentiment can have only a few words that make us understand that the sentence is negative. For example, we use one of the sentences used during the theory class: 
*Donald Trump’s administration: “Government by the worst men.”*
The only word that can have a high negative score is "worst", so if we analyze this sentence taking into consideration all the words, we will have a higher polarity value than we would like, because all the other words will raise the polarity value and probably we will get a result a sentence that is recognized as a positive sentence. On the contrary, if we anlyzed only adjectives, for example, the sentence would be probably identified correctly as a negative sentence.
In conclusion, doing sentiment analysis taking into account all the words of a sentence is probably not the best way to obtain optimal results.

Considering that in a random model we would have a 50 percent chance of sucess, we could think that our model might be like a random classifier. Although that, considering a binomial distribution, we have a confidence over 99.9% that our model has some discriminatory power indeed.