# The goal of this notebook is to calculcate the impact of single terms (unigrams) on the polarity scores of sentences - publications - journals - publication sources 

A few different ways are used to approximate the influence of a single term

In [2]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer, SentiText, VaderConstants

## Variante A - Using just the values from the valence lexicon

This does not take into considerations closeness to degree adverbs (https://en.wiktionary.org/wiki/Category:English_degree_adverbs), Negations and other tweaks of the vader implementation  
Out-of-Vocabulary words are treated as neutral terms

In [5]:
sid = SentimentIntensityAnalyzer()
vader_constants = VaderConstants()
print("Top 10 positive terms")
print([(x, sid.lexicon[x]) for x in sorted(sid.lexicon, key=sid.lexicon.get, reverse=True)[:10]])
print("Top 10 negative terms")
print([(x, sid.lexicon[x]) for x in sorted(sid.lexicon, key=sid.lexicon.get)[:10]])

Top 10 positive terms
[('aml', 3.4), ('ilu', 3.4), ('ily', 3.4), ('magnificently', 3.4), ('lya', 3.3), ('ecstacy', 3.3), ('euphoria', 3.3), ('sweetheart', 3.3), ('143', 3.2), ('best', 3.2)]
Top 10 negative terms
[('rapist', -3.9), ('raping', -3.8), ('slavery', -3.8), ('fu', -3.7), ('kill', -3.7), ('murder', -3.7), ('rape', -3.7), ('terrorist', -3.7), ('hatefulness', -3.6), ('hell', -3.6)]


##### The top 10 positive words already indicate that Vader is more useful for domains like social media. A lot of abbreviations for 'i love you' are found at the top

#### Calculate per sentence polarity impact
Here I normalize the impact in a similar way to the Vader implementation (see https://www.nltk.org/api/nltk.sentiment.html#nltk.sentiment.vader.SentimentIntensityAnalyzer.score_valence). Punctuation amplifiers are taken into account on sentence level, so I do not include them on a term-base calculation.  
Note that the nltk Vader implementation adds the scalar 1 for positive terms and -1 for negative terms to account for the impact of neutral terms. e.g. the sentence `good or bad` is mapped first to lexicon values `[1.9, 0, -2.5]` and then translated to polarity impact scores `[2.9, 1, -3.5]`. The sum of absolute values (in this case `7.4`) is used for normalization.  
For neutral terms the polarity impact is fixed to `0`. This is a simplification for this calculation, removing a neutral term would increase the polarity impact of non-neutral terms in the actual implementation. For `good` the calculated impact is `2.9/7.4` for `bad` it is `-3.5/7.4`  
The polarity score of a term is multiplied by the number of it's occurences in the string

In [17]:
def clean_text(sentence):
    # this is done using the Vader implementation of nltk - it is a very simple nlp pre-processing function, which performs mainly punctuation removal
    sentitext = SentiText(sentence, vader_constants.PUNC_LIST, vader_constants.REGEX_REMOVE_PUNCTUATION)
    return [x.lower() for x in sentitext.words_and_emoticons]

def get_polarity_impact(term):
    valence = sid.lexicon.get(term) or 0
    return valence + 1 if valence >= 0 else valence -1 # neutral terms reduce impact of non-neutral terms

def get_sentence_polarity_impact_map(sentence):
    term_scores = dict()
    cleaned_sentence = clean_text(sentence)
    
    # aggregate term polarity impact per unique term
    for term in cleaned_sentence:
        if not term in term_scores:
            term_scores[term] = 0
        term_scores[term] += get_polarity_impact(term)
    
    total_impact_sum = sum([abs(x) for x in term_scores.values()])
    term_scores.update({term: term_impact/total_impact_sum if term_impact != 1 else 0 for term, term_impact in term_scores.items()})
    
    return term_scores

For simple sentences this gives a good explaination 

In [26]:
get_sentence_polarity_impact_map("Well, this is bad")

{'well': 0.27631578947368424, 'this': 0, 'is': 0, 'bad': -0.46052631578947373}

In [27]:
sid.polarity_scores("Well, this is bad")

{'neg': 0.461, 'neu': 0.263, 'pos': 0.276, 'compound': -0.34}