# MultiLexScaled - sentiment analysis (mark up individual texts) (2021-12-10)

_by A. Maurits van der Veen_  

_Modification history:_  
_2021-12-03 - Convert to csv lexica; use newest versions of lexica, as publicly available_  
_2021-12-10 - Clean up & streamline for GitHub repo_  

This notebook applies sentiment analysis to a small set of texts. It displays these texts in annotated form, showing (either in parentheses or in colour) which words are used in a text's sentiment calculation.

The notebook is very helpful to get insights into how MultiLexScaled works and how individual texts are scored. For larger-scale sentiment analysis, please use the `pandas` or `file-based` versions of this notebook (part of the same repo).


### 0. Set-up

Specify the main folders containing corpora, notebooks, and code files, as well as the code modules required.


In [None]:
STAIRfolder = '/Users/username/STAIR/'


In [None]:
# Code files to import
import sys
sys.path.append(STAIRfolder + 'Code')
import os
import csv
import numpy as np
import sty        # needed for coloured output.
                  # (if not present, install from command line with 'pip install sty', then restart kernel)

# local code modules -> these should be in the folder just specified (or otherwise locatable by python)
import tokenization
import valence
import calibrate
    
# Print summary version info (for fuller info, simply print sys.version)
print('You are using python version {}.'.format(sys.version.split()[0]))

Next, specify where to find the sentiment analysis lexica and the calibration file, along with their names. 


In [None]:
SAfolder = STAIRfolder + 'Corpora/Lexica/English/MultiLexScaled/'

In [None]:
lexica = {'HuLiu':          SAfolder + 'HuLiu/opinion-lexicon-English/HuLiu_lexiconX.csv',
          'LabMT_filtered': SAfolder + 'labMT/labMT_lexicon_filtered.csv',
          'LexicoderSD':    SAfolder + 'Lexicoder/LSDaug2015/LSD_lexiconX.csv',
          'MPQA':           SAfolder + 'MPQA 2.0/opinionfinderv2.0/lexicons/MPQA_lexicon.csv',
          'NRC':            SAfolder + 'NRC/NRC-Emotion-Lexicon-v0.92/NRC_lexicon.csv',
          'SOCAL':          SAfolder + 'SO-CAL/English (from GitHub)/SO-CAL_lexiconX.csv',
          'SWN_filtered':   SAfolder + 'SWN/SWN_lexicon_filtered0.1.csv',
          'WordStat':       SAfolder + 'WordStat/WSD 2.0/WordStat_lexicon2X.csv',
         } 
lexnames = sorted(lexica.keys())

# If not using modifiers, just set modifierlex to None
modifierlex = SAfolder + 'SO-CAL/English (from GitHub)/SO-CAL_modifiersX.csv'


In [None]:
# Load lexica & modifier info
lexica_used = [valence.load_lex(lexfile) for lexname, lexfile in sorted(lexica.items())]
mods = valence.load_lex(modifierlex) if len(modifierlex) > 0 else {}


In [None]:
# Identify the calibration pathname
calibrationfolder = SAfolder + 'Calibration/'
calibrationfile = calibrationfolder + 'Calibration_US_2021-12-10.csv'


In [None]:
wild = '*'  # Wildcard character for lexica with wildcard entries (LexicoderSD, WordStat)

### 1. Specify texts to analyze

For small sets of texts, it is easy just to type or copy & paste. Otherwise just read in from a file.

In [None]:
# Specify text or texts to analyze.

texts = ["He was not hardly happy with the unforeseen unfortunate but not awful outcome",

         "Things can go from worst to worse to bad to very mediocre to not bad to good to better to best",
         
         "While British articles are more negative than American articles when measured against an identical yardstick , the two countries do indeed parallel one another somewhat more closely when each is measured against its own national media landscape",
        ] 


In [None]:
# Clean & pre-tokenize texts (separate out punctuation &c.)

texts = [tokenization.punctuationPreprocess(text) for text in texts]


### 2. Calculate valence

#### 2.1. Specify parameters

We can specify words to ignore (for example, key search terms that might also appear in a valence lexicon), as well as special punctuation to skip (standard punctuation will be skipped automatically). The latter will not be included in the word count; the former will.


In [None]:
ignorewords = set()                  # Valenced words to ignore, if any, but include in wordcount
words2skip = set(('.', ',', '...'))  # Words to skip altogether (usually just punctuation)

# Negation words, to combine with modifiers/intensifiers such as 'very' or 'hardly' in adjusting valence
negaters = ('not', 'no', 'neither', 'nor', 'nothing', 'never', 'none', 
            'nowhere', 'noone', 'nobody',
            'lack', 'lacked', 'lacking', 'lacks', 'missing', 'without')


#### 2.2 Valence calculation and mark-up


In [None]:
# Generate a list of all keys across our lexica, but remove ignorewords
allterms = valence.allkeys(lexica_used)
ignoreX = set(ignorewords) - allterms  # Words to skip separately because not in any lexica
allterms -= set(ignorewords)  # Update allterms to ignore words in our ignore set

# Generate flags indicating whether a lexicon has wildcards
wildlexicon = [valence.haswilds(lex) for lex in lexica_used]


In [None]:
# Calculate valence data and mark up each text indicating modifier/negation words and valence words
results = [valence.getValence(text.lower(), lexica_used, wildlexicon, wild, 
                              allterms, ignoreX, words2skip,
                              modifiers=mods, negaters=negaters,
                              scaling='words', flagwords=True) \
           for text in texts]

# Separate results
valencedata, markeduptexts = zip(*results)


### 3 Calibration

Our calibration is based on newspaper articles. This has two important implications for valence calculations of short test sentences that may appear counter-intuitive:

1. The mean valence of a newspaper article in our representative corpus, as measured by every single one of our lexica, is greater than zero. (This is shown in the list of 'means' displayed when the calibration data are loaded in the next code snippet.) As a result, a sentence that has no valenced words in it at all is (comparatively) negative, rather than neutral, as one might expect.  

2. The average newspaper article in our representative corpus is 743 words long. We scale the valence sums encountered by this length, since a single positive word in a 743-word text is far less noticeable than that single positive word by itself. As a result, a 'text' consisting of that single positive word is going to get a strikingly high valence score. To put it differently, the single word 'happy' gets a score of over 63, but so does 'happy happy' and 'happy happy happy ...' ad infinitum. So our valence scores are best interpreted as though the text were repeated back-to-back-to-back up to the average length of a newspaper article. Rather than imagine what the text would look (and 'feel') like repeated back-to-back-to-back, we also divide our calculated valence by the ratio of the length of a test text to the average text length for which the scaler was generated.

In [None]:
# Load calibration data & display some info about it
neutralscaler, featurenames, nrfeatures, nravailable, stdev_adj, descriptor = \
        calibrate.load_scaler_fromcsv(calibrationfile, includevar=True, displayinfo=True)


In [None]:
# Apply calibration
calvalences = calibrate.calibrate_valences(valencedata, neutralscaler, stdev_adj, 
                                           firstvalencecol=1, showcomponents=False)


#### 3.1. Display results

Here we display each text with a markup indicating intensifiers and valenced words with a value in parentheses immediately following those words (asterisk for intensifiers, and lexicon presence for valenced words), followed by a coloured markup, and concluding with the valence value. Note that some lexica contain the same word with opposite valences. For example, the word `meaning` appears in one lexicon with a negative valence and in another with a positive valence. Unlike the parenthesis-based markup, the coloured markup is based on the net presence in lexica, so `meaning` will appear as though it is not valenced at all.

The variable `scaleparam` adjusts the valence calculation by the ratio between a text's length (in words) and this parameter (see the comment at the start of the calibration section above). 

In [None]:
scaleparam = 743  # mean nr. words per text in scaler corpus (743 for US rep. corpus)

rescaledvalences = [calval * (len(text.split())/scaleparam) \
                    for calval, text in zip(calvalences, texts)]

In [None]:
# Display each sentence twice, first with markup in parentheses and then in color 
# Follow this by the valence value
for markeduptext, calibratedvalence, rescaledvalence in zip(markeduptexts, calvalences, rescaledvalences):
    print(markeduptext, '\n')
    print(valence.formatvalencemarking(markeduptext, format='color'), '\n')
    print('=> Calibrated valence: {:5.2f} if taken as full text;\n {:27.2f} contribution to overall valence of an average-length text\n\n'.format(
          calibratedvalence, rescaledvalence))

### Done!