# MultiLexScaled - sentiment analysis (file-based) (2021-12-10)

_by A. Maurits van der Veen_  

_Modification history:_  
_2021-12-03 - Convert to csv lexica; use newest versions of lexica, as publicly available_  
_2021-12-10 - Clean up & streamline for GitHub repo_  

This notebook applies sentiment analysis to a corpus. The corpus file-based and does not need to fit into memory all at once. Intermediate stages (cleaned text, individual valences, individual calibrated valences) are stored as separate output files.


### 0. Set-up

Import necessary code modules; specify location of sentiment analysis lexica and associated files; specify corpus location.


In [None]:
STAIRfolder = '/Users/username/STAIR/'

In [None]:
# Code files to import
import sys
sys.path.append(STAIRfolder + 'Code')
import os
import csv
import numpy as np
from datetime import datetime

# local code modules -> these should be in the folder just specified (or otherwise locatable by python)
import tokenization
import valence
import calibrate

# Print summary version info (for fuller info, simply print sys.version)
print('You are using python version {}.'.format(sys.version.split()[0]))

Next, specify where to find the sentiment analysis lexica and the calibration file, along with their names. 


In [None]:
SAfolder = STAIRfolder + 'Corpora/Lexica/English/MultiLexScaled/'

In [None]:
lexica = {'HuLiu':          SAfolder + 'HuLiu/opinion-lexicon-English/HuLiu_lexiconX.csv',
          'LabMT_filtered': SAfolder + 'labMT/labMT_lexicon_filtered.csv',
          'LexicoderSD':    SAfolder + 'Lexicoder/LSDaug2015/LSD_lexiconX.csv',
          'MPQA':           SAfolder + 'MPQA 2.0/opinionfinderv2.0/lexicons/MPQA_lexicon.csv',
          'NRC':            SAfolder + 'NRC/NRC-Emotion-Lexicon-v0.92/NRC_lexicon.csv',
          'SOCAL':          SAfolder + 'SO-CAL/English (from GitHub)/SO-CAL_lexiconX.csv',
          'SWN_filtered':   SAfolder + 'SWN/SWN_lexicon_filtered0.1.csv',
          'WordStat':       SAfolder + 'WordStat/WSD 2.0/WordStat_lexicon2X.csv',
         } 
lexnames = sorted(lexica.keys())

# If not using modifiers, just set modifierlex to None
modifierlex = SAfolder + 'SO-CAL/English (from GitHub)/SO-CAL_modifiersX.csv'


In [None]:
# Load lexica & modifier info
lexica_used = [valence.load_lex(lexfile) for lexname, lexfile in sorted(lexica.items())]
mods = valence.load_lex(modifierlex) if len(modifierlex) > 0 else {}


In [None]:
# Identify the calibration pathname
calibrationfolder = SAfolder + 'Calibration/'
calibrationfile = calibrationfolder + 'Calibration_US_2021-12-10.csv'


#### 0.1 Corpus location & file names

Identify the folder in which the corpus is to be found (and into which any new files will be saved).


In [None]:
# Specify corpus location
projectfolder = STAIRfolder1 + 'Corpora/Media/Neutral/Corpus/US/'
corpusfilestem = projectfolder + 'US'

### 1. Preprocess text

Pre-tokenize text to make sure punctuation does not affect sentiment calculation.

In [None]:
# Generate clean file(s) from dataset
# The default output is a file with the suffix _clean that contains 2 columns (id, cleanedtext) and no headers
# See the code in tokenization.py for other options

textcols = (10, 12)  # columns containing text (will be combined)
rawsuffix = '_dedup'
cleansuffix = '_clean'

tokenization.preprocess_texts(corpusfilestem + rawsuffix + '.csv', 
                              corpusfilestem + cleansuffix + '.csv',
                              textcols=textcols, inheader=True, lang='english',
                              stripspecial=False, stripcomma=False)


### 2. Calculate valence

#### 2.1. Specify parameters

We can specify words to ignore (for example, key search terms that might also appear in a valence lexicon), as well as special punctuation to skip (standard punctuation will be skipped automatically). The latter will not be included in the word count; the former will.

In [None]:
ignorewords = set()                  # Valenced words to ignore, if any, but include in wordcount
words2skip = set(('.', ',', '...'))  # Words to skip altogether (usually just punctuation)

# Negation words, to combine with modifiers/intensifiers such as 'very' or 'hardly' in adjusting valence
negaters = ('not', 'no', 'neither', 'nor', 'nothing', 'never', 'none', 
            'nowhere', 'noone', 'nobody',
            'lack', 'lacked', 'lacking', 'lacks', 'missing', 'without')


#### 2.2 Valence calculation


In [None]:
cleansuffix = '_clean'
cleantextcols = (1,)
valencesuffix = '_vals'  # suffix for file to contain text-level valence data


In [None]:
corpusfile = corpusfilestem + cleansuffix + '.csv' 
valencefile = corpusfilestem + valencesuffix + '.csv'

valence.calc_corpus_valence(corpusfile, valencefile, 
                            lexnames, lexica_used, mods, 
                            textcols=cleantextcols, modify=True, negaters=negaters, 
                            ignore=ignorewords, skip=words2skip, header=False,
                            need2tokenize=False, makelower=True, skippunct=True,
                            nrjobs=4)
        

### 3. Calibrate

Now we calibrate our valences. We can either calibrate against the parameters calculated from another corpus we assume to be neutral, or we can calibrate against ourselves, simply standardizing to have a mean of 0 and a standard deviation of 1.

To calibrate against an existing set of calibration parameters, set `extcalibrate` to be `True`, and specify the calibrationfile. The calibration file will contain the scaling parameters (mean, std. dev.) for each individual lexicon, as well as the standard deviation of their average, which we need to divide by as the final calibration step. The code snippet below loads the scaler and displays some information about it.

To calibrate a corpus against itself (as here), set `extcalibrate` to `False`. If we want to use the resulting calibration parameters for additional corpora, set savescaler to `True`.


In [None]:
# Load calibration file, as needed. Set to False to calibrate based on each corpus itself
extcalibrate = False

if extcalibrate:
    # Load calibration data
    neutralscaler, featurenames, nrfeatures, nravailable, stdev_adj, descriptor = \
            calibrate.load_scaler_fromcsv(calibrationfile, includevar=True, displayinfo=True)
else:
    neutralscaler = ''  # dummy value
    stdev_adj = 1       # if not using pre-set calibration, also don't do any scale adjustment
    scalersuffix = '_newscaler'
    print('No scaler loaded -> will calibrate corpus against itself.')

Now perform the calibration.

In [None]:
calibratedsuffix = '_cal' # suffix for file containing calibrated valence data

wordcountcol = 'nrwords'
keepcols = []  # word count info is automatically retained, because used as a filter/scaler

valencefile = corpusfilestem + valencesuffix + '.csv'

scaler, new_stdev_adj, nrtexts = \
        calibrate.calibrate_features(valencefile, lexnames, 
                                     neutralscaler, stdev_adj=stdev_adj,
                                     filtercol=wordcountcol, keepcols=keepcols,
                                     missing=-999, outsuffix=calibratedsuffix)


In [None]:
# Optionally, save this scaler
# (give temporary name & descriptor; can always change later)

savescaler = False  # set to True to save the newly generated scaler
scalername = 'newcorpus'
newcalibrationfile = calibrationfolder + 'Calibration_new_temporary.csv'

if savescaler:
    descriptor = 'New scaler for {} based on {} texts. Generated: {}'.format(
                     scalername, nrtexts, datetime.now())
    calibrate.write_scaler_tocsv(newcalibrationfile, newscaler, featurenames=lexnames,
                                 name=scalername, descriptor=descriptor, stdev_adj=new_stdev_adj)

    # Describe the scaler (by reloading it)
    print('\nScaler pathname: {}\n'.format(calibrationfile))
    neutralscaler, featurenames, nrfeatures, nravailable, stdev_adj, descriptor = \
            calibrate.load_scaler_fromcsv(calibrationfile, includevar=True, displayinfo=True)


### Done!