# MultiLexScaled - sentiment analysis (pandas) (2021-12-10)

_by A. Maurits van der Veen_  

_Modification history:_  
_2021-12-03 - Convert to csv lexica; use newest versions of lexica, as publicly available_  
_2021-12-10 - Clean up & streamline for GitHub repo_  

This notebook applies sentiment analysis to a corpus. The corpus is loaded into a pandas dataframe and all the calculations are done on the dataframe. Depending on memory and processing power, for corpora larger than 50,000 texts or so it may be preferable to use the file-based version of sentiment analysis, which saves intermediate stages (cleaned text, individual valences, individual calibrated valences) as separate output files.


### 0. Set-up

Import necessary code modules; specify location of sentiment analysis lexica and associated files; specify corpus location.


In [None]:
STAIRfolder = '/Users/username/STAIR/'

In [None]:
# Code files to import
import sys
sys.path.append(STAIRfolder + 'Code')
import os
import csv
import numpy as np
from datetime import datetime
import pandas as pd

# local code modules -> these should be in the folder just specified (or otherwise locatable by python)
import tokenization
import valence
import calibrate

# Print summary version info (for fuller info, simply print sys.version)
print('You are using python version {}.'.format(sys.version.split()[0]))

Next, specify where to find the sentiment analysis lexica and the calibration file, along with their names. 


In [None]:
SAfolder = STAIRfolder + 'Corpora/Lexica/English/MultiLexScaled/'

In [None]:
lexica = {'HuLiu':          SAfolder + 'HuLiu/opinion-lexicon-English/HuLiu_lexiconX.csv',
          'LabMT_filtered': SAfolder + 'labMT/labMT_lexicon_filtered.csv',
          'LexicoderSD':    SAfolder + 'Lexicoder/LSDaug2015/LSD_lexiconX.csv',
          'MPQA':           SAfolder + 'MPQA 2.0/opinionfinderv2.0/lexicons/MPQA_lexicon.csv',
          'NRC':            SAfolder + 'NRC/NRC-Emotion-Lexicon-v0.92/NRC_lexicon.csv',
          'SOCAL':          SAfolder + 'SO-CAL/English (from GitHub)/SO-CAL_lexiconX.csv',
          'SWN_filtered':   SAfolder + 'SWN/SWN_lexicon_filtered0.1.csv',
          'WordStat':       SAfolder + 'WordStat/WSD 2.0/WordStat_lexicon2X.csv',
         } 
lexnames = sorted(lexica.keys())

# If not using modifiers, just set modifierlex to None
modifierlex = SAfolder + 'SO-CAL/English (from GitHub)/SO-CAL_modifiersX.csv'


In [None]:
# Load lexica & modifier info
lexica_used = [valence.load_lex(lexfile) for lexname, lexfile in sorted(lexica.items())]
mods = valence.load_lex(modifierlex) if len(modifierlex) > 0 else {}


In [None]:
# Identify the calibration pathname
calibrationfolder = SAfolder + 'Calibration/'
calibrationfile = calibrationfolder + 'Calibration_US_2021-12-10.csv'


#### 0.1 Corpus location & file names

Identify the folder in which the corpus is to be found (and into which any new files will be saved).
This notebook is set up to handle several different subcorpora for which we're interested in the same things. Optionally, we combine them in the end.

To use just a single corpus, simply make `corpusnames` a one-item list (do put a comma after its name, to make sure Python realizes it is a 1-item list). Each (sub-) corpus name should correspond to the name of an existing folder inside `projectfolder`.

In [None]:
# Specify input corpus
corpusfile = STAIRfolder + 'Corpora/Media/Neutral/Corpus/US/US_dedup.csv'

# Load into dataframe
df = pd.read_csv(corpusfile, index_col=False)
df.rename(columns={'DocNr': 'id',}, inplace=True)

### 1. Preprocess text

Pre-tokenize text to make sure punctuation does not affect sentiment calculation.

In [None]:
# Generate clean file(s) from dataset
# The default output is a file with the suffix _clean that contains 2 columns (id, cleanedtext) and no headers
# See the code in tokenization.py for other options

textcols = ('Title', 'Text')  # columns containing text (will be combined)

df['cleantext'] = df.apply(lambda row: tokenization.punctuationPreprocess(' . '.join([str(row[col]) for col in textcols])), axis=1)


### 2. Calculate valence

#### 2.1. Specify parameters

We can specify words to ignore (for example, key search terms that might also appear in a valence lexicon), as well as special punctuation to skip (standard punctuation will be skipped automatically). The latter will not be included in the word count; the former will.

In [None]:
ignorewords = set()                  # Valenced words to ignore, if any, but include in wordcount
words2skip = set(('.', ',', '...'))  # Words to skip altogether (usually just punctuation)

# Negation words, to combine with modifiers/intensifiers such as 'very' or 'hardly' in adjusting valence
negaters = ('not', 'no', 'neither', 'nor', 'nothing', 'never', 'none', 
            'nowhere', 'noone', 'nobody',
            'lack', 'lacked', 'lacking', 'lacks', 'missing', 'without')


In [None]:
# Generate a list of all keys across our lexica, but remove ignorewords
allterms = valence.allkeys(lexica_used)
ignoreX = set(ignorewords) - allterms  # Words to skip separately because not in any lexica
allterms -= set(ignorewords)  # Update allterms to ignore words in our ignore set

# Generate flags indicating whether a lexicon has wildcards
wildlexicon = [valence.haswilds(lex) for lex in lexica_used]


#### 2.2 Valence calculation


In [None]:
# Specify columns of interest
idcol = 'id'
textcols = ('cleantext',)

valence_df = valence.calc_valences(df, idcol, textcols, lexnames, lexica_used, 
                                   wildlexicon, allterms, mods,
                                   modify=True, negaters=negaters,
                                   ignore=ignoreX, skip=words2skip,
                                   need2tokenize=False,  # Text already cleaned does not need to be tokenized
                                   makelower=True, skippunct=True)


### 3. Calibrate

Now we calibrate our valences. We can either calibrate against the parameters calculated from another corpus we assume to be neutral, or we can calibrate against ourselves, simply standardizing to have a mean of 0 and a standard deviation of 1.

To calibrate against an existing set of calibration parameters, set `extcalibrate` to be `True`, and specify the calibrationfile. The calibration file will contain the scaling parameters (mean, std. dev.) for each individual lexicon, as well as the standard deviation of their average, which we need to divide by as the final calibration step. The code snippet below loads the scaler and displays some information about it.

To calibrate a corpus against itself (as here), set `extcalibrate` to `False`. If we want to use the resulting calibration parameters for additional corpora, set savescaler to `True`.


#### 3.1. Load & apply scaler


In [None]:
# Load calibration file, as needed. Set to False to calibrate based on each corpus itself
extcalibrate = True

if extcalibrate:
    # Load calibration data
    neutralscaler, featurenames, nrfeatures, nravailable, stdev_adj, descriptor = \
            calibrate.load_scaler_fromcsv(calibrationfile, includevar=True, displayinfo=True)
else:
    neutralscaler = ''  # dummy value
    stdev_adj = 1       # if not using pre-set calibration, also don't do any scale adjustment
    scalersuffix = '_newscaler'
    print('No scaler loaded -> will calibrate corpus against itself.')

Now perform the calibration.

In [None]:
calibrated_df, scaler, stdv_adj = \
    calibrate.calibrate_valencedata(valence_df, lexnames, neutralscaler, stdev_adj,
                                    filtercol='nrwords', missing=-999)


In [None]:
# Optionally, save this scaler
# (give temporary name & descriptor; can always change later)

savescaler = False  # set to True to save the newly generated scaler
scalername = 'newcorpus'
newcalibrationfile = calibrationfolder + 'Calibration_new_temporary.csv'

if savescaler:
    descriptor = 'New scaler for {} based on {} texts. Generated: {}'.format(
                     scalername, nrtexts, datetime.now())
    calibrate.write_scaler_tocsv(newcalibrationfile, scaler, featurenames=lexnames,
                                 name=scalername, descriptor=descriptor, stdev_adj=stdv_adj)


#### 3.2. Drop individual lexicon valence data, and merge into original df

In [None]:
df = df.merge(calibrated_df[['id', 'avg_valence']], on = 'id', how = 'left')
len(df)  # Double-check length still the same

In [None]:
# Store results, if desired
outputfile = STAIRfolder + 'Corpora/Media/Neutral/Corpus/US/US_with_clean&valence.csv'
df.to_csv(outputfile)


### Done!