# eMFDscore Tutorial

© Media Neuroscience Lab  
October 2020

***

This notebook provides a tutorial on how to use eMFDScore for extracing various moral information metrics from texutal input.  
Specifically, this tutorial guides the reader how to effectively use the eMFDScore tool either on the command line (for MACOS and Linux) and in Python (for Windows, MACOS, and Linux).  
In addition, this tutorial also demonstrates which scoring options are appropriate for particular tasks.  
For more detailed background information on the eMFD, please consult the respective [publication](https://link.springer.com/article/10.3758/s13428-020-01433-0).

Finally, when using eMFDscore, please consider "starring" the Github repository and citing the follwing article: 

Hopp, F. R., Fisher, J. T., Cornell, D., Huskey, R., & Weber, R. (2020). The extended Moral Foundations Dictionary (eMFD):  
Development and applications of a crowd-sourced approach to extracting moral intuitions from text.   
_Behavior Research Methods_, https://doi.org/10.3758/s13428-020-01433-0

***

To interactively run this tutorial, you should clone the eMFDscore github repository and follow the install instructions below.

## 1. Set-up Your Environment

eMFDscore requires a Python installation (v3.7+). If your machine does not have Python installed,  
we recommend installing Python by downloading and installing either Anaconda or Miniconda for your OS.

For best practises, we recommend installing eMFDscore into a virtual conda environment.  
Hence, you should first create a virtual environment by executing the following command in your terminal:

`$ conda create -n emfd python=3.7` 

Once Anaconda/Miniconda is installed activate the env via:

`$ source activate emfd`

Next, you must install spaCy, which is the main natural language processing backend that eMFDscore is built on:

`$ conda install -c conda-forge spacy`  
`$ python -m spacy download en_core_web_sm`

Finally, you can install eMFDscore by copying, pasting, and executing the following command:

`pip install https://github.com/medianeuroscience/emfdscore/archive/master.zip`

In addition, if you plan to run eMFDscore in an interactive python environment (IPython) or using jupyter notebooks, we encourage you to install jupyter-lab into the eMFD environment:  
`conda install -c conda-forge jupyterlab`



## 2. Using eMFDScore

eMFDScore is a versatile tool that can either be run using the command line or directly from Python.  
Note that if you are on a **Windows** machine, you must run eMFDscore from a Python environment. 

In this tutorial, we will load a few packages to inspect the output of eMFDScore's computed metrics.  
These packages must be installed/available in your conda environment, but are not necessary for  
eMFDscore to run properly. 

In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

### Options for Document Scoring

With eMFDScore, you have several options to extract moral information metrics from texutal corpora.   
Below, we go over these options one by one.  

When scoring documents with the extended Moral Foundations Dicitonary (eMFD; default in eMFDScore),  
you must decide how you would like to use the eMFD for scoring textual documents.  

As a reminder, in the eMFD, every of the 3020 words is assigned the following scores: 
- `Foundation Probabilities`: Each word is assigned 5 probabalities that denote the likelihood  
that this word is associated with each one of the five moral foundations as identified by Moral  
Foundations  Theory. For example, the word "kill" has a care probability of 0.4 and  
a loyalty probability of 0.24, meaning that there is a 40% chance that a coder highlighted a context in which the word "kill" appeared  
with the care-harm foundation and a 24% chance that this context was highlighted with the loyalty-betrayal foundation.

- `Sentiment Scores`: Each word is assigned 5 sentiment scores that denote the average sentiment  
of the foundation context in which this word appeared. For example, the word "kill" has an average  
"care_sent" of -0.69, meaning that all "care-harm" highlights in which "kill" appeared had an average,  
negative sentiment of -0.69.

***

Based on these scores, there are two options how these scores can be "mapped" when scoring a  
document (flag `prob_map` below):

1. Use `all` probabilities per word in the eMFD (option `all`):  
=> Using all five foundation probabilities assumes that each word is used as an indicator for multiple foundations with the probabilities as weights. 
2. Assign a `single`  probability to each word in the eMFD according to the foundation with the highest  probability (option `single`):  
=> Each word only indicates **one** foundation (the one with the highest foundation probability) and each time this word is found  
the respective foundation is increased by that word's foundation probability.

***

In addition, you can decide whether you want eMFDScore to return the average sentiment for each  
foundation, or whether you would like eMFDScore to split each foundation  
into a `vice` and `virtue`  category (flag `output_metrics` below):

1. Return the average `sentiment ` for each foundation (option `sentiment`) 
2. Split foundations into a `vice-virtue` category (option `vice-virtue`). 

The vice-virtue split is accomplished by considering the average sentiment of each foundation of each  
word, and then assigning this word to "virtue" if the foundation sentiment is positive,  
or to "vice" if the sentiment is negative.  
For instance, if using the `all` option for the `prob_map` option above, a word's foundation probabilities  
will be translated into five `virtue` scores (e.g., care, fairness,  loyalty, authority, and sanctity)   
if the word's sentiment for these foundations is positive, whereas a word  whose sentiments for each  
foundation is negative will be assigned five `vice` scores (e.g., harm, cheating,  betrayal, subversion, and degradation). 

***

Based on the above, there is a total of 4 different options how the eMFD can be used.  
The specific usage of each and use case is explicated below. 

#### eMFDScore Command-Line Options

A typical command for eMFDScore specifies the following:

`$ emfdscore [INPUT_FILE][OUTPUT_FILE][SCORING_METHOD][DICT_TYPE[prob-map][output_metrics]]`

When using eMFDscore, several inputs need to be defined in a specific order: 

- [INPUT_FILE]: = The path to a CSV file in which the first column contains the document texts to be scored.  
  Each row should reflect its own document. See the template_input.csv for an example file format.
  
  
- [OUTPUT_FILE] = Specifies the file name of the generated output csv.


- [SCORING_METHOD] = Currently, eMFDscore employs three different scoring algorithms:
    - `bow` is a classical Bag-of-Words approach in which the algorithm simply searches for word matches between document texts and the specified dictionary.
    - `pat` (in development) relies on named entity recognition and syntactic dependency parsing. For each document, the algorithm first extracts all mentioned entities.  
    Next, for each entitiy, eMFDscore extracts words that pertain to 1) moral verbs for which the entity is an agent argument (Agent verbs), 2) moral verbs for  
    which the entity is the patient, theme, or other argument (Patient verbs), and other moral attributes (i.e., adjectival modifiers, appositives, etc.).
    - `wordlist` is a simple scoring algorithm that lets users examine the moral content of individual words. This scoring method expects a CSV where each row corresponds  
    to a unique word. Note: The wordlist scoring algorithm does not perform any tokenization or preprocessing on the wordlists.   
    For a more fine-grained moral content extraction, users are encouraged to use either the bow or path methodology. Furthermore, only the emfd is currenlty supported for PAT extraction.   
    Additionally, this method is more computationally expensive and thus has a longer execution time.
    - `gdelt.ngrams` is designed for the Global Database of Events, Language, and Tone Television Ngram dataset.   
    This scoring method expects a unigram (1gram) input text file from GDELT and will score each unprocessed (untokenized) unigram with the eMFD.
    
    
- [DICTIONARY_TYPE] = Declares which dictionary is applied to score documents. In its current version, eMFDscore lets users choose between three dictionaries:
    - `emfd` = extended Moral Foundations Dictionary (eMFD)
    - `mfd2` = Moral Foundations Dicitonary 2.0 (Frimer et al., 2017; https://osf.io/xakyw/ )
    - `mfd` = original Moral Foundations Dictionary (https://moralfoundations.org/othermaterials)


- When choosing the eMFD; the following two additional flags need to be defined:
    - [PROB_MAP]: How are the foundation probabilities mapped when scoring a document? 
        - `all` : use all probabilities per word in the eMFD
        - `single`: Assign a single probability to each word in the eMFD according to the foundation with the highest probability
           
    - [OUTPUT_METRICS]: Which metrics are returned? 
        - `sentiment`: Return the average sentiment for each foundation
        - `vice-virtue`: Split foundations into a vice-virtue category

#### Scoring Documents with the eMFD

Below, we illustrate the various text scoring options in eMFDscore.  
For this purpose, we will be using a CSV file in which each row corresponds to  
a news article text. 

In [2]:
template_input = pd.read_csv('emfdscore/template_input.csv', header=None)
template_input.head()

Unnamed: 0,0
0,The Iraqi government's assault to retake the c...
1,WASHINGTON -- North Korea now has the capabili...
2,TEL AVIV – An Egyptian journalist wrote an op-...
3,What was life like for Russians such as Tsar N...
4,President Obama's former national security adv...


#### 1. Use All Probabilities per Word and Return Sentiment Scores 

This option should be used when one wants to extract the overall, holistic moral signal from a document.  
Note that because each word is assigned five foundation probabilities, there exist higher correlations  
across these foundations, making this method less suitable when one wants to
- use the foundation probabilities as predictor variables in statistical models
- discriminate which foundations are more or less represented in a text.  

For these cases, options (2) and (4) below  should be preferred. 

##### Using Python 

In [3]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'emfd'
PROB_MAP = 'all'
SCORE_METHOD = 'bow'
OUT_METRICS = 'sentiment'
OUT_CSV_PATH = 'all-sent.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:39 Time:  0:00:39


##### Using the Command Line

In [4]:
%%bash
emfdscore emfdscore/template_input.csv all-sent.csv bow emfd all sentiment

Running eMFDscore
Total number of input texts to be scored: 500
Scoring completed.


Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:38 Time:  0:00:38


In [5]:
# Inspect output 
all_sent = pd.read_csv('all-sent.csv')
all_sent.head()

Unnamed: 0,care_p,fairness_p,loyalty_p,authority_p,sanctity_p,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var
0,0.135488,0.103978,0.097914,0.104486,0.083446,-0.151964,-0.122982,-0.101907,-0.102609,-0.110386,2.417722,0.000361,0.000433
1,0.112714,0.087392,0.090776,0.098351,0.069435,-0.143459,-0.061835,-0.046722,-0.064888,-0.088798,1.669903,0.00025,0.001441
2,0.114769,0.103932,0.094078,0.096782,0.089199,-0.157342,-0.132494,-0.074078,-0.091602,-0.098686,0.791411,9.9e-05,0.001125
3,0.090146,0.085675,0.086087,0.092897,0.070704,-0.069739,-0.013825,0.019261,-0.030559,-0.051676,1.051948,7.4e-05,0.001184
4,0.093715,0.088627,0.094254,0.094601,0.07087,-0.109248,-0.057367,-0.013941,-0.037166,-0.086024,1.464481,0.000102,0.001437


As the above dataframe illustrates, each document is assigned 5 foundation probabilities that  
denote the average probabilty of each document belonging to one of the five moral foundations and  
5 sentiment scores that describe the average sentiment of detected moral words for that foundation.  

In addition, eMFDscore always returns the detected moral-to-nonmoral word ratio(`moral_nonmoral_ratio`),  
along with the variance across the foundation scores (`f_var`) as well as the variance across the sentiment scores (`sent_var`). 

***

#### 2. Assign Single Probability per Word and Return Sentiment Scores 

##### Using Python 

In [6]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'emfd'
PROB_MAP = 'single'
SCORE_METHOD = 'bow'
OUT_METRICS = 'sentiment'
OUT_CSV_PATH = 'single-sent.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:41 Time:  0:00:41


##### Using the Command Line

In [7]:
%%bash
emfdscore emfdscore/template_input.csv single-sent.csv bow emfd single sentiment

Running eMFDscore
Total number of input texts to be scored: 500
Scoring completed.


Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:37 Time:  0:00:37


In [8]:
# Inspect output 
single_sent = pd.read_csv('single-sent.csv')
single_sent.head()

Unnamed: 0,care_p,fairness_p,loyalty_p,authority_p,sanctity_p,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var
0,0.077193,0.027206,0.021363,0.031303,0.002395,-0.083485,-0.009629,0.001919,-0.018307,-0.000161,2.417722,0.000764,0.001249
1,0.06465,0.016021,0.029611,0.028144,0.002409,-0.083103,-0.002864,-0.010275,-0.015872,-0.001872,1.669903,0.000536,0.001169
2,0.056653,0.034186,0.025006,0.021561,0.01678,-0.093639,-0.01331,-0.001213,-0.012562,-0.017463,0.791411,0.000249,0.001398
3,0.024518,0.025494,0.021833,0.041092,0.012537,-0.032878,0.001872,0.004439,0.000244,-0.005171,1.051948,0.000106,0.000233
4,0.035749,0.028048,0.028456,0.036979,0.006093,-0.03227,-0.018531,0.003611,-0.003783,-0.002514,1.464481,0.000154,0.000212


Here, the returned output is the same as in option (1) above, although every word was assigned  
only one probability score when scoring the document. 

***

#### 3. Use All Probabilities per Word and Return Vice-Virtue Scores

##### Using Python

In [9]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'emfd'
PROB_MAP = 'all'
SCORE_METHOD = 'bow'
OUT_METRICS = 'vice-virtue'
OUT_CSV_PATH = 'all-vv.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:38 Time:  0:00:38


##### Using the Command Line

In [10]:
%%bash
emfdscore emfdscore/template_input.csv all-vv.csv bow emfd all vice-virtue

Running eMFDscore
Total number of input texts to be scored: 500
Scoring completed.


Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:37 Time:  0:00:37


In [11]:
# Inspect output 
all_vv = pd.read_csv('all-vv.csv')
all_vv.head()

Unnamed: 0,care.virtue,fairness.virtue,loyalty.virtue,authority.virtue,sanctity.virtue,care.vice,fairness.vice,loyalty.vice,authority.vice,sanctity.vice,moral_nonmoral_ratio,f_var
0,0.021779,0.021534,0.034007,0.022967,0.019689,0.112803,0.082147,0.063907,0.080426,0.062221,2.417722,0.001079
1,0.019605,0.029767,0.035892,0.029407,0.027932,0.092908,0.056936,0.05383,0.067442,0.039905,1.669903,0.000504
2,0.020196,0.028297,0.030178,0.031877,0.025163,0.093601,0.075635,0.063162,0.063838,0.063394,0.791411,0.000644
3,0.023391,0.037495,0.040412,0.034631,0.028299,0.066477,0.047443,0.045462,0.058114,0.041576,1.051948,0.000168
4,0.021151,0.033509,0.043392,0.046873,0.022972,0.072061,0.055008,0.050768,0.046851,0.047825,1.464481,0.000229


As this output illustrates, each document is assigned 10 foundation scores that indicate the degree to which  
each document reflects either a `virtue` or a `vice` foundation. In this option, every word in the eMFD is assigned  
five probability scores when scoring the document. 


***

#### 4. Assign Single Probability per Word and Return Vice-Virtue Scores

##### Using Python

In [12]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'emfd'
PROB_MAP = 'single'
SCORE_METHOD = 'bow'
OUT_METRICS = 'vice-virtue'
OUT_CSV_PATH = 'single-vv.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:39 Time:  0:00:39


##### Using the Command Line

In [13]:
%%bash
emfdscore emfdscore/template_input.csv single-vv.csv bow emfd single vice-virtue

Running eMFDscore
Total number of input texts to be scored: 500
Scoring completed.


Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:37 Time:  0:00:37


In [14]:
# Inspect output 
single_vv = pd.read_csv('single-vv.csv')
single_vv.head()

Unnamed: 0,care.virtue,fairness.virtue,loyalty.virtue,authority.virtue,sanctity.virtue,care.vice,fairness.vice,loyalty.vice,authority.vice,sanctity.vice,moral_nonmoral_ratio,f_var
0,0.010051,0.007087,0.01229,0.006542,0.000785,0.067142,0.020119,0.009073,0.024761,0.001609,2.417722,0.000379
1,0.005495,0.006633,0.007176,0.006936,0.000807,0.059155,0.009388,0.022435,0.021208,0.001602,1.669903,0.000304
2,0.002963,0.01101,0.009599,0.005253,0.001987,0.05369,0.023176,0.015406,0.016309,0.014793,0.791411,0.000224
3,0.003055,0.01307,0.010188,0.013894,0.002241,0.021463,0.012423,0.011644,0.027198,0.010296,1.051948,5.6e-05
4,0.008476,0.009681,0.012314,0.022428,0.002606,0.027273,0.018368,0.016142,0.014551,0.003487,1.464481,6.2e-05


This output is similar to option (3), although in this approach, every word in the eMFD was assigned only  
one score according to the foundation with the highest probability. 

***

#### Scoring Documents with the MFD2

##### Using Python

In [15]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'mfd2'
PROB_MAP = ''
SCORE_METHOD = 'bow'
OUT_METRICS = ''
OUT_CSV_PATH = 'mfd2.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:40 Time:  0:00:40


##### Using the Command Line

In [16]:
%%bash
emfdscore emfdscore/template_input.csv mfd2.csv bow mfd2

Running eMFDscore
Total number of input texts to be scored: 500
Scoring completed.


Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:00:38 Time:  0:00:38


In [17]:
# Inspect output 
mfd2 = pd.read_csv('mfd2.csv')
mfd2.head()

Unnamed: 0,care.virtue,care.vice,authority.virtue,fairness.vice,fairness.virtue,loyalty.vice,loyalty.virtue,sanctity.virtue,authority.vice,sanctity.vice,moral_nonmoral_ratio,f_var
0,0.142857,0.642857,0.071429,0.0,0.0,0.0,0.071429,0.071429,0.0,0.0,0.054688,0.038776
1,0.0,0.272727,0.454545,0.0,0.0,0.0,0.272727,0.0,0.0,0.0,0.041667,0.028375
2,0.05,0.3,0.1,0.05,0.1,0.0,0.2,0.0,0.0,0.2,0.073529,0.010556
3,0.0,0.0,0.333333,0.0,0.0,0.0,0.444444,0.0,0.222222,0.0,0.029316,0.028669
4,0.0,0.068966,0.517241,0.0,0.034483,0.0,0.37931,0.0,0.0,0.0,0.06872,0.035262


As can be seen, the output for the MFD2 is similar to scoring documents with the eMFD  
when using the `vice-virtue` output-metric flag. 

***

#### Scoring Documents with the MFD

##### Using Python

In [18]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'mfd'
PROB_MAP = ''
SCORE_METHOD = 'bow'
OUT_METRICS = ''
OUT_CSV_PATH = 'mfd.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:01:08 Time:  0:01:08


##### Using the Command Line

In [19]:
%%bash
emfdscore emfdscore/template_input.csv mfd.csv bow mfd

Running eMFDscore
Total number of input texts to be scored: 500
Scoring completed.


Processed: 500 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:01:06 Time:  0:01:06


In [20]:
# Inspect output 
mfd = pd.read_csv('mfd.csv')
mfd.head()

Unnamed: 0,care.virtue,fairness.virtue,loyalty.virtue,authority.virtue,sanctity.virtue,care.vice,fairness.vice,loyalty.vice,authority.vice,sanctity.vice,moral,moral_nonmoral_ratio,f_var
0,0.166667,0.083333,0.0,0.0,0.0,0.583333,0.0,0.083333,0.0,0.0,0.083333,0.046512,0.033102
1,0.125,0.0,0.25,0.5,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.029963,0.027083
2,0.0,0.125,0.125,0.125,0.0,0.5,0.0,0.0,0.0,0.125,0.0,0.028169,0.023611
3,0.0,0.0,0.5,0.25,0.0,0.083333,0.0,0.0,0.083333,0.0,0.083333,0.039474,0.026929
4,0.16,0.0,0.36,0.24,0.0,0.12,0.04,0.0,0.04,0.0,0.08,0.058685,0.01536


Likewise, the output for the MFD is in the same format as for the MFD2.0, except that  
the MFD also has a category `moral` under which general moral words are grouped.

***

### Questions or Concerns?

For any questions or concerns, please open an [issue](https://github.com/medianeuroscience/emfdscore/issues) on the Github repository.  