# ItEM - Italian EMotive lexicon

ItEM is a a high-coverage emotion lexicon for Italian in which each target term is provided with an association score with the basic emotions defined the in Plutchik (1994)’s taxonomy: **JOY, SADNESS, ANGER, FEAR, TRUST, DISGUST, SURPRISE, ANTICIPATION**.

![alt text](images/1181px-Plutchik-wheel.png "Plutchik's wheel of emotions")

## Contents of this repository
 
 - the list of seed words collected in Passaro et al., 2015. The seeds are provided both as lemmas and tokens;
 
 - the pre-compiled emotive lexicon described in Passaro and Lenci (2016) and referred as _SintParModel_;
 
 - the pre-compiled emotive lexicon built by exploiting count vectors extracted from FB-NEWS15 (Passaro et al., 2016); 
 
 - a simplified implementation of ItEM that can be used to create new lexica from a list of seeds and a list of word embeddings.

In [1]:
from IPython.display import Markdown, display
import pandas as pd

display(Markdown(f'### The list of seed words collected in Passaro et al., 2015'))
pd.read_csv("seeds/ItEM.elicitated.lemmas.txt", sep = "\t")



### The list of seed words collected in Passaro et al., 2015

Unnamed: 0,Seed,Emotion,PoS
0,colorato,gioia,A
1,esuberante,gioia,A
2,gaudio,gioia,S
3,esultanza,gioia,S
4,libero,gioia,A
...,...,...,...
550,anticipazione,attese,S
551,prospettiva,attese,S
552,sognare,attese,A
553,attesa,attese,S


In [2]:
display(Markdown(f'### The pre-compiled emotive lexicon described in Passaro and Lenci (2016) referred as _SintParModel_'))
pd.read_csv("pre-compiled-lexica/ItEM.SintParModel.cos", sep = "\t")


### The pre-compiled emotive lexicon described in Passaro and Lenci (2016) referred as _SintParModel_

Unnamed: 0,emotion,word,cosine
0,gioia,gioia-s,0.843964
1,gioia,giubilo-s,0.795281
2,gioia,letizia-s,0.779542
3,gioia,esultanza-s,0.758201
4,gioia,gaudio-s,0.754662
...,...,...,...
239931,attese,folico-a,0.001367
239932,attese,rilevante-a,0.000935
239933,attese,necessario-a,0.000000
239934,attese,compensativi-a,0.000000


In [3]:
display(Markdown(f'### The pre-compiled emotive lexicon built from FB-NEWS15 (Passaro et al., 2016)'))
pd.read_csv("pre-compiled-lexica/ItEM.FBNEWS15.cos", sep = "\t")

### The pre-compiled emotive lexicon built from FB-NEWS15 (Passaro et al., 2016)

Unnamed: 0,emotion,word,cosine
0,gioia,festoso-a,0.647874
1,gioia,euforico-a,0.622582
2,gioia,esilarante-a,0.622579
3,gioia,gaio-a,0.617334
4,gioia,divertito-a,0.614806
...,...,...,...
239941,attese,muitos-s,0.004284
239942,attese,faço-s,0.003667
239943,attese,caminho-s,0.001646
239944,attese,weapons-s,0.001640


## Citation 


```
@inproceedings{DBLP:conf/lrec/PassaroL16,
  author    = {Lucia C. Passaro and
               Alessandro Lenci},
  editor    = {Nicoletta Calzolari and
               Khalid Choukri and
               Thierry Declerck and
               Sara Goggi and
               Marko Grobelnik and
               Bente Maegaard and
               Joseph Mariani and
               H{\'{e}}l{\`{e}}ne Mazo and
               Asunci{\'{o}}n Moreno and
               Jan Odijk and
               Stelios Piperidis},
  title     = {Evaluating Context Selection Strategies to Build Emotive Vector Space
               Models},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources
               and Evaluation {LREC} 2016, Portoro{\v{z}}, Slovenia, May 23-28, 2016},
  publisher = {European Language Resources Association {(ELRA)}},
  year      = {2016},
  url       = {http://www.lrec-conf.org/proceedings/lrec2016/summaries/637.html},
  timestamp = {Mon, 19 Aug 2019 15:22:52 +0200},
  biburl    = {https://dblp.org/rec/conf/lrec/PassaroL16.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

## Create your own emotive lexicon:

This section provides a simplified implementation of ItEM. Starting from a list of seed words and word embeddings for the desired vocabulary, you will be guided through the creation of a new emotive lexicon. You can build your vector space as you wish (both count and prediction-based embeddings) or you can use pre-trained vectors.
The only constraint is that the seed words must be part of the vector space model of choice (i.e. you need to have the embeddings for such words in your space).


### what you need

#### 1. import code and libraries

In [4]:
from IPython.display import Markdown, display
import pandas as pd

from utils import item_functions as item

unable to import 'smart_open.gcs', disabling that module


#### 2. a list of seed words

- A list of seed lemmas or tokens in a tab-separated file.
- The PoS column is optional, but depending on its presence, the seed key (i.e. the unique key used to relate a word and its embedding) will be formatted in a different way. 
  - if populated, the seed key will be formatted as "Seed-PoS" (e.g. "agitarsi-v");
  - otherwise, the seed key will be the "Seed" column as is (e.g. "agitarsi"). 
- It is clear that, depending on the seed keys ("Seed-PoS" vs "Seed"), the words in the vector space must be formatted in the same way to access the embeddings properly. 


In [5]:
display(Markdown(f'#### A dafaframe containing seed lemmas'))
pd.read_csv("seeds/ItEM.elicitated.lemmas.txt", sep = "\t")


#### A dafaframe containing seed lemmas

Unnamed: 0,Seed,Emotion,PoS
0,colorato,gioia,A
1,esuberante,gioia,A
2,gaudio,gioia,S
3,esultanza,gioia,S
4,libero,gioia,A
...,...,...,...
550,anticipazione,attese,S
551,prospettiva,attese,S
552,sognare,attese,A
553,attesa,attese,S


In [6]:
display(Markdown(f'#### A dafaframe containing seed tokens'))

pd.read_csv("seeds/ItEM.elicitated.tokens.txt", sep = "\t")    


#### A dafaframe containing seed tokens

Unnamed: 0,Seed,Emotion
0,colorato,gioia
1,colorata,gioia
2,colorate,gioia
3,colorati,gioia
4,esuberantissimo,gioia
...,...,...
8108,supponessimo,attese
8109,supporrebbero,attese
8110,supporrà,attese
8111,supponevi,attese


#### 3. word vectors

The file of word embeddings must be in the `word2vec C` format:

 - the first row contains the number of embeddings and their dimension;
 - each row is space-separated and contains the word in position 0, followed by its vector.

In [7]:
display(Markdown(f'29992 300\n\nabate-s -0.505393460602235 -3.0702947386964445 [...] 1.1265833671394883\n\nabbaglio-s -0.5159974598163771 -2.5505094487548488 [...] 0.24188720476360617\n\nabbandono-s -0.0903591412413367 -0.5987769584177935 [...] 0.022376339858185584'))




29992 300

abate-s -0.505393460602235 -3.0702947386964445 [...] 1.1265833671394883

abbaglio-s -0.5159974598163771 -2.5505094487548488 [...] 0.24188720476360617

abbandono-s -0.0903591412413367 -0.5987769584177935 [...] 0.022376339858185584

## Define the variables

In [8]:
# The file of the seeds  and vectors formatted as described above.
seedsFile = "seeds/ItEM.elicitated.lemmas.txt"
vectorsFile = "vectors/COUNT.repubblica.itwac.TOP240.vec"

#other available options here:
#seedsFile = "seeds/ItEM.elicitated.tokens.txt"
#vectorsFile = "vectors/PREDICT-wiki.it.vec"

# The cosine similarity threshold used to write the output.
cosine_threshold = 0.2

# The output file formatted as the pre-compiled lexica described above.
outputFile = f"output/ItEM.{cosine_threshold}.cos"


### Load word embeddings 

Word embeddings are loaded with the `load_vectors` function of `item`. The function takes as input file of embeddings in the `word2vec C format` and loads them in a Gensim `KeyedVectors` object. 


In [9]:
vectorModel = item.load_vectors(vectorsFile)

### Create the centroid vectors

First, seeds are loaded via the `item.load_seeds` function, that takes as input a seeds file such as the ones described above and returns a dictionary in which the key is the centroid name and the value is a set of seeds associated to that centroid.

Then, the centroid vectors are computed with the `item.get_gentroid_vectors` function. The function takes as argument the dictionary `centroids` and the vector space model `vectorModel`. 

The output of the function are 3 aligned lists of length equal to the number of distinct centroid keys provided in the seeds file:

1. indexes of the centroids
2. names of the centroids (e.g. emotions)
3. vectors for the centroids

In [10]:
centroids = item.load_seeds(seedsFile)
centroidIdxs, centroidNames, centroidVectors = item.get_centroid_vectors(centroids, vectorModel)
display(Markdown(f'#### Emotive centroids and the number of seeds used to generate them: \n\n{[(k, len(centroids[k])) for k in centroids]}\n'))


#### Emotive centroids and the number of seeds used to generate them: 

[('gioia', 61), ('rabbia', 77), ('sorpresa', 60), ('disgusto', 80), ('paura', 78), ('tristezza', 77), ('fiducia', 62), ('attese', 60)]


### Compute the similarity matrix

Finally, the similarity between each target word in the vector space and each of the centroid is computed via the `item.get_similatity_matrix`.
The function takes as argument the vector model and the centroid vectors computed above.
The function returns the similarity matrix and 3 aligned lists of length equal to the number of distinct words belonging to the vector space model.

1. the similarity matrix of shape (n_words,n_centroids)
2. list of word embeddings in the model
3. list of ids for word embeddings in the model
4. list of words in the model


In [11]:
simMat, wordVectors, wordIdxs, wordNames = item.get_similatity_matrix(vectorModel,centroidVectors)
display(Markdown(f'#### The Similarity matrix'))
simMat

simMat

#### The Similarity matrix

array([[ 0.03573099,  0.12702479,  0.00631439, ...,  0.03567612,
        -0.02278734,  0.04787419],
       [ 0.27003698,  0.17806558,  0.21718146, ...,  0.17445193,
         0.18694588,  0.27295645],
       [ 0.20503877,  0.16525191,  0.13290617, ...,  0.24653084,
         0.16014088,  0.11859763],
       ...,
       [ 0.06784207,  0.05267173,  0.04871502, ..., -0.00293882,
         0.05363663,  0.03163469],
       [ 0.16169409,  0.05611053,  0.12567384, ...,  0.0716342 ,
         0.07090872,  0.10072668],
       [ 0.07470506,  0.12240172,  0.06900523, ...,  0.11972307,
         0.04162689,  0.04148881]])

## Visualize output

In [12]:
df = pd.DataFrame(simMat, index = wordNames,columns = centroidNames)
display(Markdown(f'#### A dafaframe containing the Similarity Matrix'))
df

#### A dafaframe containing the Similarity Matrix

Unnamed: 0,gioia,rabbia,sorpresa,disgusto,paura,tristezza,fiducia,attese
abate-s,0.035731,0.127025,0.006314,0.038059,0.046227,0.035676,-0.022787,0.047874
abbaglio-s,0.270037,0.178066,0.217181,0.097142,0.198298,0.174452,0.186946,0.272956
abbandono-s,0.205039,0.165252,0.132906,0.168139,0.177458,0.246531,0.160141,0.118598
abbassamento-s,0.139864,0.151562,0.111983,0.137748,0.151613,0.189760,0.144598,0.215484
abbattimento-s,0.127449,-0.066286,0.092745,0.081225,0.066279,0.011238,0.062617,0.117573
...,...,...,...,...,...,...,...,...
zoppo-a,0.044778,0.094590,0.049796,0.152718,0.125573,0.117387,0.058295,0.134320
zuccherato-a,0.114434,0.084678,0.059379,0.033473,0.038127,0.005117,0.065668,0.040402
zuccherino-a,0.067842,0.052672,0.048715,0.036028,0.032904,-0.002939,0.053637,0.031635
zuccheroso-a,0.161694,0.056111,0.125674,0.158395,0.075962,0.071634,0.070909,0.100727


### Get the emotive scores for a single word

In [13]:
w = 'urlo-s'

#For tokenized spaces:
#w = 'urlo'

display(Markdown(f'Emotive scores for **{w}**'))
df.loc[w]

Emotive scores for **urlo-s**

gioia        0.274092
rabbia       0.203188
sorpresa     0.139861
disgusto     0.092078
paura        0.130088
tristezza    0.165643
fiducia      0.155741
attese       0.115909
Name: urlo-s, dtype: float64

### Visualize the data sorted by a single emotion

In [14]:
emotion = 'gioia'
display(Markdown(f'Dataframe sorted by **{emotion}**'))

df.sort_values(by=[emotion],ascending=False)

Dataframe sorted by **gioia**

Unnamed: 0,gioia,rabbia,sorpresa,disgusto,paura,tristezza,fiducia,attese
allegro-a,0.672834,0.218464,0.381791,0.238858,0.259723,0.193698,0.350170,0.264201
gioioso-a,0.670783,0.222633,0.411944,0.221171,0.275872,0.264995,0.334075,0.272663
spensierato-a,0.663996,0.224257,0.357196,0.241643,0.279935,0.260837,0.355128,0.250717
entusiasta-a,0.591442,0.388864,0.419100,0.261177,0.387491,0.283486,0.583237,0.361841
festoso-a,0.590573,0.169645,0.284783,0.168762,0.209057,0.189748,0.262713,0.191822
...,...,...,...,...,...,...,...,...
aggiungete-v,-0.173104,-0.053790,-0.089117,0.008733,-0.011488,-0.115068,-0.030349,-0.132705
unire-v,-0.187639,-0.056820,-0.103996,0.028380,-0.015152,-0.132972,-0.055544,-0.132836
aggiungere-v,-0.210280,-0.067320,-0.114451,0.019027,-0.033565,-0.162857,-0.080599,-0.124044
tritare-v,-0.217291,-0.058656,-0.144433,0.020543,-0.024580,-0.112895,-0.064962,-0.159197


### Filtering data by emotion

In [15]:
emotion = 'gioia'

display(Markdown(f'Dataframe by **{emotion}**'))
emo_df = df[emotion]
emo_df = pd.DataFrame(emo_df, columns = [emotion])
emo_df[emo_df[emotion] >= cosine_threshold]

Dataframe by **gioia**

Unnamed: 0,gioia
abbaglio-s,0.270037
abbandono-s,0.205039
abbiate-s,0.246437
abbondanza-s,0.370326
abbraccio-s,0.262834
...,...
volubile-a,0.249351
voluto-a,0.222829
vorticoso-a,0.288817
votato-a,0.214935


## Writing the output

The output is written vita the `item.writeOutput` function. The function takes the following arguments:


1. the dataframe containing the emotions, words and cosines `df`
2. the path of the output file `outputFile`
3. the cosine similarity threshold `cosine_threshold`

The function writes the dataframe as a `TAB`-separated CSV. The function returns the output file path as well.

In [16]:
df = pd.DataFrame(simMat, columns = centroidNames)
df.insert(0,'word',wordNames)
outputFile = item.writeOutput(df, outputFile, cosine_threshold)
    

# Different emotive spaces by PoS (only for PoS-tagged data)

For some applications, it may be useful to build the centroid vectors based on the Part-of-Speech. 
This method was applied in Passaro et al. (2015) in order to construct different emotive centroids according to the PoS (namely Nouns, Verbs, Adjectives), starting from the assumption that the context that best captures the meaning of a word depends on the type of word to be represented. In this case, the emotive scores of a new word are computed on the basis of the cosine similarity between the vector of the target lemma and the centroid vector of the corresponding PoS, i.e. built from the embeddings of the seed lemmas with the same PoS of the target lemma.

In this case the functions used to build the emotive space are the same described above, but they are invoked separately by PoS in a for loop.

In [17]:
poses = ['s','v','a']

for pos in poses:
    
    posVocab = [w for w in vectorModel.vocab if w.endswith(f'-{pos}')]
    posModel = dict()
    posCentroids = item.load_seeds(seedsFile)
    for w in posVocab:
        posModel[w] = vectorModel[w]
    posCentroidIdxs, posCentroidNames, posCentroidVectors = item.get_centroid_vectors(posCentroids, posModel)
    posCentroids = {key:[a for a in value if a.endswith(f'-{pos}')] for key,value in posCentroids.items()}
    display(Markdown(f'#### Emotive centroids for the Pos {pos.upper()} and number of seeds used to build them: \n\n{[(k, len(posCentroids[k])) for k in posCentroids]}\n'))


    posSimMat, posWordVectors, posWordIdxs, posWordNames = item.get_similatity_matrix(posModel,posCentroidVectors)
    
    outputFile = f"output/ItEM-{pos.upper()}-{cosine_threshold}.cos"
    df = pd.DataFrame(posSimMat, columns = posCentroidNames)
    df.insert(0,'word',posWordNames)
    outputFile = item.writeOutput(df, outputFile, cosine_threshold)
    

#### Emotive centroids for the Pos S and number of seeds used to build them: 

[('gioia', 26), ('rabbia', 30), ('sorpresa', 17), ('disgusto', 21), ('paura', 20), ('tristezza', 21), ('fiducia', 21), ('attese', 22)]


#### Emotive centroids for the Pos V and number of seeds used to build them: 

[('gioia', 16), ('rabbia', 15), ('sorpresa', 18), ('disgusto', 21), ('paura', 22), ('tristezza', 23), ('fiducia', 16), ('attese', 23)]


#### Emotive centroids for the Pos A and number of seeds used to build them: 

[('gioia', 19), ('rabbia', 32), ('sorpresa', 25), ('disgusto', 38), ('paura', 36), ('tristezza', 33), ('fiducia', 25), ('attese', 15)]


## Differences among the two approaches

It is clear that by using non PoS-specific centroids, the cosine similarity between target words and centroids tends to be lower. This is due to the fact that vectors of words with the same PoS are closer to each other in the vector space with respect to words with different PoS. This implies that a centroid built using different PoSes may be less representative of the words due to the sparsity of data points in the high-dimensional vector space.

In [18]:
generalDf = pd.DataFrame(simMat, index = wordNames,columns = centroidNames)
posSpecificDf = pd.DataFrame(posSimMat, index = posWordNames,columns = posCentroidNames)

w = 'spensierato-a'
display(Markdown(f'Emotive scores for **{w}** in the _full_ emotive space'))


display(generalDf.loc[w].sort_values())

display(Markdown(f'Emotive scores for **{w}** in the _pos-specific_ emotive space'))
display(posSpecificDf.loc[w].sort_values())



Emotive scores for **spensierato-a** in the _full_ emotive space

rabbia       0.224257
disgusto     0.241643
attese       0.250717
tristezza    0.260837
paura        0.279935
fiducia      0.355128
sorpresa     0.357196
gioia        0.663996
Name: spensierato-a, dtype: float64

Emotive scores for **spensierato-a** in the _pos-specific_ emotive space

disgusto     0.123351
rabbia       0.162872
paura        0.238634
attese       0.249335
tristezza    0.280354
sorpresa     0.312565
fiducia      0.334119
gioia        0.787650
Name: spensierato-a, dtype: float64

# References

[Passaro et al., (2015): ItEM: A Vector Space Model to Bootstrap an Italian Emotive Lexicon](https://arpi.unipi.it/retrieve/handle/11568/766226/80602/clic-2015-2.pdf)
  
[Passaro, Lenci (2016): Evaluating Context Selection Strategies to Build Emotive Vector Space Models](http://colinglab.humnet.unipi.it/wp-content/uploads/2012/12/Passaro_Lenci_LREC2016.pdf)

[Passaro et al., (2016): FB-NEWS15: A Topic-Annotated Facebook Corpus for Emotion Detection and Sentiment Analysis](http://colinglab.humnet.unipi.it/wp-content/uploads/2012/12/passaro_etal_CLIC2016.pdf)