Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

How Gender and Skin Tone Modifiers Affect Emoji Semantics in Twitter

Francesco Barbieri and Jose Camacho Collados

The following repository includes the code and pre-trained embeddings from the paper How Gender and Skin Tone Modifiers Affect Emoji Semantics in Twitter (*SEM 2018).

Use our embeddings

We release the two sets of 100-dimensional SW2V embeddings trained on Twitter (USA-based, English):

  1. Word, base emoji and modifier embeddings. The vocabulary includes words (e.g. house, car, ...), base emojis (without sex or skin tone modifiers, e.g. 👍), and modifiers (e.g. male/female, or light/dark skin tone). Download embeddings here [~300 MB]

  2. Word and emoji (base and modified) embeddings. The vocabulary includes words (e.g. house, car, ...) and emojis, both base (without sex or skin tone modifiers, e.g. 👍), and with modifiers (e.g. 👍🏻,👍🏽,👍🏿). Download embeddings here [~300 MB]

Notes:

  • All words are lowercased.
  • For obtaining the original emoji and modifier encoding from the embeddings, you can use the following mapping (tab separated: frequency ranking, emoji, cldr, emoji code with modifiers, emoji code without modifiers).

When you run example.py (with python3) the output should be the following:

Train New Embeddings

We used the original SW2V code for training the embeddings: http://lcl.uniroma1.it/sw2v/ . We ran the code from the terminal as follows (these are the same parameters used in our experiments):

  1. Word, base emoji and modifier embeddings:
INPUT="tweets.txt"
OUTPUT="word_emoji_embedding_s0.bin"
sw2v -train $INPUT -output $OUTPUT -cbow 1 -size 100 -window 6 -negative 0 -hs 1 -threads 1 -binary 1 -iter 5 -update 0 -senses 0 -synsets_input 1 -synsets_target 1
  1. Word and emoji (base and modified) embeddings:
INPUT="tweets.txt"
OUTPUT="word_emoji_embedding_s1.bin"
sw2v -train $INPUT -output $OUTPUT -cbow 1 -size 100 -window 6 -negative 0 -hs 1 -threads 1 -binary 1 -iter 5 -update 0 -senses 1 -synsets_input 1 -synsets_target 1

The provided models are freely available under Creative Commons CC BY 3.0, using the reference below for attribution:

@InProceedings{barbieri:sem2018,
  author = 	"Barbieri, Francesco
		and Camacho-Collados, Jose",
  title = 	"How Gender and Skin Tone Modifiers Affect Emoji Semantics in Twitter",
  booktitle = 	"Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"101--106",
  location = 	"New Orleans, Louisiana",
  url = 	"http://aclweb.org/anthology/S18-2011"
}

About

*sem paper 2018 - models and code

Resources

Releases

No releases published

Packages

No packages published

Languages