Skip to content

Corpora for vulgar and censored tweets annotated for sentiment

Notifications You must be signed in to change notification settings

ericholgate/vulgartwitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Expressively vulgar: The socio-dynamics of vulgarity and its effects on sentiment analysis in social media

Please check this paper for details regarding annotation and modeling

Citation:

@InProceedings{cachola2018vulgar,
  author    = {Cachola, Isabel  and  Holgate, Eric  and  Preo\c{t}iuc-Pietro, Daniel  and  Li, Junyi Jessy},
  title     = {Expressively vulgar: The socio-dynamics of vulgarity and its effects on sentiment analysis in social media},
  booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
  year      = {2018},
  pages     = {2927--2938},
  url       = {http://aclweb.org/anthology/C18-1248},
}

Use of the data presented here must abide by the Twitter Terms of Service and Developer Policy

A bi-LSTM that predicts sentiment values, utilizing vulgarity features.

The three possible vulgarity features are: (1) Masking (2) Insertion (3) Concatenations

First, run clean_data.py to prepare data set for modeling. clean_data.py automatically uses the path ./data/coling_twitter_data.tsv to the original data set but if your file path is different then you can change it using the flag --data_set. clean_data.py saves the cleaned data to ./data/cleaned_data.tsv.

Example Usage: python3 clean_data.py --data_set=./data/coling_twitter_data.tsv

After cleaning data, run bilstm.py

Required parameters:

  • train=path to training data set
  • validation_data=path to validation data set
  • initial_embed_weights=path to initial embedding weights
  • prefix=prefix to save model

For initial embedding weights, we use 200d CBOW embeddings pre-trained on 50M tweets (Astudillo et al., 2015). Optional parameters:

  • rnndim=<rnn dimension, default=128>

  • dropout=<dropout rate, default=0.2>

  • maxsentlen=<maximum length of tweets by number of words, default=60>

  • num_cat=<number of categories, default=5>

  • lr=<learning rate, default=0.001>

  • only_testing=<boolean if you only want to load a saved model, default=False>

  • concat=<boolean if using concat method, default=False>

  • insert=<boolean if using insert method, default=False>

  • mask=<boolean if using mask method, default=False>

Example usage: python3 bilstm.py --train=<path> --test=<path> --prefix=example --concat=True

Returns:

  • Saves model as h5 and json files to ./training
  • Prints summary of model

If a test set is provided:

  • Saves predictions of test set to ./training/predictions
  • Prints micro mean absolute error
  • Prints macro mean absolute error
  • Prints per class mean absolute error

About

Corpora for vulgar and censored tweets annotated for sentiment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages