Twitter hashtag prediction
Switch branches/tags
Nothing to show
Clone or download
Latest commit fde33fa Apr 20, 2017
Type Name Latest commit message Commit time
Failed to load latest commit information.
baseline Add READMEs, clean up May 11, 2016
misc Add preprocessing files and readme May 11, 2016
tweet2vec Merge branch 'master' of Jan 4, 2017
.gitignore update citation Apr 20, 2017
LICENSE Add BSD 2-clause license May 19, 2016 update citation Apr 20, 2017


This repository provides a character-level encoder/trainer for social media posts. See Tweet2Vec paper for details.

There are two models implemented in the paper - the character level tweet2vec and a word level baseline. They can be found in their respective directories, with instructions on how to run. General information about prerequisites and data format can be found below.


  • Python 2.7
  • Theano and all dependencies (latest)
  • Lasagne (latest)
  • Numpy
  • Maybe more, just use pip install if you get an error

Data and Preprocessing

Unfortunately we are not allowed to release the data used in experiments from the paper, due to licensing restrictions. Hence, we describe the data format and preprocessing here -

  1. Preprocessing - We replace HTML tags, usernames, and URLs from tweet text with special tokens. Hashtags are also removed from the body of a tweet, and re-tweets are discarded. Example code is provided in misc/

  2. Encoding File Format - If you have a bunch of posts that you want to embed into a vector space, use the scripts provided. The input file must contain one tweet per line (make sure you preprocess these first). An example is provided in misc/encoder_example.txt.

  3. Training File Format - To train the models from scratch, use the scripts provided. The input file must contain one (hashtag,tweet) pair per line separated by a tab. There should be only one tag per line - for tweets with multiple tags split them into separate line. See misc/trainer_example.txt for an example.

  4. Test/Validation File Format - After training the model, you can test it on a held-out set using scripts provided. It has the same format as the training file format, except it can have multiple tags per separated by a comma. Example in misc/tester_example.txt.


Make sure to add THEANO_FLAGS=device=cpu,floatX=float32 before any command if you are running on a CPU.


Bhuwan Dhingra, Dylan Fitzpatrick, Zhong Zhou, Michael Muehl. Special thanks to Yun Fu for the preprocessing JAR-file.

If you end up using this code, please cite the following paper -

Dhingra, Bhuwan, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W. Cohen. "Tweet2Vec: Character-Based Distributed Representations for Social Media." ACL (2016).

  author    = {Dhingra, Bhuwan  and  Zhou, Zhong  and  Fitzpatrick, Dylan  and  Muehl, Michael  and  Cohen, William},
  title     = {Tweet2Vec: Character-Based Distributed Representations for Social Media},
  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {269--274},
  url       = {}

Report bugs and missing info to bdhingraATandrewDOTcmuDOTedu (replace AT, DOT appropriately).