Skip to content
aditya parashar edited this page Dec 23, 2016 · 2 revisions

Welcome to the NLP-project wiki! Reddit released a dataset containing all ~1.7 billion of their publicly available comments. The full dataset is an unwieldy 1+ terabyte uncompressed, so I used to use a portion of the comments for my project, which still includes 54 million comments for exercising natural language processing. The portion which I used, provided by Kaggle is here. A slight description of the dataset here The database has one table, May2015, with the following fields:

  • created_utc
  • ups
  • subreddit_id
  • link_id
  • name
  • score_hidden
  • author_flair_css_class
  • author_flair_text
  • subreddit
  • id
  • removal_reason
  • gilded
  • downs
  • archived
  • author
  • score
  • retrieved_on
  • body
  • distinguished
  • edited
  • controversiality
  • parent_id
Clone this wiki locally