Reddit corpus reader? #13

honnibal · 2016-05-05T21:31:08Z

I've thought for a while about how to give people a small reddit corpus reader. I don't want to start a spacy.corpora package, but maintaining a reddit_corpus package is sort of annoying. Maybe this is a good place for it?

I usually just do something like this:

def iter_comments(loc, limit=-1):
    with bz2.BZ2File(loc) as file_:
        for i, line in enumerate(file_):
            yield ujson.loads(line)['body']
            if i == limit:
                break

url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
space_re = re.compile(r'\s+')
def strip_meta(text):
    text = link_re.sub(r'\1', text)
    text = space_re.sub(' ', text)
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = text.replace('`', '').replace('*', '').replace('~', '')
    return text.strip()

The text was updated successfully, but these errors were encountered:

bdewilde · 2016-05-05T22:09:01Z

Hey @honnibal ! Absolutely, this would be a great thing to include. Could you point me to where you download the raw data from? I'll write code to parse and stream it from disk.

honnibal · 2016-05-08T09:35:58Z

https://archive.org/details/2015_reddit_comments_corpus

bdewilde · 2016-05-31T21:02:58Z

Hi @honnibal, quick question for you: In the code snippet you posted, there are url_re and link_re regexes, which are similar, but you only use the latter. Was the former just a typo, or does it serve some other purpose? (I'm finally getting around to writing a RedditReader class.) Thanks!

bdewilde added the enhancement label May 5, 2016

bdewilde mentioned this issue Jun 1, 2016

consistent corpora readers for reddit and wikipedia #18

Merged

bdewilde closed this as completed Jun 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reddit corpus reader? #13

Reddit corpus reader? #13

honnibal commented May 5, 2016

bdewilde commented May 5, 2016

honnibal commented May 8, 2016

bdewilde commented May 31, 2016

Reddit corpus reader? #13

Reddit corpus reader? #13

Comments

honnibal commented May 5, 2016

bdewilde commented May 5, 2016

honnibal commented May 8, 2016

bdewilde commented May 31, 2016