Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reddit corpus reader? #13

Closed
honnibal opened this issue May 5, 2016 · 3 comments
Closed

Reddit corpus reader? #13

honnibal opened this issue May 5, 2016 · 3 comments

Comments

@honnibal
Copy link

honnibal commented May 5, 2016

I've thought for a while about how to give people a small reddit corpus reader. I don't want to start a spacy.corpora package, but maintaining a reddit_corpus package is sort of annoying. Maybe this is a good place for it?

I usually just do something like this:

def iter_comments(loc, limit=-1):
    with bz2.BZ2File(loc) as file_:
        for i, line in enumerate(file_):
            yield ujson.loads(line)['body']
            if i == limit:
                break

url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
space_re = re.compile(r'\s+')
def strip_meta(text):
    text = link_re.sub(r'\1', text)
    text = space_re.sub(' ', text)
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = text.replace('`', '').replace('*', '').replace('~', '')
    return text.strip()
@bdewilde
Copy link
Collaborator

bdewilde commented May 5, 2016

Hey @honnibal ! Absolutely, this would be a great thing to include. Could you point me to where you download the raw data from? I'll write code to parse and stream it from disk.

@honnibal
Copy link
Author

honnibal commented May 8, 2016

https://archive.org/details/2015_reddit_comments_corpus

@bdewilde
Copy link
Collaborator

Hi @honnibal, quick question for you: In the code snippet you posted, there are url_re and link_re regexes, which are similar, but you only use the latter. Was the former just a typo, or does it serve some other purpose? (I'm finally getting around to writing a RedditReader class.) Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants