You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've thought for a while about how to give people a small reddit corpus reader. I don't want to start a spacy.corpora package, but maintaining a reddit_corpus package is sort of annoying. Maybe this is a good place for it?
I usually just do something like this:
def iter_comments(loc, limit=-1):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield ujson.loads(line)['body']
if i == limit:
break
url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
space_re = re.compile(r'\s+')
def strip_meta(text):
text = link_re.sub(r'\1', text)
text = space_re.sub(' ', text)
text = text.replace('>', '>').replace('<', '<')
text = text.replace('`', '').replace('*', '').replace('~', '')
return text.strip()
The text was updated successfully, but these errors were encountered:
Hey @honnibal ! Absolutely, this would be a great thing to include. Could you point me to where you download the raw data from? I'll write code to parse and stream it from disk.
Hi @honnibal, quick question for you: In the code snippet you posted, there are url_re and link_re regexes, which are similar, but you only use the latter. Was the former just a typo, or does it serve some other purpose? (I'm finally getting around to writing a RedditReader class.) Thanks!
I've thought for a while about how to give people a small reddit corpus reader. I don't want to start a
spacy.corpora
package, but maintaining areddit_corpus
package is sort of annoying. Maybe this is a good place for it?I usually just do something like this:
The text was updated successfully, but these errors were encountered: