Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrape mode should not duplicate tweets #7

Closed
nichtich opened this issue Dec 13, 2013 · 6 comments
Closed

scrape mode should not duplicate tweets #7

nichtich opened this issue Dec 13, 2013 · 6 comments

Comments

@nichtich
Copy link

After running twarc for two days I analyzed the output and found that it downloads the same tweets over and over again. The script should hold a set of known tweet ids and only emit tweets that have not been written before.

@edsu
Copy link
Member

edsu commented Dec 13, 2013

That's weird, it was designed to not behave that way, and I've seen it working properly in the past. Can you share your log file?

@ruebot
Copy link
Member

ruebot commented Dec 18, 2013

Hey! I've been noticing this same exact problem. You can see a whole bunch of duplicate tweets here. I've also gist'd up my twarc.log, and this is how I ran it: ./twarc.py --scrape "#freedaleaskey" Finally, here is the output of a grep query on the file for a particular line.

@edsu
Copy link
Member

edsu commented Dec 18, 2013

@ruebot thanks for the replication of the bug ; I haven't had time to look into it yet, but will do shortly.

@ruebot
Copy link
Member

ruebot commented Dec 18, 2013

Thanks! @edsu++

@edsu
Copy link
Member

edsu commented Dec 18, 2013

I just ran a simple test with twarc.py "#code4lib" and did not see duplication. @nichtich were you using --scrape mode by any chance, like @ruebot?

@nichtich
Copy link
Author

Yes, my query was `./twarc.py --scrape '@nichtich'. I just confirmed the bug at a new installation.

@edsu edsu closed this as completed in ac14b9d Dec 19, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants