Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent Data Loss over Multiple Pulls #22

Closed
thb5018 opened this issue Feb 6, 2020 · 5 comments
Closed

Consistent Data Loss over Multiple Pulls #22

thb5018 opened this issue Feb 6, 2020 · 5 comments

Comments

@thb5018
Copy link

thb5018 commented Feb 6, 2020

I ran 14 datasets through the tool and each returned a dataset with roughly 33% data loss. However, I've noticed that screenshots of other pulls have differing values. Is my consistent loss due to my Developer Account/App permissions or is it just chance and simply due to the dataset?

@edsu
Copy link
Member

edsu commented Feb 6, 2020

33% is high, but not unheard of. If you like I can verify if you can share the tweet ids for one of them?

@thb5018
Copy link
Author

thb5018 commented Feb 6, 2020

Sure thing. I appreciate you taking a look. Thanks!
TweetIds_640k.txt

@edsu
Copy link
Member

edsu commented Feb 7, 2020

@thb5018 Here's what I got using twarc instead of Hydrator: 411,724 tweets out of 639,716 -- or 64.3% hydration. How does that compare to what you saw with Hydrator?

@thb5018
Copy link
Author

thb5018 commented Feb 7, 2020

@edsu That is pretty similar. I had 411,909 and 411,908 on two different requests. It appears that it may just be that the collection is somewhat volatile. Thanks for running it through twarc. I wanted to try that, but had trouble inputting my keys into the program.

@edsu
Copy link
Member

edsu commented Feb 7, 2020

Ok, I'm glad that things seem to be similar. One thing you can do if you are interested in working with the original data is privately reach out to the person who collected it and see if they are willing to share it with you for research purposes. Let me know if you have trouble figuring out contact information if the dataset is in the DocNow Catalog.

@edsu edsu closed this as completed Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants