-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
converting to csv before completion #34
Comments
Hi @blas-ko -- it really ought to be possible to hydrate directly to CSV for people who don't want the JSON at all. Also, I think it would help in situations like this to write to a compressed file. Unfortunately until #15 is resolved you will notice degraded performance as Hydrator works its way into very large tweet id files. Hopefully it will be resolved soon though. Since you are using Linux I'm guessing you may have some familiarity with the command line? If that is the case, for working with large files, and having more control over how things are written you can try our other tool twarc. For example if you want you can hydrate your ids and write them as a gzip compressed CSV file with the following command:
If you run this on a dedicated vm in a tmux or screen session you can let it run for as long as it needs to. You can also sample it as it is written by just streaming the data to another program:
|
Hey @edsu! Thanks for pointing me out to twarc! It's great :). I've been able to install twarc on my personal computer successfully. However, in the server where I have access to (but where I don't have root privileges) I have been only able to import twarc from python, but not running it from the command line (I get a Should this be addressed in a separate issue? Let me know! Thanks! |
@blas-ko Did you install on your server with |
@blas-ko Just following up, if you did install with
(omit the 3 from python3 if you are using another version) I suspect that the directory you see on output is not in your PATH. If you add it to your PATH then typing twarc on the command line will work. Let me know if you need any help adjusting your PATH to include that directory. |
Hey @edsu! I installed it both as normally I tried the I found the file in
Any ideas on what could be happening? Maybe I shouldn't have done an alias? Sorry for all the mess I'm making! |
Wow, thats a new one! I am glad you figured out where the command was installed. It looks your operating system's default encoding is not utf8--which is unusual these days, but not unheard of. I recall you are working on a shared system? So you may not have control over the default encoding. Could you try setting this in your shell before you run twarc?
If that works you might want to add it to your ~/.profile so you don't have to remember to do it every time you open a new terminal session.
https://docs.python.org/3.8/using/cmdline.html#environment-variables |
It worked!!! |
Nice, please feel free to open new issues here or over in the twarc repository if you run into more issues. |
Hey again!
The hydrator is working great in Linux. However, it would be great to have the option to convert the json files to csv even at any point in the hydration, not only when it's completed. I'm trying to hydrate 40+ M tweets and the .json file takes a lot of space that I'll send to the trash after completion.
Another option is to split the hydration into chunks of customizable size (or number of tweets) and being able to convert each chunk to csv instead of waiting for the whole thing to finish.
Not sure if this is easy to do or not, let me know if I can help somehow.
Blas
The text was updated successfully, but these errors were encountered: