Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

converting to csv before completion #34

Closed
blas-ko opened this issue Apr 20, 2020 · 8 comments
Closed

converting to csv before completion #34

blas-ko opened this issue Apr 20, 2020 · 8 comments

Comments

@blas-ko
Copy link

blas-ko commented Apr 20, 2020

Hey again!

The hydrator is working great in Linux. However, it would be great to have the option to convert the json files to csv even at any point in the hydration, not only when it's completed. I'm trying to hydrate 40+ M tweets and the .json file takes a lot of space that I'll send to the trash after completion.

Another option is to split the hydration into chunks of customizable size (or number of tweets) and being able to convert each chunk to csv instead of waiting for the whole thing to finish.

Not sure if this is easy to do or not, let me know if I can help somehow.

Blas

@edsu
Copy link
Member

edsu commented Apr 21, 2020

Hi @blas-ko -- it really ought to be possible to hydrate directly to CSV for people who don't want the JSON at all. Also, I think it would help in situations like this to write to a compressed file.

Unfortunately until #15 is resolved you will notice degraded performance as Hydrator works its way into very large tweet id files. Hopefully it will be resolved soon though.

Since you are using Linux I'm guessing you may have some familiarity with the command line? If that is the case, for working with large files, and having more control over how things are written you can try our other tool twarc.

For example if you want you can hydrate your ids and write them as a gzip compressed CSV file with the following command:

twarc --format csv hydrate ids.txt | gzip - > tweets.csv.gz

If you run this on a dedicated vm in a tmux or screen session you can let it run for as long as it needs to. You can also sample it as it is written by just streaming the data to another program:

zcat tweets.csv.gz | analyze.jl

@blas-ko
Copy link
Author

blas-ko commented Apr 22, 2020

Hey @edsu! Thanks for pointing me out to twarc! It's great :).

I've been able to install twarc on my personal computer successfully. However, in the server where I have access to (but where I don't have root privileges) I have been only able to import twarc from python, but not running it from the command line (I get a Command 'twarc' not found message).

Should this be addressed in a separate issue? Let me know! Thanks!

@edsu
Copy link
Member

edsu commented Apr 22, 2020

@blas-ko Did you install on your server with pip install --user twarc?

@edsu
Copy link
Member

edsu commented Apr 22, 2020

@blas-ko Just following up, if you did install with --user you should be able to find the twarc executable in your "user base". This location is platform dependent, but you can find it by running this at the command line:

python3 -m site --user-base

(omit the 3 from python3 if you are using another version)

I suspect that the directory you see on output is not in your PATH. If you add it to your PATH then typing twarc on the command line will work. Let me know if you need any help adjusting your PATH to include that directory.

@blas-ko
Copy link
Author

blas-ko commented Apr 23, 2020

Hey @edsu! I installed it both as normally pip3 install twarc and with the user flag pip3 install --user twarc.

I tried the python3 -m site --user-base with no success.

I found the file in ~./local/bin/twarc and created and alias to it, and now I'm able to run it from the command line. However, when I try to configure twarc via twarc configure, I get the following error

Traceback (most recent call last):
  File "/home/kolic/.local/bin/twarc", line 11, in <module>
    sys.exit(main())
  File "/home/kolic/.local/lib/python3.6/site-packages/twarc/command.py", line 219, in main
    t.configure()
  File "/home/kolic/.local/lib/python3.6/site-packages/twarc/client.py", line 939, in configure
    print('\n\u2728 \u2728 \u2728  Happy twarcing! \u2728 \u2728 \u2728\n')
UnicodeEncodeError: 'ascii' codec can't encode character '\u2728' in position 1: ordinal not in range(128)

Any ideas on what could be happening? Maybe I shouldn't have done an alias?

Sorry for all the mess I'm making!

@edsu
Copy link
Member

edsu commented Apr 23, 2020

Wow, thats a new one! I am glad you figured out where the command was installed. It looks your operating system's default encoding is not utf8--which is unusual these days, but not unheard of.

I recall you are working on a shared system? So you may not have control over the default encoding. Could you try setting this in your shell before you run twarc?

export PYTHONIOENCODING="utf-8"

If that works you might want to add it to your ~/.profile so you don't have to remember to do it every time you open a new terminal session.

terminal.

https://docs.python.org/3.8/using/cmdline.html#environment-variables

@blas-ko
Copy link
Author

blas-ko commented Apr 24, 2020

It worked!!!
Thanks a ton! :)

@edsu
Copy link
Member

edsu commented Apr 24, 2020

Nice, please feel free to open new issues here or over in the twarc repository if you run into more issues.

@edsu edsu closed this as completed Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants