converting to csv before completion #34

blas-ko · 2020-04-20T19:01:41Z

Hey again!

The hydrator is working great in Linux. However, it would be great to have the option to convert the json files to csv even at any point in the hydration, not only when it's completed. I'm trying to hydrate 40+ M tweets and the .json file takes a lot of space that I'll send to the trash after completion.

Another option is to split the hydration into chunks of customizable size (or number of tweets) and being able to convert each chunk to csv instead of waiting for the whole thing to finish.

Not sure if this is easy to do or not, let me know if I can help somehow.

Blas

edsu · 2020-04-21T15:17:28Z

Hi @blas-ko -- it really ought to be possible to hydrate directly to CSV for people who don't want the JSON at all. Also, I think it would help in situations like this to write to a compressed file.

Unfortunately until #15 is resolved you will notice degraded performance as Hydrator works its way into very large tweet id files. Hopefully it will be resolved soon though.

Since you are using Linux I'm guessing you may have some familiarity with the command line? If that is the case, for working with large files, and having more control over how things are written you can try our other tool twarc.

For example if you want you can hydrate your ids and write them as a gzip compressed CSV file with the following command:

twarc --format csv hydrate ids.txt | gzip - > tweets.csv.gz

If you run this on a dedicated vm in a tmux or screen session you can let it run for as long as it needs to. You can also sample it as it is written by just streaming the data to another program:

zcat tweets.csv.gz | analyze.jl

blas-ko · 2020-04-22T09:12:54Z

Hey @edsu! Thanks for pointing me out to twarc! It's great :).

I've been able to install twarc on my personal computer successfully. However, in the server where I have access to (but where I don't have root privileges) I have been only able to import twarc from python, but not running it from the command line (I get a Command 'twarc' not found message).

Should this be addressed in a separate issue? Let me know! Thanks!

edsu · 2020-04-22T12:16:39Z

@blas-ko Did you install on your server with pip install --user twarc?

edsu · 2020-04-22T14:21:51Z

@blas-ko Just following up, if you did install with --user you should be able to find the twarc executable in your "user base". This location is platform dependent, but you can find it by running this at the command line:

python3 -m site --user-base

(omit the 3 from python3 if you are using another version)

I suspect that the directory you see on output is not in your PATH. If you add it to your PATH then typing twarc on the command line will work. Let me know if you need any help adjusting your PATH to include that directory.

blas-ko · 2020-04-23T16:06:57Z

Hey @edsu! I installed it both as normally pip3 install twarc and with the user flag pip3 install --user twarc.

I tried the python3 -m site --user-base with no success.

I found the file in ~./local/bin/twarc and created and alias to it, and now I'm able to run it from the command line. However, when I try to configure twarc via twarc configure, I get the following error

Traceback (most recent call last):
  File "/home/kolic/.local/bin/twarc", line 11, in <module>
    sys.exit(main())
  File "/home/kolic/.local/lib/python3.6/site-packages/twarc/command.py", line 219, in main
    t.configure()
  File "/home/kolic/.local/lib/python3.6/site-packages/twarc/client.py", line 939, in configure
    print('\n\u2728 \u2728 \u2728  Happy twarcing! \u2728 \u2728 \u2728\n')
UnicodeEncodeError: 'ascii' codec can't encode character '\u2728' in position 1: ordinal not in range(128)

Any ideas on what could be happening? Maybe I shouldn't have done an alias?

Sorry for all the mess I'm making!

edsu · 2020-04-23T19:44:44Z

Wow, thats a new one! I am glad you figured out where the command was installed. It looks your operating system's default encoding is not utf8--which is unusual these days, but not unheard of.

I recall you are working on a shared system? So you may not have control over the default encoding. Could you try setting this in your shell before you run twarc?

export PYTHONIOENCODING="utf-8"

If that works you might want to add it to your ~/.profile so you don't have to remember to do it every time you open a new terminal session.

terminal.

https://docs.python.org/3.8/using/cmdline.html#environment-variables

blas-ko · 2020-04-24T20:58:50Z

It worked!!!
Thanks a ton! :)

edsu · 2020-04-24T21:13:07Z

Nice, please feel free to open new issues here or over in the twarc repository if you run into more issues.

edsu closed this as completed Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

converting to csv before completion #34

converting to csv before completion #34

blas-ko commented Apr 20, 2020

edsu commented Apr 21, 2020 •

edited

Loading

blas-ko commented Apr 22, 2020

edsu commented Apr 22, 2020

edsu commented Apr 22, 2020 •

edited

Loading

blas-ko commented Apr 23, 2020

edsu commented Apr 23, 2020

blas-ko commented Apr 24, 2020

edsu commented Apr 24, 2020

converting to csv before completion #34

converting to csv before completion #34

Comments

blas-ko commented Apr 20, 2020

edsu commented Apr 21, 2020 • edited Loading

blas-ko commented Apr 22, 2020

edsu commented Apr 22, 2020

edsu commented Apr 22, 2020 • edited Loading

blas-ko commented Apr 23, 2020

edsu commented Apr 23, 2020

blas-ko commented Apr 24, 2020

edsu commented Apr 24, 2020

edsu commented Apr 21, 2020 •

edited

Loading

edsu commented Apr 22, 2020 •

edited

Loading