Read Json in chunks #235

parasml · 2020-05-14T17:15:14Z

I am reading my JSON file in chunks as it is too big in size, In below code.

df = wr.s3.read_json(path1, chunksize=2, lines=True)

df type return is Generator.

I am struggling to save above chunk (df) to parquet format, please share your thoughts on how to achieve this or if we can save it to dataframe.

Thanks,
Prashan

parasml · 2020-05-15T08:36:19Z

Thanks for all the help in past.

Just tried with below code, but still not able to read json lines in for loop. Please suggest.

code:

df = wr.s3.read_json(path1, chunksize=5, lines=True)
i = 0
for row in df:
print("row = ", row)

print("row = ", json.loads(row))
print("--------------")

=======================================

Regards,
Prashan

igorborgest · 2020-05-15T18:03:39Z

Actually I found a bug that I.ve just fixed in the json reading with chunksize.

Do you mind to test our development branch?

pip install git+https://github.com/awslabs/aws-data-wrangler.git@dev

Example:

for df in wr.s3.read_json(paths, lines=True, chunksize=1):
    print(df)

I would like to make sure that it will be fixed on our next version 1.2.0. Thanks!

igorborgest · 2020-05-20T12:11:02Z

Released on version 1.2.0

parasml added the question Further information is requested label May 14, 2020

igorborgest self-assigned this May 15, 2020

igorborgest added bug Something isn't working minor release Will be addressed in the next minor release WIP Work in progress labels May 15, 2020

igorborgest added this to the 1.2.0 milestone May 15, 2020

igorborgest closed this as completed May 20, 2020

igorborgest added bug Something isn't working and removed bug Something isn't working WIP Work in progress labels May 20, 2020