## JSON files

A JSON file is a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format, which is a standard data interchange format. It is primarily used for transmitting data between a web application and a server. JSON files are lightweight, text-based, human-readable, and can be edited using a text editor.

In [1]:
import pandas as pd
from pathlib import Path

In [2]:
dir_file = Path('__file__').resolve().parents[0]

In [13]:
dir_data = dir_file / 'data' / 'Storico meteo'

year = 2012
fns = dir_data.glob(f'meteo-{year}-*.json')
ff = sorted(fns)

In [14]:
# DO NOT RUN THIS CELL
%%time
df = pd.read_json(ff[0], lines=True)

CPU times: user 46.7 s, sys: 2min 17s, total: 3min 4s
Wall time: 4min 5s


In [15]:
df

Unnamed: 0,version,network,ident,lon,lat,date,data
0,0.1,agrmet,,1095670,4470214,2012-01-01 00:00:00+00:00,"[{'vars': {'B01019': {'v': 'Albareto'}, 'B0119..."
1,0.1,agrmet,,1095670,4470214,2012-01-01 00:15:00+00:00,"[{'vars': {'B01019': {'v': 'Albareto'}, 'B0119..."
2,0.1,agrmet,,1095670,4470214,2012-01-01 00:30:00+00:00,"[{'vars': {'B01019': {'v': 'Albareto'}, 'B0119..."
3,0.1,agrmet,,1095670,4470214,2012-01-01 00:45:00+00:00,"[{'vars': {'B01019': {'v': 'Albareto'}, 'B0119..."
4,0.1,agrmet,,1095670,4470214,2012-01-01 01:00:00+00:00,"[{'vars': {'B01019': {'v': 'Albareto'}, 'B0119..."
...,...,...,...,...,...,...,...
1045505,0.1,spdsra,,1189186,4392525,2012-01-31 19:00:00+00:00,"[{'vars': {'B01019': {'v': 'Capaccio'}, 'B0119..."
1045506,0.1,spdsra,,1189186,4392525,2012-01-31 20:00:00+00:00,"[{'vars': {'B01019': {'v': 'Capaccio'}, 'B0119..."
1045507,0.1,spdsra,,1189186,4392525,2012-01-31 21:00:00+00:00,"[{'vars': {'B01019': {'v': 'Capaccio'}, 'B0119..."
1045508,0.1,spdsra,,1189186,4392525,2012-01-31 22:00:00+00:00,"[{'vars': {'B01019': {'v': 'Capaccio'}, 'B0119..."


In [16]:
%%time
df = pd.read_json(ff[0], lines=True, chunksize=10000) # chunksize is the number of rows per chunk
df_list = list()
for c in df:
    c.drop(columns=['version', 'ident', 'network'], axis=1, inplace=True)
    c.set_index('date', inplace=True)
    #pd.json_normalize(c.data.values[0])
    value_speed = [x[0]['vars']['B05001']['v'] for x in c.data.values if 'B05001' in x[0]['vars']]
    c['w_speed'] = value_speed
    c.drop(['data'], axis=1, inplace=True)
    df_list.append(c)
df = pd.concat(df_list)

CPU times: user 36.8 s, sys: 59.5 s, total: 1min 36s
Wall time: 2min 5s


### Dask


Parallelize any Python code with Dask Futures, letting you scale any function and for loop, and giving you control and power in any situation.

In [32]:
import dask.dataframe as dd

In [128]:
%%time
ddf = dd.read_json(ff[0], blocksize=5000000) # blocksize is size in bytes of each block  
ddf = ddf.drop(columns=['version', 'ident', 'network'], axis=1)
list_var = list()
for np in range(ddf.npartitions):
    ddfp = ddf.partitions[np]
    ddfp = ddfp.set_index('date')
    value_speed = [x[0]['vars']['B05001']['v'] for x in ddfp.data.compute().values]
    list_var.append(value_speed)
#ddf = ddf.drop(columns=['data'], axis=1)

CPU times: user 1min 26s, sys: 2min 57s, total: 4min 23s
Wall time: 5min 32s


#### Parallelization!