# ARPA Weather Station Data
This notebook is used for:
1) testing ARPA API functionalities
2) testing functions to be implemented in the plugin
3) testing libraries to be used and evaluate performances

First, it is necessary to get a token.
Go to Open Data Lombardia website (https://dati.lombardia.it/). Subscribe to the website and go to your profile settings. <br>
![image.png](attachment:4bce9d53-93b0-462d-9403-47bc99e1bc35.png) <br>
Modify your profile and open the "Opzioni per lo sviluppatore" tab. Create a new App Token to be used.


Useful notebook for Sodapy: https://github.com/xmunoz/sodapy/blob/master/examples/soql_queries.py

In [1]:
arpa_token = "riTLzYVRVdDaQtUkxDDaHRgJi" 

In [2]:
print(arpa_token)

riTLzYVRVdDaQtUkxDDaHRgJi


Pandas va installato anche se si installa prima Dask

In [3]:
from sodapy import Socrata
import pandas as pd
from datetime import datetime, timedelta
import requests
from io import BytesIO
from zipfile import ZipFile
import os
import dask.dataframe as dd

In [4]:
stationsId = "nf78-nj6b" # Select meteo stations dataset containing positions and information about sensors

In [5]:
def connect_ARPA_api(token):
    """
    Function to connect to ARPA API.

        Parameters:
            - token (str): the ARPA token obtained from Open Data Lombardia website

        Returns:
            - client: client session
            
    """
    client = Socrata("www.dati.lombardia.it", app_token=token)

    return client


client = connect_ARPA_api(arpa_token)
stations_info = client.get_all(stationsId)

In [6]:
stations_df = pd.DataFrame(stations_info)
stations_df

Unnamed: 0,idsensore,tipologia,unit_dimisura,idstazione,nomestazione,quota,provincia,datastart,storico,cgb_nord,cgb_est,lng,lat,location,:@computed_region_6hky_swhk,:@computed_region_ttgh_9sm5,datastop
0,10373,Precipitazione,mm,687,Ferno v.Di Dio,215,VA,2007-08-13T00:00:00.000,N,5051773,481053,8.756970445453431,45.61924377994763,"{'latitude': '45.61924377994763', 'longitude':...",1,1,
1,10376,Precipitazione,mm,706,Lecco v.Sora,272,LC,2008-07-22T00:00:00.000,N,5078987,531045,9.399950344681852,45.86374884127965,"{'latitude': '45.86374884127965', 'longitude':...",10,10,
2,10377,Temperatura,°C,706,Lecco v.Sora,272,LC,2008-07-22T00:00:00.000,N,5078987,531045,9.399950344681852,45.86374884127965,"{'latitude': '45.86374884127965', 'longitude':...",10,10,
3,10381,Umidità Relativa,%,706,Lecco v.Sora,272,LC,2008-07-22T00:00:00.000,N,5078987,531045,9.399950344681852,45.86374884127965,"{'latitude': '45.86374884127965', 'longitude':...",10,10,
4,10382,Radiazione Globale,W/m²,706,Lecco v.Sora,272,LC,2008-07-31T00:00:00.000,N,5078987,531045,9.399950344681852,45.86374884127965,"{'latitude': '45.86374884127965', 'longitude':...",10,10,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1227,9869,Umidità Relativa,%,672,Cornale v.Libertà,74,PV,2005-07-28T00:00:00.000,N,4987406,493238,8.914144599002409,45.04007657202963,"{'latitude': '45.04007657202963', 'longitude':...",7,7,
1228,9933,Precipitazione,mm,677,Cremona Via Fatebenefratelli,43,CR,2006-04-10T00:00:00.000,N,4999315,582066,10.043836158369393,45.14254063221695,"{'latitude': '45.14254063221695', 'longitude':...",8,8,
1229,9935,Radiazione Globale,W/m²,677,Cremona Via Fatebenefratelli,43,CR,2006-04-10T00:00:00.000,N,4999315,582066,10.043836158369393,45.14254063221695,"{'latitude': '45.14254063221695', 'longitude':...",8,8,
1230,9938,Temperatura,°C,677,Cremona Via Fatebenefratelli,43,CR,2006-04-10T00:00:00.000,N,4999315,582066,10.043836158369393,45.14254063221695,"{'latitude': '45.14254063221695', 'longitude':...",8,8,


Consideration: inside the plugin the data type for each columns must be set accordingly to QGIS data types.

In [7]:
temperature_sensors = (stations_df.loc[(stations_df['tipologia'] == 'Temperatura') & (stations_df['storico'] == 'N')]).idsensore.tolist()

In [8]:
len(temperature_sensors)

197

**Request data time series**

In [41]:
def req_ARPA_start_end_API(client):
    """
    Function to request the start and the end date of data available in the ARPA API.

      Parameters:
        - client: the client session

      Returns: 
        - start_API_date (str): starting date for available data inside the API.
        - end_API_date (str): ending date for available data inside the API.
        
    """
    weather_sensor_id = "647i-nhxk" #Weather sensors id
    query = """ select MAX(data), MIN(data) limit 9999999999999999"""

    min_max_dates = client.get(weather_sensor_id, query=query)[0] #Get max and min dates from the list
    
    #Start and minimum dates from the dict obtained from the API
    start_API_date = min_max_dates['MIN_data']
    end_API_date = min_max_dates['MAX_data']
    
    #Convert to datetime and add 1 day to end date to consider all the values inside the last day (e.g. 20/01/2023 23:10:00 won't be considered and the requested data will be untile 20/01/2023 00:00:00)
    start_API_date = datetime.strptime(start_API_date, "%Y-%m-%dT%H:%M:%S.%f")
    end_API_date = datetime.strptime(end_API_date, "%Y-%m-%dT%H:%M:%S.%f")+timedelta(days=1)
    
    #Convert to string in year-month-day format, accepted by ARPA query
    start_API_date = start_API_date.strftime("%Y-%m-%d")
    end_API_date = end_API_date.strftime("%Y-%m-%d")

    return start_API_date, end_API_date

In [42]:
start_date, end_date = req_ARPA_start_end_API(client)
print("The data from the API are available from: " + start_date + " up to: " + end_date)

The data from the API are available from: 2023-01-13 up to: 2023-01-19


In [11]:
def req_ARPA_data_API(client, start_date, end_date):
    """
    Function to request data from available weather sensors in the ARPA API using a query.

      Parameters:
        - client: the client session
        - start date (str): the start date in yyy-mm-dd format
        - end date (str): the end date in yyy-mm-dd format

      Returns: 
        - time_series: time series of values requested with the query for all sensors
        
    """
    weather_sensor_id = "647i-nhxk"
    query = """
      select
          *
      where data between \'{}\' and \'{}\' limit 9999999999999999
      """.format(start_date, end_date)

    time_series = client.get(weather_sensor_id, query=query)

    return time_series

In [12]:
sensors_values_API = req_ARPA_data_API(client, start_date, end_date)

In [13]:
print("Total lenght of the list containing the values from ARPA sensors:", len(sensors_values_API))

Total lenght of the list containing the values from ARPA sensors: 335993


In [14]:
sensors_values_df_API = pd.DataFrame(sensors_values_API)

In [15]:
sensors_values_df_API

Unnamed: 0,idsensore,data,valore,idoperatore,stato
0,2116,2023-01-16T06:30:00.000,0,4,VA
1,14207,2023-01-16T05:50:00.000,0,1,VA
2,9074,2023-01-16T09:10:00.000,1.4,3,VA
3,8122,2023-01-16T05:20:00.000,0,4,VA
4,11190,2023-01-16T06:10:00.000,43,1,VA
...,...,...,...,...,...
335988,4032,2023-01-18T03:30:00.000,1.9,1,VA
335989,11648,2023-01-18T00:40:00.000,2.6,3,VA
335990,32237,2023-01-18T03:10:00.000,260,3,VA
335991,14589,2023-01-18T01:10:00.000,201,3,VA


In [16]:
sensors_values_df_API['data'] = pd.to_datetime(sensors_values_df_API['data'])
sensors_values_df_API = sensors_values_df_API.sort_values(by='data', ascending=False).reset_index(drop=True)
print(sensors_values_df_API.head())

  idsensore                data valore idoperatore stato
0     30523 2023-01-18 06:00:00   -0.4           1    VA
1     30523 2023-01-18 05:55:00   -0.2           1    VA
2     19307 2023-01-18 05:50:00   -2.1           1    VA
3     30539 2023-01-18 05:50:00      0           4    VA
4     19354 2023-01-18 05:50:00   -1.2           1    VA


# Values from csv files

In [17]:
def download_extract_csv_from_year(year):
    """
    Function for selecting the correct link for downloading zipped .csv meteorological data from ARPA sensors and extracting it.

    For older data it is necessary to download this .csv files containing the time series of the meteorological sensors.

            Parameters:
                year(str): the selected year for downloading the .csv file containing the meteorological sensors time series

            Returns:
                None
    """
    
    #Create a dict with years and link to the zip folder on Open Data Lombardia
    switcher = {
        '2022': "https://www.dati.lombardia.it/download/mvvc-nmzv/application%2Fzip",
        '2021': "https://www.dati.lombardia.it/download/49n9-866s/application%2Fzip",
        '2020': "https://www.dati.lombardia.it/download/erjn-istm/application%2Fzip",
        '2019': "https://www.dati.lombardia.it/download/wrhf-6ztd/application%2Fzip",
        '2018': "https://www.dati.lombardia.it/download/sfbe-yqe8/application%2Fzip",
        '2017': "https://www.dati.lombardia.it/download/vx6g-atiu/application%2Fzip"
    }
    
    #Select the url and make request
    url = switcher[year]
    filename = 'meteo_'+str(year)+'.zip'
    
    print(('Downloading {filename} -> Started. It might take a while... Please wait!').format(filename = filename))
    req = requests.get(url)
    
    # Writing the file to the local file system
    with open(filename,'wb') as output_file:
        output_file.write(req.content)
    print(('Downloading {filename} -> Completed').format(filename = filename))
    
    #Loading the .zip and creating a zip object
    with ZipFile(filename, 'r') as zObject:
        # Extracting all the members of the zip into a specific location
        zObject.extractall()
    
    csv_file=str(year)+'.csv'
    print(("File unzipped: {filename}").format(filename=filename))
    print(("File csv saved: {filename}").format(filename=csv_file))
    
    #Remove the zip folder
    if os.path.exists(filename):
        os.remove(filename)
    else:
        print("The file does not exist")

In [18]:
year = 2022
csv_file = str(year)+'.csv'

In [19]:
download_extract_csv_from_year(str(year))

Downloading meteo_2022.zip -> Started. It might take a while... Please wait!
Downloading meteo_2022.zip -> Completed
File unzipped: meteo_2022.zip
File csv saved: 2022.csv


In [37]:
df = dd.read_csv(csv_file, blocksize=1e8, usecols=['IdSensore','Data','Valore'])  # 100MB chunks
#df

In [38]:
df['Data'] = dd.to_datetime(df['Data'])

In [39]:
df = df.set_index('Data')#.compute() #ci mette troppo

In [40]:
df

Unnamed: 0_level_0,IdSensore,Valore
npartitions=19,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01 00:00:00,int64,float64
2022-01-14 15:10:00,...,...
...,...,...
2022-12-02 02:40:00,...,...
2022-12-22 11:50:00,...,...


https://medium.com/nerd-for-tech/grouping-and-sampling-time-series-data-2bafe98302ab

In [29]:
res_avg = df.groupby('IdSensore').resample("D").min()

In [30]:
sensors_list = res_avg['IdSensore'].unique().tolist()

In [31]:
test_date = datetime.strptime('2022-07-04', '%Y-%m-%d')
res_avg

Unnamed: 0_level_0,Unnamed: 1_level_0,IdSensore,Valore
IdSensore,Data,Unnamed: 2_level_1,Unnamed: 3_level_1
3,2022-01-01,3.0,33.0
3,2022-01-02,3.0,34.0
3,2022-01-03,3.0,36.0
3,2022-01-04,3.0,38.0
3,2022-01-05,3.0,37.0
...,...,...,...
32413,2022-12-18,,
32413,2022-12-19,,
32413,2022-12-20,,
32413,2022-12-21,32413.0,0.0


In [32]:
res_avg.loc[(res_avg['IdSensore'] == 12755) & (res_avg.index.get_level_values('Data') == test_date)]

Unnamed: 0_level_0,Unnamed: 1_level_0,IdSensore,Valore
IdSensore,Data,Unnamed: 2_level_1,Unnamed: 3_level_1
12755,2022-07-04,12755.0,4.9
