# Starting point: Load data and explore

Our data exploration and analysis journey often starts with a notebook like this one. 
* We load the data, explore it, and try to understand it. 
* We might also do some cleaning and feature engineering. 
* Visualizations may help us understand the data.

This is actually the notebook that started this tutorial ;-)

## Load data from Open Meteo API

[Open Meteo](https://open-meteo.com/) is a great source for weather data - contains weather forecasts as well as historical interpretive model runs.

In the code below, we use the arcive (historical) data API in a very simple way.

In [1]:
import httpx
import pandas as pd

In [3]:
FORECAST_URI = "https://api.open-meteo.com/v1/forecast"
params = dict(
    latitude=50.1003,
    longitude=14.2555,
    hourly="temperature_2m",
    start_date=pd.Timestamp("2023-07-01").date().isoformat(),
    end_date=pd.Timestamp("2023-07-09").date().isoformat(),
    models=["jma_seamless", "icon_d2", "meteofrance_arpege_europe"],
)

In [4]:
ARCHIVE_URI = "https://archive-api.open-meteo.com/v1/archive"
params = dict(
    latitude=50.1003,
    longitude=14.2555,
    hourly="temperature_2m",
    start_date=pd.Timestamp("2010-07-01").date().isoformat(),
    end_date=pd.Timestamp("2010-07-09").date().isoformat(),
    models=["best_match", "era5", "era5_land", "cerra"],
)

In [5]:
params["hourly"] = [
    "temperature_2m",
    "relativehumidity_2m",
    "dewpoint_2m",
    "apparent_temperature",
    "precipitation",
    "rain",
    # "showers",
    "snowfall",
    "weathercode",
    "pressure_msl",
    "surface_pressure",
    "cloudcover",
    "cloudcover_low",
    "cloudcover_mid",
    "cloudcover_high",
    "et0_fao_evapotranspiration",
    "vapor_pressure_deficit",
    "windspeed_10m",
    "windspeed_100m",
    "winddirection_10m",
    "winddirection_100m",
    "windgusts_10m",
    "soil_temperature_0_to_7cm",
    "soil_temperature_7_to_28cm",
    "soil_temperature_28_to_100cm",
    "soil_temperature_100_to_255cm",
    "soil_moisture_0_to_7cm",
    "soil_moisture_7_to_28cm",
    "soil_moisture_28_to_100cm",
    "soil_moisture_100_to_255cm",
    "is_day",
    "shortwave_radiation",
    "direct_radiation",
    "diffuse_radiation",
    "direct_normal_irradiance",
]

In [6]:
params["hourly"]

['temperature_2m',
 'relativehumidity_2m',
 'dewpoint_2m',
 'apparent_temperature',
 'precipitation',
 'rain',
 'snowfall',
 'weathercode',
 'pressure_msl',
 'surface_pressure',
 'cloudcover',
 'cloudcover_low',
 'cloudcover_mid',
 'cloudcover_high',
 'et0_fao_evapotranspiration',
 'vapor_pressure_deficit',
 'windspeed_10m',
 'windspeed_100m',
 'winddirection_10m',
 'winddirection_100m',
 'windgusts_10m',
 'soil_temperature_0_to_7cm',
 'soil_temperature_7_to_28cm',
 'soil_temperature_28_to_100cm',
 'soil_temperature_100_to_255cm',
 'soil_moisture_0_to_7cm',
 'soil_moisture_7_to_28cm',
 'soil_moisture_28_to_100cm',
 'soil_moisture_100_to_255cm',
 'is_day',
 'shortwave_radiation',
 'direct_radiation',
 'diffuse_radiation',
 'direct_normal_irradiance']

In [7]:
response = httpx.get(ARCHIVE_URI, params=params)
response

<Response [200 OK]>

In [8]:
pd.DataFrame(response.json()["hourly"])

Unnamed: 0,time,temperature_2m_best_match,relativehumidity_2m_best_match,dewpoint_2m_best_match,apparent_temperature_best_match,precipitation_best_match,rain_best_match,snowfall_best_match,weathercode_best_match,pressure_msl_best_match,...,windspeed_10m_cerra,windspeed_100m_cerra,winddirection_10m_cerra,winddirection_100m_cerra,windgusts_10m_cerra,is_day_cerra,shortwave_radiation_cerra,direct_radiation_cerra,diffuse_radiation_cerra,direct_normal_irradiance_cerra
0,2010-07-01T00:00,16.1,85,13.6,16.8,0.0,0.0,0.0,0,1018.4,...,5.4,13.0,256,2,6.8,0,0.0,0.0,0.0,0.0
1,2010-07-01T01:00,15.1,89,13.3,15.5,0.0,0.0,0.0,0,1018.3,...,6.1,11.9,264,4,7.9,0,0.0,0.0,0.0,0.0
2,2010-07-01T02:00,14.5,90,12.9,14.6,0.0,0.0,0.0,0,1018.3,...,6.1,14.4,270,358,7.9,0,0.0,0.0,0.0,0.0
3,2010-07-01T03:00,14.2,91,12.8,14.3,0.0,0.0,0.0,1,1018.4,...,4.3,15.1,266,10,7.2,1,0.0,0.0,0.0,0.0
4,2010-07-01T04:00,14.6,90,13.0,15.2,0.0,0.0,0.0,1,1018.5,...,5.8,10.8,250,16,7.2,1,11.0,0.0,11.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211,2010-07-09T19:00,23.8,54,13.9,23.8,0.0,0.0,0.0,0,1020.1,...,10.4,16.6,104,104,19.1,1,22.0,11.0,11.0,95.7
212,2010-07-09T20:00,22.0,57,13.0,21.7,0.0,0.0,0.0,0,1020.7,...,6.8,15.1,108,118,14.8,0,0.0,0.0,0.0,0.0
213,2010-07-09T21:00,20.7,60,12.7,20.2,0.0,0.0,0.0,0,1021.0,...,2.9,17.3,108,132,9.7,0,0.0,0.0,0.0,0.0
214,2010-07-09T22:00,19.5,65,12.6,19.0,0.0,0.0,0.0,0,1020.9,...,0.4,16.6,156,152,3.2,0,0.0,0.0,0.0,0.0


## Bottom line

We loaded the data with quite a minimal effort, however ...
* the data column names mix quantities and model names
  * hence the data is not in a tidy (long) format
* we don't do any validation of the data schema, such as column names, types, units, etc.

If the code we are writing is supposed to grow and to be used by others and / or in production systems, 
we should make it more resilient against bugs, errors or changes on the data provider side 
(e.g. API changes).