# 🌍 Open-Meteo - Data Engineering Project

## 🎯 Project Objective

This project aims to build a robust **Data Engineering pipeline** to extract, transform, and store weather data using the [Open-Meteo API](https://open-meteo.com/en/docs), while following industry best practices:

- Incremental vs Full extraction
- Upsert / Merge / Overwrite strategies
- Metadata tracking
- Delta table compaction and optimization

The collected data will serve as a foundation for future analytics, dashboards, or machine learning projects.

---

## 📡 Data Source: Open-Meteo API

The Open-Meteo API offers a wide range of endpoints (weather, climate change, marine forecasts, air quality, satellite radiation, flood monitoring,...) but we will focus on the weather and the geocoding.
 

---

## 🧩 Endpoints used in this project

### 🏙️ a) Geocoding

**Purpose:** Get geographic data from a city name (latitude, longitude, etc.)

- **Extraction type:** Static but not full (hybrid)
- **Strategy:** `MERGE` — incrementally enrich the city database without retrieving all cities at once.



### 🌤️ b) Current Weather

**Purpose:** Get current weather conditions for a specific city.

- **Extraction type:** Incremental
- **Strategy:** `MERGE` — replace previous data on each run.



### 📅 c) Daily Forecast

**Purpose:** Get daily weather forecast data.

- **Extraction type:** Incremental
- **Strategy:** `MERGE` — store and accumulate forecasts, allowing future comparison with real values.



### ⏰ d) Hourly Forecast

**Purpose:** Get hourly weather forecast data.

- **Extraction type:** Incremental
- **Strategy:** `UPSERT` — update forecast values over time, as they may change throughout the day.



### 🕰️ e) Historical Weather

**Purpose:** Retrieve historical weather data for a given date range.

- **Extraction type:** Incremental
- **Strategy:** `MERGE` — extract data progressively (year by year) to avoid API limitations.



### 🔢 f) Weather Codes

**Purpose:** Manually input weather code descriptions (not available through API).

- **Extraction type:** Manual
- **Strategy:** `OVERWRITE` — static reference table.

---

## 🔮 Potential Future Extensions

- 🌊 Integrate flood monitoring data  
- 🌫️ Add air quality tracking  
- 🛰️ Use satellite radiation information  

---


In [1]:
#Import libraries
import sys
import os
import time
# Add the path to the modules directory
my_current_loc = os.getcwd()
print(my_current_loc)
my_modules_dir = "/Users/focus_profond/UTN/Data_engineering/proyecto/UTN_data_engineering_project/Entrega_Final/Modules"
os.chdir(my_modules_dir)

#Importing personal modules
from DF_functions import *
from openmeteo_API import *


#Returning to the main directory
#my_main_dir = "/Users/focus_profond/UTN/Data_engineering/proyecto/UTN_data_engineering_project/Entrega"
os.chdir(my_current_loc)
os.chdir('../')


/Users/focus_profond/UTN/Data_engineering/proyecto/UTN_data_engineering_project/Entrega_Final/Notebooks
Module exécuté dans : /Users/focus_profond/UTN/Data_engineering/proyecto/UTN_data_engineering_project/utn_env/bin/python


### 🧱 **STORAGE: BRONZE LAYER**

In the **Bronze layer**, we store raw data with minimal processing. All data is saved under a single main folder: `OpenMeteo/`.

- No partitioning is applied at this stage.
- Each function related to an API endpoint performs data extraction and minimal transformation (only what's necessary for saving).
- This layer acts as the **raw zone**, preserving the original data as received from the source.


#### 🌍 **1.1 GEOCODING**

From a **city name**, we retrieve various geographic and administrative details using the **geocoding endpoint** of the Open-Meteo API.

Returned information includes:

- `latitude`, `longitude`, `elevation`, `population`, `postcodes`
- Administrative divisions:
  - `admin1`, `admin2`, `admin3`, `admin4` (names of hierarchical administrative areas)
  - `admin1_id`, `admin2_id`, `admin3_id`, `admin4_id` (identifiers for those areas)

> This is treated as **semi-static reference data**, incrementally enriched using a `MERGE` strategy as new cities are requested.


In [2]:
list_of_cities = ['Buenos Aires', 'Rio de Janeiro', 'Marseille'
                   , 'Brussels', 'Namur',  'Montreal' 
                   ,'Barcelona','New York','Chicago'
                   ,'Sao Paulo','Toronto','Melbourne','London','Mexico City'
                   ,'Lima','La Paz', 'Boston', 'Kinshasa','biu']
#list_of_cities =['Lima']

my_geo_dict = {}
my_geo_df = pd.DataFrame()
for city in list_of_cities:

    #use of my own function from openmeteo_API module
    my_df = get_geolocation_openmeteo(city,nb_results=2)
    my_geo_df = pd.concat([my_geo_df,my_df])
#deleting of useless and problematic column
my_geo_df =my_geo_df.drop(columns=['postcodes'])

In [3]:
#STORING THE DATA
name_folder = 'Data/Bronze/OpenMeteo/Others/Geolocation'
partition_cols = None
predicate = "target.id = source.id"

save_new_data_as_delta(my_geo_df,name_folder,predicate= predicate, partition_cols=partition_cols, layer = 'Bronze', source= 'open-meteo-geoloc', author ='Augustin')

#Verifying the data of the bronze layer
my_dt = DeltaTable(name_folder).to_pandas()
my_dt.head()

Unnamed: 0,id,name,latitude,longitude,elevation,feature_code,country_code,admin1_id,timezone,population,...,country,admin1,probability,admin2_id,admin2,admin3_id,admin4_id,admin3,admin4,__index_level_0__
0,3418226,Bildudalur,65.6853,-23.59992,9999.0,PPL,IS,3426185,Atlantic/Reykjavik,170.0,...,Iceland,Westfjords,2,3426215.0,Vesturbyggð,,,,,1
1,2346995,Biu,10.61285,12.19458,762.0,PPLA2,NG,2346794,Africa/Lagos,95005.0,...,Nigeria,Borno State,1,8659843.0,Biu,9412374.0,,Sulumthla,,0
2,3936456,Lima,-12.04318,-77.02824,152.0,PPLC,PE,3936451,America/Lima,7737002.0,...,Peru,Lima,1,,,,,,,0
3,5160783,Lima,40.74255,-84.10523,268.0,PPLA2,US,5165418,America/New_York,37873.0,...,United States,Ohio,2,5145576.0,Allen,5160804.0,,City of Lima,,1
4,3911925,La Paz,-16.5,-68.15,3782.0,PPLG,BO,3911924,America/La_Paz,812799.0,...,Bolivia,Departamento de La Paz,2,,,,,,,1


#### 🌦️ **1.2 WEATHER CODE**

This part refers to the **interpretation of weather codes**, such as:
- 0 = Clear sky
- 1 = Mainly clear
- 45 = Fog
- 61 = Slight rain, etc.

There is **no endpoint** for this.  
We manually insert a reference table describing the codes.

> Since this table is manually created and rarely changes, we use an **overwrite** strategy.  
**Strategy:** `OVERWRITE`

In [4]:
# CREATION OF A NEW TABLE TO GET THE WEATHER CODE
weahter_code_data = {
    "Code": [
        "0", "1, 2, 3", "45, 48", "51, 53, 55", "56, 57", "61, 63, 65", 
        "66, 67", "71, 73, 75", "77", "80, 81, 82", "85, 86", "95 *", "96, 99 *"
    ],
    "Description": [
        "Clear sky",
        "Mainly clear, partly cloudy, overcast",
        "Fog, depositing rime fog",
        "Drizzle Light intensity, Drizzle Moderate intensity, Drizzle Dense intensity",
        "Freezing Drizzle Light intensity, Freezing Drizzle Dense intensity",
        "Rain Slight, Rain Moderate, Rain Heavy",
        "Freezing Rain Light, Freezing Rain Heavy",
        "Snow fall Slight,Snow fall Moderate, Snow fall Heavy ",
        "Snow grains",
        "Rain showers Slight,Rain showers Moderate,Rain showers Biolent",
        "Snow showers slight, Snow showers heavy",
        "Thunderstorm Slight or moderate",
        "Thunderstorm with slight hail,Thunderstorm with heavy hail"
    ]
}
df = pd.DataFrame(weahter_code_data)

#STORING THE DATA
name_folder = 'Data/Bronze/OpenMeteo/Others/WeatherCode'
mode = 'overwrite'
partition_cols = None
save_data_as_delta(df,name_folder,mode=mode, partition_cols=partition_cols,layer = 'Bronze', source= 'open-meteo-weathercode', author ='Augustin')

#Verifying the data of the bronze layer
my_dt = DeltaTable(name_folder).to_pandas()
my_dt.head()

Unnamed: 0,Code,Description
0,0,Clear sky
1,"1, 2, 3","Mainly clear, partly cloudy, overcast"
2,"45, 48","Fog, depositing rime fog"
3,"51, 53, 55","Drizzle Light intensity, Drizzle Moderate inte..."
4,"56, 57","Freezing Drizzle Light intensity, Freezing Dri..."


#### ☁️ **1.3 CURRENT WEATHER**

Using the **current weather endpoint**, we retrieve the **real-time weather conditions** for a given city.

Data includes (non-exhaustive):  
- Temperature, humidity, wind speed, cloud cover, visibility, etc.

> This is a **full extraction**:  
Each time we fetch current weather data, we **merge** the previous version.  
**Strategy:** `MERGE`

In [3]:
#We rethrieve the cities from which we have latitude and longitude informations (we will do the same in the rest of the notebook).
name_folder = 'Data/Bronze/OpenMeteo/Others/Geolocation'
my_dt = DeltaTable(name_folder).to_pandas()
# as we can have severals results from one city name, we decide to only choose the result with the highest probability
list_of_cities_name = list(my_dt.loc[my_dt['probability']==1, 'name'].unique())


my_current_dict = {}
my_current_df = pd.DataFrame()
for city in list_of_cities_name:

    #use of my own function from openmeteo_API module
    my_df = get_current_weather(city)
    my_current_df = pd.concat([my_current_df,my_df])
my_current_df.head()

Unnamed: 0,Date,Time,City,longitude,latitude,surface_pressure,snowfall,is_day,wind_gusts_10m,pressure_msl,showers,apparent_temperature,wind_direction_10m,cloud_cover,rain,relative_humidity_2m,wind_speed_10m,weather_code,precipitation,temperature_2m
0,2025-04-17,19:59,Biu,12.19458,10.61285,926.992493,0.0,0.0,18.359999,1010.099976,0.0,22.967268,41.185837,0.0,0.0,24.0,7.653705,0.0,0.0,25.6
0,2025-04-17,19:59,Lima,-77.02824,-12.04318,993.527222,0.0,0.0,15.48,1011.5,0.0,23.905848,204.775116,0.0,0.0,80.0,5.154415,0.0,0.0,21.700001
0,2025-04-17,19:59,La Paz,-110.3005,24.14437,1005.647766,0.0,1.0,16.199999,1010.5,0.0,30.718632,270.0,0.0,0.0,27.0,9.36,0.0,0.0,31.9055
0,2025-04-17,19:59,Kinshasa,15.31357,-4.32758,978.338318,0.0,0.0,5.76,1010.400024,0.0,32.045074,188.13002,100.0,0.0,95.0,2.545584,45.0,0.0,25.75
0,2025-04-17,19:59,Boston,-71.05977,42.35843,1013.312744,0.0,0.0,47.519997,1015.700012,0.0,-1.700215,13.736293,100.0,0.0,92.0,16.676977,3.0,0.0,2.622


In [4]:
#STORING THE DATA
name_folder = 'Data/Bronze/OpenMeteo/Current'
predicate = "target.Date = source.Date AND target.Time = source.Time AND target.City = source.City"
partition_cols = "Date"

save_new_data_as_delta(my_current_df,name_folder,predicate = predicate, partition_cols=partition_cols, layer = 'Bronze', source= 'open-meteo-current', author ='Augustin')


#Verifying the data of the bronze layer
my_dt = DeltaTable(name_folder).to_pandas()
my_dt.head()

Unnamed: 0,Date,Time,City,longitude,latitude,surface_pressure,snowfall,is_day,wind_gusts_10m,pressure_msl,...,apparent_temperature,wind_direction_10m,cloud_cover,rain,relative_humidity_2m,wind_speed_10m,weather_code,precipitation,temperature_2m,__index_level_0__
0,2025-04-17,19:59,Melbourne,144.96332,-37.814,1015.948181,0.0,0.0,29.519999,1018.200012,...,18.726562,289.983185,42.0,0.0,45.0,4.213692,1.0,0.0,19.950001,0
1,2025-04-17,19:59,Buenos Aires,-58.37723,-34.61315,1016.168091,0.0,1.0,37.439999,1018.400024,...,22.793194,69.717354,83.0,0.0,73.0,17.654688,3.0,0.0,22.65,0
2,2025-04-17,19:59,Marseille,5.38107,43.29695,1018.201477,0.0,1.0,18.359999,1021.799988,...,16.605967,244.536697,0.0,0.0,61.0,8.373386,0.0,0.0,18.1085,0
3,2025-04-17,19:59,Montreal,-73.58781,45.50884,998.746521,0.0,1.0,14.04,1027.0,...,0.00396,222.137527,0.0,0.0,52.0,10.195057,0.0,0.0,2.881,0
4,2025-04-17,19:59,Kinshasa,15.31357,-4.32758,978.338318,0.0,0.0,5.76,1010.400024,...,32.045074,188.13002,100.0,0.0,95.0,2.545584,45.0,0.0,25.75,0


#### 📅 **1.4 FORECAST DAILY WEATHER**

Using the **forecast daily endpoint**, we retrieve the **weather forecast for upcoming days**.

Data includes daily weather predictions such as:  
- Temperature, precipitation, humidity, snowfall, etc.

> This is an **incremental extraction**:  
We **merge** the new forecast data with the existing one to preserve historical predictions.  
This allows future analysis of **forecast accuracy**.  
**Strategy:** `MERGE`

In [7]:
name_folder = 'Data/Bronze/OpenMeteo/Others/Geolocation'
my_dt = DeltaTable(name_folder).to_pandas()
list_of_cities_name = list(my_dt.loc[my_dt['probability']==1, 'name'].unique())


my_forecast_daily_df = pd.DataFrame()
for city in list_of_cities_name:

    #use of my own function from openmeteo_API module
    my_df = get_forecast_daily_weather(city, forecast_days=7)
    my_forecast_daily_df = pd.concat([my_forecast_daily_df,my_df])
my_forecast_daily_df.head()

Unnamed: 0,Requested_Date,City,forecast_day,weather_code,apparent_temperature_min,sunshine_duration,rain_sum,precipitation_probability_max,shortwave_radiation_sum,temperature_2m_max,...,uv_index_clear_sky_max,snowfall_sum,wind_gusts_10m_max,daylight_duration,apparent_temperature_max,precipitation_sum,precipitation_hours,wind_direction_10m_dominant,latitude,longitude
0,2025-04-12,Biu,2025-04-12 00:00:00+00:00,3.0,21.073042,39960.453125,0.0,0.0,26.219999,36.323498,...,8.6,0.0,46.079998,44429.03125,34.464893,0.0,0.0,34.963619,10.61285,12.19458
1,2025-04-12,Biu,2025-04-13 00:00:00+00:00,0.0,18.847942,41872.453125,0.0,0.0,28.26,36.523499,...,8.6,0.0,54.360001,44463.355469,34.716145,0.0,0.0,58.123203,10.61285,12.19458
2,2025-04-12,Biu,2025-04-14 00:00:00+00:00,3.0,18.443285,41903.117188,0.0,0.0,27.799999,37.4235,...,8.6,0.0,50.399998,44497.414062,36.80666,0.0,0.0,63.976074,10.61285,12.19458
3,2025-04-12,Biu,2025-04-15 00:00:00+00:00,3.0,22.393799,39693.480469,0.0,3.0,27.26,38.573498,...,8.6,0.0,42.839996,44531.171875,36.625225,0.0,0.0,66.236183,10.61285,12.19458
4,2025-04-12,Biu,2025-04-16 00:00:00+00:00,2.0,23.345686,42075.976562,0.0,10.0,28.16,38.823498,...,8.6,0.0,35.279999,44564.605469,36.508705,0.0,0.0,72.37191,10.61285,12.19458


In [8]:
#STORING THE DATA
name_folder = 'Data/Bronze/OpenMeteo/Forecast/Daily'
    #the predicate of our merge is on the requested_date, the city and the forecast_day.
predicate = "target.Requested_Date = source.Requested_Date AND target.City = source.City and target.forecast_day = source.forecast_day "
partition_cols = ["Requested_Date"]

save_new_data_as_delta(my_forecast_daily_df,name_folder,predicate= predicate, partition_cols=partition_cols,layer = 'Bronze', source= 'open-meteo-forecast-daily', author ='Augustin')

#Verifying the data of the bronze layer
my_dt = DeltaTable(name_folder).to_pandas()
my_dt.head()

Unnamed: 0,Requested_Date,City,forecast_day,weather_code,apparent_temperature_min,sunshine_duration,rain_sum,precipitation_probability_max,shortwave_radiation_sum,temperature_2m_max,...,snowfall_sum,wind_gusts_10m_max,daylight_duration,apparent_temperature_max,precipitation_sum,precipitation_hours,wind_direction_10m_dominant,latitude,longitude,__index_level_0__
0,2025-04-12,Biu,2025-04-13 00:00:00+00:00,0.0,18.847942,41872.453125,0.0,0.0,28.26,36.523499,...,0.0,54.360001,44463.355469,34.716145,0.0,0.0,58.123203,10.61285,12.19458,1
1,2025-04-12,Biu,2025-04-15 00:00:00+00:00,3.0,22.393799,39693.480469,0.0,3.0,27.26,38.573498,...,0.0,42.839996,44531.171875,36.625225,0.0,0.0,66.236183,10.61285,12.19458,3
2,2025-04-12,Biu,2025-04-12 00:00:00+00:00,3.0,21.073042,39960.453125,0.0,0.0,26.219999,36.323498,...,0.0,46.079998,44429.03125,34.464893,0.0,0.0,34.963619,10.61285,12.19458,0
3,2025-04-12,Biu,2025-04-14 00:00:00+00:00,3.0,18.443285,41903.117188,0.0,0.0,27.799999,37.4235,...,0.0,50.399998,44497.414062,36.80666,0.0,0.0,63.976074,10.61285,12.19458,2
4,2025-04-12,Biu,2025-04-16 00:00:00+00:00,2.0,23.345686,42075.976562,0.0,10.0,28.16,38.823498,...,0.0,35.279999,44564.605469,36.508705,0.0,0.0,72.37191,10.61285,12.19458,4


#### ⏰ **1.5 FORECAST HOURLY WEATHER**

This endpoint gives us **hourly-level weather forecasts** for the coming hours or days.

> This is also an **incremental extraction**, but we apply a **UPSERT strategy**:  
Forecasts may change during the day, so we update existing entries with the **latest forecast version**.

>**Strategy:** `UPSERT`

In [9]:
name_folder = 'Data/Bronze/OpenMeteo/Others/Geolocation'
my_dt = DeltaTable(name_folder).to_pandas()
list_of_cities_name = list(my_dt.loc[my_dt['probability']==1, 'name'].unique())


my_forecast_hourly_df = pd.DataFrame()
for city in list_of_cities_name:
    
    #use of my own function from openmeteo_API module
    my_df = get_forecast_hourly_weather(city, forecast_days=5)
    my_forecast_hourly_df = pd.concat([my_forecast_hourly_df,my_df])
my_forecast_hourly_df.head()

Unnamed: 0,Requested_Date,City,Forecast_Date,Forecast_Hour,longitude,latitude,soil_moisture_27_to_81cm,soil_moisture_9_to_27cm,soil_moisture_3_to_9cm,soil_moisture_1_to_3cm,...,snow_depth,snowfall,showers,rain,precipitation,precipitation_probability,apparent_temperature,dew_point_2m,relative_humidity_2m,temperature_2m
0,2025-04-12,Biu,2025-04-12,00:00,12.19458,10.61285,0.156,0.129,0.113,0.091,...,0.0,0.0,0.0,0.0,0.0,0.0,24.232775,5.882994,26.0,27.023499
1,2025-04-12,Biu,2025-04-12,01:00,12.19458,10.61285,0.156,0.129,0.113,0.09,...,0.0,0.0,0.0,0.0,0.0,0.0,23.446596,5.33067,26.0,26.373499
2,2025-04-12,Biu,2025-04-12,02:00,12.19458,10.61285,0.156,0.129,0.113,0.09,...,0.0,0.0,0.0,0.0,0.0,0.0,22.868982,4.905665,26.0,25.873499
3,2025-04-12,Biu,2025-04-12,03:00,12.19458,10.61285,0.156,0.129,0.113,0.089,...,0.0,0.0,0.0,0.0,0.0,0.0,22.25379,4.005498,25.0,25.473499
4,2025-04-12,Biu,2025-04-12,04:00,12.19458,10.61285,0.156,0.129,0.112,0.089,...,0.0,0.0,0.0,0.0,0.0,0.0,21.712257,3.582189,25.0,24.973499


In [10]:
#STORING THE DATA
name_folder = 'Data/Bronze/OpenMeteo/Forecast/Hourly'
predicate = """target.Requested_Date = source.Requested_Date  AND target.City = source.City AND target.Forecast_Date = source.Forecast_Date and target.Forecast_Hour = source.Forecast_Hour """
partition_cols = ["Requested_Date"]


upsert_data_as_delta(my_forecast_hourly_df,name_folder,predicate= predicate, partition_cols=partition_cols, layer = 'Bronze', source= 'open-meteo-forecast-hourly', author ='Augustin')

#Verifying the data of the bronze layer
my_dt = DeltaTable(name_folder).to_pandas()
my_dt.head()

Unnamed: 0,Requested_Date,City,Forecast_Date,Forecast_Hour,longitude,latitude,soil_moisture_27_to_81cm,soil_moisture_9_to_27cm,soil_moisture_3_to_9cm,soil_moisture_1_to_3cm,...,snowfall,showers,rain,precipitation,precipitation_probability,apparent_temperature,dew_point_2m,relative_humidity_2m,temperature_2m,__index_level_0__
0,2025-04-12,Buenos Aires,2025-04-12,05:00,-58.37723,-34.61315,0.406,0.392,0.38,0.377,...,0.0,0.0,0.0,0.0,8.0,20.897532,17.854734,93.0,19.013,53
1,2025-04-12,Buenos Aires,2025-04-12,08:00,-58.37723,-34.61315,0.406,0.392,0.381,0.382,...,0.0,0.1,0.0,0.1,5.0,20.415752,17.627798,94.0,18.613001,56
2,2025-04-12,Buenos Aires,2025-04-12,17:00,-58.37723,-34.61315,0.406,0.39,0.381,0.386,...,0.0,0.3,0.0,0.3,24.0,24.342426,17.615198,71.0,23.163,65
3,2025-04-12,Buenos Aires,2025-04-12,23:00,-58.37723,-34.61315,0.405,0.39,0.387,0.389,...,0.0,0.0,0.0,0.0,13.0,21.265154,16.943434,82.0,20.113001,71
4,2025-04-12,Buenos Aires,2025-04-13,05:00,-58.37723,-34.61315,0.405,0.39,0.387,0.385,...,0.0,0.0,0.0,0.0,2.0,12.778257,10.0425,78.0,13.813,77


#### 📈 **1.5 HISTORICAL WEATHER**

The **historical weather endpoint** allows us to download weather data from the past, over a given time period.

Due to API constraints (limits on volume), we fetch data **year by year** or **chunk by chunk**.

> This is an **incremental extraction**, where we **merge** the new data with the previously collected historical records.  
**Strategy:** `MERGE`

In [11]:
name_folder = 'Data/Bronze/OpenMeteo/Others/Geolocation'
my_dt = DeltaTable(name_folder).to_pandas()
list_of_cities_name = list(my_dt.loc[my_dt['probability']==1, 'name'].unique())

my_historical_weather_df = pd.DataFrame()
for city in list_of_cities_name:
    
    #use of my own function from openmeteo_API module
    my_df = get_daily_historical_weather(city,'2013-01-01','2013-12-31' )
    my_historical_weather_df = pd.concat([my_historical_weather_df,my_df])
    #we add a small delay because of the API limitation
    time.sleep(10)
my_historical_weather_df.head()

Unnamed: 0,City,Historical_Date,Historical_Year,Historical_Month,Historical_Day,longitude,latitude,wind_direction_10m_dominant,precipitation_hours,precipitation_sum,...,showers_sum,uv_index_max,sunrise,temperature_2m_max,shortwave_radiation_sum,precipitation_probability_max,rain_sum,sunshine_duration,apparent_temperature_min,weather_code
0,Biu,2013-01-01,2013,1,Tue,12.19458,10.61285,28.564484,0.0,0.0,...,0.0,,0,27.939499,21.610001,,0.0,38293.28125,10.068772,1.0
1,Biu,2013-01-02,2013,1,Wed,12.19458,10.61285,34.51646,0.0,0.0,...,0.0,,0,29.189499,21.52,,0.0,38327.6875,11.017301,2.0
2,Biu,2013-01-03,2013,1,Thu,12.19458,10.61285,27.206942,0.0,0.0,...,0.0,,0,30.8895,21.559999,,0.0,38362.042969,12.120928,0.0
3,Biu,2013-01-04,2013,1,Fri,12.19458,10.61285,22.969246,0.0,0.0,...,0.0,,0,31.3395,21.629999,,0.0,38396.253906,13.027481,3.0
4,Biu,2013-01-05,2013,1,Sat,12.19458,10.61285,17.909935,0.0,0.0,...,0.0,,0,30.9895,21.719999,,0.0,38430.21875,12.617382,2.0


In [12]:
#STORING THE DATA
name_folder = 'Data/Bronze/OpenMeteo/Historical/Daily'
predicate = """target.City = source.City AND target.Historical_Date = source.Historical_Date"""
partition_cols = ["Historical_Year"]

save_new_data_as_delta(my_historical_weather_df,name_folder,predicate= predicate, partition_cols=partition_cols,layer = 'Bronze', source= 'open-meteo-historical-daily', author ='Augustin')

#verifying the data of the bronze layer
my_dt = DeltaTable(name_folder).to_pandas()
my_dt.head(5).sort_values(by=['City','Historical_Year'])

Unnamed: 0,City,Historical_Date,Historical_Year,Historical_Month,Historical_Day,longitude,latitude,wind_direction_10m_dominant,precipitation_hours,precipitation_sum,...,uv_index_max,sunrise,temperature_2m_max,shortwave_radiation_sum,precipitation_probability_max,rain_sum,sunshine_duration,apparent_temperature_min,weather_code,__index_level_0__
0,Biu,2013-01-08,2013,1,Tue,12.19458,10.61285,28.587063,0.0,0.0,...,,0,29.4895,20.870001,,0.0,37466.636719,11.340882,3.0,7
1,Biu,2013-01-17,2013,1,Thu,12.19458,10.61285,25.49852,0.0,0.0,...,,0,32.789501,21.83,,0.0,38816.105469,14.404625,3.0,16
2,Biu,2013-02-01,2013,2,Fri,12.19458,10.61285,39.115417,0.0,0.0,...,,0,26.289499,23.27,,0.0,39233.816406,9.076269,3.0,31
3,Biu,2013-02-09,2013,2,Sat,12.19458,10.61285,25.914753,0.0,0.0,...,,0,33.789501,23.91,,0.0,39399.445312,13.243525,3.0,39
4,Biu,2013-02-10,2013,2,Sun,12.19458,10.61285,21.718843,0.0,0.0,...,,0,35.289501,23.790001,,0.0,39417.515625,13.986423,3.0,40


**📊 CHECK BRONZE TABLE STATS: Rows, Nulls, Duplicates**

In [13]:
name_folder = 'Data/_meta/metadata_table'
my_dt = DeltaTable(name_folder).to_pandas()
my_dt = my_dt[my_dt['layer']=='Bronze']
# Optionnel : afficher les résultats avec les noms des tables
row_counts_per_table = pd.DataFrame({
        "layer":my_dt["layer"],
    "table_name": my_dt["table_name"],
    "table_path": my_dt["table_path"],
    "total_rows": my_dt['total_rows'],
    "rows_with_at_least_one_nulls":my_dt['rows_with_nulls'],
    "rows_duplicated":my_dt['rows_duplicated']
})
row_counts_per_table.head(10)



Unnamed: 0,layer,table_name,table_path,total_rows,rows_with_at_least_one_nulls,rows_duplicated
0,Bronze,Daily,Data/Bronze/OpenMeteo/Historical/Daily,69049,69049,0
1,Bronze,Hourly,Data/Bronze/OpenMeteo/Forecast/Hourly,4632,0,0
2,Bronze,Daily,Data/Bronze/OpenMeteo/Forecast/Daily,231,0,0
3,Bronze,Current,Data/Bronze/OpenMeteo/Current,19,0,0
4,Bronze,WeatherCode,Data/Bronze/OpenMeteo/Others/WeatherCode,13,0,0
5,Bronze,Geolocation,Data/Bronze/OpenMeteo/Others/Geolocation,37,36,0


In [14]:
name_folder = 'Data/_meta/metadata_table'
my_dt = DeltaTable(name_folder).to_pandas()
my_dt = my_dt[my_dt['layer']=='Bronze']

In [15]:
export_metadata_to_excel(layer='Bronze')

✅ Métadonnées exportées avec succès dans : logs/2025-04-12/bronze_metadata_20h43.xlsx
