### We want to evaluate our processed data's quality through the Syntactic accuracy(where possible), completeness and consistency dimensions


In [20]:
import pandas as pd

In [21]:
weather_data = pd.read_csv("./hourlyWeather_final.csv")

In [22]:
weather_data.head()

Unnamed: 0.1,Unnamed: 0,datetime,temperature_2m,relative_humidity_2m,dew_point_2m,apparent_temperature,precipitation,rain,weather_code,pressure_msl,...,soil_temperature_0_to_7cm,soil_temperature_7_to_28cm,soil_moisture_0_to_7cm,soil_moisture_7_to_28cm,shortwave_radiation,direct_radiation,diffuse_radiation,direct_normal_irradiance,global_tilted_irradiance,terrestrial_radiation
0,0,2020-01-01 01:00:00,-0.2,88.0,-2.0,-3.7,0.0,0.0,0.0,1030.6,...,0.6,4.4,0.364,0.371,0.0,0.0,0.0,0.0,0.0,0.0
1,1,2020-01-01 02:00:00,-0.9,88.0,-2.7,-4.4,0.0,0.0,0.0,1030.4,...,0.6,4.2,0.364,0.371,0.0,0.0,0.0,0.0,0.0,0.0
2,2,2020-01-01 03:00:00,-1.3,88.0,-3.0,-4.7,0.0,0.0,0.0,1030.2,...,0.6,4.1,0.364,0.371,0.0,0.0,0.0,0.0,0.0,0.0
3,3,2020-01-01 04:00:00,-1.0,89.0,-2.6,-4.8,0.0,0.0,0.0,1029.9,...,0.6,3.9,0.364,0.371,0.0,0.0,0.0,0.0,0.0,0.0
4,4,2020-01-01 05:00:00,-1.0,90.0,-2.4,-4.8,0.0,0.0,0.0,1029.7,...,0.6,3.8,0.364,0.371,0.0,0.0,0.0,0.0,0.0,0.0


#### Syntactic Accuracy for temperatures

By checking the historical maximum and minimum temperaturs in Milan, we can develop a reference domain for the range of accurate values.

In [23]:
# check ranges of values for all numerical columns
weather_data.describe()

Unnamed: 0.1,Unnamed: 0,temperature_2m,relative_humidity_2m,dew_point_2m,apparent_temperature,precipitation,rain,weather_code,pressure_msl,surface_pressure,...,soil_temperature_0_to_7cm,soil_temperature_7_to_28cm,soil_moisture_0_to_7cm,soil_moisture_7_to_28cm,shortwave_radiation,direct_radiation,diffuse_radiation,direct_normal_irradiance,global_tilted_irradiance,terrestrial_radiation
count,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,...,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0,35063.0
mean,17531.0,13.961464,72.404586,8.299373,12.899792,0.12406,0.121527,8.418247,1016.597008,999.219673,...,15.098195,15.015158,0.29682,0.291737,162.222457,108.180361,54.042096,196.643781,162.222457,307.303348
std,10121.960581,8.750944,20.02337,7.22491,10.506086,0.614884,0.610636,18.657324,8.054851,7.820031,...,8.927564,7.959172,0.086908,0.081235,237.131476,182.898753,72.004669,277.24342,237.131476,388.12222
min,0.0,-5.6,11.0,-15.6,-9.2,0.0,0.0,0.0,982.1,965.1,...,-4.0,0.1,0.104,0.124,0.0,0.0,0.0,0.0,0.0,0.0
25%,8765.5,6.8,57.0,2.8,4.2,0.0,0.0,0.0,1012.1,995.0,...,7.5,8.0,0.23,0.221,0.0,0.0,0.0,0.0,0.0,0.0
50%,17531.0,13.7,76.0,8.4,12.4,0.0,0.0,1.0,1016.5,999.3,...,14.6,14.5,0.32,0.306,7.0,0.0,6.0,0.0,7.0,52.2
75%,26296.5,20.8,90.0,14.5,21.5,0.0,0.0,3.0,1021.4,1003.9,...,22.2,22.3,0.368,0.362,280.0,153.0,100.0,400.35,280.0,584.4
max,35062.0,37.4,100.0,24.5,40.2,24.9,24.9,75.0,1045.8,1027.0,...,40.4,32.9,0.438,0.43,912.0,786.0,401.0,954.8,912.0,1225.4


#### Completeness

Let's check missing values and object completeness

In [24]:
weather_data.isna().sum().sum()

0

In [25]:
# Compute number of days between 31st December 2019 and 31st December 2023
import datetime
#Compute the difference between the two dates, but include them in counting number of days
delta = datetime.date(2023,12,31) - datetime.date(2019,12,31)
# since data is in hourly granularity we should have 24 rows per day
print("For a complete object representation we should have "+str(delta.days*24)+" rows in the dataset")

For a complete object representation we should have 35064 rows in the dataset


In [26]:
#Number of rows:
weather_data.shape[0]

35063

For object completeness, we are missing an hour as we have shifted values forward by an hour and lost the row relating to midnight of the first day.

But the data could still be seen as satisfactory

In [27]:
# Checking whether one day appears more than 24 times in the dataset
weather_data["date"] = pd.to_datetime(weather_data["datetime"]).dt.date
weather_data.groupby('date').count().sort_values('datetime',ascending=False).max()

Unnamed: 0                    25
datetime                      25
temperature_2m                25
relative_humidity_2m          25
dew_point_2m                  25
apparent_temperature          25
precipitation                 25
rain                          25
weather_code                  25
pressure_msl                  25
surface_pressure              25
cloud_cover                   25
et0_fao_evapotranspiration    25
vapour_pressure_deficit       25
wind_speed_10m                25
wind_speed_100m               25
wind_direction_10m            25
wind_direction_100m           25
wind_gusts_10m                25
soil_temperature_0_to_7cm     25
soil_temperature_7_to_28cm    25
soil_moisture_0_to_7cm        25
soil_moisture_7_to_28cm       25
shortwave_radiation           25
direct_radiation              25
diffuse_radiation             25
direct_normal_irradiance      25
global_tilted_irradiance      25
terrestrial_radiation         25
dtype: int64

In [28]:
#find the dates that appear more than 24 times
weather_data.groupby('date').count().sort_values('datetime',ascending=False).head()

Unnamed: 0_level_0,Unnamed: 0,datetime,temperature_2m,relative_humidity_2m,dew_point_2m,apparent_temperature,precipitation,rain,weather_code,pressure_msl,...,soil_temperature_0_to_7cm,soil_temperature_7_to_28cm,soil_moisture_0_to_7cm,soil_moisture_7_to_28cm,shortwave_radiation,direct_radiation,diffuse_radiation,direct_normal_irradiance,global_tilted_irradiance,terrestrial_radiation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-10-29,25,25,25,25,25,25,25,25,25,25,...,25,25,25,25,25,25,25,25,25,25
2022-10-30,25,25,25,25,25,25,25,25,25,25,...,25,25,25,25,25,25,25,25,25,25
2021-10-31,25,25,25,25,25,25,25,25,25,25,...,25,25,25,25,25,25,25,25,25,25
2020-10-25,25,25,25,25,25,25,25,25,25,25,...,25,25,25,25,25,25,25,25,25,25
2021-12-31,24,24,24,24,24,24,24,24,24,24,...,24,24,24,24,24,24,24,24,24,24


The dates with more than 24 rows correspond to days where DST was turned off, adding another hour to the day

In [29]:
weather_data.groupby('date').count().sort_values('datetime',ascending=True)[:6]

Unnamed: 0_level_0,Unnamed: 0,datetime,temperature_2m,relative_humidity_2m,dew_point_2m,apparent_temperature,precipitation,rain,weather_code,pressure_msl,...,soil_temperature_0_to_7cm,soil_temperature_7_to_28cm,soil_moisture_0_to_7cm,soil_moisture_7_to_28cm,shortwave_radiation,direct_radiation,diffuse_radiation,direct_normal_irradiance,global_tilted_irradiance,terrestrial_radiation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-01,23,23,23,23,23,23,23,23,23,23,...,23,23,23,23,23,23,23,23,23,23
2020-03-29,23,23,23,23,23,23,23,23,23,23,...,23,23,23,23,23,23,23,23,23,23
2021-03-28,23,23,23,23,23,23,23,23,23,23,...,23,23,23,23,23,23,23,23,23,23
2022-03-27,23,23,23,23,23,23,23,23,23,23,...,23,23,23,23,23,23,23,23,23,23
2023-03-26,23,23,23,23,23,23,23,23,23,23,...,23,23,23,23,23,23,23,23,23,23
2022-09-05,24,24,24,24,24,24,24,24,24,24,...,24,24,24,24,24,24,24,24,24,24


The dates with less than 24 rows correspond to days where DST was turned on, deducting an hour from the day; plus the first day of the dataset, where we lost an hour by shifting to local time. 

#### Consistency

Let's check if time formats are respected and consistent.

In [30]:
import re

# Define the regex pattern of our apparent format (yyyy-mm-dd hh:mm:ss)
pattern = r"^(2019|202[0-3])-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]) ([01]\d|2[0-3]):00:00$"

#The regex pattern accounts for impossible timestamp values(it doesn't accept ones like 2020-02-31 24:00:00)

# Check the number of valid entries in the 'time' column
valid_entries = weather_data['datetime'].apply(lambda x: bool(re.match(pattern, x)))

# Count how many rows match the pattern
num_valid_entries = valid_entries.sum()

In [31]:
print("Number of rows that don't respect our datetime format :", weather_data.shape[0] - num_valid_entries)

Number of rows that don't respect our datetime format : 0


In conclusion, to the extent possible, we have verified that our data satisfies metrics for our data quality dimensions of interest