# Data extraction and processing: SMHI weather data

**Author: Jakob Nyström, 5563**

In this notebook we fetch data on temperature and precipitation from the SMHI meteorology API. We then calculate different rolling averages for these variables and create a data set that can be joined with the lake chemistry data.

In [12]:
import requests
import json
import pandas as pd
import pyproj
from geopy.distance import geodesic
import csv
from datetime import datetime
from io import StringIO
from math import ceil
import matplotlib.pyplot as plt
from typing import Tuple

In [13]:
import jupyter_black

jupyter_black.load(lab=False)

<IPython.core.display.Javascript object>

## 1. Find available stations for parameters of interest

Not all SMHI stations have records of all weather parameters. The first step involves identifying what active stations provide what data. We are interested in daily average temperature (2) and daily precipitation (5).

In [14]:
def extract_station_data(url: str) -> pd.DataFrame:
    """
    Extracts required data from each station in a JSON object obtained from a given URL.
    This function sends a request to the provided URL and retrieves data in JSON format.
    It then extracts specific information for each station from the JSON object.The
    extracted information is stored in a DataFrame and returned.

    Args:
        url: The SMHI API URL to fetch JSON data from.

    Returns:
        - A dataframe containing extracted station information if successful.
        - A string error message if there's an issue with the request.
    """
    try:
        # Send request and get data in json format
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()

        # Extract the required information for each station
        stations = []
        for station in data["station"]:
            station_data = {
                "Name": station.get("name"),
                "Owner": station.get("owner"),
                "Owner category": station.get("ownerCategory"),
                "Station type": station.get("measuringStations"),
                "Id": station.get("id"),
                "Altitude": station.get("height"),
                "Latitude": station.get("latitude"),
                "Longitude": station.get("longitude"),
                "Active": station.get("active"),
            }
            # Append information to the list
            stations.append(station_data)

        # Return the information as a dataframe
        return pd.DataFrame(stations)

    except requests.RequestException as e:
        return f"Error: {e}"

### 1.1. Daily temperature

In [15]:
# Get stations that have daily average temp
url = "https://opendata-download-metobs.smhi.se/api/version/latest/parameter/2.json"
df_temp_stations = extract_station_data(url)
df_temp_stations.head()

Unnamed: 0,Name,Owner,Owner category,Station type,Id,Altitude,Latitude,Longitude,Active
0,Abelvattnet Aut,SMHI,CLIMATE,CORE,154860,665.0,65.53,14.97,False
1,Abisko,SMHI,CLIMATE,CORE,188800,393.38,68.3538,18.8166,False
2,Abisko Aut,Polarforskningssekretariatet,CLIMATE,CORE,188790,392.303,68.3538,18.8164,True
3,Abraur,SMHI,CLIMATE,CORE,158990,368.079,65.9857,18.9195,False
4,Adelsnäs,SMHI,CLIMATE,CORE,85600,97.0,58.1998,15.9802,False


In [16]:
df_temp_stations.shape

(914, 9)

### 1.2. Daily precipitation

In [17]:
# Get stations that have daily precipitation
url = "https://opendata-download-metobs.smhi.se/api/version/latest/parameter/5.json"
df_precip_stations = extract_station_data(url)
df_precip_stations.head()

Unnamed: 0,Name,Owner,Owner category,Station type,Id,Altitude,Latitude,Longitude,Active
0,Aapua,SMHI,CLIMATE,CORE,173010,210.0,66.8656,23.4951,False
1,Abborrberg,SMHI,CLIMATE,CORE,157010,550.0,65.4833,16.6,False
2,Abelvattnet Aut,SMHI,CLIMATE,CORE,154860,665.0,65.53,14.97,False
3,Abild,SMHI,CLIMATE,CORE,62570,145.0,56.9512,12.7909,False
4,Abisko,SMHI,CLIMATE,CORE,188800,393.38,68.3538,18.8166,False


In [18]:
df_precip_stations.shape

(2110, 9)

**Conclusion:** In summary, there are 914 stations that measure temperature on daily basis (active and inactive), and 2105 that measure precipitation.

## 2. Match SMHI weather stations to lakes based on coordinates

There is no explicit link between the lake sites from the chemistry data set, and the SMHI weather stations. We use coordinate data to find the two stations that are closest to each lake, by calculating the so-called Haversine distance (one for temperature, one for precipitation). 

In [19]:
# Load lakes dataset
df_lake = pd.read_csv("../data/lake_chem_data_clean.csv")
df_lake.head()

Unnamed: 0,MD-MVM Id,Survey station,Latitude,Longitude,County,Municipality,MS_CD C3,Sample date,Sample year,Sample month,...,Water temp (°C),Tot-N (µg/l N),Abs_F 420 (/m),SUVA_254 (m*l/mg),Organic N (µg/l N),Inorganic N (µg/l N),TOC:TON (mol/l),Organic P (µg/l N),Inorganic P (µg/l N),TOC:TOP (mol/l)
0,54,Spjutsjön,60.638793,15.445276,Dalarnas län,Falun,WA42559716,2001-03-28,2001,3,...,0.6,409.0,1.06,,201.0,208.0,61.513265,7.0,402.0,7809.079772
1,54,Spjutsjön,60.638793,15.445276,Dalarnas län,Falun,WA42559716,2001-05-21,2001,5,...,10.2,360.0,1.02,,256.0,104.0,37.362236,4.0,356.0,10571.725918
2,54,Spjutsjön,60.638793,15.445276,Dalarnas län,Falun,WA42559716,2001-08-22,2001,8,...,18.6,195.0,0.58,,185.0,10.0,107.18553,4.0,191.0,21916.992757
3,54,Spjutsjön,60.638793,15.445276,Dalarnas län,Falun,WA42559716,2001-10-15,2001,10,...,10.3,383.0,0.7,,353.0,30.0,28.417294,6.0,377.0,7391.613243
4,54,Spjutsjön,60.638793,15.445276,Dalarnas län,Falun,WA42559716,2002-02-26,2002,2,...,1.5,385.0,0.7,,203.0,182.0,55.161258,7.0,378.0,7072.374133


In [20]:
df_lake.shape

(8974, 51)

In [21]:
# Only keep columns from lake data that we need for joins
df_lake = df_lake[
    ["MD-MVM Id", "Survey station", "Latitude", "Longitude", "Sample date"]
]
df_lake.head()

Unnamed: 0,MD-MVM Id,Survey station,Latitude,Longitude,Sample date
0,54,Spjutsjön,60.638793,15.445276,2001-03-28
1,54,Spjutsjön,60.638793,15.445276,2001-05-21
2,54,Spjutsjön,60.638793,15.445276,2001-08-22
3,54,Spjutsjön,60.638793,15.445276,2001-10-15
4,54,Spjutsjön,60.638793,15.445276,2002-02-26


In [22]:
df_lake.shape

(8974, 5)

In [23]:
# Filter SMHI data to keep only active stations
df_temp_stations = df_temp_stations.loc[df_temp_stations["Active"] == True]
df_precip_stations = df_precip_stations.loc[df_precip_stations["Active"] == True]
print(df_temp_stations.shape, df_precip_stations.shape)

(238, 9) (564, 9)


In [24]:
# Function to calculate Haversine distance between two sets of coordinates
def haversine_distance(
    coord_1: Tuple[float, float], coord_2: Tuple[float, float]
) -> float:
    """
    Calculates the Haversine distance between two sets of coordinates.
    Each coordinate pair is inputted as a tuple.
    """
    distance = geodesic(coord_1, coord_2).kilometers
    return round(distance, 1)

In [25]:
def find_closest_station(
    df_lake: pd.DataFrame, df_other: pd.DataFrame, parameter: str
) -> pd.DataFrame:
    """
    Find the closest station in 'df_other' for each entry in 'df_lake'
    based on latitude and longitude and attach relevant information.

    Args:
        df_lake: Dataframe containing the lake chemistry data.
        df_other: Dataframe containing SMHI station data, either for
            temperature or precipitation.
        parameter: Specifier for temperature or precipitation.

    Returns:
        df_lake: The 'df_lake' dataframe with added information about
            the closest station from 'df_other'.
    """
    # Iterate through each station in lake data and calculate
    # distance to SMHI stations in other dataframe
    for index, row in df_lake.iterrows():
        distances = df_other.apply(
            lambda x: haversine_distance(
                (row["Latitude"], row["Longitude"]), (x["Latitude"], x["Longitude"])
            ),
            axis=1,
        )

        # Select the SMHI stations with shortest distance
        min_distance_idx = distances.idxmin()
        closest_station = df_other.loc[min_distance_idx, "Id"]
        df_lake.at[index, f"{parameter} station id"] = closest_station.astype(int)
        df_lake.at[index, f"{parameter} station dist"] = distances[min_distance_idx]

    # Join in name of SMHI stations
    df_lake = df_lake.join(
        df_other[["Id", "Name"]].set_index("Id"), on=f"{parameter} station id"
    ).rename(columns={"Name": f"{parameter} station name"})
    df_lake[f"{parameter} station id"] = df_lake[f"{parameter} station id"].astype(int)

    return df_lake

### 2.1. Match closest stations that have temperature data 

In [26]:
# Run the matching algorithm
df_lake_temp = find_closest_station(df_lake, df_temp_stations, parameter="Temp")
df_lake_temp.head()

Unnamed: 0,MD-MVM Id,Survey station,Latitude,Longitude,Sample date,Temp station id,Temp station dist,Temp station name
0,54,Spjutsjön,60.638793,15.445276,2001-03-28,105370,11.8,Falun-Lugnet
1,54,Spjutsjön,60.638793,15.445276,2001-05-21,105370,11.8,Falun-Lugnet
2,54,Spjutsjön,60.638793,15.445276,2001-08-22,105370,11.8,Falun-Lugnet
3,54,Spjutsjön,60.638793,15.445276,2001-10-15,105370,11.8,Falun-Lugnet
4,54,Spjutsjön,60.638793,15.445276,2002-02-26,105370,11.8,Falun-Lugnet


In [27]:
df_lake_temp.shape

(8974, 8)

### 2.2. Match closest stations that have precipitation data 

In [28]:
df_lake_precip = find_closest_station(
    df_lake_temp, df_precip_stations, parameter="Precip"
)
df_lake_precip.head()

Unnamed: 0,MD-MVM Id,Survey station,Latitude,Longitude,Sample date,Temp station id,Temp station dist,Temp station name,Precip station id,Precip station dist,Precip station name
0,54,Spjutsjön,60.638793,15.445276,2001-03-28,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås
1,54,Spjutsjön,60.638793,15.445276,2001-05-21,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås
2,54,Spjutsjön,60.638793,15.445276,2001-08-22,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås
3,54,Spjutsjön,60.638793,15.445276,2001-10-15,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås
4,54,Spjutsjön,60.638793,15.445276,2002-02-26,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås


In [29]:
df_lake_precip.shape

(8974, 11)

### 2.3. Join data and evaluate results

In [30]:
# Add the temperature station coordinates
df_temp_stations = df_temp_stations.rename(
    columns={"Latitude": "Temp station lat", "Longitude": "Temp station long"}
)
df_lake = df_lake_precip.join(
    df_temp_stations[["Id", "Temp station lat", "Temp station long"]].set_index("Id"),
    on="Temp station id",
)

# Add the precipitation station coordinates
df_precip_stations = df_precip_stations.rename(
    columns={"Latitude": "Precip station lat", "Longitude": "Precip station long"}
)
df_lake = df_lake.join(
    df_precip_stations[["Id", "Precip station lat", "Precip station long"]].set_index(
        "Id"
    ),
    on="Precip station id",
)
df_lake.head()

Unnamed: 0,MD-MVM Id,Survey station,Latitude,Longitude,Sample date,Temp station id,Temp station dist,Temp station name,Precip station id,Precip station dist,Precip station name,Temp station lat,Temp station long,Precip station lat,Precip station long
0,54,Spjutsjön,60.638793,15.445276,2001-03-28,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås,60.6185,15.6574,60.7438,15.4625
1,54,Spjutsjön,60.638793,15.445276,2001-05-21,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås,60.6185,15.6574,60.7438,15.4625
2,54,Spjutsjön,60.638793,15.445276,2001-08-22,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås,60.6185,15.6574,60.7438,15.4625
3,54,Spjutsjön,60.638793,15.445276,2001-10-15,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås,60.6185,15.6574,60.7438,15.4625
4,54,Spjutsjön,60.638793,15.445276,2002-02-26,105370,11.8,Falun-Lugnet,105470,11.7,Bjursås,60.6185,15.6574,60.7438,15.4625


In [31]:
# Check deviations between lakes and closest stations
print(f"Mean: {df_lake['Temp station dist'].mean()}")
print(f"Min: {df_lake['Temp station dist'].min()}")
print(f"Max: {df_lake['Temp station dist'].max()}")

Mean: 21.552663249387116
Min: 1.0
Max: 48.0


In [32]:
# Check deviations between lakes and closest stations
print(f"Mean: {df_lake['Precip station dist'].mean()}")
print(f"Min: {df_lake['Precip station dist'].min()}")
print(f"Max: {df_lake['Precip station dist'].max()}")

Mean: 13.427067082683308
Min: 1.0
Max: 45.4


In [33]:
# Check individual stations and their matches
df_lake.loc[df_lake["Survey station"] == "V. Rännöbodsjön"]

Unnamed: 0,MD-MVM Id,Survey station,Latitude,Longitude,Sample date,Temp station id,Temp station dist,Temp station name,Precip station id,Precip station dist,Precip station name,Temp station lat,Temp station long,Precip station lat,Precip station long
6620,134,V. Rännöbodsjön,62.330102,16.987209,2001-02-15,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
6621,134,V. Rännöbodsjön,62.330102,16.987209,2001-05-17,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
6622,134,V. Rännöbodsjön,62.330102,16.987209,2001-08-16,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
6623,134,V. Rännöbodsjön,62.330102,16.987209,2001-10-24,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
6624,134,V. Rännöbodsjön,62.330102,16.987209,2002-02-20,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6702,134,V. Rännöbodsjön,62.330102,16.987209,2021-08-03,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
6703,134,V. Rännöbodsjön,62.330102,16.987209,2021-10-21,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
6704,134,V. Rännöbodsjön,62.330102,16.987209,2022-02-15,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027
6705,134,V. Rännöbodsjön,62.330102,16.987209,2022-05-09,127310,31.9,Sundsvall-Timrå Flygplats,2222812,2.6,Bällsta,62.5246,17.441,62.3446,17.027


## 3. Get SMHI data for the matching weather stations

Now that we have our closest stations for all lakes, we can go ahead and pull this data from SMHI. 

In [34]:
def download_csv_data(station_id: str, parameter: int) -> pd.DataFrame:
    """
    Download and process CSV data for a specific station and parameter from
    the SMHI Open Data API.

    Args:
        station_id: The ID of the weather station.
        parameter: The parameter identifier (2 for temp, 5 for precipitation).

    Returns:
        df or None: A dataframe containing weather data if successful, or None
            if the request fails.
    """
    # Base URL for SMHI Open Data API
    BASE_URL = "https://opendata-download-metobs.smhi.se/api/version/latest"

    endpoint = f"{BASE_URL}/parameter/{parameter}/station/{station_id}/period/corrected-archive/data.csv"
    response = requests.get(endpoint)

    if parameter == 2:
        param_col_header = "Temp (°C)"
    elif parameter == 5:
        param_col_header = "Precip (mm)"

    if response.status_code == 200:
        # Store returned data in dataframe
        data = StringIO(response.text)
        df = pd.read_csv(
            data,
            sep=";",
            skiprows=50,
            usecols=[2, 3, 4],
            names=["Date", param_col_header, "Quality"],
        )
        df["Date"] = pd.to_datetime(df["Date"])
        df = df.loc[(df["Date"] >= "2000-01-01") & (df["Date"] <= "2022-12-31")]
        return df
    else:
        print(
            f"Failed to fetch data for station {station_id} and parameter {parameter}"
        )
        return None

### 3.1.  Temperature data

In [35]:
# Download and store temperature data
all_data = []
temp_stations = df_lake["Temp station id"].unique()

for station in temp_stations:
    df = download_csv_data(station, 2)
    if df is not None:
        df["Temp station id"] = station
        all_data.append(df)

# Combine all DataFrames into a single DataFrame
df_temp = pd.concat(all_data, ignore_index=True)
df_temp.head()

Unnamed: 0,Date,Temp (°C),Quality,Temp station id
0,2000-01-01,-9.2,Y,105370
1,2000-01-02,-0.1,Y,105370
2,2000-01-03,-3.2,Y,105370
3,2000-01-04,2.4,Y,105370
4,2000-01-05,-5.5,Y,105370


### 3.2. Precipitation data

In [36]:
# Download and store precipitation data
all_data = []
precip_stations = df_lake["Precip station id"].unique()

for station in precip_stations:
    df = download_csv_data(station, 5)
    if df is not None:
        df["Precip station id"] = station
        all_data.append(df)

# Combine all DataFrames into a single DataFrame
df_precip = pd.concat(all_data, ignore_index=True)
df_precip.head()

Failed to fetch data for station 2212878 and parameter 5


Unnamed: 0,Date,Precip (mm),Quality,Precip station id
0,2016-07-10,2.8,G,105470
1,2016-07-11,16.2,G,105470
2,2016-07-12,3.8,G,105470
3,2016-07-13,18.7,G,105470
4,2016-07-14,21.1,G,105470


## 4. Calculate rolling averages and create final data set

Rolling averages for the weather variables are calculated over several windows between 1 and 52 weeks. For each lake and sample date, the corresponding rolling average numbers at those dates are kept in the final data set.

In [37]:
# Calculate rolling average temperatures and
# rolling cumulative precipitation

# Select time windows
weeks = [1, 2, 4, 12, 52]

# Sort dataframes by date
df_temp = df_temp.sort_values(by=["Temp station id", "Date"])
df_precip = df_precip.sort_values(by=["Precip station id", "Date"])

# Iterate through all windows
for w in weeks:
    # Rolling average temperatures
    df_temp[f"Temp {w}w avg"] = (
        df_temp.groupby("Temp station id")["Temp (°C)"]
        .rolling(w * 7, min_periods=w * 7)
        .mean()
        .reset_index(drop=True)
        .values
    )
    df_temp = df_temp.sort_values(by=["Temp station id", "Date"]).reset_index(drop=True)

    # Rolling average daily precipitation
    df_precip[f"Precip {w}w avg"] = (
        df_precip.groupby("Precip station id")["Precip (mm)"]
        .rolling(w * 7, min_periods=w * 7)
        .mean()
        .reset_index(drop=True)
        .values
    )

    # Accumulated precipitation
    df_precip[f"Precip {w}w acc"] = (
        df_precip.groupby("Precip station id")["Precip (mm)"]
        .rolling(w * 7, min_periods=w * 7)
        .sum()
        .reset_index(drop=True)
        .values
    )
    df_precip = df_precip.sort_values(by=["Precip station id", "Date"]).reset_index(
        drop=True
    )

In [38]:
df_temp.head(20)

Unnamed: 0,Date,Temp (°C),Quality,Temp station id,Temp 1w avg,Temp 2w avg,Temp 4w avg,Temp 12w avg,Temp 52w avg
0,2008-01-01,0.4,Y,53300,,,,,
1,2008-01-02,0.7,Y,53300,,,,,
2,2008-01-03,-2.8,Y,53300,,,,,
3,2008-01-04,-3.0,Y,53300,,,,,
4,2008-01-05,-1.5,Y,53300,,,,,
5,2008-01-06,1.4,Y,53300,,,,,
6,2008-01-07,1.7,Y,53300,-0.442857,,,,
7,2008-01-08,4.0,Y,53300,0.071429,,,,
8,2008-01-09,3.6,Y,53300,0.485714,,,,
9,2008-01-10,4.0,Y,53300,1.457143,,,,


In [39]:
df_precip.head(20)

Unnamed: 0,Date,Precip (mm),Quality,Precip station id,Precip 1w avg,Precip 1w acc,Precip 2w avg,Precip 2w acc,Precip 4w avg,Precip 4w acc,Precip 12w avg,Precip 12w acc,Precip 52w avg,Precip 52w acc
0,2000-01-01,1.9,G,53410,,,,,,,,,,
1,2000-01-02,0.0,G,53410,,,,,,,,,,
2,2000-01-03,0.3,G,53410,,,,,,,,,,
3,2000-01-04,4.9,G,53410,,,,,,,,,,
4,2000-01-05,0.4,G,53410,,,,,,,,,,
5,2000-01-06,3.1,G,53410,,,,,,,,,,
6,2000-01-07,0.0,G,53410,1.514286,10.6,,,,,,,,
7,2000-01-08,0.0,G,53410,1.242857,8.7,,,,,,,,
8,2000-01-09,0.2,G,53410,1.271429,8.9,,,,,,,,
9,2000-01-10,0.0,G,53410,1.228571,8.6,,,,,,,,


In [40]:
# Join temp and precip data to dataframe with lake observations

df_lake["Sample date"] = pd.to_datetime(df_lake["Sample date"])

# Temperature data
df_lake = df_lake.join(
    df_temp.drop(["Temp (°C)", "Quality"], axis="columns").set_index(
        ["Temp station id", "Date"]
    ),
    on=["Temp station id", "Sample date"],
    how="left",
)
df_lake.head()

# Precipitation data
df_lake = df_lake.join(
    df_precip.drop(["Precip (mm)", "Quality"], axis="columns").set_index(
        ["Precip station id", "Date"]
    ),
    on=["Precip station id", "Sample date"],
    how="left",
)
df_lake.head()

Unnamed: 0,MD-MVM Id,Survey station,Latitude,Longitude,Sample date,Temp station id,Temp station dist,Temp station name,Precip station id,Precip station dist,...,Precip 1w avg,Precip 1w acc,Precip 2w avg,Precip 2w acc,Precip 4w avg,Precip 4w acc,Precip 12w avg,Precip 12w acc,Precip 52w avg,Precip 52w acc
0,54,Spjutsjön,60.638793,15.445276,2001-03-28,105370,11.8,Falun-Lugnet,105470,11.7,...,,,,,,,,,,
1,54,Spjutsjön,60.638793,15.445276,2001-05-21,105370,11.8,Falun-Lugnet,105470,11.7,...,,,,,,,,,,
2,54,Spjutsjön,60.638793,15.445276,2001-08-22,105370,11.8,Falun-Lugnet,105470,11.7,...,,,,,,,,,,
3,54,Spjutsjön,60.638793,15.445276,2001-10-15,105370,11.8,Falun-Lugnet,105470,11.7,...,,,,,,,,,,
4,54,Spjutsjön,60.638793,15.445276,2002-02-26,105370,11.8,Falun-Lugnet,105470,11.7,...,,,,,,,,,,


In [41]:
df_lake.shape

(8974, 30)

## 5. Save processed data

Save the processed data as a new csv file on the GitHub repository (or overwrite the existing file with the same name).

In [42]:
# Save the file in the data folder
df_lake.to_csv("../data/weather_data_clean.csv", index=False)