## Install required dependencies

In [2]:
#!pip install -r requirements.txt



## Connect to CrateDB

You also need to provide a connection string to your CrateDB database cluster,
optionally using the environment variable `CRATEDB_CONNECTION_STRING`.

This example uses a CrateDB instance on your workstation, which you can start by
running [CrateDB using Docker]. Alternatively, you can also connect to a cluster
running on [CrateDB Cloud].

[CrateDB Cloud]: https://console.cratedb.cloud/
[CrateDB using Docker]: https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker

In [10]:
import os
import sqlalchemy as sa

#CONNECTION_STRING = os.environ.get(
#    "CRATEDB_CONNECTION_STRING",
#    "crate://crate@localhost/",
#)

# For CrateDB Cloud, use:
CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    "crate://USER:PASSWORD@CLUSTER.cratedb.net/?ssl=true",
)

engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get('INFO'))
connection = engine.connect()

## Create the Tables

CrateDB uses SQL, a powerful and familiar language for database management. To store the device readings and the device info data, create two tables with columns tailored to the datasets using the `CREATE TABLE` command:

```sql
CREATE TABLE IF NOT EXISTS devices_readings (
   "ts" TIMESTAMP WITH TIME ZONE,
   "device_id" TEXT,
   "battery" OBJECT(DYNAMIC) AS (
      "level" BIGINT,
      "status" TEXT,
      "temperature" DOUBLE PRECISION
   ),
   "cpu" OBJECT(DYNAMIC) AS (
      "avg_1min" DOUBLE PRECISION,
      "avg_5min" DOUBLE PRECISION,
      "avg_15min" DOUBLE PRECISION
   ),
   "memory" OBJECT(DYNAMIC) AS (
      "free" BIGINT,
      "used" BIGINT
   )
);

CREATE TABLE IF NOT EXISTS devices_info (
   "device_id" TEXT,
   "api_version" TEXT,
   "manufacturer" TEXT,
   "model" TEXT,
   "os_name" TEXT
);
```

In [118]:
_ = connection.execute(sa.text(
    """
    CREATE TABLE IF NOT EXISTS weather_data (
        "timestamp" TIMESTAMP,
        "location" VARCHAR,
        "temperature" DOUBLE,
        "humidity" DOUBLE,
        "wind_speed" DOUBLE
    )
    """))

## Loading Initial Data

Let us load an initial set of data using the `COPY FROM` SQL Statement.

The result contains information about the successfully written rows and errors that might have occured.

We expect an output like 
`[({'id': <SOME_ID>, 'name': <SOME_NAME>}, 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/devices_readings.json.gz', 70000, 0, {} )]`

The results shows that 70000 rows have been successfully loaded and no errors occured.

In [30]:
result = connection.execute(sa.text(
    """
    COPY weather_data
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_weather.csv.gz'
    WITH (format='csv', compression='gzip', empty_string_as_null=true)
    """))

After inserting data, it is recommended to `ANALYZE` the tables to supply the query optimizer with the necessary statistics. We also execute a `REFRESH` statement to ensure that the data is up to date.

In [31]:
_ = connection.execute(sa.text("REFRESH TABLE weather_data"))
_ = connection.execute(sa.text("ANALYZE"))

## Querying data into a dataframe

In this first example, we want to have a look at the data and query it into a pandas dataframe.

CrateDB stores timestamps as big integers representing millisseconds since the unix epoch. For better readability, we transform it to a datetime in python.

We will see weather data from multiple cities at different timestamps. These sample data already show that some data is missing - we will interpolate it later on.

In [125]:
import pandas as pd

query = "SELECT * FROM weather_data ORDER BY timestamp LIMIT 10"
df = pd.read_sql(query, CONNECTION_STRING)
df['timestamp'] = np.array(pd.to_datetime(df['timestamp'], unit='ms'))
df.head(10)


Unnamed: 0,timestamp,location,temperature,humidity,wind_speed,attributes
0,2023-01-01 00:00:40,Vienna,,,12.364712,
1,2023-01-01 00:05:43,Zurich,19.921922,,10.053553,
2,2023-01-01 00:10:01,Zurich,20.149884,45.397803,13.094305,
3,2023-01-01 00:15:00,Dornbirn,19.840251,97.157829,0.927458,
4,2023-01-01 00:20:06,Berlin,20.103726,81.005533,9.731158,
5,2023-01-01 00:25:24,Vienna,20.329405,90.035633,7.045718,
6,2023-01-01 00:30:02,Dornbirn,20.485222,90.23342,4.534103,
7,2023-01-01 00:35:53,Berlin,19.855271,79.228996,0.714025,
8,2023-01-01 00:40:40,Dornbirn,19.744811,48.965236,11.88162,
9,2023-01-01 00:45:14,Berlin,20.276,32.238157,7.677409,


## A few sample queries

CrateDB is built for fast aggregation using the columnar storage to speed up queries. For example, calculate the average temperature for each location by using the AVG aggregation function:

In [127]:
query = """
SELECT 
    location, 
    AVG(temperature) AS avg_temp
FROM weather_data
GROUP BY location;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df.head(10)

Unnamed: 0,location,avg_temp
0,Redwood City,25.933538
1,Dornbirn,24.080325
2,Berlin,23.956854
3,Vienna,24.018788
4,Zurich,23.99455


Computing basic averages is nothing special, but what if you need to answer more detailed questions? For example, if you want to know the highest and lowest temperature for each place and when it occurred.

Simple groupings might not be enough, but thankfully, CrateDB has enhanced tools for time series data. You can use the `max_by(returned_value, maximized_value)` or `min_by(returned_value, maximized_value)` function, which returns a value (like the time) when another value (like the temperature) is at its maximum or minumin, respectively.

Let’s put this to use with the following query:

In [130]:
query = """
SELECT location,
    max(temperature) AS max_temp,
    max_by(timestamp, temperature) AS time_of_max_temp,
    min(temperature) AS min_temp,
    min_by(timestamp, temperature) AS time_of_min_temp
FROM weather_data
GROUP BY location
"""

df_temperature = pd.read_sql(query, CONNECTION_STRING)

# milliseconds to datetime for better readability
df_temperature['time_of_max_temp'] = np.array(pd.to_datetime(df_temperature['time_of_max_temp'], unit='ms'))
df_temperature['time_of_min_temp'] = np.array(pd.to_datetime(df_temperature['time_of_min_temp'], unit='ms'))

df_temperature.head(10)

Unnamed: 0,location,max_temp,time_of_max_temp,min_temp,time_of_min_temp
0,Redwood City,37.491913,2023-04-27 06:45:14,11.501863,2023-08-31 18:00:41.000
1,Dornbirn,35.485976,2023-04-05 06:00:20,9.52796,2023-08-19 18:45:51.000
2,Berlin,35.492248,2023-04-16 06:00:19,9.50499,2023-08-08 18:35:24.000
3,Vienna,35.489541,2023-04-23 06:00:55,9.2,2023-11-17 20:01:48.646
4,Zurich,35.48786,2023-04-12 06:55:58,9.52336,2023-08-02 18:35:07.000


### Exploratorin the data further

Let us further explore the data, visualize it, and step by step introduce more advanced CrateDB features.

In [55]:
from datetime import datetime
import plotly.express as px
import numpy as np
import warnings
# Suppress a few warnings of pandas, that clutter the output
warnings.simplefilter("ignore", category=FutureWarning)

In [38]:
query = "SELECT * FROM weather_data ORDER BY location, timestamp"
df_weather = pd.read_sql(query, CONNECTION_STRING)
df_weather.head(10)

Unnamed: 0,timestamp,location,temperature,humidity,wind_speed
0,1672532406000,Berlin,20.103726,81.005533,9.731158
1,1672532406000,Berlin,20.103726,81.005533,9.731158
2,1672532406000,Berlin,20.103726,81.005533,9.731158
3,1672533353000,Berlin,19.855271,79.228996,0.714025
4,1672533353000,Berlin,19.855271,79.228996,0.714025
5,1672533353000,Berlin,19.855271,79.228996,0.714025
6,1672533914000,Berlin,20.276,32.238157,7.677409
7,1672533914000,Berlin,20.276,32.238157,7.677409
8,1672533914000,Berlin,20.276,32.238157,7.677409
9,1672536055000,Berlin,21.301758,55.687224,14.534997


First of all, crate a plot of the data first to get a visual impression. We will facet by location to see the complete data set.

In order to further analyze the data, please zoom in so that you only see a few days of data (just click and drag with the mouse to zoom in).

In [199]:
# Creating a line chart for temperature, humidity, and wind_speed
df_weather['timestamp'] = pd.to_datetime(df_weather['timestamp'], unit='ms')
fig = px.line(df_weather, x='timestamp', y=['temperature', 'humidity', 'wind_speed'], 
              facet_col='location', title='Temperature, Humidity, and Wind Speed')
fig.update_xaxes(tickangle=90)
fig.show()

In order to better visualize the results, we will plot a smaller range of data of just two days.

In [198]:
# Plot again with two days of data
fig = px.line(df_weather[df_weather['timestamp'].between(datetime(2023, 6, 1), datetime(2023, 6, 3))], x='timestamp', y=['temperature', 'humidity', 'wind_speed'], 
              facet_col='location', title='Temperature, Humidity, and Wind Speed')
fig.update_xaxes(tickangle=90)
fig.show()

## Interpolation

You have probably observed by now, that there are gaps in the dataset for certain metrics. Such occurrences are common, perhaps due to a sensor malfunction or disconnection. To address this, the missing values need to be filled in. 

This example query introduces multiple additional features of CrateDB:

- **Common Table Expressions:** CTEs provides a way to reference subqueries by a name within the primary query. The subqueries effectively act as temporary tables or views for the duration of the primary query. This can improve the readability of SQL code as it breaks down complicated queries into smaller parts.
- **Window Functions:** CrateDB supports the [OVER](https://cratedb.com/docs/crate/reference/en/latest/general/builtins/window-functions.html#window-definition-over)  clause to enable the execution of [window functions](https://cratedb.com/docs/crate/reference/en/latest/general/builtins/window-functions.html#window-functions). Paired with the [`IGNORE NULLS`](https://cratedb.com/docs/crate/reference/en/latest/general/builtins/window-functions.html#window-functions) clause, null values are excluded from the window function executions. The window functions that support this option are: [`lead(arg [, offset [, default] ])`](https://cratedb.com/docs/crate/reference/en/latest/general/builtins/window-functions.html#window-functions-lead), [`lag(arg [, offset [, default] ])`](https://cratedb.com/docs/crate/reference/en/latest/general/builtins/window-functions.html#window-functions-lag), [`first_value(arg)`](https://cratedb.com/docs/crate/reference/en/latest/general/builtins/window-functions.html#window-functions-first-value), `last_value(arg)`, and [`nth_value(arg, number)`](https://cratedb.com/docs/crate/reference/en/latest/general/builtins/window-functions.html#window-functions-nth-value). We utilize window functions to spot the next and prior non-null temperature recordings, and then compute the arithmetic mean to bridge the gap.
- **Window Defition**: We use the `location_window` to partition our data by location and order by location and timestamp to identify the correct previous and next row.

In [148]:
query = """
WITH OrderedData AS (
    SELECT timestamp,
           location,
           temperature,
           COALESCE(LAG(temperature, 1) IGNORE NULLS OVER location_window, temperature) AS prev_temperature,
           COALESCE(LEAD(temperature, 1) IGNORE NULLS OVER location_window, temperature) AS next_temperature,
           humidity,
           COALESCE(LAG(humidity, 1) IGNORE NULLS OVER location_window, humidity) AS prev_humidity,
           COALESCE(LEAD(humidity, 1) IGNORE NULLS OVER location_window, humidity) AS next_humidity,
           wind_speed,
           COALESCE(LAG(wind_speed, 1) IGNORE NULLS OVER location_window, wind_speed) AS prev_wind_speed,
           COALESCE(LEAD(wind_speed, 1) IGNORE NULLS OVER location_window, wind_speed) AS next_wind_speed
    FROM weather_data
    WINDOW location_window AS (partition by location ORDER BY location, timestamp)
)
SELECT timestamp,
       location,
       COALESCE(temperature, (prev_temperature + next_temperature) / 2) as temperature,
       COALESCE(humidity, (prev_humidity + next_humidity) / 2) as humidity,
       COALESCE(wind_speed, (prev_wind_speed + next_wind_speed) / 2) as wind_speed
FROM OrderedData
ORDER BY location, timestamp
"""

#pd.set_option('display.max_rows', 100)

df_weather_interpolated = pd.read_sql(query, CONNECTION_STRING)
df_weather_interpolated['timestamp'] = pd.to_datetime(df_weather_interpolated['timestamp'], unit='ms')
df_weather_interpolated.head(10)

Unnamed: 0,timestamp,location,temperature,humidity,wind_speed
0,2023-01-01 00:20:06,Berlin,20.103726,81.005533,9.731158
1,2023-01-01 00:35:53,Berlin,19.855271,79.228996,0.714025
2,2023-01-01 00:45:14,Berlin,20.276,32.238157,7.677409
3,2023-01-01 01:20:55,Berlin,21.301758,55.687224,14.534997
4,2023-01-01 01:30:51,Berlin,21.034967,35.431482,13.160233
5,2023-01-01 02:25:24,Berlin,22.785129,76.821308,7.833615
6,2023-01-01 02:50:45,Berlin,22.395786,50.281067,12.518645
7,2023-01-01 03:00:59,Berlin,23.914804,40.42473,12.035399
8,2023-01-01 03:30:02,Berlin,23.183628,50.596705,12.961223
9,2023-01-01 03:50:15,Berlin,23.736812,32.566016,6.995728


In [149]:
# Plot again for two days of data
fig = px.line(df_weather_interpolated[df_weather_interpolated['timestamp'].between(datetime(2023, 6, 1), datetime(2023, 6, 3))], x='timestamp', y=['temperature', 'humidity', 'wind_speed'], 
              facet_col='location', title='Temperature, Humidity, and Wind Speed')
fig.update_xaxes(tickangle=90)
fig.show()

## Create a View to Use the Calculated Features

We want to use the interpolated data in the following steps. This can be particularly useful if you want to use additional features for downstream appliations.

In [151]:
query = """
CREATE OR REPLACE VIEW weather_data_interpolated AS (
    WITH OrderedData AS (
        SELECT timestamp,
               location,
               temperature,
               COALESCE(LAG(temperature, 1) IGNORE NULLS OVER location_window, temperature) AS prev_temperature,
               COALESCE(LEAD(temperature, 1) IGNORE NULLS OVER location_window, temperature) AS next_temperature,
               humidity,
               COALESCE(LAG(humidity, 1) IGNORE NULLS OVER location_window, humidity) AS prev_humidity,
               COALESCE(LEAD(humidity, 1) IGNORE NULLS OVER location_window, humidity) AS next_humidity,
               wind_speed,
               COALESCE(LAG(wind_speed, 1) IGNORE NULLS OVER location_window, wind_speed) AS prev_wind_speed,
               COALESCE(LEAD(wind_speed, 1) IGNORE NULLS OVER location_window, wind_speed) AS next_wind_speed
        FROM weather_data
        WINDOW location_window AS (partition by location ORDER BY location, timestamp)
    )
    SELECT timestamp,
           location,
           COALESCE(temperature, (prev_temperature + next_temperature) / 2) as temperature,
           COALESCE(humidity, (prev_humidity + next_humidity) / 2) as humidity,
           COALESCE(wind_speed, (prev_wind_speed + next_wind_speed) / 2) as wind_speed
    FROM OrderedData
    ORDER BY location, timestamp
)
"""

_ = connection.execute(sa.text(query))

## Moving Averages

Moving averages are a widely used statistical technique in time series analysis, smoothing out short-term fluctuations and highlighting longer-term trends or cycles. We will use the previously created view of interpolated data. We will create moving averages for a range of 10 and 20 readings.

In [155]:
query = """
SELECT 
    timestamp, 
    location,
    temperature,
    AVG(temperature) OVER w_10 AS temperature_ma_10,
    AVG(temperature) OVER w_20 AS temperature_ma_20,
    humidity,
    AVG(humidity) OVER w_10 AS humidity_ma_10,
    AVG(humidity) OVER w_20 AS humidity_ma_20,
    wind_speed,
    AVG(wind_speed) OVER w_10 AS wind_speed_ma_10,
    AVG(wind_speed) OVER w_20 AS wind_speed_ma_20
FROM 
    weather_data_interpolated
WINDOW
    w_10 AS (ORDER BY location, timestamp ROWS BETWEEN 10 PRECEDING AND CURRENT ROW),
    w_20 AS (ORDER BY location, timestamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW)
"""

df_weather_ma = pd.read_sql(query, CONNECTION_STRING)
df_weather_ma['timestamp'] = pd.to_datetime(df_weather_ma['timestamp'], unit='ms')

# Plot for two days of data
fig = px.line(df_weather_ma[df_weather_ma['timestamp'].between(datetime(2023, 6, 1), datetime(2023, 6, 3))], 
              x='timestamp', y=['temperature', 'humidity', 'humidity_ma_10', 'wind_speed', 'wind_speed_ma_10'], 
              facet_col='location', title='Temperature, Humidity, and Wind Speed', color_discrete_map={
                 "temperature": "#636EFA",
                 "humidity": "#EF553B",
                 "wind_speed": "#00CC96"
             })
fig.update_xaxes(tickangle=90)
fig.show()

## Arbitrary JSON data as measurements

We can add additional metadata, or even measurements as JSON data in CrateDB. It can follow a dynamic schema, i.e. the data you insert can vary per sensor type / weather station. The automatic indexing capabilities make that data available for aggregations and search immediately.

In [163]:
# result = connection.execute(sa.text('ALTER TABLE weather_data ADD COLUMN attributes OBJECT(DYNAMIC)'))

result = connection.execute(sa.text(
    """
    INSERT INTO weather_data (timestamp, location, temperature, humidity, wind_speed, attributes) VALUES (
    CURRENT_TIMESTAMP, 'Vienna', 9.2, 80, 7.2, '{"measurement_accuracy": "+/- 0.5°C",
        "data_source": "Local Automated Station",
        "weather_condition_code": "Clear",
        "sensor_information": {
            "software_version": "v2.4.1",
            "maintenance_record": "2023-11-01",
            "data_collection_method": "Automatic"
        }}')
    """
))

result = connection.execute(sa.text(
    """
    INSERT INTO weather_data (timestamp, location, temperature, humidity, wind_speed, attributes) VALUES (
    CURRENT_TIMESTAMP, 'Munich', 10.2, 65, 4.1, '{"measurement_accuracy": "+/- 0.5°C",
        "data_source": "Local Automated Station",
        "weather_condition_code": "Clear",
        "sensor_information": {
            "software_version": "v1.2.3",
            "maintenance_record": "2023-11-01",
            "data_collection_method": "Automatic"
        }}')
    """
))

Querying the data is as easy as inserting it - attributes of JSON objects are identified by using squared brackets (also in nested JSON objects). 

For example, we can use `attributes['weather_condition_code'] = 'Clear'` in a where clause or filtering for `attributes['sensor_information']['software_version'] = 'v1.2.3'` returns only values of sensors with a certain software version.

We can either transform the results to "regular columns" during query time or leverage the powerful `json_normalize` function of dataframes. 

A SQL query could look like this:

```sql
SELECT timestamp, 
       location, 
       temperature, 
       attributes['weather_condition_code'] AS weather_condition_code, 
       attributes['sensor_information']['software_version'] AS software_version 
FROM weather_data 
ORDER BY location, timestamp. 
```

In [166]:
query = "SELECT * FROM weather_data WHERE attributes['weather_condition_code'] = 'Clear' ORDER BY timestamp DESC"
df_weather_att = pd.read_sql(query, CONNECTION_STRING)
df_weather_att['timestamp'] = pd.to_datetime(df_weather_att['timestamp'], unit='ms')

df_weather_att

Unnamed: 0,timestamp,location,temperature,humidity,wind_speed,attributes
0,2023-11-18 05:20:45.820,Munich,10.2,65.0,4.1,"{'measurement_accuracy': '+/- 0.5°C', 'sensor_..."
1,2023-11-18 05:20:45.730,Vienna,9.2,80.0,7.2,"{'measurement_accuracy': '+/- 0.5°C', 'sensor_..."
2,2023-11-18 05:17:34.410,Munich,10.2,65.0,4.1,"{'measurement_accuracy': '+/- 0.5°C', 'mainten..."
3,2023-11-18 05:17:34.354,Vienna,9.2,80.0,7.2,"{'measurement_accuracy': '+/- 0.5°C', 'mainten..."
4,2023-11-17 20:01:48.646,Vienna,9.2,80.0,7.2,"{'measurement_accuracy': '+/- 0.5°C', 'mainten..."


In [168]:
df_weather_att = df_weather_att.join(pd.json_normalize(df_weather_att.pop('attributes')))
df_weather_att

Unnamed: 0,timestamp,location,temperature,humidity,wind_speed,measurement_accuracy,weather_condition_code,data_source,sensor_information.software_version,sensor_information.maintenance_record,sensor_information.data_collection_method,maintenance_record,software_version,data_collection_method
0,2023-11-18 05:20:45.820,Munich,10.2,65.0,4.1,+/- 0.5°C,Clear,Local Automated Station,v1.2.3,2023-11-01,Automatic,,,
1,2023-11-18 05:20:45.730,Vienna,9.2,80.0,7.2,+/- 0.5°C,Clear,Local Automated Station,v2.4.1,2023-11-01,Automatic,,,
2,2023-11-18 05:17:34.410,Munich,10.2,65.0,4.1,+/- 0.5°C,Clear,Local Automated Station,,,,2023-11-01,v1.2.3,Automatic
3,2023-11-18 05:17:34.354,Vienna,9.2,80.0,7.2,+/- 0.5°C,Clear,Local Automated Station,,,,2023-11-01,v2.4.1,Automatic
4,2023-11-17 20:01:48.646,Vienna,9.2,80.0,7.2,+/- 0.5°C,Clear,Local Automated Station,,,,2023-11-01,v2.4.1,Automatic


## JOINS

When designing your core data model, you will certainly end up with having time series data in one table, and corresponding metadata in another one.

Without database joins, you can easily end up loading large amounts of data into your application just for the purpose on doing joins there, putting huge stress on your infrastructure.

Let us extend the schema with another table that describes the weather stations and execute a few example joins. 

**Hint:** Use the `EXPLAIN` statement to analyze the execution plan. In order to push down aggregations below joins, the usage of common table expressions is recommended.

In [None]:
query =
"""
CREATE TABLE weather_stations (
    city TEXT,
    station_id TEXT,
    geopoint GEO_POINT,
    additional_info OBJECT(DYNAMIC) AS (
        altitude INT,
        established_year INT,
        station_manager TEXT,
        area_covered TEXT,
        equipment_type TEXT
    )
)
"""
_ = connection.execute(sa.text(query))

query = 
"""
INSERT INTO weather_stations (city, station_id, geopoint, additional_info) VALUES 
('Berlin', 'BER001', 'POINT(13.4050 52.5200)', '{"altitude": 34, "established_year": 1985, "station_manager": "Dr. Klaus Weber", "area_covered": "Greater Berlin", "equipment_type": "Automatic Weather Station"}'),
('Dornbirn', 'DOR002', 'POINT(9.7438 47.4132)', '{"altitude": 437, "established_year": 1990, "station_manager": "Mag. Elisabeth Hofer", "area_covered": "Vorarlberg Region", "equipment_type": "Synoptic Weather Station"}'),
('Zurich', 'ZUR003', 'POINT(8.5417 47.3769)', '{"altitude": 408, "established_year": 1978, "station_manager": "Herr Lukas Müller", "area_covered": "Canton of Zurich", "equipment_type": "Climatological Weather Station"}'),
('Redwood City', 'RED004', 'POINT(-122.2364 37.4852)', '{"altitude": 7, "established_year": 2003, "station_manager": "Dr. Emily Johnson", "area_covered": "San Mateo County", "equipment_type": "Hydro-Meteorological Station"}'),
('Vienna', 'VIE005', 'POINT(16.3738 48.2082)', '{"altitude": 151, "established_year": 1982, "station_manager": "Dr. Franz Schmidt", "area_covered": "Vienna Metropolitan Area", "equipment_type": "Radar Weather Station"}');
"""
_ = connection.execute(sa.text(query))

In [197]:
query = """
    SELECT timestamp, location, temperature, humidity, wind_speed, additional_info['altitude'] as altitude, geopoint
    FROM weather_data LEFT JOIN weather_stations ON (weather_data.location = weather_stations.city)
    LIMIT 100
"""
df_weather_stations = pd.read_sql(query, CONNECTION_STRING)
df_weather_stations['timestamp'] = pd.to_datetime(df_weather_stations['timestamp'], unit='ms')

df_weather_stations.head(10)

Unnamed: 0,timestamp,location,temperature,humidity,wind_speed,altitude,geopoint
0,2023-01-01 03:50:15,Berlin,23.736812,32.566016,6.995728,34,"[13.405, 52.52]"
1,2023-01-01 09:05:31,Berlin,23.308959,95.373535,5.512651,34,"[13.405, 52.52]"
2,2023-01-01 11:40:57,Berlin,21.069977,95.415336,0.916734,34,"[13.405, 52.52]"
3,2023-01-01 12:05:11,Berlin,20.241718,82.016328,7.78017,34,"[13.405, 52.52]"
4,2023-01-01 15:25:26,Berlin,16.201996,51.027113,9.48928,34,"[13.405, 52.52]"
5,2023-01-01 16:55:41,Berlin,15.969215,59.613926,14.564787,34,"[13.405, 52.52]"
6,2023-01-01 17:40:16,Berlin,14.884941,82.128397,2.001783,34,"[13.405, 52.52]"
7,2023-01-01 18:25:35,Berlin,14.802825,67.792081,14.41607,34,"[13.405, 52.52]"
8,2023-01-01 19:05:52,Berlin,15.391499,41.888298,9.668151,34,"[13.405, 52.52]"
9,2023-01-01 19:45:39,Berlin,15.177686,66.65146,9.748926,34,"[13.405, 52.52]"
