Skip to content

Commit

Permalink
Merge pull request #35 from citrusvanilla/issue-32--compact_storage
Browse files Browse the repository at this point in the history
update README, docs
  • Loading branch information
citrusvanilla committed Mar 21, 2023
2 parents f5fc7cc + a0a3c2c commit 0e2448f
Show file tree
Hide file tree
Showing 7 changed files with 122 additions and 113 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ jobs:
run: |
pip install --upgrade pip
pip install -r requirements.txt
- name: Check README
run: rstcheck README.rst
- name: Check code formatting
run: black --check tinyflux/ tests/ examples/
- name: Check code style
Expand Down
9 changes: 7 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Recent Updates
**************

v0.3.0 (2023-3-21)
^^^^^^^^^^^^^^^^^^
==================

* Tag and field keys can be compacted when using CSVStorage, saving potentially many bytes per Point (resolves issue #32).
* Fixed bug that causes tag values of '' to be serialized as "_none" (resolves issue #33).
Expand Down Expand Up @@ -126,6 +126,11 @@ The `examples <https://github.com/citrusvanilla/tinyflux/tree/master/examples>`_
2. `Local Analytics Workflow with a TinyFlux Database <https://github.com/citrusvanilla/tinyflux/blob/master/examples/2_analytics_workflow.ipynb>`_
3. `TinyFlux as a MQTT Datastore for IoT Devices <https://github.com/citrusvanilla/tinyflux/blob/master/examples/3_iot_datastore_with_mqtt.py>`_

Tips
****

Checkout some tips for working with TinyFlux `here <https://tinyflux.readthedocs.io/en/latest/tips.html>`_.


TinyFlux Across the Internet
****************************
Expand All @@ -141,7 +146,7 @@ Contributing

New ideas, new developer tools, improvements, and bugfixes are always welcome. Follow these guidelines before getting started:

1. Make sure to read `Getting Started <https://tinyflux.readthedocs.io/en/latest/getting-started.html>`_ and the `Contributing <https://tinyflux.readthedocs.io/en/latest/contributing-philosophy.html>`_ section of the documentation.
1. Make sure to read `Getting Started <https://tinyflux.readthedocs.io/en/latest/getting-started.html>`_ and the `Contributing Tooling and Conventions <https://tinyflux.readthedocs.io/en/latest/contributing-tooling.html>`_ section of the documentation.
2. Check GitHub for `existing open issues <https://github.com/citrusvanilla/tinyflux/issues>`_, `open a new issue <https://github.com/citrusvanilla/tinyflux/issues/new>`_ or `start a new discussion <https://github.com/citrusvanilla/tinyflux/discussions/new>`_.
3. To get started on a pull request, fork the repository on GitHub, create a new branch, and make updates.
4. Write unit tests, ensure the code is 100% covered, update documentation where necessary, and format and style the code correctly.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# -- Project information -----------------------------------------------------

project = "TinyFlux"
copyright = "2022, Justin Fung"
copyright = "2023, Justin Fung"
author = "Justin Fung"

# The full version, including alpha/beta/rc tags
Expand Down
4 changes: 2 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Get started with
================
Getting started with
====================

.. image:: https://github.com/citrusvanilla/tinyflux/blob/master/artwork/tinyfluxdb-dark.png?raw=true#gh-light-mode-only
:width: 500px
Expand Down
140 changes: 70 additions & 70 deletions docs/source/time.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,86 +15,86 @@ To illustrate the way time is handled in TinyFlux, below are the five ways time

1. ``time`` is not set by the user when the Point is initialized so its default value is ``None``. AFTER it is inserted into TinyFlux, it is assigned a UTC timestamp corresponding to the time of insertion.

>>> from tinyflux import Point, TinyFlux
>>> db = TinyFlux("my_db.csv") # an empty db
>>> p = Point()
>>> p.time is None
True
>>> db.insert(p)
>>> p.time
datetime.datetime(2021, 10, 30, 13, 53, 552872, tzinfo=datetime.timezone.utc)
>>> from tinyflux import Point, TinyFlux
>>> db = TinyFlux("my_db.csv") # an empty db
>>> p = Point()
>>> p.time is None
True
>>> db.insert(p)
>>> p.time
datetime.datetime(2021, 10, 30, 13, 53, 552872, tzinfo=datetime.timezone.utc)

2. ``time`` is set with a value, but it is not a ``datetime`` object. TinyFlux raises an exception.

>>> Point(time="2022-01-01")
ValueError: Time must be datetime object.
>>> Point(time="2022-01-01")
ValueError: Time must be datetime object.

3. ``time`` is set with a ``datetime`` object that is "timezone-naive". TinyFlux considers this time to be local to the timezone of the computer that is running TinyFlux and will convert this time to UTC using the ``astimezone`` attribute of the ``datetime`` module upon insertion. This will lead to confusion down the road if TinyFlux is running on a remote computer, or the user was annotating data for points corresponding to places in other timezones.

>>> from datetime import datetime
>>> # Example: Our computer is in Californa, but we are working with a dataset of
>>> # air quality measurements for Beijing, China.
>>> # Here, AQI was measured at 1pm local time in Beijing on Aug 28, 2021.
>>> p = Point(
... time=datetime(2021, 8, 28, 13, 0), # 1pm, datetime-naive
... tags={"city": "beijing"},
... fields={"aqi": 118}
... )
>>> p.time
datetime.datetime(2021, 8, 28, 13, 0)
>>> # Insert the point into the database.
>>> db.insert(p)
>>> # The point is cast to UTC, assuming the time was local to California, not Beijing.
>>> p.time
datetime.datetime(2021, 8, 28, 20, 0, tzinfo=datetime.timezone.utc)
>>> from datetime import datetime
>>> # Example: Our computer is in Californa, but we are working with a dataset of
>>> # air quality measurements for Beijing, China.
>>> # Here, AQI was measured at 1pm local time in Beijing on Aug 28, 2021.
>>> p = Point(
... time=datetime(2021, 8, 28, 13, 0), # 1pm, datetime-naive
... tags={"city": "beijing"},
... fields={"aqi": 118}
... )
>>> p.time
datetime.datetime(2021, 8, 28, 13, 0)
>>> # Insert the point into the database.
>>> db.insert(p)
>>> # The point is cast to UTC, assuming the time was local to California, not Beijing.
>>> p.time
datetime.datetime(2021, 8, 28, 20, 0, tzinfo=datetime.timezone.utc)


4. ``time`` is set with a ``datetime`` object that is timezone-aware but the timezone is not UTC- TinyFlux casts the time to UTC for internal storage and retrieval and the original timezone is lost (it is up to the user to cast the timezone again after retrieval).

>>> from tinyflux import Point, TinyFlux
>>> from datetime import datetime
>>> from zoneinfo import ZoneInfo
>>> db = TinyFlux("my_db.csv") # an empty db
>>> la_point = Point(
... time=datetime(2000, 1, 1, tzinfo=ZoneInfo("US/Pacific")),
... tags={"city": "Los Angeles"}
... fields={"temp_f": 54.0}
... )
>>> ny_point = Point(
... time=datetime(2000, 1, 1, tzinfo=ZoneInfo("US/Eastern")),
... tags={"city": "New York City"}
... fields={"temp_f": 15.0}
... )
>>> db.insert_multiple([la_point, ny_point])
>>> # Notice the time attributes no longer carry the timezone information:
>>> la_point.time
datetime.datetime(2000, 1, 1, 8, 0, tzinfo=datetime.timezone.utc)
>>> ny_point.time
datetime.datetime(2000, 1, 1, 5, 0, tzinfo=datetime.timezone.utc)

.. hint::

If you need to keep the original, non-UTC timezone along with the dataset, consider adding a ``tag`` to your point indicating the timezone, for easier conversion after retrieval. TinyFlux will not assume nor attempt to store the timezone of your data for you.
>>> from tinyflux import Point, TinyFlux
>>> from datetime import datetime
>>> from zoneinfo import ZoneInfo
>>> db = TinyFlux("my_db.csv") # an empty db
>>> la_point = Point(
... time=datetime(2000, 1, 1, tzinfo=ZoneInfo("US/Pacific")),
... tags={"city": "Los Angeles"}
... fields={"temp_f": 54.0}
... )
>>> ny_point = Point(
... time=datetime(2000, 1, 1, tzinfo=ZoneInfo("US/Eastern")),
... tags={"city": "New York City"}
... fields={"temp_f": 15.0}
... )
>>> db.insert_multiple([la_point, ny_point])
>>> # Notice the time attributes no longer carry the timezone information:
>>> la_point.time
datetime.datetime(2000, 1, 1, 8, 0, tzinfo=datetime.timezone.utc)
>>> ny_point.time
datetime.datetime(2000, 1, 1, 5, 0, tzinfo=datetime.timezone.utc)

.. hint::

If you need to keep the original, non-UTC timezone along with the dataset, consider adding a ``tag`` to your point indicating the timezone, for easier conversion after retrieval. TinyFlux will not assume nor attempt to store the timezone of your data for you.

5. ``time`` is set with a ``datetime`` object that is timezone-aware and the timezone is UTC. This is the easiest way to handle time. If needed, infomation about the timezone is stored in a tag.

>>> from datetime import datetime, timezone
>>> from tinyflux import TinyFlux, Point
>>> from zoneinfo import ZoneInfo
>>> # Time now is 10am in Los Angeles, which is 6pm UTC:
>>> t = datetime.now(timezone.utc)
>>> t
datetime.datetime(2022, 11, 9, 18, 0, 0, tzinfo=datetime.timezone.utc)
>>> # Store the time in UTC, but keep the timezone as a tag for later use.
>>> p = Point(
... time=t,
... tags={"room": "bedroom", "timezone": "America/Los_Angeles"},
... fields={"temp": 72.0}
... )
>>> # Time is still UTC:
>>> p.time
datetime.datetime(2022, 11, 9, 18, 0, 0, tzinfo=datetime.timezone.utc)
>>> # To cast back to local time in Los Angeles:
>>> la_timezone = ZoneInfo(p.tags["timezone"])
>>> p.time.astimezone(la_timezone)
datetime.datetime(2022, 11, 9, 10, 0, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles'))
>>> from datetime import datetime, timezone
>>> from tinyflux import TinyFlux, Point
>>> from zoneinfo import ZoneInfo
>>> # Time now is 10am in Los Angeles, which is 6pm UTC:
>>> t = datetime.now(timezone.utc)
>>> t
datetime.datetime(2022, 11, 9, 18, 0, 0, tzinfo=datetime.timezone.utc)
>>> # Store the time in UTC, but keep the timezone as a tag for later use.
>>> p = Point(
... time=t,
... tags={"room": "bedroom", "timezone": "America/Los_Angeles"},
... fields={"temp": 72.0}
... )
>>> # Time is still UTC:
>>> p.time
datetime.datetime(2022, 11, 9, 18, 0, 0, tzinfo=datetime.timezone.utc)
>>> # To cast back to local time in Los Angeles:
>>> la_timezone = ZoneInfo(p.tags["timezone"])
>>> p.time.astimezone(la_timezone)
datetime.datetime(2022, 11, 9, 10, 0, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles'))
77 changes: 39 additions & 38 deletions docs/source/tips.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,44 +3,6 @@ Tips for TinyFlux

Below are some tips to get the most out of TinyFlux.

Optimizing Queries
^^^^^^^^^^^^^^^^^^

Unlike TinyDB, TinyFlux never pulls in the entirety of its data into memory (unless the ``.all()`` method is called). This has the benefit of reducing the memory footprint of the database, but means that database operations are usually I/O bound. By using an index, TinyFlux is able to construct a matching set of items from the storage layer without actually reading any of those items. For database operations that return Points, TinyFlux iterates over the storage, collects the items that belong in the set, deserializes them, and finally returns them to the caller.

This utlimately means that the smaller the set of matches, the less I/O TinyFlux must perform.

.. hint::

Queries that return smaller sets of matches perform best.

.. warning::

Resist the urge to build your own time range query using the ``.map()`` query method. This will result in slow queries. Instead, use two ``TimeQuery`` instances combined with the ``&`` or ``|`` operator.


Keeping The Index Intact
^^^^^^^^^^^^^^^^^^^^^^^^

TinyFlux must build an index when it is initialized as it currently does not save the index upon closing. If the workflow for the session is read-only, then the index state will never be modified. If, however, a TinyFlux session consists of a mix of writes and reads, then the index will become invalid if at any time, a Point is inserted out of time order.

>>> from tinyflux import TinyFlux, Point
>>> from datetime import datetime, timedelta, timezone
>>> db = TinyFlux("my_db.csv")
>>> t = datetime.now(timezone.utc) # current time
>>> db.insert(Point(time=t))
>>> db.index.valid
True
>>> db.insert(Point(time=t - timedelta(hours=1))) # a Point out of time order
>>> db.index.valid
False

If ``auto-index`` is set to ``True`` (the default setting), then the next read will rebuild the index, which may just seem like a very slow query. For smaller datasets, reindexing is usually not noticeable.

.. hint::

If possible, Points should be inserted into TinyFlux in time-order.


Saving Space
^^^^^^^^^^^^
Expand Down Expand Up @@ -82,3 +44,42 @@ For example, if a TinyFlux database currently holds Points for two separate meas
.. hint::

When queries and indexes slow down a workflow, consider creating separate databases. Or, just migrate to InfluxDB.


Optimizing Queries
^^^^^^^^^^^^^^^^^^

Unlike TinyDB, TinyFlux never pulls in the entirety of its data into memory (unless the ``.all()`` method is called). This has the benefit of reducing the memory footprint of the database, but means that database operations are usually I/O bound. By using an index, TinyFlux is able to construct a matching set of items from the storage layer without actually reading any of those items. For database operations that return Points, TinyFlux iterates over the storage, collects the items that belong in the set, deserializes them, and finally returns them to the caller.

This utlimately means that the smaller the set of matches, the less I/O TinyFlux must perform.

.. hint::

Queries that return smaller sets of matches perform best.

.. warning::

Resist the urge to build your own time range query using the ``.map()`` query method. This will result in slow queries. Instead, use two ``TimeQuery`` instances combined with the ``&`` or ``|`` operator.


Keeping The Index Intact
^^^^^^^^^^^^^^^^^^^^^^^^

TinyFlux must build an index when it is initialized as it currently does not save the index upon closing. If the workflow for the session is read-only, then the index state will never be modified. If, however, a TinyFlux session consists of a mix of writes and reads, then the index will become invalid if at any time, a Point is inserted out of time order.

>>> from tinyflux import TinyFlux, Point
>>> from datetime import datetime, timedelta, timezone
>>> db = TinyFlux("my_db.csv")
>>> t = datetime.now(timezone.utc) # current time
>>> db.insert(Point(time=t))
>>> db.index.valid
True
>>> db.insert(Point(time=t - timedelta(hours=1))) # a Point out of time order
>>> db.index.valid
False

If ``auto-index`` is set to ``True`` (the default setting), then the next read will rebuild the index, which may just seem like a very slow query. For smaller datasets, reindexing is usually not noticeable.

.. hint::

If possible, Points should be inserted into TinyFlux in time-order.
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ flake8-docstrings
mypy
pytest
pytest-cov
rstcheck
sphinx
sphinx_autodoc_typehints
sphinx_rtd_theme
Expand Down

0 comments on commit 0e2448f

Please sign in to comment.