feat: Bump psycopg2 to psycopg3 for all Postgres components #4303

job-almekinders · 2024-06-21T13:43:27Z

What this PR does / why we need it:

This PR upgrades the psycopg2 dependency to the newer psycopg3 dependency. See here for more information on the differences between the two versions.

This is the 1st out of 2 PRs, required to enable async feature retrieval for the Postgres Online Store.

While here:

Makefile: Remove additional whitespace and exclude Snowflake tests for Postgres offline integration tests
Make some functions more readable, and add some additional docstrings and typehints.
Make the batch_size argument configurable for the postgres online store materialization function. This Fixes: Discussion: Pushing batches of data to online store: Should conn.commit() happen in the for loop or after? #4036

Additional remarks

The changes in this commit are related to the linter. In psycopg3, stricter type hints on the Cursor object require handling cases where cursor.description might be None. Although psycopg2 could also return None for this, it wasn't previously accounted for.

Which issue(s) this PR fixes:

1st out of 2 PRs required to fix #4260

job-almekinders

Some additional clarifications from my side!

job-almekinders · 2024-06-21T13:44:19Z

sdk/python/feast/errors.py

+class ZeroRowsQueryResult(Exception):
+    def __init__(self, query: str):
+        super().__init__(f"This query returned zero rows:\n{query}")
+
+
+class ZeroColumnQueryResult(Exception):
+    def __init__(self, query: str):
+        super().__init__(f"This query returned zero columns:\n{query}")


Exceptions to use for stricter handling of type hints of psycopg3

job-almekinders · 2024-06-21T13:44:52Z

sdk/python/feast/infra/offline_stores/contrib/postgres_offline_store/postgres.py

+            query = f"""
+                SELECT
+                    MIN({entity_df_event_timestamp_col}) AS min,
+                    MAX({entity_df_event_timestamp_col}) AS max
+                FROM ({entity_df}) AS tmp_alias
+                """


No updates here, only re-formatting the query

job-almekinders · 2024-06-21T13:45:07Z

sdk/python/feast/infra/online_stores/contrib/postgres.py

@@ -64,57 +75,56 @@ def online_write_batch(
            Tuple[EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]]
        ],
        progress: Optional[Callable[[int], Any]],
+        batch_size: int = 5000,


Make configurable, addressing #4036

job-almekinders · 2024-06-21T13:45:27Z

sdk/python/feast/infra/online_stores/contrib/postgres.py

+        # Format insert values
+        insert_values = []
+        for entity_key, values, timestamp, created_ts in data:
+            entity_key_bin = serialize_entity_key(
+                entity_key,
+                entity_key_serialization_version=config.entity_key_serialization_version,
+            )
+            timestamp = _to_naive_utc(timestamp)
+            if created_ts is not None:
+                created_ts = _to_naive_utc(created_ts)

-        with self._get_conn(config) as conn, conn.cursor() as cur:
-            insert_values = []
-            for entity_key, values, timestamp, created_ts in data:
-                entity_key_bin = serialize_entity_key(
-                    entity_key,
-                    entity_key_serialization_version=config.entity_key_serialization_version,
-                )
-                timestamp = _to_naive_utc(timestamp)
-                if created_ts is not None:
-                    created_ts = _to_naive_utc(created_ts)
-
-                for feature_name, val in values.items():
-                    vector_val = None
-                    if config.online_store.pgvector_enabled:
-                        vector_val = get_list_val_str(val)
-                    insert_values.append(
-                        (
-                            entity_key_bin,
-                            feature_name,
-                            val.SerializeToString(),
-                            vector_val,
-                            timestamp,
-                            created_ts,
-                        )
+            for feature_name, val in values.items():
+                vector_val = None
+                if config.online_store.pgvector_enabled:
+                    vector_val = get_list_val_str(val)
+                insert_values.append(
+                    (
+                        entity_key_bin,
+                        feature_name,
+                        val.SerializeToString(),
+                        vector_val,
+                        timestamp,
+                        created_ts,
                    )
-            # Control the batch so that we can update the progress
-            batch_size = 5000
+                )
+
+        # Create insert query
+        sql_query = sql.SQL(
+            """
+            INSERT INTO {}
+            (entity_key, feature_name, value, vector_value, event_ts, created_ts)
+            VALUES (%s, %s, %s, %s, %s, %s)
+            ON CONFLICT (entity_key, feature_name) DO
+            UPDATE SET
+                value = EXCLUDED.value,
+                vector_value = EXCLUDED.vector_value,
+                event_ts = EXCLUDED.event_ts,
+                created_ts = EXCLUDED.created_ts;
+        """
+        ).format(sql.Identifier(_table_id(config.project, table)))
+
+        # Push data in batches to online store


No changes here, only moving code further up in the function to make it more readable.

job-almekinders · 2024-06-21T13:45:35Z

sdk/python/feast/infra/online_stores/contrib/postgres.py

+            """
+            INSERT INTO {}
+            (entity_key, feature_name, value, vector_value, event_ts, created_ts)
+            VALUES (%s, %s, %s, %s, %s, %s)


1 out of 2 actual changes to the function:

We need to explicitly set the number of placeholder values.

job-almekinders · 2024-06-21T13:45:48Z

sdk/python/feast/infra/online_stores/contrib/postgres.py

-                    cur_batch,
-                    page_size=batch_size,
-                )
+                cur.executemany(sql_query, cur_batch)


2 out of 2 actual changes to the function:

The psycopg2.extras.execute_values functionality is removed in psycopg3. The maintainer of psycopg3 advices to use executemany. See psycopg/psycopg#576 and psycopg/psycopg#114

job-almekinders · 2024-06-21T13:45:57Z

sdk/python/feast/infra/online_stores/contrib/postgres.py

+                values_dict[
+                    row[0] if isinstance(row[0], bytes) else row[0].tobytes()
+                ].append(row[1:])


Only call tobytes() when row[0] is not already of bytes type. Otherwise, this will result in Errors.

job-almekinders · 2024-06-21T13:46:40Z

sdk/python/feast/infra/utils/postgres/connection_utils.py

+def _get_conninfo(config: PostgreSQLConfig) -> str:
+    """Get the `conninfo` argument required for connection objects."""
+    return (
+        f"postgresql://{config.user}"
+        f":{config.password}"
+        f"@{config.host}"
+        f":{int(config.port)}"
+        f"/{config.database}"
+    )
+
+
+def _get_conn_kwargs(config: PostgreSQLConfig) -> Dict[str, Any]:
+    """Get the additional `kwargs` required for connection objects."""
+    return {
+        "sslmode": config.sslmode,
+        "sslkey": config.sslkey_path,
+        "sslcert": config.sslcert_path,
+        "sslrootcert": config.sslrootcert_path,
+        "options": "-c search_path={}".format(config.db_schema or config.user),
+    }
+
+


Helper functions to prevent code duplication in the above methods.

job-almekinders · 2024-06-21T13:47:01Z

sdk/python/feast/infra/utils/postgres/connection_utils.py

+    nr_columns = df.shape[1]
+    placeholders = ", ".join(["%s"] * nr_columns)
+    query = f"INSERT INTO {table_name} VALUES ({placeholders})"
+    values = df.replace({np.NaN: None}).to_numpy().tolist()
+
    with _get_conn(config) as conn, conn.cursor() as cur:
        cur.execute(_df_to_create_table_sql(df, table_name))
-        psycopg2.extras.execute_values(
-            cur,
-            f"""
-            INSERT INTO {table_name}
-            VALUES %s
-            """,
-            df.replace({np.NaN: None}).to_numpy(),
-        )
+        cur.executemany(query, values)


Moved the parsing of variables further to the top of the function.

Again, we need to replace execute_values by executemany.

Again, we need to explicitly set the number of placeholders. Since this function should be able to handle a dynamic amount of columns, we use the placeholders variable

job-almekinders · 2024-06-21T13:47:28Z

sdk/python/tests/integration/online_store/test_universal_online.py

+@pytest.mark.parametrize(
+    "conn_type",
+    [ConnectionType.singleton, ConnectionType.pool],
+    ids=lambda v: f"conn_type:{v}",
+)


Test both ConnectionTypes

sdk/python/tests/integration/registration/test_universal_registry.py

tokoko · 2024-06-21T14:06:15Z

sdk/python/requirements/py3.10-requirements.txt

@@ -20,32 +20,38 @@ charset-normalizer==3.3.2
    # via requests
 click==8.1.7
    # via
+    #   feast (setup.py)


Did you use lock-python-dependencies-all to generate these files? feast (setup.py) lines shouldn't have been added, I think.

Yes, I used that command indeed!

Do you have any thoughts on what might be causing this and how to resolve it?

Not sure honestly, I'll try to look into it. We can still merge regardless, it's not a blocker, just a bunch of extra line changes in the PR.

docs/tutorials/using-scalable-registry.md

sdk/python/feast/infra/offline_stores/contrib/postgres_offline_store/postgres.py

tokoko

LGTM

job-almekinders · 2024-06-26T10:26:20Z

Hey @franciscojavierarceo, would you perhaps be able to do another pass on this PR? :)

HaoXuAI · 2024-06-26T23:22:34Z

Looks like I merged the other pr caused conflicts here. mind fix it then we can merge it?