Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Bump psycopg2 to psycopg3 for all Postgres components #4303

Merged
merged 21 commits into from
Jul 1, 2024

Conversation

job-almekinders
Copy link
Contributor

What this PR does / why we need it:

This PR upgrades the psycopg2 dependency to the newer psycopg3 dependency. See here for more information on the differences between the two versions.

This is the 1st out of 2 PRs, required to enable async feature retrieval for the Postgres Online Store.

While here:

Additional remarks

The changes in this commit are related to the linter. In psycopg3, stricter type hints on the Cursor object require handling cases where cursor.description might be None. Although psycopg2 could also return None for this, it wasn't previously accounted for.

Which issue(s) this PR fixes:

1st out of 2 PRs required to fix #4260

Copy link
Contributor Author

@job-almekinders job-almekinders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional clarifications from my side!

Comment on lines +406 to +401
class ZeroRowsQueryResult(Exception):
def __init__(self, query: str):
super().__init__(f"This query returned zero rows:\n{query}")


class ZeroColumnQueryResult(Exception):
def __init__(self, query: str):
super().__init__(f"This query returned zero columns:\n{query}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions to use for stricter handling of type hints of psycopg3

Comment on lines +338 to +343
query = f"""
SELECT
MIN({entity_df_event_timestamp_col}) AS min,
MAX({entity_df_event_timestamp_col}) AS max
FROM ({entity_df}) AS tmp_alias
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No updates here, only re-formatting the query

@@ -64,57 +75,56 @@ def online_write_batch(
Tuple[EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]]
],
progress: Optional[Callable[[int], Any]],
batch_size: int = 5000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make configurable, addressing #4036

Comment on lines +80 to +121
# Format insert values
insert_values = []
for entity_key, values, timestamp, created_ts in data:
entity_key_bin = serialize_entity_key(
entity_key,
entity_key_serialization_version=config.entity_key_serialization_version,
)
timestamp = _to_naive_utc(timestamp)
if created_ts is not None:
created_ts = _to_naive_utc(created_ts)

with self._get_conn(config) as conn, conn.cursor() as cur:
insert_values = []
for entity_key, values, timestamp, created_ts in data:
entity_key_bin = serialize_entity_key(
entity_key,
entity_key_serialization_version=config.entity_key_serialization_version,
)
timestamp = _to_naive_utc(timestamp)
if created_ts is not None:
created_ts = _to_naive_utc(created_ts)

for feature_name, val in values.items():
vector_val = None
if config.online_store.pgvector_enabled:
vector_val = get_list_val_str(val)
insert_values.append(
(
entity_key_bin,
feature_name,
val.SerializeToString(),
vector_val,
timestamp,
created_ts,
)
for feature_name, val in values.items():
vector_val = None
if config.online_store.pgvector_enabled:
vector_val = get_list_val_str(val)
insert_values.append(
(
entity_key_bin,
feature_name,
val.SerializeToString(),
vector_val,
timestamp,
created_ts,
)
# Control the batch so that we can update the progress
batch_size = 5000
)

# Create insert query
sql_query = sql.SQL(
"""
INSERT INTO {}
(entity_key, feature_name, value, vector_value, event_ts, created_ts)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT (entity_key, feature_name) DO
UPDATE SET
value = EXCLUDED.value,
vector_value = EXCLUDED.vector_value,
event_ts = EXCLUDED.event_ts,
created_ts = EXCLUDED.created_ts;
"""
).format(sql.Identifier(_table_id(config.project, table)))

# Push data in batches to online store
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes here, only moving code further up in the function to make it more readable.

"""
INSERT INTO {}
(entity_key, feature_name, value, vector_value, event_ts, created_ts)
VALUES (%s, %s, %s, %s, %s, %s)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 out of 2 actual changes to the function:

We need to explicitly set the number of placeholder values.

cur_batch,
page_size=batch_size,
)
cur.executemany(sql_query, cur_batch)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 out of 2 actual changes to the function:

The psycopg2.extras.execute_values functionality is removed in psycopg3. The maintainer of psycopg3 advices to use executemany. See psycopg/psycopg#576 and psycopg/psycopg#114

Comment on lines +185 to +187
values_dict[
row[0] if isinstance(row[0], bytes) else row[0].tobytes()
].append(row[1:])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only call tobytes() when row[0] is not already of bytes type. Otherwise, this will result in Errors.

Comment on lines +35 to +56
def _get_conninfo(config: PostgreSQLConfig) -> str:
"""Get the `conninfo` argument required for connection objects."""
return (
f"postgresql://{config.user}"
f":{config.password}"
f"@{config.host}"
f":{int(config.port)}"
f"/{config.database}"
)


def _get_conn_kwargs(config: PostgreSQLConfig) -> Dict[str, Any]:
"""Get the additional `kwargs` required for connection objects."""
return {
"sslmode": config.sslmode,
"sslkey": config.sslkey_path,
"sslcert": config.sslcert_path,
"sslrootcert": config.sslrootcert_path,
"options": "-c search_path={}".format(config.db_schema or config.user),
}


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper functions to prevent code duplication in the above methods.

Comment on lines +75 to +82
nr_columns = df.shape[1]
placeholders = ", ".join(["%s"] * nr_columns)
query = f"INSERT INTO {table_name} VALUES ({placeholders})"
values = df.replace({np.NaN: None}).to_numpy().tolist()

with _get_conn(config) as conn, conn.cursor() as cur:
cur.execute(_df_to_create_table_sql(df, table_name))
psycopg2.extras.execute_values(
cur,
f"""
INSERT INTO {table_name}
VALUES %s
""",
df.replace({np.NaN: None}).to_numpy(),
)
cur.executemany(query, values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the parsing of variables further to the top of the function.

  1. Again, we need to replace execute_values by executemany.
  2. Again, we need to explicitly set the number of placeholders. Since this function should be able to handle a dynamic amount of columns, we use the placeholders variable

Comment on lines +42 to +46
@pytest.mark.parametrize(
"conn_type",
[ConnectionType.singleton, ConnectionType.pool],
ids=lambda v: f"conn_type:{v}",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test both ConnectionTypes

@@ -20,32 +20,38 @@ charset-normalizer==3.3.2
# via requests
click==8.1.7
# via
# feast (setup.py)
Copy link
Collaborator

@tokoko tokoko Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you use lock-python-dependencies-all to generate these files? feast (setup.py) lines shouldn't have been added, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I used that command indeed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any thoughts on what might be causing this and how to resolve it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure honestly, I'll try to look into it. We can still merge regardless, it's not a blocker, just a bunch of extra line changes in the PR.

Copy link
Collaborator

@tokoko tokoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@job-almekinders
Copy link
Contributor Author

Hey @franciscojavierarceo, would you perhaps be able to do another pass on this PR? :)

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Jun 26, 2024

Looks like I merged the other pr caused conflicts here. mind fix it then we can merge it?

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
@job-almekinders
Copy link
Contributor Author

Looks like I merged the other pr caused conflicts here. mind fix it then we can merge it?

I just pushed the update! @HaoXuAI

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Set connection read only

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Addition

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Use new ConnectionPool

Pass kwargs as named argument

Use executemany over execute_values

Remove not-required open argument in psycopg.connect

Improve

Use SpooledTemporaryFile

Use max_size and add docstring

Properly write with StringIO

Utils: Use SpooledTemporaryFile over StringIO object

Add replace

Fix df_to_postgres_table

Remove import

Utils

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Add log statement

Lint: Fix _to_arrow_internal

Lint: Fix _get_entity_df_event_timestamp_range

Update exception

Use ZeroColumnQueryResult

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Update warning

Fix

Format warning

Add typehints

Use better variable name

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>
@job-almekinders
Copy link
Contributor Author

job-almekinders commented Jun 27, 2024

We tried to install and run this feature branch already in one of our downstream projects. We noticed that there is an (edge) case where the required dependency combination of psycopg and sqlalchemy is not working properly. This is related to this discussion.

One of Feast' requirements is SQLAlchemy>=1. This enables feast to install either version 1 or 2. With this feature branch, the psycopg2 dependency is replaced with psycopg3. The path parameter of the RegistryConfig is being passed to sqlalchemy directly. However, the psycopg3 driver is only available for SQLAlchemy>=2.
If users in downstream projects have an additional dependency of SQLAlchemy<2, Feast is still able to be installed properly, however, with SQLAlchemy==1. This version doesn't support the psycop3 driver however. Since psycopg2 is no dependency of feast anymore, the user will not be able to connect to the registry out of the box.

The question now is how to handle this. We think there are a number of potential solutions:

  1. Update the validator, add add additional version checks for sqlalchemy and/or psycopg2/3, and based on that decide which driver to use. For example:
    a) If SQLAlchemy>=2, use psycopg3 (which is always installed with feast with postgres extras)
    b) If SQLAlchemy<2, check for availability of psycopg2.
    If psycopg2 is available, add this to the path prefix and continue.
    If psycopg2 is not available, raise an Exception and instruct the user to add it to their own dependencies.
  2. Users have to install the psycopg2 dependency themselves in their downstream project, and use the postgresql+psycopg2 prefix for the path parameter of the RegistryConfig. Potentially, we could also decide to drop the added validator, and instead communicate to the user how to handle this.

There are more ways to tackle this issue, but I'm curious to hear your thoughts on what you think might be the best way forward.

@tokoko I would love to hear your thoughts as well on this matter if you have the time :)

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Jun 27, 2024

We tried to install and run this feature branch already in one of our downstream projects. We noticed that there is an (edge) case where the required dependency combination of psycopg and sqlalchemy is not working properly. This is related to this discussion.

One of Feast' requirements is SQLAlchemy>=1. This enables feast to install either version 1 or 2. With this feature branch, the psycopg2 dependency is replaced with psycopg3. The path parameter of the RegistryConfig is being passed to sqlalchemy directly. However, the psycopg3 driver is only available for SQLAlchemy>=2. If users in downstream projects have an additional dependency of SQLAlchemy<2, Feast is still able to be installed properly, however, with SQLAlchemy==1. This version doesn't support the psycop3 driver however. Since psycopg2 is no dependency of feast anymore, the user will not be able to connect to the registry out of the box.

The question now is how to handle this. We think there are a number of potential solutions:

  1. Update the validator, add add additional version checks for sqlalchemy and/or psycopg2/3, and based on that decide which driver to use. For example:
    a) If SQLAlchemy>=2, use psycopg3 (which is always installed with feast with postgres extras)
    b) If SQLAlchemy<2, check for availability of psycopg2.
    If psycopg2 is available, add this to the path prefix and continue.
    If psycopg2 is not available, raise an Exception and instruct the user to add it to their own dependencies.
  2. Users have to install the psycopg2 dependency themselves in their downstream project, and use the postgresql+psycopg2 prefix for the path parameter of the RegistryConfig. Potentially, we could also decide to drop the added validator, and instead communicate to the user how to handle this.

There are more ways to tackle this issue, but I'm curious to hear your thoughts on what you think might be the best way forward.

@tokoko I would love to hear your thoughts as well on this matter if you have the time :)

any reason we are not able to bump sqlAlchemy to 2.0+?

@tokoko
Copy link
Collaborator

tokoko commented Jun 27, 2024

@job-almekinders Thanks for pointing that out. However, I don't believe this warrants any action from us. Those kinds of diamond dependencies are just part of life in libraries like this, unfortunately. I'm sure there are a few others similar to this lurking around that we haven't noticed before.

If a user has some hard dependency on sqlalchemy1 and also uses postgres in feast only for the registry, they are free to forgo postgres extra, install psycopg2 instead and set up sql registry path accordingly.

any reason we are not able to bump sqlAlchemy to 2.0+?

I guess we can, but as long as we don't absolutely need to, it's better to leave it as is not to cause unnecessary diamond dependency problems to downstream libraries.

@TomSteenbergen
Copy link
Contributor

any reason we are not able to bump sqlAlchemy to 2.0+?

@HaoXuAI This will probably break the Snowflake integration, as SQLAlchemy 2 isn't supported yet by the Snowflake Python libraries. See this relevant issue: snowflakedb/snowflake-sqlalchemy#380

@job-almekinders
Copy link
Contributor Author

@job-almekinders Thanks for pointing that out. However, I don't believe this warrants any action from us. Those kinds of diamond dependencies are just part of life in libraries like this, unfortunately. I'm sure there are a few others similar to this lurking around that we haven't noticed before.

If a user has some hard dependency on sqlalchemy1 and also uses postgres in feast only for the registry, they are free to forgo postgres extra, install psycopg2 instead and set up sql registry path accordingly.

That makes sense !

If this is no blocker then I think we are good to move forward with this PR, at least from our end :)

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Jun 27, 2024

any reason we are not able to bump sqlAlchemy to 2.0+?

@HaoXuAI This will probably break the Snowflake integration, as SQLAlchemy 2 isn't supported yet by the Snowflake Python libraries. See this relevant issue: snowflakedb/snowflake-sqlalchemy#380

I think snowflake can use snowflake-python module which doesn't depend on sqlAchemy.
Also since we are on python 3.9+ that shouldn't be many cases which still depending on sqlAchemy 1.4?

@tokoko
Copy link
Collaborator

tokoko commented Jul 1, 2024

I think snowflake can use snowflake-python module which doesn't depend on sqlAchemy.

I think he meant snowflake-backed sql registry, not a separate snowflake registry.

Also since we are on python 3.9+ that shouldn't be many cases which still depending on sqlAchemy 1.4?

I'm fine with removing EOL versions if that will get us anything. Probably best to follow up in a different issue, though.

@HaoXuAI This is good to merge, right?

@HaoXuAI HaoXuAI merged commit 9451d9c into feast-dev:master Jul 1, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants