Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 Weaviate destination: Add embedding capabilities, overwrite and dedup support, API key auth mode and available on Airbyte Cloud #30151

Merged
merged 61 commits into from Sep 28, 2023
Merged
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
83c97bb
better error message for misconfigured text fields
Sep 4, 2023
de0cf30
improve message
Sep 4, 2023
5a88145
improve error message
Sep 4, 2023
6c74055
refactor vector db cdk helpers a bit
Sep 4, 2023
6b7a3b0
make most of the things work
Sep 5, 2023
2ff7617
Automated Commit - Formatting Changes
flash1293 Sep 5, 2023
a414d88
adjust component interaction
Sep 5, 2023
3cc529a
Merge branch 'flash1293/ai-cdk-improvements' into flash1293/weaviate-…
Sep 5, 2023
b47eadf
Merge branch 'flash1293/weaviate-rewrite' of github.com:airbytehq/air…
Sep 5, 2023
139c0ab
make no_embedding mode work
Sep 5, 2023
51ca514
Automated Commit - Formatting Changes
flash1293 Sep 5, 2023
347f333
format fixes
Sep 5, 2023
5d76df2
Merge branch 'flash1293/ai-cdk-improvements' into flash1293/weaviate-…
Sep 5, 2023
52c2095
Merge branch 'flash1293/weaviate-rewrite' of github.com:airbytehq/air…
Sep 5, 2023
17a5c30
docs, auto-create classes
Sep 5, 2023
a5c5ff8
review comments
Sep 5, 2023
716abc8
Merge remote-tracking branch 'origin/master' into flash1293/ai-cdk-im…
Sep 6, 2023
bd0f3c2
revert formatting
Sep 6, 2023
3ce6334
fix unit tests
Sep 6, 2023
db2f04e
Merge branch 'flash1293/ai-cdk-improvements' into flash1293/weaviate-…
Sep 6, 2023
aa8461a
work on tests
Sep 6, 2023
4084ddb
work on tests
Sep 6, 2023
3c2f06c
Merge remote-tracking branch 'origin/master' into flash1293/weaviate-…
Sep 6, 2023
70fe870
make most things works
Sep 8, 2023
49d0963
Merge remote-tracking branch 'origin/master' into flash1293/weaviate-…
Sep 8, 2023
006d277
documentation
Sep 8, 2023
6738efd
Merge remote-tracking branch 'origin/master' into flash1293/weaviate-…
Sep 11, 2023
80f480d
fix metadata and dockerfile
Sep 11, 2023
89b9d50
fix small things in weaviate destination
Sep 11, 2023
6e6331c
try more retries
Sep 11, 2023
d238c52
try to fix integration tests
Sep 11, 2023
8de45e2
Merge remote-tracking branch 'origin/master' into flash1293/weaviate-…
Sep 12, 2023
2ccd705
review comments
Sep 12, 2023
2a61557
respect reserved property names
Sep 12, 2023
7986eaa
Merge branch 'master' into flash1293/weaviate-rewrite
Sep 14, 2023
cb4ef90
Merge remote-tracking branch 'upstream/master' into flash1293/weaviat…
Sep 18, 2023
e26aaaf
adjust based on feedback
Sep 18, 2023
cd0acb2
fix integration tests
Sep 18, 2023
5c3e877
bump changelog
Sep 18, 2023
5bff5ed
enable on cloud
Sep 18, 2023
037803d
fix breaking change message
Sep 18, 2023
e2c3cc0
set index correctly for _ab_record_id field
Sep 18, 2023
a626f7d
format
Sep 18, 2023
5c82f1a
Merge remote-tracking branch 'origin/master' into flash1293/weaviate-…
Sep 25, 2023
60d8144
adjust metadata
Sep 25, 2023
53e80d4
Automated Commit - Formatting Changes
flash1293 Sep 25, 2023
3eeb341
update cdk
Sep 25, 2023
c9eaf17
Merge branch 'flash1293/weaviate-rewrite' of github.com:airbytehq/air…
Sep 25, 2023
c7497eb
fix
Sep 25, 2023
3a6b0e2
disallow no auth on cloud
Sep 25, 2023
47c682d
fix test
Sep 25, 2023
ab9dcdd
Merge remote-tracking branch 'upstream/master' into flash1293/weaviat…
Sep 26, 2023
828855d
set to certified
Sep 26, 2023
e7de851
Merge branch 'master' into flash1293/weaviate-rewrite
Sep 27, 2023
c7a0b8f
Update docs/integrations/destinations/weaviate.md
Sep 27, 2023
751025e
Update docs/integrations/destinations/weaviate-migrations.md
Sep 27, 2023
f1c33a3
Update docs/integrations/destinations/weaviate-migrations.md
Sep 27, 2023
68f7a94
update cdk
Sep 27, 2023
359cd06
Merge branch 'flash1293/weaviate-rewrite' of github.com:airbytehq/air…
Sep 27, 2023
8c6c583
chunk as configured
Sep 28, 2023
f8d9237
fix format
Sep 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 12 additions & 9 deletions airbyte-integrations/connectors/destination-weaviate/Dockerfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this logic be moved to a post-build shell script? I heard Dockerfiles are on their way out.

(Just a question for consideration; I wouldn't block on this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge this new Dockerfile is closer to the "default logic", so I expect it to not cause any problems.

@@ -1,16 +1,19 @@
FROM python:3.9.11-alpine3.15 as base
FROM python:3.10-slim as base

# build and load all requirements
FROM base as builder
WORKDIR /airbyte/integration_code

# upgrade pip to the latest version
RUN apk --no-cache upgrade \
&& pip install --upgrade pip \
&& apk --no-cache add tzdata build-base \
&& apk add libffi-dev

COPY setup.py ./

RUN pip install --upgrade pip

# This is required because the current connector dependency is not compatible with the CDK version
# An older CDK version will be used, which depends on pyYAML 5.4, for which we need to pin Cython to <3.0
# As of today the CDK version that satisfies the main dependency requirements, is 0.1.80 ...
RUN pip install --prefix=/install "Cython<3.0" "pyyaml~=5.4" --no-build-isolation

# install necessary packages to a temporary folder
RUN pip install --prefix=/install .

Expand All @@ -25,7 +28,7 @@ COPY --from=builder /usr/share/zoneinfo/Etc/UTC /etc/localtime
RUN echo "Etc/UTC" > /etc/timezone

# bash is installed for more convenient debugging.
RUN apk --no-cache add bash
RUN apt-get install bash

# copy payload code only
COPY main.py ./
Expand All @@ -34,5 +37,5 @@ COPY destination_weaviate ./destination_weaviate
ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]

LABEL io.airbyte.version=0.1.1
LABEL io.airbyte.name=airbyte/destination-weaviate
LABEL io.airbyte.version=0.2.0
LABEL io.airbyte.name=airbyte/destination-weaviate
@@ -0,0 +1,7 @@
acceptance_tests:
spec:
tests:
- spec_path: integration_tests/spec.json
backward_compatibility_tests_config:
disable_for_version: "0.2.0"
connector_image: airbyte/destination-weaviate:dev
@@ -0,0 +1,2 @@
#!/usr/bin/env sh
source "$(git rev-parse --show-toplevel)/airbyte-integrations/bases/connector-acceptance-test/acceptance-test-docker.sh"

This file was deleted.

@@ -0,0 +1,141 @@
#
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
#

from typing import List, Literal, Union

import dpath.util
from airbyte_cdk.destinations.vector_db_based.config import (
AzureOpenAIEmbeddingConfigModel,
CohereEmbeddingConfigModel,
FakeEmbeddingConfigModel,
FromFieldEmbeddingConfigModel,
OpenAIEmbeddingConfigModel,
ProcessingConfigModel,
)
from airbyte_cdk.utils.spec_schema_transformations import resolve_refs
from pydantic import BaseModel, Field


class UsernamePasswordAuth(BaseModel):
mode: Literal["username_password"] = Field("username_password", const=True)
username: str = Field(..., title="Username", description="Username for the Weaviate cluster", order=1)
password: str = Field(..., title="Password", description="Password for the Weaviate cluster", airbyte_secret=True, order=2)

class Config:
title = "Username/Password"
schema_extra = {"description": "Authenticate using username and password (suitable for self-managed Weaviate clusters)"}


class NoAuth(BaseModel):
mode: Literal["no_auth"] = Field("no_auth", const=True)

class Config:
title = "No Authentication"
schema_extra = {
"description": "Do not authenticate (suitable for locally running test clusters, do not use for clusters with public IP addresses)"
}


class TokenAuth(BaseModel):
mode: Literal["token"] = Field("token", const=True)
token: str = Field(..., title="API Token", description="API Token for the Weaviate instance", airbyte_secret=True)

class Config:
title = "API Token"
schema_extra = {"description": "Authenticate using an API token (suitable for Weaviate Cloud)"}


class Header(BaseModel):
header_key: str = Field(..., title="Header Key")
value: str = Field(..., title="Header Value", airbyte_secret=True)


class WeaviateIndexingConfigModel(BaseModel):
host: str = Field(
...,
title="Public Endpoint",
order=1,
description="The public endpoint of the Weaviate cluster.",
examples=["https://my-cluster.weaviate.network"],
)
auth: Union[TokenAuth, UsernamePasswordAuth, NoAuth] = Field(
..., title="Authentication", description="Authentication method", discriminator="mode", type="object", order=2
)
batch_size: int = Field(title="Batch Size", description="The number of records to send to Weaviate in each batch", default=128)
text_field: str = Field(title="Text Field", description="The field in the object that contains the embedded text", default="text")
default_vectorizer: str = Field(
title="Default Vectorizer",
description="The vectorizer to use if new classes need to be created",
default="none",
enum=[
"none",
"text2vec-cohere",
"text2vec-huggingface",
"text2vec-openai",
"text2vec-palm",
"text2vec-contextionary",
"text2vec-transformers",
"text2vec-gpt4all",
],
)
additional_headers: List[Header] = Field(
title="Additional headers",
description="Additional HTTP headers to send with every request.",
default=[],
examples=[{"header_key": "X-OpenAI-Api-Key", "value": "my-openai-api-key"}],
)

class Config:
title = "Indexing"
schema_extra = {
"group": "indexing",
"description": "Indexing configuration",
}


class NoEmbeddingConfigModel(BaseModel):
mode: Literal["no_embedding"] = Field("no_embedding", const=True)

class Config:
title = "No external embedding"
schema_extra = {
"description": "Do not calculate and pass embeddings to Weaviate. Suitable for clusters with configured vectorizers to calculate embeddings within Weaviate or for classes that should only support regular text search."
}


class ConfigModel(BaseModel):
processing: ProcessingConfigModel
embedding: Union[
NoEmbeddingConfigModel,
AzureOpenAIEmbeddingConfigModel,
OpenAIEmbeddingConfigModel,
CohereEmbeddingConfigModel,
FromFieldEmbeddingConfigModel,
FakeEmbeddingConfigModel,
] = Field(..., title="Embedding", description="Embedding configuration", discriminator="mode", group="embedding", type="object")
indexing: WeaviateIndexingConfigModel

class Config:
title = "Weaviate Destination Config"
schema_extra = {
"groups": [
{"id": "processing", "title": "Processing"},
{"id": "embedding", "title": "Embedding"},
{"id": "indexing", "title": "Indexing"},
]
}

@staticmethod
def remove_discriminator(schema: dict) -> None:
"""pydantic adds "discriminator" to the schema for oneOfs, which is not treated right by the platform as we inline all references"""
dpath.util.delete(schema, "properties/*/discriminator")
dpath.util.delete(schema, "properties/**/discriminator")

@classmethod
def schema(cls):
"""we're overriding the schema classmethod to enable some post-processing"""
schema = super().schema()
schema = resolve_refs(schema)
cls.remove_discriminator(schema)
return schema