Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datahub-ingestion): remove old jars, sync pyspark version #9217

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 4 additions & 3 deletions docker/datahub-ingestion-base/build.gradle
Expand Up @@ -9,20 +9,21 @@ ext {
docker_registry = rootProject.ext.docker_registry == 'linkedin' ? 'acryldata' : docker_registry
docker_repo = 'datahub-ingestion-base'
docker_dir = 'datahub-ingestion-base'
docker_target = project.getProperties().getOrDefault("dockerTarget", "slim")

revision = 2 // increment to trigger rebuild
}

docker {
name "${docker_registry}/${docker_repo}:v${version}-slim"
version "v${version}-slim"
name "${docker_registry}/${docker_repo}:v${version}-${docker_target}"
version "v${version}-${docker_target}"
dockerfile file("${rootProject.projectDir}/docker/${docker_dir}/Dockerfile")
files fileTree(rootProject.projectDir) {
include "docker/${docker_dir}/*"
}.exclude {
i -> i.file.isHidden() || i.file == buildDir
}
buildArgs([APP_ENV: 'slim'])
buildArgs([APP_ENV: docker_target])
}
tasks.getByName('docker').dependsOn('build')

Expand Down
16 changes: 14 additions & 2 deletions docker/datahub-ingestion/Dockerfile
Expand Up @@ -22,10 +22,22 @@ ENV PATH="/datahub-ingestion/.local/bin:$PATH"
FROM base as slim-install
RUN pip install --no-cache --user ".[base,datahub-rest,datahub-kafka,snowflake,bigquery,redshift,mysql,postgres,hive,clickhouse,glue,dbt,looker,lookml,tableau,powerbi,superset,datahub-business-glossary]"

FROM base as full-install
FROM base as full-install-build

USER 0
RUN apt-get update && apt-get install -y -qq maven

USER datahub
COPY ./docker/datahub-ingestion/pyspark_jars.sh .

RUN pip install --no-cache --user ".[base]" && \
pip install --no-cache --user "./airflow-plugin[acryl-datahub-airflow-plugin]" && \
pip install --no-cache --user ".[all]"
pip install --no-cache --user ".[all]" && \
./pyspark_jars.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have numbers for this - what does this do to install sizes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No expected (or observed impact). Still the full image is ~1.08 GB


FROM base as full-install

COPY --from=full-install-build /datahub-ingestion/.local /datahub-ingestion/.local

FROM base as dev-install
# Dummy stage for development. Assumes code is built on your machine and mounted to this image.
Expand Down
22 changes: 22 additions & 0 deletions docker/datahub-ingestion/pyspark_jars.sh
@@ -0,0 +1,22 @@
#!/bin/bash

set -ex

HADOOP_CLIENT_DEPENDENCY="${HADOOP_CLIENT_DEPENDENCY:-org.apache.hadoop:hadoop-client:3.3.6}"
ZOOKEEPER_DEPENDENCY="${ZOOKEEPER_DEPENDENCY:-org.apache.zookeeper:zookeeper:3.7.2}"
PYSPARK_JARS="$(python -m site --user-site)/pyspark/jars"

# Remove conflicting versions
echo "Removing version conflicts from $PYSPARK_JARS"
CONFLICTS="zookeeper hadoop- slf4j-"
for jar in $CONFLICTS; do
rm "$PYSPARK_JARS/$jar"*.jar
done

# Fetch dependencies
mvn dependency:get -Dtransitive=true -Dartifact="$HADOOP_CLIENT_DEPENDENCY"
mvn dependency:get -Dtransitive=true -Dartifact="$ZOOKEEPER_DEPENDENCY"

# Move to pyspark location
echo "Moving jars to $PYSPARK_JARS"
find "$HOME/.m2" -type f -name "*.jar" -exec mv {} "$PYSPARK_JARS/" \;
4 changes: 2 additions & 2 deletions metadata-ingestion/setup.py
Expand Up @@ -242,7 +242,7 @@
}

data_lake_profiling = {
"pydeequ==1.1.0",
"pydeequ~=1.1.0",
"pyspark~=3.3.0",
}

Expand All @@ -256,7 +256,7 @@
databricks = {
# 0.1.11 appears to have authentication issues with azure databricks
"databricks-sdk>=0.9.0",
"pyspark",
"pyspark~=3.3.0",
"requests",
}

Expand Down