Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Conversation

@sirupsen
Copy link
Contributor

@sirupsen sirupsen commented May 10, 2022

The dev setup right now is pretty unreliable. I'm trying to vastly simplify it and make it more robust. I'm commenting inline why I've made certain changes.


It's highly recommended that all involved columns are indexed.

## Development Setup
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too easy to miss it otherwise. Let's just have one big README.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that we only want to have doc-based instructions for a setup? Check DAT-3098

Copy link
Contributor Author

@sirupsen sirupsen May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love a script, the reason for a doc-based approach for now is (1) these currently take too long to be one script, (2) too hard to test, because it takes too long to test, and (3) the existing script didn't work (because of (2)). We are not mature enough for a script, unfortunately.


run_sql("RM @~/ratings.csv.gz")
run_sql("PUT file://ml-25m/ratings.csv @~")
run_sql("PUT file://dev/ml-25m/ratings.csv @~")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commands in the README straight up didn't work, because the paths assumed you were cd(1)'ed into the directory.

--default-authentication-plugin=mysql_native_password
--innodb-buffer-pool-size=8G
--innodb_io_capacity=2000
--innodb_log_file_size=1G
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you have a huge computer these settings will make things way worse. MySQL OOM'ed on my machine (errors I got this morning) because the buffer pool size is bigger than I allow Docker. We have to err towards robustness over performance here. These settings are too idiosyncratic

- '5432'
env_file:
- dev.env
- dev/dev.env
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should run docker-compose at the top-level directory, not have to cd around to develop. Too annoying.


[tool.poetry.dev-dependencies]
protobuf = "^3.20.1"
mysql-connector-python = "*"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too easy to miss to install the extras otherwise for development. It's annoying to have to repeat them. If someone knows how to make them mandatory in dev, but optional in prod, let me know!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this will create errors. Since you are not also enforcing the installation of dependencies! You cannot simply install these packages without the correct dependencies installed. This is why we also had this docker setup. So people aren't forces to install dependencies in order to use it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a reason that the docker installs python3-dev libpq-dev python3-setuptools gcc. If your system does not have those, you can't install data-diff with mysql-connector-python or psycopg2 for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I see the Docker container as a production environment or a quick way to test locally—not a development environment (I should add something to the docs about the Docker container)

I do not think the Docker setup can replace this as a development environment (I'm not sure if you are implying this, but for completeness), because developing inside of Docker is a miserable experience in my experience. Having to rebuild container at times, not having the CLI tools from your laptop, etc. That's why I do not think it should be in docker-compose at all, since that's what it implies.

As for the system libraries, that's a really good point. I see 2 options:

(1) Continue to install these as dev dependencies, but as part of the doc installs, ensure that people run the correct brew install
(2) Revert back to extras, but with (1)

I am defaulting to (1) because to run the test suite, you will need all the bindings soon. Furthermore, not every database uses native libraries, as many are converging on protobuf, mysql protocol, or the postgres protocol—so I don't see it as a never-ending nightmare. If it spins out of control, I'd then suggest we move to (2)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're talking about someone who wants to develop this tool, and use our env setup for it, but doesn't want to install mysql? I'm not seeing the point here.

For using the tool, yes, we want to make it easy to install. But for development, I think it's okay to raise the bar a little bit. The tests won't run without mysql anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erezsh so to be clear, you're voting for (1)? People will run mysql through Docker, so they won't have the bindings by default—so we will just provide them the instructions for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're on Windows though, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's try (1). But then we need to test it rigorously with colleagues it they can get it up and running, solely based on the doc, before we ask customers to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 I'll fold those instructions into the README and get this merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm voting 1. Asking to install libraries is common for public dev setups.

Maybe we can provide a nice script for them that does it for ubuntu/mac

Yes, I use windows, but I run data-diff on WSL

self.connection.query("DROP TABLE IF EXISTS ratings_test", None)
self.connection.query("DROP TABLE IF EXISTS ratings_test2", None)
self.preql.load("./tests/setup.pql")
self.preql.commit()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure tables are created before creating the segment class

@sirupsen sirupsen requested review from cfernhout and erezsh May 10, 2022 14:50
# For now while we are in heavy development we install the latest with Poetry
# and execute directly with Poetry. Later, we'll move to the released Pip package.
RUN poetry install -E preql -E mysql -E pgsql -E snowflake
ENTRYPOINT ["poetry", "run", "python3", "-m", "data_diff"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids having to type out all the arguments

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would work and ENTRYPOINT "poetry run python3 -m data_diff" wouldn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'd both work. Just been bit many times by string syntax from quoting, so in this habit now. It's what they recommend too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this so that it works? How are you providing it with the arguments like DB1_URI etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CleanShot 2022-05-11 at 10 44 54@2x

WORKDIR /app
# For now while we are in heavy development we install the latest with Poetry
# and execute directly with Poetry. Later, we'll move to the released Pip package.
RUN poetry install -E preql -E mysql -E pgsql -E snowflake
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can remove extras since they should be done by default in the dev env


self.table = TableSegment(TestDiffTables.connection,
(self.table_name, ),
('ratings_test', ),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test tables were renamed without the tests being fixed

Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay

Use cases:
- Validate that a table was copied properly
- Be alerted before your customer finds out, or your report is wrong
- Validate that your replication mechnism is working correctly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error (mechnism)

Copy link
Collaborator

@cfernhout cfernhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the comment from Erez with spelling mistake.

# For now while we are in heavy development we install the latest with Poetry
# and execute directly with Poetry. Later, we'll move to the released Pip package.
RUN poetry install -E preql -E mysql -E pgsql -E snowflake
ENTRYPOINT ["poetry", "run", "python3", "-m", "data_diff"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would work and ENTRYPOINT "poetry run python3 -m data_diff" wouldn't?

mysql:
container_name: mysql
image: mysql:oracle
# fsync less aggressively for insertion perf for test setup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear where this commend is pointing to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all the configuration you put in is this, but I will refer to it 👍🏻


[tool.poetry.dev-dependencies]
protobuf = "^3.20.1"
mysql-connector-python = "*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this will create errors. Since you are not also enforcing the installation of dependencies! You cannot simply install these packages without the correct dependencies installed. This is why we also had this docker setup. So people aren't forces to install dependencies in order to use it.


[tool.poetry.dev-dependencies]
protobuf = "^3.20.1"
mysql-connector-python = "*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a reason that the docker installs python3-dev libpq-dev python3-setuptools gcc. If your system does not have those, you can't install data-diff with mysql-connector-python or psycopg2 for example.

preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql
preql -f dev/prepare_db.psq snowflake://<uri>
preql -f dev/prepare_db.psq mssql://<uri>
preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own
preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own script


It's highly recommended that all involved columns are indexed.

## Development Setup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that we only want to have doc-based instructions for a setup? Check DAT-3098

@sirupsen sirupsen merged commit 0097e06 into master May 11, 2022
@sirupsen sirupsen deleted the dev-env-fixes branch May 11, 2022 14:52
nolar pushed a commit that referenced this pull request Apr 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants