-
Notifications
You must be signed in to change notification settings - Fork 299
dev: simplify #18
dev: simplify #18
Conversation
|
|
||
| It's highly recommended that all involved columns are indexed. | ||
|
|
||
| ## Development Setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too easy to miss it otherwise. Let's just have one big README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that we only want to have doc-based instructions for a setup? Check DAT-3098
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love a script, the reason for a doc-based approach for now is (1) these currently take too long to be one script, (2) too hard to test, because it takes too long to test, and (3) the existing script didn't work (because of (2)). We are not mature enough for a script, unfortunately.
|
|
||
| run_sql("RM @~/ratings.csv.gz") | ||
| run_sql("PUT file://ml-25m/ratings.csv @~") | ||
| run_sql("PUT file://dev/ml-25m/ratings.csv @~") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commands in the README straight up didn't work, because the paths assumed you were cd(1)'ed into the directory.
| --default-authentication-plugin=mysql_native_password | ||
| --innodb-buffer-pool-size=8G | ||
| --innodb_io_capacity=2000 | ||
| --innodb_log_file_size=1G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you have a huge computer these settings will make things way worse. MySQL OOM'ed on my machine (errors I got this morning) because the buffer pool size is bigger than I allow Docker. We have to err towards robustness over performance here. These settings are too idiosyncratic
| - '5432' | ||
| env_file: | ||
| - dev.env | ||
| - dev/dev.env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should run docker-compose at the top-level directory, not have to cd around to develop. Too annoying.
|
|
||
| [tool.poetry.dev-dependencies] | ||
| protobuf = "^3.20.1" | ||
| mysql-connector-python = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too easy to miss to install the extras otherwise for development. It's annoying to have to repeat them. If someone knows how to make them mandatory in dev, but optional in prod, let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this will create errors. Since you are not also enforcing the installation of dependencies! You cannot simply install these packages without the correct dependencies installed. This is why we also had this docker setup. So people aren't forces to install dependencies in order to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a reason that the docker installs python3-dev libpq-dev python3-setuptools gcc. If your system does not have those, you can't install data-diff with mysql-connector-python or psycopg2 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I see the Docker container as a production environment or a quick way to test locally—not a development environment (I should add something to the docs about the Docker container)
I do not think the Docker setup can replace this as a development environment (I'm not sure if you are implying this, but for completeness), because developing inside of Docker is a miserable experience in my experience. Having to rebuild container at times, not having the CLI tools from your laptop, etc. That's why I do not think it should be in docker-compose at all, since that's what it implies.
As for the system libraries, that's a really good point. I see 2 options:
(1) Continue to install these as dev dependencies, but as part of the doc installs, ensure that people run the correct brew install
(2) Revert back to extras, but with (1)
I am defaulting to (1) because to run the test suite, you will need all the bindings soon. Furthermore, not every database uses native libraries, as many are converging on protobuf, mysql protocol, or the postgres protocol—so I don't see it as a never-ending nightmare. If it spins out of control, I'd then suggest we move to (2)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're talking about someone who wants to develop this tool, and use our env setup for it, but doesn't want to install mysql? I'm not seeing the point here.
For using the tool, yes, we want to make it easy to install. But for development, I think it's okay to raise the bar a little bit. The tests won't run without mysql anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erezsh so to be clear, you're voting for (1)? People will run mysql through Docker, so they won't have the bindings by default—so we will just provide them the instructions for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're on Windows though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's try (1). But then we need to test it rigorously with colleagues it they can get it up and running, solely based on the doc, before we ask customers to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 I'll fold those instructions into the README and get this merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm voting 1. Asking to install libraries is common for public dev setups.
Maybe we can provide a nice script for them that does it for ubuntu/mac
Yes, I use windows, but I run data-diff on WSL
| self.connection.query("DROP TABLE IF EXISTS ratings_test", None) | ||
| self.connection.query("DROP TABLE IF EXISTS ratings_test2", None) | ||
| self.preql.load("./tests/setup.pql") | ||
| self.preql.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure tables are created before creating the segment class
| # For now while we are in heavy development we install the latest with Poetry | ||
| # and execute directly with Poetry. Later, we'll move to the released Pip package. | ||
| RUN poetry install -E preql -E mysql -E pgsql -E snowflake | ||
| ENTRYPOINT ["poetry", "run", "python3", "-m", "data_diff"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids having to type out all the arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this would work and ENTRYPOINT "poetry run python3 -m data_diff" wouldn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They'd both work. Just been bit many times by string syntax from quoting, so in this habit now. It's what they recommend too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this so that it works? How are you providing it with the arguments like DB1_URI etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WORKDIR /app | ||
| # For now while we are in heavy development we install the latest with Poetry | ||
| # and execute directly with Poetry. Later, we'll move to the released Pip package. | ||
| RUN poetry install -E preql -E mysql -E pgsql -E snowflake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can remove extras since they should be done by default in the dev env
|
|
||
| self.table = TableSegment(TestDiffTables.connection, | ||
| (self.table_name, ), | ||
| ('ratings_test', ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test tables were renamed without the tests being fixed
erezsh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay
| Use cases: | ||
| - Validate that a table was copied properly | ||
| - Be alerted before your customer finds out, or your report is wrong | ||
| - Validate that your replication mechnism is working correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error (mechnism)
cfernhout
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check the comment from Erez with spelling mistake.
| # For now while we are in heavy development we install the latest with Poetry | ||
| # and execute directly with Poetry. Later, we'll move to the released Pip package. | ||
| RUN poetry install -E preql -E mysql -E pgsql -E snowflake | ||
| ENTRYPOINT ["poetry", "run", "python3", "-m", "data_diff"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this would work and ENTRYPOINT "poetry run python3 -m data_diff" wouldn't?
| mysql: | ||
| container_name: mysql | ||
| image: mysql:oracle | ||
| # fsync less aggressively for insertion perf for test setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear where this commend is pointing to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all the configuration you put in is this, but I will refer to it 👍🏻
|
|
||
| [tool.poetry.dev-dependencies] | ||
| protobuf = "^3.20.1" | ||
| mysql-connector-python = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this will create errors. Since you are not also enforcing the installation of dependencies! You cannot simply install these packages without the correct dependencies installed. This is why we also had this docker setup. So people aren't forces to install dependencies in order to use it.
|
|
||
| [tool.poetry.dev-dependencies] | ||
| protobuf = "^3.20.1" | ||
| mysql-connector-python = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a reason that the docker installs python3-dev libpq-dev python3-setuptools gcc. If your system does not have those, you can't install data-diff with mysql-connector-python or psycopg2 for example.
| preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql | ||
| preql -f dev/prepare_db.psq snowflake://<uri> | ||
| preql -f dev/prepare_db.psq mssql://<uri> | ||
| preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own | |
| preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own script |
|
|
||
| It's highly recommended that all involved columns are indexed. | ||
|
|
||
| ## Development Setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that we only want to have doc-based instructions for a setup? Check DAT-3098

The dev setup right now is pretty unreliable. I'm trying to vastly simplify it and make it more robust. I'm commenting inline why I've made certain changes.