Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML of Pipeline example in tutorial incomplete #21457

Closed
1 of 2 tasks
KevinYanesG opened this issue Feb 9, 2022 · 22 comments
Closed
1 of 2 tasks

HTML of Pipeline example in tutorial incomplete #21457

KevinYanesG opened this issue Feb 9, 2022 · 22 comments
Labels

Comments

@KevinYanesG
Copy link
Contributor

Describe the issue with documentation

Apache Version: 2.2.3

OS:macOS Catalina Version 10.15.7

Apache Airflow Provider versions:

Deployment:

Web browser: Google Chrome Version 98.0.4758.80 (Official Build) (x86_64)
Also happened with Safari Version 13.1.3 (15609.4.1)

What happened:

The Pipeline example of the tutorial (in the website) looked like the following:
image

I was wondering why there was so little information about the connection to Postgres and why there were no import statements in the code.

How to solve the problem

Then, I realized that in the GitHub repo the tutorial file was way more extensive:
image

I marked in red the parts of the tutorial that can't be seen on the website. There are probably more missing details.

Thanks in advance!

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@KevinYanesG KevinYanesG added kind:bug This is a clearly a bug kind:documentation labels Feb 9, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 9, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@KevinYanesG
Copy link
Contributor Author

KevinYanesG commented Feb 9, 2022

Also:
I think that in the full version of the code the import should remain as from airflow.providers.postgres.hooks.postgres import PostgresHook.
from airflow.hooks.postgres import PostgresHook gives me an ImportError.

@ashb
Copy link
Member

ashb commented Feb 9, 2022

from airflow.hooks.postgres import PostgresHook gives me an ImportError.

You need to install the postgres provider -- that is also missing from that tutorial https://airflow.apache.org/docs/apache-airflow-providers-postgres/2.4.0/#installation

@KevinYanesG
Copy link
Contributor Author

KevinYanesG commented Feb 9, 2022

You need to install the postgres provider -- that is also missing from that tutorial https://airflow.apache.org/docs/apache-airflow-providers-postgres/2.4.0/#installation

Thanks! But I actually already have the provider installed, since from airflow.providers.postgres.hooks.postgres import PostgresHook works for me without problems.

I just thought that from airflow.hooks.postgres is wrong because it doesn't exist, or at least I didn't found it.

The closest module is: airflow.hooks.postgres_hook (https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/hooks/postgres_hook/index.html), from which you can import PostgresHook

@ashb
Copy link
Member

ashb commented Feb 9, 2022

Oh sorry, airflow.providers.postgres.hooks.postgres is the new module path https://airflow.apache.org/docs/apache-airflow-providers-postgres/2.4.0/_api/airflow/providers/postgres/hooks/postgres/index.html

@KevinYanesG
Copy link
Contributor Author

KevinYanesG commented Feb 9, 2022

One last thing:
In the "valid" version of the pipeline example it is said:

Create a Employee table in Postgres using this:
CREATE TABLE "Employees" ...

For me (someone who didn't work with Postgres) it wasn't trivial to get the example to run, because I didn't know how to explicitly create the database. Finally, I did it with the following code:

    def get_data():
        # NOTE: configure this as appropriate for your airflow environment
        data_path = "/opt/airflow/dags/files/employees.csv"

        url = "https://raw.githubusercontent.com/apache/airflow/main/docs/apache-airflow/pipeline_example.csv"

        response = requests.request("GET", url)

        with open(data_path, "w+") as file:
            file.write(response.text)

        postgres_hook = PostgresHook(postgres_conn_id="postgres_default")
        conn = postgres_hook.get_conn()
        cur = conn.cursor()

        cur.execute('''CREATE TABLE "Employees" 
        (
            "Serial Number" NUMERIC PRIMARY KEY,
            "Company Name" TEXT,
            "Employee Markme" TEXT,
            "Description" TEXT,
            "Leave" INTEGER
        );

        CREATE TABLE "Employees_temp"
        (
            "Serial Number" NUMERIC PRIMARY KEY,
            "Company Name" TEXT,
            "Employee Markme" TEXT,
            "Description" TEXT,
            "Leave" INTEGER
        ); ''')


        with open(data_path, "r") as file:
            cur.copy_expert(
                "COPY \"Employees_temp\" FROM STDIN WITH CSV HEADER DELIMITER AS ',' QUOTE '\"'",
                file,
            )
        conn.commit()

I don't know if this is the intended/optimal way to create the DB, but the DAG's run didn't fail. Maybe it would be easier for bloody beginners like me to include the part of creating the DB explicitly in the tutorial's code.

@potiuk
Copy link
Member

potiuk commented Feb 9, 2022

Thanks for posting those. But I will close the ticket because the main reason why you see the difference is because you are looking at "stable" version of the docs released together with Airflow when the stable version of Airflow is released (which is 2.2.3 currently)

So what is published in 2.2.3 version of the docs (which you can see at the top-left selection of the page) is this: https://github.com/apache/airflow/blob/2.2.3/docs/apache-airflow/tutorial.rst

The main version of the docs is "development' version which might (or might not) be released with the new version and might (or might not) contain improvements and new features. The changes you pointed at from main are actually cherry-picked to v2-2-test (which will become 2.2.4 soon - about a week - when we release it): https://github.com/apache/airflow/blob/v2-2-test/docs/apache-airflow/tutorial.rst

I actually looked through all documentation-only changes we had in main (development versions). When the changes were just "improving" the documentation rather than adding new feature docs, i cherry-picked all such changes (and Jed - our release manager cherry-picked some before as well).

I just thought that from airflow.hooks.postgres is wrong because it doesn't exist, or at least I didn't found it.

This is indeed wrong. The easiest way for you to fix it is just go to the tutorial page and click "Suggest a change on this page" (bottomn-right of the page). It will open you the Pull Request with the "main" version of the page and you will be able to easily create a PR (even more easily than creating this issue was. You can this way become one of the > 1900 contributors to Airlfow (because Airflow is created by the contributors like you. This would be a fantastic first contribution to correct it!

I don't know if this is the intended/optimal way to create the DB, but the DAG's run didn't fail. Maybe it would be easier for bloody beginners like me to include the part of creating the DB explicitly in the tutorial's code.

You should follow Quick start if you are bloody beginner. This is who it is intended for. And It explains what happens and what is going on includng the steps to initialize DB (airflow db init is the right command),

@potiuk potiuk closed this as completed Feb 9, 2022
@potiuk potiuk added the invalid label Feb 9, 2022
@ashb
Copy link
Member

ashb commented Feb 9, 2022

http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/latest/tutorial.html#pipeline-example for the rendering of the latest doc.

The import is still wrong in that example, so if you fancy a PR @KevinYanesG (no need for an issue, can just go to PR fixing it) we can guide you through it.

@potiuk
Copy link
Member

potiuk commented Feb 9, 2022

The import is still wrong in that example, so if you fancy a PR @KevinYanesG (no need for an issue, can just go to PR fixing it) we can guide you through it.

Absolutely.

@KevinYanesG
Copy link
Contributor Author

KevinYanesG commented Feb 10, 2022

But I will close the ticket because the main reason why you see the difference is because you are looking at "stable" version of the docs released together with Airflow when the stable version of Airflow is released (which is 2.2.3 currently)

Thanks for the extensive responses! I wasn't aware of that. In any case, it's good that you are going to include these details into the next version.

This is indeed wrong. The easiest way for you to fix it is just go to the tutorial page and click "Suggest a change on this page" (bottomn-right of the page)

I will be happy to do that :) I hope I can manage to do it before the end of the week.

You should follow Quick start if you are bloody beginner. This is who it is intended for. And It explains what happens and what is going on includng the steps to initialize DB (airflow db init is the right command),

I actually followed the Docker Quickstart and not the standalone one. Thus, I didn't saw the db init command. Although, I just had a look at it and it rather seems to be for initializing the postgres Backend, right? I was talking about how to explicitly create the Employee tables from the tutorial, which I did with the cur.execute('''CREATE TABLE "Employees ...") method. Is there another way to do it in a Docker environment?

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

I actually followed the Docker Quickstart and not the standalone one. Thus, I didn't saw the db init command. Although, I just had a look at it and it rather seems to be for initializing the postgres Backend, right? I was talking about how to explicitly create the Employee tables from the tutorial, which I did with the cur.execute('''CREATE TABLE "Employees ...") method. Is there another way to do it in a Docker environment?

Aaah My Bad. Indeed - silly me. Yeah by all means - adding those would be a nice thing for the tutorial.

Thanks for looking at those and attempts to clarify it all !

BTW. If you add the PRs shortly I am super-happy to cherry-pick those PRs to 2.2.4 as well :).

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

I actually followed the Docker Quickstart and not the standalone one. Thus, I didn't saw the db init command. Although, I just had a look at it and it rather seems to be for initializing the postgres Backend, right? I was talking about how to explicitly create the Employee tables from the tutorial, which I did with the cur.execute('''CREATE TABLE "Employees ...") method. Is there another way to do it in a Docker environment?

Actually there IS a better way!

We have the command:

airflow db shell

That drops you into the DB command line environment that is configured. So you could - I think - even redirect a creation script to the command - that would be much nicer, because it could be done as a single command:

airflow db shell <<EOF

CREATE ....

EOF

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

And in docker-compoe Quickstart you could likely combine it with https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#running-the-cli-commands

@KevinYanesG
Copy link
Contributor Author

KevinYanesG commented Feb 10, 2022

Thanks to you!

I tried

airflow db shell <<EOF ...

But unfortunately, after trying many different things, I didn't bring it to work, probably because of little experience in Airflow / Postgres.

A very brief summary of what I tried after deleting the cur.execute()code:

  • ./airflow.sh airflow db reset
  • ./airflow.sh airflow db init(just in case)
  • ./airflow.sh airflow db shell <<EOF ...
  • run the DAG
  • Got the error
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "Employees_temp_pkey"
ERROR:  relation "Employees_temp" already exists

(I wonder why this error didn't appear when using the cur.execute())

  • Tried to solve it with https://stackoverflow.com/a/21639138, but when pg_get_serial_sequence("Employees", "Serial Number") there was no output, the solution to this seems to go to deep into PostgresSQL for me to understand it in a reduced amount of time

(I also tried to run the sql commands directly in the shell but also didn't work)

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

ERROR: relation "Employees_temp" already exists

Isn't that because you already created it before :)?

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

just run (in reverse order from creation) at the beginning of the script

DROP TABLE EMPLOYEES_TEMP IF EXISTS CASCADE;
DROP TABLE EMPLOYEES IF EXISTS CASCADE;

https://www.geeksforgeeks.org/postgresql-drop-table/#:~:text=PostgreSQL%20has%20a%20DROP%20TABLE,table%20permanently%20from%20the%20database.

@KevinYanesG
Copy link
Contributor Author

KevinYanesG commented Feb 10, 2022

I actually already tried with justDROP TABLE <name>, it didn't work either. Now with the extra parameters:

./airflow.sh airflow db shell <<EOF

DROP TABLE EMPLOYEES_TEMP IF EXISTS CASCADE;
DROP TABLE EMPLOYEES IF EXISTS CASCADE;
CREATE TABLE "Employees" 
(
    "Serial Number" NUMERIC PRIMARY KEY,
    "Company Name" TEXT,
    "Employee Markme" TEXT,
    "Description" TEXT,
    "Leave" INTEGER
);


CREATE TABLE "Employees_temp"
(
    "Serial Number" NUMERIC PRIMARY KEY,
    "Company Name" TEXT,
    "Employee Markme" TEXT,
    "Description" TEXT,
    "Leave" INTEGER
);

EOF

After running the DAG I still get

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "Employees_temp_pkey"
DETAIL:  Key ("Serial Number")=(2) already exists.
CONTEXT:  COPY Employees_temp, line 2

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "Employees_temp_pkey"
DETAIL:  Key ("Serial Number")=(2) already exists.
CONTEXT:  COPY Employees_temp, line 2

This error indicates that the input file contain a duplicate (or maybe Postgres has case-sensitive names and the DROP did not actually drop the table)?

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

Here is the right Blog about it :

https://blog.xojo.com/2016/09/28/about-postgresql-case-sensitivity/

Seems that when you initially create Postgres tables with "" the name remains case sensitive. So I think Just dropping the quotes from names of the tabels and using it consistenlty UpperCase or lower case should help.

@potiuk
Copy link
Member

potiuk commented Feb 10, 2022

(BTW. Of course I had no idea about it before I googled it).

@KevinYanesG
Copy link
Contributor Author

KevinYanesG commented Feb 11, 2022

Thanks, that's a good hint I will probably get to it in the beginning/middle of next week.

@KevinYanesG
Copy link
Contributor Author

Ok, I got it to work, thanks again for the suggestions, for some reason the EOF method didn't work. But accessing the db shell directly and creating the tables one by one did work. I also needed to change the whole code to be in capital letters. I will open a pull request ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants