Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(airflow): refactor extract_astro_blogs method #176

Merged
merged 1 commit into from
Dec 6, 2023

Conversation

Lee-W
Copy link
Collaborator

@Lee-W Lee-W commented Nov 27, 2023

When parsing links, parse the article tag to get the date to filter. Break the loop if any dates are filtered in an iteration.

closes: #116

Copy link

cloudflare-pages bot commented Nov 27, 2023

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 273e6dc
Status: ✅  Deploy successful!
Preview URL: https://a2fd9648.ask-astro.pages.dev
Branch Preview URL: https://refactor-blog-extraction.ask-astro.pages.dev

View logs

Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lee-W how have you tested this? Requesting you to run the DAG once and test this change

@Lee-W
Copy link
Collaborator Author

Lee-W commented Nov 30, 2023

I ran these function and compare the results. Let me run the tests on DAG as well

@Lee-W
Copy link
Collaborator Author

Lee-W commented Nov 30, 2023

@sunank200 I've tested with local DAGs

圖片

@Lee-W Lee-W requested a review from sunank200 December 1, 2023 00:35
@sunank200
Copy link
Collaborator

sunank200 commented Dec 1, 2023

@sunank200 I've tested with local DAGs

圖片

@Lee-W For each of these changes in ingestion. We need to do the following:

  1. Ingest the data to the dev database with a change in ingestion code.
  2. Request @vatsrahul1001 to check if the retrieval performance as degraded or not.

Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking this change till this is tested by @vatsrahul1001

@Lee-W
Copy link
Collaborator Author

Lee-W commented Dec 1, 2023

Ingest the data to the dev database with a change in ingestion code.

May I know where is the dev database and dev airflow environment we can ingest? also what is the change you mentioned? do you mean the code change in this PR?

@mpgreg
Copy link
Contributor

mpgreg commented Dec 1, 2023

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.

Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

@Lee-W
Copy link
Collaborator Author

Lee-W commented Dec 1, 2023

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.

Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision

@sunank200
Copy link
Collaborator

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.
Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision

@Lee-W can you setup a call for this please and lets have a decision on this? After that this change should be tested with fresh ingestion on dev database.

@Lee-W
Copy link
Collaborator Author

Lee-W commented Dec 4, 2023

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.
Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision

@Lee-W can you setup a call for this please and lets have a decision on this? After that this change should be tested with fresh ingestion on dev database.

I just sent a message to the channel for discussion. But IMHO, this might not even be directly part of the PR.

@Lee-W
Copy link
Collaborator Author

Lee-W commented Dec 5, 2023

as discussed earlier, we still need this cutoff date. will proceeded testing on this one

@Lee-W
Copy link
Collaborator Author

Lee-W commented Dec 5, 2023

I changed the host and pointed to the dev WEAVIATE. @vatsrahul1001 Could you please help with the testing? Thanks!

@Lee-W
Copy link
Collaborator Author

Lee-W commented Dec 5, 2023

@vatsrahul1001 We no longer need to test it on cloud. Thanks!

@sunank200 I've tested it and send you the result. Could you please take a look? Thanks!

Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Lee-W Lee-W merged commit 32e8ddc into main Dec 6, 2023
8 checks passed
@Lee-W Lee-W deleted the refactor-blog-extraction branch December 6, 2023 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extract_astro_blogs should check for date before extract
5 participants