Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ New Connector Idea: "GitHub Files" Source #22

Open
aaronsteers opened this issue May 20, 2024 · 14 comments · May be fixed by airbytehq/airbyte#40621
Open

✨ New Connector Idea: "GitHub Files" Source #22

aaronsteers opened this issue May 20, 2024 · 14 comments · May be fixed by airbytehq/airbyte#40621
Assignees

Comments

@aaronsteers
Copy link

aaronsteers commented May 20, 2024

Desired Connector Spec

This new connector would allow users to get file data from a github (or generic git) repo, and apply our CDK parser logic (especially unstructured parser) along with glob patterns, to get data and content from a github repo into an Airbyte data pipeline.

@Ishankoradia
Copy link

Ishankoradia commented Jun 3, 2024

Hi @aaronsteers , this is what me and @siddhant3030 were thinking, please let me know what you think.
Lets say a git repo has 10 files.

  1. Connector takes input: git repo url and its credentials.
  2. The connector should read all files in this repo and extract their content (via some repo content api).
  3. We will generate records with schema [{content: "text content from the api in above step", file_name: "source.py"}]. Add more rows based on the glob patterns (this is a bit unclear; if you could explain this would be great)
  4. So basically we will have on stream per repo

If you could shed light on the below questions would be great

  1. What is unclear is why do we need a new connector ? Can't we add another stream in the existing github connector that reads content from the files.
  2. Is the motivation behind this connector to give users ability to plug their git content data into an llm for retrieval ?
  3. Does having one stream per file here makes sense ?
  4. Could give an example of glob patterns you are referring to ?

@TLazarevic
Copy link

TLazarevic commented Jun 3, 2024

@aaronsteers
This one is interesting as well, but we would also appreciate the answer to the question above:

What is unclear is why do we need a new connector ? Can't we add another stream in the existing github connector that reads content from the files.

@aaronsteers
Copy link
Author

@Ishankoradia - response inline:

  1. Connector takes input: git repo url and its credentials.

👍

  1. The connector should read all files in this repo and extract their content (via some repo content api).

👍

  1. We will generate records with schema [{content: "text content from the api in above step", file_name: "source.py"}]. Add more rows based on the glob patterns (this is a bit unclear; if you could explain this would be great)

Yes, as you suggest would include content, (relative) file name - and hopefully also created_at and last_modified_at.

Since this is related to LLM use cases, it might make sense to try to match the schema of "unstructured" sources like source-google-drive.

If easy to include, some other fields like last_commit_sha could be helpful also.

  1. So basically we will have on stream per repo

👍

If you could shed light on the below questions would be great

  1. What is unclear is why do we need a new connector ? Can't we add another stream in the existing github connector that reads content from the files.

TL;DR: This part is your call. I don't feel strongly about it either way.

The thought process is that it was probably easier to not have to worry about impacts related to modifying the existing connector. And if created as a separate connector, we could always incorporate back into the GitHub source later on.

If adding to the existing GitHub connector, we can't change the auth or other options.

Another consideration here is that we may want to go all-in on the unstructured documents paradigm or files-type extractor paradigm. There is some existing prior art here that you could reuse - notably source-s3 and source-google-drive, for example. In those cases, the config is file-centric, allowing stream definition per glob and different globs/streams using different parsers.

  1. Is the motivation behind this connector to give users ability to plug their git content data into an llm for retrieval ?

Yes! Goal here is to make git content available to LLMs.

  1. Does having one stream per file here makes sense ?

I think one stream per repo (or else one stream per glob?) is probably the right call.

  1. Could give an example of glob patterns you are referring to ?

Sorry for the confusion on this point. Here are some example glob patterns, keeping in mind the LLM use cases.

  1. * (or **/*?) - get all files from the repo. Helpful if you just want all the files content.
  2. *.md - get all markdown documentation pages from the repo.
  3. *.py (or another language-specific filter) - gather code files for purposes of code analysis, code summaries, and/or AI-assisted code creation.

@Ishankoradia
Copy link

Ishankoradia commented Jun 4, 2024

@aaronsteers thanks for getting back super quick. Appreciate all the answers.

Summarizing for our understanding

  • Agreed, separate connector makes it easy for testing and we dont have to worry about OCP design making sure existing things doesn't break.
  • Motivation is clear to us now.
  • One stream per glob pattern makes more sense as you mentioned.
  • Schema could have things like last_commit_sha. Eventually a record could look like
    [{content: "text content from the api in above step", relative_file_name: "src/source.py", last_modified_at: "2023-01-01", created_on: "2024-01-01", last_commit_sha: "asdasda..."}]

Have two more questions

  1. Will this be a python CDK connector ?
  2. In your initial comment what do you mean by CDK parser logic ?

Could you assign this to me ? Its very interesting.

@aaronsteers
Copy link
Author

aaronsteers commented Jun 4, 2024

Have two more questions

  1. Will this be a python CDK connector ?

Yes, if that works for you. 👍 We have a CDK "extra" for file-based connectors, which may be helpful.

  1. In our initial comment what do you mean by CDK parser logic

There's a common CDK backend for sources such as Google Drive and S3, which I'll screenshot below for comparison.

Show/hide S3 example

image

Show/hide Google Drive example

image

Could you assign this to me ? Its very interesting.

Absolutely. You've got first dibs as first person to reply. If above sounds good to you, simply confirm and @marcosmarxm will assign to you.

@Ishankoradia
Copy link

Ishankoradia commented Jun 4, 2024

Perfect, Sounds good to me @aaronsteers . Thanks for getting back quickly.
I will look into the resources you shared.
This should be exciting. You can assign it to me.

@aaronsteers
Copy link
Author

@Ishankoradia - We are very excited also - thanks for taking this on. I've assigned it to you!

@avirajsingh7
Copy link

@Ishankoradia if you want to drop this issue, Can I work on this one?

@Ishankoradia
Copy link

Hey @avirajsingh7 i am still working on this. Deep into it, i am planning to submit a first draft of the PR in the coming week. Thanks

@aaronsteers
Copy link
Author

Hi, @Ishankoradia - Any update or questions on this before end of event? I'll check in tomorrow (Saturday) in case I can help in any way. Thanks!

@Ishankoradia Ishankoradia linked a pull request Jun 29, 2024 that will close this issue
2 tasks
@Ishankoradia
Copy link

Hi @aaronsteers thanks for checking in. This is draft PR i am working on.
Have hardcoded the public repo for, i was testing & working around to fit this in the unstructured file paradigm. Would be great if you couuld give some early feedback. (Since its a draft PR, i haven't added all the information yet in the github ticket - i will do so once its close to completion).

I have a few questions or places where i am stuck/confused.

  • I am trying to use file based cdk's unstructured paradigm. I am facing difficulty understanding the unstructured format's processing model APIProcessingConfigModel.Could explain how this works ? why is it asking for an api key - ideally i would just want to use the authentication info entered above (Personal access token or oauth) and call github's content api to fetch files content.
  • I am trying to fit my github files connector's need with this unstructured paradigm. How do i go away with other formats ? Since i wont be needing it. Does it require overriding the based model ?
  • I am wondering what advantages i get using the file based cdk over creating my own connector/form from scratch.

@Ishankoradia
Copy link

Hi @aaronsteers , i would still like to continue work on this, if its alright with you all. Would be great if you answer my questions above. Thanks.

@aaronsteers
Copy link
Author

@Ishankoradia - I'm afraid I do not have clear answers to your question regarding the API key. To my understanding, it should not require an API key, although a key to Unstructured.io was intended as an optional input for high-res image processing. (Although I don't think that feature is live as of today.)

In regards to leveraging the CDK versus writing from scratch, the goal with leveraging that is/was to retain parsing capabilities of the CDK, in order to be able to parse files like jpg, doc, word, pdf, CSV, Excel, etc. Since most files in a git repo are already test formatted, I don't see that as a strict hard requirement, although there are (probably?) some benefits to leveraging those paradigms.

While the deadline for contributions for hackathon has now passed (last day was July 3), I'd be happy to continue working on this if you'd like to see it through to completion.

Either way, we thank you for your contribution and for your participation. 🙏

@Ishankoradia
Copy link

Hi @aaronsteers thanks for getting back, I dont mind the hackathon being over.

I would like to continue working on this to completion, would be great to get your support while i do.

I am going to try to make a push and see if I can pull something from a public repo in the connector. Keeping in mind, leveraging CDK is not a strict requirement.

If you have any ideas/suggestions, I am happy to hear them out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

4 participants