✨ New Connector Idea: "GitHub Files" Source #22

aaronsteers · 2024-05-20T19:20:50Z

Desired Connector Spec

This new connector would allow users to get file data from a github (or generic git) repo, and apply our CDK parser logic (especially unstructured parser) along with glob patterns, to get data and content from a github repo into an Airbyte data pipeline.

Ishankoradia · 2024-06-03T16:58:18Z

Hi @aaronsteers , this is what me and @siddhant3030 were thinking, please let me know what you think.
Lets say a git repo has 10 files.

Connector takes input: git repo url and its credentials.
The connector should read all files in this repo and extract their content (via some repo content api).
We will generate records with schema [{content: "text content from the api in above step", file_name: "source.py"}]. Add more rows based on the glob patterns (this is a bit unclear; if you could explain this would be great)
So basically we will have on stream per repo

If you could shed light on the below questions would be great

What is unclear is why do we need a new connector ? Can't we add another stream in the existing github connector that reads content from the files.
Is the motivation behind this connector to give users ability to plug their git content data into an llm for retrieval ?
Does having one stream per file here makes sense ?
Could give an example of glob patterns you are referring to ?

TLazarevic · 2024-06-03T18:23:29Z

@aaronsteers
This one is interesting as well, but we would also appreciate the answer to the question above:

What is unclear is why do we need a new connector ? Can't we add another stream in the existing github connector that reads content from the files.

aaronsteers · 2024-06-04T03:24:18Z

@Ishankoradia - response inline:

Connector takes input: git repo url and its credentials.

👍

The connector should read all files in this repo and extract their content (via some repo content api).

👍

We will generate records with schema [{content: "text content from the api in above step", file_name: "source.py"}]. Add more rows based on the glob patterns (this is a bit unclear; if you could explain this would be great)

Yes, as you suggest would include content, (relative) file name - and hopefully also created_at and last_modified_at.

Since this is related to LLM use cases, it might make sense to try to match the schema of "unstructured" sources like source-google-drive.

If easy to include, some other fields like last_commit_sha could be helpful also.

So basically we will have on stream per repo

👍

If you could shed light on the below questions would be great

What is unclear is why do we need a new connector ? Can't we add another stream in the existing github connector that reads content from the files.

TL;DR: This part is your call. I don't feel strongly about it either way.

The thought process is that it was probably easier to not have to worry about impacts related to modifying the existing connector. And if created as a separate connector, we could always incorporate back into the GitHub source later on.

If adding to the existing GitHub connector, we can't change the auth or other options.

Another consideration here is that we may want to go all-in on the unstructured documents paradigm or files-type extractor paradigm. There is some existing prior art here that you could reuse - notably source-s3 and source-google-drive, for example. In those cases, the config is file-centric, allowing stream definition per glob and different globs/streams using different parsers.

Is the motivation behind this connector to give users ability to plug their git content data into an llm for retrieval ?

Yes! Goal here is to make git content available to LLMs.

Does having one stream per file here makes sense ?

I think one stream per repo (or else one stream per glob?) is probably the right call.

Could give an example of glob patterns you are referring to ?

Sorry for the confusion on this point. Here are some example glob patterns, keeping in mind the LLM use cases.

* (or **/*?) - get all files from the repo. Helpful if you just want all the files content.
*.md - get all markdown documentation pages from the repo.
*.py (or another language-specific filter) - gather code files for purposes of code analysis, code summaries, and/or AI-assisted code creation.

Ishankoradia · 2024-06-04T04:12:42Z

@aaronsteers thanks for getting back super quick. Appreciate all the answers.

Summarizing for our understanding

Agreed, separate connector makes it easy for testing and we dont have to worry about OCP design making sure existing things doesn't break.
Motivation is clear to us now.
One stream per glob pattern makes more sense as you mentioned.
Schema could have things like last_commit_sha. Eventually a record could look like
[{content: "text content from the api in above step", relative_file_name: "src/source.py", last_modified_at: "2023-01-01", created_on: "2024-01-01", last_commit_sha: "asdasda..."}]

Have two more questions

Will this be a python CDK connector ?
In your initial comment what do you mean by CDK parser logic ?

Could you assign this to me ? Its very interesting.

aaronsteers · 2024-06-04T04:33:50Z

Have two more questions

Will this be a python CDK connector ?

Yes, if that works for you. 👍 We have a CDK "extra" for file-based connectors, which may be helpful.

In our initial comment what do you mean by CDK parser logic

There's a common CDK backend for sources such as Google Drive and S3, which I'll screenshot below for comparison.

Show/hide S3 example

Show/hide Google Drive example

Could you assign this to me ? Its very interesting.

Absolutely. You've got first dibs as first person to reply. If above sounds good to you, simply confirm and @marcosmarxm will assign to you.

Ishankoradia · 2024-06-04T04:39:33Z

Perfect, Sounds good to me @aaronsteers . Thanks for getting back quickly.
I will look into the resources you shared.
This should be exciting. You can assign it to me.

aaronsteers · 2024-06-04T16:52:21Z

@Ishankoradia - We are very excited also - thanks for taking this on. I've assigned it to you!

avirajsingh7 · 2024-06-23T05:58:29Z

@Ishankoradia if you want to drop this issue, Can I work on this one?

Ishankoradia · 2024-06-23T05:59:07Z

Hey @avirajsingh7 i am still working on this. Deep into it, i am planning to submit a first draft of the PR in the coming week. Thanks

aaronsteers · 2024-06-29T01:48:49Z

Hi, @Ishankoradia - Any update or questions on this before end of event? I'll check in tomorrow (Saturday) in case I can help in any way. Thanks!

Ishankoradia · 2024-06-29T09:54:28Z

Hi @aaronsteers thanks for checking in. This is draft PR i am working on.
Have hardcoded the public repo for, i was testing & working around to fit this in the unstructured file paradigm. Would be great if you couuld give some early feedback. (Since its a draft PR, i haven't added all the information yet in the github ticket - i will do so once its close to completion).

I have a few questions or places where i am stuck/confused.

I am trying to use file based cdk's unstructured paradigm. I am facing difficulty understanding the unstructured format's processing model APIProcessingConfigModel.Could explain how this works ? why is it asking for an api key - ideally i would just want to use the authentication info entered above (Personal access token or oauth) and call github's content api to fetch files content.
I am trying to fit my github files connector's need with this unstructured paradigm. How do i go away with other formats ? Since i wont be needing it. Does it require overriding the based model ?
I am wondering what advantages i get using the file based cdk over creating my own connector/form from scratch.

Ishankoradia · 2024-07-06T04:03:49Z

Hi @aaronsteers , i would still like to continue work on this, if its alright with you all. Would be great if you answer my questions above. Thanks.

aaronsteers · 2024-07-09T23:26:12Z

@Ishankoradia - I'm afraid I do not have clear answers to your question regarding the API key. To my understanding, it should not require an API key, although a key to Unstructured.io was intended as an optional input for high-res image processing. (Although I don't think that feature is live as of today.)

In regards to leveraging the CDK versus writing from scratch, the goal with leveraging that is/was to retain parsing capabilities of the CDK, in order to be able to parse files like jpg, doc, word, pdf, CSV, Excel, etc. Since most files in a git repo are already test formatted, I don't see that as a strict hard requirement, although there are (probably?) some benefits to leveraging those paradigms.

While the deadline for contributions for hackathon has now passed (last day was July 3), I'd be happy to continue working on this if you'd like to see it through to completion.

Either way, we thank you for your contribution and for your participation. 🙏

Ishankoradia · 2024-07-10T04:30:35Z

Hi @aaronsteers thanks for getting back, I dont mind the hackathon being over.

I would like to continue working on this to completion, would be great to get your support while i do.

I am going to try to make a push and see if I can pull something from a public repo in the connector. Keeping in mind, leveraging CDK is not a strict requirement.

If you have any ideas/suggestions, I am happy to hear them out.

aaronsteers assigned Ishankoradia Jun 4, 2024

btkcodedev mentioned this issue Jun 12, 2024

Debug issues with source-greenhouse connector #8

Closed

Ishankoradia linked a pull request Jun 29, 2024 that will close this issue

Feature/GitHub files airbytehq/airbyte#40621

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ New Connector Idea: "GitHub Files" Source #22

✨ New Connector Idea: "GitHub Files" Source #22

aaronsteers commented May 20, 2024 •

edited

Loading

Ishankoradia commented Jun 3, 2024 •

edited

Loading

TLazarevic commented Jun 3, 2024 •

edited

Loading

aaronsteers commented Jun 4, 2024

Ishankoradia commented Jun 4, 2024 •

edited

Loading

aaronsteers commented Jun 4, 2024 •

edited

Loading

Ishankoradia commented Jun 4, 2024 •

edited

Loading

aaronsteers commented Jun 4, 2024

avirajsingh7 commented Jun 23, 2024

Ishankoradia commented Jun 23, 2024

aaronsteers commented Jun 29, 2024

Ishankoradia commented Jun 29, 2024

Ishankoradia commented Jul 6, 2024

aaronsteers commented Jul 9, 2024

Ishankoradia commented Jul 10, 2024

✨ New Connector Idea: "GitHub Files" Source #22

✨ New Connector Idea: "GitHub Files" Source #22

Comments

aaronsteers commented May 20, 2024 • edited Loading

Desired Connector Spec

Ishankoradia commented Jun 3, 2024 • edited Loading

TLazarevic commented Jun 3, 2024 • edited Loading

aaronsteers commented Jun 4, 2024

Ishankoradia commented Jun 4, 2024 • edited Loading

aaronsteers commented Jun 4, 2024 • edited Loading

Ishankoradia commented Jun 4, 2024 • edited Loading

aaronsteers commented Jun 4, 2024

avirajsingh7 commented Jun 23, 2024

Ishankoradia commented Jun 23, 2024

aaronsteers commented Jun 29, 2024

Ishankoradia commented Jun 29, 2024

Ishankoradia commented Jul 6, 2024

aaronsteers commented Jul 9, 2024

Ishankoradia commented Jul 10, 2024

aaronsteers commented May 20, 2024 •

edited

Loading

Ishankoradia commented Jun 3, 2024 •

edited

Loading

TLazarevic commented Jun 3, 2024 •

edited

Loading

Ishankoradia commented Jun 4, 2024 •

edited

Loading

aaronsteers commented Jun 4, 2024 •

edited

Loading

Ishankoradia commented Jun 4, 2024 •

edited

Loading