Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use permalinks in ecosystem diff references #5704

Merged
merged 6 commits into from
Jul 12, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
50 changes: 30 additions & 20 deletions scripts/check_ecosystem.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,11 @@ async def clone(self: Self, checkout_dir: Path) -> AsyncIterator[Path]:
"""Shallow clone this repository to a temporary directory."""
if checkout_dir.exists():
logger.debug(f"Reusing {self.org}:{self.repo}")
yield Path(checkout_dir)
yield await self._get_commit(checkout_dir)
Copy link
Member Author

@zanieb zanieb Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... because Repository is immutable (and I liked that it was immutable) we can't just get the commit SHA on clone and stash it on the Repository object. Instead, we yield it and it's attached to the Diff object for later display. It also turns out that the previously yielded Path was identical to the input argument so I removed that.

We probably ought to just have the ecosystem checks pinned to a specific SHA that's updated on a schedule by a bot and then we can skip the retrieval and just assign it to the Repository at creation time. That seems pretty low priority though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also happy to hear alternative approaches if anyone has ideas! This seemed liked the simplest way to retain immutability but this was my first time looking at this script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could replace the value:

self._replace(ref=await self._get_commit(checkout_dir))

But, otherwise I think this is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we'd get the new Repository object back to the relevant owner without a bunch of changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, didn't think about that (_replace returns a new object). I think this is fine then. We can iterate it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the commit SHA to the Repository object as that's where I think it should be logically? I think adding it to ref should be the same meaning as branches are just pointers to specific commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also mean that the repo.url_for function/call need not be changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that first but we cannot because it's immutable. We definitely could decide to make it mutable, but I liked the immutable design. More commentary at #5704 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that comment. Give me a minute.

return

logger.debug(f"Cloning {self.org}:{self.repo}")
git_command = [
git_clone_command = [
"git",
"clone",
"--config",
Expand All @@ -60,39 +60,48 @@ async def clone(self: Self, checkout_dir: Path) -> AsyncIterator[Path]:
"--no-tags",
]
if self.ref:
git_command.extend(["--branch", self.ref])
git_clone_command.extend(["--branch", self.ref])

git_command.extend(
git_clone_command.extend(
[
f"https://github.com/{self.org}/{self.repo}",
checkout_dir,
],
)

process = await create_subprocess_exec(
*git_command,
git_clone_process = await create_subprocess_exec(
*git_clone_command,
Comment on lines -72 to +73
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed these before extracting _get_commit into a separate function. I don't think the extra clarity hurts though.

env={"GIT_TERMINAL_PROMPT": "0"},
)

status_code = await process.wait()
status_code = await git_clone_process.wait()

logger.debug(
f"Finished cloning {self.org}/{self.repo} with status {status_code}",
)
yield await self._get_commit(checkout_dir)

yield Path(checkout_dir)

def url_for(self: Self, path: str, lnum: int | None = None) -> str:
"""Return the GitHub URL for the given path and line number, if given."""
def url_for(self: Self, commit_sha: str, path: str, lnum: int | None = None) -> str:
"""Return the GitHub URL for the given commit, path, and line number, if given."""
# Default to main branch
url = (
f"https://github.com/{self.org}/{self.repo}"
f"/blob/{self.ref or 'main'}/{path}"
)
url = f"https://github.com/{self.org}/{self.repo}" f"/blob/{commit_sha}/{path}"
zanieb marked this conversation as resolved.
Show resolved Hide resolved
if lnum:
url += f"#L{lnum}"
return url

async def _get_commit(self: Self, checkout_dir: Path) -> str:
"""Return the commit sha for the repository in the checkout directory."""
git_sha_process = await create_subprocess_exec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use subprocess.check_output with text=True here, it will also check the output status automatically

check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'd rather not make a blocking call in an async function though 👿

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair

*["git", "rev-parse", "HEAD"],
cwd=str(checkout_dir),
stdout=PIPE,
)
git_sha_stdout, _ = await git_sha_process.communicate()
assert (
await git_sha_process.wait() == 0
), f"Failed to retrieve commit sha at {checkout_dir}"
return git_sha_stdout.decode().strip()


REPOSITORIES: list[Repository] = [
Repository("apache", "airflow", "main", select="ALL"),
Expand Down Expand Up @@ -169,6 +178,7 @@ class Diff(NamedTuple):

removed: set[str]
added: set[str]
source_sha: str

def __bool__(self: Self) -> bool:
"""Return true if this diff is non-empty."""
Expand Down Expand Up @@ -203,13 +213,13 @@ async def compare(
assert ":" not in repo.org
assert ":" not in repo.repo
checkout_dir = Path(checkout_parent).joinpath(f"{repo.org}:{repo.repo}")
async with repo.clone(checkout_dir) as path:
async with repo.clone(checkout_dir) as checkout_sha:
try:
async with asyncio.TaskGroup() as tg:
check1 = tg.create_task(
check(
ruff=ruff1,
path=path,
path=checkout_dir,
name=f"{repo.org}/{repo.repo}",
select=repo.select,
ignore=repo.ignore,
Expand All @@ -220,7 +230,7 @@ async def compare(
check2 = tg.create_task(
check(
ruff=ruff2,
path=path,
path=checkout_dir,
name=f"{repo.org}/{repo.repo}",
select=repo.select,
ignore=repo.ignore,
Expand All @@ -237,7 +247,7 @@ async def compare(
elif line.startswith("+ "):
added.add(line[2:])

return Diff(removed, added)
return Diff(removed, added, checkout_sha)


def read_projects_jsonl(projects_jsonl: Path) -> dict[tuple[str, str], Repository]:
Expand Down Expand Up @@ -379,7 +389,7 @@ async def limited_parallelism(coroutine: T) -> T:
continue

pre, inner, path, lnum, post = match.groups()
url = repo.url_for(path, int(lnum))
url = repo.url_for(diff.source_sha, path, int(lnum))
print(f"{pre} <a href='{url}'>{inner}</a> {post}")
print("</pre>")

Expand Down