<a href="https://colab.research.google.com/github/fmind/gitworks/blob/main/GitWorks_Automatically_Review_GitHub_Projects_with_Your_Guidelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SETUP

## GitHub

1.  **Get Token:**
    * Generate a **Fine-grained Personal Access Token** from [GitHub Developer Settings](https://github.com/settings/personal-access-tokens).
    * Grant access to the target repository (or all repositories).
    * Required **Repository Permissions**:
        * `Contents`: **Read-only**
        * `Issues`: **Read & write** (needed if `CREATE_ISSUE` is `True`)
        * `Metadata`: **Read-only**
    * **Copy the token immediately** after generation.

2.  **Store Token in Colab:**
    * Click the **Key icon** (🔑) in the left sidebar to open Secrets.
    * Add a secret named `GITWORKS_GITHUB_ACCESS_TOKEN`.
    * Paste your copied GitHub Token as the value.
    * Ensure "Notebook access" is **enabled**.

## Gemini

1.  **Get API Key:**
    * Obtain an API key from [Google AI Studio](https://makersuite.google.com/app/apikey).
    * You might need to create a new project first.
    * **Copy the API key immediately**.

2.  **Store Key in Colab:**
    * Click the **Key icon** (🔑) in the left sidebar to open Secrets.
    * Add a new secret named `GITWORKS_GEMINI_API_KEY`.
    * Paste your copied Gemini API Key as the value.
    * Ensure "Notebook access" is **enabled**.


# CONFIGS

In [None]:
# @title Generative AI

MODEL = "gemini-2.0-flash" # @param {"type":"string"}
TEMPERATURE = 0.0 # @param {"type":"slider","min":0,"max":2,"step":0.1}
MAX_OUTPUT_TOKENS = 10000 # @param {"type":"integer"}

In [None]:
# @title App

REPOSITORY = "fmind/mlops-python-package" # @param {"type":"string"}
CREATE_ISSUE = True # @param {"type":"boolean"}

# INSTALLS

## Python

In [None]:
%pip install PyGithub

# IMPORTS

## Internal

In [None]:
import io
import typing as t
from pathlib import Path

## External

In [None]:
import github as gh
import pydantic as pdt
from google import genai
from IPython import display
from google.colab import userdata
from google.genai import types as gt

# SECRETS

# Github

In [None]:
GITHUB_ACCESS_TOKEN = userdata.get("GITWORKS_GITHUB_ACCESS_TOKEN")

## Gen AI

In [None]:
GEMINI_API_KEY = userdata.get("GITWORKS_GEMINI_API_KEY")

## SERVICES

## GitHub

In [None]:
github_auth = gh.Auth.Token(GITHUB_ACCESS_TOKEN)
github = gh.Github(auth=github_auth)

## Gen AI

In [None]:
genai_client = genai.Client(api_key=GEMINI_API_KEY)

# CONTENTS

## Guidelines

In [None]:
guidelines = """
## MLOps Code Repository Checklist

This checklist helps assess the maturity of an MLOps project based on artifacts and configurations found within its GitHub repository.

---

### Level 1: Prototype

_Focus: Basic functionality, primarily for project actors._

- **Repository Initialization:** `.git` directory exists, indicating version control is used.
- **Basic Code Structure:** Source code files exist (e.g., `.py` files or notebooks).
- **Initial README:** A basic `README.md` file exists, perhaps with a project title and brief description.
- **Environment/Dependency Listing (Basic):** A `requirements.txt` or initial `pyproject.toml` might exist, listing key dependencies.

---

### Level 2: Alpha

_Focus: Improved structure, basic validation, ready for selected few._

- **Package Structure:** Code organized into a package structure (e.g., within a `src/` directory with `__init__.py`).
- **`pyproject.toml`:** Project metadata, dependencies, and tool configurations defined in `pyproject.toml`.
- **`.gitignore`:** File exists and excludes common unnecessary files (e.g., `.venv`, `__pycache__`, cache directories, secrets).
- **Basic Linting/Formatting Config:** Configuration for linters/formatters (e.g., Ruff) present in `pyproject.toml`.
- **Basic Testing Setup:** `tests/` directory exists with test files (e.g., `test_*.py`). `pytest` configuration might be present in `pyproject.toml`.
- **Pre-commit Hooks Config:** `.pre-commit-config.yaml` exists, potentially configured with basic hooks (e.g., whitespace, syntax checks, basic formatting/linting).
- **Basic Containerization:** A `Dockerfile` exists for building a container image.
- **Basic Documentation:** Docstrings present in key functions/classes. An expanded `README.md` with setup and usage instructions.
- **License File:** `LICENSE.txt` (or similar) file exists.
- **Task Automation (Optional):** A `justfile` or `tasks/` directory with automation scripts might exist.

---

### Level 3: Beta

_Focus: Robust validation, CI/CD, basic releases, ready for a larger audience (low guarantee)._

- **Typing Configuration:** Type checking tool (e.g., `mypy`) configured in `pyproject.toml`. Evidence of type hints in function/method signatures in the code.
- **Comprehensive Testing:** Increased number of tests in `tests/`. Test coverage tool (e.g., `pytest-cov`) configured in `pyproject.toml`, potentially with a minimum coverage target.
- **CI/CD Workflows (Checks):** Workflow files exist in `.github/workflows/` that automate checks (linting, type checking, testing, security scanning) on pull requests.
- **Security Scanning Config:** Configuration for security scanners (e.g., `bandit`) present in `pyproject.toml` or CI workflow. `Dependabot` configuration (`dependabot.yml`) exists in `.github/`.
- **Centralized Configurations:** Configuration files (e.g., YAML, TOML) exist, separate from code (e.g., in a `confs/` directory). Code uses a library (like OmegaConf) to load these configurations.
- **Entrypoints Defined:** Script entrypoints defined in `pyproject.toml` (`[project.scripts]`).
- **Basic Experiment Tracking Config:** Configuration or usage of an experiment tracking tool (like MLflow) visible in code or config files (e.g., `mlflow.set_tracking_uri`).
- **Basic Model Registry Config:** Code includes steps to register models using a tool (like MLflow) (e.g., `mlflow.register_model`).
- **Changelog:** `CHANGELOG.md` file exists.
- **Contribution Guidelines:** `CONTRIBUTING.md` file exists.
- **Reproducibility Basics:** Fixed random seeds used in relevant code sections. MLflow Project file (`MLproject`) exists.

---

### Level 4: GA (General Availability)

_Focus: Rigorous processes, full automation, high guarantees for a large audience._

- **Enforced Test Coverage:** CI workflow enforces a high minimum test coverage percentage (e.g., >80%).
- **CI/CD Workflows (Build/Publish):** Workflow files exist in `.github/workflows/` that automate building artifacts (e.g., wheel files, Docker images) and publishing them on releases.
- **Deterministic Builds:** Build process (e.g., in `justfile` or CI workflow) uses mechanisms like `--require-hashes` or lock files (`uv.lock`) to ensure deterministic package builds.
- **Formal Release Management:** Git tags exist corresponding to release versions following a schema (e.g., SemVer). Release notes are present on the repository's releases page (verifiable via GitHub UI, not code alone, but tags are verifiable).
- **Comprehensive Documentation:** Generated API documentation exists (e.g., in `docs/` folder, potentially hosted on GitHub Pages). README includes badges for build status, coverage, etc..
- **Code of Conduct:** `CODE_OF_CONDUCT.md` file exists.
- **Monitoring/Evaluation Artifacts:** Code includes jobs or scripts for model evaluation (e.g., using `mlflow.evaluate` or tools like `Evidently`) and potentially generates evaluation reports or artifacts.
- **Lineage Tracking:** Use of lineage tracking features (e.g., `mlflow.log_input` with MLflow Datasets) visible in code.
- **Explainability Artifacts:** Code includes jobs or scripts to generate model explanations (e.g., using SHAP) and saves these as artifacts.
- **Infrastructure Metrics Logging:** Use of system metrics logging (e.g., `mlflow.start_run(log_system_metrics=True)`) visible in code.
- **Project Template Usage (Optional):** Evidence of project generation from a template (e.g., presence of `.cruft.json` if using Cruft).
"""

## Repository

In [None]:
repository = github.get_repo(REPOSITORY)
repository

## Contents

In [None]:
contents = []
stack = repository.get_contents("")
while stack:
    content = stack.pop(0)
    if content.type == "dir":
        new_contents = repository.get_contents(content.path)
        stack.extend(new_contents)
    else:
        contents.append(content)
contents

## String

In [None]:
string = io.StringIO()
for content in contents:
    path = content.path
    try:
        text = content.decoded_content.decode()
        part = f"--- file: {path} ---\n{text}\n"
        string.write(part)
    except Exception as error:
        print(f'[ERROR] Path: "{path}", Error: {error}')
string = string.getvalue()
print('Characters:', len(string))

# ANALYSIS

## Instructions

In [None]:
instructions = f"""
You are a Senior Software Engineer.
Given the following guidelines, give a detailed review the repository content.
Provide a general summary, and then lists the guidelines that need improvements and how to fix it.

{guidelines}
"""

## Data Class

In [None]:
class GitHubIssue(pdt.BaseModel):
    """GitHub Issue."""
    title: str
    body: str

## Review

In [None]:
review = genai_client.models.generate_content(
    model=MODEL,
    contents=string,
    config=gt.GenerateContentConfig(
        temperature=TEMPERATURE,
        max_output_tokens=MAX_OUTPUT_TOKENS,
        system_instruction=instructions,
        response_mime_type='application/json',
        response_schema=GitHubIssue,
    ),
)
print('Input tokens:', review.usage_metadata.prompt_token_count)
print('Output tokens:', review.usage_metadata.candidates_token_count)
display.display(display.Markdown(f"# {review.parsed.title}"))
display.display(display.Markdown(review.parsed.body))

In [None]:
if CREATE_ISSUE:
    issue = repository.create_issue(title=review.parsed.title, body=review.parsed.body)
    print('Issue created:', issue.html_url)
else:
    print('Issue not created')