# Ingest GitHub Repos for RAG

**R.A.G.** -- Retrieval Augmented Generation

<img src="https://docs.llamaindex.ai/en/stable/_images/basic_rag.png" width="50%">

Resources:
- [llama_index](https://docs.llamaindex.ai/en/stable/index.html) - Data Embedding
- [Ollama](https://ollama.ai/) - Local LLM Wrapper

## Setup

### Configs

In [74]:
USERNAME = 'fairbanksio'
MODEL = 'llama2'
OLLAMA_URL = 'http://192.168.4.5:11434'

### Installs & Imports

In [23]:
%pip install -q -U python-dotenv psycopg2-binary hvac llama-index torch transformers python-pptx Pillow pypdf

Note: you may need to restart the kernel to use updated packages.


In [64]:
import json
import requests
import hvac
import os
import warnings
import subprocess
import os
import psycopg2
import shutil

from dotenv import load_dotenv
from llama_index import SimpleDirectoryReader, VectorStoreIndex, ServiceContext, set_global_service_context

load_dotenv()

# Disable all warnings
warnings.simplefilter("ignore")

### Set a fake OpenAI API Key 

DO NOT CHANGE -- `llama_index` checks for this to be set but it is not used/needed

In [35]:
os.environ["OPENAI_API_KEY"] = "sk-abc123"

### Connect to Vault

In [13]:
client = hvac.Client(
    url=os.getenv('VAULT_API'),
    token=os.getenv('VAULT_TOKEN'),
)

print(client.is_authenticated())

True


##### Get GitHub Secrets

In [14]:
try:
    secret_resp = client.secrets.kv.v2.read_secret_version(
        mount_point='kv', 
        path='github', 
        raise_on_deleted_version=False
    )
    
    if secret_resp['data'] is not None:
        secret_values = secret_resp['data']['data']
        for secret, value in secret_values.items():
            os.environ[str(secret)] = str(value)
    else:
        print("The secret does not exist.")
except hvac.exceptions.InvalidPath:
    print("The path is invalid or the permission is denied.")
except hvac.exceptions.Forbidden:
    print("The permission is denied.")
except hvac.exceptions.VaultError as e:
    print(f"Vault error occurred: {e}")

##### Get Postgres Secrets

In [15]:
try:
    secret_resp = client.secrets.kv.v2.read_secret_version(
        mount_point='kv', 
        path='postgres', 
        raise_on_deleted_version=False
    )
    
    if secret_resp['data'] is not None:
        secret_values = secret_resp['data']['data']
        for secret, value in secret_values.items():
            os.environ[str(secret)] = str(value)
    else:
        print("The secret does not exist.")
except hvac.exceptions.InvalidPath:
    print("The path is invalid or the permission is denied.")
except hvac.exceptions.Forbidden:
    print("The permission is denied.")
except hvac.exceptions.VaultError as e:
    print(f"Vault error occurred: {e}")

### Scrape GitHub Repos

In [16]:
def get_github_repos(username, token):
    url = f'https://api.github.com/users/{username}/repos'
    headers = {'Authorization': f'token {token}'}
    
    all_repos = []
    page = 1

    while True:
        params = {'page': page, 'per_page': 100}
        response = requests.get(url, headers=headers, params=params)

        if response.status_code == 200:
            repos = response.json()
            if not repos:
                break  # No more repositories
            all_repos.extend(repos)
            page += 1
        else:
            print(f"Error fetching repositories. Status code: {response.status_code}")
            return None

    repo_names = [repo['name'] for repo in all_repos if not repo['fork']]
    return repo_names

In [17]:
def save_to_json(data, filename):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=2)

In [75]:
REPOS = get_github_repos(USERNAME, os.getenv("PERSONAL_TOKEN"))

if REPOS:
    print(f"Found {len(REPOS)} GitHub Repositories created by {USERNAME}:\n")
    for repo in REPOS:
        print(f"{repo}")

    # Save to JSON file with username as filename
    save_to_json(REPOS, f'{USERNAME}_repos.json')
    print(f"\nRepository list saved to '{USERNAME}_repos.json'")
else:
    print(f"Unable to fetch GitHub repositories for {USERNAME}")

Found 35 GitHub Repositories created by fairbanksio:

Blink
digital-ocean-billing-paypal
ExpressAPI
ExpressGQL
f5-client
f5-fetch-data
f5-get-posts
f5oclock
Factorio-Mod-Updater
flux-gitops-apps
GoHTTP
helm-charts
jenkins-docker
Jetson-Camera
k6
kube-cluster
next-skeleton
notebooks
PayPal-IPN-Listener
PayPal-Payment-Generator
PayPal-Sandbox-Dashboard
react-register
react-skeleton
Roadrunner
rtsp-nvr
site-status
slack-history-export
Spot
tf-iac-apps
tf-iac-cluster
tf-iac-demo
tf-iac-infra
tiles-api
tiles-client
uptime-monitor

Repository list saved to 'fairbanksio_repos.json'


### Clone Repos

In [76]:
def clone_repos(repos, target_folder):
    for repo in repos:
        clone_command = f"git clone -q https://github.com/{USERNAME}/{repo}.git {target_folder}/{repo}"
        try:
            subprocess.run(clone_command, shell=True)
        except Exception:
            continue

##### Set where repos should be cloned temporarily for conversion

In [77]:
REPO_DIRECTORY = os.getcwd() + "/temp/" + USERNAME
print(REPO_DIRECTORY)

/home/jovyan/temp/fairbanksio


##### Start the Cloning Process

In [78]:
clone_repos(REPOS, REPO_DIRECTORY)

### Connect to Postgres

In [27]:
try:
    connection = psycopg2.connect(
        host=os.getenv("POSTGRES_HOST"),
        port=os.getenv("POSTGRES_POST"),
        dbname=os.getenv("POSTGRES_DBNAME"),
        user=os.getenv("POSTGRES_USER"),
        password=os.getenv("POSTGRES_PASSWORD")
    )

    # Create a cursor object to interact with the database
    cursor = connection.cursor()

    if cursor:
        print("Database Cursor: Ready")

except psycopg2.Error as err:
    print(f"Database Error: {err}")

Database Cursor: Ready


### Setup Ollama

In [73]:
from llama_index.llms import Ollama

llm = Ollama(
    model=MODEL,
    base_url=OLLAMA_URL,
    request_timeout=60.0
)

resp = llm.complete("Hello World")
print(resp)

Hello there! It's nice to meet you. How are you today? Is there something I can help you with or would you like to chat?


## Load Files and Create Search Index

### Configuration for Local Embeddings & Local LLM

In [55]:
service_context = ServiceContext.from_defaults(
    embed_model="local:BAAI/bge-large-en",
    chunk_size=1024,
    llm=llm
)

set_global_service_context(service_context)

### Load Files from Directory

In [40]:
files = SimpleDirectoryReader(input_dir=REPO_DIRECTORY, recursive=True)
documents = files.load_data(show_progress=True)
print(f"Loaded {len(documents):,} documents")

Loading files: 100%|██████████| 1537/1537 [00:14<00:00, 103.70file/s]

Loaded 1,780 docs





### Build a File Index

In [42]:
index = VectorStoreIndex.from_documents(
    documents, 
    service_context=service_context,
    show_progress=True
)
query_engine = index.as_query_engine()

Parsing nodes:   0%|          | 0/1780 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/398 [00:00<?, ?it/s]

### Remove Temp Files

In [62]:
try:
    shutil.rmtree(REPO_DIRECTORY)
    print(f"Successfully removed {REPO_DIRECTORY}")
except OSError as e:
    print(f"Error: {e}")

Successfully removed /home/jovyan/temp/jonfairbanks


## Query the Index

In [63]:
query_engine = index.as_query_engine(
    similarity_top_k=5, # Return additional results
    service_context=service_context
) 
response = query_engine.query("Are there any improvements that could be made to the Yo codebase?")
print(response)

Yes, there are several improvements that could be made to the Yo codebase. Here are some suggestions based on the provided context information:

1. **Modularize the code**: The current codebase is quite monolithic, with many functions and variables scattered throughout the file. Modularizing the code would make it easier to maintain and update in the future. One approach could be to create separate modules for different components of the Yo system, such as the server, client, and database layers.
2. **Use TypeScript**: TypeScript is a superset of JavaScript that adds static typing and other features. Using TypeScript could help catch errors earlier in the development process and make the codebase more maintainable.
3. **Implement proper error handling**: The current codebase does not have comprehensive error handling, which can lead to unexpected behavior when encountering errors. Implementing proper error handling mechanisms, such as error objects or middleware functions, would help h

## Save Index Data

In [17]:
# Example: Execute a SQL query
# cursor.execute("SELECT version();")

# Fetch the result
# result = cursor.fetchone()
# print("PostgreSQL version:", result)

# if connection:
#     cursor.close()
#     connection.close()
#     print("Connection closed.")