# NAICS GitHub Repository Classifier - Inference Demo

This notebook demonstrates how to use the trained model to classify GitHub repositories into NAICS industry codes.

**Model:** RoBERTa-base fine-tuned on GitHub repository data  
**Performance:** ~79% F1 Score on test set  
**Classes:** 19 NAICS industry sectors

## 1. Setup

First, let's import the necessary libraries and load the trained model.

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Import project modules
from src.inference import load_trained_model, predict_naics, format_repository_input
from src.naics_mapping import get_naics_description, NAICS_CODE_TO_DESCRIPTION

print("Imports successful!")

In [None]:
# Load the trained model
MODEL_PATH = project_root / "models" / "roberta-base-naics-classifier"

print(f"Loading model from: {MODEL_PATH}")
model, tokenizer, label_mappings = load_trained_model(MODEL_PATH)
print("Model loaded successfully!")
print(f"Number of classes: {len(label_mappings['label2id'])}")

## 2. NAICS Codes Reference

Here are all the NAICS industry sectors the model can predict:

In [None]:
import pandas as pd

# Display NAICS codes
naics_df = pd.DataFrame([
    {"Code": code, "Description": desc}
    for code, desc in NAICS_CODE_TO_DESCRIPTION.items()
]).drop_duplicates(subset=["Description"])

print("NAICS Industry Sectors:")
naics_df

## 3. Make Predictions

### Option A: Quick prediction with formatted text

In [None]:
# Example: Banking API
text = "Repository: bank-api | Description: Banking API for financial transactions | README: This API provides secure banking operations including account management, transfers, and payment processing."

result = predict_naics(
    text=text,
    model=model,
    tokenizer=tokenizer,
    label_mappings=label_mappings
)

print(f"Input: {text[:100]}...")
print(f"\nPredicted NAICS Code: {result['predicted_naics']}")
print(f"Description: {get_naics_description(result['predicted_naics'])}")
print(f"Confidence: {result['confidence']:.2%}")

### Option B: Prediction from repository components

In [None]:
# Build input from separate components
text = format_repository_input(
    repo_name="hospital-ehr",
    description="Electronic health records system for hospitals",
    topics="healthcare; medical; ehr; hipaa",
    readme="Complete hospital management with patient records, appointment scheduling, medical billing, and HIPAA compliance."
)

result = predict_naics(
    text=text,
    model=model,
    tokenizer=tokenizer,
    label_mappings=label_mappings
)

print(f"Formatted input: {text}")
print(f"\nPredicted NAICS Code: {result['predicted_naics']}")
print(f"Description: {get_naics_description(result['predicted_naics'])}")
print(f"Confidence: {result['confidence']:.2%}")

## 4. Helper Function for Easy Predictions

Use this function to quickly classify any repository:

In [None]:
def classify_repository(repo_name=None, description=None, topics=None, readme=None):
    """
    Classify a GitHub repository into a NAICS industry sector.
    
    Args:
        repo_name: Repository name (e.g., "bank-api")
        description: Repository description
        topics: Topics/tags (comma or semicolon separated)
        readme: README content
    
    Returns:
        Dictionary with prediction results
    """
    text = format_repository_input(
        repo_name=repo_name,
        description=description,
        topics=topics,
        readme=readme
    )
    
    result = predict_naics(
        text=text,
        model=model,
        tokenizer=tokenizer,
        label_mappings=label_mappings
    )
    
    return {
        "naics_code": result["predicted_naics"],
        "industry": get_naics_description(result["predicted_naics"]),
        "confidence": f"{result['confidence']:.2%}"
    }

print("classify_repository() function ready!")

## 5. Try It Yourself!

Modify the values below to classify your own repositories:

In [None]:
# Classify your repository
result = classify_repository(
    repo_name="my-awesome-project",
    description="Your project description here",
    topics="topic1; topic2; topic3",
    readme="Your README content here. Include details about what your project does."
)

print("Classification Result:")
print(f"  NAICS Code: {result['naics_code']}")
print(f"  Industry: {result['industry']}")
print(f"  Confidence: {result['confidence']}")

## 6. Batch Predictions

Classify multiple repositories at once:

In [None]:
# List of repositories to classify
repositories = [
    {
        "repo_name": "bank-api",
        "description": "Banking API for financial transactions",
        "readme": "Secure banking operations including account management and transfers."
    },
    {
        "repo_name": "farm-tracker",
        "description": "Agricultural management software",
        "readme": "Track crops, livestock, and farm operations."
    },
    {
        "repo_name": "school-portal",
        "description": "Student management system",
        "readme": "Manage student enrollment, grades, and courses."
    },
    {
        "repo_name": "logistics-app",
        "description": "Shipping and delivery tracking",
        "readme": "Real-time package tracking and route optimization for delivery services."
    },
    {
        "repo_name": "movie-streaming",
        "description": "Video streaming platform",
        "readme": "Stream movies and TV shows with recommendations."
    }
]

# Classify all
results = []
for repo in repositories:
    result = classify_repository(**repo)
    result["repo_name"] = repo["repo_name"]
    results.append(result)

# Display as table
results_df = pd.DataFrame(results)[["repo_name", "naics_code", "industry", "confidence"]]
results_df

## 7. Load and Classify from CSV/Parquet

If you have a file with repository data:

In [None]:
# Example: Load from parquet file
# Uncomment and modify the path to use your data

# import pandas as pd
# 
# # Load your data
# df = pd.read_parquet("path/to/your/data.parquet")
# # or: df = pd.read_csv("path/to/your/data.csv")
# 
# # Classify each repository
# predictions = []
# for _, row in df.iterrows():
#     result = classify_repository(
#         repo_name=row.get("name_repo") or row.get("repo"),
#         description=row.get("description"),
#         topics=row.get("topics"),
#         readme=row.get("readme_content")
#     )
#     predictions.append(result)
# 
# # Add predictions to dataframe
# df["predicted_naics"] = [p["naics_code"] for p in predictions]
# df["predicted_industry"] = [p["industry"] for p in predictions]
# df["confidence"] = [p["confidence"] for p in predictions]
# 
# # Save results
# df.to_csv("predictions.csv", index=False)

print("Uncomment the code above to classify repositories from a file.")

---

## Model Information

- **Base Model:** RoBERTa-base
- **Training Data:** 2,538 GitHub repositories with NAICS labels
- **Test F1 Score:** 79.28%
- **Test Accuracy:** 81.30%
- **Number of Classes:** 19 NAICS sectors

For questions or issues, contact the repository maintainer.