Skip to content

HW_04_202601 #176

@jeanpool1415

Description

@jeanpool1415

Homework Assignment: GitHub Repository Intelligence with LLMs and BERT

Course: Machine Learning / NLP / Applied AI
Total Score: 20 points

  • Technical implementation: 12 points

  • Presentation video: 8 points

Deadline: Friday — 11:59 PM


Description

The goal of this assignment is to build a complete weak-supervision NLP pipeline using:

  • GitHub API

  • Large Language Models (LLMs)

  • BERT-based models

  • Repository metadata

  • Open-source ecosystem signals

You will create a system capable of analyzing GitHub repositories and classifying them according to one of the following project tracks:


Available Project Tracks

Track A — Hiring-Oriented Repository Intelligence

Build a system that evaluates whether a repository reflects work expected from:

  • Intern-level engineering

  • Junior-level engineering

  • Senior-level engineering

  • Lead/Architect-level engineering

  • Template/Boilerplate/Replica repository

  • Low-value repository not worth detailed review

The objective is NOT to judge the developer directly.

The objective is:

estimate the engineering maturity and complexity reflected by the repository itself.

This can help:

  • recruiters,

  • engineering managers,

  • startups,

  • accelerators,

  • and technical screening systems.

The challenge is determining:

  • which repository signals matter,

  • how to summarize them,

  • and how to define engineering maturity categories.


Track B — Technology Innovation & Ecosystem Tracking

Build a system capable of identifying:

  • emerging technologies,

  • mature ecosystems,

  • declining technologies,

  • and experimental or niche technical areas

using GitHub repository activity and metadata.

The objective is NOT to predict whether code is “good.”

The objective is:

analyze repository and ecosystem signals to understand technological momentum and innovation trends.

Examples:

  • AI agents

  • vector databases

  • cybersecurity tooling

  • blockchain infrastructure

  • robotics frameworks

  • MLOps platforms

This can help:

  • investors,

  • researchers,

  • governments,

  • consulting firms,

  • and technology analysts.

The challenge is determining:

  • which GitHub signals represent innovation,

  • how to define “growth” or “decline,”

  • and how to convert repository behavior into measurable technological trends.


Main Objective

Build a complete pipeline that:

  1. Collects repository information from GitHub API

  2. Creates repository summaries/features

  3. Uses an LLM to generate weak labels

  4. Fine-tunes a BERT-based classifier

  5. Evaluates model performance

  6. Explains the business and analytical value of the system

You are NOT given:

  • fixed categories,

  • fixed prompts,

  • fixed features,

  • or fixed methodologies.

Those decisions are part of the assignment.


Expected Repository Structure

Create a repository named exactly:

For Track A:

github_hiring_repository_intelligence

For Track B:

github_technology_innovation_tracking

with the following structure:

repository_name/
│
├── app.py                              # Streamlit app
├── README.md                           # Project explanation
├── requirements.txt                    # Dependencies
│
├── src/
│   ├── github_collector.py             # GitHub API extraction
│   ├── preprocessing.py                # Cleaning and transformations
│   ├── summarization.py                # Repository summary generation
│   ├── llm_labeling.py                 # Weak labeling with LLMs
│   ├── train.py                        # BERT fine-tuning
│   ├── evaluation.py                   # Metrics and validation
│   ├── visualization.py                # Graphs and analysis
│   └── utils.py                        # Helper functions
│
├── data/
│   ├── raw/
│   ├── processed/
│   ├── labeled/
│   └── splits/
│
├── models/
│   └── trained_models/
│
├── output/
│   ├── figures/
│   ├── tables/
│   └── metrics/
│
└── video/
    └── link.txt

Required Pipeline

Your project must contain the following stages.


Stage 1 — GitHub Data Collection

You must use the GitHub API to collect repository information.

You are free to choose repositories and sampling strategies.

You may use:

  • REST API

  • GraphQL API

You must explain:

  • how repositories were selected,

  • why they were selected,

  • and how selection may affect the results.


Minimum Required Features

You must extract at least 6 repository-level signals.

Examples include:

  • number of contributors

  • commits frequency

  • stars/forks

  • issue activity

  • pull request activity

  • release frequency

  • README characteristics

  • workflow/CI presence

  • dependency updates

  • repository topics/tags

  • repository age

  • last activity date

You are encouraged to experiment with additional signals.


Stage 2 — Repository Representation

You must convert repository information into a format usable by:

  • LLMs

  • and BERT models

This may include:

  • textual summaries,

  • structured prompts,

  • concatenated metadata,

  • or hybrid representations.

Example:

Repository has 15 contributors, active CI/CD workflows,
weekly commits, regular releases, and extensive documentation.

You must justify:

  • why your representation is useful,

  • and why it may help classification.


Stage 3 — Weak Labeling with LLMs

You must use an LLM to generate labels for the training dataset.

Examples:

  • OpenAI

  • Claude

  • DeepSeek

  • Gemini

  • Mistral

  • Qwen

The LLM acts as:

the initial annotator of repository categories.

You must:

  • explain your prompt design,

  • justify your category definitions,

  • and discuss limitations of LLM-generated labels.


Stage 4 — Train / Validation / Test Split

You must create:

  • Train dataset

  • Validation dataset

  • Test dataset

Suggested split:

  • 70% train

  • 15% validation

  • 15% test

The test dataset must remain unseen during training.


Stage 5 — Fine-Tuning a BERT-Based Model

You must fine-tune one lightweight transformer model.

Recommended options:

  • DistilBERT

  • ModernBERT

  • MiniLM

  • DeBERTa-v3-small

The objective is NOT massive-scale training.

The objective is:

learn how weak supervision pipelines work in realistic AI systems.

Input:

  • repository representations

Output:

  • repository category prediction


Stage 6 — Evaluation and Error Analysis

You must evaluate:

  • Accuracy

  • Precision

  • Recall

  • F1-score

You must also:

  • analyze common errors,

  • compare categories,

  • discuss weak points,

  • and explain possible improvements.


Track A — Required Analytical Questions

Question 1 — Engineering Maturity

Which repository signals appear most associated with:

  • intern-level repositories,

  • junior-level repositories,

  • senior-level repositories,

  • or lead-level repositories?

You must justify your reasoning.


Question 2 — Low-Value or Replica Repositories

How can repositories that are:

  • duplicated,

  • template-based,

  • unfinished,

  • or low-value

be differentiated from repositories showing meaningful engineering complexity?

You must define your logic.


Question 3 — Hiring Signal Interpretation

Why might your classification system be useful for:

  • recruiters,

  • startups,

  • technical interview pipelines,

  • or engineering managers?

You must explain:

  • business value,

  • limitations,

  • and ethical considerations.


Question 4 — Methodological Sensitivity

How do results change when:

  • repository features change,

  • prompts change,

  • or category definitions change?

You must compare:

  • one baseline approach

  • and one alternative approach.


Track B — Required Analytical Questions

Question 1 — Technology Momentum

Which repository signals appear associated with:

  • emerging technologies,

  • mature ecosystems,

  • declining technologies,

  • or experimental/niche areas?

You must justify your reasoning.


Question 2 — Innovation Signals

What types of GitHub activity appear to indicate:

  • technological growth,

  • ecosystem expansion,

  • or declining interest?

You must explain:

  • why you selected those signals,

  • and their limitations.


Question 3 — Business and Economic Value

Why could this system be useful for:

  • investors,

  • consulting firms,

  • governments,

  • or technology researchers?

You must explain:

  • business value,

  • practical applications,

  • and limitations.


Question 4 — Methodological Sensitivity

How do results change when:

  • repository features,

  • growth definitions,

  • or prompts

change?

You must compare:

  • one baseline approach

  • and one alternative approach.


Technical Requirements

Your project must include:

  • HuggingFace Transformers

  • pandas

  • scikit-learn

  • matplotlib

  • seaborn

  • Streamlit

Optional:

  • PyTorch

  • datasets

  • accelerate

  • plotly


Streamlit Application

Score: 4 points

Your Streamlit app must contain exactly 4 tabs.


Tab 1 — Problem & Methodology

Include:

  • project objective

  • repository selection methodology

  • GitHub signals used

  • prompt strategy

  • dataset construction

  • limitations


Tab 2 — Exploratory Analysis

Include:

  • repository statistics

  • category distributions

  • signal comparisons

  • selected visualizations

You must explain:

  • why those visualizations were selected,

  • and what analytical insight they provide.


Tab 3 — Model Results

Include:

  • evaluation metrics

  • confusion matrix

  • category performance

  • baseline vs alternative comparison


Tab 4 — Interactive Repository Exploration

Include:

  • repository search/filtering

  • category predictions

  • metadata exploration

  • model prediction examples


README.md Must Include

  • What does the project do?

  • Which track was selected?

  • What repositories were analyzed?

  • Which GitHub signals were used?

  • How were repository summaries created?

  • How were prompts designed?

  • How was the dataset split?

  • Which BERT model was used?

  • What were the final metrics?

  • What are the main limitations?

  • What are the possible business applications?

  • How to run the project?

  • How to run the Streamlit app?


Explanatory Video

Score: 8 points

Create a video of:

  • maximum 5 minutes

The video is NOT a coding walkthrough.

The video must be presented as:

a pitch of the idea and system.

The goal is to communicate:

  • analytical thinking,

  • business understanding,

  • AI pipeline design,

  • and practical usefulness.


The Video Must Explain

1. Problem Definition

What real-world problem are you solving?


2. Repository Signals

What GitHub information did you collect?

Examples:

  • contributors

  • commits

  • issue activity

  • releases

  • repository topics

  • workflow files

Why do you believe these are meaningful signals?


3. LLM Weak Labeling

What did you feed to the LLM?
Why do you think the LLM can help classify repositories?


4. Classification Logic

Which categories did you define?
Why are those categories useful?


5. Business Value

Who could use this system?
Why would it matter in reality?

Possible examples:

  • recruiting

  • investment analysis

  • technology trend analysis

  • startup evaluation

  • ecosystem monitoring


6. Model Performance

Show:

  • basic metrics,

  • confusion matrix,

  • examples of correct/incorrect predictions.


Important

The presentation should focus on:

  • ideas,

  • methodology,

  • reasoning,

  • and business usefulness.

Do NOT spend the presentation showing code line-by-line.


GitHub Workflow (MANDATORY)

❌ Do not work directly on main
✅ Create development branches
✅ Use descriptive commits
✅ Merge through Pull Requests

Example branches:

feature/github-scraping
feature/llm-labeling
feature/bert-training
feature/streamlit-dashboard

Grading Rubric

Technical Implementation — 12 points

Criteria | Points -- | -- GitHub data collection pipeline | 2 pts Repository representation and preprocessing | 2 pts LLM weak labeling methodology | 2 pts BERT fine-tuning pipeline | 2 pts Evaluation and error analysis | 2 pts Streamlit app completeness | 2 pts

Checklist Before Submitting

  • Repository has the correct name

  • GitHub API was used

  • At least 6 repository signals were extracted

  • LLM weak labeling was implemented

  • Train/validation/test split exists

  • BERT model was fine-tuned

  • Evaluation metrics are included

  • Streamlit app contains exactly 4 tabs

  • README explains methodology and findings

  • Video link exists in video/link.txt

  • Work was done using branches and Pull Requests

  • Repository is reproducible


Final Note

This assignment is intentionally designed to evaluate:

  • analytical reasoning,

  • AI system design,

  • weak supervision understanding,

  • and business thinking.

The most important part is NOT achieving the highest accuracy.

The most important part is being able to justify:

  • why you selected certain GitHub signals,

  • why your prompts make sense,

  • why your categories are meaningful,

  • and why your system could be useful in reality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions