HW_04_202601

<html>
<body>
<html><head></head><body><h1>Homework Assignment: GitHub Repository Intelligence with LLMs and BERT</h1>Course: Machine Learning / NLP / Applied AI Total Score: 20 points<ul><li>Technical implementation: 12 points</li><li>Presentation video: 8 points</li></ul>Deadline: Friday — 11:59 PM<hr><h1>Description</h1>The goal of this assignment is to build a complete weak-supervision NLP pipeline using:<ul><li>GitHub API</li><li>Large Language Models (LLMs)</li><li>BERT-based models</li><li>Repository metadata</li><li>Open-source ecosystem signals</li></ul>You will create a system capable of analyzing GitHub repositories and classifying them according to one of the following project tracks:<hr><h1>Available Project Tracks</h1><h2>Track A — Hiring-Oriented Repository Intelligence</h2>Build a system that evaluates whether a repository reflects work expected from:<ul><li>Intern-level engineering</li><li>Junior-level engineering</li><li>Senior-level engineering</li><li>Lead/Architect-level engineering</li><li>Template/Boilerplate/Replica repository</li><li>Low-value repository not worth detailed review</li></ul>The objective is NOT to judge the developer directly.The objective is:<blockquote>estimate the engineering maturity and complexity reflected by the repository itself.</blockquote>This can help:<ul><li>recruiters,</li><li>engineering managers,</li><li>startups,</li><li>accelerators,</li><li>and technical screening systems.</li></ul>The challenge is determining:<ul><li>which repository signals matter,</li><li>how to summarize them,</li><li>and how to define engineering maturity categories.</li></ul><hr><h2>Track B — Technology Innovation &amp; Ecosystem Tracking</h2>Build a system capable of identifying:<ul><li>emerging technologies,</li><li>mature ecosystems,</li><li>declining technologies,</li><li>and experimental or niche technical areas</li></ul>using GitHub repository activity and metadata.The objective is NOT to predict whether code is “good.”The objective is:<blockquote>analyze repository and ecosystem signals to understand technological momentum and innovation trends.</blockquote>Examples:<ul><li>AI agents</li><li>vector databases</li><li>cybersecurity tooling</li><li>blockchain infrastructure</li><li>robotics frameworks</li><li>MLOps platforms</li></ul>This can help:<ul><li>investors,</li><li>researchers,</li><li>governments,</li><li>consulting firms,</li><li>and technology analysts.</li></ul>The challenge is determining:<ul><li>which GitHub signals represent innovation,</li><li>how to define “growth” or “decline,”</li><li>and how to convert repository behavior into measurable technological trends.</li></ul><hr><h1>Main Objective</h1>Build a complete pipeline that:<ol><li>Collects repository information from GitHub API</li><li>Creates repository summaries/features</li><li>Uses an LLM to generate weak labels</li><li>Fine-tunes a BERT-based classifier</li><li>Evaluates model performance</li><li>Explains the business and analytical value of the system</li></ol>You are NOT given:<ul><li>fixed categories,</li><li>fixed prompts,</li><li>fixed features,</li><li>or fixed methodologies.</li></ul>Those decisions are part of the assignment.<hr><h1>Expected Repository Structure</h1>Create a repository named exactly:For Track A:<pre><code class="language-bash">github_hiring_repository_intelligence
</code></pre>For Track B:<pre><code class="language-bash">github_technology_innovation_tracking
</code></pre>with the following structure:<pre><code class="language-bash">repository_name/
│
├── app.py # Streamlit app
├── README.md # Project explanation
├── requirements.txt # Dependencies
│
├── src/
│ ├── github_collector.py # GitHub API extraction
│ ├── preprocessing.py # Cleaning and transformations
│ ├── summarization.py # Repository summary generation
│ ├── llm_labeling.py # Weak labeling with LLMs
│ ├── train.py # BERT fine-tuning
│ ├── evaluation.py # Metrics and validation
│ ├── visualization.py # Graphs and analysis
│ └── utils.py # Helper functions
│
├── data/
│ ├── raw/
│ ├── processed/
│ ├── labeled/
│ └── splits/
│
├── models/
│ └── trained_models/
│
├── output/
│ ├── figures/
│ ├── tables/
│ └── metrics/
│
└── video/
 └── link.txt
</code></pre><hr><h1>Required Pipeline</h1>Your project must contain the following stages.<hr><h1>Stage 1 — GitHub Data Collection</h1>You must use the GitHub API to collect repository information.You are free to choose repositories and sampling strategies.You may use:<ul><li>REST API</li><li>GraphQL API</li></ul>You must explain:<ul><li>how repositories were selected,</li><li>why they were selected,</li><li>and how selection may affect the results.</li></ul><hr><h1>Minimum Required Features</h1>You must extract at least 6 repository-level signals.Examples include:<ul><li>number of contributors</li><li>commits frequency</li><li>stars/forks</li><li>issue activity</li><li>pull request activity</li><li>release frequency</li><li>README characteristics</li><li>workflow/CI presence</li><li>dependency updates</li><li>repository topics/tags</li><li>repository age</li><li>last activity date</li></ul>You are encouraged to experiment with additional signals.<hr><h1>Stage 2 — Repository Representation</h1>You must convert repository information into a format usable by:<ul><li>LLMs</li><li>and BERT models</li></ul>This may include:<ul><li>textual summaries,</li><li>structured prompts,</li><li>concatenated metadata,</li><li>or hybrid representations.</li></ul>Example:<pre><code class="language-text">Repository has 15 contributors, active CI/CD workflows,
weekly commits, regular releases, and extensive documentation.
</code></pre>You must justify:<ul><li>why your representation is useful,</li><li>and why it may help classification.</li></ul><hr><h1>Stage 3 — Weak Labeling with LLMs</h1>You must use an LLM to generate labels for the training dataset.Examples:<ul><li>OpenAI</li><li>Claude</li><li>DeepSeek</li><li>Gemini</li><li>Mistral</li><li>Qwen</li></ul>The LLM acts as:<blockquote>the initial annotator of repository categories.</blockquote>You must:<ul><li>explain your prompt design,</li><li>justify your category definitions,</li><li>and discuss limitations of LLM-generated labels.</li></ul><hr><h1>Stage 4 — Train / Validation / Test Split</h1>You must create:<ul><li>Train dataset</li><li>Validation dataset</li><li>Test dataset</li></ul>Suggested split:<ul><li>70% train</li><li>15% validation</li><li>15% test</li></ul>The test dataset must remain unseen during training.<hr><h1>Stage 5 — Fine-Tuning a BERT-Based Model</h1>You must fine-tune one lightweight transformer model.Recommended options:<ul><li>DistilBERT</li><li>ModernBERT</li><li>MiniLM</li><li>DeBERTa-v3-small</li></ul>The objective is NOT massive-scale training.The objective is:<blockquote>learn how weak supervision pipelines work in realistic AI systems.</blockquote>Input:<ul><li>repository representations</li></ul>Output:<ul><li>repository category prediction</li></ul><hr><h1>Stage 6 — Evaluation and Error Analysis</h1>You must evaluate:<ul><li>Accuracy</li><li>Precision</li><li>Recall</li><li>F1-score</li></ul>You must also:<ul><li>analyze common errors,</li><li>compare categories,</li><li>discuss weak points,</li><li>and explain possible improvements.</li></ul><hr><h1>Track A — Required Analytical Questions</h1><h2>Question 1 — Engineering Maturity</h2>Which repository signals appear most associated with:<ul><li>intern-level repositories,</li><li>junior-level repositories,</li><li>senior-level repositories,</li><li>or lead-level repositories?</li></ul>You must justify your reasoning.<hr><h2>Question 2 — Low-Value or Replica Repositories</h2>How can repositories that are:<ul><li>duplicated,</li><li>template-based,</li><li>unfinished,</li><li>or low-value</li></ul>be differentiated from repositories showing meaningful engineering complexity?You must define your logic.<hr><h2>Question 3 — Hiring Signal Interpretation</h2>Why might your classification system be useful for:<ul><li>recruiters,</li><li>startups,</li><li>technical interview pipelines,</li><li>or engineering managers?</li></ul>You must explain:<ul><li>business value,</li><li>limitations,</li><li>and ethical considerations.</li></ul><hr><h2>Question 4 — Methodological Sensitivity</h2>How do results change when:<ul><li>repository features change,</li><li>prompts change,</li><li>or category definitions change?</li></ul>You must compare:<ul><li>one baseline approach</li><li>and one alternative approach.</li></ul><hr><h1>Track B — Required Analytical Questions</h1><h2>Question 1 — Technology Momentum</h2>Which repository signals appear associated with:<ul><li>emerging technologies,</li><li>mature ecosystems,</li><li>declining technologies,</li><li>or experimental/niche areas?</li></ul>You must justify your reasoning.<hr><h2>Question 2 — Innovation Signals</h2>What types of GitHub activity appear to indicate:<ul><li>technological growth,</li><li>ecosystem expansion,</li><li>or declining interest?</li></ul>You must explain:<ul><li>why you selected those signals,</li><li>and their limitations.</li></ul><hr><h2>Question 3 — Business and Economic Value</h2>Why could this system be useful for:<ul><li>investors,</li><li>consulting firms,</li><li>governments,</li><li>or technology researchers?</li></ul>You must explain:<ul><li>business value,</li><li>practical applications,</li><li>and limitations.</li></ul><hr><h2>Question 4 — Methodological Sensitivity</h2>How do results change when:<ul><li>repository features,</li><li>growth definitions,</li><li>or prompts</li></ul>change?You must compare:<ul><li>one baseline approach</li><li>and one alternative approach.</li></ul><hr><h1>Technical Requirements</h1>Your project must include:<ul><li>HuggingFace Transformers</li><li>pandas</li><li>scikit-learn</li><li>matplotlib</li><li>seaborn</li><li>Streamlit</li></ul>Optional:<ul><li>PyTorch</li><li>datasets</li><li>accelerate</li><li>plotly</li></ul><hr><h1>Streamlit Application</h1><h2>Score: 4 points</h2>Your Streamlit app must contain exactly 4 tabs.<hr><h2>Tab 1 — Problem &amp; Methodology</h2>Include:<ul><li>project objective</li><li>repository selection methodology</li><li>GitHub signals used</li><li>prompt strategy</li><li>dataset construction</li><li>limitations</li></ul><hr><h2>Tab 2 — Exploratory Analysis</h2>Include:<ul><li>repository statistics</li><li>category distributions</li><li>signal comparisons</li><li>selected visualizations</li></ul>You must explain:<ul><li>why those visualizations were selected,</li><li>and what analytical insight they provide.</li></ul><hr><h2>Tab 3 — Model Results</h2>Include:<ul><li>evaluation metrics</li><li>confusion matrix</li><li>category performance</li><li>baseline vs alternative comparison</li></ul><hr><h2>Tab 4 — Interactive Repository Exploration</h2>Include:<ul><li>repository search/filtering</li><li>category predictions</li><li>metadata exploration</li><li>model prediction examples</li></ul><hr><h1>README.md Must Include</h1><ul><li>What does the project do?</li><li>Which track was selected?</li><li>What repositories were analyzed?</li><li>Which GitHub signals were used?</li><li>How were repository summaries created?</li><li>How were prompts designed?</li><li>How was the dataset split?</li><li>Which BERT model was used?</li><li>What were the final metrics?</li><li>What are the main limitations?</li><li>What are the possible business applications?</li><li>How to run the project?</li><li>How to run the Streamlit app?</li></ul><hr><h1>Explanatory Video</h1><h2>Score: 8 points</h2>Create a video of:<ul><li>maximum 5 minutes</li></ul>The video is NOT a coding walkthrough.The video must be presented as:<blockquote>a pitch of the idea and system.</blockquote>The goal is to communicate:<ul><li>analytical thinking,</li><li>business understanding,</li><li>AI pipeline design,</li><li>and practical usefulness.</li></ul><hr><h1>The Video Must Explain</h1><h2>1. Problem Definition</h2>What real-world problem are you solving?<hr><h2>2. Repository Signals</h2>What GitHub information did you collect?Examples:<ul><li>contributors</li><li>commits</li><li>issue activity</li><li>releases</li><li>repository topics</li><li>workflow files</li></ul>Why do you believe these are meaningful signals?<hr><h2>3. LLM Weak Labeling</h2>What did you feed to the LLM? Why do you think the LLM can help classify repositories?<hr><h2>4. Classification Logic</h2>Which categories did you define? Why are those categories useful?<hr><h2>5. Business Value</h2>Who could use this system? Why would it matter in reality?Possible examples:<ul><li>recruiting</li><li>investment analysis</li><li>technology trend analysis</li><li>startup evaluation</li><li>ecosystem monitoring</li></ul><hr><h2>6. Model Performance</h2>Show:<ul><li>basic metrics,</li><li>confusion matrix,</li><li>examples of correct/incorrect predictions.</li></ul><hr><h2>Important</h2>The presentation should focus on:<ul><li>ideas,</li><li>methodology,</li><li>reasoning,</li><li>and business usefulness.</li></ul>Do NOT spend the presentation showing code line-by-line.<hr><h1>GitHub Workflow (MANDATORY)</h1>❌ Do not work directly on <code inline="">main</code> ✅ Create development branches ✅ Use descriptive commits ✅ Merge through Pull RequestsExample branches:<pre><code class="language-bash">feature/github-scraping
feature/llm-labeling
feature/bert-training
feature/streamlit-dashboard
</code></pre><hr><h1>Grading Rubric</h1><h2>Technical Implementation — 12 points</h2>
Criteria | Points
-- | --
GitHub data collection pipeline | 2 pts
Repository representation and preprocessing | 2 pts
LLM weak labeling methodology | 2 pts
BERT fine-tuning pipeline | 2 pts
Evaluation and error analysis | 2 pts
Streamlit app completeness | 2 pts

<hr><h1>Checklist Before Submitting</h1><ul class="contains-task-list"><li class="task-list-item"><input type="checkbox" disabled=""> Repository has the correct name</li><li class="task-list-item"><input type="checkbox" disabled=""> GitHub API was used</li><li class="task-list-item"><input type="checkbox" disabled=""> At least 6 repository signals were extracted</li><li class="task-list-item"><input type="checkbox" disabled=""> LLM weak labeling was implemented</li><li class="task-list-item"><input type="checkbox" disabled=""> Train/validation/test split exists</li><li class="task-list-item"><input type="checkbox" disabled=""> BERT model was fine-tuned</li><li class="task-list-item"><input type="checkbox" disabled=""> Evaluation metrics are included</li><li class="task-list-item"><input type="checkbox" disabled=""> Streamlit app contains exactly 4 tabs</li><li class="task-list-item"><input type="checkbox" disabled=""> README explains methodology and findings</li><li class="task-list-item"><input type="checkbox" disabled=""> Video link exists in <code inline="">video/link.txt</code></li><li class="task-list-item"><input type="checkbox" disabled=""> Work was done using branches and Pull Requests</li><li class="task-list-item"><input type="checkbox" disabled=""> Repository is reproducible</li></ul><hr><h1>Final Note</h1>This assignment is intentionally designed to evaluate:<ul><li>analytical reasoning,</li><li>AI system design,</li><li>weak supervision understanding,</li><li>and business thinking.</li></ul>The most important part is NOT achieving the highest accuracy.The most important part is being able to justify:<ul><li>why you selected certain GitHub signals,</li><li>why your prompts make sense,</li><li>why your categories are meaningful,</li><li>and why your system could be useful in reality.</li></ul></body></html>
</body>
</html>

HW_04_202601 #176

Description

Homework Assignment: GitHub Repository Intelligence with LLMs and BERT

Description

Available Project Tracks

Track A — Hiring-Oriented Repository Intelligence

Track B — Technology Innovation & Ecosystem Tracking

Main Objective

Expected Repository Structure

Required Pipeline

Stage 1 — GitHub Data Collection

Minimum Required Features

Stage 2 — Repository Representation

Stage 3 — Weak Labeling with LLMs

Stage 4 — Train / Validation / Test Split

Stage 5 — Fine-Tuning a BERT-Based Model

Stage 6 — Evaluation and Error Analysis

Track A — Required Analytical Questions

Question 1 — Engineering Maturity

Question 2 — Low-Value or Replica Repositories

Question 3 — Hiring Signal Interpretation

Question 4 — Methodological Sensitivity

Track B — Required Analytical Questions

Question 1 — Technology Momentum

Question 2 — Innovation Signals

Question 3 — Business and Economic Value

Question 4 — Methodological Sensitivity

Technical Requirements

Streamlit Application

Score: 4 points

Tab 1 — Problem & Methodology

Tab 2 — Exploratory Analysis

Tab 3 — Model Results

Tab 4 — Interactive Repository Exploration

README.md Must Include

Explanatory Video

Score: 8 points

The Video Must Explain

1. Problem Definition

2. Repository Signals

3. LLM Weak Labeling

4. Classification Logic

5. Business Value

6. Model Performance

Important

GitHub Workflow (MANDATORY)

Grading Rubric

Technical Implementation — 12 points

Checklist Before Submitting

Final Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions