## Assignment: Evaluation and Robustness of Open-Source Language Models

**Background:**  
This assignment deepens your practical understanding of *open-source large language models (LLMs)*. Rather than training models from scratch, you will select, use, and robustly evaluate LLMs for text generation, focusing on reproducible experiments and critical analysis.

### Instructions and Point Breakdown

#### 1. **Model Selection and Setup (2 points)**

- Review the curated list of **open-source LLMs** in the provided notebook (e.g., Llama 3.1 8B, Gemma 7B, Phi-3 Mini, StableLM 7B).
- Select **two different models** (with different architectures/sizes).
- Load each model and its tokenizer in code, using the Hugging Face `transformers` library.
- Briefly justify your model choices (consider parameter size, hardware requirements, and relevant use cases).

#### 2. **Prompt Design and Data Augmentation (3 points)**

- Create a base set of **10 distinct prompts** (on a common theme such as university programs, Columbia, MO, or another topic of your choice).
- Generate completions from both models using the same set of prompts.
- Augment the prompt set by applying at least **two augmentation techniques**:
  - Paraphrase (manually or using a paraphrasing model)
  - Adding typographical errors or slang
  - Varying instruction style or context (e.g., first-person vs. third-person)
- For each augmentation, generate completions with both models.

#### 3. **Qualitative and Quantitative Output Analysis (2 points)**

- Define appropriate **metrics** for output analysis, such as:
  - Length of completion
  - N-gram diversity
  - Factual accuracy (if applicable)
  - Readability scores
- Build a **Markdown table** comparing the chosen metrics across models and augmentations for at least 10 samples.
- Include sample outputs from both original and augmented prompts for illustration.

#### 4. **Robustness Testing (2 points)**

- Select a subset of your prompts and **introduce adversarial or noisy edits** (e.g., scrambled words, excessive punctuation, misleading context).
- Analyze *performance degradation* or changes in output quality across models using your metrics.
- Visualize differences using a plot (e.g., bar chart showing metric differences across original vs. noisy prompts for both models).

#### 5. **Technical Reflection (1 point)**

- In a Markdown cell, answer:
  - Which model was more robust to prompt noise/augmentation, and why?
  - Which augmentation technique most impacted output quality, and how?
  - Suggest another way to further evaluate or benchmark open-source LLMs.

### Submission Requirements

- **Jupyter Notebook** containing:
  - All code and comments for Sections 1â€“4
  - Outputs: completions, metrics, tables, and plots
  - Reflection (Section 5)
- Use Python libraries: `transformers`, `nltk/textstat` (for readability/scoring), `sklearn`, `pandas`, and `matplotlib` as needed.

**Grading Rubric:**

| Section                       | Points |
|:------------------------------|:------:|
| Model selection/setup         | 2      |
| Prompt design/augmentation    | 3      |
| Output analysis               | 2      |
| Robustness testing            | 2      |
| Quality of reflection         | 1      |
| **Total**                     | **10** |