# Day 1 - Fine-Tuning Large Language Models: From Inference to Training

### Summary
This lecture marks a significant shift from model **inference** (using pre-trained models) to **model training**, emphasizing that this is where the "real" advanced work begins in LLM engineering. The initial focus of training will be on the critical, though less glamorous, aspect of **data preparation**—cleaning, curating, and understanding datasets. The core project for the upcoming weeks involves **fine-tuning a frontier model** to predict product prices based on their descriptions, a task chosen for its clear measurability and the surprising effectiveness of modern LLMs in handling such traditionally regression-oriented problems.

### Highlights
-   **Transition from Inference to Training**: The course is moving from techniques that optimize the *use* of pre-trained models (like RAG, prompt chaining) to methods that *modify and improve* the models themselves by adjusting their internal parameters based on new data. This allows for deeper, more nuanced adaptation to specific tasks.
-   **Primacy of Data in Training**: Before any training can occur, meticulous data work is essential. This includes understanding, visualizing, cleaning, and curating the dataset to ensure it's high quality and suitable for the model. This step is foundational for successful model training.
-   **Transfer Learning as a Practical Approach**: Training large language models from scratch is prohibitively expensive (often costing over $100 million). **Transfer learning** allows leveraging the knowledge of pre-trained models by further training them on smaller, specialized datasets, a process known as **fine-tuning**. This makes advanced model customization accessible.
-   **Fine-Tuning for Specialized Tasks**: Fine-tuning enables the adaptation of general-purpose foundation models to perform specific tasks more effectively. The lecture mentions techniques like **QLoRA** to make this process manageable in terms of computational resources.
-   **Commercial Project: Product Price Estimation**: The practical project involves building a model to estimate the price of various products (electronics, appliances, etc.) based solely on their text descriptions. This serves as a concrete application for learning fine-tuning.
-   **Why a "Regression-like" Problem for a Generative AI Course?**: Although price prediction is traditionally a regression task (predicting a number), modern frontier LLMs have shown remarkable emergent capabilities in solving such quantitative problems effectively, sometimes outperforming specialized regression models. This makes it a relevant and challenging task for generative AI.
-   **Measurability of Success**: A key advantage of the chosen price prediction problem is the ease and clarity of measuring success. The accuracy of a predicted price is directly understandable, unlike the often subjective or complex metrics (e.g., perplexity, BLEU scores) used for traditional text generation tasks like translation or summarization.
-   **Course Structure Context**: This new phase (Week 6: Fine-tuning a Frontier Model) builds upon previous weeks covering frontier models, APIs, Hugging Face, model selection, code generation, and RAG, and leads towards building custom open-source models in Week 7.

### Conceptual Understanding
-   **Transfer Learning and Fine-Tuning**
    1.  **Why is this concept important?** It democratizes the ability to create high-performing, specialized AI models. Without it, only organizations with massive computational resources and datasets could develop sophisticated LLMs. Transfer learning allows smaller teams or individuals to adapt existing powerful models to their specific needs.
    2.  **How does it connect to real-world tasks, problems, or applications?** It's used extensively in various fields: adapting a general language model to understand medical text for healthcare applications, tuning a model for specific legal jargon, or creating a customer service bot that understands a particular company's products and policies. In this lecture's context, it's for making a general model an expert at product price estimation.
    3.  **Which related techniques or areas should be studied alongside this concept?** Understanding base model architectures (e.g., Transformers), techniques for efficient fine-tuning (like LoRA, QLoRA, adapter methods), dataset curation for specialized tasks, and evaluation metrics relevant to the fine-tuning objective.

-   **Using Generative AI for Regression-like Tasks**
    1.  **Why is this concept important?** It expands the applicability of LLMs beyond pure text generation. If LLMs can effectively handle tasks traditionally requiring different model types (like regression), they can become more versatile tools in a data scientist's toolkit, simplifying workflows by potentially using a single powerful model for multiple task types.
    2.  **How does it connect to real-world tasks, problems, or applications?** Beyond price prediction, this could apply to forecasting sales figures from market reports, estimating risk scores from textual descriptions, or predicting property values from listing details. It's valuable wherever numerical outputs can be derived from textual input.
    3.  **Which related techniques or areas should be studied alongside this concept?** Prompt engineering for structured output (e.g., forcing JSON responses containing numerical fields), evaluating LLMs on regression metrics (MAE, RMSE), comparing LLM performance against traditional regression models (e.g., linear regression, gradient boosting machines), and understanding the mechanisms by which LLMs perform numerical reasoning (even if not fully transparent).

### Reflective Questions
1.  **Application:** Which specific dataset or project could benefit from this concept of fine-tuning an LLM for a regression-like task?
    * *Answer:* A project analyzing customer reviews to predict a "satisfaction score" (e.g., 1-5) could benefit. Fine-tuning an LLM on a dataset of reviews paired with their scores could capture nuanced textual cues better than traditional sentiment analysis for a more accurate numerical prediction.

2.  **Teaching:** How would you explain "transfer learning" in LLMs to a junior colleague, using one concrete example?
    * *Answer:* Imagine a chef who is an expert in general French cuisine (the pre-trained model); transfer learning is like giving them a short apprenticeship in a specific Italian pastry shop (the new, smaller dataset) so they become excellent at making cannolis (the specialized task) without forgetting their French cooking skills.

3.  **Extension:** Given the plan to fine-tune for price prediction, what related technique or area should you explore next to potentially improve the model or understand its limitations?
    * *Answer:* Exploring "explainability techniques" for LLMs (e.g., analyzing attention weights or using LIME/SHAP adapted for transformers) would be valuable next. This could help understand which parts of the product description most influence the predicted price, providing insights for debugging or improving the model's reasoning.

# Day 1 - Finding and Crafting Datasets for LLM Fine-Tuning: Sources & Techniques

### Summary
This lecture emphasizes that sourcing and meticulously crafting data is the most essential, albeit less glamorous, part of model training. It outlines various data sources, from vital proprietary company data and public repositories like Kaggle and Hugging Face, to synthetic data and specialized curation services. The specific dataset chosen for the course project is a large Amazon review dataset from Hugging Face, rich in product descriptions and prices, which will be processed through a multi-stage workflow involving investigation, parsing, visualization, quality assessment, curation, and saving.

### Highlights
-   **Primacy of Proprietary Data**: For any business problem, the first and most crucial data source is the company's own proprietary data, as it directly pertains to the specific challenges and nuances the model needs to learn.
-   **Leveraging Public and Community Datasets**: Platforms like Kaggle and Hugging Face are invaluable resources, offering a wide array of datasets contributed by the community. The course will use an Amazon product review dataset from Hugging Face for its product price prediction project.
-   **Strategic Use of Synthetic Data**: LLMs can generate synthetic data, which can be useful for training smaller, more cost-effective models or when real data is scarce. However, using a frontier model to generate data for its own learning might be counterproductive.
-   **Data Curation as a Multi-Step Process**: Effective data preparation involves a structured approach:
    1.  **Investigating**: Initial exploration to understand data fields, completeness, and basic quality.
    2.  **Parsing**: Converting raw data into a more manageable and structured format.
    3.  **Visualizing**: Graphically representing data to understand distributions, ranges, and potential biases (e.g., price distribution).
    4.  **Assessing Quality**: Deeper analysis of data limitations and issues.
    5.  **Curating**: Making decisions to refine the dataset, such as excluding problematic data or addressing imbalances.
    6.  **Saving**: Storing the prepared dataset in a suitable location (e.g., Hugging Face Hub) for training.
-   **Importance of Data Quality for Training**: The quality and structure of the training data directly impact the model's performance. Decisions made during curation, like handling missing values or skewed distributions, are critical for building an effective model.

### Conceptual Understanding
-   **Synthetic Data Generation**
    1.  **Why is this concept important?** Synthetic data can bootstrap projects where real data is scarce, sensitive, or expensive to obtain. It allows for the creation of large, tailored datasets for specific training needs, potentially covering edge cases not present in existing real-world data.
    2.  **How does it connect to real-world tasks, problems, or applications?** It's used in scenarios like training autonomous vehicles (simulating rare traffic incidents), augmenting medical imaging datasets without compromising patient privacy, or creating diverse examples for training fraud detection models. For LLMs, it can generate conversational data or specific instruction-following examples.
    3.  **Which related techniques or areas should be studied alongside this concept?** Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), techniques for prompting large language models for data generation, data augmentation strategies, and methods for evaluating the quality and representativeness of synthetic data.

-   **Data Curation Workflow**
    1.  **Why is this concept important?** A systematic data curation workflow ensures that the data used for training is of high quality, relevant, and appropriately formatted, which is fundamental for the success of any machine learning model. "Garbage in, garbage out" is a core tenet; this workflow mitigates that risk.
    2.  **How does it connect to real-world tasks, problems, or applications?** This process is universal in data science. Whether building a customer churn model, a medical diagnostic tool, or a financial forecasting system, data scientists must rigorously investigate, clean, transform (parse/visualize), and select (curate) their data before model training.
    3.  **Which related techniques or areas should be studied alongside this concept?** Exploratory Data Analysis (EDA), data cleaning techniques (handling missing values, outliers, inconsistencies), feature engineering, data transformation and normalization, bias detection and mitigation strategies, and version control for datasets (e.g., DVC).

### Reflective Questions
1.  **Application:** Which specific dataset or project could benefit from the described data curation steps, particularly visualization and addressing imbalances?
    * *Answer:* A project to predict loan default risk using historical loan application data could greatly benefit. Visualizing features like income distribution or past defaults might reveal significant imbalances (e.g., far fewer defaults than non-defaults), which would need addressing during curation (e.g., through oversampling, undersampling, or synthetic data generation) to prevent the model from becoming biased.

2.  **Teaching:** How would you explain the importance of the "Curating" step in the data workflow to a junior colleague, using one concrete example?
    * *Answer:* Imagine you're training a model to identify ripe apples from photos, but 75% of your initial photos are of unripe green apples. If you don't "curate" by balancing this (e.g., finding more ripe apple photos or downsampling unripe ones), the model might become very good at spotting green apples but terrible at identifying perfectly ripe red ones, as it learned from a skewed perspective.

# Day 1 - Data Curation Techniques for Fine-Tuning LLMs on Product Descriptions

### Summary
This session dives into the practical first steps of data curation for the product price prediction project, using the "appliances" category from a large Amazon reviews dataset hosted on Hugging Face. The process involves loading the data, inspecting its structure and content (like `title`, `description`, `price`), and identifying initial data quality issues such as missing prices and varied formats for descriptive fields. Key exploratory steps include programmatically determining the proportion of items with valid prices and then visualizing the distributions of both combined text lengths and product prices using histograms, which reveal important characteristics like long tails in text length and significant price skewness towards cheaper items.

### Highlights
-   **Dataset Source and Subset**: The project utilizes the "Amazon opdrachtgevers Nederland Meta Data" dataset from Hugging Face, specifically focusing on the "appliances" category (containing 94,000 items) to manage training time and complexity.
-   **Initial Data Inspection**: Examination of individual data points reveals key fields: `title`, `description` (a list, can be empty), `features` (a list), `details` (a string containing JSON), and `price`. Early checks showed some items lack price information.
-   **Data Type Variety**: The descriptive fields exhibit different data structures. For instance, `description` and `features` are lists, while `details` is a JSON string requiring parsing to be used as a dictionary.
-   **Assessing Price Data Availability**: An initial programmatic check was performed to count items with valid (non-null, non-zero) prices. The speaker anecdotally concluded that about half of the "appliances" subset had price information, which was deemed sufficient for proceeding.
-   **Text Length Analysis**: The total character count from `title`, `description`, `features`, and `details` was calculated for items with prices. A histogram of these lengths showed a distribution with a notable peak and a long tail of very lengthy descriptions.
-   **Implications of Long Text Sequences**: The long tail in text lengths poses challenges for training, as models have maximum token limits. Longer sequences increase memory requirements for self-hosted models and costs for API-based frontier models, suggesting a need for a length cutoff.
-   **Price Distribution Analysis**: A histogram of prices for appliances revealed a highly skewed distribution, with a large number of low-cost items and a few very expensive outliers (e.g., a $21,000 professional microwave).
-   **Impact of Skewed Price Data**: The predominance of low-cost items could bias the model during training, making it less accurate for mid-range or high-priced products. The mean price can be misleadingly pulled up by high-value outliers.
-   **Python F-string Tip**: A useful Python tip was shared for formatting numbers with thousand separators in f-strings using the `:,` format specifier (e.g., `f"{value:,}"`).

### Conceptual Understanding
-   **Histograms for Data Exploration**
    1.  **Why is this concept important?** Histograms provide a visual representation of the distribution of numerical data, allowing data scientists to quickly understand its underlying frequency distribution, central tendency, spread, and shape (e.g., skewness, presence of outliers). This is crucial for identifying potential data issues or characteristics that might affect model training.
    2.  **How does it connect to real-world tasks, problems, or applications?** In the context of the lecture, plotting histograms for text length helps decide on sequence length limits for models, while price histograms reveal skewness that might require data transformation or specialized sampling techniques for effective model training in price prediction. Generally, it's used in any data analysis to understand variable distributions – e.g., age distribution of customers, sensor reading variations, income levels.
    3.  **Which related techniques or areas should be studied alongside this concept?** Descriptive statistics (mean, median, mode, standard deviation, variance, quartiles), other visualization plots (box plots, density plots, scatter plots), data transformation techniques (log transformation for skewed data), and outlier detection methods.

-   **Impact of Skewed Data Distributions on Training**
    1.  **Why is this concept important?** Skewed distributions, where data is heavily concentrated on one side, can lead to biased models. A model trained on data with many cheap items and few expensive ones (as in the price example) might become very good at predicting cheap prices but perform poorly on expensive ones because it hasn't seen enough examples.
    2.  **How does it connect to real-world tasks, problems, or applications?** This is common in many domains: fraud detection (many non-fraudulent transactions, few fraudulent), medical diagnosis (many healthy patients, few with a rare disease), or quality control (many good products, few defects). Addressing skew is vital for building fair and accurate models.
    3.  **Which related techniques or areas should be studied alongside this concept?** Techniques for handling imbalanced datasets such as oversampling minority classes (e.g., SMOTE), undersampling majority classes, using different evaluation metrics (e.g., F1-score, AUC-PR instead of accuracy), cost-sensitive learning, and data augmentation.

### Code Examples
The transcript describes the following key coding steps:

1.  **Loading the dataset from Hugging Face:**
    * The `load_dataset` function is used, specifying the dataset name (`"AmazonReviewsMulti/amazon_reviews_multi"`, though the speaker shortens it to "Amazon reviews" in narration) and the specific configuration/subset (e.g., `"en_US"` for language and then filtering for a category like `'appliances'`, or directly a split if available). The exact loading might involve `name='All_Amazon_Review_2023'` and then filtering, or `split='full'` and then filtering.
    * Example (conceptual based on description):
        ```python
        from datasets import load_dataset
        # Actual dataset name and split might vary based on Hugging Face structure
        # dataset = load_dataset("AmazonReviewsMulti/amazon_reviews_multi", name="All_Amazon_Review_2023", split="full", streaming=True) 
        # For the 'appliances' subset, it's likely a specific file/config within the dataset.
        # The speaker uses:
        # dataset = load_dataset("amazon-reviews-2023", "appliances") 
        # (Note: "amazon-reviews-2023" is a placeholder for the actual HF dataset identifier used in the video's code)
        # And then finds 94,000 items.
        ```

2.  **Inspecting data structure and content:**
    * Accessing individual data points (e.g., `dataset[0]`).
    * Printing specific fields like `data_point['title']`, `data_point['description']`, `data_point['price']`.
    * Using `json.loads()` to parse stringified JSON from the `details` field.

3.  **Counting items with valid prices:**
    * Iterating through the dataset.
    * Using a `try-except` block to handle potential errors if a price is missing or not a number.
    * Checking if `price is not None` and `float(price) > 0`.

4.  **Calculating total text length:**
    * For each item, concatenating or summing the lengths of `title`, `description` (handling if it's a list), `features` (handling if it's a list), and `details`.
    * Storing these lengths in a list.

5.  **Plotting histograms with Matplotlib:**
    * Using `matplotlib.pyplot.hist()` to visualize distributions of text lengths and prices.
    * Customizing plots with titles and labels (e.g., `plt.title()`, `plt.xlabel()`, `plt.ylabel()`).
    * Example (conceptual for histogram):
        ```python
        import matplotlib.pyplot as plt
        # Assuming 'lengths' is a list of character counts
        plt.hist(lengths, bins=50) 
        plt.title("Distribution of Text Lengths")
        plt.xlabel("Number of Characters")
        plt.ylabel("Frequency")
        plt.show()
        ```

### Reflective Questions
1.  **Application:** Given the observed skewness in product prices within the "appliances" category, what specific strategy would you propose for the data curation phase to mitigate its impact on model training, and why?
    * *Answer:* One strategy would be to apply log transformation to the prices before training the model. This can compress the range of very high prices and expand the range of low prices, making the distribution more symmetrical and potentially helping the model learn more effectively across different price points.

2.  **Teaching:** How would you explain to a junior colleague why a "long tail" in the histogram of text lengths is a concern for training LLMs, using one concrete example?
    * *Answer:* Imagine trying to fit variously sized photos into standard frames; most photos (text descriptions) fit standard short/medium frames (token limits), but a few very long panoramic photos (long-tail descriptions) won't fit or require very large, expensive custom frames (exceed token limits, increase processing cost/memory). We need to decide if we'll crop those long photos or find a way to summarize them to fit.

3.  **Extension:** After discovering that the `details` field is a JSON string, what potential data quality issues might you anticipate encountering when parsing this field across 94,000 items, and how might you prepare to handle them?
    * *Answer:* I'd anticipate issues like malformed JSON strings (e.g., syntax errors), inconsistent key names within the JSON across different products, or varied data types for the same conceptual key. To prepare, I would wrap the `json.loads()` call in a `try-except` block to catch parsing errors, implement logging for failed parses, and plan to inspect unique keys and their value types after successfully parsing a sample to standardize the extracted information.

# Day 1 - Optimizing Training Data: Scrubbing Techniques for LLM Fine-Tuning

### Summary
This lecture details the data curation phase, where raw data points from the Hugging Face dataset are transformed into structured Python objects called `Item`. This `Item` class, defined in a separate `items.py` module, encapsulates logic for cleaning text, removing irrelevant information like part numbers, and formatting the data into consistent prompts with a controlled token length (around 180 tokens, guided by the Llama 3.1 8B tokenizer). This process generates distinct training prompts (including the product's price) and test prompts (omitting the price) to prepare the data for fine-tuning a model to predict product costs.

### Highlights
-   **Structured Data Objects (`Item` Class)**: Data curation involves converting raw dataset entries into instances of a custom `Item` class. This class stores `title`, `price`, `category`, `token_count`, and the generated `prompt`, centralizing data processing logic.
-   **Modular Code for Reusability**: The `Item` class is defined in a separate `items.py` file, promoting cleaner notebooks and allowing the data processing logic to be reused across different parts of the project.
-   **Token-Aware Prompt Engineering**: Prompts are engineered to have a maximum token count (around 178-180 tokens). This limit is determined using the Llama 3.1 8B tokenizer to ensure consistency for both current frontier model fine-tuning and future open-source model training, while also managing costs and computational resources.
-   **Targeted Data Cleaning (`scrub_details`, `scrub`)**: Specific functions are implemented to "scrub" the text data:
    * Removing common irrelevant phrases (e.g., "batteries included," "manufacturer").
    * Normalizing whitespace and removing unusual characters using regular expressions.
-   **Data-Driven Rule for Part Number Removal**: A key cleaning step, discovered through manual data inspection, involves removing words that are 8+ characters long and contain a digit. These are often part numbers that consume many tokens but add little value to price prediction.
-   **Generation of Training and Test Prompts**:
    * **Training prompts** are formatted to include the product description and its price (e.g., "How much does this cost to the nearest dollar? [Description] Cost: $[Price]").
    * **Test prompts** include the same description but omit the price, allowing the model's prediction accuracy to be evaluated.
-   **Consistent Token Length**: After processing, the resulting prompts are tightly packed with useful information, averaging around 176 tokens and not exceeding ~178 tokens, ensuring efficient use of the token budget.
-   **Impact on Price Distribution**: The curation process, including filtering by token count and cleaning, resulted in a dataset where the average price is around $100 and the maximum price is about $11,000. The previously noted $21,000 outlier was removed, though the distribution remains skewed towards lower-priced items.
-   **Iterative Refinement Encouraged**: The lecture stresses that such cleaning logic is often developed iteratively by digging into the data, identifying patterns (like irrelevant part numbers), and implementing targeted solutions. Students are encouraged to review the `items.py` code.

### Conceptual Understanding
-   **Tokenization and Prompt Engineering for Fixed Lengths**
    1.  **Why is this concept important?** LLMs process text as tokens and have maximum input sequence lengths (context windows). Engineering prompts to fit a specific token limit ensures that no information is lost due to truncation by the model, optimizes processing costs (as many APIs charge per token), and makes training more memory-efficient, especially for self-hosted models.
    2.  **How does it connect to real-world tasks, problems, or applications?** This is critical in any application where LLMs are used with constrained resources or budgets. For fine-tuning, consistent input length helps stabilize training. For inference, it ensures predictable performance and cost, whether in chatbots, document summarizers, or, as in this case, price prediction models.
    3.  **Which related techniques or areas should be studied alongside this concept?** Different tokenization strategies (BPE, WordPiece, SentencePiece), techniques for text truncation and summarization to fit context windows, methods for estimating token counts, and the impact of context window size on model performance.

-   **Iterative Data Cleaning Based on Exploration**
    1.  **Why is this concept important?** Real-world datasets are often messy and contain irrelevant or misleading information. A predefined, static cleaning script rarely suffices. Iterative cleaning, driven by deep data exploration and understanding the specific task, is crucial for creating high-quality data that significantly improves model performance.
    2.  **How does it connect to real-world tasks, problems, or applications?** This is a fundamental practice in all data science projects. For example, in sentiment analysis, one might iteratively refine rules to remove boilerplate text from reviews; in sales forecasting, one might identify and handle anomalous data entries. The lecture's example of removing part numbers based on their structure and irrelevance to price is a prime illustration.
    3.  **Which related techniques or areas should be studied alongside this concept?** Exploratory Data Analysis (EDA), regular expressions for pattern matching and text manipulation, data profiling tools, anomaly detection, and a mindset of continuous data quality improvement.

### Code Examples
The transcript describes the functionality and structure of an `Item` class within an `items.py` module. Key aspects include:

1.  **`Item` Class Definition**:
    ```python
    # items.py
    
    # (Conceptual representation based on description)
    # from transformers import AutoTokenizer # For Llama tokenizer
    
    # BASE_MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B" # Or similar
    # TOKENIZER = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)
    # MAX_TOKENS = 180 # Example limit
    
    class Item:
        def __init__(self):
            self.title = None
            self.price = None
            self.category = None
            self.token_count = 0
            self.prompt = "" # Training prompt
            self.test_prompt_text = "" # Test prompt
    
        def scrub_details(self, text_details):
            # Logic to remove "batteries included", "manufacturer", etc.
            # ...
            return cleaned_text_details
    
        def scrub(self, text):
            # Logic to remove weird characters, normalize multiple spaces using regex
            # Logic to remove words >= 8 chars with a digit (part numbers)
            # ...
            return cleaned_text
    
        def parse(self, raw_data_point, category_name):
            # Extract title, price from raw_data_point
            # Concatenate relevant text fields (title, description, features, details after scrubbing)
            # ...
            # combined_text = self.scrub(raw_text_from_fields)
            # Truncate combined_text to fit MAX_TOKENS using TOKENIZER
            # ...
            # self.title = ...
            # self.price = float(raw_data_point.get('price'))
            # self.category = category_name
            
            # Construct prompt_text
            description_for_prompt = "Title: " + self.title + " Description: " + combined_text # Simplified
            self.prompt = f"How much does this cost to the nearest dollar?\n{description_for_prompt}\nCost: ${self.price:.0f}" # Example training prompt
            self.test_prompt_text = f"How much does this cost to the nearest dollar?\n{description_for_prompt}\nCost: $"
            
            # self.token_count = len(TOKENIZER.encode(self.prompt))
            return self # Or True if successful
    ```

2.  **Usage in Notebook**:
    ```python
    # notebook.ipynb
    
    # from items import Item 
    # curated_items = []
    # for data_point in raw_dataset_with_prices:
    #     item_obj = Item()
    #     if item_obj.parse(data_point, "appliances"): # Assuming parse returns self or status
    #         # Further check if item_obj.token_count <= MAX_TOKENS (implicitly handled in parse)
    #         curated_items.append(item_obj)
    
    # # Inspecting a prompt
    # print(curated_items[0].prompt)
    # print(curated_items[0].test_prompt_text)
    
    # # Plotting token counts
    # token_counts = [item.token_count for item in curated_items]
    # import matplotlib.pyplot as plt
    # plt.hist(token_counts, bins=30)
    # plt.show()
    ```

### Reflective Questions
1.  **Application:** If you were tasked with adapting this `Item` class and curation process for a dataset of real estate listings to predict property prices, what specific "data scrubbing" rules, similar to the part number removal, might you investigate and potentially implement?
    * *Answer:* For real estate, I'd investigate removing agent contact information or boilerplate agency disclaimers often found in listings, as these are irrelevant to the intrinsic property value and consume tokens. I'd also look for and standardize or remove overly specific showing instructions or phrases like "motivated seller," which might not generalize well.

2.  **Teaching:** How would you explain to a junior colleague the rationale behind creating two separate prompts (`prompt` for training and `test_prompt_text` for testing) in the `Item` class?
    * *Answer:* During training, we want the model to learn the connection between the description and the actual price, so the training prompt includes the price as the target. For testing, we need to see if the model can predict the price on its own without seeing the answer, so the test prompt presents only the description and asks for the price, mimicking a real prediction scenario.

3.  **Extension:** The current method truncates text to meet the ~180 token limit. What alternative strategy could be employed if crucial price-determining information was often found at the end of very long descriptions, and what would be its trade-offs?
    * *Answer:* An alternative strategy could be to use a summarization model to condense the long descriptions into a shorter text that fits the token limit while trying to retain key information. The trade-off is added complexity and potential cost of the summarization step, and the risk that the summarization might inadvertently remove or alter nuanced details crucial for accurate price prediction.

# Day 1 - Evaluating LLM Performance: Model-Centric vs Business-Centric Metrics


### Summary
This lecture recaps the paramount importance of data quality in model performance, asserting it has a more significant impact than extensive hyperparameter tuning. It then delves into performance evaluation, distinguishing between model-centric metrics (like training loss and Root Mean Squared Log Error - RMSLE) and human-understandable, business-centric metrics (such as average absolute price difference or the percentage of "good enough" estimates). The price prediction project benefits from clear business metrics that directly reflect model accuracy, which will be crucial for comparing different models and approaches.

### Highlights
-   **Data Quality is Key**: The most effective way to improve model performance is by enhancing the quality of the input data, often yielding more significant gains than hyperparameter tuning.
-   **Model-Centric vs. Business-Centric Metrics**: Performance evaluation uses two types of metrics:
    * **Model-centric metrics** (e.g., training/validation loss, RMSLE, MSE) are technical measures indicating how well the model is learning and performing mathematically.
    * **Business-centric metrics** (e.g., average absolute price difference, percentage of "good enough" estimates) reflect the model's real-world utility and are understandable to stakeholders.
-   **Specific Metrics for Price Prediction**:
    * **RMSLE (Root Mean Squared Log Error)**: A model-centric metric that balances absolute and percentage errors, suitable for data with a wide range of values like prices.
    * **Average Absolute Price Difference**: A simple, human-understandable business metric showing the raw difference between predicted and actual prices.
    * **Percentage of "Good Enough" Estimates**: A business metric where success is defined by predictions falling within a predefined absolute or percentage tolerance (e.g., within $40 or 20%). This will be a key metric for the project.
-   **Advantages of the Chosen Project's Metrics**: The price prediction task allows for straightforward, human-understandable business metrics that are directly tied to the model's output, facilitating clear assessment of its performance.
-   **Training and Validation Loss**: These are fundamental model-centric metrics monitored during training; a decreasing loss indicates the model is learning to make more accurate predictions on the training and unseen validation data.
-   **Limitations of MSE and Simple Differences**: Mean Squared Error (MSE) can be heavily skewed by large errors on expensive items. Simple absolute or percentage differences have their own biases (absolute difference penalizes expensive items more; percentage difference can be harsh on cheap items).

### Conceptual Understanding
-   **Model-Centric vs. Business-Centric Metrics**
    1.  **Why is this concept important?** It bridges the gap between technical model development and real-world impact. While data scientists use model-centric metrics to optimize and debug models, business stakeholders need to understand performance in terms of tangible outcomes and value.
    2.  **How does it connect to real-world tasks, problems, or applications?** In a customer churn prediction model, 'AUC-ROC' might be a model-centric metric, while 'reduction in customer churn rate' or 'cost savings from retained customers' would be business-centric. For the lecture's price prediction, 'training loss' is model-centric, while 'average dollar amount the predictions are off by' is business-centric.
    3.  **Which related techniques or areas should be studied alongside this concept?** Key Performance Indicators (KPIs) definition, stakeholder communication, cost-benefit analysis, and aligning model objectives with business goals. Understanding the limitations of each type of metric is also crucial.

-   **Root Mean Squared Log Error (RMSLE)**
    1.  **Why is this concept important?** RMSLE is particularly useful when predicting positive numerical values spanning a large range, and when relative errors are more significant than absolute errors, or when over-predictions and under-predictions should be treated differently implicitly (as it penalizes under-prediction more heavily than over-prediction when values are transformed back from log space if not careful, though the primary benefit is sensitivity to relative error). It's less sensitive to outliers than MSE if the outliers are large in absolute terms but not necessarily in relative terms after log transformation.
    2.  **How does it connect to real-world tasks, problems, or applications?** It's commonly used in competitions and real-world scenarios involving predictions of things like sales, demand, or income, where a 10-unit error on a 100-unit item is more significant than a 10-unit error on a 1000-unit item. In the lecture's context, it appropriately evaluates price predictions across a wide range of product costs.
    3.  **Which related techniques or areas should be studied alongside this concept?** Other regression evaluation metrics (MAE, MSE, R-squared), log transformation of target variables, understanding the impact of data distribution on metric choice, and error analysis. The formula is $ \text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\log(p_i + 1) - \log(a_i + 1))^2} $, where $p_i$ is the prediction and $a_i$ is the actual value, and $n$ is the number of data points. The `+1` is added to handle zero values.

### Reflective Questions
1.  **Application:** If you were building a model to predict the number of daily rentals for a bike-sharing service, which type of metric (model-centric or business-centric) would you prioritize when communicating with city officials, and what specific metric example would you use?
    * *Answer:* I would prioritize a business-centric metric. For example, I might use the "average percentage error in predicting daily demand," as this directly relates to their operational concerns like bike availability and rebalancing, and is more intuitive than a raw loss value.

2.  **Teaching:** How would you explain to a junior colleague why "Average Absolute Price Difference" might be a good starting metric for the e-commerce price prediction project, but also mention one of its limitations?
    * *Answer:* It's a great starting point because it's simple for anyone to understand – "our model is off by an average of $X." However, a limitation is that a $50 error on a $1000 TV (5%) looks the same as a $50 error on a $60 toaster (83%), so it doesn't capture the relative severity equally across different price points.
