# LLM Interview Topics

## 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 & 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴

### Transformer Architecture (attention mechanisms)

### Pre-training vs Fine-tuning

### Training Objectives (next token prediction)

### Context Window and Position Embeddings

### Tokenization Strategies

### Model Scaling Laws

### Parameter Efficient Fine-tuning (LoRA, QLoRA, Prefix Tuning)

### Distillation

What is knowledge distillation?

When did knowledge distillation appear as a technique?
* The ideas behind knowledge distillation (KD) date back to 2006, when Bucilă, Caruana, and Niculescu-Mizil in their work “Model Compression” showed that an ensemble of models could be compressed into a single smaller model without much loss in accuracy. They demonstrated that a cumbersome model (like an ensemble) could be effectively replaced by a lean model that was easier to deploy.
* Later in 2015, Geoffrey Hinton, Oriol Vinyals, and Jeff Dean coined the term “distillation” in their “Distilling the Knowledge in a Neural Network” paper. This term was referred to the process of transferring knowledge from a large, complex AI model or ensemble to a smaller, faster AI model, called the distilled model​. 
    * Instead of just training the smaller model on correct answers, researchers proposed to give it the probability distribution from the large model. 
    * This helps the smaller model learn not just what the right answer is, but also how confident the big model is about each option. This training concept is closely connected to the softmax function.

Types of knowledge distillation

Improved algorithms

Distillation scaling laws

Benefits

Not without limitations

Real-world effective use cases (why OpenAI got mad at DeepSeek)

---
## 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝗖𝗼𝗻𝘁𝗿𝗼𝗹

### Temperature and Top-p Sampling

### Prompt Engineering Techniques

### Few-shot Learning

### In-context Learning

### Chain-of-Thought Prompting

### Hallucination Mitigation



---
## 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 - Metrics 

### Precision@K

<ins>Precision@K</ins>: A common metric to evaluate the performance of ranking algorithms (search systems, RAG, etc.). It is the ratio of correctly identified relevant items within the total recommended items (the K-long list). As such, it answers the question, out of the top-K items suggested, how many are actually relevant to the user?
* K is the cut-off threshold that you choose to limit your evaluation to and it's dependent on the expectation of how many items your user is likely to interact with.
* Relevance is a binary label that is specific to your use case. For example, you might measure an item as relevant if a user clicks on it, adds it to a shopping cart, or purchases it.
* <ins>Used in</ins>: ranking algorithms (search systems, RAG)
* <ins>Range</ins>: 0-1, a higher value means better performance
* <ins>Limitations</ins>:
    * Only reflects the number of relevant items in the top K, doesn't evaluate the ranking quality inside of K

<ins>Equation</ins>: $\text{Precision@K} = \frac{\text{Number of relevant items in K}}{\text{Total number of items in K}}$


### Recall@K 

<ins>Recall@K</ins>: A common metric to evaluate the performance of ranking algorithms (search systems, RAG, etc.). It measures the share of revelant items captured within the top K positions
* <ins>Used in</ins>: ranking algorithms (search systems, RAG)
* <ins>Equation</ins>: 
* <ins>Range</ins>: 0-1, a higher value means better performance
* Limitations:
    * Only reflects the number of relevant items in the top K, doesn't evaluate the ranking quality inside of K

### F-Score@K

<ins>F-Score@K</ins>: A common metric to evaluate the performance of ranking algorithms (search systems, RAG, etc.). It provides a balanced measure of Precision@K and Recall@K
* <ins>Used in</ins>: ranking algorithms (search systems, RAG)
* <ins>Equation</ins>: 
* <ins>Range</ins>: 0-1, a higher value means better performance
* <ins>Limitations</ins>:
    * Only reflects the number of relevant items in the top K, doesn't evaluate the ranking quality inside of K

### Apology Rate

<ins>Apology Rate</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### No Response Rate

<ins>No Response Rate</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### Perplexity

<ins>Perplexity</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### ROUGE Scores

<ins>ROUGE Scores</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### BLEU Score

<ins>BLEU Score</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### Intelligence

<ins>Intelligence</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### Relevance

<ins>Relevance</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

---
## Benchmark Metrics

### ELO

<ins>ELO</ins>: 
* <ins>Used in</ins>: 
* <ins>Equation</ins>: 
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

---

## LLM Eval Techniques

### LLM-as-a-judge

### Human Evaluation Methods

### Evaluation Data Sets
* Benchmark Datasets (MMLU, BigBench, HumanEval)

### A/B Testing Between Chatbots

### Bias Detection

---
## LLM Interpretability

### Dictionary Learning

<ins>Dictionary Learning</ins>: a technique developed by Anthropic that helps uncover millions of neuron patterns or “features” and match them to human concepts. Think of it as building a rough glossary for the brain of an LLM.
* <ins>Used in</ins>: 
* <ins>Citations</ins>: 
    * [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html?utm_source=www.turingpost.com&utm_medium=newsletter&utm_campaign=gemini-is-rising-while-anthropic-works-on-opening-the-black-box-of-ai&_bhlid=2e798d9d7f3d0fa403e5dd525162c467e41be7b2)
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### Monosemanticity

<ins>Dictionary Learning</ins>: the idea that a single neuron pattern might align with a single meaning. A clean signal, not a messy blend. Theoretical work laid out by Anthropic's interpretability team.
* <ins>Used in</ins>: 
* <ins>Citations</ins>: 
    * [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning](https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning?utm_source=www.turingpost.com&utm_medium=newsletter&utm_campaign=gemini-is-rising-while-anthropic-works-on-opening-the-black-box-of-ai&_bhlid=8606cb7ea3b2e400fc2c44136f3dc1a6f8b28041)
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

### Attributional Graphs

<ins>Dictionary Learning</ins>: A technique used by Anthropic's interpretability team to trace how Claude 3.5 Haiku reasons across multiple steps – writing poems, diagnosing patients, even planning ahead. The findings make one thing clear, these models aren’t just completing sentences. They’re constructing thoughts. 
* <ins>Used in</ins>: 
* <ins>Citations</ins>: 
    * [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html?utm_source=www.turingpost.com&utm_medium=newsletter&utm_campaign=gemini-is-rising-while-anthropic-works-on-opening-the-black-box-of-ai&_bhlid=357b4f39d91905c184ff9190a3bcc3eff0d7ea32)
* <ins>Range</ins>: 
* <ins>Limitations</ins>:

---
## 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 & 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁

### Quantization Techniques (4-bit, 8-bit)

### Model Distillation

### Prompt Caching

### Model Merging

### Inference Optimization

### Load Balancing

### Latency Management

### Cost Optimization

---

## 𝗦𝗮𝗳𝗲𝘁𝘆 & 𝗘𝘁𝗵𝗶𝗰𝘀

### Content Filtering

### Output Sanitization

### Jailbreak Prevention

### Data Privacy


---
## Retrieval Augmented Generation (RAG)

<ins>RAG</ins>: an architectural framework, introduced by Meta AI Research, that enhances LLMs by incorporating external data retrieval mechanisms. This integration allows LLMs to access real-time, relevant information, thereby addressing the limitations of traditional generative models that rely solely on static training data. By retrieving pertinent documents or data points in response to specific queries, RAG ensures that the generated outputs are not only contextually appropriate but also factually accurate, significantly reducing the incidence of outdated or erroneous information. This capability is particularly beneficial in applications such as customer support and knowledge management, where timely and precise responses are critical. 

The primary methods employed in RAG involve a two-stage process: 
1. Retrieving relevant information from a curated set of external sources.
2. Utilizing this information to inform the generation of responses. 

This dual approach allows RAG to dynamically augment the generative capabilities of LLMs with up-to-date context, enhancing their performance across various tasks. Techniques such as vector-based retrieval and query expansion are commonly used to improve the relevance and accuracy of the retrieved information. Furthermore, RAG systems can be designed to include mechanisms for citation and source attribution, enabling users to verify the accuracy of the generated content and fostering trust in AI outputs.

### Issues with RAG
Despite its advantages, implementing RAG poses several challenges that organizations must navigate. One significant hurdle is the complexity of integrating retrieval systems with generative models, which requires specialized knowledge in both natural language processing and information retrieval. Additionally, the effectiveness of a RAG system is heavily dependent on the quality and reliability of the external data sources it utilizes; poor-quality data can lead to misleading outputs or propagate inaccuracies. Latency issues can also arise during retrieval operations, particularly when accessing large datasets or multiple sources simultaneously, potentially impacting user experience in time-sensitive applications.

### RAG Implementation Process (2 Step)

#### 1. Retrieval QA Chain: Ingestion
1. Ingest documents
2. Split documents into chunks
3. Convert chunks into vectors via an embedding model
4. Create an index of vectors (Vector Store)

#### 2. Retrieval QA Chain: Query Time
1. User queries model
2. Convert user query into vector using embedding model
3. Conduct similarity search using query vector and vector store
4. Retrieve top-k relevant documents
5. LLM generates response using documents as additional context

### Secondary Concepts
* REALM Technique (RAG during pretraining)
* Speculative RAG (Google Research)
* Corrective RAG
* Self-RAG (Allen AI)
* Fusion-RAG
* Graph RAG (Microsoft Research)
* Hypothetical Document Embeddings (HyDE) - the technique that inspired RAG
* RAG vs. Fine-Tuning (UC Berkeley paper, retrieval augmented fine tuning)




---
## Multimodal Retrieval Augmented Generation (MLLM RAG)

### Retrieval QA Chain: Ingestion
Same flow as standard RAG but for images do the following:
* Extract images from files
* Send images to MLLM to generate an image description
* Create embeddings of image with multimodal embeddings
* Create embeddings of text description from MLLMs using text embeddings



---

## Agents

### Model Context Protocol (MCP)
* [A Deep Dive Into MCP and the Future of AI Tooling](https://a16z.com/a-deep-dive-into-mcp-and-the-future-of-ai-tooling/)

--- 

## Product Sense - Good vs. Bad LLM Use Cases

### Good LLM Use Cases

### Bad LLM Use Cases