# Review of DSC 340: Recurrent Neural Networks, Transformers, and Large Language Models

## Recurrent Neural Networks
### DSC 340 Course Material
* [DSC 340 Lecture Notes on RNNs](DSC340Notes/RNNLectureNotes.ipynb)
* [DSC 340 Lecture Notes on RNNs for NLP](DSC340Notes/RNN_NLP.ipynb)
* [DSC 340 Assignment on RNNs](DSC340Notes/RNN_HW.ipynb)
* [DSC 340 Assignment on RNNs for NLP](DSC340Notes/RNN_NLP_HW%20.ipynb)

### External Resources
* [NLP From Scratch: Classifying Names with a Character-Level RNN](https://docs.pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)
* [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
* [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
* [Awesome Recurrent Neural Networks](https://github.com/kjw0612/awesome-rnn) (List of various resources)
* [Stanford CS224N Natural Language Processing with Deep Learning Learning](https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPDoLWjDaRAdP9D)
* [Introduction to RNNs for NLP](https://web.eecs.utk.edu/~hqi/deeplearning/lecture15-rnn-nlp.pdf)
* [Introduction to RNNs](https://courses.physics.illinois.edu/cs447/fa2020/Slides/Lecture11.pdf)
* [Deep Learning, NLP, and Representations](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

### Most Important Concepts
* **RNNs vs. Feedforward Neural Networks**: Due to the structural differences between RNNs and standard feedforward neural networks (see the DSC 340 lecture notes on RNNs), there are differences in how the different types of networks consider data. Feedforward neural networks learn patterns between x and y data, but assume that each data point is independent from the others. RNNs, however, are designed to handle sequential data by carrying information forward through a hidden state. This means an RNN can recognize patterns that unfold over time, so they are best applied to time series data, sound analysis, and natural language processing. 
* **Hidden States**: Hidden states in recurrent neural networks (RNNs) act as the model’s internal memory, allowing it to retain information from earlier steps in a sequence and use that information to influence later predictions. At each time step, the hidden state is updated based on both the current input and the previous hidden state, creating a chain of dependencies that helps the network capture patterns such as rhythm, grammar, or trends over time. This mechanism gives RNNs their strength: they can model context, temporal relationships, and long‑range structure in data in a way that feed‑forward networks cannot.
* **LSTM and GRU Layers**: Long Short‑Term Memory (LSTM) and Gated Recurrent Unit (GRU) layers are advanced variants of recurrent neural networks designed to address the limitations of standard RNNs, especially the problem of vanishing gradients during long sequences. Both introduce gating mechanisms that regulate how information is stored, forgotten, and passed forward through time. LSTMs use three gates—input, forget, and output—to carefully control their internal cell state, allowing them to preserve information over long intervals. GRUs streamline this design into two gates—update and reset—making them computationally simpler while still capturing long‑range dependencies effectively. Because of these gating structures, LSTM and GRU layers outperform basic RNNs on tasks involving complex temporal patterns, such as language modeling or time‑series prediction.
* **Tokenization and Embedding**: Tokenization is the process of breaking text up into parts, usually either at the word, subword, or character level. Each unique letter or string of letter is then assigned a number to represent it. Typically, but not always, the token that occurs the most often is assigned 1, the one that occurs second most often is assigned 2, and so on. Tokenization is done to break large blocks of text down into their smaller building blocks so machine learning models can find patterns between the tokens. The numerical encodings are performed because a machine learning model cannot handle text. Embedding is the process of converting discrete tokens into continuous numerical vectors that a model can learn from. These vectors capture semantic relationships, placing tokens with similar meanings closer together in the embedding space, which helps neural networks recognize patterns in language more effectively. The length of the vectors is a hyperparameter that is controlled by the user. 


## Transformers and Large Language Models

### DSC 340 Course Material
* [DSC 340 Slides on Transformers and Large Language Models](DSC340Notes/TransformersAndLargeLanguageModels.pdf)
* [DSC 340 Notes on Transformers](DSC340Notes/Transformers.pdf)
* [DSC 340 Homework on Transformers](DSC340Notes/Transformer_HW.ipynb)
* [DSC 340 Homework on Large Language Models](DSC340Notes/LLM_Shakespeare_HW.ipynb)

### External Resources
* [Introduction to Attention, Transformers and Large Language Models - Part 1](https://communities.sas.com/t5/SAS-Communities-Library/Introduction-to-Attention-Transformers-and-Large-Language-Models/ta-p/932212)
* [How do Transformers work?](https://huggingface.co/learn/llm-course/en/chapter1/4)
* [Attention and Augmented Recurrent Neural Networks](https://distill.pub/2016/augmented-rnns/)
* [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
* [Transformers, the tech behind LLMs ](https://www.youtube.com/watch?v=wjZofJX0v4M)
* [Attention in transformers, step-by-step](https://www.youtube.com/watch?v=eMlx5fFNoYc)
* [Large Language Models explained briefly](https://www.youtube.com/watch?v=LPZh9BOjkQs)
* [Intro to Large Language Models](https://www.youtube.com/watch?v=zjkBMFhNj_g)

### Most Important Concepts
* **Architecture of a Transformer**: A transformer is a neural network that understands sequences by using attention instead of recurrence. It has two main parts: an encoder, which reads and processes the input, and a decoder, which generates the output. Each part is made of layers that use self‑attention to decide which words in a sentence should matter most to each other, plus small feed‑forward networks for extra processing. Because transformers look at all words at once rather than one at a time, they can learn patterns in language quickly and handle long‑range relationships that older models like RNNs struggle with.
* **Attention Mechanism**: The attention mechanism is a tool that helps a neural network decide which parts of an input matter most at any moment. Instead of treating every word or feature as equally important, attention assigns different weights to different elements, allowing the model to “focus” on the information that best supports its prediction. This makes attention especially powerful for tasks like language understanding, where the meaning of one word often depends on another word far away in the sentence. By dynamically highlighting relevant pieces of information, attention helps modern architectures—such as transformers—capture long‑range relationships more effectively than older models like RNNs or CNNs.
* **Multi-Headed Attention Mechanism**: Multi‑headed attention is a key component of transformer models that allows them to look at different parts of a sentence in multiple ways at the same time. Instead of using a single attention calculation, the model splits its attention into several “heads,” with each head learning to focus on different relationships—such as syntactic structure, long‑distance dependencies, or subtle contextual cues. Each head produces its own representation of how tokens relate to one another, and these representations are then combined to give the model a richer, more nuanced understanding of the input. This parallel attention process makes transformers far more powerful than earlier single‑path attention or recurrent models, enabling them to capture complex patterns efficiently and accurately.
* **Positional Embedding**: Multi‑headed attention is a key component of transformer models that allows them to look at different parts of a sentence in multiple ways at the same time. Instead of using a single attention calculation, the model splits its attention into several “heads,” with each head learning to focus on different relationships—such as syntactic structure, long‑distance dependencies, or subtle contextual cues. Each head produces its own representation of how tokens relate to one another, and these representations are then combined to give the model a richer, more nuanced understanding of the input. This parallel attention process makes transformers far more powerful than earlier single‑path attention or recurrent models, enabling them to capture complex patterns efficiently and accurately.
* **Large Language Models**: A large language model is a neural network made of many transformer layers that process text as sequences of tokens. Each token is turned into a numerical representation and combined with information about its position in the sentence. Using self-attention, the model looks at how tokens relate to one another to understand context. The model is trained to predict the next token in a sequence, which allows it to learn patterns in language and generate meaningful text.
* **Decoding Strategies**: Greedy decoding is the simplest decoding strategy used in language generation: at each step, the model selects the single most probable next token, making it fast and deterministic but often prone to repetitive output.  To improve the results, more advanced strategies are used. Beam search explores several next tokens at once, allowing the model to explore multiple promising paths and choose a globally stronger sequence, though at higher computational cost.  Sampling‑based approaches like top‑k and top‑p sampling introduce controlled randomness by selecting from a subset of likely tokens. Top‑k limits choices to the k most probable tokens, while top‑p selects from the smallest set of tokens whose cumulative probability exceeds a threshold p. Temperature scaling adjusts the overall probability distribution to make choices more or less random, enabling a range of outputs. 




