# Large Language Models

## Recap

### VAEs, GANs, and Diffusion Models

- **Variational Autoencoders (VAEs)**: Encode data into a latent space and decode it back, allowing for generation of new data.

- **Generative Adversarial Networks (GANs)**: Consist of a generator and a discriminator, where the generator creates data and the discriminator evaluates it, leading to improved generation over time.

- **Diffusion Models**: Start with noise and iteratively refine it to generate data, often producing high-quality outputs.



## Transformers

### Key Concepts

- **Self-Attention**: Mechanism that allows the model to weigh the importance of different words in a sequence, enabling it to capture long-range dependencies.

- **Positional Encoding**: Adds information about the position of words in a sequence, since transformers do not inherently understand order.

- **Multi-Head Attention**: Uses multiple attention mechanisms in parallel, allowing the model to focus on different parts of the input simultaneously.


### Architecture

- **Encoder-Decoder Structure**: The transformer consists of an encoder that processes the input and a decoder that generates the output. Each consists of multiple layers of self-attention and feed-forward networks.
- **Layer Normalization**: Applied to stabilize and speed up training by normalizing the inputs to each layer.
- **Feed-Forward Networks**: Each layer includes a fully connected feed-forward network that processes the output of the self-attention mechanism.

### Training
- **Masked Language Modeling**: A training objective where some words in the input are masked, and the model learns to predict them based on the context provided by the unmasked words.
- **Next Sentence Prediction**: A task where the model predicts whether a given sentence follows another, helping it learn relationships between sentences.

### Applications

- **Natural Language Processing (NLP)**: Transformers are widely used in tasks such as translation, summarization, and question answering.
- **Computer Vision**: Transformers have been adapted for image processing tasks, such as object detection and segmentation.
- **Reinforcement Learning**: Transformers can be used to model policies and value functions in reinforcement learning settings.


### Generative Pre-trained Transformer (GPT)

Another type of generative model that has gained popularity in recent years is the Generative Pre-trained Transformer (GPT). GPT is a type of language model that generates text by predicting the next word in a sequence of words.



The key idea behind GPT is to learn a transformer model that predicts the next word in a sequence of words. The model is trained using a large corpus of text data and is fine-tuned on a specific task, such as text generation or text classification.



It learns an underlying distribution of the data and generates new samples by sampling from the learned distribution. So, although its output can be very convincing, it's really just outputting what it thinks is most likely to come next given the input it's been given - it's not actually understanding the text in the way a human would.



Nevertheless, GPT has been shown to generate high-quality text and has been applied to a wide range of applications, such as text generation, text summarization, and text classification.


# Large Language Models (LLMs)

## Key Concepts

- **Pre-training and Fine-tuning**: LLMs are typically pre-trained on large datasets to learn general language patterns and then fine-tuned on specific tasks to improve performance.
- **Transfer Learning**: LLMs leverage knowledge learned from one task to improve performance on another, often requiring less data for fine-tuning.
- **Zero-shot, One-shot, and Few-shot Learning**: LLMs can perform tasks with little to no task-specific training data, relying on their pre-trained knowledge to generalize to new tasks.
- **Contextual Understanding**: LLMs can understand and generate text based on the context provided, allowing them to produce coherent and contextually relevant responses.
- **Scalability**: LLMs can be scaled up by increasing the number of parameters, leading to improved performance on various tasks, but also requiring more computational resources.

## Applications
- **Text Generation**: LLMs can generate coherent and contextually relevant text, making them suitable for applications like chatbots, content creation, and storytelling.
- **Question Answering**: LLMs can answer questions based on provided context or general knowledge, making them useful for customer support and information retrieval.
- **Translation**: LLMs can translate text between languages, leveraging their understanding of language structure and semantics.
- **Summarization**: LLMs can condense long texts into shorter summaries, making them useful for news aggregation and content curation.
- **Sentiment Analysis**: LLMs can analyze text to determine sentiment, helping businesses understand customer feedback and opinions.
- **Code Generation**: LLMs can generate code snippets based on natural language descriptions, assisting developers in writing software.

## Challenges and Limitations
- **Bias and Fairness**: LLMs can inherit biases present in the training data, leading to biased outputs. Addressing bias and ensuring fairness is a significant challenge.
- **Interpretability**: LLMs are often seen as "black boxes," making it difficult to understand how they arrive at specific outputs. Improving interpretability is an ongoing area of research.
- **Resource Intensive**: Training and deploying LLMs require significant computational resources, making them less accessible for smaller organizations.
- **Ethical Concerns**: The potential for misuse of LLMs, such as generating misleading or harmful content, raises ethical concerns that need to be addressed.


## Example usage of LLMs in practice

Let's deploy a simple LLM using the Hugging Face Transformers library. This example will demonstrate how to use a pre-trained model for text generation.


In [1]:
from transformers import pipeline
# Load a pre-trained text generation model
generator = pipeline('text-generation', model='gpt2')

  from .autonotebook import tqdm as notebook_tqdm
2025-05-28 17:22:48.268098: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2025-05-28 17:22:48.268127: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2025-05-28 17:22:48.268132: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2025-05-28 17:22:48.268167: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-05-28 17:22:48.268178: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized fro

In [2]:
# Generate text based on a prompt
prompt = "Once upon a time in a galaxy far, far away"
generated_text = generator(prompt, max_length=50, num_return_sequences=1, truncation=True)

# Print the generated text
print(generated_text[0]['generated_text'])


Once upon a time in a galaxy far, far away, I became a powerful hero, and in doing so made more lives possible than I'd ever imagine.

I know for a fact that other people didn't realise I had been a hero


This code snippet uses the Hugging Face Transformers library to load a pre-trained GPT-2 model and generate text based on a given prompt. The `pipeline` function simplifies the process of using pre-trained models for various tasks, including text generation.


In [3]:
import json
from mlx_lm import generate, load
from mlx_lm.models.cache import make_prompt_cache

# Specify the checkpoint
checkpoint = "mlx-community/Qwen2.5-32B-Instruct-4bit"

# Load the corresponding model and tokenizer
model, tokenizer = load(path_or_hf_repo=checkpoint)

# An example tool, make sure to include a docstring and type hints
def multiply(a: float, b: float):
    """
    A function that multiplies two numbers

    Args:
        a: The first number to multiply
        b: The second number to multiply
    """
    return a * b


tools = {"multiply": multiply}


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 159783.01it/s]


In [4]:

# Specify the prompt and conversation history
prompt = "Multiply 12234585 and 48838483920."
messages = [{"role": "user", "content": prompt}]

prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tools=list(tools.values())
)

prompt_cache = make_prompt_cache(model)

# Generate the initial tool call:
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    prompt_cache=prompt_cache,
)
print(response)


<tool_call>
{"name": "multiply", "arguments": {"a": 12234585, "b": 48838483920}}
</tool_call>
Prompt: 221 tokens, 113.483 tokens-per-sec
Generation: 42 tokens, 18.302 tokens-per-sec
Peak memory: 18.803 GB
<tool_call>
{"name": "multiply", "arguments": {"a": 12234585, "b": 48838483920}}
</tool_call>


In [5]:
# Parse the tool call:
# (Note, the tool call format is model specific)
tool_open = "<tool_call>"
tool_close = "</tool_call>"
start_tool = response.find(tool_open) + len(tool_open)
end_tool = response.find(tool_close)
tool_call = json.loads(response[start_tool:end_tool].strip())
tool_result = tools[tool_call["name"]](**tool_call["arguments"])

# Put the tool result in the prompt
messages = [{"role": "tool", "name": tool_call["name"], "content": tool_result}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Generate the final response:
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    prompt_cache=prompt_cache,
)

The product of 12234585 and 48838483920 is 597518582790373200.
Prompt: 56 tokens, 78.884 tokens-per-sec
Generation: 47 tokens, 13.570 tokens-per-sec
Peak memory: 18.803 GB


In [6]:
list(tools.values())

[<function __main__.multiply(a: float, b: float)>]


## Multimodal Models
- **Definition**: Multimodal models are designed to process and generate data from multiple modalities, such as text, images, and audio. They can understand and generate content that combines different types of information.

- **Tokenization**: Multimodal models often use specialized tokenization techniques to handle different types of data. For example, images may be tokenized into patches, while text is tokenized into words or subwords.

- **Cross-Modal Attention**: These models utilize attention mechanisms that allow them to focus on relevant parts of different modalities simultaneously, enabling them to learn relationships between text and images or other modalities.


- **Applications**: Multimodal models can be used for tasks like image captioning, video analysis, and cross-modal retrieval, where understanding the relationship between different modalities is crucial.

- **Examples**: Models like CLIP (Contrastive Language-Image Pre-training) and DALL-E combine text and image understanding, enabling them to generate images from textual descriptions or vice versa.


## What about scientific data?

- **Scientific Data**: LLMs can be adapted to handle scientific data, such as research papers, experimental results, and datasets. They can assist in literature review, hypothesis generation, and data analysis.
- **Applications in Science**: LLMs can be used for tasks like automated literature review, data extraction from scientific texts, and generating hypotheses based on existing research.

But, they are not yet able to reliably work with scientific data, and there are many challenges to overcome before they can be used effectively in this domain.

In collaboration with experts in Computer Science we are currently working on a project to adapt LLMs to scientific data, focusing on improving their ability to understand and generate scientific content. One of the main challenges is usefully encoding the scientific data, which often involves complex structures and relationships that are not easily captured by standard text-based models.

## Conclusion
Large Language Models (LLMs) represent a significant advancement in the field of artificial intelligence, enabling machines to understand and generate human-like text. Their applications span various domains, from natural language processing to computer vision and beyond. However, challenges such as bias, interpretability, and resource requirements remain critical areas for ongoing research and development.

As LLMs continue to evolve, their potential to transform industries and improve human-computer interaction is immense. The future of LLMs lies in addressing these challenges while expanding their capabilities to handle more complex and diverse data types, including scientific data.


## Final remarks

Through this course, we have covered a wide range of topics in deep learning for geo/environmental sciences, from the basics of neural networks to advanced topics like self-supervised learning and generative models. I hope you have gained a good understanding of these topics and how they can be applied to geospatial and environmental data.



We only scratched the surface of deep learning in this course, and there is much more to learn and explore. I encourage you to continue learning and experimenting with deep learning techniques and apply them to your own research or projects.



If you have any questions or need further clarification on any of the topics covered in this course, please feel free to reach out to me. I'm always happy to help and discuss deep learning with you.


### Evals


Please fill out the course evaluation form to provide feedback on the course and help me improve it for future iterations. Your feedback is valuable and will help me make this course better for future students.