# Deep Learning for Geo/Environmental sciences

<center><img src="../logo_2.png" alt="logo" width="500"/></center>

<em>*Created with ChapGPT</em>

## Lecture 15: Large Language Models

 - [Recap](#Recap)
 - [Transformers](#Transformers)
 - [Large Language Models](#Large-Language-Models)
 - [Examples](#Examples)

## Recap

### VAEs, GANs, and Diffusion Models

- **Variational Autoencoders (VAEs)**: Encode data into a latent space and decode it back, allowing for generation of new data.

- **Generative Adversarial Networks (GANs)**: Consist of a generator and a discriminator, where the generator creates data and the discriminator evaluates it, leading to improved generation over time.

- **Diffusion Models**: Start with noise and iteratively refine it to generate data, often producing high-quality outputs.



## Transformers


![Transformers](./_images/transformer.png)

### Transformers - Key Concepts



**Self-Attention**: Mechanism that allows the model to weigh the importance of different words in a sequence, enabling it to capture long-range dependencies.

This is important for understanding context in language, as it allows the model to focus on relevant parts of the input when making predictions, even if they are far apart in the sequence.



**Positional Encoding**: Adds information about the position of words in a sequence, since transformers do not inherently understand order.

For example, in a sentence like "The cat sat on the mat," positional encoding helps the model differentiate between "cat" and "mat" based on their positions, which is crucial for understanding the meaning of the sentence.



**Multi-Head Attention**: Uses multiple attention mechanisms in parallel, allowing the model to focus on different parts of the input simultaneously.

This enables the model to capture various aspects of the input, such as different meanings or relationships between words, enhancing its understanding of complex sentences.



### Transformers - Architecture

![Transformer Architecture](./_images/transformer_architecture.png)

Notes.
- **Tokenization**: The input text is split into tokens, which can be words or subwords. Each token is then converted into a vector representation.
- **Embedding Layer**: Converts tokens into dense vectors that capture semantic meaning. This is typically done using pre-trained embeddings like Word2Vec or GloVe, or learned during training.
- **Positional Encoding**: Since transformers do not have a built-in sense of order, positional encodings are added to the token embeddings to provide information about the position of each token in the sequence.
- **Attention Mechanism**: The core of the transformer, where each token attends to all other tokens in the sequence. This allows the model to weigh the importance of different tokens based on their context.
- **Feed-Forward Networks**: Each layer includes a fully connected feed-forward network that processes the output of the self-attention mechanism.
- **Softmax Layer**: At the output, a softmax layer is used to convert the final representations into probabilities for each token in the vocabulary, allowing for tasks like language modeling or text generation.

Other important components:
- **Layer Normalization**: Applied to stabilize and speed up training by normalizing the inputs to each layer.
- **Residual Connections**: Allow gradients to flow through the network more easily, helping to mitigate the vanishing gradient problem in deep networks.


### Training

Training transformers typically involves large datasets and significant computational resources. The training process includes:
- **Pre-training**: The model is trained on a large corpus of text to learn general language representations. This phase often involves unsupervised learning tasks.
- **Fine-tuning**: The pre-trained model is then fine-tuned on a specific task or dataset, such as sentiment analysis or translation, using supervised learning.



Some example pre-training tasks include:
- **Causal Language Modeling**: The model predicts the next word in a sequence given the previous words, which helps it learn the structure and flow of language.
- **Masked Language Modeling**: Randomly masks some words in a sentence and trains the model to predict them based on the surrounding context, allowing it to learn relationships between words.
- **Next Sentence Prediction**: The model learns to predict whether two sentences follow each other in the text, which helps it understand relationships between sentences.

## Large Language Models (LLMs)

Large Language Models (LLMs) are a class of deep learning models designed to understand and generate human language. They are built on the transformer architecture and are trained on vast amounts of text data to learn patterns, structures, and semantics of language.

LLMs are characterized by their large number of parameters, often in the billions, which allows them to capture complex language features and relationships. They can perform a wide range of natural language processing tasks, including text generation, translation, summarization, and question answering.

## Large Language Models (LLMs) - successes
These models learn an underlying distribution of language, enabling them to generate coherent and contextually relevant text based on the input they receive. They can also adapt to various tasks with minimal fine-tuning, making them versatile tools in the field of natural language processing. 

One of the key successes of LLMs is their ability to scale effectively, meaning that as the model size increases (in terms of parameters), the performance on various language tasks tends to improve. This scalability is a significant factor in the development and deployment of LLMs.

## Large Language Models (LLMs) - challenges
Although the output of LLMs can be highly convincing, it is important to note that they do not possess true understanding or consciousness; they generate text based on learned patterns rather than comprehension.

LLMs can also inherit biases present in the training data, leading to biased outputs. Addressing these biases and ensuring fairness in their applications is a significant challenge.

There are also ethical considerations surrounding the use of LLMs, including concerns about misinformation, privacy, and the potential for misuse in generating harmful content.

## Example usage of LLMs in practice

Let's deploy a simple LLM using the Hugging Face Transformers library. This example will demonstrate how to use a pre-trained model for text generation.


In [None]:
from transformers import pipeline
# Load a pre-trained text generation model
generator = pipeline('text-generation', model='gpt2')

In [None]:
# Generate text based on a prompt
prompt = "Once upon a time in a galaxy far, far away"
generated_text = generator(prompt, max_length=50, num_return_sequences=1, truncation=True)

# Print the generated text
print(generated_text[0]['generated_text'])


## Example with tool usage

Models that have been trained to use tools can perform tasks that require external knowledge or actions, such as searching the web, performing calculations, or interacting with APIs. This powerful capability allows them to provide more accurate and contextually relevant responses.

This code snippet uses the Hugging Face Transformers library to load a pre-trained GPT-2 model and generate text based on a given prompt. The `pipeline` function simplifies the process of using pre-trained models for various tasks, including text generation.


In [None]:
import json
from mlx_lm import generate, load
from mlx_lm.models.cache import make_prompt_cache

# Specify the checkpoint
checkpoint = "mlx-community/Qwen2.5-32B-Instruct-4bit"

# Load the corresponding model and tokenizer
model, tokenizer = load(path_or_hf_repo=checkpoint)

# An example tool, make sure to include a docstring and type hints
def multiply(a: float, b: float):
    """
    A function that multiplies two numbers

    Args:
        a: The first number to multiply
        b: The second number to multiply
    """
    return a * b


tools = {"multiply": multiply}


In [None]:

# Specify the prompt and conversation history
prompt = "Multiply 12234585 and 48838483920."
messages = [{"role": "user", "content": prompt}]

prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tools=list(tools.values())
)

prompt_cache = make_prompt_cache(model)

# Generate the initial tool call:
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    prompt_cache=prompt_cache,
)
print(response)


In [None]:
# Parse the tool call:
# (Note, the tool call format is model specific)
tool_open = "<tool_call>"
tool_close = "</tool_call>"
start_tool = response.find(tool_open) + len(tool_open)
end_tool = response.find(tool_close)
tool_call = json.loads(response[start_tool:end_tool].strip())
tool_result = tools[tool_call["name"]](**tool_call["arguments"])

# Put the tool result in the prompt
messages = [{"role": "tool", "name": tool_call["name"], "content": tool_result}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Generate the final response:
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    prompt_cache=prompt_cache,
)

Let's try a more complex example 


In [None]:
prompt = "How many parameters does a NN with 50 inputs, 5 hidden layers, each 128 wide, and 2 outputs have?"
messages = [{"role": "user", "content": prompt}]

prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tools=list(tools.values())
)

prompt_cache = make_prompt_cache(model)

# Generate the initial tool call:
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    prompt_cache=prompt_cache,
)
print(response)

# Parse the tool call:
# (Note, the tool call format is model specific)
tool_open = "<tool_call>"
tool_close = "</tool_call>"
start_tool = response.find(tool_open) + len(tool_open)
end_tool = response.find(tool_close)
tool_call = json.loads(response[start_tool:end_tool].strip())
tool_result = tools[tool_call["name"]](**tool_call["arguments"])

# Put the tool result in the prompt
messages = [{"role": "tool", "name": tool_call["name"], "content": tool_result}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Generate the final response:
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    prompt_cache=prompt_cache,
)

## Multimodal Models

LLMs are now evolving into multimodal models, which can process and generate content across different types of data, such as text, images, and audio. These models leverage the strengths of transformers to handle multiple modalities simultaneously, enabling them to understand and generate complex content that combines various forms of information.



A key challenge in multimodal models is effectively integrating and aligning different types of data, such as text and images, to ensure that the model can understand the relationships between them. 



The first step in this process is to tokenize and encode each modality separately, allowing the model to learn representations that capture the unique characteristics of each type of data. 


### Multimodal Models

The model then uses cross-modal attention mechanisms to learn how different modalities relate to each other, enabling it to generate coherent outputs that combine information from multiple sources.



For example, models like CLIP (Contrastive Language-Image Pre-training) and DALL-E combine text and image understanding, enabling them to generate images from textual descriptions or vice versa.


## What about scientific data?

Along these lines, there is growing interest in adapting LLMs to handle scientific data. Scientific data often involves complex structures, relationships, and domain-specific knowledge that differ from general language tasks.



With sufficient training and adaptation, LLMs could potentially revolutionize how we interact with scientific literature and data. Some potential applications include:
- **Automated Literature Review**: LLMs could assist researchers in quickly summarizing and extracting key information from vast amounts of scientific literature, saving time and effort in the research process.
- **Data Extraction**: They could be used to extract relevant data from scientific texts, such as experimental results or methodologies, enabling more efficient data analysis and integration.
- **Hypothesis Generation**: LLMs could help generate new hypotheses based on existing research, potentially leading to novel discoveries and insights.



However, adapting LLMs to scientific data presents several challenges:
- **Domain-Specific Knowledge**: Scientific texts often contain specialized terminology and concepts that require a deep understanding of the domain. LLMs need to be trained on domain-specific data to effectively handle these nuances.
- **Data Structure**: Scientific data can be structured in various formats, such as tables, graphs, and equations. LLMs need to be able to process and understand these different formats to extract meaningful information.
- **Interpretability**: Scientific research often requires a high level of interpretability and explainability. LLMs need to provide clear and understandable explanations for their outputs, especially when generating hypotheses or summarizing complex data.

## What about scientific data?

In collaboration with experts in Computer Science we are currently working on a project to adapt LLMs to scientific data, focusing on improving their ability to understand and generate scientific content. 

One of the main challenges is usefully encoding the scientific data, which often involves complex structures and relationships that are not easily captured by standard text-based models.



## Conclusion
Large Language Models (LLMs) represent a significant advancement in the field of artificial intelligence, enabling machines to understand and generate human-like text. Their applications span various domains, from natural language processing to computer vision and beyond. However, challenges such as bias, interpretability, and resource requirements remain critical areas for ongoing research and development.



As LLMs continue to evolve, their potential to transform industries and improve human-computer interaction is immense. The future of LLMs lies in addressing these challenges while expanding their capabilities to handle more complex and diverse data types, including scientific data.


## SIO209 - Final remarks

Through this course, we have covered a wide range of topics in deep learning for geo/environmental sciences, from the basics of neural networks to advanced topics like self-supervised learning and generative models. I hope you have gained a good understanding of these topics and how they can be applied to geospatial and environmental data.



We only scratched the surface of deep learning in this course, and there is much more to learn and explore. I encourage you to continue learning and experimenting with deep learning techniques and apply them to your own research or projects.



If you have any questions or need further clarification on any of the topics covered in this course, please feel free to reach out to me. I'm always happy to help and discuss deep learning with you.


![I Learned Deep Learning](./_images/learned_deep_learning.png)

### Evals


Please fill out the course evaluation form to provide feedback on the course and help me improve it for future iterations. Your feedback is valuable and will help me make this course better for future students.