# Python programs

This is an introductory lab where you will run a Python program to create an AI-powered farm management chat assistant using a <a href="https://stanford-cs324.github.io/winter2022/lectures/introduction/" target="_blank">language model</a>.

The goal of this lab is to provide a brief introduction to the Python programming language; demonstrate some programming concepts such as data types, data structures, and functions; and to illustrate why programming is a useful skill for solving problems and implementing data-driven solutions.

## A Python program

* A program is a series of statements that are executed to complete a task
* Here, the statements are written using the syntax of the Python programming language
* Programs complete tasks by loading, transforming, and visualising data
* Data in programs can be of different types (numeric, text, boolean)
* Data can be combined into data structures to represent complex concepts and objects
* Functions operate on data to complete tasks

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab. **If you're working with Google Colab be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/geog3300-agri3003/coursebook/blob/main/docs/notebooks/week-1_1.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Setup

### Setup Hugging Face token

1. Create a Hugging Face account:

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-1-annotated.png" alt="HF account" width="75%">

Once, you have created a Hugging Face account you can gain access to models. Many LLMs require you accept terms and conditions. For this exercise we will be working with Google's Gemma 2 2b it model (2b represents 2 billion parameters and it represents instruction tuned). Agree to the model's terms and conditions and usage policy <a href="https://huggingface.co/google/gemma-2-2b-it" target="_blank">here</a>.

2. Create an *Access Token*:

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-2-annotated.png" alt="HF account" width="75%">

3. Click on *Create new token*

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-3-annotated.png" alt="HF account" width="75%">

4. Set the token permissions

You can initially set token permissions to *Read*, which has read access to all your resources. This is a good option for getting started.

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-4-read-annotated.png" alt="HF account" width="75%">

However, as you start developing resources such as models and datasets and using Hugging Face in different environments, it's a good idea to create access tokens with fine-grained permissions with just enough permissions to complete tasks associated with the token.

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-4-annotated.png" alt="HF account" width="75%">

If you have selected fine-grained permissions, you will need to add repositories (models) that you want that token to grant permission to.

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-5-annotated.png" alt="HF account" width="75%">

6. Click *Create token* to generate the access token.

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-6-annotated.png" alt="HF account" width="75%">

7. Copy the access token. **This is your only opportunity to do this - keep a record of the token (in a secure location)**. If you lose your token, it's easy to generate a new one.

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-7-annotated.png" alt="HF account" width="75%">

8. In Google Colab, click on the *key* icon in the left-hand sidebar.

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-8-annotated.png" alt="HF account" width="75%">

9. Add your Hugging Face access token with the name `HF_TOKEN` and make sure the Notebook access it checked. **Restart your Google Colab session to load the token into your environment**.

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/hf-9-annotated.png" alt="HF account" width="50%">

### Setup runtime

**This lab will only work (quickly) using Google Colab with a *T4 GPU* runtime type.**

**Before running any code, set the runtime type to *T4 GPU*.**

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/week-1-colab-runtime-1.jpg" alt="colab runtime menu" width="50%">

<p></p>

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/week-1-colab-runtime-2.jpg" alt="colab runtime menu" width="50%">



### Import modules

A package is a collection of files containing Python code that we can use in our program. Files containing Python code are called modules (a package is a collection of modules). By importing Python packages, we can make our programs shorter and reuse code that has already been written to complete a task.

For example, the pandas provides a DataFrame structure for storing tabular data and specific functions for working with tabular data.

We need to immport the modules into our Python program where they can be used.

Here, we import a package called `os`, which contains tools for working with the operating system (e.g. creating paths to folders and files).

We also import `torch` and `transformers`, which are machine learning packages.

Importing these packages into our program means we can re-use code that has already been written to work with deep learning models that can receive and generate natural language. We'll use these tools in our program to create an AI farm chat assistant.

In [26]:
import os
import torch
from transformers import pipeline

## Using `pipeline` objects

To create a `pipeline` object call the `pipeline()` instantiator function as pass in the task (`"text-generation"` here), the model (`"google/gemma-2-2b-it"`), and some additional arguments. Set the `device_map` argument to `"auto"` to run the model on Colab's GPU when it is available.

An object in Python programs is a container for related data and functions. Data is information related to a particular concept, represented by the object, and functions do things with data to complete a task.

This will download the model to your Colab environment, which will take a moment or two. Here, we are using <a href="https://huggingface.co/google/gemma-2-2b-it" target="_blank">Google's Gemma 2 2b it model</a> - follow this link to make yourself aware of the terms of use for this model.

In [None]:
pipe = pipeline(
    "text-generation",
    model="google/gemma-2-2b-it",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

## Functions

Our `pipe` object can be treated as a function. You can spot functions by `()` appearing after the function name. Functions can take input data - this is the function's arguments -  and use these inputs to complete it's task.

Functions can return data back to the Python program, display data on your screen, or save data to disk.

The general pattern for executing a function / method is:

1. The function is called and any data are passed in as arguments inside `()`
2. The function performs operations on the data passed in
3. The function returns the result of operating on the data

The `pipe()` function takes in text data as a prompt (here we ask `"What is canola?"`) and a number to the `max_new_tokens` paramter.

It returns the output text that the Gemma model has generated in response to the prompt. This output text is referenced by the variable `output`.

**The <a href="https://blog.google/technology/developers/gemma-open-models/">Gemma</a> model we are using is a reduced size and open large language model. Therefore, its responses might not be as accurate as ones generated from larger models available online (e.g. Gemini or ChatGPT). We would probably not use this version of the Gemma model directly in a agricultural chat assistant; we would fine-tune or train a LLM to be more accurate with agricultural topics. However, it is a good example of how we can use Python to build a program to provide a useful service or solve a task. In this case, we're building a program that would let people use AI to get more information about farm management.**

**It is also your responsibility to only provide agriculturally relevant prompts to the model. You should not create offensive, personal, or controversial prompts. This is also a more general principle when working with big datasets, machine learning models and when developing technology. It is important to do so in a way that causes no harm.**

In [None]:
prompt = "What is canola?"

outputs = pipe(prompt, max_new_tokens=512)
response = outputs[0]["generated_text"].strip()
print(response)

#### Recap quiz

<details>
    <summary><b>Is <code>print()</code> a function?</b></summary>

Yes, `print()` is a function. You can tell it's a function because of the parentheses after the function name. The fact that print is a verb also indicates it's a function - something is being done (nouns are often used to indicate data). `print()` takes in data from our Python program and prints a representation of this data on our display. Above, we pass in the output from generating text with our Gemma model and print it on the display.
</details>

## Data

Programs generally complete their tasks by doing things with data. Data can be scientific data which can be analysed and visualised. However, in a computer program data refers to any information the program needs to run. For example, it could be text data such as file paths or file names or URLs for resources on the internet.

There are different types of data. Python provides support for the following built-in data types:

* `float` - storing floating point numeric values.
* `int` - storing integer numeric values.
* `str` - storing text data as strings.
* `bool` - storing True or False as boolean values.
* `None` - storing null values.
* `bytes` - storing raw binary data.

Our AI-powered farm assistant program requires two key pieces of data. Text prompts for the user (i.e. questions or requests for information) and text outputs from the LLM. Here, we'll be storing text data as <a href="https://realpython.com/python-strings/" target="_blank">string (`str`)</a> type data in our program.

In Python, we create `str` type data by placing text inside quotation marks `"a string"`.

Let's create some different text prompts as `str` type data.

In [None]:
prompt = "Drought is affecting my wheat crop. What strategies can I use to protect crop yields?"

outputs = pipe(prompt, max_new_tokens=512)
response = outputs[0]["generated_text"].strip()
print(response)

#### Recap quiz

**Can you create a new prompt of `str` type data to ask the model about pests affecting bananas?**

In [None]:
## ADD CODE HERE

<details>
    <summary><b>answer</b></summary>

```python
prompt = "What insects and pests affect banana crops?"

outputs = pipe(prompt, max_new_tokens=512)
response = outputs[0]["generated_text"].strip()
print(response)
```
</details>

## Data structures

Python provides a series of built-in data structures that can be used to group together and store related data.

You can think of a data structure as a container for related data values in a Python program. For example, a DataFrame is often used to store a tabular datset similar to how you would store data in a spreadsheet in excel.

A commonly used data structure in Python is a `list`. A `list` stores .... a list of data values.

A Python `list` is created by placing data values inside square brackets `[]`. If we have many prompts or questions we want to ask out farm assistant, we can store them all in a list.

In [54]:
prompts = [
    "When are wheat crops ready for harvest in Western Australia?",
    "What is harvest weed seed control?",
]

`prompts` is a list of two elements. Each element is a prompt for our LLM of `str` type data. Elements in a Python `list` have an index position starting at 0 for the first element. We can access elements of a list using their index. For example, to access the second element of the list we would execute the following statement:

`prompts[1]`

Note that index position 1 corresponds to the second element. Indexing starts at 0.

In [None]:
print(f"the first element in the list is {prompts[0]}")
print(f"the second element in the list is {prompts[1]}")

We can add items to our list using a `list` object's `append()` method. Note, this is a function as you can see the `()` after append.

To use the `append()` method, we pass in the data we wish to append to the end of the list. Let's add another prompt to our list: `"What herbicides are used in Australia?"`

In [None]:
prompts.append("What herbicides are used in Western Australia?")
print(prompts)

#### Recap quiz

**Can you add another item to `prompts`? The item to add is `"What grain crops are grown in France?"`.**

In [62]:
## ADD CODE HERE

<details>
    <summary><b>answer</b></summary>

```python
prompts.append("What grain crops are grown in France?")
prompts
```
</details>

## Flow control

Flow control refers to the order in which the statements that make up a program are executed. A common flow control tool are a for loops. For loops allow us to iterate over items in a sequence; we can loop over items in a list. To loop over the items in `prompts` and print the current item we would execute:

In [None]:
for p in prompts:
    print(p)

## Putting it all together: a Python program

Now we can put together the various concepts we've covered above (data types, data structures, functions, and flow control) to create a simple program that can store multiple prompts or questions from a user, pass these prompts into our LLM, and return answers.

This is a basic template of Python program to create an AI-powered farm chat assistant app. If we were expanding this we might build a user interface (e.g. a web form) where users could enter prompts that would be added to the `prompts`. We should also use a model that is trained and tested for the application domain we're working in. However, this is a small illustration of how to write a Python program to complete a task (in this case building a small prototype application for a farm chat assistant).

In [None]:
for p in prompts:
    print("Question for the farm assistant:")
    outputs = pipe(p, max_new_tokens=512)
    response = outputs[0]["generated_text"].strip()
    print(response)
    print("")
    print("***************")
    print("")