# Python programs

This is an introductory lab where you will run a Python program to create an AI-powered farm management chat assistant using a <a href="https://stanford-cs324.github.io/winter2022/lectures/introduction/" target="_blank">language model</a>.

The goal of this lab is to provide a brief introduction to the Python programming language; demonstrate some programming concepts such as data types, data structures, and functions; and to illustrate why programming is a useful skill for solving problems and implementing data-driven solutions.

This lab is based off this <a href="https://keras.io/examples/generative/gpt2_text_generation_with_kerasnlp/" target="_blank">code example from keras</a>.

## A Python program

* A program is a series of statements that are executed to complete a task
* Here, the statements are written using the syntax of the Python programming language
* Programs complete tasks by loading, transforming, and visualising data
* Data in programs can be of different types (numeric, text, boolean)
* Data can be combined into data structures to represent complex concepts and objects
* Functions operate on data to complete tasks

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab. **If you're working with Google Colab be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/geog3300-agri3003/coursebook/blob/main/docs/notebooks/week-1_1.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Setup

### Setup runtime

**This lab will only work (quickly) using Google Colab with a *T4 GPU* runtime type.**

**Before running any code, set the runtime type to *T4 GPU*.**

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/week-1-colab-runtime-1.jpg" alt="colab runtime menu" width="50%">

<p></p>

<img src="https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/week-1-colab-runtime-2.jpg" alt="colab runtime menu" width="50%">


### Install packages

A package is a collection of files containing Python code that we can use in our program. Files containing Python code are called modules (a package is a collection of modules). By importing Python packages, we can make our programs shorter and reuse code that has already been written to complete a task.

For example, the pandas provides a DataFrame structure for storing tabular data and specific functions for working with tabular data.

Here, we are installing some Python packages from the <a href="https://keras.io/getting_started/" target="_blank">keras</a> package that provide tools for working with deep learning models, such as chatGPT, and using artificial intelligence.

The `!pip install ...` syntax installs (downloads) the Python code from the keras package into our Python environment (i.e. the computer we are working on).

In [None]:
# install in this order - https://keras.io/getting_started/
"""
Critically, you should reinstall Keras 3 after installing KerasNLP.
This is a temporary step while TensorFlow is pinned to Keras 2, and will no longer be necessary after TensorFlow 2.16.
The cause is that keras-nlp depends on tensorflow-text, which will install tensorflow==2.15, which will overwrite your Keras installation with keras==2.15.
"""
!pip install --upgrade keras-cv
!pip install --upgrade keras-nlp
!pip install --upgrade keras

### Import modules

Before we installed the packages in our environment; this downloads the Python code (called modules) in the packages onto our computer. Now we import the modules into our Python program where they can be used.

Here, we import a package called `os`, which contains tools for working with the operating system (e.g. creating paths to folders and files).

We also import `keras_nlp` (keras natural language processing), which contains tools for working with natuaral language (e.g. written and spoken text) in a Python and artificial intelligence environment.

Finally, we import `keras`, which is a Python package for artificial intelligence. Specifically, it provides an interface to a range of deep learning models.

Importing these packages into our program means we can re-use code that has already been written to work with deep learning models that can receive and generate natural language (e.g. such as ChatGPT). We'll use these tools in our program to create an AI farm chat assistant.

In [None]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_nlp
import keras

keras.mixed_precision.set_global_policy("mixed_float16")

## Load a GPT model

First, we create a large language model object referenced by the variable `gpt2_lm`. This model is loaded from the `keras_nlp` package that we imported into our program.

Specifically, `preprocessor` is an object that converts text into tokens (numeric representations of text) and back again. `gpt2_lm` is a GPT2 large language model that can take in input text and generate output text in response.

An **object** in Python programs is a container for related data and functions. Data is information related to a particular concept, represented by the object, and functions do things with data to complete a task.

In [None]:
# To speed up training and generation, we use preprocessor of length 512
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=512,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

## Functions

Our `gpt2_lm` object has a method (another name for a function) called `generate()`. You can spot functions by `()` appearing after the function name. Functions can take input data - this is the function's arguments -  and use these inputs to complete it's task.

Functions can return data back to the Python program, display data on your screen, or save data to disk.

The general pattern for executing a function / method is:

1. The function is called and any data are passed in as arguments inside `()`
2. The function performs operations on the data passed in
3. The function returns the result of operating on the data

The `generate()` function takes in text data as a prompt to the `input` parameter (here we ask `"What is canola?"`) and a number to the `max_length` paramter.

It returns the output text that the GPT2 model has generate in response to the prompt. This output text is referenced by the variable `output`.

**The <a href="https://openai.com/research/better-language-models" target="_blank">GPT2 model</a> we are using is an early large language model (a precursor to ChatGPT and modern models). Therefore, it might generate nonsense responses; this is especially so as this version of the model is a small open source release for research and experimentation. There are techniques we can use to improve language model's performance for specific tasks, but these are advanced topics not for here. However, it is a good example of how we can use Python to build a program to provide a useful service or solve a task. In this case, we're building a program that would let people use AI to get more information about farm management.**

**It is also your responsibility to only provide agriculturally relevant prompts to the model. You should create offensive, personal, or controversial prompts. This is also a more general principle when working with big datasets, machine learning models, and when developing technology. It is important to do so in a way that causes no harm.**

In [None]:
output = gpt2_lm.generate("What is canola?", max_length=50)
print("\nGPT-2 output:")
print(output)

#### Recap quiz

<details>
    <summary><b>Is <code>print()</code> a function?</b></summary>

Yes, `print()` is a function. You can tell it's a function because of the parentheses after the function name. The fact that print is a verb also indicates it's a function - something is being done (nouns are often used to indicate data). `print()` takes in data from our Python program and prints a representation of this data on our display. Above, we pass in the output from generating text with our GPT2 model and print it on the display.
</details>

## Data

Programs generally complete their tasks by doing things with data. Data can be scientific data which can be analysed and visualised. However, in a computer program data refers to any information the program needs to run. For example, it could be text data such as file paths or file names or URLs for resources on the internet.

There are different types of data. Python provides support for the following built-in data types:

* `float` - storing floating point numeric values.
* `int` - storing integer numeric values.
* `str` - storing text data as strings.
* `bool` - storing True or False as boolean values.
* `None` - storing null values.
* `bytes` - storing raw binary data.

Our AI-powered farm assistant program requires two key pieces of data. Text prompts for the user (i.e. questions or requests for information) and text outputs from the GPT2 model. Here, we'll be storing text data as <a href="https://realpython.com/python-strings/" target="_blank">string (`str`)</a> type data in our program.

In Python, we create `str` type data by placing text inside quotation marks `"a string"`.

Let's create some different text prompts as `str` type data.

In [None]:
prompt = "Drought is affecting my wheat crop"
print(f"the data type of prompt is {type(prompt)}")

And we can pass this prompt into our GPT2 model.

In [None]:
output = gpt2_lm.generate(prompt, max_length=50)
print("\nGPT-2 output:")
print(output)

#### Recap quiz

**Can you create a new prompt of `str` type data to ask the model about pests affecting bananas?**

In [None]:
## ADD CODE HERE

<details>
    <summary><b>answer</b></summary>

```python
new_prompt = "Insects and pests affecting banana crops"

## NOTE: we pass in the variable new_prompt to the generate() function
output = gpt2_lm.generate(new_prompt, max_length=75)
print("\nGPT-2 output:")
print(output)
```
</details>

## Data structures

Python provides a series of built-in data structures that can be used to group together and store related data.

You can think of a data structure as a container for related data values in a Python program. For example, a DataFrame is often used to store a tabular datset similar to how you would store data in a spreadsheet in excel.

A commonly used data structure in Python is a `list`. A `list` stores .... a list of data values.

A Python `list` is created by placing data values inside square brackets `[]`. If we have many prompts or questions we want to ask out farm assistant, we can store them all in a list.

In [None]:
prompts_list = [
    "When are wheat crops ready for harvest?",
    "Why do we need fertiliser for fruit tree crops?",
]

`prompts_list` is a list of two elements. Each element is a prompt for our GPT2 model of `str` type data. Elements in a Python `list` have an index position starting at 0 for the first element. We can access elements of a list using their index. For example, to access the second element of the list we would execute the following statement:

`prompts_list[1]`

Note that index position 1 corresponds to the second element. Indexing starts at 0.

In [None]:
print(f"the first element in the list is {prompts_list[0]}")
print(f"the second element in the list is {prompts_list[1]}")

We can add items to our list using a `list` object's `append()` method. Note, this is a function as you can see the `()` after append.

To use the `append()` method, we pass in the data we wish to append to the end of the list. Let's add another prompt to our list: `"What herbicides can we use in Australia?"`

In [None]:
prompts_list.append("What herbicides can we use in Australia?")
print(prompts_list)

#### Recap quiz

**Can you add another item to `prompts_list`? The item to add is `"What grain crops are grown in France?"`.**

In [None]:
## ADD CODE HERE

<details>
    <summary><b>answer</b></summary>

```python
prompts_list.append("What grain crops are grown in France?")
prompts_list
```
</details>

## Flow control

Flow control refers to the order in which the statements that make up a program are executed. A common flow control tool are a for loops. For loops allow us to iterate over items in a sequence; we can loop over items in a list. To loop over the items in `prompts_list` and print the current item we would execute:

In [None]:
for prompt in prompts_list:
    print(prompt)

## Putting it all together: a Python program

Now we can put together the various concepts we've covered above (data types, data structures, functions, and flow control) to create a simple program that can store multiple prompts or questions from a user, pass these prompts into our GPT2 model, and return answers.

This is a basic template of Python program to create an AI-powered farm chat assistant app. If we were expanding this we might build a user interface (e.g. a web form) where users could enter prompts that would be added to the `prompts_list`. We should also use a model that is trained and tested for the application domain we're working in. However, this is a small illustration of how to write a Python program to complete a task (in this case building a small prototype application for a farm chat assistant).

In [None]:
for prompt in prompts_list:
    print("Question for the farm assistant:")
    output = gpt2_lm.generate(prompt, max_length=75)
    print(output)
    print("")
    print("***************")
    print("")