# 01 – Introduction to Colab and Jupyter

FB2NEP – Nutritional Epidemiology and Public Health

In the epidemiology component of FB2NEP, we will use **Jupyter notebooks** (in particular **Google Colab**) as a practical environment to:

- run small examples,
- inspect and summarise datasets,
- produce simple plots,
- and introduce core ideas of reproducible and transparent analysis.

The purpose of this notebook is to give you a first overview of:

1. How the teaching notebooks are stored in GitHub and opened in **Google Colab**.
2. How to run and edit notebook cells.
3. A brief introduction to **Python** and basic programme structure.
4. How to import and use core libraries: **NumPy**, **pandas**, and **Matplotlib**.
5. How to load and explore a small **hippo dietary survey** dataset.
6. Why notebooks are helpful for **reproducibility and open science**.

You do not need prior programming experience. The examples are small, explained line by line, and you will use them mainly as tools to understand epidemiological ideas.

## 1. Where are the notebooks and how do I open them in Colab?

All teaching notebooks for FB2NEP live in a **read-only GitHub repository**:

- GitHub repository:  
  https://github.com/ggkuhnle/fb2nep-epi
- Published site (easier browsing):  
  https://ggkuhnle.github.io/fb2nep-epi/

You will usually access notebooks via the **published site**. For each notebook there is a link or badge labelled something like **"Open in Colab"**.

Typical workflow during the module:

1. Go to the published site and navigate to the notebook for the week.
2. Click the **"Open in Colab"** link or badge.
3. Colab will open the notebook in your browser.
4. At the top of the notebook Colab may display a warning such as:
   > This notebook was not authored by Google.
   
   This is a standard message. In the context of this module it simply means that the notebook comes from the course repository, not from Google. You can safely choose **"Run anyway"** for FB2NEP notebooks.
5. Once the notebook is open in Colab, use **File → Save a copy in Drive** to create **your own copy**. All your edits will then be stored in your Google Drive.

The original notebooks in GitHub remain unchanged. You cannot accidentally damage them. You work in your own copy.

## 2. Notebook basics: cells and Markdown (very briefly)

A notebook consists of **cells** arranged from top to bottom.

- **Code cells** contain Python code and produce outputs such as numbers, tables, or plots.
- **Text cells** (Markdown cells) contain formatted text for headings, lists, and explanations.

To run a code cell:
1. Click inside the cell.
2. Press **Shift + Enter** (or click the small play button on the left in Colab).

The output will appear directly below the cell.

**Markdown** is a light-weight mark-up language that controls basic formatting (headings, lists, bold, italics). In this module you only need a very small subset, and you can look it up when needed. A concise reference is available at:

- https://www.markdownguide.org/basic-syntax/

In the rest of this notebook we will focus on code cells and data handling.

### 2.1 First code cell: a simple message

Run the cell below. It prints a short message and demonstrates the basic **code → output** pattern.

In [None]:
# Run this cell (Shift + Enter)
print("Hello, FB2NEP")

#### Try it

- Change the text inside the quotation marks and run the cell again.
- Add a second line, for example:

```python
print(2 + 3)
```

Run the cell again and observe that both lines of output appear under the cell.

## 3. A very brief introduction to Python

This section introduces three ideas that are useful throughout FB2NEP:

1. What a **Python programme** is and why **indentation** matters.
2. What **libraries** are and how to use them.
3. Very basic **programme structure**: a condition (`if`) and a loop (`for`).

### 3.1 What is a Python programme?

A Python programme is a sequence of **statements** that will be executed from top to bottom. In a notebook the programme is effectively the combination of all code cells that you run.

Important points:

- Python uses the **line order**: earlier lines usually run before later lines.
- Python uses **indentation** (spaces at the beginning of a line) to define structure. Indentation is not cosmetic formatting; it is part of the language.
- Lines starting with `#` are **comments** and are ignored by Python. They are for humans.

The small example below shows indentation and comments.

In [None]:
# A tiny example that uses indentation and a comment.

hippo_age = 12  # age in years

if hippo_age >= 10:
    # This line is indented and belongs to the 'if' block.
    print("This is an older hippo.")
else:
    # This line belongs to the 'else' block.
    print("This is a younger hippo.")

### 3.2 Conditions and loops (basic structure)

The two most common control structures are:

- **Condition**: `if condition: ... else: ...` to choose between two branches.
- **Loop**: `for item in collection: ...` to repeat an action for each element of a sequence.

The following example uses a `for` loop to look at several hippo ages.

In [None]:
# Example: loop over a list of hippo ages.

hippo_ages = [3, 7, 12]

for age in hippo_ages:
    if age >= 10:
        print("Age", age, "→ older hippo")
    else:
        print("Age", age, "→ younger hippo")

### 3.3 Libraries and how to use them

The Python standard library is small. Most data analysis tools live in **libraries** that you import when you need them.

In FB2NEP we will mainly use three libraries:

| Library | Typical import | Main purpose |
|--------|-----------------|--------------|
| NumPy  | `import numpy as np` | Fast numerical operations and random numbers |
| pandas | `import pandas as pd` | Reading, cleaning, and summarising tabular data |
| Matplotlib | `import matplotlib.pyplot as plt` | Creating plots |

We will now import these libraries. In Colab they are already installed.

In [None]:
# Only run this cell if Colab reports that a library is missing.
# In the teaching environment this step is usually not necessary.
%pip install numpy pandas matplotlib --quiet

In [None]:
# Import the core libraries used in this module.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Print versions (useful for reproducibility and debugging).
import sys, matplotlib
print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("pandas:", pd.__version__)
print("Matplotlib:", matplotlib.__version__)

### 3.4 Objects, attributes, and methods (very briefly)

Libraries such as pandas and Matplotlib are built around **objects**. Examples include:

- a pandas **DataFrame** (a table),
- a pandas **Series** (a single column),
- a Matplotlib **Figure** (a plot canvas).

Objects usually provide:

- **attributes** (properties that you access with a dot, for example `hippos.shape`),
- **methods** (functions that belong to the object, for example `hippos.mean()` or `hippos.describe()`).

You will see this in practice when we work with the hippo dietary survey.

## 4. Hippo dietary survey: creating and loading a small dataset

For this module we will use a small **hippo dietary survey** as a toy example. In the repository it is stored as a CSV file at:

- `data/hippo_diet_survey.csv`

Each row represents one hippo. Columns include for example:

- `hippo_id` – unique identifier,
- `name` – hippo name,
- `age_years` – age in years,
- `habitat` – for example `River`, `Lake`, `Zoo`,
- `fruit_portions` – fruit portions per day,
- `veg_portions` – vegetable portions per day,
- `grass_kg` – kilograms of grass consumed per day.

In the actual teaching repository this file will already exist. In Colab, however, we may need to **recreate** it so that the example is self-contained. The cell below constructs a small DataFrame and writes it to `data/hippo_diet_survey.csv`.

In [None]:
# Create the hippo dietary survey dataset and save it as data/hippo_diet_survey.csv
# In the GitHub repository the file will already exist. This cell ensures that
# the dataset is present when running the notebook in a fresh Colab environment.

import os

# Ensure that the data directory exists.
os.makedirs("data", exist_ok=True)

# Fix a random seed for reproducibility.
np.random.seed(11088)

hippo_data = pd.DataFrame({
    "hippo_id": [1, 2, 3, 4, 5, 6, 7, 8],
    "name": [
        "Helga", "Bruno", "Ama", "Kofi",
        "Lina", "Otto", "Sita", "Milo"
    ],
    "age_years": [5, 12, 9, 15, 3, 18, 7, 11],
    "habitat": [
        "River", "River", "Lake", "Zoo",
        "Lake", "River", "Zoo", "Lake"
    ],
    "fruit_portions": [2, 1, 3, 1, 2, 1, 3, 2],
    "veg_portions": [3, 4, 2, 5, 3, 4, 2, 3],
    "grass_kg": [40.5, 55.0, 48.2, 50.0, 37.8, 60.3, 45.0, 42.5]
})

hippo_path = "data/hippo_diet_survey.csv"
hippo_data.to_csv(hippo_path, index=False)

print(f"Saved hippo dataset to {hippo_path}")
hippo_data

### 4.1 Loading the hippo dataset from CSV

In a typical workflow the data file already exists and you only need to **read** it. The code below reads `data/hippo_diet_survey.csv` into a pandas **DataFrame** called `hippos` and shows the first few rows.

In [None]:
# Read the dataset from the CSV file.

hippo_path = "data/hippo_diet_survey.csv"
hippos = pd.read_csv(hippo_path)

# head() is a method: it returns the first few rows.
hippos.head()

### 4.2 Inspecting the data and variables

The `hippos` object is a pandas **DataFrame**. It has attributes and methods that help you to understand the structure of the data.

Commonly used methods and attributes include:

- `hippos.shape` – attribute with number of rows and columns.
- `hippos.columns` – attribute with column names.
- `hippos.info()` – method with data types and missing values.
- `hippos.describe()` – method with summary statistics for numeric columns.

Run the cell below and examine the output carefully.

In [None]:
# Inspect the structure of the hippo dataset.

print("Shape (rows, columns):", hippos.shape)

print("\nColumn names:")
print(hippos.columns.tolist())

print("\nBasic information:")
hippos.info()

print("\nSummary statistics for numeric columns:")
hippos.describe()

### 4.3 Using methods to summarise the hippo data

Many operations are available as **methods**. For example:

- `hippos["fruit_portions"].mean()` computes the mean of the `fruit_portions` column.
- `hippos.groupby("habitat")["grass_kg"].mean()` computes the mean grass intake per habitat.

These methods are part of the **object-oriented** design of pandas: the DataFrame and Series objects provide the relevant functionality via the dot notation.

In [None]:
# Mean fruit portions per day (all hippos).
mean_fruit = hippos["fruit_portions"].mean()
print(f"Mean fruit portions per day (all hippos): {mean_fruit:.2f}")

# Mean grass intake per habitat.
mean_grass_by_habitat = hippos.groupby("habitat")["grass_kg"].mean()
print("\nMean grass intake (kg per day) by habitat:")
print(mean_grass_by_habitat)

# Example of a simple condition on the DataFrame: hippos older than 10 years.
older_hippos = hippos[hippos["age_years"] > 10]
print("\nNumber of hippos older than 10 years:", len(older_hippos))

### 4.4 Plotting the hippo data

We can now create a simple plot using **Matplotlib**. A common pattern is:

1. Prepare a summary table in pandas.
2. Pass the summary values to Matplotlib.

Below we create a bar chart of **mean grass intake by habitat**.

In [None]:
# Prepare the summary again (for clarity).
mean_grass_by_habitat = hippos.groupby("habitat")["grass_kg"].mean()

# Create a bar chart.
plt.figure()
plt.bar(mean_grass_by_habitat.index, mean_grass_by_habitat.values)
plt.xlabel("Habitat")
plt.ylabel("Mean grass intake (kg per day)")
plt.title("Hippo grass intake by habitat")
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

#### Try it

Using the `hippos` DataFrame:

1. Compute the mean fruit portions per habitat using `groupby` and `mean`.
2. Create a bar chart for mean fruit portions by habitat.
3. Change the title and axis labels of the plot so that they describe your new chart.

Optional:
- Check for missing values using `hippos.isna().sum()`.
- Select only hippos from one habitat, for example river hippos:  
  `river_hippos = hippos[hippos["habitat"] == "River"]`.

## 5. Reproducibility and open-science principles

One key reason to use notebooks and version-controlled repositories (GitHub) in nutritional epidemiology is **reproducibility**.

In a reproducible analysis:

- The path from data to results is visible.
- Another researcher (or your future self) can rerun the analysis and obtain the same numbers and plots.
- Important choices (for example exclusion criteria, variable definitions) are documented in text near the code.

Notebooks help with this because they combine:

- code (what you did),
- outputs (what you obtained),
- and explanations (why you did it).

The teaching notebooks for FB2NEP are stored in a **Git repository** on GitHub. Git records the history of changes over time. In later parts of your degree you may use Git directly for your own projects.

For now, a few simple good practices are sufficient:

- Keep notebooks and data in a consistent folder structure.
- Use clear, descriptive variable names in code.
- Record decisions in short Markdown notes.
- Fix a random seed (`np.random.seed(...)`) when you use random numbers, so that results are repeatable.
- When possible, use open formats such as CSV for data and share both data (if appropriate) and analysis code.

In [None]:
# Small demonstration of a fixed random seed.
# Run this cell several times and check that the numbers stay the same.

np.random.seed(11088)
values = np.random.normal(loc=0, scale=1, size=5)
print("Random values:", values)

## 6. Recap

In this introductory notebook you have:

- seen how the FB2NEP notebooks live in a GitHub repository and are opened in **Google Colab**,
- run and edited simple Python code cells,
- learned that indentation and line order matter for Python programmes,
- imported and briefly used the core libraries **NumPy**, **pandas**, and **Matplotlib**,
- created and loaded a small **hippo dietary survey** dataset from a CSV file,
- inspected the dataset with methods such as `.head()`, `.info()`, `.describe()`,
- used methods such as `.mean()` and `.groupby()` to summarise variables,
- produced a simple bar chart from summarised data,
- and discussed how notebooks and Git support **reproducible and transparent** analyses.

These elements will recur throughout the FB2NEP epidemiology materials. The aim is that the tools become familiar so that you can concentrate on the underlying nutritional and epidemiological questions.

---
## Appendix (optional): running notebooks locally

If you prefer, you can also run the notebooks on your own computer instead of Colab. This requires a local installation of Python and Jupyter.

Two common approaches are:

1. **Conda / Miniconda** (recommended for beginners):

   ```bash
   conda create -n fb2nep python=3.11 -y
   conda activate fb2nep
   conda install jupyterlab numpy pandas matplotlib -y
   jupyter lab
   ```

2. **`venv` and `pip`**:

   ```bash
   python -m venv fb2nep
   # macOS / Linux
   source fb2nep/bin/activate
   # Windows (PowerShell)
   fb2nep\\Scripts\\activate
   pip install jupyterlab numpy pandas matplotlib
   jupyter lab
   ```

Key concepts:

- **Environment**: an isolated Python installation with its own set of packages.
- **Kernel**: the Python process that executes the code of a notebook.
- **Working directory**: the folder from which the notebook reads and writes files.
