# 👩‍💻 Notebook 0 – Getting Started

Welcome to the **Data Analysis Toolkit for Food and Nutrition Sciences**!  
Before we dive into exciting topics like nutrient analysis and clinical trials, it’s essential to set up a working environment you can rely on.

---

## 🎯 Objectives

By the end of this notebook, you will:

- Understand the different ways you can run Python (locally or online)
- Set up and verify your Python environment
- Install the required libraries
- Test everything with a simple example

# 🐍 What is Python?

Python is a **programming language** — a way of giving instructions to a computer.  
It's widely used in data science, web development, automation, and more.

In this course, we use Python to:
- Read and clean nutrition datasets
- Calculate statistics and visualise trends
- Build simple models and make predictions

Python is known for being:
- **Readable** 📝 – Code looks a lot like English
- **Flexible** 🔧 – You can use it for almost anything
- **Popular** 🌍 – There’s a huge community and lots of free resources

---

<details>
<summary>📘 Click to expand: More about Python (for curious minds)</summary>

### 🧠 Why is it called Python?

Python is named after the comedy group *Monty Python*, not the snake.  
You’ll often see silly or creative examples in the Python community – it’s part of the charm!

### 🧰 What can Python do?

Beyond data analysis, Python can also:
- Power websites (e.g. Instagram, Reddit)
- Control robots and IoT devices
- Train machine learning models
- Automate repetitive tasks (e.g. renaming files)

### 🔄 What does Python code look like?

```python
# This is Python code!
name = "hippo"
print(f"Hello, {name}!")
```

### 📎 Want to explore more?

- [Python for Beginners (python.org)](https://www.python.org/about/gettingstarted/)
- [What is Python? (realpython.com)](https://realpython.com/what-is-python/)
- [Python Crash Course (Google Colab)](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/00.00-Preface.ipynb)

</details>


## 🌍 Running Python: Your Options

There are two common ways to run Python for data analysis:

### 🖥️ 1. **Local installation**
You install Python, Jupyter, and other libraries directly on your computer. This gives you full control and works well for advanced users.

### ☁️ 2. **Google Colab (Recommended for beginners)**
This free, web-based platform runs Python in the cloud — no installation required.

- It's ideal for beginners or anyone working on shared machines (e.g. university PCs).
- All you need is a Google account.

We'll design this toolkit to work **seamlessly in Google Colab**, but you can also download and run it locally if you prefer.


## 🔍 What is a Python "environment"?

A Python environment is a collection of installed tools and packages. Think of it as your lab bench:

- Python is the bench itself.
- Packages like `pandas`, `numpy`, and `matplotlib` are your tools.
- You can create custom environments to keep tools separate for different projects.

Colab already provides a pre-configured environment — we'll just add a few extra tools.


<details>
<summary>🦛</summary>

Just like a hippo needs the right waterhole to cool off, you need the right environment to analyse your data comfortably. Let’s get you set up!
</details>


## 📂 Loading Data: Different Ways to Do It

Before we begin analysing data, we need to **load it into our Python environment**.

There are a few common ways to do this in Colab or Jupyter:

### 🧳 Option 1: Load from the Internet (Recommended)
If your data is stored in a GitHub repository (like this project), you can automatically download and use it in Google Colab.  
This is great because:

- You don’t need to upload files manually
- Everyone in your group sees the same structure

We’ll start by trying to **clone the whole GitHub repository**, just like downloading a suitcase full of datasets and notebooks.

### 📁 Option 2: Upload Manually
If cloning fails or you're using your own file, you can upload it manually from your computer.  
This is helpful if:

- You're working with private data
- You're just testing out a quick idea

<details>
<summary>🦛</summary>

Think of it like this:

- The GitHub repository is your **shared hippo pantry**
- Uploading a file manually is like **bringing your own snacks**
</details>

We'll now run a code cell that first **tries the automatic method**, and falls back to manual upload if needed.  
Don’t worry — it explains everything along the way!


In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '01_infrastructure'  # e.g., '01_infrastructure'
DATASET = 'hippo_diets.csv'  # e.g., 'hippo_diets.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
# Note: If you encounter a cloning error (e.g., 'fatal: destination path already exists'),
#       reset the runtime (Runtime > Restart runtime) and run this cell again.
try:
    print('Attempting to clone repository...')
    if os.path.exists(BASE_PATH):
        print('Repository already exists, skipping clone.')
    else:
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    
    # Debug: Print directory structure
    print('Listing repository contents:')
    !ls {BASE_PATH}
    print(f'Listing notebooks directory contents:')
    !ls {BASE_PATH}/notebooks
    
    # Check if the module directory exists
    if not os.path.exists(MODULE_PATH):
        raise FileNotFoundError(f'Module directory {MODULE_PATH} not found. Check the repository structure.')
    
    # Set working directory to the notebook's folder
    os.chdir(MODULE_PATH)
    
    # Verify dataset is accessible
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

## 📦 Installing and Using Python Packages

Python is powerful, but it doesn’t come with everything built-in.  
That’s where **packages** come in — they’re like apps you install to give Python superpowers!

---

### 🛠️ What is a package?

A package is a collection of code written by others that you can reuse in your own projects.

Think of them as:

- 🧰 Specialised tools you add to your data analysis workbench
- 📚 Cheat codes that help you do complex things with just a few lines

---

### 📦 In this notebook, we’ll install and use:

- **`pandas`** – Makes working with data tables easy, like using a spreadsheet in Python
- **`numpy`** – Adds powerful maths and statistics tools (great for calculations!)
- **`matplotlib`** – Lets you create simple graphs and plots

We’ll use `%pip install` to make sure these are available in your current environment.  
(In Google Colab, `%pip` works just like the normal terminal command `pip` but runs inside the notebook.)

---

<details>
<summary>🦛</summary> 
*Even hippos appreciate the right tools for the job — let’s load ours!*</details>


In [None]:
# Install core packages
%pip install pandas numpy matplotlib  # For Colab users
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print('Your data analysis environment is ready!')

Your data analysis environment is ready!


## ✅ Test Your Setup

Let’s check that:

- Your Python environment works
- Required packages are installed
- You can read a file and make a plot

We'll use a small dataset called `hippo_diets.csv`, which contains sample records from a fictional hippo nutrition study. 🦛

---

### 📚 What will this code do?

1. **Read a CSV file**  
   We use `pd.read_csv()` to load the dataset into a `DataFrame` — a special table-like structure in Python that’s great for analysis.

2. **Print the first row**  
   This lets us quickly check that the data loaded correctly.

3. **Make a scatter plot**  
   We’ll plot `Calories` vs `Protein` to get a feel for the data.



In [None]:
df = pd.read_csv('data/hippo_diets.csv')
print(df.head(1))
plt.scatter(df['Calories'], df['Protein'])
plt.xlabel('Calories')
plt.ylabel('Protein (g)')
plt.title('Sample Hippo Diet Data')
plt.show()

   ID  Calories  Protein        Date
0  H1      2500     80.5  2024-01-01


## ✅ Conclusion

🎉 **Success!** You’ve verified that your Python environment is working properly:

- ✅ You’ve installed key packages  
- ✅ Loaded your first dataset  
- ✅ Created a simple visualisation

You're now fully set up and ready to begin exploring data in the exciting world of food and nutrition science. 🦛

---

### 🚀 What’s Next?

Head to **Notebook 1.1** to begin your journey into data science environments — you’ll learn how to think like a data scientist and explore how we actually work with data.

---

### 📚 Helpful Resources

- 🔧 [Install Anaconda (for local setup)](https://www.anaconda.com/products/distribution)  
- ☁️ [Google Colab Documentation](https://colab.research.google.com/)  
- 📦 [Course Repository on GitHub](https://github.com/ggkuhnle/data-analysis-projects)
