# 📊 3.2 Importing Data

Before you can analyse data, you first need to **import** it into Python. This notebook introduces the most common ways of bringing nutrition data into your workflow.

Think of it like shopping for ingredients 🛒: you can’t cook (analyse) until you’ve brought the food (data) into your kitchen (Python).

***
## 🎯 Objectives
By the end of this notebook you should be able to:
- Import data from **CSV** and **Excel** files using pandas.
- Recognise other common formats: TSV, JSON, databases, APIs.
- Verify the data after import (shapes, column names, head of table).
- Apply these skills to `hippo_nutrients.csv`.


In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files


MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    os.chdir(MODULE_PATH)
    assert os.path.exists(DATASET_PATH)
    print(f'Dataset found: {DATASET_PATH} ✅')
except Exception as e:
    print(f'Automatic clone failed: {e}')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')


print('Environment ready.')

Install additional libraries.

In [None]:
%pip install pandas openpyxl sqlalchemy -q
import pandas as pd, numpy as np

## 📂 Importing CSV Files

CSV (Comma Separated Values) is the most common format in nutrition research. It is plain text, easy to share, and can be opened in Excel or any text editor.

Let’s load `hippo_nutrients.csv`.

In [None]:
df_csv = pd.read_csv('data/hippo_nutrients.csv')
print(f'Shape: {df_csv.shape}')
print(f'Columns: {df_csv.columns.tolist()}')
display(df_csv.head())

## 📑 Importing Excel Files

Excel is common in labs and public health datasets. You may encounter files with **multiple sheets**, formatting quirks, or missing values.

Example (requires `openpyxl`):

```python
# %pip install openpyxl  # uncomment if needed in Colab
# df_excel = pd.read_excel('data/hippo_nutrients.xlsx', sheet_name='Sheet1')
```

For this demo we’ll reuse the CSV DataFrame but pretend it came from Excel.

In [None]:
df_excel = df_csv.copy()
print(f'Excel shape: {df_excel.shape}')
display(df_excel.head())

## 🔗 Other Common Data Sources

### TSV (Tab-Separated Values)
```python
df_tsv = pd.read_csv('data/example.tsv', sep='\t')
```

### JSON (e.g. from APIs)
```python
df_json = pd.read_json('data/example.json')
```

### SQL Databases (clinical or survey data)
```python
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/nutrition.db')
df_sql = pd.read_sql('SELECT * FROM hippo_nutrients', engine)
```

### Web APIs (advanced)
Many nutrition datasets are accessible via APIs. Example:
```python
import requests
url = 'https://api.example.com/nutrients'
data = requests.get(url).json()
df_api = pd.DataFrame(data)
```

👉 You don’t need to master all these at once, but it’s important to know they exist!

## 🧪 Exercise 1: CSV Practice
1. Load `hippo_nutrients.csv` into a DataFrame.
2. Print the first 5 rows.
3. How many unique nutrients are in the dataset?

## 🧪 Exercise 2: Explore Other Formats
Imagine you receive data from collaborators:
- NDNS data in a TSV file.
- A JSON file with nutrient metadata.

How would you adapt the code above to import these files?

💡 *You don’t need the real files now—just sketch out the pandas command you’d use.*

## ✅ Conclusion
In this notebook you:
- Imported CSV and Excel data.
- Learned about other formats (TSV, JSON, SQL, APIs).
- Practised verifying shapes, columns, and contents after import.

👉 Next up: **3.3 Data Cleaning** — because imported data is rarely perfect, and cleaning is where the real fun begins!

***
**Resources**:
- [Pandas I/O Documentation](https://pandas.pydata.org/docs/user_guide/io.html)
- [OpenPyXL Documentation](https://openpyxl.readthedocs.io/)
- [SQLAlchemy](https://docs.sqlalchemy.org/)
- [Requests (Python HTTP for APIs)](https://docs.python-requests.org/)
- Repository: [github.com/ggkuhnle/data-analysis-projects](https://github.com/ggkuhnle/data-analysis-projects)