# 01 ‚Äì Welcome to Python, Colab, and Data Handling

In this notebook, we will:
- Learn how to load and explore data in Python
- Understand how to handle variables, data types, and missing data
- Prepare for working with real public health nutrition datasets

üëâ Before starting, make sure you‚Äôve completed `00_playground.ipynb`.


In [None]:
import pandas as pd              # for working with tabular data
import numpy as np               # for numerical operations
import matplotlib.pyplot as plt  # for plotting
import seaborn as sns            # for nicer statistical visualisations

# Configure plots
sns.set(style="whitegrid")


## üìä Load the dataset

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/ggkuhnle/FB2NEP_datascience/main/data/fb2nep_data.csv')
df.head()

## üßæ Basic Summary of the Data

In [None]:
df.info()
df.describe(include='all')

## üîç Understanding Data Types

## Understanding Data Types in Pandas

- object: usually text (strings)
- int64: whole numbers
- float64: decimal numbers
- bool: True or False
- category: efficient version of object for fixed values (e.g., male/female)

In [None]:
df.dtypes

## ‚ö†Ô∏è Missing Data

## ‚ö†Ô∏è Understanding Missing Data

In real datasets, it's common to encounter **missing values** ‚Äî these appear as `NaN` in Python and represent unknown or unrecorded data.

### Why does missing data matter?

- It can reduce statistical **power**
- It may introduce **bias** if not handled correctly
- It influences how you interpret results and which methods are valid

---

### üîç Types of Missingness

Understanding *why* data are missing helps determine how to handle them. There are three main types:

#### 1. **MCAR** ‚Äì Missing Completely at Random  
The missingness is independent of both observed and unobserved data.  
> Example: A questionnaire page gets lost in the printer for some participants.

‚Üí Analyses remain unbiased (but may lose power).

---

#### 2. **MAR** ‚Äì Missing at Random  
The missingness is related to observed data, but **not** to the missing value itself.  
> Example: Older participants are less likely to report income ‚Äî but age is recorded.

‚Üí Can be handled via methods like multiple imputation or model-based adjustment.

---

#### 3. **MNAR** ‚Äì Missing Not at Random  
The missingness depends on unobserved values ‚Äî the reason data are missing is related to the missing value itself.  
> Example: People with very high alcohol intake choose not to report it.

‚Üí This is the most problematic and requires advanced methods or sensitivity analysis.

---

### üí° In practice

Before doing any statistical analysis:
- Use `df.isna().sum()` to check for missing data
- Consider *why* values might be missing
- Choose an appropriate strategy (drop, impute, model)

We'll explore these in context throughout this module.


In [None]:
df.isna().sum()

Let's introduce some missing data (completely at random ...)

In [None]:
df_missing = df.copy()  # Never change the original data

# Introduce some missing values
df.loc[df.sample(frac=0.02, random_state=11088).index, 'age'] = np.nan
df.loc[df.sample(frac=0.03, random_state=42).index, 'bmi'] = np.nan

What do we find now?

In [None]:
df.isna().sum()

### Replace or remove missing data?

In [None]:
# Example: fill missing age with median
df['age'] = df['age'].fillna(df['age'].median())

## üßÆ Subsetting Data

In [None]:
# Filter for females only
df_female = df[df['sex'] == 'female']
df_female.head()

## üß† Exercise

Make a plot of one of the variables using `sns.histplot`, `sns.boxplot`, or `sns.countplot`.

Then answer:
- What is 1 **strength** of this dataset?
- What is 1 **limitation**?

‚úçÔ∏è Write your answers in a text cell below.

## üß™ Playground ‚Äì try your own code below

In [None]:
# Write your own code here!