# Python basics: working with data

this notebook covers basic Python operations for data manipulation using `pandas`.

**dataset:** `data.csv` contains simulated health data with the following columns:
- `id`: unique identifier
- `zip_code`: Seattle area ZIP code  
- `age`: age in years
- `smoking`: binary variable (0 = non-smoker, 1 = smoker)
- `years_smoked`: number of years smoked (NA if non-smoker)

## 1. setup: import libraries

first, we import the libraries we'll use. Run this cell with `Shift+Enter`.

In [None]:
import pandas as pd
import numpy as np

## 2. reading data

use `pd.read_csv()` to read in a CSV file. this creates a **dataframe**, which is like an R dataframe.

In [None]:
# read the data
df = pd.read_csv("data.csv")

# view the first few rows
df.head()

In [None]:
# check the shape (rows, columns)
print(f"Shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

In [None]:
# view column names
df.columns

In [None]:
# view data types
df.dtypes

## 3. selecting columns

there are several ways to select columns in pandas:

In [None]:
# select a single column (returns a Series)
ages = df['age']
print(type(ages))
ages.head()

In [None]:
# select multiple columns (returns a DataFrame)
subset = df[['id', 'age', 'smoking']]
print(type(subset))
subset.head()

## 4. basic statistics

pandas has built-in methods for common statistics:

In [None]:
# summary statistics for all numeric columns
df.describe()

In [None]:
# individual statistics
print(f"mean age: {df['age'].mean():.2f}")
print(f"median age: {df['age'].median():.2f}")
print(f"std dev of age: {df['age'].std():.2f}")
print(f"min age: {df['age'].min()}")
print(f"max age: {df['age'].max()}")

In [None]:
# count values
df['smoking'].value_counts()

## 5. filtering rows

use boolean conditions to filter rows:

In [None]:
# filter to only smokers
smokers = df[df['smoking'] == 1]
print(f"number of smokers: {len(smokers)}")
smokers.head()

In [None]:
# filter with multiple conditions (use & for AND, | for OR)
older_smokers = df[(df['smoking'] == 1) & (df['age'] > 65)]
print(f"number of smokers over 65: {len(older_smokers)}")
older_smokers.head()

In [None]:
# filter using .query() method (cleaner syntax)
older_smokers = df.query("smoking == 1 and age > 65")
older_smokers.head()

## 6. creating new variables

add new columns by assigning to a new column name:

In [None]:
# create a copy so we don't modify the original
df_modified = df.copy()

# create a new column
df_modified['age_group'] = np.where(df_modified['age'] >= 65, 'senior', 'non-senior')
df_modified.head(10)

In [None]:
# create a column from arithmetic
df_modified['age_decades'] = df_modified['age'] / 10
df_modified[['age', 'age_decades']].head()

In [None]:
# create multiple categories with pd.cut()
df_modified['age_category'] = pd.cut(
    df_modified['age'], 
    bins=[0, 40, 60, 80, 100],
    labels=['Young', 'Middle', 'Senior', 'Elderly']
)
df_modified['age_category'].value_counts()

## 7. renaming columns

In [None]:
# rename a column
df_renamed = df.rename(columns={'age': 'age_in_years'})
df_renamed.head()

In [None]:
# rename multiple columns at once
df_renamed = df.rename(columns={
    'age': 'age_in_years',
    'smoking': 'is_smoker'
})
df_renamed.head()

## 8. handling missing values

In [None]:
# check for missing values
df.isna().sum()

In [None]:
# filter to rows with missing years_smoked
missing_years = df[df['years_smoked'].isna()]
print(f"rows with missing years_smoked: {len(missing_years)}")
missing_years.head()

In [None]:
# drop rows with any missing values
df_complete = df.dropna()
print(f"original rows: {len(df)}")
print(f"complete cases: {len(df_complete)}")

In [None]:
# fill missing values
df_filled = df.copy()
df_filled['years_smoked'] = df_filled['years_smoked'].fillna(0)
df_filled.isna().sum()

## 9. grouping and aggregating

In [None]:
# group by smoking status and calculate mean age
df.groupby('smoking')['age'].mean()

In [None]:
# multiple aggregations
df.groupby('smoking')['age'].agg(['mean', 'median', 'std', 'count'])

In [None]:
# group by multiple columns
df_modified = df.copy()
df_modified['age_group'] = np.where(df_modified['age'] >= 65, 'senior', 'non-senior')
df_modified.groupby(['smoking', 'age_group']).size()

## 10. sorting

In [None]:
# sort by age (ascending)
df.sort_values('age').head()

In [None]:
# sort by age (descending)
df.sort_values('age', ascending=False).head()

In [None]:
# sort by multiple columns
df.sort_values(['smoking', 'age'], ascending=[True, False]).head(10)

---
# practice problems

try these exercises on your own!

### Problem 1

Calculate the mean age of smokers vs non-smokers. Which group is older on average?

In [None]:
# Your code here


### Problem 2

Create a new column called `pack_years` that equals `years_smoked * 1.5` for smokers and `0` for non-smokers.

In [None]:
# Your code here


### Problem 3

Find the 5 most common ZIP codes in the dataset.

In [None]:
# Your code here


### Problem 4

Filter the data to only include people aged 30-50 who are smokers, then calculate the mean `years_smoked` for this group.

In [None]:
# Your code here


### Problem 5

Create a summary table showing the count and mean age by ZIP code, sorted by count (most to least).

In [None]:
# Your code here
