# Preliminaries and Digressions

This notebook is a collection of topics that might be considered preliminaries *or* digressions that might supplement your understanding but aren't purely necessary for this course. I'll try to be clear about what is what or split these notes out as they grow.

## Python Basics

### Syntax

Let me assume that if you don't know Python, you at least know R. Then, there are only a few surprises. 

* Use `=` for assignment instead of `<-`
* Use `**` for exponentiation instead of `^`. So $2^3$ is coded as `2**3`. 
* Python is zero-indexed. The element of at the beginning of the list `x = [9, 7]` is the value `9` and it is accessed as `x[0]`. This also means that, though `x` is of length two, the last element is at place `x[1]`. Indices generally run over the interval $0 \leq i < \text{length of array}$. This is how [Dykstra wanted it](https://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF).

In action:

In [1]:
# Assignment and exponentiation
x = 5  # In R: x <- 5
y = 2**3  # In R: 2^3
print(f"x = {x}")
print(f"2 to the power of 3 = {y}")

# Zero-indexing example
grades = [85, 92, 78, 95]
print(f"\nFirst grade (index 0): {grades[0]}")
print(f"Last grade (index 3): {grades[3]}")
# grades[4] would give an error!

x = 5
2 to the power of 3 = 8

First grade (index 0): 85
Last grade (index 3): 95


**Quick note on f-strings**: The `f"text {variable}"` syntax is Python's way of inserting variables into text. It's like `paste0("text ", variable)` in R but cleaner. Just put an `f` before the quotes and wrap variables in `{}`.

Here are a few more differences that will save you debugging time:

In [2]:
# Data structures: R vectors become Python lists
scores = [85, 92, 78]  # Like c(85, 92, 78) in R

# R's named vectors become Python dictionaries
student = {"name": "Alex", "grade": 92}  # Like c(name="Alex", grade=92)
print(f"Student name: {student['name']}")

# Functions are defined differently
def calculate_mean(numbers):
    """Calculate the mean of a list of numbers"""
    return sum(numbers) / len(numbers)

# Using the function
mean_score = calculate_mean(scores)
print(f"\nMean score: {mean_score}")

Student name: Alex

Mean score: 85.0


### Where to run the code?
You can probably get by just using Google Colab notebooks for everything in this course. However, I recommend working in [**VS Code**](https://code.visualstudio.com/) or [**JupyterLab**](https://github.com/jupyterlab/jupyterlab-desktop) instead. I usually write notebooks in JupyterLab and write standalone py files in VS Code or Sublime. AI integration is better in VS Code. Cursor is built off of VS Code for even more AI integration. 

VS Code is nice because you can split the screen with files and a terminal session. In the terminal, I run Claude Code. 

If you find yourself using a similar workflow, writing both notebooks and py files, I recommend using the IPython magics `%run` and `%load` to run a python file. For a file called `foo.py`, `%run foo.py` then executes that code as if it were pasted into your notebook code cell and `%run -i foo.py` runs it in the local namespace, meaning it can use variables already defined in the notebook. `%load foo.py`

If you discover an even better workflow, let me know!


### What else do I need to install? 

If you are working outside of a cloud environment, you'll need to install Python packages.

**Option 1: Using pip**
```bash
pip install -r requirements.txt
```

**Option 2: Using conda environments**

A conda environment is an isolated workspace with its own packages and dependencies. This keeps different projects' requirements separate.

```bash
# Create environment from the course's environment.yml file
conda env create -f environment.yml

# Activate the environment
conda activate mlss

# Deactivate when done
conda deactivate
```

The environment.yml file in the repository root includes NumPy, pandas, scikit-learn, PyTorch, and Jupyter.

## Data Manipulation with Python

Tabular data is usually stored as DataFrames either in `pandas` or `polars`. `pandas` is more common and AI will default to this when writing code. `polars` is somewhat new--new enough that ChatGPT still makes basic mistakes when writing `polars` code. We'll basically ignore `polars` for simplicity, but I mention it because my industry workflow is `polars`-first. I sometimes work with data that is too big for `pandas` (>100M rows given the tech I'm working with). I've heard R users say `polars` makes more sense to them than `pandas`. 

## American Time Use Survey (ATUS)

The American Time Use Survey is a federal survey conducted by the Bureau of Labor Statistics. Respondents report their activities for a 24-hour period, creating time diaries. The survey draws from households that completed the Current Population Survey and has been conducted annually since 2003.

### Available Microdata Files

ATUS releases several microdata files each year:

- **resp** - Respondent file with demographics and labor force characteristics  
- **act** - Activity file with one record per activity in the diary day
- **sum** - Summary file with total minutes per activity category
- **who** - Who file indicating who was present during activities
- **rost** - Roster file with household member information
- **cps** - CPS file with Current Population Survey data for ATUS households

Each file can be merged using the case ID (TUCASEID).

### Downloading ATUS Files with statwrap

The `statwrap` package provides convenience functions for accessing ATUS data:

In [ ]:
from statwrap.atus import get_microdata_link, read_zip

# Get download links for different file types
print("Respondent file (2023):", get_microdata_link('resp', 2023))
print("Activity file (2023):", get_microdata_link('act', 2023))
print("Summary file (2023):", get_microdata_link('sum', 2023))

# For earlier years
print("\nRespondent file (2020):", get_microdata_link('resp', 2020))

After downloading files manually, use `read_zip` to load them:

In [ ]:
# Assuming files are downloaded to current directory
# resp_df = read_zip('atusresp-0323.zip')
# sum_df = read_zip('atussum-0323.zip')
# act_df = read_zip('atusact-0323.zip')

### Merging ATUS Files

A common analysis combines respondent characteristics with time use summaries:

In [ ]:
# Example merge between respondent and summary files
# Uncomment after downloading both files:

# import pandas as pd
# resp_df = read_zip('atusresp-0323.zip')
# sum_df = read_zip('atussum-0323.zip')

# # Merge on case ID
# merged_df = pd.merge(resp_df, sum_df, on='TUCASEID', how='inner')

# # Example analysis: Average work time by sex
# work_cols = ['t050101', 't050102', 't050103']  # Work activities
# merged_df['work_minutes'] = merged_df[work_cols].sum(axis=1)
# merged_df.groupby('TESEX')['work_minutes'].mean()

## Digression: Making DataFrames Look Nice with Stylers

Pandas has a `.style` attribute that lets you format DataFrames for better visual presentation. This is especially useful when presenting results or exploring data patterns. Here are three useful methods:

In [7]:
# First, let's create a sample dataset with more numeric columns
results_df = pd.DataFrame({
    'State': ['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'],
    'Population_M': [39.5, 29.5, 21.8, 19.5, 12.8],
    'GDP_per_capita': [88700, 71900, 55700, 95500, 72800],
    'Unemployment_rate': [3.9, 3.4, 2.8, 3.7, 3.5],
    'College_grad_pct': [35.0, 32.2, 31.3, 37.5, 33.1]
})

print("Our data:")
results_df

Our data:


Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5,88700,3.9,35.0
1,Texas,29.5,71900,3.4,32.2
2,Florida,21.8,55700,2.8,31.3
3,New York,19.5,95500,3.7,37.5
4,Pennsylvania,12.8,72800,3.5,33.1


### 1. Format: Control how numbers are displayed

In [8]:
# Format numbers for better readability
styled = results_df.style.format({
    'Population_M': '{:.1f}M',          # One decimal place with M suffix
    'GDP_per_capita': '${:,.0f}',       # Dollar sign with commas
    'Unemployment_rate': '{:.1%}',      # Convert to percentage (multiply by 100)
    'College_grad_pct': '{:.1f}%'       # Just add % sign (don't multiply)
})
styled

Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5M,"$88,700",390.0%,35.0%
1,Texas,29.5M,"$71,900",340.0%,32.2%
2,Florida,21.8M,"$55,700",280.0%,31.3%
3,New York,19.5M,"$95,500",370.0%,37.5%
4,Pennsylvania,12.8M,"$72,800",350.0%,33.1%


### 2. Background gradient: Visualize patterns with color

use the `axis` parameter to adjust bar heights relative to other column values (`axis=0`), other row values (`axis=1`) or all values in the table (`axis=None`).

In [9]:
# Apply color gradients to see patterns
numeric_cols = ['Population_M', 'GDP_per_capita', 'Unemployment_rate', 'College_grad_pct']
results_df.style.background_gradient(cmap='Greens', subset=numeric_cols, axis=0)

Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5,88700,3.9,35.0
1,Texas,29.5,71900,3.4,32.2
2,Florida,21.8,55700,2.8,31.3
3,New York,19.5,95500,3.7,37.5
4,Pennsylvania,12.8,72800,3.5,33.1


In [10]:
results_df.style.background_gradient(cmap='coolwarm_r', subset=numeric_cols, axis=None)

Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5,88700,3.9,35.0
1,Texas,29.5,71900,3.4,32.2
2,Florida,21.8,55700,2.8,31.3
3,New York,19.5,95500,3.7,37.5
4,Pennsylvania,12.8,72800,3.5,33.1


### 3. Bar: Add mini bar charts inside cells

Again, use the `axis` parameter to adjust bar heights relative to other column values (`axis=0`), other row values (`axis=1`) or all values in the table (`axis=None`).

In [11]:
# Add horizontal bars to visualize relative magnitudes
numeric_cols = ['Population_M', 'GDP_per_capita', 'Unemployment_rate', 'College_grad_pct']
results_df.style.bar(subset=numeric_cols, color='lightblue')

Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5,88700,3.9,35.0
1,Texas,29.5,71900,3.4,32.2
2,Florida,21.8,55700,2.8,31.3
3,New York,19.5,95500,3.7,37.5
4,Pennsylvania,12.8,72800,3.5,33.1


In [12]:
results_df.style.bar(subset=numeric_cols, color='lightblue', axis=0)

Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5,88700,3.9,35.0
1,Texas,29.5,71900,3.4,32.2
2,Florida,21.8,55700,2.8,31.3
3,New York,19.5,95500,3.7,37.5
4,Pennsylvania,12.8,72800,3.5,33.1


In [13]:
results_df.style.bar(subset=numeric_cols, color='lightblue', axis = 1)

Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5,88700,3.9,35.0
1,Texas,29.5,71900,3.4,32.2
2,Florida,21.8,55700,2.8,31.3
3,New York,19.5,95500,3.7,37.5
4,Pennsylvania,12.8,72800,3.5,33.1


In [14]:
results_df.style.bar(subset=numeric_cols, color='lightblue', axis=None)

Unnamed: 0,State,Population_M,GDP_per_capita,Unemployment_rate,College_grad_pct
0,California,39.5,88700,3.9,35.0
1,Texas,29.5,71900,3.4,32.2
2,Florida,21.8,55700,2.8,31.3
3,New York,19.5,95500,3.7,37.5
4,Pennsylvania,12.8,72800,3.5,33.1
