# Docker Fundamentals for Computational Social Science

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Understand what Docker is and why containerization is valuable for reproducible research
* Learn the basic Docker workflow: writing Dockerfiles, building images, and running containers
* Practice creating and modifying Docker containers for Python applications
* Experience hands-on Docker development through live builds and modifications
* Discuss best practices, common challenges, and when to use Docker in research projects
* Understand how Docker can enhance your final project's reproducibility and deployment
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br> 

### Sections
1. [What is Docker?](#section1)
2. [Live Build: Creating Your First Container](#section2)
3. [Hands-On: Modifying and Rebuilding](#section3)
4. [Show and Tell: Real-World Use Cases](#section4)
5. [Discussion: Benefits, Challenges, and Best Practices](#section5)
6. [Docker for Your Final Project](#section6)

Welcome to our Docker workshop! Today we'll explore containerization technology and how it can make your computational social science research more reproducible, portable, and shareable. Docker has become an essential tool in modern software development and research, allowing you to package your applications with all their dependencies into lightweight, portable containers.

💡 **Note**: If you don't have Docker installed on your machine, don't worry! You can follow along with the demonstrations and pair up with someone who has it installed for the hands-on portions.

<a id='section1'></a>

# What is Docker?

Docker is a **containerization platform** that allows you to package applications and their dependencies into lightweight, portable containers. Think of containers as lightweight virtual machines that include everything needed to run your application: code, runtime, system tools, libraries, and settings.

## Why Use Docker in Research?

🔔 **Question**: Before we dive in, think about a time when you or a colleague tried to run someone else's code. What challenges did you encounter?

Common research challenges Docker helps solve:

- **"It works on my machine"** - Different operating systems, Python versions, or package versions can break code
- **Dependency hell** - Complex dependencies that conflict with each other
- **Reproducibility** - Ensuring your research can be replicated years later
- **Collaboration** - Making it easy for teammates to run your code
- **Deployment** - Moving from local development to cloud or production environments

## Key Docker Concepts

- **Image**: A read-only template with instructions for creating a container (like a recipe)
- **Container**: A running instance of an image (like a cake made from the recipe)
- **Dockerfile**: A text file with instructions to build an image
- **Registry**: A storage and distribution system for Docker images (like Docker Hub)

## Docker vs Virtual Machines

While virtual machines virtualize entire operating systems, Docker containers share the host OS kernel, making them much more lightweight and faster to start.

💡 **Tip**: Think of Docker containers as "shipping containers" for your code - they provide a standardized way to package and transport applications across different environments.

<a id='section2'></a>

# 🎬 Live Build: Creating Your First Container

Let's walk through creating a simple Python application and containerizing it step by step. We'll start with a basic "Hello World" script and gradually add complexity.

## Step 1: Create a Simple Python Script

First, let's create a simple Python script that demonstrates basic functionality:

In [None]:
# Let's create our hello.py script content
hello_script = '''
#!/usr/bin/env python3
"""A simple hello world script for Docker demonstration."""

import os
import datetime

def main():
    print("Hello from Docker!")
    print(f"Current time: {datetime.datetime.now()}")
    print(f"Running on: {os.uname().sysname if hasattr(os, 'uname') else 'Unknown OS'}")
    print(f"Python version: {os.sys.version}")
    
    # Read a simple data file if it exists
    try:
        with open('data.txt', 'r') as f:
            content = f.read().strip()
            print(f"Data file contains: {content}")
    except FileNotFoundError:
        print("No data file found - creating one...")
        with open('data.txt', 'w') as f:
            f.write("Hello from the containerized world!")
        print("Data file created successfully!")

if __name__ == "__main__":
    main()
'''

print("Our hello.py script:")
print(hello_script)

## Step 2: Understanding the Dockerfile

Now let's create a Dockerfile to containerize our application. A Dockerfile contains instructions for building a Docker image:

In [None]:
# Let's examine our Dockerfile step by step
dockerfile_content = '''
# Use an official Python runtime as the base image
FROM python:3.10-slim

# Set metadata
LABEL maintainer="Your Name <your.email@example.com>"
LABEL description="A simple Python hello world application"

# Set the working directory inside the container
WORKDIR /app

# Copy our Python script into the container
COPY hello.py .

# Make the script executable (optional but good practice)
RUN chmod +x hello.py

# Define the command to run when the container starts
CMD ["python", "hello.py"]
'''

print("Our Dockerfile:")
print(dockerfile_content)

### Breaking Down the Dockerfile

Let's understand each instruction:

- **`FROM python:3.10-slim`**: Start with a lightweight Python 3.10 base image
- **`LABEL`**: Add metadata to describe the image
- **`WORKDIR /app`**: Set the working directory where commands will run
- **`COPY hello.py .`**: Copy our script from the host to the container
- **`RUN chmod +x hello.py`**: Execute a command during image build
- **`CMD ["python", "hello.py"]`**: Default command to run when container starts

## Step 3: Building the Docker Image

Now we'll build our Docker image. The build process follows these steps:

```bash
# Navigate to the directory containing your Dockerfile and hello.py
cd /path/to/your/project

# Build the image with a tag name
docker build -t hello-app .
```

The `-t hello-app` gives our image a name (tag), and the `.` tells Docker to look for the Dockerfile in the current directory.

## Step 4: Running the Container

Once the image is built, we can run it as a container:

```bash
# Run the container
docker run hello-app

# Run with interactive mode to see output
docker run -it hello-app

# Run and automatically remove container when it stops
docker run --rm hello-app
```

🔔 **Question**: What do you think will happen when we run this container? What output would you expect to see?

<a id='section3'></a>

# 🥊 Hands-On: Modifying and Rebuilding

Now it's your turn to practice! We'll modify our application to require additional packages and see how Docker handles dependencies.

## Challenge 1: Add Package Dependencies

Let's create a more complex version of our script that requires external packages like `numpy` and `pandas`:

In [None]:
# Enhanced version with data analysis capabilities
enhanced_script = '''
#!/usr/bin/env python3
"""Enhanced script with data analysis capabilities."""

import os
import datetime
import numpy as np
import pandas as pd

def analyze_data():
    """Perform simple data analysis."""
    # Create sample data
    data = {
        'date': pd.date_range('2024-01-01', periods=10),
        'value': np.random.randn(10).cumsum(),
        'category': np.random.choice(['A', 'B', 'C'], 10)
    }
    
    df = pd.DataFrame(data)
    
    print("\nSample Data Analysis:")
    print(f"Dataset shape: {df.shape}")
    print(f"Mean value: {df['value'].mean():.2f}")
    print(f"Value range: {df['value'].min():.2f} to {df['value'].max():.2f}")
    print("\nCategory counts:")
    print(df['category'].value_counts())
    
    # Save results
    df.to_csv('analysis_results.csv', index=False)
    print("\nResults saved to analysis_results.csv")

def main():
    print("Enhanced Hello from Docker!")
    print(f"Current time: {datetime.datetime.now()}")
    print(f"NumPy version: {np.__version__}")
    print(f"Pandas version: {pd.__version__}")
    
    # Perform data analysis
    analyze_data()

if __name__ == "__main__":
    main()
'''

print("Enhanced script with dependencies:")
print(enhanced_script)

## Challenge 2: Create a Requirements File

For Python applications with dependencies, it's best practice to use a `requirements.txt` file:

In [None]:
# Create requirements.txt content
requirements = '''
numpy>=1.24.0
pandas>=2.0.0
'''

print("requirements.txt:")
print(requirements)

## Challenge 3: Update the Dockerfile

Now we need to modify our Dockerfile to install the dependencies:

In [None]:
# Updated Dockerfile with dependencies
enhanced_dockerfile = '''
# Use an official Python runtime as the base image
FROM python:3.10-slim

# Set metadata
LABEL maintainer="Your Name <your.email@example.com>"
LABEL description="Enhanced Python application with data analysis"

# Set the working directory inside the container
WORKDIR /app

# Copy requirements first (for better Docker layer caching)
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy our Python script into the container
COPY enhanced_hello.py .

# Create a directory for output files
RUN mkdir -p /app/output

# Define the command to run when the container starts
CMD ["python", "enhanced_hello.py"]
'''

print("Enhanced Dockerfile:")
print(enhanced_dockerfile)

### Key Improvements in the Enhanced Dockerfile

- **Copy requirements.txt first**: This enables Docker layer caching - if dependencies don't change, Docker can reuse the layer
- **`--no-cache-dir`**: Reduces image size by not storing pip cache
- **Create output directory**: Prepares a location for generated files

## Your Turn: Build and Test

Now practice the build and run process:

```bash
# Build the enhanced image
docker build -t enhanced-hello-app .

# Run the enhanced container
docker run --rm enhanced-hello-app

# Run with volume mounting to see output files
docker run --rm -v $(pwd)/output:/app/output enhanced-hello-app
```

🥊 **Challenge**: Try modifying the script to:
1. Add a new Python package (like `matplotlib` for plotting)
2. Update the requirements.txt file
3. Rebuild the image
4. Test that it works

💡 **Tip**: If you don't have Docker installed, pair up with someone who does, or just follow along with the instructor's demonstration.

<a id='section4'></a>

# Show and Tell: Real-World Use Cases

Let's hear from anyone who has experience using Docker in research or industry!

🔔 **Question**: Has anyone here used Docker before? What was your use case?

## Common Research Applications

Here are some ways researchers use Docker:

### 1. Reproducible Research Environments
```dockerfile
# Example: Creating a standardized research environment
FROM jupyter/scipy-notebook:latest

# Install additional research packages
RUN pip install gensim spacy transformers
RUN python -m spacy download en_core_web_sm

# Copy research notebooks
COPY notebooks/ /home/jovyan/work/
```

### 2. Web Scraping Applications
```dockerfile
# Example: Containerized web scraper
FROM python:3.10-slim

RUN apt-get update && apt-get install -y \
    chromium-driver \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY scraper.py .
CMD ["python", "scraper.py"]
```

### 3. API Services
```dockerfile
# Example: Containerized Flask API
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app.py .
EXPOSE 5000
CMD ["flask", "run", "--host=0.0.0.0"]
```

### 4. Machine Learning Model Deployment
```dockerfile
# Example: Containerized ML model
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy trained model
COPY model.pkl .
COPY predict.py .

CMD ["python", "predict.py"]
```

## Industry Examples

- **Netflix**: Uses containers for microservices architecture
- **Airbnb**: Containerizes data science workflows
- **Spotify**: Deploys ML models using containers
- **Academic Institutions**: Share reproducible research environments

💡 **Share Your Experience**: If you've used Docker in research, internships, or projects, please share:
- What problem were you trying to solve?
- How did Docker help?
- What challenges did you face?
- Any tips for beginners?

<a id='section5'></a>

# Discussion: Benefits, Challenges, and Best Practices

Let's discuss the pros and cons of using Docker in research contexts.

## Benefits of Docker in Research

### ✅ Reproducibility
- **Consistent environments** across different machines and operating systems
- **Version control for environments** - your Dockerfile is version controlled
- **Long-term preservation** - containers can run years later

### ✅ Collaboration
- **Easy sharing** - colleagues can run your exact environment
- **Onboarding** - new team members get productive faster
- **Cross-platform** - works on Windows, Mac, and Linux

### ✅ Deployment
- **Cloud deployment** - easy to deploy to AWS, Google Cloud, etc.
- **Scaling** - can easily run multiple instances
- **Isolation** - applications don't interfere with each other

## Challenges with Docker

### ⚠️ Learning Curve
- **New concepts** - images, containers, volumes, networks
- **Command line** - requires comfort with terminal/command prompt
- **Debugging** - troubleshooting containers can be tricky

### ⚠️ Resource Usage
- **Disk space** - images can be large (especially with ML libraries)
- **Memory** - running containers uses RAM
- **Build time** - initial builds can be slow

### ⚠️ Complexity for Simple Tasks
- **Overkill** - might be unnecessary for simple scripts
- **Administrative rights** - some systems require admin access to install Docker
- **Networking** - can be complex for multi-container applications

## Best Practices for Research

### 🎯 Start Small
- Begin with simple applications
- Use official base images when possible
- Keep Dockerfiles readable and well-commented

### 🎯 Optimize for Size and Speed
```dockerfile
# Use slim or alpine base images
FROM python:3.10-slim

# Copy requirements first for better caching
COPY requirements.txt .
RUN pip install -r requirements.txt

# Use .dockerignore to exclude unnecessary files
# Clean up in the same RUN command
RUN apt-get update && apt-get install -y \
    build-essential \
    && pip install some-package \
    && apt-get remove -y build-essential \
    && apt-get autoremove -y \
    && rm -rf /var/lib/apt/lists/*
```

### 🎯 Security Considerations
```dockerfile
# Don't run as root user
RUN useradd -m researcher
USER researcher

# Don't include secrets in images
# Use environment variables or mounted volumes instead
```

### 🎯 Documentation
- **README files** with clear build and run instructions
- **Comments in Dockerfiles** explaining complex steps
- **Version tags** for reproducibility

## Container Orchestration (Advanced Topic)

For complex applications with multiple containers, you might encounter:

- **Docker Compose** - Define multi-container applications
- **Kubernetes** - Container orchestration at scale
- **Docker Swarm** - Docker's native clustering solution

These tools are beyond our scope today, but worth knowing they exist!

🔔 **Discussion Questions**:
1. When might Docker be overkill for a research project?
2. How could Docker help with reproducibility in computational social science?
3. What concerns would you have about adopting Docker in your research?

<a id='section6'></a>

# Docker for Your Final Project

Let's discuss how Docker could enhance your final projects and potentially earn you bonus points!

## When to Consider Docker for Your Project

### ✅ Good Candidates for Dockerization
- **Complex dependencies** - Multiple Python packages, specific versions
- **System dependencies** - Requires specific software or libraries
- **Web applications** - Flask/Django apps, APIs, dashboards
- **Data processing pipelines** - Automated workflows
- **Machine learning models** - Especially with TensorFlow, PyTorch, etc.
- **Reproducible analysis** - Want others to easily replicate your work

### ❌ When Docker Might Be Overkill
- **Simple Jupyter notebooks** - Especially if using standard packages
- **Basic data analysis** - With only pandas, matplotlib, etc.
- **Exploratory work** - Still figuring out requirements
- **Time constraints** - Learning Docker while doing research

## Sample Final Project Docker Setup

Here's how you might structure a final project with Docker:

### Project Structure
```
my-final-project/
├── Dockerfile
├── requirements.txt
├── README.md
├── data/
│   └── (your datasets)
├── notebooks/
│   └── analysis.ipynb
├── src/
│   ├── data_collection.py
│   ├── analysis.py
│   └── visualization.py
└── output/
    └── (generated results)
```

### Sample Dockerfile for Final Project
```dockerfile
FROM jupyter/scipy-notebook:latest

# Install additional packages for your project
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy project files
COPY --chown=jovyan:users . /home/jovyan/work/

# Set working directory
WORKDIR /home/jovyan/work

# Expose Jupyter port
EXPOSE 8888

# Start Jupyter
CMD ["start-notebook.sh", "--NotebookApp.token=''", "--NotebookApp.password=''"]
```

### Sample README Instructions
```markdown
# My Final Project: Social Media Sentiment Analysis

## Quick Start with Docker

1. Clone this repository
2. Build the Docker image:
   ```bash
   docker build -t my-project .
   ```
3. Run the container:
   ```bash
   docker run -p 8888:8888 -v $(pwd):/home/jovyan/work my-project
   ```
4. Open your browser to http://localhost:8888
5. Open and run `notebooks/analysis.ipynb`

## Without Docker

1. Install Python 3.10+
2. Install requirements: `pip install -r requirements.txt`
3. Run: `jupyter notebook`
```

## Bonus Points Criteria

To earn bonus points for Docker usage in your final project:

### ⭐ Basic Docker Implementation (2-3 bonus points)
- Working Dockerfile that builds successfully
- Clear instructions in README
- Container runs your analysis without errors

### ⭐⭐ Advanced Docker Usage (4-5 bonus points)
- Optimized Dockerfile (multi-stage builds, small image size)
- Docker Compose for multi-service setup
- Volume mounting for data persistence
- Environment variables for configuration

### ⭐⭐⭐ Production-Ready Containerization (6+ bonus points)
- Deployed to cloud platform (AWS, Google Cloud, etc.)
- Web interface (Streamlit, Flask, etc.) running in container
- Automated CI/CD pipeline
- Comprehensive documentation

## Getting Started Tips

1. **Start early** - Don't leave Docker for the last minute
2. **Begin simple** - Get a basic container working first
3. **Test thoroughly** - Make sure others can actually run your container
4. **Document everything** - Clear instructions are crucial
5. **Ask for help** - Use office hours if you get stuck

## Alternative: GitHub Codespaces

If Docker feels too complex, consider GitHub Codespaces:
- Cloud-based development environment
- Configured via `.devcontainer/devcontainer.json`
- Easier than Docker but similar reproducibility benefits

🔔 **Question**: Are you considering using Docker for your final project? What type of project are you working on?

💡 **Remember**: Docker is **optional** for your final project. Use it if it adds value, but don't let it distract from your core research goals!

<div class="alert alert-success">

## ❗ Key Points

* **Docker containers** provide lightweight, portable environments for your applications
* **Reproducibility** is Docker's biggest benefit for research - ensuring your work can be replicated
* **Dockerfiles** define how to build images with step-by-step instructions
* **Best practices** include using official base images, optimizing for size, and good documentation
* **Learning curve** exists, but the investment pays off for complex projects
* **Start simple** - begin with basic containers before attempting advanced features
* **Optional for final projects** - use Docker if it adds value, not just for bonus points
* **Alternatives exist** - GitHub Codespaces, virtual environments, etc. can also help with reproducibility

</div>

## Next Steps

1. **Practice** - Try containerizing a simple Python script
2. **Explore** - Look at existing Docker images on Docker Hub
3. **Plan** - Consider if Docker would benefit your final project
4. **Learn more** - Docker documentation, tutorials, and courses
5. **Connect** - Join Docker communities and forums

## Additional Resources

- [Docker Official Documentation](https://docs.docker.com/)
- [Docker Hub](https://hub.docker.com/) - Repository of pre-built images
- [Play with Docker](https://labs.play-with-docker.com/) - Browser-based Docker playground
- [Awesome Docker](https://github.com/veggiemonk/awesome-docker) - Curated list of Docker resources
- [Docker for Data Science](https://github.com/jupyter/docker-stacks) - Jupyter Docker images

Thanks for joining today's Docker workshop! Feel free to reach out during office hours if you have questions about implementing Docker in your projects.