## Overview

A **package** is a collection of modules organized in a directory hierarchy. Packages allow you to structure large projects logically and avoid naming conflicts.

In this notebook, we'll explore:

1. What packages are and how they differ from modules
2. The role of `__init__.py`
3. Creating and using packages
4. Nested packages (subpackages)
5. Relative and absolute imports
6. Best practices for package structure
7. Real-world package examples

## Modules vs Packages

| Concept | Definition | Example |
|---------|------------|----------|
| **Module** | A single Python file (`.py`) | `statistics.py` |
| **Package** | A directory containing modules | `data_analysis/` (with `__init__.py`) |
| **Subpackage** | A package inside another package | `data_analysis/preprocessing/` |

**Key Point**: A package is essentially a directory with an `__init__.py` file (can be empty).

## Understanding `__init__.py`

### Purpose of `__init__.py`

The `__init__.py` file serves two main purposes:

1. **Marks a directory as a Python package** (required in Python < 3.3, optional but recommended in Python ≥ 3.3)
2. **Executes initialization code** when the package is imported
3. **Controls what's exported** from the package

**Note**: In Python 3.3+, namespace packages exist (directories without `__init__.py`), but using `__init__.py` is still best practice.

### Simple Package Structure

We've created a simple package called `mathtools` in the same directory. Here's the structure:

```
modules/
├── mathtools/
│   ├── __init__.py
│   ├── basic.py
│   └── advanced.py
```

The files contain:
- **`__init__.py`**: Package initialization and exports
- **`basic.py`**: Simple operations (add, subtract)
- **`advanced.py`**: More complex operations (power, factorial)

## Using the Package

# Import specific modules from the package
from mathtools import basic, advanced

# Use functions
print(basic.add(5, 3))         # 8
print(advanced.factorial(5))   # 120

# Alternative: import specific functions
from mathtools.basic import add, subtract
from mathtools.advanced import power

print(add(10, 5))      # 15
print(power(2, 8))     # 256

## Controlling Package Exports with `__init__.py`

### Making Imports More Convenient

Our `mathtools/__init__.py` exposes functions at the package level, so you can use them more conveniently:

In [None]:
import mathtools

# Functions available at package level!
print(mathtools.add(5, 3))      # 8
print(mathtools.factorial(5))   # 120
print(mathtools.__version__)    # 1.0.0

# Compare to importing from submodules
# from mathtools.basic import add
# from mathtools.advanced import factorial

## Nested Packages (Subpackages)

### Complex Package Structure

For larger projects, you'll need hierarchical organization:

```
data_analysis/
├── __init__.py
├── io/
│   ├── __init__.py
│   ├── readers.py
│   └── writers.py
├── preprocessing/
│   ├── __init__.py
│   ├── cleaning.py
│   └── transformation.py
├── statistics/
│   ├── __init__.py
│   ├── descriptive.py
│   └── inferential.py
└── visualization/
    ├── __init__.py
    ├── plots.py
    └── charts.py
```

### Example File Contents

**`data_analysis/__init__.py`**
```python
"""
Data Analysis Package

A comprehensive package for data analysis tasks including:
- Data I/O operations
- Preprocessing and cleaning
- Statistical analysis
- Data visualization
"""

__version__ = '2.0.0'
__author__ = 'Data Science Team'

# Import key functions to package level
from .preprocessing.cleaning import remove_outliers, fill_missing
from .statistics.descriptive import mean, median, std_dev

__all__ = [
    'remove_outliers',
    'fill_missing',
    'mean',
    'median',
    'std_dev'
]
```

**`data_analysis/preprocessing/cleaning.py`**
```python
"""
Data cleaning utilities.
"""

def remove_outliers(data, method='iqr', threshold=1.5):
    """
    Remove outliers from dataset.
    
    Parameters
    ----------
    data : list
        Numerical data
    method : str
        Method to detect outliers ('iqr' or 'zscore')
    threshold : float
        Threshold for outlier detection
    
    Returns
    -------
    list
        Data with outliers removed
    """
    # Implementation here
    pass

def fill_missing(data, strategy='mean'):
    """
    Fill missing values in dataset.
    
    Parameters
    ----------
    data : list
        Data with potential None values
    strategy : str
        Strategy for filling ('mean', 'median', 'mode', 'zero')
    
    Returns
    -------
    list
        Data with missing values filled
    """
    # Implementation here
    pass
```

**`data_analysis/statistics/descriptive.py`**
```python
"""
Descriptive statistics functions.
"""

def mean(data):
    """Calculate arithmetic mean."""
    return sum(data) / len(data)

def median(data):
    """Calculate median."""
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    return sorted_data[mid]

def std_dev(data):
    """Calculate sample standard deviation."""
    m = mean(data)
    variance = sum((x - m) ** 2 for x in data) / (len(data) - 1)
    return variance ** 0.5
```

### Using Nested Packages

```python
# Option 1: Import from package level (if exposed in __init__.py)
import data_analysis
result = data_analysis.mean([1, 2, 3, 4, 5])

# Option 2: Import from subpackage
from data_analysis.statistics import descriptive
result = descriptive.mean([1, 2, 3, 4, 5])

# Option 3: Import specific function
from data_analysis.statistics.descriptive import mean, median
avg = mean([1, 2, 3, 4, 5])
mid = median([1, 2, 3, 4, 5])

# Option 4: Import entire subpackage
from data_analysis import statistics
result = statistics.descriptive.mean([1, 2, 3, 4, 5])
```

## Relative vs Absolute Imports

### Absolute Imports

Specify the complete path from the project root:

```python
# In data_analysis/preprocessing/transformation.py

# Import from sibling module
from data_analysis.preprocessing.cleaning import remove_outliers

# Import from different subpackage
from data_analysis.statistics.descriptive import mean

def normalize(data):
    """Normalize data after removing outliers."""
    cleaned = remove_outliers(data)
    avg = mean(cleaned)
    # ... normalization logic
```

**Pros:**
- Clear and explicit
- Easy to understand import origin
- Works from anywhere

**Cons:**
- Verbose
- Harder to refactor (package rename affects all imports)

### Relative Imports

Use dots to indicate relative position:

```python
# In data_analysis/preprocessing/transformation.py

# Import from same package (sibling)
from .cleaning import remove_outliers

# Import from parent package
from ..statistics.descriptive import mean

# Import from current package's __init__.py
from . import cleaning

def normalize(data):
    cleaned = remove_outliers(data)
    avg = mean(cleaned)
    # ... normalization logic
```

**Syntax:**
- `.` = current package
- `..` = parent package
- `...` = grandparent package

**Pros:**
- Concise
- Easier to refactor
- Package can be renamed without changing internal imports

**Cons:**
- Only works in packages (not standalone modules)
- Can be confusing with deep nesting

### Best Practice: Mix Both

**Within a package**: Use relative imports
```python
# Inside data_analysis package
from .preprocessing import cleaning
from ..statistics import descriptive
```

**From outside the package**: Use absolute imports
```python
# In main.py
from data_analysis.preprocessing import cleaning
from data_analysis.statistics import descriptive
```

## Real-World Package Example: NumPy-like Structure

Let's examine how real packages like NumPy are organized:

```
numpy/
├── __init__.py           # Main package initialization
├── core/                 # Core functionality
│   ├── __init__.py
│   ├── numeric.py       # Basic array operations
│   └── multiarray.py    # Array implementation
├── linalg/               # Linear algebra
│   ├── __init__.py
│   └── linalg.py
├── random/               # Random number generation
│   ├── __init__.py
│   └── mtrand.py
├── fft/                  # Fast Fourier Transform
│   ├── __init__.py
│   └── fftpack.py
└── testing/              # Testing utilities
    ├── __init__.py
    └── utils.py
```

**Key observations:**
- Logical grouping by functionality
- Each subpackage has clear purpose
- Common functions exposed at top level
- Deep functionality available in subpackages

## Best Practices for Package Structure

### 1. Logical Organization

Group related functionality together:

```
✅ Good:
data_pipeline/
├── ingestion/       # Data loading
├── validation/      # Data quality checks
├── transformation/  # Data transformations
└── export/          # Data export

❌ Bad:
data_pipeline/
├── module1.py       # Unclear purpose
├── helpers.py       # Too generic
├── utils.py         # What kind of utils?
└── misc.py          # Catch-all
```

### 2. Flat is Better Than Nested

Don't over-nest packages:

```
✅ Good (2-3 levels):
project/
└── analysis/
    ├── preprocessing/
    └── statistics/

❌ Too deep (4+ levels):
project/
└── company/
    └── department/
        └── team/
            └── project/
                └── analysis/
```

### 3. Clear Naming

Package names should be:
- **Lowercase**: `data_analysis` not `DataAnalysis`
- **Short**: `stats` better than `statistical_analysis_tools`
- **Descriptive**: `preprocessing` not `prep` or `utils`
- **No hyphens**: `data_io` not `data-io`

### 4. Documentation Structure

Complete package with documentation:

```
mypackage/
├── README.md              # Overview, installation, quick start
├── LICENSE                # License information
├── setup.py               # Package installation
├── requirements.txt       # Dependencies
├── docs/                  # Detailed documentation
│   ├── index.md
│   ├── api.md
│   └── examples.md
├── tests/                 # Test suite
│   ├── __init__.py
│   ├── test_module1.py
│   └── test_module2.py
├── examples/              # Example scripts
│   ├── basic_usage.py
│   └── advanced_usage.py
└── mypackage/             # Actual package
    ├── __init__.py
    ├── module1.py
    └── module2.py
```

## Practical Example: Data Analysis Package

We've created a complete, well-structured package called `datatools`:

```
modules/
├── datatools/
│   ├── __init__.py
│   ├── cleaning/
│   │   ├── __init__.py
│   │   ├── missing.py
│   │   └── outliers.py
│   └── stats/
│       ├── __init__.py
│       └── descriptive.py
```

This package demonstrates:
- Nested package structure (subpackages)
- Proper `__init__.py` files at each level
- Relative imports between modules (e.g., `from ..stats.descriptive import mean`)
- Exposed API at the top level
- Complete documentation

### Using the Package

# Option 1: Use top-level imports (most convenient)
import datatools

data = [10, 20, None, 30, 40, None, 50]
cleaned = datatools.fill_missing(data, strategy='median')
avg = datatools.mean(cleaned)
print(f"Average: {avg}")

# Option 2: Import specific submodules
from datatools.cleaning import missing
from datatools.stats import descriptive

info = missing.detect_missing(data)
print(f"\nMissing: {info['count']} values ({info['percentage']:.1f}%)")

# Option 3: Import specific functions
from datatools import fill_missing, mean, std_dev

cleaned = fill_missing(data)
print(f"\nMean: {mean(cleaned):.2f}")
print(f"Std Dev: {std_dev(cleaned):.2f}")

## Exercises

### Exercise 1: Design a Package Structure

Design a package structure for a machine learning project that includes:
- Data loading (CSV, JSON, SQL)
- Data preprocessing (scaling, encoding, splitting)
- Model training (linear regression, decision tree)
- Model evaluation (metrics, visualization)
- Utilities (logging, config)

Draw the directory structure and explain your design decisions.

### Exercise 2: Implement a Simple Package

Create a package called `texttools` with the following structure:

```
texttools/
├── __init__.py
├── analysis/
│   ├── __init__.py
│   └── word_count.py
└── transformation/
    ├── __init__.py
    └── case.py
```

Implement:
- `word_count.py`: Functions for counting words, characters, sentences
- `case.py`: Functions for case transformations (title_case, snake_case, camel_case)
- Proper `__init__.py` files that expose commonly-used functions
- Use relative imports between modules
- Include docstrings for all functions

### Exercise 3: Refactor Code into a Package

Given this monolithic script, refactor it into a well-organized package:

```python
# messy_analysis.py

def load_data(filename):
    # Load CSV
    pass

def clean_data(data):
    # Remove missing values
    pass

def normalize_data(data):
    # Normalize to 0-1 range
    pass

def calculate_mean(data):
    # Calculate mean
    pass

def calculate_std(data):
    # Calculate std dev
    pass

def plot_histogram(data):
    # Create histogram
    pass

def save_results(data, filename):
    # Save to CSV
    pass
```

Create a package structure that logically groups these functions.

### Exercise 4: Relative vs Absolute Imports

Given this package structure:

```
analytics/
├── __init__.py
├── core/
│   ├── __init__.py
│   ├── math_ops.py
│   └── data_ops.py
└── advanced/
    ├── __init__.py
    └── ml.py
```

For the file `analytics/advanced/ml.py`, write:
1. How to import `math_ops` using absolute import
2. How to import `math_ops` using relative import
3. How to import `data_ops` using both methods
4. Which method would you prefer and why?

## Key Takeaways

✅ **Packages vs Modules:**
- Package = directory with `__init__.py`
- Module = single `.py` file
- Packages organize related modules

✅ **`__init__.py` Purposes:**
- Marks directory as package
- Runs initialization code
- Controls what's exported
- Can expose submodule functions at package level

✅ **Import Strategies:**
- **Absolute imports**: Full path from project root
- **Relative imports**: Use `.` and `..` for navigation
- Within package: prefer relative
- From outside: use absolute

✅ **Best Practices:**
- Logical organization by functionality
- 2-3 levels deep maximum
- Clear, descriptive names
- Comprehensive documentation
- Expose common functions at top level