## Overview

Creating your own modules is essential for organizing larger projects. In this notebook, we'll learn:

1. How to create and use custom modules
2. Understanding `__name__` and `__main__`
3. Module initialization and variables
4. The `__pycache__` directory and `.pyc` files
5. Best practices for module design
6. Documentation and docstrings

**Note**: Since Jupyter notebooks don't directly support creating separate `.py` files for import, we'll use a combination of:
- Conceptual examples
- File creation demonstrations
- Practical exercises you can do outside the notebook

## What Makes a File a Module?

**Any Python file (`.py`) is automatically a module!**

If you create a file called `my_functions.py`, you can import it as:
```python
import my_functions
```

The module name is the filename without the `.py` extension.

## Example Module Structure

Let's create a simple statistics module. Here's what `stats_utils.py` would look like:

```python
# stats_utils.py
"""
A simple statistics utility module.

This module provides basic statistical functions for data analysis.
"""

def mean(data):
    """Calculate the arithmetic mean of a dataset."""
    if not data:
        raise ValueError("Cannot calculate mean of empty dataset")
    return sum(data) / len(data)

def median(data):
    """Calculate the median of a dataset."""
    if not data:
        raise ValueError("Cannot calculate median of empty dataset")
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    return sorted_data[mid]

def variance(data):
    """Calculate the sample variance."""
    if len(data) < 2:
        raise ValueError("Variance requires at least 2 data points")
    m = mean(data)
    return sum((x - m) ** 2 for x in data) / (len(data) - 1)

# Module-level variable
VERSION = "1.0.0"
AUTHOR = "Data Analysis Student"

# This code runs when the module is imported
print(f"stats_utils module loaded (version {VERSION})")
```

## Using Your Module

After creating `stats_utils.py`, you could use it like this:

```python
# main.py
import stats_utils

data = [10, 20, 30, 40, 50]
print(f"Mean: {stats_utils.mean(data)}")
print(f"Median: {stats_utils.median(data)}")
print(f"Version: {stats_utils.VERSION}")
```

Output:
```
stats_utils module loaded (version 1.0.0)
Mean: 30.0
Median: 30.0
Version: 1.0.0
```

## Understanding `__name__` and `__main__`

### The `__name__` Variable

Every Python module has a special variable called `__name__`:

- When a file is **run directly**, `__name__` is set to `"__main__"`
- When a file is **imported**, `__name__` is set to the module's name

This allows us to write code that behaves differently depending on how it's used.

In [None]:
# In a notebook, __name__ is always '__main__'
print(f"Current __name__: {__name__}")

### Practical Example

Here's how you'd structure `stats_utils.py` to include tests:

```python
# stats_utils.py
def mean(data):
    """Calculate the arithmetic mean."""
    return sum(data) / len(data)

def median(data):
    """Calculate the median."""
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    return sorted_data[mid]

# This code only runs when the file is executed directly
if __name__ == "__main__":
    print("Running module tests...")
    
    # Test data
    test_data = [1, 2, 3, 4, 5]
    
    # Run tests
    print(f"Test data: {test_data}")
    print(f"Mean: {mean(test_data)}")
    print(f"Median: {median(test_data)}")
    
    print("\nAll tests passed!")
```

**Behavior:**
- Running `python stats_utils.py` → Tests execute
- Importing `import stats_utils` → Tests don't execute

## Module Initialization

### Code Execution on Import

When you import a module, **all code at the module level executes immediately**. This happens only once, even if you import the module multiple times.

```python
# config.py
print("Initializing configuration...")

DATABASE_URL = "localhost:5432"
MAX_CONNECTIONS = 10
DEBUG_MODE = True

print("Configuration loaded")
```

First import:
```python
import config  # Prints: "Initializing..." and "Configuration loaded"
```

Second import:
```python
import config  # Prints nothing (already initialized)
```

### Module-Level Variables

Variables defined at the module level act as **shared state** across all imports:

```python
# counter.py
count = 0

def increment():
    global count
    count += 1
    return count

def get_count():
    return count
```

Usage:
```python
import counter

print(counter.increment())  # 1
print(counter.increment())  # 2
print(counter.get_count())  # 2
```

## The `__pycache__` Directory

### What is `__pycache__`?

When Python imports a module for the first time:

1. It reads the `.py` file
2. Compiles it to **bytecode** (intermediate representation)
3. Saves the bytecode to `__pycache__/module.cpython-XY.pyc`

**Benefits:**
- Faster subsequent imports (no recompilation needed)
- Python checks if source changed and recompiles if necessary

**Example structure:**
```
project/
├── main.py
├── stats_utils.py
└── __pycache__/
    └── stats_utils.cpython-312.pyc
```

### Should You Commit `__pycache__`?

**No!** Add this to your `.gitignore`:

```
__pycache__/
*.pyc
*.pyo
```

Bytecode files are:
- Platform/version specific
- Automatically regenerated
- Not human-readable

## Best Practices for Module Design

### 1. Clear Module Purpose

Each module should have a **single, well-defined purpose**.

**Good:**
```
data_cleaning.py      # Functions for cleaning data
visualization.py      # Plotting and visualization
statistics.py         # Statistical calculations
```

**Bad:**
```
utils.py              # Too generic, becomes a dumping ground
stuff.py              # Unclear purpose
helpers.py            # What kind of helpers?
```

### 2. Module Documentation

Always include a **module docstring** at the top:

```python
# data_preprocessing.py
"""
Data preprocessing utilities for machine learning projects.

This module provides functions for:
- Handling missing values
- Encoding categorical variables
- Scaling numerical features
- Outlier detection and treatment

Author: Your Name
Date: 2024-01-15
Version: 1.0.0
"""
```

### 3. Function Documentation

Every public function should have a docstring:

```python
def normalize_data(data, method='minmax'):
    """
    Normalize numerical data using specified method.
    
    Parameters
    ----------
    data : list or array-like
        Numerical data to normalize
    method : str, optional
        Normalization method. Options: 'minmax', 'zscore'
        Default is 'minmax'
    
    Returns
    -------
    list
        Normalized data
    
    Raises
    ------
    ValueError
        If method is not recognized or data is empty
    
    Examples
    --------
    >>> normalize_data([1, 2, 3, 4, 5])
    [0.0, 0.25, 0.5, 0.75, 1.0]
    """
    # Implementation here
    pass
```

### 4. Private vs Public Interface

Use naming conventions to indicate intended use:

```python
# data_utils.py

# Public function (meant to be used)
def load_data(filename):
    """Load data from a file."""
    return _parse_file(filename)

# Private function (internal helper)
def _parse_file(filename):
    """Internal function to parse file contents."""
    # Implementation
    pass

# Private variable
_BUFFER_SIZE = 1024

# Public constant
MAX_RETRY_ATTEMPTS = 3
```

**Convention**: Names starting with `_` are "private" (not truly private, but signal internal use)

### 5. The `__all__` Variable

Control what's exported with `from module import *`:

```python
# stats_utils.py

# Explicitly define public API
__all__ = ['mean', 'median', 'variance', 'std_dev']

def mean(data):
    """Calculate mean."""
    return sum(data) / len(data)

def median(data):
    """Calculate median."""
    # Implementation
    pass

def variance(data):
    """Calculate variance."""
    # Implementation
    pass

def std_dev(data):
    """Calculate standard deviation."""
    return variance(data) ** 0.5

def _internal_helper():
    """This won't be imported with 'from stats_utils import *'."""
    pass
```

## Practical Example: Data Validation Module

Let's design a complete module for data validation:

```python
# data_validators.py
"""
Data validation utilities for data analysis projects.

This module provides functions to validate data quality,
check data types, and identify potential issues.
"""

__all__ = ['check_missing', 'check_duplicates', 'check_outliers', 
           'validate_numeric_range']

# Module constants
DEFAULT_OUTLIER_THRESHOLD = 3  # Standard deviations
_VERSION = "1.0.0"

def check_missing(data):
    """
    Check for missing values in dataset.
    
    Parameters
    ----------
    data : list
        Dataset to check
    
    Returns
    -------
    dict
        Dictionary with 'count' and 'percentage' of missing values
    """
    missing_count = sum(1 for x in data if x is None)
    percentage = (missing_count / len(data)) * 100 if data else 0
    
    return {
        'count': missing_count,
        'percentage': percentage,
        'has_missing': missing_count > 0
    }

def check_duplicates(data):
    """
    Check for duplicate values in dataset.
    
    Parameters
    ----------
    data : list
        Dataset to check
    
    Returns
    -------
    dict
        Dictionary with duplicate count and unique values count
    """
    unique_count = len(set(data))
    duplicate_count = len(data) - unique_count
    
    return {
        'total': len(data),
        'unique': unique_count,
        'duplicates': duplicate_count,
        'has_duplicates': duplicate_count > 0
    }

def check_outliers(data, threshold=None):
    """
    Identify outliers using z-score method.
    
    Parameters
    ----------
    data : list
        Numerical dataset
    threshold : float, optional
        Z-score threshold for outliers (default: 3)
    
    Returns
    -------
    dict
        Dictionary with outlier information
    """
    if threshold is None:
        threshold = DEFAULT_OUTLIER_THRESHOLD
    
    mean_val = sum(data) / len(data)
    variance = sum((x - mean_val) ** 2 for x in data) / len(data)
    std_dev = variance ** 0.5
    
    outliers = []
    for i, value in enumerate(data):
        z_score = abs((value - mean_val) / std_dev) if std_dev > 0 else 0
        if z_score > threshold:
            outliers.append({'index': i, 'value': value, 'z_score': z_score})
    
    return {
        'count': len(outliers),
        'outliers': outliers,
        'threshold': threshold
    }

def validate_numeric_range(data, min_val=None, max_val=None):
    """
    Validate that all values are within specified range.
    
    Parameters
    ----------
    data : list
        Numerical dataset
    min_val : float, optional
        Minimum allowed value
    max_val : float, optional
        Maximum allowed value
    
    Returns
    -------
    dict
        Validation results
    """
    out_of_range = []
    
    for i, value in enumerate(data):
        if min_val is not None and value < min_val:
            out_of_range.append({'index': i, 'value': value, 'reason': 'below_min'})
        elif max_val is not None and value > max_val:
            out_of_range.append({'index': i, 'value': value, 'reason': 'above_max'})
    
    return {
        'valid': len(out_of_range) == 0,
        'out_of_range_count': len(out_of_range),
        'violations': out_of_range
    }

def _get_version():
    """Internal function to get module version."""
    return _VERSION

# Module initialization
if __name__ == "__main__":
    print("Data Validators Module - Test Suite")
    print("=" * 40)
    
    # Test data
    test_data = [10, 20, None, 30, 20, 100, 25, 30]
    
    # Run tests
    print("\nTest Data:", test_data)
    
    print("\nMissing Values Check:")
    print(check_missing(test_data))
    
    print("\nDuplicates Check:")
    print(check_duplicates([x for x in test_data if x is not None]))
    
    print("\nOutliers Check:")
    numeric_data = [x for x in test_data if x is not None]
    print(check_outliers(numeric_data))
    
    print("\nRange Validation (10-50):")
    print(validate_numeric_range(numeric_data, min_val=10, max_val=50))
    
    print("\nAll tests completed!")
```

## Simulating Module Usage

Let's simulate the functions from the module above in this notebook:

In [None]:
# Simulating the data_validators module

def check_missing(data):
    """Check for missing values in dataset."""
    missing_count = sum(1 for x in data if x is None)
    percentage = (missing_count / len(data)) * 100 if data else 0
    return {
        'count': missing_count,
        'percentage': percentage,
        'has_missing': missing_count > 0
    }

def check_duplicates(data):
    """Check for duplicate values in dataset."""
    unique_count = len(set(data))
    duplicate_count = len(data) - unique_count
    return {
        'total': len(data),
        'unique': unique_count,
        'duplicates': duplicate_count,
        'has_duplicates': duplicate_count > 0
    }

# Test the functions
test_data = [10, 20, None, 30, 20, 100, 25, 30, None]

print("Test Data:", test_data)
print("\nMissing Values:")
print(check_missing(test_data))

numeric_data = [x for x in test_data if x is not None]
print("\nDuplicates:")
print(check_duplicates(numeric_data))

## Module Organization Tips

### File Structure Example

For a data analysis project:

```
my_project/
├── main.py                    # Main script
├── data/                      # Data files
│   ├── raw/
│   └── processed/
├── modules/                   # Custom modules
│   ├── __init__.py           # Makes it a package
│   ├── data_loading.py       # Data loading utilities
│   ├── data_cleaning.py      # Cleaning functions
│   ├── statistics.py         # Statistical calculations
│   └── visualization.py      # Plotting functions
├── tests/                     # Test files
│   ├── test_data_loading.py
│   └── test_statistics.py
├── requirements.txt           # Dependencies
└── README.md                  # Documentation
```

### Naming Conventions

| Type | Convention | Example |
|------|------------|----------|
| Module names | lowercase_with_underscores | `data_utils.py` |
| Function names | lowercase_with_underscores | `calculate_mean()` |
| Class names | CapitalizedWords | `DataProcessor` |
| Constants | UPPERCASE_WITH_UNDERSCORES | `MAX_ITERATIONS` |
| Private | _leading_underscore | `_internal_helper()` |

## Exercises

### Exercise 1: Create a Temperature Conversion Module

Create a file called `temperature.py` with:

1. Functions to convert between Celsius, Fahrenheit, and Kelvin
2. A module docstring explaining its purpose
3. Docstrings for all functions
4. Module constants for absolute zero in each scale
5. A `__main__` block with tests
6. An `__all__` list defining the public API

**Hint**: Implement at least these functions:
- `celsius_to_fahrenheit(c)`
- `fahrenheit_to_celsius(f)`
- `celsius_to_kelvin(c)`
- `kelvin_to_celsius(k)`

### Exercise 2: Data Transformation Module

Create `data_transform.py` with functions to:

1. `normalize(data, method='minmax')` - Normalize data using min-max or z-score
2. `remove_outliers(data, method='iqr')` - Remove outliers using IQR or z-score
3. `fill_missing(data, strategy='mean')` - Fill missing values with mean, median, or mode
4. `bin_data(data, n_bins=5)` - Discretize continuous data into bins

Include proper documentation and a test suite in the `__main__` block.

### Exercise 3: Understanding `__name__`

Create two files:

**File 1: `greeter.py`**
```python
def greet(name):
    return f"Hello, {name}!"

print(f"greeter.py __name__ = {__name__}")

if __name__ == "__main__":
    print("Running greeter.py directly")
    print(greet("World"))
```

**File 2: `main.py`**
```python
import greeter

print(f"main.py __name__ = {__name__}")
print(greeter.greet("Python"))
```

Run both files and explain the output:
1. What happens when you run `python greeter.py`?
2. What happens when you run `python main.py`?
3. Why is the output different?

### Exercise 4: Module Variables and State

Create `statistics_tracker.py` that maintains state across calls:

```python
# statistics_tracker.py
# Module-level variables to track statistics
_data_points = []
_calculation_count = 0

def add_data_point(value):
    """Add a data point to the tracker."""
    # Implement this
    pass

def get_mean():
    """Calculate mean of all data points."""
    # Implement this
    # Increment _calculation_count
    pass

def get_stats():
    """Get all statistics including calculation count."""
    # Implement this
    pass

def reset():
    """Reset all tracked data."""
    # Implement this
    pass
```

Test it by:
1. Adding multiple data points
2. Calculating statistics multiple times
3. Checking that the calculation count increases
4. Resetting and verifying state is cleared

## Key Takeaways

✅ **Creating Modules:**
- Any `.py` file is a module
- Module name = filename without `.py`
- Code executes on first import

✅ **`__name__` and `__main__`:**
- `__name__ == "__main__"` when file is run directly
- `__name__ == "module_name"` when imported
- Use for testing code in modules

✅ **Best Practices:**
- Single, clear purpose per module
- Comprehensive documentation
- Define public API with `__all__`
- Use `_prefix` for private functions
- Follow PEP 8 naming conventions

✅ **Module Compilation:**
- Python creates `.pyc` files in `__pycache__`
- Speeds up subsequent imports
- Don't commit to version control

## What's Next?

In the next notebook, we'll explore **Packages**: organizing multiple related modules into a hierarchical structure, understanding `__init__.py`, and building a complete package for data analysis.