<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #70d498ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
Tiny Helper Scripts (setup-only, optional)
</summary>


**scripts/verify_data.py** (shape + target + dtype spot-check)

```python
from pathlib import Path
import pandas as pd, yaml

csv = Path("data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
schema = yaml.safe_load(Path("configs/schema.yaml").read_text())

df = pd.read_csv(csv)
expected = schema["expected_dtypes"]

missing = [c for c in expected if c not in df.columns]
extra = [c for c in df.columns if c not in expected]

print("Missing columns:", missing)
print("Extra columns:", extra)
print("Shape:", df.shape)
print("Target present:", "Churn" in df.columns)
```

Run:

```bash
python -m scripts.check_paths
python -m scripts.verify_data
```

## 14) Optional tiny scripts (handy helpers)

`scripts/check_paths.py`

```python
from pathlib import Path
paths = ["data/raw","data/processed","outputs/figures","outputs/reports","models"]
for p in paths:
    print(Path(p).resolve(), "‚úì" if Path(p).exists() else "‚úó")
```

`scripts/quick_profile.py`

```python
import pandas as pd
from pathlib import Path
csv = Path("data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df = pd.read_csv(csv)
print(df.shape)
print(df.isnull().sum().sort_values(ascending=False).head(10))
```

Run:

```bash
python -m scripts.check_paths
python -m scripts.quick_profile
```

# Complete Guide: Setting Up Clean, Reusable Python Code for Data Science Projects
## From Messy Notebooks to Production-Ready Code

---

## üéØ Quick Start Checklist

Before diving into details, here's what you'll set up:
- [ ] Project structure with clear separation of concerns
- [ ] Virtual environment for dependency isolation
- [ ] Version control with Git
- [ ] Configuration management system
- [ ] Logging framework
- [ ] Testing infrastructure
- [ ] Documentation standards
- [ ] Code formatting and linting tools
- [ ] Reproducibility measures

---

## üìÇ 1. Project Structure

### Recommended Directory Layout

```
your-data-science-project/
‚îÇ
‚îú‚îÄ‚îÄ data/                       # Data storage (gitignored)
‚îÇ   ‚îú‚îÄ‚îÄ raw/                   # Original, immutable data
‚îÇ   ‚îú‚îÄ‚îÄ interim/               # Intermediate transformations
‚îÇ   ‚îú‚îÄ‚îÄ processed/             # Final, analysis-ready data
‚îÇ   ‚îî‚îÄ‚îÄ external/              # External data sources
‚îÇ
‚îú‚îÄ‚îÄ src/                       # Source code for the project
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py           
‚îÇ   ‚îú‚îÄ‚îÄ data/                 # Data loading and processing
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ load_data.py
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ clean_data.py
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ validate_data.py
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ features/             # Feature engineering
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ build_features.py
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ feature_selection.py
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ models/               # Model training and prediction
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ train_model.py
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ predict_model.py
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ evaluate_model.py
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ visualization/        # Visualization functions
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ visualize.py
‚îÇ   ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ utils/                # Utility functions
‚îÇ       ‚îú‚îÄ‚îÄ __init__.py
‚îÇ       ‚îú‚îÄ‚îÄ config.py
‚îÇ       ‚îú‚îÄ‚îÄ logger.py
‚îÇ       ‚îî‚îÄ‚îÄ helpers.py
‚îÇ
‚îú‚îÄ‚îÄ notebooks/                 # Jupyter notebooks
‚îÇ   ‚îú‚îÄ‚îÄ 01_data_exploration.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ 02_feature_engineering.ipynb
‚îÇ   ‚îî‚îÄ‚îÄ 03_model_experiments.ipynb
‚îÇ
‚îú‚îÄ‚îÄ tests/                     # Test files
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îú‚îÄ‚îÄ test_data.py
‚îÇ   ‚îú‚îÄ‚îÄ test_features.py
‚îÇ   ‚îî‚îÄ‚îÄ test_models.py
‚îÇ
‚îú‚îÄ‚îÄ configs/                   # Configuration files
‚îÇ   ‚îú‚îÄ‚îÄ config.yaml           # Main configuration
‚îÇ   ‚îú‚îÄ‚îÄ logging_config.yaml  # Logging configuration
‚îÇ   ‚îî‚îÄ‚îÄ model_params.yaml     # Model parameters
‚îÇ
‚îú‚îÄ‚îÄ outputs/                   # Generated outputs (gitignored)
‚îÇ   ‚îú‚îÄ‚îÄ figures/              # Generated graphics
‚îÇ   ‚îú‚îÄ‚îÄ models/               # Trained model files
‚îÇ   ‚îî‚îÄ‚îÄ reports/              # Generated reports
‚îÇ
‚îú‚îÄ‚îÄ docs/                      # Documentation
‚îÇ   ‚îú‚îÄ‚îÄ data_dictionary.md   # Data documentation
‚îÇ   ‚îú‚îÄ‚îÄ model_card.md        # Model documentation
‚îÇ   ‚îî‚îÄ‚îÄ api_reference.md     # Code documentation
‚îÇ
‚îú‚îÄ‚îÄ scripts/                   # Standalone scripts
‚îÇ   ‚îú‚îÄ‚îÄ download_data.py
‚îÇ   ‚îú‚îÄ‚îÄ train_pipeline.py
‚îÇ   ‚îî‚îÄ‚îÄ generate_report.py
‚îÇ
‚îú‚îÄ‚îÄ .env.example              # Example environment variables
‚îú‚îÄ‚îÄ .gitignore                # Git ignore file
‚îú‚îÄ‚îÄ requirements.txt          # Project dependencies
‚îú‚îÄ‚îÄ requirements-dev.txt      # Development dependencies
‚îú‚îÄ‚îÄ setup.py                  # Package setup file
‚îú‚îÄ‚îÄ README.md                 # Project documentation
‚îú‚îÄ‚îÄ Makefile                  # Automation commands
‚îî‚îÄ‚îÄ pyproject.toml           # Modern Python project config
```

### Creating the Structure

```bash
# Create project structure with a script
#!/bin/bash
# save as: create_project_structure.sh

PROJECT_NAME="your-data-science-project"

# Create main directories
mkdir -p $PROJECT_NAME/{data/{raw,interim,processed,external},\
src/{data,features,models,visualization,utils},\
notebooks,tests,configs,outputs/{figures,models,reports},\
docs,scripts}

# Create __init__.py files
find $PROJECT_NAME/src -type d -exec touch {}/__init__.py \;
touch $PROJECT_NAME/tests/__init__.py

# Create essential files
touch $PROJECT_NAME/{README.md,.gitignore,requirements.txt,\
requirements-dev.txt,setup.py,Makefile,.env.example}

echo "Project structure created for $PROJECT_NAME"
```

---

## üîß 2. Environment Setup

### Step 1: Create Virtual Environment

```bash
# Using venv (built-in)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using conda
conda create -n myproject python=3.9
conda activate myproject

# Or using poetry (modern approach)
pip install poetry
poetry new myproject
poetry install
```

### Step 2: Requirements Management

**requirements.txt** - Core dependencies
```txt
# Data manipulation
pandas==2.0.3
numpy==1.24.3

# Machine learning
scikit-learn==1.3.0
xgboost==1.7.6

# Visualization
matplotlib==3.7.2
seaborn==0.12.2
plotly==5.15.0

# Configuration
pyyaml==6.0
python-dotenv==1.0.0

# Data validation
great-expectations==0.17.12
pandera==0.16.1

# Utilities
tqdm==4.65.0
joblib==1.3.1
```

**requirements-dev.txt** - Development dependencies
```txt
# Testing
pytest==7.4.0
pytest-cov==4.1.0
pytest-mock==3.11.1

# Code quality
black==23.7.0
flake8==6.0.0
pylint==2.17.4
mypy==1.4.1
isort==5.12.0

# Pre-commit hooks
pre-commit==3.3.3

# Documentation
sphinx==7.1.1
sphinx-rtd-theme==1.3.0

# Notebooks
jupyter==1.0.0
nbqa==1.7.0
nbstripout==0.6.1
```

### Step 3: Setup Configuration

**setup.py** - Make your project installable
```python
from setuptools import setup, find_packages

with open("README.md", "r", encoding="utf-8") as fh:
    long_description = fh.read()

with open("requirements.txt", "r", encoding="utf-8") as fh:
    requirements = [line.strip() for line in fh if line.strip() and not line.startswith("#")]

setup(
    name="your-data-science-project",
    version="0.1.0",
    author="Your Name",
    author_email="your.email@example.com",
    description="A clean data science project",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/yourusername/your-project",
    packages=find_packages(where="src"),
    package_dir={"": "src"},
    classifiers=[
        "Development Status :: 3 - Alpha",
        "Intended Audience :: Developers",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3.8",
        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: 3.10",
    ],
    python_requires=">=3.8",
    install_requires=requirements,
    extras_require={
        "dev": ["pytest>=7.0", "black>=23.0", "flake8>=6.0"],
    },
    entry_points={
        "console_scripts": [
            "train-model=scripts.train_pipeline:main",
        ],
    },
)

# Install in development mode
# pip install -e .
```

---

## üìù 3. Code Organization Principles

### 3.1 Data Module Structure

**src/data/load_data.py**
```python
"""Data loading module with validation and caching."""

import logging
from pathlib import Path
from typing import Optional, Dict, Any
import pandas as pd
import yaml
from functools import lru_cache

logger = logging.getLogger(__name__)

class DataLoader:
    """Handle data loading with validation and caching."""
    
    def __init__(self, config_path: str = "configs/config.yaml"):
        """Initialize with configuration."""
        self.config = self._load_config(config_path)
        self.data_dir = Path(self.config['data']['base_dir'])
    
    @staticmethod
    def _load_config(config_path: str) -> Dict[str, Any]:
        """Load configuration from YAML file."""
        with open(config_path, 'r') as f:
            return yaml.safe_load(f)
    
    @lru_cache(maxsize=1)
    def load_raw_data(self, 
                     filename: str,
                     validate: bool = True) -> pd.DataFrame:
        """
        Load raw data with caching and validation.
        
        Args:
            filename: Name of the file to load
            validate: Whether to validate data after loading
            
        Returns:
            Loaded DataFrame
            
        Raises:
            FileNotFoundError: If file doesn't exist
            ValueError: If validation fails
        """
        filepath = self.data_dir / 'raw' / filename
        
        if not filepath.exists():
            raise FileNotFoundError(f"Data file not found: {filepath}")
        
        logger.info(f"Loading data from {filepath}")
        
        # Detect file type and load accordingly
        if filepath.suffix == '.csv':
            df = pd.read_csv(filepath, **self.config['data'].get('csv_params', {}))
        elif filepath.suffix == '.parquet':
            df = pd.read_parquet(filepath)
        elif filepath.suffix in ['.xlsx', '.xls']:
            df = pd.read_excel(filepath)
        else:
            raise ValueError(f"Unsupported file type: {filepath.suffix}")
        
        logger.info(f"Loaded {len(df)} rows and {len(df.columns)} columns")
        
        if validate:
            self._validate_data(df)
        
        return df
    
    def _validate_data(self, df: pd.DataFrame) -> None:
        """Validate loaded data against schema."""
        required_columns = self.config['data'].get('required_columns', [])
        
        missing = set(required_columns) - set(df.columns)
        if missing:
            raise ValueError(f"Missing required columns: {missing}")
        
        # Additional validation
        if df.empty:
            raise ValueError("DataFrame is empty")
        
        if df.duplicated().any():
            logger.warning(f"Found {df.duplicated().sum()} duplicate rows")
        
        logger.info("Data validation passed")
    
    def save_processed_data(self, 
                           df: pd.DataFrame, 
                           filename: str,
                           compress: bool = True) -> None:
        """Save processed data efficiently."""
        filepath = self.data_dir / 'processed' / filename
        
        if compress and not filename.endswith('.parquet'):
            filepath = filepath.with_suffix('.parquet')
            df.to_parquet(filepath, compression='snappy')
        else:
            df.to_csv(filepath, index=False)
        
        logger.info(f"Saved processed data to {filepath}")
```

### 3.2 Feature Engineering Module

**src/features/build_features.py**
```python
"""Feature engineering with pipeline approach."""

from typing import List, Optional, Tuple
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import logging

logger = logging.getLogger(__name__)

class FeatureEngineer:
    """Centralized feature engineering."""
    
    def __init__(self, config: dict):
        """Initialize with configuration."""
        self.config = config
        self.feature_pipeline = None
        self.feature_names = []
    
    def create_features(self, 
                       df: pd.DataFrame,
                       target_col: Optional[str] = None) -> pd.DataFrame:
        """
        Create all features for the dataset.
        
        Args:
            df: Input DataFrame
            target_col: Target column to exclude from features
            
        Returns:
            DataFrame with engineered features
        """
        df_features = df.copy()
        
        # Temporal features
        if self.config.get('create_temporal_features', True):
            df_features = self._create_temporal_features(df_features)
        
        # Aggregation features
        if self.config.get('create_aggregation_features', True):
            df_features = self._create_aggregation_features(df_features)
        
        # Interaction features
        if self.config.get('create_interaction_features', True):
            df_features = self._create_interaction_features(df_features)
        
        # Log transform skewed features
        if self.config.get('log_transform_skewed', True):
            df_features = self._log_transform_skewed_features(df_features)
        
        # Record feature names
        self.feature_names = [col for col in df_features.columns 
                             if col != target_col]
        
        logger.info(f"Created {len(self.feature_names)} features")
        
        return df_features
    
    @staticmethod
    def _create_temporal_features(df: pd.DataFrame) -> pd.DataFrame:
        """Create time-based features."""
        date_columns = df.select_dtypes(include=['datetime64']).columns
        
        for col in date_columns:
            df[f'{col}_year'] = df[col].dt.year
            df[f'{col}_month'] = df[col].dt.month
            df[f'{col}_day'] = df[col].dt.day
            df[f'{col}_dayofweek'] = df[col].dt.dayofweek
            df[f'{col}_quarter'] = df[col].dt.quarter
            df[f'{col}_is_weekend'] = df[col].dt.dayofweek.isin([5, 6]).astype(int)
        
        return df
    
    def _create_aggregation_features(self, 
                                    df: pd.DataFrame) -> pd.DataFrame:
        """Create aggregation-based features."""
        numerical_cols = df.select_dtypes(include=[np.number]).columns
        
        if len(numerical_cols) > 1:
            # Statistical aggregations
            df['numerical_mean'] = df[numerical_cols].mean(axis=1)
            df['numerical_std'] = df[numerical_cols].std(axis=1)
            df['numerical_max'] = df[numerical_cols].max(axis=1)
            df['numerical_min'] = df[numerical_cols].min(axis=1)
        
        return df
    
    @staticmethod
    def _create_interaction_features(df: pd.DataFrame) -> pd.DataFrame:
        """Create interaction features between columns."""
        # Example: Create ratios for numerical columns
        numerical_cols = df.select_dtypes(include=[np.number]).columns
        
        for i, col1 in enumerate(numerical_cols):
            for col2 in numerical_cols[i+1:]:
                # Avoid division by zero
                if (df[col2] != 0).any():
                    df[f'{col1}_div_{col2}'] = df[col1] / (df[col2] + 1e-8)
                    df[f'{col1}_mult_{col2}'] = df[col1] * df[col2]
        
        return df
    
    @staticmethod
    def _log_transform_skewed_features(df: pd.DataFrame, 
                                      threshold: float = 0.75) -> pd.DataFrame:
        """Apply log transformation to skewed features."""
        numerical_cols = df.select_dtypes(include=[np.number]).columns
        
        for col in numerical_cols:
            skewness = df[col].skew()
            if abs(skewness) > threshold:
                if (df[col] > 0).all():  # Only if all values are positive
                    df[f'{col}_log'] = np.log1p(df[col])
                    logger.debug(f"Log transformed {col} (skewness: {skewness:.2f})")
        
        return df

class CustomTransformer(BaseEstimator, TransformerMixin):
    """Custom sklearn transformer for pipeline integration."""
    
    def __init__(self, function):
        self.function = function
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return self.function(X)
```

### 3.3 Model Module

**src/models/train_model.py**
```python
"""Model training with experiment tracking."""

import logging
from typing import Dict, Any, Tuple, Optional
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
import joblib
import mlflow
import mlflow.sklearn
from datetime import datetime
import json

logger = logging.getLogger(__name__)

class ModelTrainer:
    """Handle model training with tracking and versioning."""
    
    def __init__(self, 
                 model_config: Dict[str, Any],
                 experiment_name: str = "default_experiment"):
        """Initialize trainer with configuration."""
        self.config = model_config
        self.experiment_name = experiment_name
        self.model = None
        self.metrics = {}
        
        # Setup MLflow
        mlflow.set_experiment(experiment_name)
    
    def train(self,
             X_train: pd.DataFrame,
             y_train: pd.Series,
             X_val: Optional[pd.DataFrame] = None,
             y_val: Optional[pd.Series] = None) -> Any:
        """
        Train model with experiment tracking.
        
        Args:
            X_train: Training features
            y_train: Training target
            X_val: Validation features (optional)
            y_val: Validation target (optional)
            
        Returns:
            Trained model
        """
        with mlflow.start_run():
            # Log parameters
            mlflow.log_params(self.config['model_params'])
            
            # Initialize model
            model_class = self._get_model_class()
            self.model = model_class(**self.config['model_params'])
            
            # Train model
            logger.info(f"Training {self.config['model_type']} model")
            self.model.fit(X_train, y_train)
            
            # Evaluate model
            if X_val is not None and y_val is not None:
                self.metrics = self._evaluate_model(X_val, y_val)
                
                # Log metrics
                for metric_name, metric_value in self.metrics.items():
                    mlflow.log_metric(metric_name, metric_value)
            
            # Cross-validation
            if self.config.get('cross_validate', True):
                cv_scores = self._cross_validate(X_train, y_train)
                mlflow.log_metric('cv_mean_score', cv_scores.mean())
                mlflow.log_metric('cv_std_score', cv_scores.std())
            
            # Log model
            mlflow.sklearn.log_model(
                self.model,
                "model",
                registered_model_name=f"{self.experiment_name}_model"
            )
            
            # Save model locally
            self._save_model()
            
            logger.info(f"Training completed. Metrics: {self.metrics}")
            
        return self.model
    
    def _get_model_class(self):
        """Get model class from configuration."""
        from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
        from sklearn.linear_model import LogisticRegression
        from xgboost import XGBClassifier
        
        model_classes = {
            'random_forest': RandomForestClassifier,
            'gradient_boosting': GradientBoostingClassifier,
            'logistic_regression': LogisticRegression,
            'xgboost': XGBClassifier
        }
        
        model_type = self.config['model_type']
        if model_type not in model_classes:
            raise ValueError(f"Unknown model type: {model_type}")
        
        return model_classes[model_type]
    
    def _evaluate_model(self, 
                       X_val: pd.DataFrame, 
                       y_val: pd.Series) -> Dict[str, float]:
        """Evaluate model performance."""
        from sklearn.metrics import (
            accuracy_score, precision_score, recall_score,
            f1_score, roc_auc_score
        )
        
        y_pred = self.model.predict(X_val)
        y_proba = self.model.predict_proba(X_val)[:, 1] if hasattr(self.model, 'predict_proba') else None
        
        metrics = {
            'accuracy': accuracy_score(y_val, y_pred),
            'precision': precision_score(y_val, y_pred, average='weighted'),
            'recall': recall_score(y_val, y_pred, average='weighted'),
            'f1': f1_score(y_val, y_pred, average='weighted')
        }
        
        if y_proba is not None and len(np.unique(y_val)) == 2:
            metrics['roc_auc'] = roc_auc_score(y_val, y_proba)
        
        # Log confusion matrix
        cm = confusion_matrix(y_val, y_pred)
        logger.info(f"Confusion Matrix:\n{cm}")
        
        return metrics
    
    def _cross_validate(self, 
                       X: pd.DataFrame, 
                       y: pd.Series,
                       cv: int = 5) -> np.ndarray:
        """Perform cross-validation."""
        skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
        scores = cross_val_score(
            self.model, X, y, 
            cv=skf, 
            scoring=self.config.get('scoring', 'accuracy')
        )
        
        logger.info(f"Cross-validation scores: {scores}")
        logger.info(f"Mean CV score: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
        
        return scores
    
    def _save_model(self) -> None:
        """Save model and metadata."""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        model_path = f"outputs/models/model_{timestamp}.pkl"
        
        # Save model
        joblib.dump(self.model, model_path)
        logger.info(f"Model saved to {model_path}")
        
        # Save metadata
        metadata = {
            'timestamp': timestamp,
            'model_type': self.config['model_type'],
            'parameters': self.config['model_params'],
            'metrics': self.metrics,
            'feature_names': self.config.get('feature_names', [])
        }
        
        metadata_path = f"outputs/models/metadata_{timestamp}.json"
        with open(metadata_path, 'w') as f:
            json.dump(metadata, f, indent=2, default=str)
        
        logger.info(f"Metadata saved to {metadata_path}")
```

---

## üîê 4. Configuration Management

### 4.1 Main Configuration File

**configs/config.yaml**
```yaml
# Project configuration
project:
  name: "telco-churn-analysis"
  version: "1.0.0"
  description: "Customer churn prediction"
  author: "Your Name"

# Data configuration
data:
  base_dir: "data"
  raw_file: "telco_customer_churn.csv"
  processed_file: "telco_processed.parquet"
  
  # Column definitions
  target_column: "Churn"
  id_column: "customerID"
  
  # Required columns for validation
  required_columns:
    - customerID
    - gender
    - tenure
    - MonthlyCharges
    - TotalCharges
    - Churn
  
  # CSV reading parameters
  csv_params:
    encoding: "utf-8"
    sep: ","
    na_values: ["", " ", "NA", "N/A", "null"]

# Feature engineering configuration
features:
  create_temporal_features: true
  create_aggregation_features: true
  create_interaction_features: true
  log_transform_skewed: true
  
  # Categorical encoding
  encoding_method: "one_hot"  # options: one_hot, label, target
  
  # Feature selection
  selection_method: "mutual_info"  # options: mutual_info, chi2, anova
  n_features_to_select: 20

# Model configuration
model:
  model_type: "xgboost"  # options: random_forest, xgboost, logistic_regression
  
  # Model parameters
  model_params:
    n_estimators: 100
    max_depth: 5
    learning_rate: 0.1
    random_state: 42
  
  # Training configuration
  test_size: 0.2
  validation_size: 0.2
  cross_validate: true
  cv_folds: 5
  scoring: "roc_auc"
  
  # Hyperparameter tuning
  hyperparameter_tuning:
    enabled: true
    method: "grid_search"  # options: grid_search, random_search, bayesian
    n_iter: 50  # for random search
    param_grid:
      n_estimators: [50, 100, 200]
      max_depth: [3, 5, 7]
      learning_rate: [0.01, 0.1, 0.3]

# Paths
paths:
  logs: "logs"
  outputs: "outputs"
  models: "outputs/models"
  figures: "outputs/figures"
  reports: "outputs/reports"

# Logging configuration
logging:
  level: "INFO"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
  file: "logs/project.log"
```

### 4.2 Environment Variables

**.env.example**
```bash
# Database connections
DB_HOST=localhost
DB_PORT=5432
DB_NAME=myproject
DB_USER=username
DB_PASSWORD=password

# API Keys
API_KEY=your-api-key-here
SECRET_KEY=your-secret-key-here

# Cloud storage
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
S3_BUCKET=your-bucket-name

# MLflow tracking
MLFLOW_TRACKING_URI=http://localhost:5000
MLFLOW_EXPERIMENT_NAME=telco_churn

# Environment
ENVIRONMENT=development  # development, staging, production
DEBUG=True
```

### 4.3 Configuration Loader

**src/utils/config.py**
```python
"""Configuration management utilities."""

import os
from pathlib import Path
from typing import Dict, Any, Optional
import yaml
from dotenv import load_dotenv
import logging

logger = logging.getLogger(__name__)

class Config:
    """Centralized configuration management."""
    
    _instance = None
    
    def __new__(cls):
        """Singleton pattern for configuration."""
        if cls._instance is None:
            cls._instance = super().__new__(cls)
            cls._instance._initialized = False
        return cls._instance
    
    def __init__(self):
        """Initialize configuration."""
        if self._initialized:
            return
        
        self._initialized = True
        self.project_root = Path(__file__).parent.parent.parent
        
        # Load environment variables
        self._load_env()
        
        # Load YAML configuration
        self.config = self._load_yaml_config()
        
        # Override with environment variables
        self._override_with_env()
        
    def _load_env(self) -> None:
        """Load environment variables from .env file."""
        env_file = self.project_root / '.env'
        if env_file.exists():
            load_dotenv(env_file)
            logger.info(f"Loaded environment variables from {env_file}")
    
    def _load_yaml_config(self, 
                         config_file: str = 'configs/config.yaml') -> Dict[str, Any]:
        """Load YAML configuration file."""
        config_path = self.project_root / config_file
        
        if not config_path.exists():
            logger.warning(f"Config file not found: {config_path}")
            return {}
        
        with open(config_path, 'r') as f:
            config = yaml.safe_load(f)
        
        logger.info(f"Loaded configuration from {config_path}")
        return config
    
    def _override_with_env(self) -> None:
        """Override configuration with environment variables."""
        # Example: Override database configuration
        if os.getenv('DB_HOST'):
            self.config.setdefault('database', {})['host'] = os.getenv('DB_HOST')
        
        if os.getenv('ENVIRONMENT'):
            self.config['environment'] = os.getenv('ENVIRONMENT')
    
    def get(self, key: str, default: Any = None) -> Any:
        """Get configuration value by key (supports nested keys)."""
        keys = key.split('.')
        value = self.config
        
        for k in keys:
            if isinstance(value, dict):
                value = value.get(k)
            else:
                return default
            
            if value is None:
                return default
        
        return value
    
    @property
    def data_dir(self) -> Path:
        """Get data directory path."""
        return self.project_root / self.get('data.base_dir', 'data')
    
    @property
    def output_dir(self) -> Path:
        """Get output directory path."""
        return self.project_root / self.get('paths.outputs', 'outputs')
    
    def get_model_params(self) -> Dict[str, Any]:
        """Get model parameters."""
        return self.get('model.model_params', {})

# Create global config instance
config = Config()
```

---

## üìä 5. Logging Setup

**src/utils/logger.py**
```python
"""Logging configuration and utilities."""

import logging
import logging.config
import sys
from pathlib import Path
from typing import Optional
import yaml
from datetime import datetime

def setup_logging(
    config_path: Optional[str] = None,
    default_level: int = logging.INFO,
    log_dir: str = "logs"
) -> None:
    """
    Setup logging configuration.
    
    Args:
        config_path: Path to logging configuration file
        default_level: Default logging level
        log_dir: Directory for log files
    """
    # Create log directory
    Path(log_dir).mkdir(exist_ok=True)
    
    if config_path and Path(config_path).exists():
        # Load from configuration file
        with open(config_path, 'r') as f:
            config = yaml.safe_load(f)
        logging.config.dictConfig(config)
    else:
        # Default configuration
        log_file = Path(log_dir) / f"app_{datetime.now():%Y%m%d}.log"
        
        logging.basicConfig(
            level=default_level,
            format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler(log_file),
                logging.StreamHandler(sys.stdout)
            ]
        )
    
    # Reduce noise from third-party libraries
    logging.getLogger('matplotlib').setLevel(logging.WARNING)
    logging.getLogger('urllib3').setLevel(logging.WARNING)
    
    logger = logging.getLogger(__name__)
    logger.info("Logging initialized")

class LoggerMixin:
    """Mixin to add logging to any class."""
    
    @property
    def logger(self):
        """Get logger for the class."""
        name = '.'.join([
            self.__class__.__module__,
            self.__class__.__name__
        ])
        return logging.getLogger(name)

# Logging configuration file
# configs/logging_config.yaml
"""
version: 1
disable_existing_loggers: false

formatters:
  default:
    format: '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
  detailed:
    format: '%(asctime)s - %(name)s - %(levelname)s - %(funcName)s:%(lineno)d - %(message)s'

handlers:
  console:
    class: logging.StreamHandler
    level: INFO
    formatter: default
    stream: ext://sys.stdout
  
  file:
    class: logging.handlers.RotatingFileHandler
    level: DEBUG
    formatter: detailed
    filename: logs/app.log
    maxBytes: 10485760  # 10MB
    backupCount: 5
  
  error_file:
    class: logging.handlers.RotatingFileHandler
    level: ERROR
    formatter: detailed
    filename: logs/errors.log
    maxBytes: 10485760  # 10MB
    backupCount: 5

loggers:
  src:
    level: DEBUG
    handlers: [console, file]
    propagate: no
  
  src.models:
    level: INFO
    handlers: [console, file]
    propagate: no

root:
  level: INFO
  handlers: [console, file, error_file]
"""
```

---

## üß™ 6. Testing Infrastructure

### 6.1 Test Structure

**tests/test_data.py**
```python
"""Tests for data module."""

import pytest
import pandas as pd
import numpy as np
from pathlib import Path
import tempfile
from src.data.load_data import DataLoader
from src.data.clean_data import DataCleaner

class TestDataLoader:
    """Test data loading functionality."""
    
    @pytest.fixture
    def sample_data(self):
        """Create sample data for testing."""
        return pd.DataFrame({
            'id': range(1, 101),
            'value': np.random.randn(100),
            'category': np.random.choice(['A', 'B', 'C'], 100)
        })
    
    @pytest.fixture
    def temp_csv_file(self, sample_data):
        """Create temporary CSV file."""
        with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
            sample_data.to_csv(f, index=False)
            return f.name
    
    def test_load_csv_file(self, temp_csv_file):
        """Test loading CSV file."""
        loader = DataLoader()
        df = loader.load_raw_data(Path(temp_csv_file).name)
        
        assert df is not None
        assert len(df) == 100
        assert list(df.columns) == ['id', 'value', 'category']
    
    def test_load_nonexistent_file(self):
        """Test loading non-existent file raises error."""
        loader = DataLoader()
        
        with pytest.raises(FileNotFoundError):
            loader.load_raw_data('nonexistent.csv')
    
    def test_validate_data_with_missing_columns(self, sample_data):
        """Test validation with missing required columns."""
        loader = DataLoader()
        loader.config = {'data': {'required_columns': ['id', 'missing_column']}}
        
        with pytest.raises(ValueError, match="Missing required columns"):
            loader._validate_data(sample_data)
    
    @pytest.mark.parametrize("file_extension,reader_method", [
        ('.csv', 'read_csv'),
        ('.parquet', 'read_parquet'),
        ('.xlsx', 'read_excel')
    ])
    def test_file_type_detection(self, file_extension, reader_method, monkeypatch):
        """Test correct reader is used for different file types."""
        loader = DataLoader()
        
        # Mock the reader methods
        mock_called = {'called': False}
        
        def mock_reader(*args, **kwargs):
            mock_called['called'] = True
            return pd.DataFrame()
        
        monkeypatch.setattr(pd, reader_method, mock_reader)
        
        # This would need actual implementation in the loader
        # Just showing the test structure

class TestDataCleaner:
    """Test data cleaning functionality."""
    
    @pytest.fixture
    def dirty_data(self):
        """Create data with quality issues."""
        return pd.DataFrame({
            'id': [1, 2, 2, 3, 4],  # Duplicate
            'value': [10, 20, 20, None, 40],  # Missing value
            'text': ['  hello  ', 'WORLD', 'Test', None, '  ']  # Needs cleaning
        })
    
    def test_remove_duplicates(self, dirty_data):
        """Test duplicate removal."""
        cleaner = DataCleaner()
        cleaned = cleaner.remove_duplicates(dirty_data)
        
        assert len(cleaned) == 4
        assert not cleaned.duplicated().any()
    
    def test_handle_missing_values(self, dirty_data):
        """Test missing value handling."""
        cleaner = DataCleaner()
        
        # Test different strategies
        filled = cleaner.handle_missing(dirty_data, strategy='mean')
        assert filled['value'].isna().sum() == 0
        
        dropped = cleaner.handle_missing(dirty_data, strategy='drop')
        assert len(dropped) == 3
    
    def test_clean_text_columns(self, dirty_data):
        """Test text cleaning."""
        cleaner = DataCleaner()
        cleaned = cleaner.clean_text(dirty_data, columns=['text'])
        
        assert cleaned['text'].iloc[0] == 'hello'
        assert cleaned['text'].iloc[1] == 'world'
        assert cleaned['text'].iloc[4] == ''
```

### 6.2 Test Configuration

**pytest.ini**
```ini
[pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts = 
    -v
    --cov=src
    --cov-report=html
    --cov-report=term-missing
    --tb=short
    --strict-markers
markers =
    slow: marks tests as slow (deselect with '-m "not slow"')
    integration: marks tests as integration tests
    unit: marks tests as unit tests
```

**conftest.py**
```python
"""Shared test fixtures and configuration."""

import pytest
import pandas as pd
import numpy as np
from pathlib import Path
import sys

# Add src to path
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))

@pytest.fixture(scope='session')
def test_data_dir():
    """Get test data directory."""
    return Path(__file__).parent / 'test_data'

@pytest.fixture
def sample_telco_data():
    """Create sample telco churn data."""
    np.random.seed(42)
    n_samples = 1000
    
    return pd.DataFrame({
        'customerID': [f'ID_{i:04d}' for i in range(n_samples)],
        'tenure': np.random.randint(0, 72, n_samples),
        'MonthlyCharges': np.random.uniform(20, 120, n_samples),
        'TotalCharges': np.random.uniform(100, 8000, n_samples),
        'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
        'PaymentMethod': np.random.choice([
            'Electronic check', 'Mailed check', 
            'Bank transfer', 'Credit card'
        ], n_samples),
        'Churn': np.random.choice(['Yes', 'No'], n_samples, p=[0.3, 0.7])
    })

@pytest.fixture(autouse=True)
def reset_singleton():
    """Reset singleton instances between tests."""
    from src.utils.config import Config
    Config._instance = None
    yield
    Config._instance = None
```

---

## üé® 7. Code Quality Tools

### 7.1 Pre-commit Configuration

**.pre-commit-config.yaml**
```yaml
repos:
  # Remove trailing whitespace
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files
        args: ['--maxkb=1000']
      - id: check-json
      - id: check-merge-conflict
      - id: debug-statements

  # Black formatting
  - repo: https://github.com/psf/black
    rev: 23.7.0
    hooks:
      - id: black
        language_version: python3.9

  # isort import sorting
  - repo: https://github.com/PyCQA/isort
    rev: 5.12.0
    hooks:
      - id: isort
        args: ["--profile", "black"]

  # Flake8 linting
  - repo: https://github.com/PyCQA/flake8
    rev: 6.0.0
    hooks:
      - id: flake8
        args: ['--max-line-length=100', '--ignore=E203,W503']

  # Type checking with mypy
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.4.1
    hooks:
      - id: mypy
        additional_dependencies: [types-all]
        args: [--ignore-missing-imports]

  # Jupyter notebook cleaning
  - repo: https://github.com/kynan/nbstripout
    rev: 0.6.1
    hooks:
      - id: nbstripout

# Install pre-commit hooks
# pre-commit install
```

### 7.2 Makefile for Automation

**Makefile**
```makefile
.PHONY: help setup test clean lint format run

help:
	@echo "Available commands:"
	@echo "  make setup    - Set up the development environment"
	@echo "  make test     - Run tests"
	@echo "  make lint     - Run linting"
	@echo "  make format   - Format code"
	@echo "  make clean    - Clean up temporary files"
	@echo "  make run      - Run the main pipeline"

setup:
	python -m venv venv
	. venv/bin/activate && pip install --upgrade pip
	. venv/bin/activate && pip install -r requirements.txt
	. venv/bin/activate && pip install -r requirements-dev.txt
	. venv/bin/activate && pip install -e .
	. venv/bin/activate && pre-commit install
	@echo "Setup complete! Activate with: source venv/bin/activate"

test:
	pytest tests/ -v --cov=src --cov-report=html

lint:
	flake8 src/ tests/
	pylint src/
	mypy src/

format:
	black src/ tests/
	isort src/ tests/

clean:
	find . -type f -name "*.pyc" -delete
	find . -type d -name "__pycache__" -delete
	find . -type d -name "*.egg-info" -exec rm -rf {} + 2>/dev/null || true
	rm -rf .pytest_cache
	rm -rf .coverage
	rm -rf htmlcov
	rm -rf .mypy_cache

run:
	python scripts/train_pipeline.py

# Data pipeline commands
data-download:
	python scripts/download_data.py

data-process:
	python scripts/process_data.py

# Model commands
train:
	python scripts/train_model.py

evaluate:
	python scripts/evaluate_model.py

predict:
	python scripts/predict.py

# Docker commands
docker-build:
	docker build -t $(PROJECT_NAME) .

docker-run:
	docker run -it --rm -v $(PWD):/app $(PROJECT_NAME)

# Documentation
docs:
	sphinx-build -b html docs/ docs/_build

# Quality checks
quality-check: lint test
	@echo "All quality checks passed!"
```

---

## üìö 8. Documentation Standards

### 8.1 README Template

**README.md**
```markdown
# Project Name

Brief description of what the project does and its purpose.

## üöÄ Quick Start

```bash
# Clone the repository
git clone https://github.com/username/project.git
cd project

# Set up environment
make setup

# Run the pipeline
make run
```

## üìã Prerequisites

- Python 3.8+
- Virtual environment tool (venv, conda, or poetry)
- Git

## üõ†Ô∏è Installation

### Option 1: Using Make
```bash
make setup
```

### Option 2: Manual Setup
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
```

## üìÇ Project Structure

```
project/
‚îú‚îÄ‚îÄ src/          # Source code
‚îú‚îÄ‚îÄ tests/        # Test files
‚îú‚îÄ‚îÄ notebooks/    # Jupyter notebooks
‚îú‚îÄ‚îÄ configs/      # Configuration files
‚îú‚îÄ‚îÄ data/         # Data files (not in version control)
‚îî‚îÄ‚îÄ outputs/      # Generated outputs
```

## üîß Configuration

1. Copy `.env.example` to `.env` and fill in your values
2. Modify `configs/config.yaml` as needed

## üìä Usage

### Training a Model
```python
from src.models import ModelTrainer
from src.data import DataLoader

# Load data
loader = DataLoader()
data = loader.load_raw_data('data.csv')

# Train model
trainer = ModelTrainer(config)
model = trainer.train(X_train, y_train)
```

### Making Predictions
```python
from src.models import predict

predictions = predict(model, new_data)
```

## üß™ Testing

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=src

# Run specific test file
pytest tests/test_data.py
```

## üìà Results

Brief description of model performance and key findings.

| Metric | Value |
|--------|-------|
| Accuracy | 0.95 |
| Precision | 0.93 |
| Recall | 0.92 |
| F1 Score | 0.92 |

## ü§ù Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing`)
5. Open a Pull Request

## üìù License

This project is licensed under the MIT License - see LICENSE file for details.

## üë• Authors

- Your Name - Initial work

## üôè Acknowledgments

- Hat tip to anyone whose code was used
- Inspiration sources
- References
```

### 8.2 Docstring Standards

```python
"""
Module docstring describing the module's purpose.

This module provides functionality for X, Y, and Z.
It is designed to be used as part of the larger system.

Example:
    Basic usage of this module::
    
        from mymodule import MyClass
        
        obj = MyClass()
        result = obj.process(data)

Attributes:
    MODULE_CONSTANT (int): Description of module constant

Todo:
    * Add support for feature X
    * Optimize performance of function Y
"""

def function_with_docstring(param1: str, 
                           param2: int = 10,
                           **kwargs) -> Dict[str, Any]:
    """
    Brief one-line description of function.
    
    Longer description explaining what the function does,
    any important details about its behavior, and when
    to use it.
    
    Args:
        param1: Description of param1
        param2: Description of param2 with default value
        **kwargs: Additional keyword arguments:
            - option1 (bool): Description of option1
            - option2 (str): Description of option2
    
    Returns:
        Description of return value, including type and
        structure if complex.
        
        Example return structure:
        {
            'status': 'success',
            'data': [...],
            'metadata': {...}
        }
    
    Raises:
        ValueError: If param1 is empty
        TypeError: If param2 is not an integer
    
    Example:
        >>> result = function_with_docstring("test", param2=20)
        >>> print(result['status'])
        'success'
    
    Note:
        This function has side effects on X.
        
    See Also:
        related_function: Does something similar
        OtherClass: Related class
    """
    pass
```

---

## üöÄ 9. Putting It All Together

### Complete Working Example

**scripts/train_pipeline.py**
```python
#!/usr/bin/env python
"""
Complete training pipeline script.

This script orchestrates the entire machine learning pipeline from
data loading through model training and evaluation.
"""

import logging
from pathlib import Path
import click
import mlflow

from src.utils.config import config
from src.utils.logger import setup_logging
from src.data.load_data import DataLoader
from src.data.clean_data import DataCleaner
from src.features.build_features import FeatureEngineer
from src.models.train_model import ModelTrainer
from src.models.evaluate_model import ModelEvaluator

# Setup logging
setup_logging()
logger = logging.getLogger(__name__)

@click.command()
@click.option('--config-path', default='configs/config.yaml', 
              help='Path to configuration file')
@click.option('--data-path', default=None,
              help='Override data path from config')
@click.option('--experiment-name', default='default',
              help='MLflow experiment name')
@click.option('--debug', is_flag=True,
              help='Run in debug mode')
def main(config_path, data_path, experiment_name, debug):
    """Run the complete training pipeline."""
    
    try:
        logger.info("="*60)
        logger.info("Starting training pipeline")
        logger.info("="*60)
        
        # Load configuration
        logger.info("Loading configuration")
        # config is already loaded as singleton
        
        if debug:
            logging.getLogger().setLevel(logging.DEBUG)
        
        # Step 1: Load data
        logger.info("Step 1: Loading data")
        loader = DataLoader(config_path)
        
        data_file = data_path or config.get('data.raw_file')
        df = loader.load_raw_data(data_file)
        logger.info(f"Loaded {len(df)} rows")
        
        # Step 2: Clean data
        logger.info("Step 2: Cleaning data")
        cleaner = DataCleaner()
        df = cleaner.clean(df)
        logger.info(f"Cleaned data: {len(df)} rows remaining")
        
        # Step 3: Feature engineering
        logger.info("Step 3: Engineering features")
        engineer = FeatureEngineer(config.get('features', {}))
        df = engineer.create_features(df, target_col=config.get('data.target_column'))
        logger.info(f"Created {len(engineer.feature_names)} features")
        
        # Step 4: Split data
        logger.info("Step 4: Splitting data")
        from sklearn.model_selection import train_test_split
        
        target_col = config.get('data.target_column')
        X = df[engineer.feature_names]
        y = df[target_col]
        
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, 
            test_size=config.get('model.test_size', 0.2),
            random_state=42,
            stratify=y
        )
        
        X_train, X_val, y_train, y_val = train_test_split(
            X_train, y_train,
            test_size=config.get('model.validation_size', 0.2),
            random_state=42,
            stratify=y_train
        )
        
        logger.info(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
        
        # Step 5: Train model
        logger.info("Step 5: Training model")
        trainer = ModelTrainer(
            config.get('model', {}),
            experiment_name=experiment_name
        )
        
        model = trainer.train(X_train, y_train, X_val, y_val)
        logger.info("Model training completed")
        
        # Step 6: Evaluate model
        logger.info("Step 6: Evaluating model")
        evaluator = ModelEvaluator()
        metrics = evaluator.evaluate(model, X_test, y_test)
        
        logger.info("="*60)
        logger.info("Pipeline completed successfully!")
        logger.info(f"Final metrics: {metrics}")
        logger.info("="*60)
        
        return 0
        
    except Exception as e:
        logger.error(f"Pipeline failed: {str(e)}", exc_info=True)
        return 1

if __name__ == '__main__':
    exit(main())
```

---

## üéì Best Practices Summary

### Do's ‚úÖ

1. **Always use version control** - Commit early and often
2. **Write tests first** - TDD helps design better code
3. **Document as you go** - Future you will thank you
4. **Use type hints** - Makes code self-documenting
5. **Keep functions small** - Single responsibility principle
6. **Handle errors gracefully** - Never let errors pass silently
7. **Use configuration files** - No hardcoded values
8. **Log everything important** - Debugging will be easier
9. **Profile before optimizing** - Measure, don't guess
10. **Review your own code** - After a break, review with fresh eyes

### Don'ts ‚ùå

1. **Don't commit data files** - Use .gitignore
2. **Don't use global variables** - Pass parameters explicitly
3. **Don't ignore warnings** - They often indicate problems
4. **Don't copy-paste code** - Extract common functionality
5. **Don't skip testing** - Technical debt accumulates quickly
6. **Don't use print for debugging** - Use proper logging
7. **Don't hardcode paths** - Use configuration or Path objects
8. **Don't ignore code style** - Consistency matters
9. **Don't optimize prematurely** - Working code first
10. **Don't work without version control** - Even for experiments

---

## üö¶ Getting Started Checklist

```bash
# 1. Create project structure
bash create_project_structure.sh

# 2. Initialize git
cd your-data-science-project
git init
git add .
git commit -m "Initial commit"

# 3. Set up virtual environment
python -m venv venv
source venv/bin/activate

# 4. Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .

# 5. Set up pre-commit hooks
pre-commit install

# 6. Run initial tests
pytest

# 7. Start coding!
```

---

*This guide provides a comprehensive foundation for setting up professional, maintainable data science projects. Adapt and modify based on your specific needs, but maintain the core principles of clean, reusable code.*


# Cell 2: Environment Setup (Clean)
# Core imports and configuration
import sys
from pathlib import Path
import yaml

# Add project path
HERE = Path().resolve()
sys.path.insert(0, str(HERE.parent / "src"))

# Data science stack
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Project modules
from utils.loader import DataLoader
from utils.preprocessor import clean_telco_data
from utils.stats import (
    test_numerical_vs_churn,
    test_categorical_vs_churn,
    identify_risk_segments
)

# Load configuration
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("‚úÖ Environment setup complete")
# Cell 3: Data Loading & Validation
# Load and prepare data using modular functions
loader = DataLoader(config)
df_raw, load_report = loader.load_data(config['data']['raw_path'])
df_clean = clean_telco_data(df_raw)

# Data quality summary
print(f"Dataset: {df_clean.shape[0]:,} customers, {df_clean.shape[1]} features")
print(f"Churn rate: {(df_clean['Churn'] == 'Yes').mean()*100:.1f}%")
print(f"Missing values: {df_clean.isnull().sum().sum()}")

# Save processed data
processed_path = Path(config['data']['processed_path'])
processed_path.parent.mkdir(parents=True, exist_ok=True)
df_clean.to_csv(processed_path, index=False)
print(f"‚úÖ Clean data saved to {processed_path}")
# Cell 4: Statistical Testing Framework
# Define features to test
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = ['Contract', 'PaymentMethod', 'InternetService']

# Initialize results storage
statistical_results = {
    'numerical': {},
    'categorical': {}
}

print("üî¨ Running Statistical Tests")
print("=" * 40)
# Cell 5: Numerical Feature Analysis
# Test numerical features
for feature in numerical_features:
    result = test_numerical_vs_churn(df_clean, feature, 'Churn')
    statistical_results['numerical'][feature] = result
    
    print(f"\n{feature.upper()}:")
    print(f"  Test: {result['test_used']}")
    print(f"  P-value: {result['p_value']:.4e}")
    print(f"  Effect size: {result['cohens_d']:.3f} ({result['effect_size']})")
    print(f"  Significant: {'‚úÖ' if result['significant'] else '‚ùå'}")
# Cell 6: Categorical Feature Analysis
# Test categorical features
for feature in categorical_features:
    result = test_categorical_vs_churn(df_clean, feature, 'Churn')
    statistical_results['categorical'][feature] = result
    
    print(f"\n{feature.upper()}:")
    print(f"  Chi-square: {result['chi2_statistic']:.2f}")
    print(f"  P-value: {result['p_value']:.4e}")
    print(f"  Cram√©r's V: {result['cramers_v']:.3f}")
    print(f"  Highest risk: {result['highest_risk_category']}")
#Cell 7: Key Findings Visualization
# Create focused visualizations for significant findings
significant_features = []

# Identify significant results
for category, results in statistical_results.items():
    for feature, result in results.items():
        if result['significant']:
            significant_features.append((feature, result))

print(f"üìä Visualizing {len(significant_features)} significant findings")

# Create subplot grid
n_features = len(significant_features)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, (feature, result) in enumerate(significant_features[:4]):
    # Your visualization code here
    pass

plt.tight_layout()
plt.show()
Cell 8: Risk Segmentation
python# Business-focused risk analysis
risk_segments = identify_risk_segments(df_clean)

print("üéØ HIGH-RISK CUSTOMER SEGMENTS")
print("=" * 40)

# Sort by risk level and revenue impact
high_risk_segments = {k: v for k, v in risk_segments.items() 
                     if v['risk_level'] == 'HIGH'}

for segment_name, data in high_risk_segments.items():
    print(f"\n{segment_name.upper()}:")
    print(f"  Size: {data['size']:,} customers ({data['percentage_of_base']:.1f}%)")
    print(f"  Churn Rate: {data['churn_rate']:.1f}%")
    print(f"  Revenue at Risk: ${data.get('monthly_revenue_at_risk', 0):,.0f}/month")
# Cell 9: Executive Summary & Recommendations
# Business intelligence summary
print("üìã EXECUTIVE SUMMARY")
print("=" * 50)

# Calculate total impact
total_revenue_at_risk = sum(
    segment.get('monthly_revenue_at_risk', 0) 
    for segment in risk_segments.values()
)

print(f"\nüí∞ BUSINESS IMPACT:")
print(f"   Total Monthly Revenue at Risk: ${total_revenue_at_risk:,.0f}")
print(f"   Annualized Impact: ${total_revenue_at_risk * 12:,.0f}")

print(f"\nüéØ TOP 3 RECOMMENDATIONS:")

# Generate recommendations from significant findings
recommendations = []
for feature, result in significant_features:
    if feature == 'Contract' and result['significant']:
        recommendations.append({
            'priority': 1,
            'action': 'Contract Incentive Program',
            'rationale': f"Month-to-month customers have {result['churn_rates_by_category']['Month-to-month']*100:.1f}% churn rate",
            'expected_impact': '20% reduction in contract-related churn'
        })

# Display top recommendations
for i, rec in enumerate(recommendations[:3], 1):
    print(f"\n   {i}. {rec['action']}")
    print(f"      Rationale: {rec['rationale']}")
    print(f"      Expected Impact: {rec['expected_impact']}")
# Cell 10: Technical Appendix (Optional)
# Detailed statistical results for technical stakeholders
print("üìä DETAILED STATISTICAL RESULTS")
print("=" * 40)

# Export detailed results
results_export = {
    'summary': {
        'total_features_tested': len(numerical_features) + len(categorical_features),
        'significant_findings': len(significant_features),
        'alpha_level': 0.05
    },
    'detailed_results': statistical_results
}

# Save results for reporting
import json
with open('../results/statistical_analysis_results.json', 'w') as f:
    json.dump(results_export, f, indent=2, default=str)

print("‚úÖ Results exported for technical documentation")
# üéØ Key Improvements for Level 3
# 1. Separation of Concerns
### ‚ùå Mixed exploration and analysis
```python
tenure_0_customers = df_clean[df_clean['tenure'] == 0]
print(tabulate(tenure_0_customers, headers='keys', tablefmt='psql'))
```

## ‚úÖ Focused analysis only
```python
result = test_numerical_vs_churn(df_clean, 'tenure', 'Churn')
print(f"Tenure analysis: p={result['p_value']:.4e}, d={result['cohens_d']:.3f}")
```
## 2. Professional Output Formatting
### ‚ùå Basic print statements
```python
print("Tenure Analysis Results:")
print(f"  Test used: {tenure_results['test_used']}")
```
### ‚úÖ Structured, scannable output
```python
print("üî¨ STATISTICAL TEST RESULTS")
print("=" * 30)
print(f"Feature: {feature}")
print(f"Test: {result['test_used']}")
print(f"Significance: {'‚úÖ Significant' if result['significant'] else '‚ùå Not significant'}")
```

## 3. Result-Oriented Structure
### ‚ùå Process-focused
"First let's load the data, then clean it, then test it..."

### ‚úÖ Results-focused
"Key Finding: Contract type significantly predicts churn (p<0.001)"

</details>

Perfect ‚Äî here‚Äôs a lightweight, **Level-3-friendly Python script** you can drop into any notebook (or run as a standalone utility) to **read your YAML checklist** and report overall completion progress.

---

## üß© `progress_tracker.py` (or notebook cell)

```python
import yaml
from pathlib import Path

def checklist_progress(yaml_path, section_name):
    """
    Reads a YAML checklist (like project_plan.yaml) and reports
    completion percentage + remaining unchecked tasks.
    """

    # Load YAML
    with open(yaml_path, "r") as f:
        plan = yaml.safe_load(f)

    section = plan.get(section_name)
    if not section:
        print(f"‚ùå Section '{section_name}' not found in {yaml_path}")
        return

    # Flatten all subtasks
    def flatten(tasks):
        items = []
        for task in tasks:
            if isinstance(task, dict):
                # Nested structure (one main task with subtasks)
                for key, subtasks in task.items():
                    if isinstance(subtasks, list):
                        items.append(key)
                        items.extend(flatten(subtasks))
                    else:
                        items.append(key)
            elif isinstance(task, str):
                items.append(task)
        return items

    # Count completed vs total
    completed = 0
    total = 0
    for task in section.get("tasks", []):
        lines = yaml.dump(task).splitlines()
        for line in lines:
            if "[x]" in line.lower():
                completed += 1
                total += 1
            elif "[ ]" in line:
                total += 1

    percent = round((completed / total) * 100, 1) if total else 0
    print(f"\nüìä Progress for '{section_name}': {completed}/{total} tasks complete ({percent}%)")

    # Optional: list remaining unchecked items
    print("\nüìù Remaining Tasks:")
    for task in flatten(section["tasks"]):
        if "[ ]" in task:
            print(f"  - {task.strip('- [ ]')}")

# Example usage:
# checklist_progress("project_plan.yaml", "02_Data_Validation_and_Cleaning")
```

---

## üß† How to Use

1. Save your YAML (from the previous step) as `project_plan.yaml` in your Level_3 directory.
2. Paste this code into a notebook cell or script.
3. Run:

   ```python
   checklist_progress("project_plan.yaml", "02_Data_Validation_and_Cleaning")
   ```
4. You‚Äôll see output like:

   ```
   üìä Progress for '02_Data_Validation_and_Cleaning': 27/64 tasks complete (42.2%)

   üìù Remaining Tasks:
     - 3.4 Decide drop / imputation strategy
     - 8.5 Flag anomalies and prepare issue log
     - 15.3 Save dataset ‚Üí data/processed/telco_clean.csv
   ```

---

### üß© Optional Bonus (if you want to scale this later)

You can:

* Loop over **all sections** to print total project progress.
* Integrate it into a **Makefile**, **pre-commit hook**, or **CI/CD** to track progress automatically.
* Output progress to a small JSON for dashboards.

---

Would you like me to extend this script so it reports **progress for all notebooks** (01‚Äì10) in one table?
