# 6.3 Git Project Management and GitHub for PySpark Development

This notebook demonstrates best practices for using Git and GitHub in PySpark data engineering projects, with special focus on Databricks Repos integration.

## Learning Objectives
- Master Git workflows for data engineering teams
- Integrate Databricks Repos with Git repositories
- Implement effective branching strategies
- Conduct code reviews for PySpark code
- Set up CI/CD pipelines for data pipelines
- Manage notebooks and Python modules in version control
- Collaborate effectively on PySpark projects

In [None]:
# For local development: Uncomment the next line
# %run 00_Environment_Setup.ipynb

## Why Git and GitHub Matter for Data Engineering

**Version control is critical** for data engineering:

**Benefits**:
1. **Collaboration**: Multiple data engineers working on same codebase
2. **History**: Track changes to pipelines over time
3. **Rollback**: Revert to working versions when issues arise
4. **Code Review**: Peer review of data transformation logic
5. **CI/CD**: Automated testing and deployment
6. **Documentation**: Commit messages document evolution

**Challenges in Data Engineering**:
- Large notebooks can be difficult to diff
- Binary outputs in notebooks create merge conflicts
- Data pipelines have complex dependencies
- Testing requires data and compute resources

**Solution**: Databricks Repos + GitHub + proper Git workflows

## Databricks Repos Integration

Databricks Repos provides native Git integration within the Databricks workspace:

In [None]:
print("=== Databricks Repos Integration ===")

repos_overview = """
DATABRICKS REPOS FEATURES
=========================

What is Databricks Repos?
- Git repository integration directly in Databricks workspace
- Supports GitHub, GitLab, Bitbucket, Azure DevOps
- Visual Git operations (commit, push, pull, branch, merge)
- Notebook and Python file support
- Collaborative development environment

Setting Up a Repo:
1. Navigate to Repos in Databricks workspace
2. Click "Add Repo"
3. Connect to Git provider (GitHub, GitLab, etc.)
4. Clone repository or create new
5. Start developing with full Git integration

Supported Operations:
- Clone: Import existing repository
- Commit: Save changes with commit message
- Push: Upload changes to remote
- Pull: Download changes from remote
- Branch: Create feature branches
- Merge: Integrate branches
- Conflict Resolution: Visual diff and merge tools

File Types Supported:
- Notebooks (.py, .sql, .scala, .r)
- Python modules (.py)
- Configuration files (.yaml, .json, .txt)
- Documentation (.md)
- Any text file

Best Practices:
- One repo per project or pipeline group
- Separate notebooks from Python modules
- Use .gitignore for generated files
- Clear README.md for onboarding
- Structured directory layout
"""

print(repos_overview)

## Git Branching Strategies

Different branching strategies for different team sizes and workflows:

In [None]:
print("=== Git Branching Strategies ===")

branching_strategies = """
1. GITHUB FLOW (Recommended for Most Teams)
============================================

Structure:
  main (production)
    |
    +-- feature/add-customer-segmentation
    +-- feature/fix-revenue-calculation
    +-- hotfix/data-quality-issue

Workflow:
1. Branch from main for every change
2. Name branches descriptively (feature/*, bugfix/*, hotfix/*)
3. Commit frequently with clear messages
4. Open PR when ready for review
5. Review, test, and merge to main
6. Deploy main to production
7. Delete feature branch after merge

Pros:
- Simple and easy to understand
- Fast feedback loops
- Continuous deployment friendly
- Clear separation of work

Cons:
- Requires good CI/CD and testing
- Main must always be deployable
- Not ideal for multiple production versions

Best For:
- Small to medium teams
- Continuous deployment
- Databricks environments (dev/staging/prod clusters)


2. GIT FLOW (For Complex Release Cycles)
==========================================

Structure:
  main (production)
    |
  develop (integration)
    |
    +-- feature/feature-a
    +-- feature/feature-b
    |
  release/v1.2.0 (release prep)
    |
  hotfix/critical-bug (emergency fixes)

Workflow:
1. Develop branch for ongoing work
2. Feature branches from develop
3. Merge features back to develop
4. Create release branch from develop
5. Test and fix bugs in release branch
6. Merge release to main and tag version
7. Merge release back to develop
8. Hotfixes branch from main, merge to both

Pros:
- Clear separation of concerns
- Supports multiple release versions
- Organized release management
- Good for scheduled releases

Cons:
- More complex workflow
- Slower feedback loops
- Overhead for small teams
- Multiple merge points

Best For:
- Large teams
- Scheduled release cycles
- Multiple production versions
- Strict change control


3. TRUNK-BASED DEVELOPMENT (For Fast-Moving Teams)
====================================================

Structure:
  main (always deployable)
    |
    +-- short-lived branches (< 1 day)

Workflow:
1. All work directly on main or very short-lived branches
2. Commit multiple times per day
3. Use feature flags for incomplete features
4. Automated testing on every commit
5. Continuous deployment to production

Pros:
- Simplest possible workflow
- Fastest integration
- Minimal merge conflicts
- Forces small, incremental changes

Cons:
- Requires excellent CI/CD
- Needs feature flags for large features
- High discipline required
- Not suitable for all team sizes

Best For:
- Elite performing teams
- Mature CI/CD practices
- Microservices architectures
- High-trust environments


RECOMMENDATION FOR DATA ENGINEERING
===================================

Use GitHub Flow with environment branches:

  main (production)
    |
  staging (pre-production testing)
    |
  develop (integration testing)
    |
    +-- feature/* (individual features)
    +-- bugfix/* (bug fixes)
    +-- hotfix/* (emergency fixes)

Workflow:
1. Feature branch from develop
2. Merge to develop, test with dev data
3. Merge to staging, test with prod-like data
4. Merge to main, deploy to production
5. Tag releases for rollback capability
"""

print(branching_strategies)

## Git Workflow Best Practices

Step-by-step workflow for common operations:

In [None]:
print("=== Git Workflow Commands ===")

git_workflow = """
STARTING A NEW FEATURE
======================

# 1. Ensure main is up to date
git checkout main
git pull origin main

# 2. Create feature branch with descriptive name
git checkout -b feature/add-revenue-analytics

# Branch naming conventions:
# - feature/short-description
# - bugfix/issue-description
# - hotfix/critical-fix-description
# - experiment/hypothesis-name

# 3. Make changes to your code
# (Edit files in Databricks or locally)

# 4. Check status of changes
git status

# 5. Review changes before committing
git diff

# 6. Stage files for commit
git add notebooks/revenue_analytics.py
git add src/transformations/revenue.py
# Or stage all changes:
git add .

# 7. Commit with descriptive message
git commit -m "Add revenue analytics transformation

- Implement revenue calculation by customer segment
- Add data quality checks for revenue data
- Include unit tests for transformation functions"

# 8. Push to remote repository
git push origin feature/add-revenue-analytics

# If first push on this branch:
git push -u origin feature/add-revenue-analytics


COMMITTING CHANGES
==================

# ✅ GOOD: Small, focused commits
git commit -m "Add customer segmentation logic"
git commit -m "Add unit tests for segmentation"
git commit -m "Update documentation for segmentation"

# ❌ BAD: Large, unfocused commits
git commit -m "Various changes"
git commit -m "Fix stuff"
git commit -m "WIP"

# Good commit message structure:
git commit -m "<type>: <short summary (50 chars max)>

<optional detailed description>

<optional references to issues/tickets>"

# Commit types:
# - feat: New feature
# - fix: Bug fix
# - refactor: Code restructuring
# - test: Adding tests
# - docs: Documentation changes
# - perf: Performance improvements
# - style: Formatting changes


UPDATING YOUR BRANCH
====================

# Keep feature branch up to date with main
git checkout feature/add-revenue-analytics
git fetch origin
git rebase origin/main
# Or use merge (creates merge commit):
git merge origin/main

# Resolve conflicts if any
# 1. Edit conflicted files
# 2. Mark as resolved:
git add <conflicted-file>
# 3. Continue rebase:
git rebase --continue
# Or for merge:
git commit

# Push updated branch (may need force push after rebase)
git push --force-with-lease origin feature/add-revenue-analytics


CREATING A PULL REQUEST
=======================

# 1. Push your branch to remote (if not already)
git push origin feature/add-revenue-analytics

# 2. Go to GitHub repository
# 3. Click "Pull requests" → "New pull request"
# 4. Select base: main, compare: feature/add-revenue-analytics
# 5. Fill in PR template:

"""
PR Title: Add revenue analytics transformation

## Summary
This PR adds a new revenue analytics transformation pipeline that calculates 
revenue by customer segment.

## Changes
- New transformation: `calculate_revenue_by_segment()`
- Data quality checks for revenue data
- Unit tests with 90% coverage
- Documentation updates

## Testing
- [x] Unit tests pass
- [x] Integration tests pass
- [x] Manual testing on dev cluster
- [x] Data quality checks verified

## Checklist
- [x] Code follows project style guidelines
- [x] Tests added for new functionality
- [x] Documentation updated
- [x] No breaking changes

## Related Issues
Closes #123
"""

# 6. Request reviewers
# 7. Address review comments
# 8. Merge when approved

"""

print(git_workflow)

## .gitignore for PySpark Projects

Essential files to exclude from version control:

In [None]:
print("=== .gitignore for PySpark Projects ===")

gitignore_content = """
# .gitignore for PySpark/Databricks projects

# ========================================
# Python
# ========================================
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# ========================================
# Virtual Environments
# ========================================
venv/
ENV/
env/
.venv/
pipenv/

# ========================================
# Jupyter Notebooks
# ========================================
.ipynb_checkpoints/
*-checkpoint.ipynb
*.ipynb_checkpoints

# Notebook outputs (optional - team decision)
# Uncomment to exclude all outputs:
# *.ipynb

# ========================================
# Spark and PySpark
# ========================================
spark-warehouse/
metastore_db/
derby.log
*.log
*.parquet
*.orc
*.avro

# Local Spark data
data/raw/
data/processed/
data/temp/
tmp/
output/

# ========================================
# Delta Lake
# ========================================
_delta_log/
*.checkpoint.parquet

# ========================================
# Databricks
# ========================================
.databricks/
.databrickscfg
databricks_cli.log

# ========================================
# IDE and Editor
# ========================================
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store
Thumbs.db

# ========================================
# Testing
# ========================================
.pytest_cache/
.coverage
htmlcov/
.tox/
coverage.xml
*.cover
.hypothesis/

# ========================================
# Secrets and Configuration
# ========================================
*.env
.env.*
secrets.yaml
secrets.json
credentials.json
*.key
*.pem
*.p12

# Keep example configs
!.env.example
!secrets.yaml.example

# ========================================
# Documentation
# ========================================
docs/_build/
site/

# ========================================
# Project-Specific
# ========================================
# Add any project-specific files to ignore
# local_data/
# experiments/
"""

print(gitignore_content)

print("\n✅ Best practices for .gitignore:")
print("  - Exclude generated files (outputs, logs)")
print("  - Never commit secrets or credentials")
print("  - Exclude large data files (use .gitattributes for LFS)")
print("  - Keep example config files with .example extension")
print("  - Team decision on notebook outputs")
print("  - Document why files are ignored (comments)")

## Code Review Best Practices

Effective code review for PySpark projects:

In [None]:
print("=== Code Review Best Practices ===")

code_review_guide = """
CODE REVIEW CHECKLIST FOR PYSPARK
==================================

Functional Correctness
----------------------
□ Does the code solve the stated problem?
□ Are edge cases handled?
□ Are transformations idempotent (safe to retry)?
□ Are pure functions truly pure (no side effects)?
□ Is business logic correct and validated?

Data Quality
------------
□ Are null values handled appropriately?
□ Is data validation implemented?
□ Are schema contracts explicit?
□ Are data quality checks in place?
□ Is error handling comprehensive?

Performance
-----------
□ Are built-in functions used (not UDFs)?
□ Is unnecessary data shuffling avoided?
□ Are partitions properly managed?
□ Is caching used appropriately?
□ Are broadcast joins used for small tables?
□ Is column pruning applied?

Code Quality
------------
□ Are functions small and focused?
□ Is code readable and well-structured?
□ Are variable names descriptive?
□ Is complex logic extracted to named functions?
□ Are magic numbers avoided (use constants)?
□ Is code DRY (Don't Repeat Yourself)?

Testing
-------
□ Are unit tests included?
□ Is test coverage adequate (>80%)?
□ Are integration tests included?
□ Are tests independent and repeatable?
□ Do tests use realistic data?

Documentation
-------------
□ Are docstrings present for all functions?
□ Are complex algorithms explained?
□ Is README updated if needed?
□ Are breaking changes documented?
□ Is the PR description clear?

Security & Compliance
---------------------
□ Are credentials properly secured?
□ Are secrets not hardcoded?
□ Is PII data handled correctly?
□ Are access patterns appropriate?
□ Does code comply with data governance policies?


REVIEW COMMENT EXAMPLES
=======================

✅ Good Comments (Constructive)
--------------------------------

"Consider using F.when() here instead of a UDF for better performance.
UDFs have serialization overhead. Would this work?

    return df.withColumn('category',
        F.when(F.col('amount') > 1000, 'high')
         .otherwise('low')
    )
"

"This function looks great! One suggestion: could you add a docstring 
with an example? It would help other team members understand the 
expected input/output schema."

"Nice refactoring! I like how you extracted the complex logic into 
separate functions. This will be much easier to test."

"Question: What happens if the input DataFrame is empty? Should we 
add a check or is that handled elsewhere?"


❌ Bad Comments (Unhelpful)
---------------------------

"This is wrong." 
# Better: Explain what's wrong and suggest a fix

"I wouldn't do it this way."
# Better: Explain why and show alternative

"This won't scale."
# Better: Explain performance concerns with evidence

"Change this."
# Better: Explain reasoning and suggest improvements


RESPONDING TO REVIEWS
=====================

As Author:
----------
✅ Thank reviewers for their time
✅ Ask clarifying questions
✅ Explain your reasoning when disagreeing
✅ Mark resolved comments after addressing
✅ Re-request review after major changes

❌ Don't be defensive
❌ Don't ignore feedback
❌ Don't argue without data

As Reviewer:
------------
✅ Be respectful and constructive
✅ Ask questions, don't demand changes
✅ Provide examples and alternatives
✅ Distinguish between "must fix" and "nice to have"
✅ Approve when requirements are met

❌ Don't nitpick style (use linters)
❌ Don't block on personal preferences
❌ Don't delay reviews unnecessarily


REVIEW TURNAROUND TIMES
=======================

Target SLAs:
- Initial review: Within 24 hours
- Follow-up reviews: Within 4 hours
- Hotfix reviews: Within 1 hour

Tips for Fast Reviews:
- Keep PRs small (<400 lines)
- Provide context in description
- Self-review before requesting
- Tag specific reviewers
- Use draft PRs for early feedback
"""

print(code_review_guide)

## CI/CD Pipeline with GitHub Actions

Automated testing and deployment for PySpark projects:

In [None]:
print("=== CI/CD Pipeline Example ===")

github_actions_workflow = """
# .github/workflows/ci.yml
# CI/CD Pipeline for PySpark Project

name: CI/CD Pipeline

on:
  push:
    branches: [ main, develop, staging ]
  pull_request:
    branches: [ main, develop ]

env:
  PYTHON_VERSION: '3.10'
  SPARK_VERSION: '3.4.1'

jobs:
  # ================================
  # Linting and Code Quality
  # ================================
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}
      
      - name: Install linting tools
        run: |
          pip install black ruff mypy
      
      - name: Run Black (code formatting)
        run: black --check src/ tests/
      
      - name: Run Ruff (linting)
        run: ruff check src/ tests/
      
      - name: Run MyPy (type checking)
        run: mypy src/

  # ================================
  # Unit Tests
  # ================================
  test:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}
      
      - name: Install Java (for PySpark)
        uses: actions/setup-java@v3
        with:
          distribution: 'temurin'
          java-version: '11'
      
      - name: Cache pip dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-dev.txt
      
      - name: Run unit tests
        run: |
          pytest tests/unit/ -v --cov=src --cov-report=xml --cov-report=html
      
      - name: Upload coverage reports
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml
          flags: unittests
      
      - name: Check coverage threshold
        run: |
          coverage report --fail-under=80

  # ================================
  # Integration Tests
  # ================================
  integration-test:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}
      
      - name: Install Java
        uses: actions/setup-java@v3
        with:
          distribution: 'temurin'
          java-version: '11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-dev.txt
      
      - name: Run integration tests
        run: |
          pytest tests/integration/ -v --durations=10

  # ================================
  # Build Package
  # ================================
  build:
    runs-on: ubuntu-latest
    needs: [test, integration-test]
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}
      
      - name: Install build tools
        run: |
          pip install build wheel
      
      - name: Build wheel package
        run: |
          python -m build
      
      - name: Upload artifacts
        uses: actions/upload-artifact@v3
        with:
          name: dist-package
          path: dist/

  # ================================
  # Deploy to Development
  # ================================
  deploy-dev:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/develop'
    environment: development
    steps:
      - uses: actions/checkout@v3
      
      - name: Download artifacts
        uses: actions/download-artifact@v3
        with:
          name: dist-package
          path: dist/
      
      - name: Install Databricks CLI
        run: |
          pip install databricks-cli
      
      - name: Deploy to Databricks Dev
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_DEV_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_DEV_TOKEN }}
        run: |
          # Upload wheel to Databricks
          databricks fs cp dist/*.whl dbfs:/Volumes/dev/libraries/python/ --overwrite
          
          # Update job configuration
          # databricks jobs reset --job-id <job-id> --json-file job-config.json

  # ================================
  # Deploy to Production
  # ================================
  deploy-prod:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v3
      
      - name: Download artifacts
        uses: actions/download-artifact@v3
        with:
          name: dist-package
          path: dist/
      
      - name: Install Databricks CLI
        run: |
          pip install databricks-cli
      
      - name: Deploy to Databricks Prod
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}
        run: |
          databricks fs cp dist/*.whl dbfs:/Volumes/prod/libraries/python/ --overwrite
      
      - name: Create GitHub Release
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          tag_name: v${{ github.run_number }}
          release_name: Release v${{ github.run_number }}
          body: |
            Automated release from main branch
          draft: false
          prerelease: false
"""

print(github_actions_workflow)

print("\n✅ CI/CD Best Practices:")
print("  - Run tests on every PR")
print("  - Require passing tests before merge")
print("  - Use caching to speed up builds")
print("  - Separate dev and prod deployments")
print("  - Use GitHub environments for approvals")
print("  - Tag releases for rollback capability")
print("  - Monitor deployment success")

## Managing Notebooks in Git

Special considerations for Jupyter notebooks:

In [None]:
print("=== Managing Notebooks in Git ===")

notebook_management = """
CHALLENGES WITH NOTEBOOKS IN GIT
=================================

Problems:
1. JSON format makes diffs hard to read
2. Cell outputs create large diffs
3. Cell execution order can vary
4. Merge conflicts are difficult to resolve
5. Binary outputs (images) aren't diffable


SOLUTION 1: Clear Outputs Before Commit
========================================

Manual approach:
- Jupyter: Kernel → Restart & Clear Output
- Databricks: Clear → Clear All Outputs

Automated approach with nbstripout:

# Install nbstripout
pip install nbstripout

# Configure for repository (one-time setup)
nbstripout --install

# This creates .gitattributes:
*.ipynb filter=nbstripout
*.zpln filter=nbstripout
*.ipynb diff=ipynb

# Now outputs are automatically stripped on commit

Pros:
- Cleaner diffs
- Smaller repo size
- Easier merges

Cons:
- Lose output history
- Need to re-run to see results


SOLUTION 2: Reviewable Notebook Outputs
========================================

Use ReviewNB for GitHub:
- Visual diff tool for notebooks
- Shows side-by-side comparisons
- Displays cell outputs
- Integrates with GitHub PRs

Setup:
1. Install ReviewNB GitHub app
2. Authorize for your repository
3. Automatic notebook diffs in PRs

Alternative: nbdime
- Command-line notebook diff tool
- Git integration available

pip install nbdime
nbdime config-git --enable


SOLUTION 3: Convert to Python Scripts
======================================

Use Jupytext for dual format:

# Install jupytext
pip install jupytext

# Pair notebook with .py file
jupytext --set-formats ipynb,py:percent notebook.ipynb

# Creates notebook.py alongside notebook.ipynb
# Changes sync automatically

Benefits:
- .py files are easy to diff
- Can edit in IDE or notebook
- Better for code review
- Merge conflicts easier to resolve

Workflow:
1. Develop in .ipynb (interactive)
2. Save syncs to .py (version control)
3. Review .py in PRs (readable)
4. Collaborators can use either format


SOLUTION 4: Modular Code Approach (Recommended)
================================================

Separate notebooks from logic:

project/
  src/
    transformations/  # Pure Python modules
      cleaning.py
      business_logic.py
    
  notebooks/
    exploration/      # Keep outputs, don't commit
    production/       # Strip outputs, commit

Workflow:
1. Core logic in .py modules (easy to version)
2. Notebooks import and use modules
3. Exploratory notebooks in separate folder
4. Production notebooks are minimal (mostly imports)

Benefits:
- Clean separation of concerns
- .py files are Git-friendly
- Easy to test modules
- Notebooks focus on orchestration


NOTEBOOK BEST PRACTICES
=======================

✅ DO:
- Keep notebooks focused and concise
- Extract reusable logic to .py modules
- Use clear cell descriptions
- Run "Restart & Run All" before committing
- Clear outputs for production notebooks
- Keep exploratory notebooks separate

❌ DON'T:
- Commit notebooks with errors
- Include large outputs (images, DataFrames)
- Use notebooks for complex logic
- Commit notebooks with cell execution out of order
- Include sensitive data in outputs
"""

print(notebook_management)

## Collaboration Patterns

Team workflows for effective collaboration:

In [None]:
print("=== Collaboration Patterns ===")

collaboration_guide = """
PAIR PROGRAMMING FOR DATA ENGINEERING
=====================================

Effective pair programming:

Driver-Navigator Pattern:
- Driver: Types code in Databricks notebook
- Navigator: Reviews logic, suggests improvements
- Switch roles every 30 minutes

Benefits:
- Knowledge sharing
- Fewer bugs
- Better design decisions
- Real-time code review

Best for:
- Complex transformations
- Learning new patterns
- Onboarding new team members
- Critical production code


ASYNC COLLABORATION
===================

Using GitHub for async work:

1. Clear Communication:
   - Detailed PR descriptions
   - Inline code comments
   - Design documents in repo
   - Architecture Decision Records (ADRs)

2. Documentation:
   - README.md for project overview
   - CONTRIBUTING.md for dev setup
   - docs/ for detailed guides
   - Docstrings in all functions

3. Issue Tracking:
   - Use GitHub Issues for tasks
   - Link PRs to issues
   - Use labels for categorization
   - Use milestones for sprints

4. Project Boards:
   - Kanban board for workflow
   - Columns: Backlog, In Progress, Review, Done
   - Automate card movement


HANDLING CONFLICTS
==================

Prevention:
- Small, frequent commits
- Short-lived feature branches
- Regular rebasing from main
- Clear code ownership

Resolution:
- Communicate with teammate
- Understand both changes
- Prefer their changes if unsure
- Test after resolving
- Ask for help if complex


KNOWLEDGE SHARING
=================

Team Practices:
1. Code reviews as learning opportunities
2. Regular knowledge sharing sessions
3. Internal documentation wiki
4. Lunch & learn presentations
5. Recorded demos of complex features
6. Onboarding checklist and buddy system

Documentation:
- Architecture diagrams
- Data lineage documentation
- Runbooks for common tasks
- Troubleshooting guides
- Best practices guide


RELEASE MANAGEMENT
==================

Semantic Versioning:
- MAJOR.MINOR.PATCH (e.g., 2.3.1)
- MAJOR: Breaking changes
- MINOR: New features (backward compatible)
- PATCH: Bug fixes

Release Process:
1. Create release branch from main
2. Test thoroughly on staging
3. Create GitHub release with notes
4. Tag commit with version number
5. Deploy to production
6. Monitor for issues
7. Keep previous version for rollback

Release Notes Template:
---
## v2.3.0 - 2024-01-15

### Added
- New customer segmentation pipeline
- Revenue analytics dashboard integration

### Changed
- Improved performance of aggregation queries (2x faster)
- Updated dependencies (see requirements.txt)

### Fixed
- Fixed null handling in revenue calculation
- Resolved data quality check false positives

### Deprecated
- Old segmentation logic (will be removed in v3.0.0)

### Migration Guide
- Update import: from `old_module` to `new_module`
- Run migration script: `python scripts/migrate_v2_to_v3.py`
---
"""

print(collaboration_guide)

## Summary

**Key Takeaways:**

1. **Databricks Repos Integration**:
   - Native Git support in Databricks workspace
   - Visual Git operations (commit, push, pull, branch)
   - Collaborative development environment
   - Supports multiple Git providers

2. **Branching Strategies**:
   - GitHub Flow: Simple, continuous deployment
   - Git Flow: Complex releases, multiple versions
   - Trunk-Based: Fast-moving teams with mature CI/CD
   - Recommendation: GitHub Flow with environment branches

3. **Git Workflows**:
   - Small, focused commits with clear messages
   - Feature branches for every change
   - Pull requests for code review
   - Regular rebasing to stay current

4. **Code Review**:
   - Comprehensive checklists for PySpark code
   - Constructive, helpful comments
   - Focus on correctness, performance, quality
   - Fast turnaround times (< 24 hours)

5. **CI/CD Automation**:
   - Automated testing on every PR
   - Linting and code quality checks
   - Build and deploy pipelines
   - Environment-specific deployments

6. **Notebook Management**:
   - Clear outputs before commit (nbstripout)
   - Use ReviewNB for visual diffs
   - Convert to Python scripts (Jupytext)
   - Modular code approach (recommended)

7. **Collaboration**:
   - Pair programming for complex features
   - Async collaboration with clear communication
   - Knowledge sharing practices
   - Release management with semantic versioning

**Best Practices for Git/GitHub in PySpark**:
- Integrate Databricks Repos early in project
- Use consistent branching strategy across team
- Write clear commit messages and PR descriptions
- Automate testing and deployment
- Extract logic to .py modules for better version control
- Clear notebook outputs before committing
- Conduct thorough code reviews
- Document everything (README, docstrings, ADRs)
- Tag releases for easy rollback
- Foster collaborative culture

This completes Section 6 on project structure and collaboration! You now have a complete understanding of how to build, organize, and collaborate on production-grade PySpark projects.

## Exercise

Practice Git and GitHub workflows:

1. Set up a Databricks Repos connection to a GitHub repository
2. Create a feature branch for a new transformation
3. Make commits following best practices
4. Write a comprehensive pull request description
5. Set up a .gitignore file for your project
6. Create a GitHub Actions workflow for testing
7. Practice code review on a teammate's PR
8. Set up nbstripout for notebook output management

In [None]:
# Your exercise notes here

# 1. Databricks Repos setup checklist
repos_setup = """
□ Navigate to Repos in Databricks
□ Click "Add Repo"
□ Connect to GitHub
□ Clone repository
□ Verify files are accessible
"""

# 2. Your feature branch name
feature_branch = "feature/your-feature-name"

# 3. Your commit messages
commit_messages = """
feat: Add customer segmentation logic

- Implement RFM segmentation algorithm
- Add data quality checks
- Include unit tests with 85% coverage
"""

# 4. Your PR description template
pr_description = """
## Summary
[Your summary]

## Changes
- [Change 1]
- [Change 2]

## Testing
- [x] Unit tests pass
- [ ] Integration tests pass

## Checklist
- [ ] Code follows style guide
- [ ] Tests added
- [ ] Documentation updated
"""

# Print your exercise plan
# print(repos_setup)
# print(commit_messages)
# print(pr_description)