diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..f0e837c --- /dev/null +++ b/.gitignore @@ -0,0 +1,77 @@ +# Python +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# Virtual environments +venv/ +env/ +ENV/ +env.bak/ +venv.bak/ +.venv/ + +# IDEs +.vscode/ +.idea/ +*.swp +*.swo +*~ +.DS_Store + +# Jupyter Notebook +.ipynb_checkpoints +*.ipynb_checkpoints/ + +# Database files +*.db +*.sqlite +*.sqlite3 +etl_output.db + +# Data files (optional - uncomment if you don't want to track data files) +# *.csv +# *.json +# *.xlsx +# *.parquet + +# Logs +*.log +logs/ + +# Environment variables +.env +.env.local + +# Testing +.pytest_cache/ +.coverage +htmlcov/ +.tox/ + +# OS +Thumbs.db +.DS_Store + +# Temporary files +tmp/ +temp/ +*.tmp diff --git a/01-python-fundamentals/README.md b/01-python-fundamentals/README.md new file mode 100644 index 0000000..8611448 --- /dev/null +++ b/01-python-fundamentals/README.md @@ -0,0 +1,56 @@ +# Python Fundamentals + +Welcome to the Python Fundamentals section! This is where your journey begins. + +## πŸ“š What You'll Learn + +- Python syntax and basic data types +- Control structures (if, for, while) +- Functions and modules +- Object-oriented programming basics +- Error handling +- File operations + +## πŸ“– Lessons + +1. [Getting Started with Python](lessons/01-getting-started.md) +2. [Variables and Data Types](lessons/02-variables-datatypes.md) +3. [Control Flow](lessons/03-control-flow.md) +4. [Functions](lessons/04-functions.md) +5. [Object-Oriented Programming](lessons/05-oop-basics.md) +6. [Error Handling](lessons/06-error-handling.md) +7. [File I/O](lessons/07-file-io.md) + +## πŸ’» Examples + +Check the `examples/` folder for working code examples that demonstrate each concept. + +## ✏️ Exercises + +Complete the exercises in the `exercises/` folder to practice what you've learned. Solutions are provided, but try to solve them on your own first! + +## ⏱️ Estimated Time + +2-4 weeks, depending on your prior programming experience and time commitment. + +## βœ… Completion Checklist + +- [ ] Complete all lessons +- [ ] Run all examples +- [ ] Solve all exercises +- [ ] Build a small project using concepts learned + +## 🎯 Project Idea + +Build a simple command-line todo list application that: +- Adds tasks +- Removes tasks +- Marks tasks as complete +- Saves tasks to a file +- Loads tasks from a file + +## πŸ“š Additional Resources + +- [Python Official Tutorial](https://docs.python.org/3/tutorial/) +- [Real Python - Python Basics](https://realpython.com/tutorials/basics/) +- [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/) diff --git a/01-python-fundamentals/examples/hello_world.py b/01-python-fundamentals/examples/hello_world.py new file mode 100644 index 0000000..de87ded --- /dev/null +++ b/01-python-fundamentals/examples/hello_world.py @@ -0,0 +1,34 @@ +""" +Basic Python Examples - Hello World and Simple Operations +""" + +# Simple print statement +print("Hello, Data Engineer!") + +# Print with multiple values +print("Welcome to", "Python", "Programming") + +# Basic arithmetic +print("\n=== Basic Arithmetic ===") +print("2 + 3 =", 2 + 3) +print("10 - 4 =", 10 - 4) +print("5 * 6 =", 5 * 6) +print("20 / 4 =", 20 / 4) +print("17 // 5 =", 17 // 5) # Integer division +print("17 % 5 =", 17 % 5) # Modulus +print("2 ** 8 =", 2 ** 8) # Exponentiation + +# String operations +print("\n=== String Operations ===") +name = "Data Engineer" +print("Hello,", name) +print("Length of name:", len(name)) +print("Uppercase:", name.upper()) +print("Lowercase:", name.lower()) + +# Comments +# This is a single-line comment +""" +This is a multi-line comment +or a docstring +""" diff --git a/01-python-fundamentals/exercises/README.md b/01-python-fundamentals/exercises/README.md new file mode 100644 index 0000000..0348bf6 --- /dev/null +++ b/01-python-fundamentals/exercises/README.md @@ -0,0 +1,263 @@ +# Python Fundamentals Exercises + +## πŸ“ Overview + +These exercises are designed to reinforce the concepts you've learned in the Python Fundamentals section. Start with Exercise 1 and work your way through them sequentially. + +## 🎯 How to Use These Exercises + +1. Read the exercise description +2. Try to solve it on your own first +3. Test your solution +4. Compare with the provided solution (if available) +5. Understand any differences + +## πŸ“š Exercise List + +### Exercise 1: Variables and Data Types +**Difficulty**: ⭐ Beginner + +Create a program that: +- Stores your name, age, and city in variables +- Prints a formatted message using these variables +- Calculates and prints your age in months and days + +**Example Output**: +``` +Name: Alice +Age: 25 years (300 months, 9125 days) +City: New York +``` + +**Skills Practiced**: Variables, basic math, string formatting + +--- + +### Exercise 2: Number Calculator +**Difficulty**: ⭐ Beginner + +Write a simple calculator that: +- Takes two numbers as input +- Performs addition, subtraction, multiplication, and division +- Prints all results + +**Example**: +```python +# Input: 10, 5 +# Output: +# Addition: 15 +# Subtraction: 5 +# Multiplication: 50 +# Division: 2.0 +``` + +**Skills Practiced**: Variables, arithmetic operators, input/output + +--- + +### Exercise 3: Grade Calculator +**Difficulty**: ⭐⭐ Intermediate + +Create a program that: +- Takes a score (0-100) as input +- Determines the letter grade: + - A: 90-100 + - B: 80-89 + - C: 70-79 + - D: 60-69 + - F: Below 60 +- Prints the grade and a message + +**Skills Practiced**: If/elif/else, comparison operators + +--- + +### Exercise 4: List Operations +**Difficulty**: ⭐⭐ Intermediate + +Write a program that: +- Creates a list of numbers +- Finds the sum, average, minimum, and maximum +- Prints all results + +**Example**: +```python +numbers = [10, 25, 30, 15, 40] +# Output: +# Sum: 120 +# Average: 24.0 +# Minimum: 10 +# Maximum: 40 +``` + +**Skills Practiced**: Lists, loops, built-in functions + +--- + +### Exercise 5: String Manipulator +**Difficulty**: ⭐⭐ Intermediate + +Create a program that: +- Takes a sentence as input +- Counts the number of words +- Counts the number of vowels +- Converts to uppercase and lowercase +- Reverses the string + +**Skills Practiced**: Strings, string methods, loops + +--- + +### Exercise 6: Number Guessing Game +**Difficulty**: ⭐⭐ Intermediate + +Build a game that: +- Generates a random number between 1 and 100 +- Asks the user to guess +- Provides hints (too high/too low) +- Counts number of guesses +- Congratulates on correct guess + +**Skills Practiced**: Loops, conditionals, random module + +--- + +### Exercise 7: Shopping List Manager +**Difficulty**: ⭐⭐⭐ Advanced + +Create a program that: +- Allows adding items to a shopping list +- Allows removing items +- Allows viewing all items +- Allows clearing the list +- Uses a menu system + +**Skills Practiced**: Lists, loops, functions, user input + +--- + +### Exercise 8: Contact Book +**Difficulty**: ⭐⭐⭐ Advanced + +Build a simple contact book that: +- Stores contacts (name, phone, email) +- Allows adding new contacts +- Allows searching by name +- Allows displaying all contacts +- Uses dictionaries to store data + +**Skills Practiced**: Dictionaries, functions, user input + +--- + +### Exercise 9: File Word Counter +**Difficulty**: ⭐⭐⭐ Advanced + +Write a program that: +- Reads a text file +- Counts total words +- Counts unique words +- Finds most common words +- Writes results to a new file + +**Skills Practiced**: File I/O, string processing, dictionaries + +--- + +### Exercise 10: Mini Project - Todo List Application +**Difficulty**: ⭐⭐⭐⭐ Challenge + +Build a command-line todo list app that: +- Adds tasks +- Marks tasks as complete +- Deletes tasks +- Views all tasks +- Saves to file +- Loads from file on start + +**Features**: +- Menu-driven interface +- Data persistence +- Input validation +- Error handling + +**Skills Practiced**: All Python fundamentals + +--- + +## πŸ§ͺ Testing Your Solutions + +### Basic Testing +```python +# Test with different inputs +# Check edge cases +# Verify output matches expected + +# Example: +def add(a, b): + return a + b + +# Test +assert add(2, 3) == 5 +assert add(-1, 1) == 0 +assert add(0, 0) == 0 +print("All tests passed!") +``` + +## πŸ’‘ Tips for Success + +1. **Start Simple**: Get basic functionality working first +2. **Test Often**: Test after each small change +3. **Read Errors**: Error messages tell you what's wrong +4. **Use Print**: Print statements help debug +5. **Take Breaks**: Step away if stuck +6. **Ask for Help**: Use communities when truly stuck + +## πŸ“ Submission Guidelines + +When practicing: +1. Create a file for each exercise (e.g., `exercise_01.py`) +2. Add comments explaining your approach +3. Test with multiple inputs +4. Compare with solution (if provided) +5. Refactor to improve code quality + +## 🎯 Bonus Challenges + +For each exercise, try to: +- Add input validation +- Handle errors gracefully +- Add more features +- Optimize your code +- Write cleaner code + +## πŸ“š Additional Practice + +After completing these exercises: +1. **LeetCode Easy**: Try Python easy problems +2. **HackerRank**: Python basics track +3. **Codewars**: 8 kyu and 7 kyu challenges +4. **Exercism**: Python track with mentoring + +## πŸ† Next Steps + +Once you've completed all exercises: +- Move to `02-python-data-engineering` +- Start building small projects +- Contribute your own exercises +- Help others learn + +## πŸ“– Solutions + +Solutions are available in the `solutions/` folder, but try to solve exercises on your own first! Learning happens when you struggle through problems. + +## 🀝 Getting Help + +If you're stuck: +1. Review the relevant lesson +2. Check Python documentation +3. Search for similar problems online +4. Ask specific questions in communities +5. Look at the solution as a last resort + +Good luck with your exercises! diff --git a/01-python-fundamentals/lessons/01-getting-started.md b/01-python-fundamentals/lessons/01-getting-started.md new file mode 100644 index 0000000..d3264b3 --- /dev/null +++ b/01-python-fundamentals/lessons/01-getting-started.md @@ -0,0 +1,162 @@ +# Getting Started with Python + +## What is Python? + +Python is a high-level, interpreted programming language known for its simplicity and readability. It's one of the most popular languages for data engineering, data science, and general-purpose programming. + +## Why Python for Data Engineering? + +- **Easy to Learn**: Clear and readable syntax +- **Extensive Libraries**: Rich ecosystem for data manipulation (Pandas, NumPy) +- **Cross-platform**: Works on Windows, Mac, and Linux +- **Large Community**: Extensive resources and support +- **Integration**: Works well with databases and data tools + +## Installing Python + +### Windows +1. Download Python from [python.org](https://www.python.org/downloads/) +2. Run the installer (check "Add Python to PATH") +3. Verify installation: `python --version` + +### Mac +```bash +# Using Homebrew +brew install python3 +python3 --version +``` + +### Linux +```bash +# Ubuntu/Debian +sudo apt update +sudo apt install python3 python3-pip + +# Verify +python3 --version +``` + +## Your First Python Program + +Create a file called `hello.py`: + +```python +print("Hello, Data Engineer!") +``` + +Run it: +```bash +python hello.py +``` + +## Python Interactive Shell + +You can also use Python interactively: + +```bash +python +>>> print("Hello!") +Hello! +>>> 2 + 2 +4 +>>> exit() +``` + +## Setting Up Your Development Environment + +### Option 1: VS Code (Recommended) +1. Install [VS Code](https://code.visualstudio.com/) +2. Install Python extension +3. Create a workspace for your projects + +### Option 2: PyCharm +1. Install [PyCharm Community Edition](https://www.jetbrains.com/pycharm/) +2. Create a new Python project + +### Option 3: Jupyter Notebook +Great for data exploration: +```bash +pip install jupyter +jupyter notebook +``` + +## Virtual Environments + +Always use virtual environments for your projects: + +```bash +# Create a virtual environment +python -m venv myenv + +# Activate it +# Windows: +myenv\Scripts\activate +# Mac/Linux: +source myenv/bin/activate + +# Install packages +pip install pandas + +# Deactivate +deactivate +``` + +## Basic Python Syntax + +### Comments +```python +# This is a single-line comment + +""" +This is a +multi-line comment +or docstring +""" +``` + +### Print Statement +```python +print("Hello World") +print("Value:", 42) +``` + +### Basic Arithmetic +```python +# Addition +print(5 + 3) # 8 + +# Subtraction +print(10 - 4) # 6 + +# Multiplication +print(3 * 4) # 12 + +# Division +print(15 / 3) # 5.0 + +# Integer Division +print(15 // 4) # 3 + +# Modulus +print(15 % 4) # 3 + +# Exponentiation +print(2 ** 3) # 8 +``` + +## Next Steps + +Now that you have Python installed and running, proceed to the next lesson on variables and data types. + +## Practice Exercise + +1. Install Python on your computer +2. Set up VS Code or your preferred editor +3. Create a Python file that prints your name +4. Use the Python interactive shell to calculate: (10 + 5) * 3 +5. Create a virtual environment for this course + +## Additional Resources + +- [Python.org Beginner's Guide](https://wiki.python.org/moin/BeginnersGuide) +- [Real Python - Installation & Setup](https://realpython.com/installing-python/) diff --git a/02-python-data-engineering/README.md b/02-python-data-engineering/README.md new file mode 100644 index 0000000..60aa102 --- /dev/null +++ b/02-python-data-engineering/README.md @@ -0,0 +1,204 @@ +# Python for Data Engineering + +This section focuses on using Python for data engineering tasks - the practical skills you'll use daily as a data engineer. + +## πŸ“š What You'll Learn + +- Working with Pandas for data manipulation +- Reading and writing various file formats (CSV, JSON, Parquet, Excel) +- API interactions and web scraping +- Data cleaning and transformation +- Working with dates and times +- Connecting to databases with Python +- Error handling in data pipelines + +## πŸ“– Lessons + +1. [Introduction to Pandas](lessons/01-pandas-intro.md) +2. [Data Cleaning](lessons/02-data-cleaning.md) +3. [File Formats](lessons/03-file-formats.md) +4. [Working with APIs](lessons/04-apis.md) +5. [Database Connections](lessons/05-database-connections.md) +6. [Date and Time Handling](lessons/06-datetime.md) +7. [Data Validation](lessons/07-data-validation.md) + +## πŸ’» Examples + +The `examples/` folder contains practical code examples: +- `pandas_basics.py` - Pandas fundamentals +- `csv_processing.py` - CSV file operations +- `json_handling.py` - Working with JSON +- `api_requests.py` - API interactions +- `database_operations.py` - Database connectivity + +## ✏️ Exercises + +Practice exercises in `exercises/` folder: +- Data cleaning challenges +- File format conversions +- API data extraction +- Database operations +- Real-world scenarios + +## πŸ› οΈ Required Libraries + +```bash +# Install required packages +pip install pandas numpy +pip install requests +pip install openpyxl # for Excel files +pip install pyarrow # for Parquet files +pip install sqlalchemy psycopg2-binary +``` + +## ⏱️ Estimated Time + +4-6 weeks with hands-on practice + +## βœ… Completion Checklist + +- [ ] Master Pandas basics +- [ ] Work with CSV, JSON, and Excel files +- [ ] Make API requests +- [ ] Connect to databases +- [ ] Clean and transform real datasets +- [ ] Handle errors properly +- [ ] Complete all exercises + +## 🎯 Project Ideas + +### Project 1: Data Pipeline +Build a pipeline that: +- Fetches data from an API +- Cleans and transforms the data +- Saves to database and CSV + +### Project 2: Data Integration +Combine data from: +- Multiple CSV files +- JSON API +- Database tables +- Output: Clean, unified dataset + +### Project 3: Automated Report +Create a script that: +- Reads data from database +- Performs analysis +- Generates Excel report +- Sends email notification + +## πŸ“Š Real-World Scenarios + +### E-commerce Data Processing +- Process order data from CSV +- Validate customer information +- Calculate metrics +- Load into database + +### API Data Extraction +- Fetch weather data from API +- Parse JSON responses +- Store in structured format +- Handle rate limits and errors + +### Log File Analysis +- Read server log files +- Parse and extract information +- Identify patterns +- Generate reports + +## πŸ”‘ Key Skills + +### Data Manipulation with Pandas +```python +import pandas as pd + +# Read data +df = pd.read_csv('data.csv') + +# Basic operations +df.head() +df.info() +df.describe() + +# Filtering +df[df['age'] > 30] + +# Grouping +df.groupby('category')['sales'].sum() + +# Transformation +df['new_column'] = df['old_column'] * 2 +``` + +### File Operations +```python +# CSV +df = pd.read_csv('file.csv') +df.to_csv('output.csv', index=False) + +# JSON +df = pd.read_json('file.json') +df.to_json('output.json') + +# Excel +df = pd.read_excel('file.xlsx') +df.to_excel('output.xlsx', index=False) + +# Parquet +df = pd.read_parquet('file.parquet') +df.to_parquet('output.parquet') +``` + +### API Requests +```python +import requests + +response = requests.get('https://api.example.com/data') +data = response.json() +df = pd.DataFrame(data) +``` + +### Database Operations +```python +from sqlalchemy import create_engine + +engine = create_engine('postgresql://user:pass@localhost/db') +df = pd.read_sql('SELECT * FROM table', engine) +df.to_sql('new_table', engine, if_exists='replace') +``` + +## πŸ’‘ Best Practices + +1. **Read Documentation**: Pandas docs are excellent +2. **Use Vectorization**: Avoid loops when possible +3. **Memory Management**: Be aware of large datasets +4. **Error Handling**: Always handle exceptions +5. **Data Validation**: Validate before processing +6. **Type Hints**: Use type hints in functions +7. **Testing**: Write tests for data transformations + +## πŸ“š Additional Resources + +- [Pandas Documentation](https://pandas.pydata.org/docs/) +- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) +- [Real Python - Pandas Tutorials](https://realpython.com/learning-paths/pandas-data-science/) +- [Kaggle Learn - Pandas](https://www.kaggle.com/learn/pandas) + +## Common Pitfalls to Avoid + +1. **Chained Indexing**: Use `.loc` instead +2. **Modifying During Iteration**: Use `.apply()` or vectorization +3. **Not Checking Data Types**: Always verify dtypes +4. **Ignoring Missing Data**: Handle NaN values properly +5. **Memory Issues**: Use chunking for large files +6. **Silent Failures**: Add logging and error handling + +## Next Steps + +After completing this section, you'll be able to: +- Build data ingestion pipelines +- Process various data formats +- Interact with APIs and databases +- Handle real-world data issues +- Write production-quality Python code for data engineering diff --git a/02-python-data-engineering/examples/pandas_basics.py b/02-python-data-engineering/examples/pandas_basics.py new file mode 100644 index 0000000..5fe1690 --- /dev/null +++ b/02-python-data-engineering/examples/pandas_basics.py @@ -0,0 +1,185 @@ +""" +Pandas Basics - Essential Operations for Data Engineers +""" + +import pandas as pd +import numpy as np + +# Creating DataFrames +print("=" * 50) +print("Creating DataFrames") +print("=" * 50) + +# From dictionary +data = { + 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], + 'age': [25, 30, 35, 28, 32], + 'city': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston'], + 'salary': [70000, 85000, 90000, 75000, 80000] +} +df = pd.DataFrame(data) +print("\nDataFrame from dictionary:") +print(df) + +# Basic information +print("\n" + "=" * 50) +print("Basic DataFrame Information") +print("=" * 50) +print("\nShape:", df.shape) +print("\nColumns:", df.columns.tolist()) +print("\nData types:") +print(df.dtypes) +print("\nBasic statistics:") +print(df.describe()) + +# Selecting data +print("\n" + "=" * 50) +print("Selecting Data") +print("=" * 50) + +# Select a column +print("\nNames column:") +print(df['name']) + +# Select multiple columns +print("\nNames and ages:") +print(df[['name', 'age']]) + +# Select rows by condition +print("\nPeople older than 30:") +print(df[df['age'] > 30]) + +# Multiple conditions +print("\nPeople older than 30 with salary > 80000:") +print(df[(df['age'] > 30) & (df['salary'] > 80000)]) + +# Sorting +print("\n" + "=" * 50) +print("Sorting Data") +print("=" * 50) + +print("\nSorted by age (ascending):") +print(df.sort_values('age')) + +print("\nSorted by salary (descending):") +print(df.sort_values('salary', ascending=False)) + +# Adding new columns +print("\n" + "=" * 50) +print("Adding New Columns") +print("=" * 50) + +df['monthly_salary'] = df['salary'] / 12 +df['senior'] = df['age'] > 30 +print(df) + +# Grouping and aggregation +print("\n" + "=" * 50) +print("Grouping and Aggregation") +print("=" * 50) + +# Group by senior status +print("\nAverage salary by senior status:") +print(df.groupby('senior')['salary'].mean()) + +# Multiple aggregations +print("\nMultiple statistics by senior status:") +print(df.groupby('senior')['salary'].agg(['mean', 'min', 'max', 'count'])) + +# Handling missing data +print("\n" + "=" * 50) +print("Handling Missing Data") +print("=" * 50) + +# Create DataFrame with missing values +df_missing = pd.DataFrame({ + 'A': [1, 2, np.nan, 4], + 'B': [5, np.nan, np.nan, 8], + 'C': [9, 10, 11, 12] +}) + +print("\nDataFrame with missing values:") +print(df_missing) + +print("\nCheck for missing values:") +print(df_missing.isnull()) + +print("\nCount of missing values per column:") +print(df_missing.isnull().sum()) + +print("\nDrop rows with any missing values:") +print(df_missing.dropna()) + +print("\nFill missing values with 0:") +print(df_missing.fillna(0)) + +print("\nFill missing values with column mean:") +print(df_missing.fillna(df_missing.mean())) + +# Merging DataFrames +print("\n" + "=" * 50) +print("Merging DataFrames") +print("=" * 50) + +# Create two DataFrames +df1 = pd.DataFrame({ + 'employee_id': [1, 2, 3], + 'name': ['Alice', 'Bob', 'Charlie'], + 'department': ['Sales', 'IT', 'HR'] +}) + +df2 = pd.DataFrame({ + 'employee_id': [1, 2, 4], + 'salary': [70000, 85000, 75000] +}) + +print("\nDataFrame 1:") +print(df1) +print("\nDataFrame 2:") +print(df2) + +print("\nInner join:") +print(pd.merge(df1, df2, on='employee_id', how='inner')) + +print("\nLeft join:") +print(pd.merge(df1, df2, on='employee_id', how='left')) + +print("\nOuter join:") +print(pd.merge(df1, df2, on='employee_id', how='outer')) + +# Apply functions +print("\n" + "=" * 50) +print("Applying Functions") +print("=" * 50) + +# Apply function to a column +df['name_length'] = df['name'].apply(len) +print("\nAdded name length column:") +print(df[['name', 'name_length']]) + +# Apply custom function +def categorize_salary(salary): + if salary < 75000: + return 'Low' + elif salary < 85000: + return 'Medium' + else: + return 'High' + +df['salary_category'] = df['salary'].apply(categorize_salary) +print("\nSalary categories:") +print(df[['name', 'salary', 'salary_category']]) + +print("\n" + "=" * 50) +print("String Operations") +print("=" * 50) + +# String methods +df['name_upper'] = df['name'].str.upper() +df['city_lower'] = df['city'].str.lower() +print("\nString transformations:") +print(df[['name', 'name_upper', 'city', 'city_lower']]) + +# Filtering with string methods +print("\nNames containing 'a':") +print(df[df['name'].str.contains('a', case=False)]) diff --git a/03-sql-fundamentals/README.md b/03-sql-fundamentals/README.md new file mode 100644 index 0000000..baf6664 --- /dev/null +++ b/03-sql-fundamentals/README.md @@ -0,0 +1,131 @@ +# SQL Fundamentals + +Welcome to SQL Fundamentals! Here you'll learn the essential skills for working with relational databases. + +## πŸ“š What You'll Learn + +- Basic SQL query syntax +- Filtering and sorting data +- Joining tables +- Aggregate functions +- Grouping data +- Subqueries and CTEs + +## πŸ“– Lessons + +1. [Introduction to SQL and Databases](lessons/01-intro-to-sql.md) +2. [SELECT Statements](lessons/02-select-statements.md) +3. [Filtering with WHERE](lessons/03-where-clause.md) +4. [Sorting and Limiting Results](lessons/04-order-limit.md) +5. [Joins](lessons/05-joins.md) +6. [Aggregate Functions](lessons/06-aggregates.md) +7. [GROUP BY and HAVING](lessons/07-groupby-having.md) +8. [Subqueries](lessons/08-subqueries.md) +9. [Common Table Expressions (CTEs)](lessons/09-ctes.md) + +## πŸ’» Practice Database + +We'll use a sample database with the following tables: + +- `employees` - Employee information +- `departments` - Department details +- `projects` - Project information +- `sales` - Sales transactions +- `customers` - Customer data + +## πŸ—„οΈ Setting Up Your Database + +### Using SQLite (Easiest for Beginners) +```bash +# SQLite comes pre-installed on most systems +sqlite3 practice.db +``` + +### Using PostgreSQL (Recommended for Production) +```bash +# Install PostgreSQL +# Ubuntu/Debian +sudo apt install postgresql + +# Mac +brew install postgresql + +# Start the service and create database +createdb learning_db +psql learning_db +``` + +## πŸ“ Sample Queries + +Check the `queries/` folder for example SQL queries organized by topic. + +## ✏️ Exercises + +Complete the exercises in the `exercises/` folder. Each exercise includes: +- Problem description +- Sample data +- Expected output +- Solution (try solving on your own first!) + +## ⏱️ Estimated Time + +3-4 weeks of consistent practice + +## βœ… Completion Checklist + +- [ ] Complete all lessons +- [ ] Run all sample queries +- [ ] Solve all exercises +- [ ] Create your own practice database +- [ ] Write 50+ SQL queries + +## 🎯 Project Idea + +Build a sample e-commerce database with: +- Products table +- Orders table +- Customers table +- Order items table + +Write queries to: +- Find top-selling products +- Calculate revenue by month +- Identify best customers +- Analyze product categories + +## πŸ“š Additional Resources + +- [SQLZoo](https://sqlzoo.net/) - Interactive SQL tutorial +- [PostgreSQL Tutorial](https://www.postgresqltutorial.com/) +- [LeetCode SQL Problems](https://leetcode.com/problemset/database/) +- [Mode SQL Tutorial](https://mode.com/sql-tutorial/) + +## πŸ”‘ Key SQL Concepts + +### Basic Query Structure +```sql +SELECT column1, column2 +FROM table_name +WHERE condition +ORDER BY column1 +LIMIT 10; +``` + +### Common Data Types +- `INTEGER` / `INT` - Whole numbers +- `DECIMAL` / `NUMERIC` - Decimal numbers +- `VARCHAR(n)` - Variable-length text +- `TEXT` - Long text +- `DATE` - Date values +- `TIMESTAMP` - Date and time +- `BOOLEAN` - True/false + +### SQL Keywords to Know +- `SELECT` - Retrieve data +- `FROM` - Specify table +- `WHERE` - Filter rows +- `JOIN` - Combine tables +- `GROUP BY` - Group rows +- `HAVING` - Filter groups +- `ORDER BY` - Sort results +- `LIMIT` - Restrict number of rows diff --git a/03-sql-fundamentals/queries/01-basic-selects.sql b/03-sql-fundamentals/queries/01-basic-selects.sql new file mode 100644 index 0000000..5c065b0 --- /dev/null +++ b/03-sql-fundamentals/queries/01-basic-selects.sql @@ -0,0 +1,56 @@ +-- Basic SELECT Queries +-- This file contains examples of basic SELECT statements + +-- Select all columns from a table +SELECT * FROM employees; + +-- Select specific columns +SELECT first_name, last_name, email +FROM employees; + +-- Select with column aliases +SELECT + first_name AS "First Name", + last_name AS "Last Name", + salary AS "Annual Salary" +FROM employees; + +-- Select distinct values (remove duplicates) +SELECT DISTINCT department_id +FROM employees; + +SELECT DISTINCT city, country +FROM customers; + +-- Select with calculations +SELECT + first_name, + last_name, + salary, + salary * 1.1 AS salary_with_raise, + salary / 12 AS monthly_salary +FROM employees; + +-- Concatenate strings +SELECT + first_name || ' ' || last_name AS full_name, + email +FROM employees; + +-- Using CONCAT function (in some SQL dialects) +SELECT + CONCAT(first_name, ' ', last_name) AS full_name, + email +FROM employees; + +-- Select with LIMIT (restrict number of rows) +SELECT first_name, last_name +FROM employees +LIMIT 5; + +-- Select current date/time +SELECT CURRENT_DATE; +SELECT CURRENT_TIMESTAMP; + +-- Select literal values +SELECT 'Hello' AS greeting, 42 AS answer; diff --git a/03-sql-fundamentals/queries/02-where-clause.sql b/03-sql-fundamentals/queries/02-where-clause.sql new file mode 100644 index 0000000..b4440bb --- /dev/null +++ b/03-sql-fundamentals/queries/02-where-clause.sql @@ -0,0 +1,99 @@ +-- WHERE Clause Examples +-- Filtering data with various conditions + +-- Basic equality +SELECT * FROM employees +WHERE department_id = 5; + +-- Not equal +SELECT * FROM employees +WHERE department_id != 5; +-- or +SELECT * FROM employees +WHERE department_id <> 5; + +-- Comparison operators +SELECT first_name, last_name, salary +FROM employees +WHERE salary > 50000; + +SELECT * FROM employees +WHERE hire_date >= '2020-01-01'; + +-- BETWEEN operator +SELECT first_name, last_name, salary +FROM employees +WHERE salary BETWEEN 40000 AND 60000; + +-- IN operator (match any value in a list) +SELECT * FROM employees +WHERE department_id IN (1, 3, 5); + +SELECT * FROM products +WHERE category IN ('Electronics', 'Clothing', 'Books'); + +-- LIKE operator (pattern matching) +-- % matches any sequence of characters +-- _ matches any single character + +-- Names starting with 'J' +SELECT * FROM employees +WHERE first_name LIKE 'J%'; + +-- Names ending with 'son' +SELECT * FROM employees +WHERE last_name LIKE '%son'; + +-- Names containing 'ar' +SELECT * FROM employees +WHERE first_name LIKE '%ar%'; + +-- Email addresses from gmail +SELECT * FROM employees +WHERE email LIKE '%@gmail.com'; + +-- Names with exactly 4 characters +SELECT * FROM employees +WHERE first_name LIKE '____'; + +-- NULL checks +SELECT * FROM employees +WHERE manager_id IS NULL; + +SELECT * FROM employees +WHERE phone_number IS NOT NULL; + +-- Combining conditions with AND +SELECT * FROM employees +WHERE salary > 50000 + AND department_id = 3; + +-- Combining conditions with OR +SELECT * FROM employees +WHERE department_id = 1 + OR department_id = 5; + +-- Using AND with OR (use parentheses for clarity) +SELECT * FROM employees +WHERE (department_id = 1 OR department_id = 5) + AND salary > 50000; + +-- NOT operator +SELECT * FROM employees +WHERE NOT department_id = 5; + +SELECT * FROM employees +WHERE department_id NOT IN (1, 2, 3); + +-- Complex conditions +SELECT + first_name, + last_name, + salary, + department_id +FROM employees +WHERE + (salary BETWEEN 40000 AND 70000) + AND department_id IN (2, 4, 6) + AND hire_date >= '2019-01-01' + AND email LIKE '%@company.com'; diff --git a/03-sql-fundamentals/queries/03-joins.sql b/03-sql-fundamentals/queries/03-joins.sql new file mode 100644 index 0000000..b905133 --- /dev/null +++ b/03-sql-fundamentals/queries/03-joins.sql @@ -0,0 +1,193 @@ +-- SQL Joins - Combining Data from Multiple Tables +-- Demonstrates different types of joins with examples + +-- Sample data structure (for reference): +-- employees: employee_id, first_name, last_name, department_id, manager_id +-- departments: department_id, department_name, location +-- projects: project_id, project_name, budget +-- project_assignments: employee_id, project_id, hours_worked + +-- ============================================ +-- INNER JOIN +-- Returns only rows that have matches in both tables +-- ============================================ + +-- Basic inner join +SELECT + e.first_name, + e.last_name, + d.department_name +FROM employees e +INNER JOIN departments d ON e.department_id = d.department_id; + +-- Join with additional conditions +SELECT + e.first_name, + e.last_name, + d.department_name, + d.location +FROM employees e +INNER JOIN departments d ON e.department_id = d.department_id +WHERE d.location = 'New York'; + +-- ============================================ +-- LEFT JOIN (LEFT OUTER JOIN) +-- Returns all rows from left table and matching rows from right table +-- ============================================ + +-- Find all employees and their departments (including employees without departments) +SELECT + e.first_name, + e.last_name, + d.department_name +FROM employees e +LEFT JOIN departments d ON e.department_id = d.department_id; + +-- Find employees who are not assigned to any department +SELECT + e.first_name, + e.last_name +FROM employees e +LEFT JOIN departments d ON e.department_id = d.department_id +WHERE d.department_id IS NULL; + +-- ============================================ +-- RIGHT JOIN (RIGHT OUTER JOIN) +-- Returns all rows from right table and matching rows from left table +-- ============================================ + +-- Find all departments and their employees (including departments with no employees) +SELECT + d.department_name, + e.first_name, + e.last_name +FROM employees e +RIGHT JOIN departments d ON e.department_id = d.department_id; + +-- Find departments with no employees +SELECT + d.department_name +FROM employees e +RIGHT JOIN departments d ON e.department_id = d.department_id +WHERE e.employee_id IS NULL; + +-- ============================================ +-- FULL OUTER JOIN +-- Returns all rows when there's a match in either table +-- ============================================ + +-- Find all employees and departments (including unmatched records from both) +SELECT + e.first_name, + e.last_name, + d.department_name +FROM employees e +FULL OUTER JOIN departments d ON e.department_id = d.department_id; + +-- ============================================ +-- SELF JOIN +-- Joining a table to itself +-- ============================================ + +-- Find employees and their managers +SELECT + e.first_name || ' ' || e.last_name AS employee, + m.first_name || ' ' || m.last_name AS manager +FROM employees e +LEFT JOIN employees m ON e.manager_id = m.employee_id; + +-- ============================================ +-- MULTIPLE JOINS +-- Joining more than two tables +-- ============================================ + +-- Find employees, their departments, and projects +SELECT + e.first_name, + e.last_name, + d.department_name, + p.project_name, + pa.hours_worked +FROM employees e +INNER JOIN departments d ON e.department_id = d.department_id +INNER JOIN project_assignments pa ON e.employee_id = pa.employee_id +INNER JOIN projects p ON pa.project_id = p.project_id; + +-- ============================================ +-- JOIN with Aggregate Functions +-- ============================================ + +-- Count employees per department +SELECT + d.department_name, + COUNT(e.employee_id) AS employee_count +FROM departments d +LEFT JOIN employees e ON d.department_id = e.department_id +GROUP BY d.department_name +ORDER BY employee_count DESC; + +-- Total hours worked by employee on all projects +SELECT + e.first_name, + e.last_name, + SUM(pa.hours_worked) AS total_hours +FROM employees e +INNER JOIN project_assignments pa ON e.employee_id = pa.employee_id +GROUP BY e.employee_id, e.first_name, e.last_name +HAVING SUM(pa.hours_worked) > 100; + +-- ============================================ +-- CROSS JOIN +-- Cartesian product of two tables (all possible combinations) +-- ============================================ + +-- Create all possible employee-project combinations (use carefully!) +SELECT + e.first_name, + e.last_name, + p.project_name +FROM employees e +CROSS JOIN projects p; + +-- Practical use: Generate date range for each employee +-- (assuming you have a dates table) +SELECT + e.first_name, + d.date +FROM employees e +CROSS JOIN date_range d +WHERE d.date BETWEEN '2024-01-01' AND '2024-01-31'; + +-- ============================================ +-- JOIN with USING clause +-- When column names are the same in both tables +-- ============================================ + +-- Instead of: ON e.department_id = d.department_id +-- You can use: USING (department_id) +SELECT + e.first_name, + e.last_name, + d.department_name +FROM employees e +INNER JOIN departments d USING (department_id); + +-- ============================================ +-- Complex JOIN Example +-- ============================================ + +-- Find employees working on high-budget projects in specific locations +SELECT + e.first_name || ' ' || e.last_name AS employee_name, + d.department_name, + d.location, + p.project_name, + p.budget, + pa.hours_worked +FROM employees e +INNER JOIN departments d ON e.department_id = d.department_id +INNER JOIN project_assignments pa ON e.employee_id = pa.employee_id +INNER JOIN projects p ON pa.project_id = p.project_id +WHERE p.budget > 100000 + AND d.location IN ('New York', 'San Francisco') +ORDER BY p.budget DESC, pa.hours_worked DESC; diff --git a/04-advanced-sql/README.md b/04-advanced-sql/README.md new file mode 100644 index 0000000..22be85e --- /dev/null +++ b/04-advanced-sql/README.md @@ -0,0 +1,160 @@ +# Advanced SQL + +Welcome to Advanced SQL! Here you'll learn optimization, database design, and advanced query techniques. + +## πŸ“š What You'll Learn + +- Database design and normalization +- Indexes and query optimization +- Window functions +- Stored procedures and functions +- Transactions and concurrency +- Performance tuning +- Advanced query patterns + +## πŸ“– Lessons + +1. [Database Design Principles](lessons/01-database-design.md) +2. [Normalization](lessons/02-normalization.md) +3. [Indexes and Performance](lessons/03-indexes.md) +4. [Window Functions](lessons/04-window-functions.md) +5. [Stored Procedures](lessons/05-stored-procedures.md) +6. [Transactions](lessons/06-transactions.md) +7. [Query Optimization](lessons/07-query-optimization.md) +8. [Advanced Patterns](lessons/08-advanced-patterns.md) + +## πŸ’» Sample Queries + +Check the `queries/` folder for advanced SQL examples: +- Window functions +- Complex aggregations +- Recursive queries +- Performance optimization examples + +## ⏱️ Estimated Time + +3-4 weeks with hands-on practice + +## βœ… Completion Checklist + +- [ ] Understand database normalization +- [ ] Design efficient database schemas +- [ ] Use indexes effectively +- [ ] Write window functions +- [ ] Create stored procedures +- [ ] Understand transactions +- [ ] Optimize slow queries +- [ ] Complete all exercises + +## 🎯 Key Concepts + +### Window Functions +```sql +-- Running total +SELECT + date, + amount, + SUM(amount) OVER (ORDER BY date) AS running_total +FROM sales; + +-- Ranking +SELECT + employee_name, + salary, + RANK() OVER (ORDER BY salary DESC) AS salary_rank +FROM employees; + +-- Partitioned aggregation +SELECT + department, + employee_name, + salary, + AVG(salary) OVER (PARTITION BY department) AS dept_avg +FROM employees; +``` + +### Indexes +```sql +-- Create index +CREATE INDEX idx_employee_name ON employees(last_name, first_name); + +-- Create unique index +CREATE UNIQUE INDEX idx_employee_email ON employees(email); + +-- Create partial index +CREATE INDEX idx_active_employees +ON employees(department_id) +WHERE status = 'active'; +``` + +### Common Table Expressions (CTEs) +```sql +-- Simple CTE +WITH high_earners AS ( + SELECT * FROM employees + WHERE salary > 80000 +) +SELECT department_id, COUNT(*) +FROM high_earners +GROUP BY department_id; + +-- Recursive CTE (org hierarchy) +WITH RECURSIVE employee_hierarchy AS ( + SELECT employee_id, name, manager_id, 1 AS level + FROM employees + WHERE manager_id IS NULL + + UNION ALL + + SELECT e.employee_id, e.name, e.manager_id, eh.level + 1 + FROM employees e + JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id +) +SELECT * FROM employee_hierarchy; +``` + +## πŸ’‘ Best Practices + +1. **Indexing**: Index foreign keys and frequently queried columns +2. **Query Design**: Avoid SELECT *, use specific columns +3. **Joins**: Use appropriate join types +4. **Transactions**: Keep them short and focused +5. **Testing**: Test queries with production-like data volumes +6. **Documentation**: Document complex queries +7. **Monitoring**: Track slow queries + +## πŸ” Query Optimization Tips + +1. **Use EXPLAIN**: Analyze query execution plans +2. **Limit Result Sets**: Use WHERE clauses effectively +3. **Avoid Functions in WHERE**: Can prevent index usage +4. **Use Joins Instead of Subqueries**: Often faster +5. **Proper Data Types**: Use appropriate types for columns +6. **Batch Operations**: Bulk inserts instead of row-by-row +7. **Connection Pooling**: Reuse database connections + +## πŸ“š Additional Resources + +- [Use The Index, Luke](https://use-the-index-luke.com/) +- [PostgreSQL Performance Tuning](https://wiki.postgresql.org/wiki/Performance_Optimization) +- [SQL Server Execution Plans](https://www.red-gate.com/simple-talk/databases/sql-server/performance-sql-server/execution-plans/) + +## Real-World Scenarios + +### Scenario 1: Slow Dashboard Query +- Analyze execution plan +- Add appropriate indexes +- Rewrite query to reduce joins +- Consider materialized views + +### Scenario 2: Concurrent Updates +- Implement proper transactions +- Handle deadlocks +- Use appropriate isolation levels +- Design for concurrency + +### Scenario 3: Large Data Imports +- Use bulk insert methods +- Disable indexes during import +- Rebuild indexes after import +- Use transactions appropriately diff --git a/05-data-engineering/README.md b/05-data-engineering/README.md new file mode 100644 index 0000000..594aeb0 --- /dev/null +++ b/05-data-engineering/README.md @@ -0,0 +1,129 @@ +# Data Engineering Concepts + +Welcome to the Data Engineering section! Here you'll learn the core concepts and practices of data engineering. + +## πŸ“š What You'll Learn + +- ETL vs ELT processes +- Data pipeline architecture +- Data warehousing concepts +- Data quality and validation +- Data modeling +- Workflow orchestration +- Version control for data projects + +## πŸ“– Lessons + +1. [Introduction to Data Engineering](lessons/01-intro-data-engineering.md) +2. [ETL vs ELT](lessons/02-etl-vs-elt.md) +3. [Data Pipelines](lessons/03-data-pipelines.md) +4. [Data Warehousing](lessons/04-data-warehousing.md) +5. [Data Quality](lessons/05-data-quality.md) +6. [Data Modeling](lessons/06-data-modeling.md) +7. [Workflow Orchestration](lessons/07-orchestration.md) +8. [Version Control with Git](lessons/08-version-control.md) + +## πŸ—οΈ Projects + +### Project 1: Simple ETL Pipeline +Build an ETL pipeline that: +- Extracts data from CSV files +- Transforms and cleans the data +- Loads it into a database + +### Project 2: Data Quality Framework +Create a data quality checking system that: +- Validates data types +- Checks for null values +- Identifies duplicates +- Generates quality reports + +### Project 3: Automated Data Pipeline +Build an automated pipeline that: +- Runs on a schedule +- Processes incoming data +- Handles errors gracefully +- Sends notifications + +## ⏱️ Estimated Time + +4-6 weeks with hands-on projects + +## βœ… Completion Checklist + +- [ ] Understand ETL vs ELT +- [ ] Build a basic ETL pipeline +- [ ] Design a data warehouse schema +- [ ] Implement data quality checks +- [ ] Use Git for version control +- [ ] Complete all projects + +## 🎯 Real-World Scenarios + +### Scenario 1: E-commerce Analytics +Design a data pipeline for an e-commerce company that: +- Ingests order data from multiple sources +- Processes customer behavior data +- Creates aggregated reports +- Feeds a dashboard + +### Scenario 2: IoT Data Processing +Build a system to: +- Collect sensor data +- Clean and validate readings +- Store time-series data efficiently +- Generate alerts for anomalies + +## πŸ”‘ Key Concepts + +### ETL Process +1. **Extract**: Pull data from source systems +2. **Transform**: Clean, validate, and reshape data +3. **Load**: Store data in target system + +### Data Pipeline Components +- **Source**: Where data comes from +- **Ingestion**: How data is collected +- **Processing**: Data transformation logic +- **Storage**: Where data is stored +- **Orchestration**: How pipeline steps are coordinated + +### Data Quality Dimensions +- **Accuracy**: Is the data correct? +- **Completeness**: Is all required data present? +- **Consistency**: Is data consistent across sources? +- **Timeliness**: Is data up-to-date? +- **Validity**: Does data follow business rules? + +## πŸ› οΈ Tools You'll Use + +- **Python**: For data processing +- **Pandas**: For data manipulation +- **SQL**: For data querying +- **Git**: For version control +- **SQLite/PostgreSQL**: For data storage + +## πŸ“š Additional Resources + +- [The Data Engineering Cookbook](https://github.com/andkret/Cookbook) +- [Fundamentals of Data Engineering (Book)](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/) +- [Data Engineering Weekly Newsletter](https://www.dataengineeringweekly.com/) + +## πŸ’‘ Best Practices + +1. **Documentation**: Document your pipelines thoroughly +2. **Testing**: Test your data pipelines +3. **Monitoring**: Monitor pipeline health and data quality +4. **Idempotency**: Design pipelines to be rerunnable +5. **Error Handling**: Handle failures gracefully +6. **Scalability**: Design for growth +7. **Security**: Protect sensitive data + +## πŸŽ“ Career Path + +Understanding these concepts prepares you for roles like: +- Data Engineer +- ETL Developer +- Data Pipeline Engineer +- Analytics Engineer +- Data Platform Engineer diff --git a/05-data-engineering/projects/simple-etl/etl_example.py b/05-data-engineering/projects/simple-etl/etl_example.py new file mode 100644 index 0000000..0c0e15d --- /dev/null +++ b/05-data-engineering/projects/simple-etl/etl_example.py @@ -0,0 +1,159 @@ +""" +Simple ETL Pipeline Example +Demonstrates Extract, Transform, Load process +""" + +import csv +import sqlite3 +from datetime import datetime + + +def extract_data(csv_file): + """ + Extract data from CSV file + + Args: + csv_file: Path to CSV file + + Returns: + List of dictionaries containing the data + """ + print(f"[{datetime.now()}] Extracting data from {csv_file}...") + data = [] + + try: + with open(csv_file, 'r') as file: + csv_reader = csv.DictReader(file) + for row in csv_reader: + data.append(row) + print(f"[{datetime.now()}] Extracted {len(data)} records") + return data + except FileNotFoundError: + print(f"Error: File {csv_file} not found") + return [] + + +def transform_data(data): + """ + Transform and clean the data + + Args: + data: List of dictionaries to transform + + Returns: + Transformed data + """ + print(f"[{datetime.now()}] Transforming data...") + transformed = [] + + for record in data: + # Example transformations + transformed_record = { + 'id': int(record.get('id', 0)), + 'name': record.get('name', '').strip().title(), + 'email': record.get('email', '').strip().lower(), + 'age': int(record.get('age', 0)) if record.get('age') else None, + 'city': record.get('city', '').strip().title(), + 'processed_date': datetime.now().strftime('%Y-%m-%d') + } + + # Data quality checks + if transformed_record['email'] and '@' in transformed_record['email']: + transformed.append(transformed_record) + else: + print(f"Skipping invalid record: {record}") + + print(f"[{datetime.now()}] Transformed {len(transformed)} valid records") + return transformed + + +def load_data(data, db_name='etl_output.db'): + """ + Load data into SQLite database + + Args: + data: List of dictionaries to load + db_name: Name of the database file + """ + print(f"[{datetime.now()}] Loading data into database...") + + # Connect to database (creates if doesn't exist) + conn = sqlite3.connect(db_name) + cursor = conn.cursor() + + # Create table if it doesn't exist + cursor.execute(''' + CREATE TABLE IF NOT EXISTS users ( + id INTEGER PRIMARY KEY, + name TEXT NOT NULL, + email TEXT UNIQUE NOT NULL, + age INTEGER, + city TEXT, + processed_date TEXT + ) + ''') + + # Insert data + inserted = 0 + for record in data: + try: + cursor.execute(''' + INSERT OR REPLACE INTO users (id, name, email, age, city, processed_date) + VALUES (?, ?, ?, ?, ?, ?) + ''', ( + record['id'], + record['name'], + record['email'], + record['age'], + record['city'], + record['processed_date'] + )) + inserted += 1 + except sqlite3.Error as e: + print(f"Error inserting record {record['id']}: {e}") + + conn.commit() + conn.close() + + print(f"[{datetime.now()}] Loaded {inserted} records into database") + + +def run_etl_pipeline(source_file, target_db='etl_output.db'): + """ + Run the complete ETL pipeline + + Args: + source_file: Path to source CSV file + target_db: Name of target database + """ + print(f"\n{'='*50}") + print("Starting ETL Pipeline") + print(f"{'='*50}\n") + + start_time = datetime.now() + + # Extract + raw_data = extract_data(source_file) + + if not raw_data: + print("No data to process. Exiting.") + return + + # Transform + clean_data = transform_data(raw_data) + + # Load + load_data(clean_data, target_db) + + end_time = datetime.now() + duration = (end_time - start_time).total_seconds() + + print(f"\n{'='*50}") + print(f"ETL Pipeline Completed in {duration:.2f} seconds") + print(f"{'='*50}\n") + + +if __name__ == "__main__": + # Example usage + # Create a sample CSV file first or replace with your file + run_etl_pipeline('sample_data.csv') diff --git a/06-advanced-topics/README.md b/06-advanced-topics/README.md new file mode 100644 index 0000000..a347409 --- /dev/null +++ b/06-advanced-topics/README.md @@ -0,0 +1,283 @@ +# Advanced Topics in Data Engineering + +This section covers advanced concepts and technologies that modern data engineers use in production environments. + +## πŸ“š What You'll Learn + +- Introduction to Apache Spark +- Cloud data platforms (AWS, GCP, Azure) +- Data streaming concepts +- Containerization with Docker +- Testing data pipelines +- CI/CD for data engineering +- Data governance and security + +## πŸ“– Lessons + +1. [Introduction to Big Data](lessons/01-big-data-intro.md) +2. [Apache Spark Basics](lessons/02-spark-basics.md) +3. [Cloud Platforms Overview](lessons/03-cloud-platforms.md) +4. [Data Streaming](lessons/04-data-streaming.md) +5. [Docker for Data Engineers](lessons/05-docker.md) +6. [Testing Data Pipelines](lessons/06-testing.md) +7. [CI/CD](lessons/07-cicd.md) +8. [Data Governance](lessons/08-data-governance.md) + +## 🎯 Projects + +### Project 1: Dockerized ETL Pipeline +- Package ETL pipeline in Docker +- Use Docker Compose for multi-container setup +- Include database and application + +### Project 2: Cloud Data Pipeline +- Build pipeline on cloud platform +- Use managed services +- Implement monitoring + +### Project 3: Streaming Data Pipeline +- Process real-time data +- Use message queues +- Handle high throughput + +## ⏱️ Estimated Time + +6-8 weeks for comprehensive understanding + +## βœ… Completion Checklist + +- [ ] Understand big data concepts +- [ ] Learn Spark basics +- [ ] Explore cloud platforms +- [ ] Build a Docker container +- [ ] Understand streaming concepts +- [ ] Implement testing +- [ ] Set up CI/CD pipeline +- [ ] Complete capstone project + +## πŸ”‘ Key Technologies + +### Apache Spark +```python +from pyspark.sql import SparkSession + +# Create Spark session +spark = SparkSession.builder \ + .appName("DataEngineering") \ + .getOrCreate() + +# Read data +df = spark.read.csv("data.csv", header=True) + +# Transform +df_transformed = df.filter(df.age > 25) \ + .groupBy("city") \ + .count() + +# Write +df_transformed.write.parquet("output/") +``` + +### Docker +```dockerfile +# Dockerfile for Python app +FROM python:3.9-slim + +WORKDIR /app + +COPY requirements.txt . +RUN pip install -r requirements.txt + +COPY . . + +CMD ["python", "etl_pipeline.py"] +``` + +### Docker Compose +```yaml +# docker-compose.yml +version: '3.8' + +services: + postgres: + image: postgres:13 + environment: + POSTGRES_PASSWORD: password + ports: + - "5432:5432" + + etl_app: + build: . + depends_on: + - postgres + environment: + DB_HOST: postgres +``` + +## ☁️ Cloud Platforms + +### AWS Services for Data Engineering +- **S3**: Object storage +- **RDS**: Managed databases +- **Redshift**: Data warehouse +- **Glue**: ETL service +- **Lambda**: Serverless compute +- **Kinesis**: Streaming data + +### GCP Services +- **Cloud Storage**: Object storage +- **Cloud SQL**: Managed databases +- **BigQuery**: Data warehouse +- **Dataflow**: Stream/batch processing +- **Cloud Functions**: Serverless +- **Pub/Sub**: Messaging + +### Azure Services +- **Blob Storage**: Object storage +- **Azure SQL**: Managed databases +- **Synapse Analytics**: Data warehouse +- **Data Factory**: ETL/ELT +- **Functions**: Serverless +- **Event Hubs**: Streaming + +## πŸ§ͺ Testing Data Pipelines + +### Unit Testing +```python +import pytest +import pandas as pd + +def test_data_transformation(): + # Arrange + input_data = pd.DataFrame({ + 'name': ['Alice', 'Bob'], + 'age': [25, 30] + }) + + # Act + result = transform_data(input_data) + + # Assert + assert len(result) == 2 + assert 'age_group' in result.columns +``` + +### Integration Testing +```python +def test_database_connection(): + engine = create_engine(TEST_DB_URL) + conn = engine.connect() + assert conn is not None + conn.close() + +def test_etl_pipeline(): + # Run entire pipeline on test data + run_pipeline(test_source, test_target) + # Verify results + result = read_from_target() + assert result.shape[0] > 0 +``` + +## πŸ”„ CI/CD Example + +### GitHub Actions Workflow +```yaml +name: Data Pipeline CI + +on: [push, pull_request] + +jobs: + test: + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v2 + + - name: Set up Python + uses: actions/setup-python@v2 + with: + python-version: 3.9 + + - name: Install dependencies + run: | + pip install -r requirements.txt + pip install pytest + + - name: Run tests + run: pytest + + - name: Run linter + run: pylint *.py +``` + +## πŸ“Š Data Streaming Concepts + +### Key Concepts +- **Event Stream**: Continuous flow of events +- **Message Queue**: Buffer between producers and consumers +- **Consumer Group**: Multiple consumers processing stream +- **Offset**: Position in stream +- **Windowing**: Time-based aggregations + +### Example Technologies +- **Apache Kafka**: Distributed streaming platform +- **RabbitMQ**: Message broker +- **AWS Kinesis**: Managed streaming +- **Google Pub/Sub**: Messaging service + +## πŸ’‘ Best Practices + +1. **Containerization**: Use Docker for consistency +2. **Testing**: Test at multiple levels +3. **Monitoring**: Implement comprehensive monitoring +4. **Documentation**: Document architecture and decisions +5. **Security**: Follow security best practices +6. **Cost Optimization**: Monitor and optimize cloud costs +7. **Scalability**: Design for growth + +## πŸ“š Additional Resources + +### Books +- "Learning Spark" by Matei Zaharia +- "Streaming Systems" by Tyler Akidau +- "Docker Deep Dive" by Nigel Poulton + +### Online +- [Apache Spark Documentation](https://spark.apache.org/docs/latest/) +- [Docker Documentation](https://docs.docker.com/) +- [AWS Data Analytics](https://aws.amazon.com/big-data/datalakes-and-analytics/) + +### Certifications +- AWS Certified Data Analytics +- Google Professional Data Engineer +- Microsoft Certified: Azure Data Engineer +- Databricks Certified Data Engineer + +## Career Advancement + +Mastering these topics prepares you for: +- Senior Data Engineer +- Data Platform Engineer +- Big Data Engineer +- Cloud Data Engineer +- MLOps Engineer + +## Capstone Project Ideas + +1. **Real-time Analytics Dashboard** + - Stream data from API + - Process with Spark Streaming + - Store in time-series database + - Visualize in real-time + +2. **Cloud Data Warehouse** + - Design star schema + - Implement on cloud platform + - Build ETL pipeline + - Add data quality checks + +3. **Containerized Pipeline** + - Full ETL pipeline in Docker + - Orchestrated with Airflow + - Automated testing + - CI/CD deployment diff --git a/07-projects/README.md b/07-projects/README.md new file mode 100644 index 0000000..ca5849a --- /dev/null +++ b/07-projects/README.md @@ -0,0 +1,293 @@ +# Capstone Projects + +This section contains comprehensive projects that bring together everything you've learned. Each project simulates real-world data engineering scenarios. + +## 🎯 Projects Overview + +### Project 1: ETL Pipeline +Build a complete Extract, Transform, Load pipeline for processing sales data. + +### Project 2: Data Warehouse +Design and implement a dimensional data warehouse for analytics. + +### Project 3: Real-Time Dashboard +Create a system that processes streaming data and displays real-time metrics. + +## πŸ“‹ Prerequisites + +Before starting these projects, you should have completed: +- Python Fundamentals +- Python for Data Engineering +- SQL Fundamentals +- Data Engineering Concepts + +## πŸš€ How to Approach These Projects + +1. **Understand Requirements**: Read project specs carefully +2. **Plan Architecture**: Design before coding +3. **Start Simple**: Build MVP first +4. **Iterate**: Add features incrementally +5. **Test**: Validate at each step +6. **Document**: Explain your design decisions +7. **Refactor**: Improve code quality +8. **Deploy**: Make it production-ready + +## Project 1: ETL Pipeline + +### Overview +Build an automated ETL pipeline that processes e-commerce data from multiple sources. + +### Objectives +- Extract data from CSV, JSON, and API +- Clean and validate data +- Transform for analytics +- Load into database +- Schedule regular updates +- Handle errors gracefully + +### Requirements +- Python 3.8+ +- PostgreSQL or SQLite +- Pandas +- SQLAlchemy + +### Skills Practiced +- Data extraction from multiple sources +- Data cleaning and validation +- Database operations +- Error handling +- Logging +- Scheduling + +### Deliverables +- Working ETL pipeline +- Documentation +- Unit tests +- Configuration files +- README with setup instructions + +### Success Criteria +- Pipeline runs without errors +- Data quality checks pass +- Handles edge cases +- Well-documented code +- Tests cover main functionality + +--- + +## Project 2: Data Warehouse + +### Overview +Design and implement a data warehouse using dimensional modeling for a fictional retail company. + +### Objectives +- Design star schema +- Create dimension and fact tables +- Build ETL to populate warehouse +- Write analytical queries +- Optimize for performance +- Document design decisions + +### Requirements +- PostgreSQL (or similar) +- Python for ETL +- Understanding of dimensional modeling +- SQL knowledge + +### Skills Practiced +- Database design +- Dimensional modeling +- Data warehouse concepts +- ETL development +- Query optimization +- Performance tuning + +### Deliverables +- Database schema (ERD) +- ETL scripts +- Sample analytical queries +- Documentation +- Performance analysis + +### Success Criteria +- Properly normalized dimensions +- Efficient fact table design +- Working ETL process +- Optimized queries +- Clear documentation + +--- + +## Project 3: Real-Time Dashboard + +### Overview +Create a system that ingests streaming data, processes it, and displays real-time metrics on a dashboard. + +### Objectives +- Simulate or connect to data stream +- Process data in real-time +- Store processed data +- Create visualization dashboard +- Handle high throughput +- Implement monitoring + +### Requirements +- Python +- Database (PostgreSQL/TimescaleDB) +- Message queue (optional) +- Visualization tool (Plotly/Dash/Grafana) + +### Skills Practiced +- Stream processing +- Real-time data handling +- Data visualization +- System design +- Performance optimization + +### Deliverables +- Data ingestion service +- Processing pipeline +- Dashboard application +- Documentation +- Demo video + +### Success Criteria +- Handles data in real-time +- Low latency processing +- Responsive dashboard +- Scalable design +- Proper error handling + +--- + +## πŸ“š Additional Project Ideas + +### Beginner Projects +1. **CSV Data Cleaner**: Tool to clean messy CSV files +2. **Database Backup Script**: Automate database backups +3. **Log File Parser**: Extract insights from log files +4. **Data Quality Checker**: Validate data against rules + +### Intermediate Projects +5. **API Data Aggregator**: Collect data from multiple APIs +6. **Automated Report Generator**: Generate daily/weekly reports +7. **Data Version Control**: Track changes in datasets +8. **Multi-Source Data Integration**: Combine different data sources + +### Advanced Projects +9. **Data Lakehouse**: Implement data lake and warehouse +10. **ML Pipeline**: Data pipeline for machine learning +11. **Data Observability Platform**: Monitor data quality and pipelines +12. **Change Data Capture (CDC)**: Track database changes + +## πŸ’‘ Tips for Success + +### Planning +- Sketch architecture diagrams +- List requirements clearly +- Break into small tasks +- Estimate time needed + +### Development +- Use version control (Git) +- Commit frequently +- Write tests as you go +- Document as you code + +### Best Practices +- Follow coding standards +- Handle errors properly +- Add logging +- Use configuration files +- Keep credentials secure + +### Testing +- Test with sample data first +- Validate edge cases +- Performance test with realistic data +- Test failure scenarios + +### Documentation +- Explain design decisions +- Document setup process +- Provide usage examples +- Include troubleshooting guide + +## πŸŽ“ Learning Outcomes + +After completing these projects, you will: +- Have portfolio projects for job applications +- Understand full data engineering lifecycle +- Know how to design data systems +- Be comfortable with production concepts +- Have experience with real-world challenges + +## πŸ“ Project Presentation + +For each project, prepare: +1. **Problem Statement**: What you're solving +2. **Architecture Diagram**: System design +3. **Technology Stack**: Tools used +4. **Demo**: Working demonstration +5. **Challenges**: What you learned +6. **Future Improvements**: What's next + +## 🀝 Getting Help + +If you get stuck: +1. Review relevant lessons +2. Check documentation +3. Search Stack Overflow +4. Ask in communities +5. Review similar projects on GitHub + +## 🌟 Showcase Your Work + +- Push to GitHub with good README +- Write blog post about your project +- Create demo video +- Add to your portfolio +- Share on LinkedIn + +## πŸ“Š Evaluation Rubric + +### Code Quality (25%) +- Clean, readable code +- Proper structure +- Comments and docstrings +- Follows best practices + +### Functionality (25%) +- Meets requirements +- Works as expected +- Handles edge cases +- Error handling + +### Design (20%) +- Good architecture +- Scalable solution +- Efficient implementation +- Proper data modeling + +### Testing (15%) +- Unit tests included +- Test coverage +- Tests pass +- Edge cases covered + +### Documentation (15%) +- Clear README +- Setup instructions +- Architecture explained +- Usage examples + +## Next Steps + +1. Choose a project that interests you +2. Read the detailed requirements +3. Plan your approach +4. Start building +5. Iterate and improve +6. Share your work + +Good luck with your projects! These will form the foundation of your data engineering portfolio. diff --git a/07-projects/etl-pipeline/README.md b/07-projects/etl-pipeline/README.md new file mode 100644 index 0000000..ba7c65a --- /dev/null +++ b/07-projects/etl-pipeline/README.md @@ -0,0 +1,360 @@ +# Project 1: E-commerce ETL Pipeline + +## πŸ“‹ Project Overview + +Build a production-ready ETL pipeline that processes e-commerce sales data from multiple sources, validates and transforms it, and loads it into a database for analytics. + +## 🎯 Objectives + +1. Extract data from multiple sources (CSV, JSON, API) +2. Implement data validation and quality checks +3. Transform data for analytics +4. Load data into a PostgreSQL database +5. Handle errors and edge cases +6. Implement logging and monitoring +7. Make the pipeline schedulable + +## πŸ“Š Data Sources + +### Source 1: Orders CSV +Daily exports of order data: +``` +order_id,customer_id,order_date,total_amount,status +1001,5001,2024-01-15,99.99,completed +1002,5002,2024-01-15,149.50,pending +``` + +### Source 2: Product Catalog API +REST API endpoint: `/api/products` +```json +{ + "products": [ + { + "product_id": "P001", + "name": "Laptop", + "category": "Electronics", + "price": 999.99 + } + ] +} +``` + +### Source 3: Customer Data JSON +Customer information updates: +```json +{ + "customer_id": 5001, + "name": "John Doe", + "email": "john@example.com", + "signup_date": "2023-01-10" +} +``` + +## πŸ—οΈ Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Sources │────▢│ ETL Process │────▢│ Database β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ + β”œβ”€ CSV Files β”œβ”€ Extract β”œβ”€ PostgreSQL + β”œβ”€ JSON Files β”œβ”€ Transform β”œβ”€ Staging Tables + └─ REST API β”œβ”€ Load └─ Final Tables + └─ Validate +``` + +## πŸ“ Project Structure + +``` +etl-pipeline/ +β”œβ”€β”€ README.md +β”œβ”€β”€ requirements.txt +β”œβ”€β”€ config/ +β”‚ β”œβ”€β”€ config.yaml +β”‚ └── logging_config.yaml +β”œβ”€β”€ data/ +β”‚ β”œβ”€β”€ input/ +β”‚ β”‚ β”œβ”€β”€ orders/ +β”‚ β”‚ β”œβ”€β”€ customers/ +β”‚ β”‚ └── products/ +β”‚ └── output/ +β”œβ”€β”€ src/ +β”‚ β”œβ”€β”€ __init__.py +β”‚ β”œβ”€β”€ extract/ +β”‚ β”‚ β”œβ”€β”€ __init__.py +β”‚ β”‚ β”œβ”€β”€ csv_extractor.py +β”‚ β”‚ β”œβ”€β”€ json_extractor.py +β”‚ β”‚ └── api_extractor.py +β”‚ β”œβ”€β”€ transform/ +β”‚ β”‚ β”œβ”€β”€ __init__.py +β”‚ β”‚ β”œβ”€β”€ data_cleaner.py +β”‚ β”‚ β”œβ”€β”€ data_validator.py +β”‚ β”‚ └── data_transformer.py +β”‚ β”œβ”€β”€ load/ +β”‚ β”‚ β”œβ”€β”€ __init__.py +β”‚ β”‚ └── database_loader.py +β”‚ β”œβ”€β”€ utils/ +β”‚ β”‚ β”œβ”€β”€ __init__.py +β”‚ β”‚ β”œβ”€β”€ logger.py +β”‚ β”‚ β”œβ”€β”€ db_connection.py +β”‚ β”‚ └── config_loader.py +β”‚ └── pipeline.py +β”œβ”€β”€ tests/ +β”‚ β”œβ”€β”€ __init__.py +β”‚ β”œβ”€β”€ test_extract.py +β”‚ β”œβ”€β”€ test_transform.py +β”‚ β”œβ”€β”€ test_load.py +β”‚ └── test_pipeline.py +β”œβ”€β”€ sql/ +β”‚ β”œβ”€β”€ schema.sql +β”‚ └── queries.sql +└── main.py +``` + +## πŸ”§ Technical Requirements + +### Required Software +- Python 3.8+ +- PostgreSQL 12+ +- Git + +### Python Libraries +``` +pandas>=1.3.0 +sqlalchemy>=1.4.0 +psycopg2-binary>=2.9.0 +requests>=2.26.0 +pyyaml>=5.4.0 +python-dotenv>=0.19.0 +pytest>=7.0.0 +``` + +## πŸ“ Implementation Steps + +### Phase 1: Setup (Week 1) +- [ ] Set up project structure +- [ ] Install dependencies +- [ ] Set up database +- [ ] Create configuration files +- [ ] Set up logging + +### Phase 2: Extract (Week 1-2) +- [ ] Implement CSV extractor +- [ ] Implement JSON extractor +- [ ] Implement API extractor +- [ ] Handle extraction errors +- [ ] Write extraction tests + +### Phase 3: Transform (Week 2-3) +- [ ] Implement data validation +- [ ] Implement data cleaning +- [ ] Implement transformations +- [ ] Add data quality checks +- [ ] Write transformation tests + +### Phase 4: Load (Week 3) +- [ ] Create database schema +- [ ] Implement database loader +- [ ] Handle loading errors +- [ ] Implement upsert logic +- [ ] Write loading tests + +### Phase 5: Integration (Week 4) +- [ ] Connect all components +- [ ] Implement orchestration +- [ ] Add comprehensive logging +- [ ] Handle end-to-end errors +- [ ] Write integration tests + +### Phase 6: Production Ready (Week 4) +- [ ] Add configuration management +- [ ] Implement monitoring +- [ ] Add scheduling capability +- [ ] Create documentation +- [ ] Performance optimization + +## πŸ§ͺ Testing Strategy + +### Unit Tests +Test individual components: +- Extractors +- Transformers +- Loaders +- Utilities + +### Integration Tests +Test component interactions: +- Extract β†’ Transform +- Transform β†’ Load +- End-to-end pipeline + +### Data Quality Tests +Validate data: +- Schema validation +- Data type checks +- Null value checks +- Duplicate detection +- Business rule validation + +## πŸ“Š Database Schema + +### Staging Tables +```sql +CREATE TABLE staging_orders ( + order_id VARCHAR(50), + customer_id VARCHAR(50), + order_date VARCHAR(50), + total_amount VARCHAR(50), + status VARCHAR(50), + loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); +``` + +### Final Tables +```sql +CREATE TABLE orders ( + order_id INTEGER PRIMARY KEY, + customer_id INTEGER REFERENCES customers(customer_id), + order_date DATE NOT NULL, + total_amount DECIMAL(10,2) NOT NULL, + status VARCHAR(20) NOT NULL, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); + +CREATE TABLE customers ( + customer_id INTEGER PRIMARY KEY, + name VARCHAR(100) NOT NULL, + email VARCHAR(100) UNIQUE NOT NULL, + signup_date DATE, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); + +CREATE TABLE products ( + product_id VARCHAR(50) PRIMARY KEY, + name VARCHAR(200) NOT NULL, + category VARCHAR(50), + price DECIMAL(10,2) NOT NULL, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); +``` + +## πŸ” Data Quality Checks + +1. **Completeness**: All required fields present +2. **Validity**: Data types and formats correct +3. **Consistency**: Cross-field validation +4. **Accuracy**: Values within expected ranges +5. **Uniqueness**: No duplicate keys +6. **Timeliness**: Data is current + +## πŸ“ˆ Monitoring and Logging + +### Log Levels +- **INFO**: Pipeline start/stop, phase transitions +- **WARNING**: Data quality issues, missing data +- **ERROR**: Processing failures +- **DEBUG**: Detailed processing information + +### Metrics to Track +- Records processed +- Records failed +- Processing time +- Data quality scores +- Error rates + +## πŸš€ Running the Pipeline + +### Setup +```bash +# Clone repository +git clone +cd etl-pipeline + +# Create virtual environment +python -m venv venv +source venv/bin/activate # or venv\Scripts\activate on Windows + +# Install dependencies +pip install -r requirements.txt + +# Set up database +psql -U postgres -f sql/schema.sql + +# Configure +cp config/config.example.yaml config/config.yaml +# Edit config.yaml with your settings +``` + +### Execution +```bash +# Run full pipeline +python main.py + +# Run specific phase +python main.py --phase extract +python main.py --phase transform +python main.py --phase load + +# Run with date range +python main.py --start-date 2024-01-01 --end-date 2024-01-31 + +# Dry run (no database writes) +python main.py --dry-run +``` + +## πŸ“– Documentation Requirements + +1. **README**: Project overview and setup +2. **Architecture Diagram**: System design +3. **API Documentation**: If creating APIs +4. **Configuration Guide**: How to configure +5. **Troubleshooting**: Common issues and solutions +6. **Code Comments**: Inline documentation + +## 🎯 Success Criteria + +- [ ] Pipeline processes all three data sources +- [ ] Data quality checks are implemented +- [ ] Errors are handled gracefully +- [ ] Logging provides adequate information +- [ ] Tests achieve >80% coverage +- [ ] Documentation is complete +- [ ] Code follows Python best practices +- [ ] Pipeline runs without manual intervention + +## 🌟 Bonus Features + +- **Incremental Loading**: Only process new/changed data +- **Parallel Processing**: Process multiple files simultaneously +- **Email Notifications**: Alert on failures +- **Dashboard**: Visualize pipeline metrics +- **Containerization**: Package in Docker +- **Cloud Deployment**: Deploy to AWS/GCP/Azure + +## πŸ“š Resources + +- [SQLAlchemy Documentation](https://docs.sqlalchemy.org/) +- [Pandas Data Validation](https://pandas.pydata.org/docs/user_guide/indexing.html) +- [Python Logging Best Practices](https://docs.python.org/3/howto/logging.html) +- [PostgreSQL Documentation](https://www.postgresql.org/docs/) + +## 🀝 Getting Help + +- Review previous lessons on ETL +- Check Stack Overflow for specific errors +- Refer to library documentation +- Ask in data engineering communities + +## πŸ“ Submission Guidelines + +When completed, your repository should include: +1. All source code +2. Requirements file +3. Database schema +4. Sample data (or data generator) +5. Test suite +6. Complete documentation +7. Demo video (optional) + +Good luck building your ETL pipeline! diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..c812f5d --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,146 @@ +# Contributing to Data Engineer Learning Path + +Thank you for your interest in contributing! This document provides guidelines for contributing to this learning repository. + +## How to Contribute + +### Reporting Issues +- Check if the issue already exists +- Provide a clear description +- Include relevant details (lesson number, code snippets, etc.) + +### Suggesting Improvements +- Open an issue with the "enhancement" label +- Describe the improvement clearly +- Explain why it would be valuable + +### Adding Content + +#### Adding Lessons +1. Fork the repository +2. Create a new branch (`git checkout -b add-lesson-topic`) +3. Add your lesson in the appropriate directory +4. Follow the existing lesson format +5. Include examples and exercises +6. Update the section README.md +7. Submit a pull request + +#### Adding Examples +- Place in the appropriate `examples/` folder +- Include comments explaining the code +- Make sure code runs without errors +- Add a docstring at the top of the file + +#### Adding Exercises +- Place in the appropriate `exercises/` folder +- Include clear problem description +- Provide sample input/output +- Include solution in a separate file (e.g., `exercise_name_solution.py`) + +## Code Style Guidelines + +### Python Code +- Follow PEP 8 style guide +- Use meaningful variable names +- Add comments for complex logic +- Include docstrings for functions +- Keep functions focused and small + +### SQL Code +- Use uppercase for SQL keywords +- Indent subqueries +- Add comments explaining complex queries +- Format for readability + +### Markdown +- Use headers appropriately (# for main title, ## for sections, etc.) +- Include code blocks with language specification +- Add blank lines between sections +- Use bullet points for lists + +## Content Guidelines + +### Lessons +- Start with clear learning objectives +- Progress from simple to complex +- Include practical examples +- End with exercises or a project +- Link to additional resources + +### Examples +- Should be runnable without modification (when possible) +- Include error handling +- Demonstrate best practices +- Keep focused on one concept + +### Exercises +- Should reinforce lesson concepts +- Provide varying difficulty levels +- Include hints for difficult problems +- Solution should include explanation + +## Pull Request Process + +1. **Create a descriptive PR title** + - Good: "Add lesson on pandas groupby operations" + - Bad: "Update files" + +2. **Describe your changes** + - What was added/changed + - Why the change is needed + - Any relevant context + +3. **Ensure quality** + - Code runs without errors + - Markdown renders correctly + - No typos or grammar issues + - Links work correctly + +4. **Wait for review** + - Respond to feedback + - Make requested changes + - Be patient and respectful + +## Testing Your Contributions + +### Python Code +```bash +# Run the code to ensure it works +python your_script.py + +# Check for syntax errors +python -m py_compile your_script.py +``` + +### SQL Code +- Test queries in a database +- Verify results are correct +- Check for syntax errors + +### Markdown +- Preview in VS Code or GitHub +- Verify links work +- Check formatting + +## Code of Conduct + +- Be respectful and inclusive +- Welcome newcomers +- Provide constructive feedback +- Focus on the content, not the person +- Help create a positive learning environment + +## Questions? + +If you have questions about contributing: +1. Check existing issues and discussions +2. Open a new issue with your question +3. Tag it with "question" + +## Recognition + +Contributors will be acknowledged in the repository. Thank you for helping others learn! + +## License + +By contributing, you agree that your contributions will be licensed under the same license as this project (MIT License). diff --git a/FAQ.md b/FAQ.md new file mode 100644 index 0000000..19f2002 --- /dev/null +++ b/FAQ.md @@ -0,0 +1,232 @@ +# Frequently Asked Questions (FAQ) + +## General Questions + +### Q: Do I need prior programming experience? +**A:** No! This learning path starts from the basics. However, basic computer literacy is expected. + +### Q: How long will it take to complete? +**A:** It depends on your pace: +- **Full-time (40 hrs/week)**: 3-4 months +- **Part-time (15 hrs/week)**: 6-9 months +- **Casual (5 hrs/week)**: 12+ months + +### Q: Is this learning path free? +**A:** Yes! All materials in this repository are free. However, some recommended resources (books, courses) may have costs. + +### Q: What's the job market like for data engineers? +**A:** Data engineering is in high demand with competitive salaries. Entry-level positions typically require portfolio projects and internship experience. + +### Q: Can I skip sections? +**A:** Not recommended. Each section builds on previous ones. However, if you already know Python, you can move through it quickly. + +## Technical Questions + +### Q: Which Python version should I use? +**A:** Python 3.8 or higher. We recommend using the latest stable version (3.11 or 3.12). + +### Q: Windows, Mac, or Linux? +**A:** Any! All examples work on all platforms. Linux is common in production, but start with what you have. + +### Q: SQLite or PostgreSQL? +**A:** Start with SQLite (easier setup), then move to PostgreSQL (more features, production-ready). + +### Q: Do I need a powerful computer? +**A:** No. Basic specs are fine: +- 4GB RAM minimum (8GB recommended) +- 20GB free disk space +- Any modern processor + +### Q: How do I install Python? +**A:** See [Getting Started Guide](GETTING_STARTED.md) and the [Python Installation Lesson](01-python-fundamentals/lessons/01-getting-started.md). + +## Learning Questions + +### Q: I'm stuck on an exercise. What should I do? +**A:** +1. Read the error message carefully +2. Review the relevant lesson +3. Search for the error online +4. Ask in communities (provide details) +5. Check the solution (as last resort) + +### Q: How much time should I spend daily? +**A:** +- **Minimum**: 30 minutes (to maintain consistency) +- **Ideal**: 1-2 hours +- **Quality over quantity**: Focused 1 hour beats distracted 3 hours + +### Q: Should I take notes? +**A:** Yes! Taking notes helps retention. Keep a learning journal to track progress and challenges. + +### Q: When should I start building projects? +**A:** Start small projects early! Even simple programs help you learn. Complete the capstone projects after finishing relevant sections. + +### Q: How do I know if I'm ready to move to the next section? +**A:** You should be able to: +- Explain key concepts +- Complete most exercises independently +- Build a small project using the skills + +## Career Questions + +### Q: What jobs can I get after completing this? +**A:** +- Junior Data Engineer +- ETL Developer +- Data Pipeline Engineer +- Analytics Engineer +- BI Developer (with additional skills) + +### Q: What's the typical salary? +**A:** Varies by location and experience: +- **Entry-level**: $60k-$90k +- **Mid-level**: $90k-$130k +- **Senior**: $130k-$180k+ +(US market, adjust for your location) + +### Q: Do I need a degree? +**A:** Not always. Many companies hire based on skills and portfolio. However, some companies require a degree. A strong portfolio can compensate. + +### Q: Should I get certified? +**A:** Certifications can help but aren't required. Focus on: +1. Building strong portfolio +2. Understanding concepts deeply +3. Then consider certifications for specific tools/platforms + +### Q: How important is the capstone project? +**A:** Very! Employers want to see you can build real systems. Quality projects in your portfolio are crucial. + +## Tool Questions + +### Q: VS Code or PyCharm? +**A:** Either works great: +- **VS Code**: Free, lightweight, extensible +- **PyCharm**: More Python-specific features +- Try both, use what feels better + +### Q: Do I need to learn Docker? +**A:** Eventually, yes. It's covered in advanced topics. But master the basics first. + +### Q: Should I learn AWS, GCP, or Azure? +**A:** Learn cloud concepts first, then pick one: +- **AWS**: Most popular, lots of resources +- **GCP**: Strong data/ML tools +- **Azure**: Good for enterprise +Start with one, concepts transfer to others. + +### Q: What about Apache Spark? +**A:** Important for big data, but not required initially. It's covered in advanced topics after you're comfortable with Python and SQL. + +## Practice Questions + +### Q: Where can I practice SQL? +**A:** +- **LeetCode**: Database section +- **HackerRank**: SQL challenges +- **SQLZoo**: Interactive tutorials +- **Mode Analytics**: SQL tutorials with real data + +### Q: Where can I practice Python? +**A:** +- **LeetCode**: Python problems +- **HackerRank**: Python track +- **Codewars**: Community challenges +- **Exercism**: Mentor-supported practice + +### Q: How do I get real datasets to practice? +**A:** +- **Kaggle**: Thousands of datasets +- **data.gov**: Government data +- **GitHub**: Awesome datasets repositories +- **APIs**: Public APIs for real-time data + +## Troubleshooting + +### Q: My code isn't working but I don't see an error +**A:** +- Check indentation (Python is sensitive) +- Verify variable names (case-sensitive) +- Add print statements to debug +- Use Python debugger (pdb) + +### Q: I get "ModuleNotFoundError" +**A:** +```bash +# Install the missing module +pip install module_name + +# Make sure you're in the right virtual environment +which python # Should show your venv path +``` + +### Q: PostgreSQL connection fails +**A:** +- Is PostgreSQL running? `sudo service postgresql status` +- Check connection details (host, port, password) +- Verify database exists +- Check firewall settings + +### Q: Git push fails +**A:** +- Check you're on correct branch: `git branch` +- Pull first: `git pull origin branch_name` +- Verify credentials +- Check repository permissions + +## Community Questions + +### Q: Where can I get help? +**A:** +- **Reddit**: r/learnprogramming, r/dataengineering +- **Discord**: Python Discord, DataTalks.Club +- **Stack Overflow**: Ask specific questions +- **GitHub Issues**: For problems with this repo + +### Q: How can I contribute? +**A:** See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. Contributions are welcome! + +### Q: Can I share my solutions? +**A:** Yes! Sharing helps others learn. Fork the repo, add your solutions, share on social media. + +### Q: Is there a study group? +**A:** Check the repository discussions or create your own! Many learners find study partners in Discord servers. + +## Next Steps Questions + +### Q: I finished everything. What's next? +**A:** Congratulations! πŸŽ‰ +1. Build more projects +2. Contribute to open source +3. Learn advanced topics (Airflow, Spark, etc.) +4. Prepare for interviews +5. Apply for jobs +6. Keep learning! + +### Q: What should I learn after this? +**A:** Depends on your interests: +- **Big Data**: Apache Spark, Hadoop +- **Cloud**: Deep dive into AWS/GCP/Azure +- **Orchestration**: Apache Airflow, Prefect +- **Streaming**: Kafka, Flink +- **ML Engineering**: MLOps, model deployment + +### Q: How do I prepare for interviews? +**A:** +1. Review system design +2. Practice SQL and Python problems +3. Prepare to explain your projects +4. Study common data engineering patterns +5. Research the company +6. Practice behavioral questions + +## Still Have Questions? + +- **Open an Issue**: For technical problems with the repository +- **Start a Discussion**: For learning questions and sharing +- **Join Communities**: Connect with other learners +- **Contact**: Check repository for contact information + +--- + +Remember: Every expert was once a beginner asking these same questions. Don't be afraid to ask for help! diff --git a/GETTING_STARTED.md b/GETTING_STARTED.md new file mode 100644 index 0000000..5ae9a84 --- /dev/null +++ b/GETTING_STARTED.md @@ -0,0 +1,358 @@ +# Getting Started with Data Engineer Learning Path + +Welcome! This guide will help you get started on your journey to becoming a data engineer. + +## 🎯 Is This For You? + +This learning path is perfect if you: +- Want to become a data engineer +- Have basic computer skills +- Are willing to learn and practice regularly +- Can dedicate 10-15 hours per week +- Enjoy working with data and solving problems + +No prior programming experience is required, but it's helpful! + +## πŸ“… Time Commitment + +### Full-Time Study (40 hours/week) +- Complete path in 3-4 months +- Intensive learning +- Quick career transition + +### Part-Time Study (10-15 hours/week) +- Complete path in 6-9 months +- Balance with work/life +- Sustainable pace + +### Casual Learning (5-10 hours/week) +- Complete path in 12+ months +- Flexible schedule +- At your own pace + +## πŸ—ΊοΈ Your Learning Journey + +### Month 1-2: Python Foundations +**Goal**: Learn Python basics + +**What you'll do**: +- Install Python and VS Code +- Learn variables, loops, functions +- Write simple programs +- Complete exercises + +**Time**: 2-4 hours/day + +**Milestone**: Build a command-line todo app + +### Month 3-4: Python for Data +**Goal**: Use Python for data tasks + +**What you'll do**: +- Learn Pandas for data manipulation +- Work with CSV, JSON, Excel files +- Make API requests +- Process real datasets + +**Time**: 2-4 hours/day + +**Milestone**: Build data cleaning script + +### Month 3-5: SQL Fundamentals +**Goal**: Master database queries + +**What you'll do** (overlaps with Python): +- Learn SELECT, JOIN, GROUP BY +- Practice with real databases +- Write complex queries +- Optimize query performance + +**Time**: 1-2 hours/day + +**Milestone**: Design and query your own database + +### Month 5-7: Data Engineering Concepts +**Goal**: Understand data pipelines + +**What you'll do**: +- Learn ETL processes +- Build data pipelines +- Implement data quality checks +- Use version control + +**Time**: 2-3 hours/day + +**Milestone**: Build complete ETL pipeline + +### Month 7-9: Advanced Topics +**Goal**: Learn production tools + +**What you'll do**: +- Docker basics +- Cloud platforms intro +- Testing and CI/CD +- Workflow orchestration + +**Time**: 2-3 hours/day + +**Milestone**: Deploy containerized pipeline + +### Month 9+: Projects & Job Search +**Goal**: Build portfolio and find job + +**What you'll do**: +- Complete capstone projects +- Build portfolio +- Prepare for interviews +- Apply for jobs + +## πŸš€ Week 1 Action Plan + +### Day 1: Setup +- [ ] Install Python 3.8+ +- [ ] Install VS Code +- [ ] Install Git +- [ ] Create GitHub account +- [ ] Clone this repository + +### Day 2: Python Basics +- [ ] Read "Getting Started with Python" lesson +- [ ] Write your first Python program +- [ ] Complete basic syntax exercises +- [ ] Watch a Python tutorial video + +### Day 3: More Python +- [ ] Learn about variables and data types +- [ ] Practice with numbers and strings +- [ ] Do 5 coding exercises +- [ ] Start a learning journal + +### Day 4: Control Flow +- [ ] Learn if/else statements +- [ ] Learn loops (for, while) +- [ ] Write programs using control flow +- [ ] Complete 5 more exercises + +### Day 5: Functions +- [ ] Learn to write functions +- [ ] Understand parameters and return values +- [ ] Practice with function exercises +- [ ] Refactor previous code into functions + +### Day 6: Practice & Review +- [ ] Review all concepts from the week +- [ ] Complete remaining exercises +- [ ] Start a small project +- [ ] Join a coding community + +### Day 7: Rest & Plan +- [ ] Review your progress +- [ ] Plan next week +- [ ] Read about SQL +- [ ] Set up PostgreSQL (optional) + +## πŸ’» Required Software Setup + +### 1. Python +```bash +# Verify installation +python --version # Should be 3.8+ +pip --version +``` + +### 2. Code Editor (VS Code) +- Download from [code.visualstudio.com](https://code.visualstudio.com/) +- Install Python extension +- Install Git extension + +### 3. Git +```bash +# Verify installation +git --version +``` + +### 4. Database (Start with SQLite, add PostgreSQL later) +```bash +# SQLite is built into Python +python -c "import sqlite3; print('SQLite ready!')" +``` + +## πŸ“– Daily Study Routine + +### Option 1: Morning Learner (Before Work) +- **6:00-7:00 AM**: Study new concepts +- **7:00-7:30 AM**: Practice exercises +- **Evening**: Review and practice (30 min) + +### Option 2: Evening Learner (After Work) +- **7:00-8:00 PM**: Study new concepts +- **8:00-9:00 PM**: Practice and exercises +- **Weekend**: Projects and review + +### Option 3: Full-Time Student +- **9:00-11:00 AM**: Study new material +- **11:00-12:00 PM**: Practice exercises +- **1:00-3:00 PM**: Project work +- **3:00-4:00 PM**: Review and community + +## πŸ“š Learning Resources + +### Primary: This Repository +Follow the structured path here + +### Supplementary +- **Video**: YouTube Python tutorials +- **Practice**: LeetCode, HackerRank +- **Community**: Reddit r/learnprogramming +- **Documentation**: Official Python docs + +## 🎯 Setting Goals + +### Short-Term (1-2 weeks) +- Complete Python basics +- Write 10 simple programs +- Join online community + +### Medium-Term (1-3 months) +- Complete Python and SQL sections +- Build 3 small projects +- Start GitHub portfolio + +### Long-Term (6-12 months) +- Complete all sections +- Build 3 capstone projects +- Land data engineering job + +## πŸ“ Tracking Progress + +### Keep a Learning Journal +```markdown +# Date: 2024-01-15 +## What I Learned +- Python functions +- Return values +- Default parameters + +## What I Built +- Calculator program +- Temperature converter + +## Challenges +- Understanding scope +- Debugging errors + +## Tomorrow's Goals +- Learn about lists +- Practice with data structures +``` + +### Use GitHub +- Commit code daily +- Track your streak +- Build your portfolio +- Show your progress + +## 🀝 Getting Help + +### When Stuck +1. Read error messages carefully +2. Check documentation +3. Search Stack Overflow +4. Ask in communities +5. Take a break, come back fresh + +### Communities +- **Reddit**: r/learnprogramming, r/dataengineering +- **Discord**: Python Discord, DataTalks.Club +- **Stack Overflow**: Ask and answer questions +- **LinkedIn**: Connect with data engineers + +## πŸ’‘ Study Tips + +### Effective Learning +1. **Code Every Day**: Even 30 minutes +2. **Type, Don't Copy**: Type all examples +3. **Build Projects**: Apply what you learn +4. **Teach Others**: Explain concepts +5. **Review Regularly**: Revisit old topics +6. **Take Breaks**: Rest is important +7. **Stay Consistent**: Daily practice beats cramming + +### Avoid Common Pitfalls +- ❌ Tutorial hell (watching without doing) +- ❌ Learning too many things at once +- ❌ Skipping fundamentals +- ❌ Not practicing enough +- ❌ Giving up when stuck +- ❌ Comparing to others + +### Do This Instead +- βœ… Build while learning +- βœ… Focus on one topic at a time +- βœ… Master basics first +- βœ… Practice daily +- βœ… Embrace challenges +- βœ… Track your own progress + +## πŸŽ“ Study Groups + +### Find Study Partners +- Local meetups +- Online study groups +- Discord servers +- LinkedIn groups + +### Start Your Own +- Invite friends to learn +- Set weekly goals +- Share progress +- Support each other + +## πŸ“Š Progress Milestones + +### Beginner +- βœ… Installed all required software +- βœ… Written first Python program +- βœ… Completed 10 exercises +- βœ… Built first small project + +### Intermediate +- βœ… Comfortable with Python basics +- βœ… Can write SQL queries +- βœ… Built data processing script +- βœ… Used Git and GitHub + +### Advanced +- βœ… Built complete ETL pipeline +- βœ… Understand databases well +- βœ… Can use cloud services +- βœ… Ready for job interviews + +## πŸš€ Ready to Start? + +1. **Star this repository** on GitHub +2. **Fork it** to your account +3. **Clone** to your computer +4. **Start with** 01-python-fundamentals +5. **Code along** with examples +6. **Complete** exercises +7. **Build** projects +8. **Share** your progress + +## πŸ“ž Questions? + +- Open an issue in this repository +- Join our community discussions +- Check the FAQ (coming soon) + +## πŸŽ‰ Welcome! + +You're at the beginning of an exciting journey. Data engineering is a rewarding career with great opportunities. Take it one step at a time, practice consistently, and don't give up when things get challenging. + +**Remember**: Every expert was once a beginner. You can do this! + +--- + +Ready to begin? Head over to **[01-python-fundamentals](01-python-fundamentals/README.md)** and start learning! + +**Happy Learning! πŸš€** diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..49ba495 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2025 Data Engineer Learning Path Contributors + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md index b068193..62c7cd5 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,231 @@ -# data_engineer_learning_python_sql_path -data_engineer_learning_python_sql_path +# Data Engineer Learning Path - Python & SQL + +A comprehensive learning path for aspiring data engineers, covering essential Python programming and SQL database skills needed for modern data engineering roles. + +## πŸ“š Table of Contents + +- [Overview](#overview) +- [Getting Started](#getting-started) +- [Learning Path](#learning-path) +- [Prerequisites](#prerequisites) +- [Repository Structure](#repository-structure) +- [How to Use This Repository](#how-to-use-this-repository) +- [Resources](#resources) +- [FAQ](#faq) +- [Contributing](#contributing) + +## 🎯 Overview + +This repository provides a structured learning path for data engineers, focusing on: +- **Python Programming**: From basics to advanced data manipulation +- **SQL Databases**: Query writing, optimization, and database design +- **Data Engineering Concepts**: ETL/ELT, data pipelines, and data warehousing +- **Practical Projects**: Hands-on exercises and real-world scenarios + +## πŸš€ Getting Started + +**New to programming or data engineering?** Start here: + +πŸ‘‰ **[GETTING STARTED GUIDE](GETTING_STARTED.md)** - Your roadmap to begin learning + +This guide includes: +- Time commitments and study plans +- Week 1 action plan +- Software setup instructions +- Daily study routines +- Tips for success + +## πŸ›£οΈ Learning Path + +### Phase 1: Python Fundamentals (2-4 weeks) +- [ ] Python basics: variables, data types, control structures +- [ ] Functions and modules +- [ ] Object-oriented programming +- [ ] Error handling and debugging +- [ ] File I/O operations + +### Phase 2: Python for Data Engineering (4-6 weeks) +- [ ] Data structures: lists, dictionaries, sets, tuples +- [ ] Working with libraries: NumPy, Pandas +- [ ] Data manipulation and transformation +- [ ] API interactions and web scraping +- [ ] Working with CSV, JSON, and XML files + +### Phase 3: SQL Fundamentals (3-4 weeks) +- [ ] Basic SQL queries: SELECT, WHERE, ORDER BY +- [ ] Joins: INNER, LEFT, RIGHT, FULL OUTER +- [ ] Aggregate functions and GROUP BY +- [ ] Subqueries and CTEs (Common Table Expressions) +- [ ] Window functions + +### Phase 4: Advanced SQL (3-4 weeks) +- [ ] Database design and normalization +- [ ] Indexes and query optimization +- [ ] Stored procedures and functions +- [ ] Transactions and ACID properties +- [ ] Working with different SQL databases (PostgreSQL, MySQL, SQLite) + +### Phase 5: Data Engineering Concepts (4-6 weeks) +- [ ] ETL vs ELT processes +- [ ] Data pipelines and orchestration +- [ ] Data warehousing concepts +- [ ] Data quality and validation +- [ ] Version control with Git + +### Phase 6: Advanced Topics (6-8 weeks) +- [ ] Working with big data tools (introduction to Spark) +- [ ] Cloud platforms (AWS, GCP, Azure basics) +- [ ] Data streaming concepts +- [ ] Docker and containerization +- [ ] Testing and CI/CD for data pipelines + +### Phase 7: Practical Projects +- [ ] Build an ETL pipeline +- [ ] Create a data warehouse +- [ ] Implement data quality checks +- [ ] Build a dashboard with real-time data + +## πŸ“‹ Prerequisites + +- Basic computer literacy +- Understanding of basic programming concepts (helpful but not required) +- Willingness to learn and practice regularly +- A computer with internet access + +### Required Software +- Python 3.8 or higher +- A SQL database (PostgreSQL recommended, SQLite for beginners) +- Git for version control +- A code editor (VS Code, PyCharm, or similar) + +## πŸ“ Repository Structure + +``` +. +β”œβ”€β”€ 01-python-fundamentals/ # Python basics and fundamentals +β”‚ β”œβ”€β”€ lessons/ # Theory and explanations +β”‚ β”œβ”€β”€ examples/ # Code examples +β”‚ └── exercises/ # Practice exercises +β”‚ +β”œβ”€β”€ 02-python-data-engineering/ # Python for data tasks +β”‚ β”œβ”€β”€ lessons/ +β”‚ β”œβ”€β”€ examples/ +β”‚ └── exercises/ +β”‚ +β”œβ”€β”€ 03-sql-fundamentals/ # Basic SQL concepts +β”‚ β”œβ”€β”€ lessons/ +β”‚ β”œβ”€β”€ queries/ # SQL query examples +β”‚ └── exercises/ +β”‚ +β”œβ”€β”€ 04-advanced-sql/ # Advanced SQL topics +β”‚ β”œβ”€β”€ lessons/ +β”‚ β”œβ”€β”€ queries/ +β”‚ └── exercises/ +β”‚ +β”œβ”€β”€ 05-data-engineering/ # Data engineering concepts +β”‚ β”œβ”€β”€ lessons/ +β”‚ β”œβ”€β”€ projects/ +β”‚ └── exercises/ +β”‚ +β”œβ”€β”€ 06-advanced-topics/ # Advanced data engineering +β”‚ β”œβ”€β”€ lessons/ +β”‚ β”œβ”€β”€ projects/ +β”‚ └── exercises/ +β”‚ +β”œβ”€β”€ 07-projects/ # Capstone projects +β”‚ β”œβ”€β”€ etl-pipeline/ +β”‚ β”œβ”€β”€ data-warehouse/ +β”‚ └── real-time-dashboard/ +β”‚ +└── resources/ # Additional resources + β”œβ”€β”€ books.md + β”œβ”€β”€ courses.md + └── tools.md +``` + +## πŸš€ How to Use This Repository + +1. **Clone the repository** + ```bash + git clone https://github.com/fabianomalves/data_engineer_learning_python_sql_path.git + cd data_engineer_learning_python_sql_path + ``` + +2. **Install dependencies** (when needed) + ```bash + # Create a virtual environment + python -m venv venv + source venv/bin/activate # On Windows: venv\Scripts\activate + + # Install required packages + pip install -r requirements.txt + ``` + +3. **Follow the learning path sequentially** + - Start with Phase 1 and progress through each phase + - Complete exercises before moving to the next section + - Work on projects to apply your knowledge + +4. **Practice regularly** + - Code daily, even if just for 30 minutes + - Review concepts regularly + - Build your own projects alongside the provided ones + +5. **Track your progress** + - Check off completed topics in the learning path + - Keep a learning journal + - Share your progress and projects + +## πŸ“– Resources + +### Books +- "Python for Data Analysis" by Wes McKinney +- "SQL Performance Explained" by Markus Winand +- "Designing Data-Intensive Applications" by Martin Kleppmann + +### Online Platforms +- DataCamp +- Coursera +- LeetCode (for SQL practice) +- HackerRank + +### Documentation +- [Python Official Documentation](https://docs.python.org/3/) +- [PostgreSQL Documentation](https://www.postgresql.org/docs/) +- [Pandas Documentation](https://pandas.pydata.org/docs/) + +## ❓ FAQ + +Have questions? Check out our **[Frequently Asked Questions](FAQ.md)** covering: +- Getting started +- Technical setup +- Learning strategies +- Career advice +- Troubleshooting + +## 🀝 Contributing + +Contributions are welcome! If you'd like to contribute: + +1. Fork the repository +2. Create a new branch (`git checkout -b feature/improvement`) +3. Make your changes +4. Commit your changes (`git commit -am 'Add new feature'`) +5. Push to the branch (`git push origin feature/improvement`) +6. Create a Pull Request + +See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines. + +## πŸ“ License + +This project is open source and available under the MIT License. + +## ⭐ Acknowledgments + +This learning path is designed to help aspiring data engineers build a strong foundation in Python and SQL, the two most essential skills for modern data engineering roles. + +--- + +**Happy Learning! πŸš€** + +For questions or discussions, please open an issue in this repository. diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..dab89ea --- /dev/null +++ b/requirements.txt @@ -0,0 +1,34 @@ +# Python Data Engineering Learning Path Requirements + +# Core data manipulation libraries +pandas>=1.5.0 +numpy>=1.24.0 + +# Database connectivity +sqlalchemy>=2.0.0 +psycopg2-binary>=2.9.0 + +# File format support +openpyxl>=3.1.0 # Excel files +pyarrow>=12.0.0 # Parquet files + +# API and web +requests>=2.31.0 + +# Configuration +python-dotenv>=1.0.0 +pyyaml>=6.0 + +# Testing +pytest>=7.4.0 +pytest-cov>=4.1.0 + +# Data validation +great-expectations>=0.17.0 + +# Utilities +python-dateutil>=2.8.0 + +# Optional: For advanced examples +# apache-airflow>=2.7.0 # Uncomment if learning Airflow +# pyspark>=3.4.0 # Uncomment if learning Spark diff --git a/resources/books.md b/resources/books.md new file mode 100644 index 0000000..2ca34e1 --- /dev/null +++ b/resources/books.md @@ -0,0 +1,134 @@ +# Recommended Books for Data Engineers + +## Python Programming + +### Beginner +1. **"Python Crash Course" by Eric Matthes** + - Perfect for complete beginners + - Hands-on projects + - Clear explanations + +2. **"Automate the Boring Stuff with Python" by Al Sweigart** + - Practical Python applications + - Free to read online + - Great for automation tasks + +### Intermediate +3. **"Python for Data Analysis" by Wes McKinney** + - Written by the creator of Pandas + - Essential for data manipulation + - Real-world examples + +4. **"Fluent Python" by Luciano Ramalho** + - Deep dive into Python + - Best practices + - Advanced features + +## SQL and Databases + +5. **"SQL Performance Explained" by Markus Winand** + - Query optimization + - Index strategies + - Database-agnostic + +6. **"Learning SQL" by Alan Beaulieu** + - Comprehensive introduction + - Practical examples + - Covers MySQL + +7. **"PostgreSQL: Up and Running" by Regina Obe and Leo Hsu** + - PostgreSQL specific + - Quick start guide + - Best practices + +## Data Engineering + +8. **"Designing Data-Intensive Applications" by Martin Kleppmann** + - Must-read for data engineers + - Covers fundamental concepts + - Architecture patterns + +9. **"Fundamentals of Data Engineering" by Joe Reis and Matt Housley** + - Modern data engineering practices + - Lifecycle approach + - Tool-agnostic + +10. **"The Data Warehouse Toolkit" by Ralph Kimball and Margy Ross** + - Dimensional modeling + - Data warehouse design + - Industry standard + +11. **"Data Pipelines Pocket Reference" by James Densmore** + - Quick reference guide + - Pipeline patterns + - Best practices + +## System Design and Architecture + +12. **"Building Microservices" by Sam Newman** + - Microservices architecture + - Relevant for distributed systems + - Practical guidance + +13. **"Site Reliability Engineering" by Google** + - Production systems + - Monitoring and reliability + - Free online + +## Additional Topics + +14. **"The Pragmatic Programmer" by David Thomas and Andrew Hunt** + - Software craftsmanship + - Best practices + - Career development + +15. **"Clean Code" by Robert C. Martin** + - Code quality + - Refactoring + - Maintainability + +## Online Resources + +### Free Books +- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) +- [SQL Murder Mystery](https://mystery.knightlab.com/) - Interactive SQL learning +- [The Data Engineering Cookbook](https://github.com/andkret/Cookbook) + +### Documentation +- [Python Official Docs](https://docs.python.org/3/) +- [PostgreSQL Docs](https://www.postgresql.org/docs/) +- [Pandas Documentation](https://pandas.pydata.org/docs/) + +## Reading Plan + +### Month 1-2: Python Foundations +- Python Crash Course +- Automate the Boring Stuff + +### Month 3-4: Data Manipulation +- Python for Data Analysis +- Learning SQL + +### Month 5-6: Advanced Topics +- Designing Data-Intensive Applications +- Fundamentals of Data Engineering + +### Ongoing +- SQL Performance Explained (reference) +- The Pragmatic Programmer (reference) + +## Tips for Reading Technical Books + +1. **Code Along**: Type out examples as you read +2. **Take Notes**: Summarize key concepts +3. **Do Exercises**: Complete all practice problems +4. **Apply**: Use concepts in your own projects +5. **Review**: Revisit difficult chapters +6. **Discuss**: Join study groups or forums + +## Where to Buy + +- [O'Reilly Learning Platform](https://learning.oreilly.com/) - Subscription access to many books +- Amazon (Kindle or Physical) +- [Manning Publications](https://www.manning.com/) - Often has sales +- Local Library - Many libraries have O'Reilly access diff --git a/resources/cheatsheet.md b/resources/cheatsheet.md new file mode 100644 index 0000000..cec0da4 --- /dev/null +++ b/resources/cheatsheet.md @@ -0,0 +1,572 @@ +# Data Engineer Cheatsheet + +Quick reference for common Python and SQL operations used in data engineering. + +## Python Basics + +### Variables and Data Types +```python +# Numbers +integer = 42 +floating = 3.14 +complex_num = 3 + 4j + +# Strings +text = "Hello" +multi_line = """Multiple +lines""" + +# Boolean +is_true = True +is_false = False + +# None +empty = None +``` + +### Lists +```python +# Create +my_list = [1, 2, 3, 4, 5] +mixed = [1, "two", 3.0, True] + +# Access +first = my_list[0] +last = my_list[-1] +slice = my_list[1:4] # [2, 3, 4] + +# Modify +my_list.append(6) # Add to end +my_list.insert(0, 0) # Insert at position +my_list.remove(3) # Remove first occurrence +popped = my_list.pop() # Remove and return last + +# Common operations +length = len(my_list) +sorted_list = sorted(my_list) +reversed_list = list(reversed(my_list)) +``` + +### Dictionaries +```python +# Create +person = {"name": "Alice", "age": 30, "city": "NYC"} + +# Access +name = person["name"] +age = person.get("age", 0) # With default + +# Modify +person["email"] = "alice@example.com" +person.update({"phone": "123-456-7890"}) + +# Iterate +for key, value in person.items(): + print(f"{key}: {value}") +``` + +### Control Flow +```python +# If-else +if x > 0: + print("Positive") +elif x < 0: + print("Negative") +else: + print("Zero") + +# For loop +for i in range(5): + print(i) + +for item in my_list: + print(item) + +# While loop +while x > 0: + x -= 1 + +# List comprehension +squares = [x**2 for x in range(10)] +evens = [x for x in range(10) if x % 2 == 0] +``` + +### Functions +```python +# Basic function +def greet(name): + return f"Hello, {name}!" + +# Default arguments +def greet(name="World"): + return f"Hello, {name}!" + +# Multiple return values +def stats(numbers): + return min(numbers), max(numbers), sum(numbers) + +# Lambda function +square = lambda x: x**2 +``` + +## Pandas + +### DataFrame Creation +```python +import pandas as pd + +# From dictionary +df = pd.DataFrame({ + 'A': [1, 2, 3], + 'B': [4, 5, 6] +}) + +# From CSV +df = pd.read_csv('file.csv') + +# From SQL +df = pd.read_sql('SELECT * FROM table', connection) +``` + +### Data Selection +```python +# Columns +df['column_name'] +df[['col1', 'col2']] + +# Rows +df.iloc[0] # By position +df.loc[0] # By label +df.iloc[0:5] # First 5 rows +df.head(10) # First 10 rows +df.tail(10) # Last 10 rows + +# Conditional +df[df['age'] > 30] +df[(df['age'] > 30) & (df['city'] == 'NYC')] +``` + +### Data Manipulation +```python +# Add column +df['new_col'] = df['col1'] + df['col2'] + +# Drop column +df = df.drop('column_name', axis=1) + +# Rename +df = df.rename(columns={'old': 'new'}) + +# Sort +df = df.sort_values('column_name') +df = df.sort_values(['col1', 'col2'], ascending=[True, False]) + +# Group by +grouped = df.groupby('category')['value'].sum() +grouped = df.groupby('category').agg({ + 'value': ['sum', 'mean', 'count'] +}) +``` + +### Data Cleaning +```python +# Handle missing values +df.isnull().sum() # Count nulls +df = df.dropna() # Drop rows with nulls +df = df.fillna(0) # Fill with value +df = df.fillna(df.mean()) # Fill with mean + +# Remove duplicates +df = df.drop_duplicates() +df = df.drop_duplicates(subset=['column']) + +# Data types +df.dtypes # Check types +df['col'] = df['col'].astype(int) # Convert type +``` + +### Merging DataFrames +```python +# Merge (SQL-like joins) +merged = pd.merge(df1, df2, on='key') +merged = pd.merge(df1, df2, on='key', how='left') + +# Concat (append) +combined = pd.concat([df1, df2], axis=0) # Rows +combined = pd.concat([df1, df2], axis=1) # Columns +``` + +## SQL + +### Basic Queries +```sql +-- Select +SELECT column1, column2 FROM table; +SELECT * FROM table; +SELECT DISTINCT city FROM customers; + +-- Where +SELECT * FROM table WHERE age > 25; +SELECT * FROM table WHERE age > 25 AND city = 'NYC'; +SELECT * FROM table WHERE age BETWEEN 20 AND 30; +SELECT * FROM table WHERE city IN ('NYC', 'LA', 'SF'); +SELECT * FROM table WHERE name LIKE 'A%'; + +-- Order By +SELECT * FROM table ORDER BY age DESC; +SELECT * FROM table ORDER BY age, name; + +-- Limit +SELECT * FROM table LIMIT 10; +SELECT * FROM table LIMIT 10 OFFSET 20; +``` + +### Joins +```sql +-- Inner Join +SELECT a.*, b.name +FROM orders a +INNER JOIN customers b ON a.customer_id = b.id; + +-- Left Join +SELECT a.*, b.name +FROM orders a +LEFT JOIN customers b ON a.customer_id = b.id; + +-- Multiple Joins +SELECT o.order_id, c.name, p.product_name +FROM orders o +JOIN customers c ON o.customer_id = c.id +JOIN products p ON o.product_id = p.id; +``` + +### Aggregations +```sql +-- Basic aggregates +SELECT COUNT(*) FROM table; +SELECT SUM(amount) FROM orders; +SELECT AVG(salary) FROM employees; +SELECT MIN(price), MAX(price) FROM products; + +-- Group By +SELECT city, COUNT(*) as count +FROM customers +GROUP BY city; + +SELECT department, AVG(salary) as avg_salary +FROM employees +GROUP BY department +HAVING AVG(salary) > 50000; +``` + +### Subqueries +```sql +-- In WHERE clause +SELECT * FROM employees +WHERE salary > (SELECT AVG(salary) FROM employees); + +-- In FROM clause +SELECT dept, avg_sal +FROM ( + SELECT department as dept, AVG(salary) as avg_sal + FROM employees + GROUP BY department +) subquery +WHERE avg_sal > 60000; +``` + +### Window Functions +```sql +-- Running total +SELECT date, amount, + SUM(amount) OVER (ORDER BY date) as running_total +FROM sales; + +-- Ranking +SELECT name, salary, + RANK() OVER (ORDER BY salary DESC) as rank +FROM employees; + +-- Partition +SELECT department, name, salary, + AVG(salary) OVER (PARTITION BY department) as dept_avg +FROM employees; +``` + +### Data Modification +```sql +-- Insert +INSERT INTO table (col1, col2) VALUES (val1, val2); +INSERT INTO table VALUES (val1, val2, val3); + +-- Update +UPDATE table SET col1 = val1 WHERE condition; + +-- Delete +DELETE FROM table WHERE condition; + +-- Create Table +CREATE TABLE users ( + id SERIAL PRIMARY KEY, + name VARCHAR(100) NOT NULL, + email VARCHAR(100) UNIQUE, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); +``` + +## Database Operations in Python + +### SQLAlchemy +```python +from sqlalchemy import create_engine + +# Connect +engine = create_engine('postgresql://user:pass@localhost/db') + +# Read +df = pd.read_sql('SELECT * FROM table', engine) +df = pd.read_sql_table('table_name', engine) +df = pd.read_sql_query('SELECT * FROM table WHERE x > 5', engine) + +# Write +df.to_sql('table_name', engine, if_exists='replace', index=False) +# if_exists: 'fail', 'replace', 'append' +``` + +### Psycopg2 (PostgreSQL) +```python +import psycopg2 + +# Connect +conn = psycopg2.connect( + host="localhost", + database="mydb", + user="user", + password="password" +) + +# Execute +cursor = conn.cursor() +cursor.execute("SELECT * FROM table") +rows = cursor.fetchall() + +# With parameters +cursor.execute("SELECT * FROM table WHERE id = %s", (id,)) + +# Commit and close +conn.commit() +cursor.close() +conn.close() +``` + +## File Operations + +### CSV +```python +# Read +df = pd.read_csv('file.csv') +df = pd.read_csv('file.csv', sep=';', encoding='utf-8') + +# Write +df.to_csv('output.csv', index=False) +df.to_csv('output.csv', sep='\t', encoding='utf-8') +``` + +### JSON +```python +# Read +df = pd.read_json('file.json') +df = pd.read_json('file.json', orient='records') + +# Write +df.to_json('output.json', orient='records', indent=2) +``` + +### Excel +```python +# Read +df = pd.read_excel('file.xlsx', sheet_name='Sheet1') + +# Write +df.to_excel('output.xlsx', sheet_name='Data', index=False) + +# Multiple sheets +with pd.ExcelWriter('output.xlsx') as writer: + df1.to_excel(writer, sheet_name='Sheet1') + df2.to_excel(writer, sheet_name='Sheet2') +``` + +### Parquet +```python +# Read +df = pd.read_parquet('file.parquet') + +# Write +df.to_parquet('output.parquet', compression='snappy') +``` + +## Date/Time Operations + +### Python datetime +```python +from datetime import datetime, timedelta + +# Current +now = datetime.now() +today = datetime.today() + +# Create +dt = datetime(2024, 1, 15, 10, 30) + +# Parse +dt = datetime.strptime('2024-01-15', '%Y-%m-%d') + +# Format +formatted = dt.strftime('%Y-%m-%d %H:%M:%S') + +# Arithmetic +tomorrow = now + timedelta(days=1) +last_week = now - timedelta(weeks=1) +``` + +### Pandas datetime +```python +# Convert to datetime +df['date'] = pd.to_datetime(df['date_string']) + +# Extract components +df['year'] = df['date'].dt.year +df['month'] = df['date'].dt.month +df['day'] = df['date'].dt.day +df['dayofweek'] = df['date'].dt.dayofweek + +# Date arithmetic +df['next_week'] = df['date'] + pd.Timedelta(weeks=1) + +# Resample (time series) +df.set_index('date').resample('D').mean() # Daily average +df.set_index('date').resample('M').sum() # Monthly sum +``` + +## Common Patterns + +### ETL Pattern +```python +def extract_data(source): + """Extract data from source""" + return pd.read_csv(source) + +def transform_data(df): + """Clean and transform data""" + df = df.dropna() + df['new_col'] = df['col1'] + df['col2'] + return df + +def load_data(df, target): + """Load data to target""" + df.to_sql('table', engine, if_exists='replace') + +# Run ETL +df = extract_data('input.csv') +df = transform_data(df) +load_data(df, 'output_table') +``` + +### Error Handling +```python +try: + df = pd.read_csv('file.csv') +except FileNotFoundError: + print("File not found") +except pd.errors.ParserError: + print("Error parsing CSV") +except Exception as e: + print(f"Unexpected error: {e}") +finally: + print("Cleanup") +``` + +### Logging +```python +import logging + +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) + +logger = logging.getLogger(__name__) + +logger.info("Processing started") +logger.warning("Warning message") +logger.error("Error occurred") +``` + +## Git Commands + +```bash +# Clone +git clone + +# Status +git status + +# Add files +git add file.py +git add . + +# Commit +git commit -m "Message" + +# Push +git push origin branch_name + +# Pull +git pull origin branch_name + +# Branch +git branch new_branch +git checkout new_branch +git checkout -b new_branch # Create and switch + +# Merge +git merge branch_name + +# View history +git log +git log --oneline +``` + +## Docker Commands + +```bash +# Build +docker build -t myapp . + +# Run +docker run myapp +docker run -p 5000:5000 myapp +docker run -d myapp # Detached + +# List +docker ps # Running +docker ps -a # All + +# Stop/Start +docker stop container_id +docker start container_id + +# Remove +docker rm container_id +docker rmi image_id + +# Docker Compose +docker-compose up +docker-compose up -d +docker-compose down +docker-compose logs +``` + +--- + +This cheatsheet covers the most common operations. Bookmark it for quick reference! diff --git a/resources/courses.md b/resources/courses.md new file mode 100644 index 0000000..429e521 --- /dev/null +++ b/resources/courses.md @@ -0,0 +1,252 @@ +# Online Courses and Learning Platforms + +## Comprehensive Learning Platforms + +### DataCamp +- **Data Engineer Career Track** + - Structured learning path + - Interactive exercises + - Hands-on projects + - Certificate upon completion + +### Coursera +- **IBM Data Engineering Professional Certificate** + - 13 courses + - Real-world projects + - Industry-recognized certificate + +- **Google Data Analytics Professional Certificate** + - Beginner-friendly + - SQL and analysis skills + - Portfolio projects + +### Udacity +- **Data Engineering Nanodegree** + - Project-based learning + - Mentor support + - Industry partnerships + - Advanced topics + +## Python Courses + +### Free +1. **Python for Everybody (Coursera)** + - Dr. Chuck Severance + - Beginner-friendly + - University of Michigan + +2. **CS50's Introduction to Programming with Python (edX)** + - Harvard University + - Free certification available + - High quality + +3. **freeCodeCamp Python Courses (YouTube)** + - Multiple full courses + - Completely free + - Various skill levels + +### Paid +4. **Complete Python Bootcamp (Udemy)** + - Jose Portilla + - Comprehensive coverage + - Practical projects + +5. **Python for Data Science and Machine Learning (Udemy)** + - Jose Portilla + - Focus on data libraries + - Hands-on projects + +## SQL Courses + +### Free +1. **Khan Academy - Intro to SQL** + - Interactive lessons + - Beginner-friendly + - Query practice + +2. **SQLBolt** + - Interactive SQL lessons + - Progressive difficulty + - Great for beginners + +3. **Mode SQL Tutorial** + - SQL for data analysis + - Real datasets + - Advanced topics + +### Paid +4. **The Complete SQL Bootcamp (Udemy)** + - PostgreSQL focus + - Practical examples + - Assessment tests + +5. **DataCamp SQL Fundamentals Track** + - Multiple courses + - Interactive exercises + - Progressive learning + +## Data Engineering Specific + +### Free +1. **AWS Data Engineering Fundamentals (Coursera)** + - Cloud data engineering + - AWS services + - Free to audit + +2. **Microsoft Azure Data Engineering (Coursera)** + - Azure-specific + - Cloud technologies + - Free to audit + +### Paid +3. **Data Engineering on Google Cloud Platform (Coursera)** + - GCP focus + - Professional certificate + - Hands-on labs + +4. **The Complete Hands-On Introduction to Apache Airflow (Udemy)** + - Workflow orchestration + - Practical projects + - Industry tool + +## Practice Platforms + +### Coding Practice +1. **LeetCode** + - SQL problems + - Algorithm practice + - Interview preparation + +2. **HackerRank** + - Python challenges + - SQL challenges + - Certification tests + +3. **Codewars** + - Community challenges + - Multiple languages + - Progressive difficulty + +4. **Exercism** + - Mentor-supported + - Language tracks + - Free practice + +### SQL Specific +5. **SQLZoo** + - Interactive SQL tutorials + - Progressive difficulty + - Multiple SQL dialects + +6. **SQL Murder Mystery** + - Learn through games + - Engaging format + - Good for beginners + +## YouTube Channels + +### Data Engineering +1. **Data Engineering TV** + - Industry insights + - Tool reviews + - Best practices + +2. **Seattle Data Guy** + - Career advice + - Technical tutorials + - Industry trends + +### Python +3. **Corey Schafer** + - Clear explanations + - Python tutorials + - Practical examples + +4. **Real Python** + - Python tutorials + - Best practices + - Various skill levels + +### SQL +5. **Socratica - SQL** + - Clear, concise lessons + - Professional production + - Beginner-friendly + +## Learning Path Recommendations + +### Complete Beginner (0-3 months) +1. Python for Everybody (Coursera) +2. Khan Academy SQL +3. DataCamp Python Fundamentals + +### Intermediate (3-6 months) +1. Python for Data Analysis course +2. Complete SQL Bootcamp +3. Practice on LeetCode/HackerRank + +### Advanced (6-12 months) +1. Data Engineering Nanodegree (Udacity) +2. Cloud platform specialization +3. Apache Airflow course + +## Certification Paths + +### Entry Level +- **DataCamp Data Engineer Track** +- **HackerRank Python/SQL Certificates** + +### Professional +- **AWS Certified Data Analytics** +- **Google Professional Data Engineer** +- **Microsoft Certified: Azure Data Engineer** + +### Advanced +- **Databricks Certified Data Engineer** +- **Snowflake SnowPro Core Certification** + +## Tips for Online Learning + +1. **Set a Schedule**: Dedicate specific hours each week +2. **Take Notes**: Summarize key concepts +3. **Code Along**: Practice while watching +4. **Build Projects**: Apply what you learn +5. **Join Communities**: Connect with other learners +6. **Review Regularly**: Revisit difficult topics +7. **Stay Consistent**: Daily practice beats cramming + +## Free vs Paid + +### When Free is Enough +- Learning basics +- Exploring new topics +- Casual learning +- Budget constraints + +### When to Consider Paid +- Structured learning path needed +- Want certification +- Need mentor support +- Serious career change + +## Community Learning + +### Forums and Communities +- Reddit: r/dataengineering, r/learnpython +- Stack Overflow +- Discord servers for data engineering +- LinkedIn groups + +### Study Groups +- Local meetups +- Online study groups +- Code review sessions +- Peer learning + +## Budget-Friendly Options + +1. **Library Access**: Many libraries offer free Coursera/LinkedIn Learning +2. **Financial Aid**: Coursera offers financial aid +3. **Free Trials**: Try platforms before committing +4. **Company Benefits**: Check if employer offers learning budgets +5. **YouTube**: Vast free content available diff --git a/resources/tools.md b/resources/tools.md new file mode 100644 index 0000000..287638c --- /dev/null +++ b/resources/tools.md @@ -0,0 +1,337 @@ +# Essential Tools for Data Engineers + +## Development Environment + +### Code Editors and IDEs + +#### Visual Studio Code (Recommended for Beginners) +- **Free and open source** +- **Extensions**: Python, SQL, Git +- **Features**: Debugging, integrated terminal, Git integration +- **Download**: [code.visualstudio.com](https://code.visualstudio.com/) + +#### PyCharm +- **Professional**: Paid, full-featured +- **Community**: Free, perfect for Python +- **Features**: Advanced debugging, database tools, refactoring +- **Download**: [jetbrains.com/pycharm](https://www.jetbrains.com/pycharm/) + +#### Jupyter Notebook / JupyterLab +- **Interactive Python environment** +- **Great for**: Data exploration, documentation +- **Install**: `pip install jupyter` + +### Terminal and Command Line + +#### Windows +- **PowerShell** +- **Windows Terminal** (modern, recommended) +- **Git Bash** (Unix-like commands on Windows) +- **WSL** (Windows Subsystem for Linux) + +#### Mac/Linux +- **Terminal** (built-in) +- **iTerm2** (Mac, advanced features) +- **Zsh** with Oh My Zsh (enhanced shell) + +## Version Control + +### Git +- **Essential skill** for all developers +- **Commands**: clone, commit, push, pull, branch, merge +- **Download**: [git-scm.com](https://git-scm.com/) + +### GitHub +- **Code hosting** and collaboration +- **Portfolio**: Showcase your projects +- **Learning**: Explore open source projects + +### Alternatives +- **GitLab**: Similar to GitHub, good CI/CD +- **Bitbucket**: Integrates with Atlassian tools + +## Databases + +### SQLite +- **Best for**: Learning, small projects +- **Advantages**: No server, file-based, simple +- **Built-in** to Python + +### PostgreSQL (Recommended) +- **Best for**: Production, learning advanced SQL +- **Features**: Full-featured, reliable, open source +- **Download**: [postgresql.org](https://www.postgresql.org/) + +### MySQL/MariaDB +- **Popular**: Widely used in web applications +- **Similar to**: PostgreSQL in many ways + +### Database Tools + +#### DBeaver (Recommended) +- **Free and open source** +- **Supports**: All major databases +- **Features**: Query builder, ER diagrams, data export +- **Download**: [dbeaver.io](https://dbeaver.io/) + +#### pgAdmin +- **PostgreSQL specific** +- **Official tool** +- **Full-featured** + +#### DataGrip (JetBrains) +- **Paid, professional** +- **Multi-database support** +- **Advanced features** + +## Python Libraries + +### Essential for Data Engineering + +```bash +# Data manipulation +pip install pandas +pip install numpy + +# Database connectivity +pip install psycopg2-binary # PostgreSQL +pip install pymysql # MySQL +pip install sqlalchemy # ORM and database abstraction + +# Data validation +pip install great-expectations +pip install pandera + +# API interactions +pip install requests +pip install httpx + +# File formats +pip install openpyxl # Excel +pip install pyarrow # Parquet +pip install lxml # XML + +# Configuration +pip install python-dotenv # Environment variables +pip install pyyaml # YAML files + +# Testing +pip install pytest +pip install pytest-cov + +# Date/Time +pip install python-dateutil +``` + +### Advanced Libraries + +```bash +# Workflow orchestration +pip install apache-airflow + +# Data quality +pip install dbt-core # Data transformation + +# Cloud SDKs +pip install boto3 # AWS +pip install google-cloud-storage # GCP +pip install azure-storage-blob # Azure + +# Big data +pip install pyspark + +# Logging and monitoring +pip install loguru +``` + +## Workflow Orchestration + +### Apache Airflow +- **Industry standard** for data pipelines +- **Features**: Scheduling, monitoring, retries +- **Python-based** DAGs (Directed Acyclic Graphs) + +### Alternatives +- **Prefect**: Modern, Pythonic +- **Dagster**: Data-aware orchestration +- **Luigi**: Spotify's workflow manager + +## Containerization + +### Docker +- **Essential** for modern data engineering +- **Benefits**: Consistent environments, easy deployment +- **Download**: [docker.com](https://www.docker.com/) + +### Docker Compose +- **Multi-container** applications +- **Great for**: Local development +- **Included** with Docker Desktop + +## Cloud Platforms + +### AWS (Amazon Web Services) +- **Services**: S3, RDS, Redshift, Glue, Lambda +- **Most popular** cloud platform +- **Free tier** available + +### Google Cloud Platform (GCP) +- **Services**: BigQuery, Cloud Storage, Dataflow +- **Strong** data and ML offerings +- **Free tier** available + +### Microsoft Azure +- **Services**: Azure SQL, Data Factory, Synapse +- **Enterprise focused** +- **Free tier** available + +## Data Quality and Testing + +### Great Expectations +- **Data validation** framework +- **Documentation** generation +- **Integration** with pipelines + +### Pytest +- **Testing framework** +- **Essential** for production code +- **Simple** and powerful + +### dbt (data build tool) +- **SQL-based** transformations +- **Testing** built-in +- **Documentation** generation + +## Monitoring and Logging + +### Logging +- **Python logging** module (built-in) +- **Loguru**: Modern logging library +- **Structured logging**: JSON logs + +### Monitoring Tools +- **Prometheus**: Metrics collection +- **Grafana**: Visualization +- **ELK Stack**: Elasticsearch, Logstash, Kibana + +## Documentation + +### Markdown +- **Standard** for documentation +- **Easy to learn** +- **Supported everywhere** + +### Sphinx +- **Python documentation** generator +- **Used by** Python itself +- **Professional** output + +### Draw.io / Diagrams.net +- **Free diagramming** tool +- **Architecture diagrams** +- **Data flow diagrams** + +## Productivity Tools + +### Task Management +- **Notion**: All-in-one workspace +- **Trello**: Kanban boards +- **Todoist**: Simple todo lists + +### Note Taking +- **Obsidian**: Markdown-based +- **Notion**: Rich features +- **Jupyter notebooks**: Code + notes + +### Communication +- **Slack**: Team communication +- **Discord**: Communities +- **Stack Overflow**: Q&A + +## Package Management + +### pip +- **Default** Python package manager +- **Essential** for installing libraries + +### conda +- **Environment** and package manager +- **Good for**: Data science +- **Includes**: Non-Python dependencies + +### Poetry +- **Modern** dependency management +- **Better** dependency resolution +- **Recommended** for projects + +## Recommended Setup for Beginners + +1. **Install**: Python 3.8+ +2. **Install**: VS Code +3. **Install**: Git +4. **Install**: PostgreSQL +5. **Install**: Docker (when ready) +6. **Create**: GitHub account +7. **Install**: DBeaver +8. **Learn**: Basic terminal commands + +## Recommended Setup for Advanced Users + +1. All beginner tools +2. **Add**: PyCharm Professional +3. **Add**: Docker and Docker Compose +4. **Add**: Apache Airflow (in Docker) +5. **Add**: Cloud platform CLI (AWS/GCP/Azure) +6. **Add**: Monitoring tools +7. **Add**: CI/CD tools (GitHub Actions) + +## Tool Selection Tips + +1. **Start Simple**: Don't overwhelm yourself +2. **Master Basics**: Before moving to advanced tools +3. **Open Source First**: Try free tools before paid +4. **Community**: Choose tools with active communities +5. **Job Market**: Consider what employers use +6. **Personal Preference**: Use what works for you + +## Learning Resources + +### Practice Environments +- **Google Colab**: Free Jupyter notebooks +- **Kaggle**: Datasets and notebooks +- **Repl.it**: Online IDE + +### Sandboxes +- **DB Fiddle**: Online SQL practice +- **SQLite Online**: Browser-based SQLite +- **PythonAnywhere**: Host Python apps for free + +## Cost Considerations + +### Free Forever +- Python +- VS Code +- Git +- PostgreSQL +- SQLite +- DBeaver +- Most Python libraries + +### Free Tier (Limited) +- AWS, GCP, Azure +- GitHub (unlimited public repos) +- Docker Hub + +### Worth Paying For +- PyCharm Professional (student discounts available) +- Cloud resources (for production) +- Courses and books +- Monitoring services (for production) + +## Next Steps + +1. **Install core tools**: Python, editor, Git, database +2. **Set up environment**: Virtual environments, Git config +3. **Practice**: Build small projects +4. **Explore**: Try new tools as you learn +5. **Share**: Show your work on GitHub