<a href="https://colab.research.google.com/github/blacktalenthubs/data-engineering-track/blob/main/Python_for_data_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Introduction to Data Structures**

Data structures are fundamental constructs that allow you to store and organize data efficiently. Understanding how to use them is crucial for data engineering tasks such as data manipulation, transformation, and storage.

## **Table of Contents**

1. [Introduction to Data Structures](#1-introduction-to-data-structures)
    - [Lists](#lists)
    - [Sets](#sets)
    - [Dictionaries](#dictionaries)
2. [Defining and Invoking Functions](#2-defining-and-invoking-functions)
3. [Using Data Structures with Functions](#3-using-data-structures-with-functions)
4. [Modules and Packages](#4-modules-and-packages)
    - [Importing Built-in Modules](#importing-built-in-modules)
    - [Creating and Importing Custom Modules](#creating-and-importing-custom-modules)
5. [Comprehensive Examples and Problems](#5-comprehensive-examples-and-problems)
6. [Conclusion](#6-conclusion)
7. [Further Practice](#7-further-practice)
8. [Additional Resources](#8-additional-resources)

## **1. Introduction to Data Structures**

Data structures are fundamental constructs that allow you to store and organize data efficiently. Understanding how to use them is crucial for data engineering tasks such as data manipulation, transformation, and storage.

### **Lists**

**Definition:** A list is an ordered, mutable collection of items that can be of different types.

**Characteristics:**

- **Ordered:** Elements have a defined sequence.
- **Mutable:** Elements can be added, removed, or changed.
- **Allows duplicate elements.**

In [None]:
# Creating a list of integers
numbers = [10, 20, 30, 40, 50]

# Accessing elements
first_number = numbers[0]
last_number = numbers[-1]

# Modifying elements
numbers[2] = 35

# Adding elements
numbers.append(60)

# Removing elements
numbers.remove(20)

# Resulting list
print(numbers)

### **Sets**

**Definition:** A set is an unordered collection of unique items.

**Characteristics:**

- **Unordered:** No indexing or order of elements.
- **Mutable:** Elements can be added or removed.
- **No duplicate elements.**

In [None]:
# Creating a set of strings
unique_ids = {'id1', 'id2', 'id3'}

# Adding an element
unique_ids.add('id4')

# Attempting to add a duplicate element
unique_ids.add('id2')

# Removing an element
unique_ids.discard('id1')

# Resulting set
print(unique_ids)

### **Dictionaries**

**Definition:** A dictionary is an unordered collection of key-value pairs.

**Characteristics:**

- **Unordered (prior to Python 3.7):** Insertion order is not guaranteed.
- **Mutable:** Entries can be added, modified, or removed.
- **Keys must be unique and of immutable types.**

In [None]:
# Creating a dictionary
employee = {
    'id': 'E001',
    'name': 'Alice',
    'department': 'Engineering'
}

# Accessing values
employee_name = employee['name']

# Modifying values
employee['department'] = 'Data Science'

# Adding a new key-value pair
employee['position'] = 'Data Engineer'

# Removing a key-value pair
del employee['id']

# Resulting dictionary
print(employee)

## **2. Defining and Invoking Functions**

**Definition:** A function is a reusable block of code that performs a specific task.

**Syntax:**

```python
def function_name(parameters):
    # Function body
    return result

In [None]:
---

```python
def calculate_average(numbers):
    total = sum(numbers)
    count = len(numbers)
    average = total / count
    return average

data_points = [10, 20, 30, 40, 50]
average_value = calculate_average(data_points)
print(average_value)

## **3. Using Data Structures with Functions**

Functions can accept data structures as parameters and return them as outputs, enabling complex data manipulations.

**Example: Filtering Data from a List**

In [None]:
def filter_even_numbers(numbers):
    even_numbers = []
    for num in numbers:
        if num % 2 == 0:
            even_numbers.append(num)
    return even_numbers

data = [1, 2, 3, 4, 5, 6]
result = filter_even_numbers(data)
print(result)

**Example: Updating a Dictionary within a Function**

In [None]:
def update_employee_record(employee, updates):
    for key, value in updates.items():
        employee[key] = value
    return employee

employee_record = {'id': 'E002', 'name': 'Bob', 'department': 'Marketing'}
updates = {'department': 'Sales', 'position': 'Sales Manager'}
updated_record = update_employee_record(employee_record, updates)
print(updated_record)

## **4. Modules and Packages**

Modules and packages help organize code into manageable and reusable components.

In [None]:
import csv

def read_csv_file(filename):
    data = []
    with open(filename, mode='r') as file:
        csv_reader = csv.DictReader(file)
        for row in csv_reader:
            data.append(row)
    return data

 csv_data = read_csv_file('employees.csv')
 print(csv_data)

### **Creating and Importing Custom Modules**

**Problem:** You have utility functions for data cleaning that you want to reuse across different scripts.

**Solution:** Create a custom module named `data_cleaning.py`.

In [None]:

def remove_null_values(data_list):
    return [item for item in data_list if item is not None]

def standardize_text(text):
    return text.strip().lower()

raw_data = ['  Alice ', None, 'Bob', '  ', 'Charlie', None]
clean_data = remove_null_values(raw_data)
standardized_data = [standardize_text(name) for name in clean_data if name.strip()]
print(standardized_data)

## **5. Comprehensive Examples and Problems**

Let's apply what we've learned to more complex, real-world data problems.

### **Problem ### **Problem 1: Data Aggregation Using Functions and Dictionaries**

**Scenario:** You have transaction data containing user IDs and purchase amounts. You need to calculate the total amount spent by each user.1: Data Aggregation Using Functions and Dictionaries**

**Scenario:** You have transaction data containing user IDs and purchase amounts. You need to calculate the total amount spent by each user.

In [None]:
transactions = [
    {'user_id': 'U001', 'amount': 250},
    {'user_id': 'U002', 'amount': 150},
    {'user_id': 'U001', 'amount': 200},
    {'user_id': 'U003', 'amount': 300},
    {'user_id': 'U002', 'amount': 100},
]

### **Problem 2: Data Deduplication Using Sets**

**Scenario:** You have a list of email addresses collected from various sources, and you need to remove duplicates before sending out a newsletter.

In [None]:
emails = [
    'user1@example.com',
    'user2@example.com',
    'user3@example.com',
    'user2@example.com',
    'user4@example.com',
    'user1@example.com',
]

### **Problem 3: Nested Data Processing with Functions and Dictionaries**

**Scenario:** You have JSON data representing users and their associated orders. You need to extract a list of all products ordered by a specific user.

## **6. Conclusion**

In this module, we've explored how to use fundamental data structures—**lists**, **sets**, and **dictionaries**—in conjunction with **functions** and **modules** to solve real-world data problems. Understanding these concepts is essential for data engineers, as they form the backbone of data manipulation and processing tasks.

**Key Takeaways:**

- **Lists** are ideal for ordered collections that may contain duplicates.
- **Sets** are useful for storing unique elements and performing set operations.
- **Dictionaries** are powerful for storing and accessing data via key-value pairs.
- **Functions** promote code reusability and organization.
- **Modules** help structure your codebase for better maintainability.

## **7. Further Practice**

To deepen your understanding, try solving the following problems:

In [None]:
### **Exercise 1: Data Transformation with Functions and Lists**

**Task:** Convert product prices to USD.

**Data:**

```python
products = [
    {'id': 'P001', 'price': 100, 'currency': 'EUR'},
    {'id': 'P002', 'price': 200, 'currency': 'GBP'},
    {'id': 'P003', 'price': 300, 'currency': 'USD'},
]
exchange_rates = {'EUR': 1.1, 'GBP': 1.3, 'USD': 1.0}

### **Exercise 2: Counting Frequency with Dictionaries**

**Task:** Count the frequency of each word.

**Data:**

```python
words = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']

### **Exercise 3: Set Operations for Data Comparison**

**Task:** Find users who logged in on both days.

**Data:**

```python
day1_users = {'U001', 'U002', 'U003', 'U004'}
day2_users = {'U003', 'U004', 'U005', 'U006'}

## **8. Additional Resources**

- **Python Official Documentation**
  - [Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
  - [Defining Functions](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)
  - [Modules](https://docs.python.org/3/tutorial/modules.html)
- **Books**
  - *Python Crash Course* by Eric Matthes
  - *Learning Python* by Mark Lutz
- **Online Tutorials**
  - [W3Schools Python Tutorial](https://www.w3schools.com/python/)
  - [Real Python Tutorials](https://realpython.com/)

# **Object-Oriented Programming (OOP) in Python for Data Engineering**

In this module, we'll explore the fundamentals of **Object-Oriented Programming (OOP)** in Python. You'll learn how to model data using classes and objects, create reusable code with methods, and use OOP principles to build data structures that can be serialized into formats like CSV and JSON.

## **Key Topics Covered:**
1. [Introduction to OOP Concepts](#1-introduction-to-oop-concepts)
    - [Classes and Objects](#classes-and-objects)
    - [Attributes and Methods](#attributes-and-methods)
2. [Modeling Data Using Classes](#2-modeling-data-using-classes)
3. [Generating Data Using OOP](#3-generating-data-using-oop)
    - [Creating a CSV File from Class Instances](#creating-a-csv-file-from-class-instances)
    - [Converting Class Instances to JSON](#converting-class-instances-to-json)
4. [Serialization and Deserialization](#4-serialization-and-deserialization)

## **1. Introduction to OOP Concepts**

OOP is a programming paradigm based on the concept of "objects," which are instances of classes. Classes define the structure and behavior of objects, making it easier to model real-world data.

### **1.1 Classes and Objects**

- A **Class** is a blueprint for creating objects. It defines a set of attributes (data) and methods (behavior).
- An **Object** is an instance of a class.

**Example: Defining a Basic Class and Creating Objects**


In [None]:

# Defining a basic class
class Car:
    def __init__(self, make, model, year):
        self.make = make
        self.model = model
        self.year = year

# Creating objects (instances of the Car class)
car1 = Car("Toyota", "Corolla", 2020)
car2 = Car("Honda", "Civic", 2019)

print(car1.make, car1.model, car1.year)
print(car2.make, car2.model, car2.year)

---

```markdown
### **1.2 Attributes and Methods**

- **Attributes** are variables associated with an object (e.g., `make`, `model`, `year` for a `Car` class).
- **Methods** are functions that belong to a class and define behaviors for the objects.

**Example: Adding Methods to a Class**

In [None]:

# Extending the Car class to include a method
class Car:
    def __init__(self, make, model, year):
        self.make = make
        self.model = model
        self.year = year

    # Method to return a formatted description
    def description(self):
        return f"{self.year} {self.make} {self.model}"

# Creating and using the Car objects
car1 = Car("Toyota", "Corolla", 2020)
car2 = Car("Honda", "Civic", 2019)

print(car1.description())  # Output: 2020 Toyota Corolla
print(car2.description())  # Output: 2019 Honda Civic

 ---

```markdown
## **2. Modeling Data Using Classes**

Classes are particularly useful for modeling data in a structured way, such as representing a data schema in a data engineering context.

**Example: Modeling a Database Schema**

Let's define a `Book` class to represent books in a library system, including attributes like `title`, `author`, `isbn`, and `year`.

In [None]:

class Book:
    def __init__(self, title, author, isbn, year):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.year = year

    def get_info(self):
        return f"'{self.title}' by {self.author} (ISBN: {self.isbn}, Year: {self.year})"

# Creating book instances
book1 = Book("1984", "George Orwell", "1234567890", 1949)
book2 = Book("To Kill a Mockingbird", "Harper Lee", "0987654321", 1960)

print(book1.get_info())
print(book2.get_info())

---

```markdown
## **3. Generating Data Using OOP**

One powerful use of OOP in data engineering is to generate and manipulate structured data easily. You can use classes to create data models and then export this data into formats like CSV or JSON.

### **3.1 Creating a CSV File from Class Instances**

**Example: Generating a CSV File for Employee Data**

In [None]:
import csv

# Defining the Employee class
class Employee:
    def __init__(self, emp_id, name, department, salary):
        self.emp_id = emp_id
        self.name = name
        self.department = department
        self.salary = salary

    def to_dict(self):
        return {
            'emp_id': self.emp_id,
            'name': self.name,
            'department': self.department,
            'salary': self.salary
        }

# Creating employee instances
employees = [
    Employee(1, "Alice", "Engineering", 90000),
    Employee(2, "Bob", "Sales", 80000),
    Employee(3, "Charlie", "Marketing", 85000)
]

# Writing employee data to a CSV file
csv_filename = 'employees.csv'
with open(csv_filename, mode='w', newline='') as csv_file:
    fieldnames = ['emp_id', 'name', 'department', 'salary']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    for emp in employees:
        writer.writerow(emp.to_dict())

print(f"Data successfully written to {csv_filename}")

---

```markdown
### **3.2 Converting Class Instances to JSON**

**Example: Serializing a List of Objects to JSON**

In [None]:

import json

# Defining a Product class
class Product:
    def __init__(self, product_id, name, price, stock):
        self.product_id = product_id
        self.name = name
        self.price = price
        self.stock = stock

# Creating a list of Product instances
products = [
    Product(101, "Laptop", 1500.00, 50),
    Product(102, "Smartphone", 800.00, 100),
    Product(103, "Tablet", 400.00, 200)
]

# Converting Product instances to dictionaries
products_list = [product.__dict__ for product in products]

# Serializing the list to JSON
json_data = json.dumps(products_list, indent=4)
print(json_data)

## **4. Serialization and Deserialization**

**Serialization** is the process of converting an object into a format that can be easily saved or transmitted (e.g., converting a Python object to a JSON string or CSV row).  
**Deserialization** is the reverse process, where data from formats like JSON or CSV is converted back into Python objects.

**Example: Saving and Loading Application Settings**

```
---

```python
class Settings:
    def __init__(self, theme, notifications_enabled):
        self.theme = theme
        self.notifications_enabled = notifications_enabled

    def save_to_json(self, filename):
        data = self.__dict__
        with open(filename, 'w') as json_file:
            json.dump(data, json_file)

    @classmethod
    def load_from_json(cls, filename):
        with open(filename, 'r') as json_file:
            data = json.load(json_file)
            return cls(**data)

# Creating a settings object
app_settings = Settings(theme="dark", notifications_enabled=True)

# Saving settings to a file
app_settings.save_to_json('settings.json')

# Loading settings from a file
loaded_settings = Settings.load_from_json('settings.json')
print(loaded_settings.__dict__)
```



---

```markdown
# **Conclusion**

In this section, you've learned how to use **OOP concepts** in Python to:

- Define and use classes and objects.
- Model data in a structured way using classes.
- Generate structured data, and save it to formats like CSV and JSON.
- Serialize and deserialize data for persistence and transmission.

OOP allows you to model real-world entities as classes and objects, making your code modular, reusable, and easy to maintain.

# **Further Practice**

### **Exercise 1: Create a Student Class**

1. Define a `Student` class with attributes like `student_id`, `name`, `major`, and `gpa`.
2. Add methods to calculate and update the `gpa`.
3. Create several instances of `Student` and serialize them into JSON.

### **Exercise 2: Generate CSV Data for an E-Commerce System**

1. Create a `Customer` class with attributes like `customer_id`, `name`, `email`, and `address`.
2. Create a `Product` class with attributes like `product_id`, `name`, `price`, and `category`.
3. Write code to generate a list of customers and products and save the data to two separate CSV files.

---