<a href="https://colab.research.google.com/github/blacktalenthubs/data-engineering-track/blob/main/week2_python_functions_DSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Structures and Functions in Python for Data Engineering

Welcome to this comprehensive module on Data Structures and Functions in Python, specifically designed for data engineering students. In this lesson, we’ll explore fundamental data structures—lists, sets, and dictionaries—and delve into how they integrate with functions and modules. We’ll provide clear definitions, practical examples, and meaningful problems with solutions related to real-world data scenarios.

Table of Contents

	1.	Introduction to Data Structures
	•	Lists
	•	Sets
	•	Dictionaries
	2.	Defining and Invoking Functions
	3.	Using Data Structures with Functions
	4.	Modules and Packages
	•	Importing Built-in Modules
	•	Creating and Importing Custom Modules
	5.	Comprehensive Examples and Problems
	6.	Conclusion

1. Introduction to Data Structures

Data structures are fundamental constructs that allow you to store and organize data efficiently. Understanding how to use them is crucial for data engineering tasks such as data manipulation, transformation, and storage.

Lists

Definition: A list is an ordered, mutable collection of items that can be of different types.

Characteristics:

	•	Ordered: Elements have a defined sequence.
	•	Mutable: Elements can be added, removed, or changed.
	•	Allows duplicate elements.

Example:

# Creating a list of integers
numbers = [10, 20, 30, 40, 50]

# Accessing elements
first_number = numbers[0]
last_number = numbers[-1]

# Modifying elements
numbers[2] = 35

# Adding elements
numbers.append(60)

# Removing elements
numbers.remove(20)

# Resulting list
print(numbers)

Output:

[10, 35, 40, 50, 60]

Sets

Definition: A set is an unordered collection of unique items.

Characteristics:

	•	Unordered: No indexing or order of elements.
	•	Mutable: Elements can be added or removed.
	•	No duplicate elements.

Example:

# Creating a set of strings
unique_ids = {'id1', 'id2', 'id3'}

# Adding an element
unique_ids.add('id4')

# Attempting to add a duplicate element
unique_ids.add('id2')

# Removing an element
unique_ids.discard('id1')

# Resulting set
print(unique_ids)

Output:

{'id2', 'id3', 'id4'}

Dictionaries

Definition: A dictionary is an unordered collection of key-value pairs.

Characteristics:

	•	Unordered (prior to Python 3.7): Insertion order is not guaranteed.
	•	Mutable: Entries can be added, modified, or removed.
	•	Keys must be unique and immutable types.

Example:

# Creating a dictionary
employee = {
    'id': 'E001',
    'name': 'Alice',
    'department': 'Engineering'
}

# Accessing values
employee_name = employee['name']

# Modifying values
employee['department'] = 'Data Science'

# Adding a new key-value pair
employee['position'] = 'Data Engineer'

# Removing a key-value pair
del employee['id']

# Resulting dictionary
print(employee)

Output:

{'name': 'Alice', 'department': 'Data Science', 'position': 'Data Engineer'}

2. Defining and Invoking Functions

Definition: A function is a reusable block of code that performs a specific task.

Syntax:

def function_name(parameters):
    # Function body
    return result

Example:

def calculate_average(numbers):
    total = sum(numbers)
    count = len(numbers)
    average = total / count
    return average

data_points = [10, 20, 30, 40, 50]
average_value = calculate_average(data_points)
print(average_value)

Output:

30.0

3. Using Data Structures with Functions

Functions can accept data structures as parameters and return them as outputs, enabling complex data manipulations.

Example: Filtering Data from a List

def filter_even_numbers(numbers):
    even_numbers = []
    for num in numbers:
        if num % 2 == 0:
            even_numbers.append(num)
    return even_numbers

data = [1, 2, 3, 4, 5, 6]
result = filter_even_numbers(data)
print(result)

Output:

[2, 4, 6]

Example: Updating a Dictionary within a Function

def update_employee_record(employee, updates):
    for key, value in updates.items():
        employee[key] = value
    return employee

employee_record = {'id': 'E002', 'name': 'Bob', 'department': 'Marketing'}
updates = {'department': 'Sales', 'position': 'Sales Manager'}
updated_record = update_employee_record(employee_record, updates)
print(updated_record)

Output:

{'id': 'E002', 'name': 'Bob', 'department': 'Sales', 'position': 'Sales Manager'}

4. Modules and Packages

Modules and packages help organize code into manageable and reusable components.

Importing Built-in Modules

Python offers a wide range of built-in modules that can be readily used.

Example: Using the csv Module

import csv

def read_csv_file(filename):
    data = []
    with open(filename, mode='r') as file:
        csv_reader = csv.DictReader(file)
        for row in csv_reader:
            data.append(row)
    return data

csv_data = read_csv_file('employees.csv')
print(csv_data)

Creating and Importing Custom Modules

Problem: You have utility functions for data cleaning that you want to reuse across different scripts.

Solution: Create a custom module named data_cleaning.py.

Content of data_cleaning.py:

def remove_null_values(data_list):
    return [item for item in data_list if item is not None]

def standardize_text(text):
    return text.strip().lower()

Using the Custom Module:

from data_cleaning import remove_null_values, standardize_text

raw_data = ['  Alice ', None, 'Bob', '  ', 'Charlie', None]
clean_data = remove_null_values(raw_data)
standardized_data = [standardize_text(name) for name in clean_data if name.strip()]
print(standardized_data)

Output:

['alice', 'bob', 'charlie']

5. Comprehensive Examples and Problems

Let’s apply what we’ve learned to more complex, real-world data problems.

Problem 1: Data Aggregation Using Functions and Dictionaries

Scenario: You have transaction data containing user IDs and purchase amounts. You need to calculate the total amount spent by each user.

Data:

transactions = [
    {'user_id': 'U001', 'amount': 250},
    {'user_id': 'U002', 'amount': 150},
    {'user_id': 'U001', 'amount': 200},
    {'user_id': 'U003', 'amount': 300},
    {'user_id': 'U002', 'amount': 100},
]

Solution:

def aggregate_user_purchases(transactions):
    user_totals = {}
    for transaction in transactions:
        user_id = transaction['user_id']
        amount = transaction['amount']
        if user_id in user_totals:
            user_totals[user_id] += amount
        else:
            user_totals[user_id] = amount
    return user_totals

totals = aggregate_user_purchases(transactions)
print(totals)

Output:

{'U001': 450, 'U002': 250, 'U003': 300}

Problem 2: Data Deduplication Using Sets

Scenario: You have a list of email addresses collected from various sources, and you need to remove duplicates before sending out a newsletter.

Data:

emails = [
    'user1@example.com',
    'user2@example.com',
    'user3@example.com',
    'user2@example.com',
    'user4@example.com',
    'user1@example.com',
]

Solution:

def remove_duplicate_emails(email_list):
    unique_emails = set(email_list)
    return list(unique_emails)

unique_email_list = remove_duplicate_emails(emails)
print(unique_email_list)

Output (order may vary due to the nature of sets):

['user2@example.com', 'user4@example.com', 'user1@example.com', 'user3@example.com']

Problem 3: Nested Data Processing with Functions and Dictionaries

Scenario: You have JSON data representing users and their associated orders. You need to extract a list of all products ordered by a specific user.

Data:

users = [
    {
        'user_id': 'U001',
        'name': 'Alice',
        'orders': [
            {'order_id': 'O1001', 'products': ['P001', 'P002']},
            {'order_id': 'O1002', 'products': ['P003']},
        ],
    },
    {
        'user_id': 'U002',
        'name': 'Bob',
        'orders': [
            {'order_id': 'O1003', 'products': ['P002', 'P004']},
        ],
    },
]

Solution:

def get_user_products(user_id, users):
    for user in users:
        if user['user_id'] == user_id:
            products = []
            for order in user['orders']:
                products.extend(order['products'])
            return products
    return []

alice_products = get_user_products('U001', users)
print(alice_products)

Output:

['P001', 'P002', 'P003']

6. Conclusion

In this module, we’ve explored how to use fundamental data structures—lists, sets, and dictionaries—in conjunction with functions and modules to solve real-world data problems. Understanding these concepts is essential for data engineers, as they form the backbone of data manipulation and processing tasks.

Key Takeaways:

	•	Lists are ideal for ordered collections that may contain duplicates.
	•	Sets are useful for storing unique elements and performing set operations.
	•	Dictionaries are powerful for storing and accessing data via key-value pairs.
	•	Functions promote code reusability and organization.
	•	Modules help structure your codebase for better maintainability.

By mastering these tools, you’ll be well-equipped to handle complex data engineering challenges.

Further Practice

To deepen your understanding, try solving the following problems:

	1.	Data Transformation with Functions and Lists
Write a function that takes a list of dictionaries representing products with prices in different currencies and returns a new list with prices converted to USD.
Data:

products = [
    {'id': 'P001', 'price': 100, 'currency': 'EUR'},
    {'id': 'P002', 'price': 200, 'currency': 'GBP'},
    {'id': 'P003', 'price': 300, 'currency': 'USD'},
]

Exchange Rates:

exchange_rates = {'EUR': 1.1, 'GBP': 1.3, 'USD': 1.0}


	2.	Counting Frequency with Dictionaries
Given a list of words, write a function that returns a dictionary with each word and its frequency count.
Data:

words = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']


	3.	Set Operations for Data Comparison
You have two lists of user IDs representing users who have logged in over two different days. Find the users who logged in on both days.
Data:

day1_users = ['U001', 'U002', 'U003', 'U004']
day2_users = ['U003', 'U004', 'U005', 'U006']



Additional Resources

	•	Python Official Documentation
	•	Data Structures
	•	Defining Functions
	•	Modules
	•	Books
	•	Python Crash Course by Eric Matthes
	•	Learning Python by Mark Lutz
	•	Online Tutorials
	•	W3Schools Python Tutorial
	•	Real Python Tutorials

Feel free to explore these concepts further and apply them to your data engineering projects. Practice is key to mastery. Happy coding!