### **Introduction to Functions**

Functions help make our code reusable, modular, and easier to debug. Instead of writing the same code over and over, we define a function once and call it whenever needed


In [4]:
def greet():
    print("Hello, welcome to Python functions!")
    
greet()


Hello, welcome to Python functions!


`def` – This tells Python we are defining a function.

`greet()` – This is the function call (it runs the function).

In [8]:
def greet(name):
    print(f"Hello, {name}! Welcome to Python functions.")

greet("Alice")
greet("Bob")


Hello, Alice! Welcome to Python functions.
Hello, Bob! Welcome to Python functions.


In data engineering, column names in datasets are often messy. Let’s write a function that ensures they are consistent.

In [11]:
def clean_column_name(column_name):
    return column_name.strip().lower().replace(" ", "_")

print(clean_column_name("  Customer Name  "))  # Output: customer_name

customer_name


**An improved version**

In [22]:
import re

def clean_column_name(column_name):
    """
    Cleans a column name by:
    - Removing all special characters
    - Replacing spaces with underscores
    - Converting to lowercase
    - Stripping leading/trailing underscores
    """
    column_name = column_name.lower()  # Convert to lowercase
    column_name = re.sub(r'[^a-z0-9_]', '_', column_name)  # Replace special chars with _
    column_name = re.sub(r'_{2,}', '_', column_name)  # Replace multiple underscores with single _
    return column_name.strip('_')  # Remove leading/trailing underscores

def clean_column_names(columns):
    """
    Cleans a list of column names.
    """
    return [clean_column_name(col) for col in columns]

# Example usage:
raw_columns = ["Employee ID#", "Full Name!", "Salary ($)", "Hire Date", "Dept_Code"]
cleaned_columns = clean_column_names(raw_columns)
print(cleaned_columns)


['employee_id', 'full_name', 'salary', 'hire_date', 'dept_code']


### **Function Arguments (Positional & Keyword Arguments)**
<hr></hr>

**Positional vs Keyword Arguments**

In [19]:
def order_coffee(size, type):
    print(f"Here is your {size} {type} coffee.")

order_coffee("large", "latte")  # Positional arguments
order_coffee(type="espresso", size="small")  # Keyword arguments


Here is your large latte coffee.
Here is your small espresso coffee.


**Default Argument Values**

In [24]:
def order_coffee(size="medium", type="black"):
    print(f"Here is your {size} {type} coffee.")

order_coffee()  # Uses defaults
order_coffee("small")  # Overrides size, keeps type as "black"
order_coffee(type="cappuccino")  # Keeps size as "medium"


Here is your medium black coffee.
Here is your small black coffee.
Here is your medium cappuccino coffee.


#### **DE Use Case: Log Processing Function**

In [27]:
def parse_log_entry(log, separator="|"):
    return log.strip().split(separator)

log_entry = "2025-02-24|INFO|Data pipeline started"
print(parse_log_entry(log_entry))  # Output: ['2025-02-24', 'INFO', 'Data pipeline started']


['2025-02-24', 'INFO', 'Data pipeline started']


In [18]:
import csv
import logging
from typing import List, Dict, Optional, Union
import os
import re

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

import os
import csv
import re

def detect_separator(line: str):
    """
    Detects the most likely separator in a given line.
    
    Args:
        line (str): A single line of text.
    
    Returns:
        str: Detected separator or None if no common separator is found.
    """
    common_separators = ["|", ",", ";", "\t"]
    separator_counts = {sep: line.count(sep) for sep in common_separators}
    
    # Return the separator with the highest occurrence
    return max(separator_counts, key=separator_counts.get) if any(separator_counts.values()) else None

def process_log_file(file_path: str):
    """
    Reads a log file and processes it into a structured format, dynamically detecting separators per line.
    
    Args:
        file_path (str): Relative or absolute path to the log file.
    
    Returns:
        list[dict]: List of log entries as dictionaries.
    """
    log_entries = []
    
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Log file not found: {file_path}")
    
    with open(file_path, "r", encoding="utf-8") as file:
        header_line = file.readline().strip()
        separator = detect_separator(header_line)
        
        if separator is None:
            raise ValueError("Could not determine a valid separator in the header line.")
        
        # Reset file pointer and read all lines dynamically detecting separators
        file.seek(0)
        lines = file.readlines()

        # Extract column headers
        headers = [col.strip() for col in re.split(rf"[{re.escape(separator)}]", lines[0].strip())]

        for line in lines[1:]:
            line = line.strip()
            current_separator = detect_separator(line)  # Detect separator per line

            values = [val.strip() for val in line.split(current_separator)]
            log_entries.append(dict(zip(headers, values)))

    return log_entries



In [19]:
import os
file_path = os.path.join(os.getcwd(), "data", "logs.txt")
print(file_path)

log_entries = process_log_file(file_path)

for entry in log_entries:
    print(entry)

C:\Users\fpicaso\envs\notebooks\data\logs.txt
{'timestamp': '2025-02-24 10:15:30', 'level': 'INFO', 'message': 'User logged in', 'user': 'john_doe'}
{'timestamp': '2025-02-24 10:18:20', 'level': 'ERROR', 'message': 'Database connection failed', 'user': 'admin'}
{'timestamp': '2025-02-24 10:25:00', 'level': 'INFO', 'message': 'Password changed', 'user': 'john_doe'}
{'timestamp': '2025-02-24 10:30:10', 'level': 'ERROR', 'message': 'Application crash', 'user': 'system'}
{'timestamp': '2025-02-24 10:35:45', 'level': 'DEBUG', 'message': 'Configuration updated', 'user': 'dev_user'}
{'timestamp': '2025-02-24 10:45:55', 'level': 'INFO', 'message': 'User logged out', 'user': 'jane_doe'}
{'timestamp': '2025-02-24 10:50:30', 'level': 'ERROR', 'message': 'Server timeout', 'user': 'admin'}
{'timestamp': '2025-02-24 10:55:15', 'level': 'INFO', 'message': 'Backup completed', 'user': 'backup_user'}
{'timestamp': '2025-02-24 10:20:45', 'level': 'INFO', 'message': 'File uploaded', 'user': 'john_doe'}
{'

### **`*args`, `**kwargs` & Argument Unpacking**

One of the best features of Python functions is the extremely flexible parameter handling mechanism. Closely related are the use of * and ** to unpack iterables and mappings into separate arguments when we call a function.

**`*args`: Accepting multiple positional arguments**

`*args` allows a function to accept any number of positional arguments (i.e., arguments passed without specifying a keyword).

- `*args` collects all extra positional arguments passed to a function into a **tuple**.
- You can loop through `args` like a normal tuple.

In [20]:
def sum_numbers(*args):
    return sum(args)

print(sum_numbers(10, 20, 30))  # Output: 60


60


In [22]:
def greet(*args):
    for name in args:
        print(f"Hello, {name}!")

greet("Alice", "Bob", "Charlie")  


Hello, Alice!
Hello, Bob!
Hello, Charlie!


**Combining `*args` with Positional Arguments**

- `first_name` and `last_name` are regular positional arguments.
- `*hobbies` collects any extra positional arguments into a tuple.

In [24]:
def introduce(first_name, last_name, *hobbies):
    print(f"Name: {first_name} {last_name}")
    
    if hobbies:
        print("Hobbies:")
        for hobby in hobbies:
            print(f"- {hobby}")

introduce("Alice", "Johnson", "Reading", "Hiking", "Painting")


Name: Alice Johnson
Hobbies:
- Reading
- Hiking
- Painting


**`**kwargs`: Accepting multiple keyword arguments**

`**kwargs` allows a function to accept any number of keyword arguments (i.e., arguments passed as key=value pairs).


- `**kwargs` collects all extra keyword arguments into a **dictionary**.
- You can access the dictionary normally or loop through it.

In [25]:
def display_info(**kwargs):
    for key, value in kwargs.items():
        print(f"{key}: {value}")

display_info(name="Alice", age=25, city="New York")


name: Alice
age: 25
city: New York


### **Combining `*args` and `**kwargs`**

You can use *args and **kwargs together in the same function.

- `"Welcome to the event!"` is a normal parameter.
- `"Alice"` and `"Bob"` go into `*args` as a tuple → ("Alice", "Bob").
- `age=30`, `city="New York"`, `profession="Engineer"` go into `**kwargs` as a dictionary.

In [26]:
def describe_person(greeting, *args, **kwargs):
    print(greeting)
    
    print("\nPositional Arguments (args):")
    for item in args:
        print(f"- {item}")
    
    print("\nKeyword Arguments (kwargs):")
    for key, value in kwargs.items():
        print(f"{key}: {value}")

describe_person(
    "Welcome to the event!",
    "Alice", "Bob",  # *args
    age=30, city="New York", profession="Engineer"  # **kwargs
)


Welcome to the event!

Positional Arguments (args):
- Alice
- Bob

Keyword Arguments (kwargs):
age: 30
city: New York
profession: Engineer


### **Using `*args` and `**kwargs` for Function Wrapping**

A common use case for `*args` and `**kwargs` is function decorators or wrappers, where they allow passing arbitrary arguments to another function.

- The decorator `log_function_call` wraps `add()`, allowing any arguments to be passed dynamically.
- `*args` and `**kwargs` ensure flexibility for different function signatures.

In [28]:
def log_function_call(func):
    def wrapper(*args, **kwargs):
        print(f"Calling function {func.__name__} with:")
        print(f"  Positional args: {args}")
        print(f"  Keyword args: {kwargs}")
        result = func(*args, **kwargs)
        print(f"  Result: {result}")
        return result
    return wrapper

@log_function_call
def add(a, b):
    print("add function")
    return a + b

add(5, 10)


Calling function add with:
  Positional args: (5, 10)
  Keyword args: {}
add function
  Result: 15


15

**Sample Use Case:** Generate tags

In [29]:
def tag(name, *content, class_=None, **attrs):
    """Generate one or more HTML tags"""
    if class_ is not None:
        attrs['class'] = class_
    attr_pairs = (f' {attr}="{value}"' for attr, value
                    in sorted(attrs.items()))
    attr_str = ''.join(attr_pairs)
    if content:
        elements = (f'<{name}{attr_str}>{c}</{name}>'
                    for c in content)
        return '\n'.join(elements)
    else:
        return f'<{name}{attr_str} />'

In [35]:
print(tag('br'))
print("-------------------")
print(tag('p', 'hello'))
print("-------------------")
print(tag('p', 'hello', 'world'))
print("-------------------")
print(tag('p', 'hello', id=33))
print("-------------------")
print(tag('p', 'hello', 'world', class_='sidebar'))
print("-------------------")
print(tag(content='testing', name="img"))
print("-------------------")
my_tag = {'name': 'img', 'title': 'Sunset Boulevard', 'src': 'sunset.jpg', 'class': 'framed'}
print(tag(**my_tag))

<br />
-------------------
<p>hello</p>
-------------------
<p>hello</p>
<p>world</p>
-------------------
<p id="33">hello</p>
-------------------
<p class="sidebar">hello</p>
<p class="sidebar">world</p>
-------------------
<img content="testing" />
-------------------
<img class="framed" src="sunset.jpg" title="Sunset Boulevard" />


### **Return Values, Mutability & Side Effects**

A function can return a value using return. This value can then be used elsewhere in the program

In [36]:
def add_numbers(a, b):
    return a + b  # This function "returns" a value

result = add_numbers(5, 3)
print(result)  # Output: 8


8


In [37]:
def get_coordinates():
    return (10.5, 20.3)  # Tuple return

x, y = get_coordinates()
print(f"x: {x}, y: {y}")


x: 10.5, y: 20.3


#### **Immutable Type (Returning a New Value)**

In [39]:
def uppercase(text: str) -> str:
    return text.upper()

message = "hello"
new_message = uppercase(message)

print(message)      # "hello" (unchanged)
print(new_message)  # "HELLO" (new value)


hello
HELLO


#### **Mutable Type (Modifying In-Place)**

In [40]:
def double_numbers(numbers: list[int]) -> None:
    for i in range(len(numbers)):
        numbers[i] *= 2  # Modifies the original list

values = [1, 2, 3]
double_numbers(values)
print(values)  # [2, 4, 6] (modified!)


[2, 4, 6]


**Make a copy of the parameter "locally" in the function to prevent modification**

In [42]:
"""
Not the correct solution
"""
def modify_list(numbers: list[int]) -> list[int]:
    numbers_copy = numbers  # This still points to the same list!
    numbers_copy.append(99) # This modifies the original list
    return numbers_copy

values = [1, 2, 3]
new_values = modify_list(values)

print(values)      # [1, 2, 3, 99]  (Oops! The original changed)
print(new_values)  # [1, 2, 3, 99]


[1, 2, 3, 99]
[1, 2, 3, 99]


In [44]:
def modify_list_safe(numbers: list[int]) -> list[int]:
    numbers_copy = numbers.copy()  # Creates a shallow copy
    numbers_copy.append(99)  # Modifies the copy, not the original
    return numbers_copy

values = [1, 2, 3]
new_values = modify_list_safe(values)

print(values)      # [1, 2, 3]  (Unchanged)
print(new_values)  # [1, 2, 3, 99]  (New list)


[1, 2, 3]
[1, 2, 3, 99]


**Nested Lists (Deep Copy)**

If the list contains other mutable objects (e.g., lists inside lists), .copy() won’t be enough—it only copies the top-level list, but the inner lists will still be shared.

For nested structures, use `copy.deepcopy()`:

In [45]:
import copy

def modify_nested_list_safe(numbers: list[list[int]]) -> list[list[int]]:
    numbers_copy = copy.deepcopy(numbers)  # Deep copy ensures full isolation
    numbers_copy[0].append(99)  # Modify copy
    return numbers_copy

values = [[1, 2, 3], [4, 5, 6]]
new_values = modify_nested_list_safe(values)

print(values)      # [[1, 2, 3], [4, 5, 6]]  (Unchanged)
print(new_values)  # [[1, 2, 3, 99], [4, 5, 6]]  (New copy)


[[1, 2, 3], [4, 5, 6]]
[[1, 2, 3, 99], [4, 5, 6]]


#### **Side Effects**

A side effect occurs when a function modifies external state instead of just returning a value.

In [46]:
usernames = []

def add_user(username: str) -> None:
    usernames.append(username)  # Side effect: modifies global list

add_user("alice")
print(usernames)  # ["alice"]


['alice']


**Logging to a file - Issues**

In [48]:
def log_message(message: str) -> None:
    """Logs a message to a hardcoded file."""
    with open("app.log", "a") as log_file:  # Hardcoded file
        log_file.write(message + "\n")

log_message("User logged in")  # Always writes to app.log

**Solution: Use Dependency Injection (DI)**

In [49]:
from typing import TextIO
import io

def log_message(message: str, log_file: TextIO) -> None:
    """Logs a message to the provided file-like object."""
    log_file.write(message + "\n")

# Using a real file
with open("app.log", "a") as f:
    log_message("User logged in", f)  # Writes to app.log

# Using an in-memory file for testing (NO file I/O!)
fake_file = io.StringIO()
log_message("Test log entry", fake_file)

# Verify the content written
print(fake_file.getvalue())  # Output: "Test log entry\n"


Test log entry

