# 🐍 Python Training - Level 1 - Data Manipulation and ETL Thinking

This notebook builds the core Python skills used daily by **Data Engineers** and **Data Analysts** and it is great for interview preparation.  
It parallels the SQL Level 1 workbook - focusing on thinking in records (rows of datasets), transformations, and data validation, but using **pure Python**, no libraries.

Now that you can access and loop through data (Level 0), we’ll practice how to write concise, reusable, and reliable code for everyday data-engineering tasks.

---

## Topics covered
- List & dict comprehensions  
- `map()` and `filter()`  
- Lambda functions  
- `Counter`, `defaultdict`  
- `enumerate()` and `zip()`  
- Iterators and generators  
- Functions & error handling  
- Sorting and dates

## Exercise 1 - List Comprehensions

**Business question:**  
"Can we simplify loops to extract confirmed bookings faster?"  
**Direct question:**  
From a list of bookings, build a list of booking_ids where status == "confirmed".  
**Goal:**  
Replace manual loops with concise list comprehensions.

In [1]:
# Consider this list of dictionaries, each one a booking record.
bookings_list = [
    {"id": 1, "status": "confirmed"},
    {"id": 2, "status": "cancelled"},
    {"id": 3, "status": "confirmed"},
]

In [2]:
# From a list of bookings, use List Comprehension to build a list of booking_ids where status == "confirmed".
confirmed_ids_list_comp = [i["id"] for i in bookings_list if i["status"] == "confirmed"]
print("List of bookings with confirmed status: ", confirmed_ids_list_comp)

List of bookings with confirmed status:  [1, 3]


## Exercise 2 - Functions

**Business question:**  
"How can we reuse logic without rewriting loops?"  
**Direct question:**  
Create a function that takes a booking and returns `True` if it’s confirmed.  
**Goal:**  
Learn how to define and call functions.

In [3]:
# Consider that the bookings_list is being used

bookings_list = [
    {"id": 1, "status": "confirmed"},
    {"id": 2, "status": "cancelled"},
    {"id": 3, "status": "confirmed"},
]

In [4]:
# First, let's define a function that takes an element of the bookings list, which is a dictionary with a booking record,
# and returns "True" if the status of a booking is confirmed. 

def is_confirmed(bookings_list):
    return bookings_list["status"] == "confirmed" # if this condition is met, the function returns "True"

In [5]:
# Now, iterate thought the list with a for loop and generate True or False for each booking record

for i in bookings_list:
    print(i["id"], ": ", is_confirmed(i) )

1 :  True
2 :  False
3 :  True


## Exercise 3 - Lambda Functions (Generic Example)

**Business question:**  
"What if we need a one-line function for something simple?"  
**Direct question:**  
Write a lambda that adds 10 to a number, and another that joins first name (Alice) and last name (Mora).  
**Goal:**  
Learn lambda syntax before using it with data structures.

In [6]:
# Create a Lambda Function that adds 10 to a number
add_ten_lambda_function = lambda x: x + 10

# Now, print this function by taking the number 2
# Note: the print needs to be in the same code block!
print(add_ten_lambda_function(2))

12


In [7]:
# Now, create a Lambda Function that joins first name (Alice) and last name (Mora)
join_names_lambda_function = lambda first,last: f"{first} {last}"

# Now, print this function by with first = "Alice" and last = "Mora"    
# Note: the print needs to be in the same code block!
print(join_names_lambda_function("Alice", "Mora"))



Alice Mora


## Exercise 4 - Lambda Functions in Practice

**Business question:**  
"Can we check booking status without defining a full function?"  
**Direct question:**  
Use a lambda function to test whether a booking is confirmed. It should return `True` if confirmed, for each booking.  
**Goal:**  
Apply lambda to real data instead of just numbers.

In [8]:
# Consider the same bookings_list
bookings_list = [
    {"id": 1, "status": "confirmed"},
    {"id": 2, "status": "cancelled"},
    {"id": 3, "status": "confirmed"},
]

In [9]:
# Create a Lambda Function that does the same as the is_confirmed function (returns True if status is confirmed)
is_confirmed_lambda_function = lambda x: x["status"] == "confirmed"

# Now, print this function for each record of the bookings_list    
# Note: the print needs to be in the same code block!
for i in bookings_list:
    print(i["id"], ": ", is_confirmed_lambda_function(i))


1 :  True
2 :  False
3 :  True


## Understanding Iterables, Iterators, and Generators

### 1. **Iterable**  
An **iterable** is any object that you can loop over.  

Examples:  
- `list`, `tuple`, `str`, `dict`, `set`, `range()`  

👉 Think of it as a *collection* of data that can be traversed.

```python
for n in [10, 20, 30]:
    print(n)
```

### 2. **Iterator**  
An **iterator** is what Python actually uses to get values one at a time from an iterable.

You can create one by calling `iter()` on an iterable, and then call `next()` to move step-by-step.

```python
nums = [10, 20, 30]
it = iter(nums)
print(next(it))   # 10
print(next(it))   # 20
print(next(it))   # 30
```

After the last element, calling another next(it) would raise `StopIteration` error.

### 3. **Generator**  
A **generator** is a special type of iterator that you write yourself with the yield keyword. Instead of building a list in memory, it **produces** values one by one only when needed.

```python
def number_stream():
    for n in [10, 20, 30]:
        yield n   # yield returns one item, pauses, and remembers where it left off

stream = number_stream()
print(next(stream))  # 10
print(next(stream))  # 20
print(next(stream))  # 30
```

Generators are automatically iterators — you can loop over them or call next() directly.

**Why It Matters**  
Functions like `map()`, `filter()`, and `generator functions` don’t create full lists right away.  
They return iterators, meaning they produce data on demand, one element at a time — this is called lazy evaluation.

Why this is useful:
- Saves memory for large datasets
- Enables real-time or streaming data processing
- You can always convert to a list when you actually need all results:

```python
list(map(...))
list(filter(...))
```

## Exercise 5 - Iterables vs Iterators (Conceptual Example)

**Business question:**  
"How does Python read through a list one element at a time?"  
**Direct question:**  
Turn a list into an iterator and use next() manually to see how it works.  
**Goal:**  
Understand what iter() and next() actually do.

In [10]:
# Consider a list of numbers
numbers = [10, 20, 30]   # This list is an *iterable* in this case

# See how the pre-built function iter() holds the items of the iterable 
it = iter(numbers)       # Now, return an interator "it" from the list `numbers`

# See how you would print the items one by one manually using the iterator idea that iterates over the iterable (the list)
print(next(it))   # 10
print(next(it))   # 20
print(next(it))   # 30

# printing next(it) again now would cause a StopIteration error - it's “exhausted”, i.e., the list is over.
# print(next(it)) 

10
20
30


## Exercise 6 – Creating Your Own Generator

**Business question:**  
"Our API returns thousands of transactions. I don’t want to load everything into memory - I just want to process one transaction at a time, in order."  

**Direct question:**  
Write a generator function `stream_payments()` that yields one payment record at a time. Then:  
1. Grab the first payment with `next()`.  
2. Loop through the rest.

**Goal:**  
Understand how `yield` lets you build your own iterator (a lazy data stream).

In [11]:
# Consider this stream of payments
payments_list = [
        {"payment_id": 101, "user": "alice", "amount": 29.90},
        {"payment_id": 102, "user": "bob",   "amount": 12.00},
        {"payment_id": 103, "user": "carol", "amount": 7.50},
    ]

In [12]:
# Create the "generator function", i.e., create a generator of iterators
def stream_payments():
    """
    A simple generator that yields one payment dictionary at a time.
    Instead of returning a full list, it pauses at each `yield`
    until the next item is requested.
    """
    for p in payments_list:
        # "yield" sends one record at a time
        yield p

In [13]:
# Create a generator object, i.e., create the "iter" object, which will create the iterator to go over some object
payments_iter = stream_payments() 

print(payments_iter)  # Result: <generator object ...>

# This is still not the payment record! We need to use next() on it!

<generator object stream_payments at 0x000001E4AA088880>


In [14]:
# Pull items manually from the iterator
first_payment_record = next(payments_iter)
print("First payment record:", first_payment_record)

print("Second payment record:", next(payments_iter))
print("Third payment record:", next(payments_iter))

# If you call next() again, you'll get `StopIteration` error
# next(payments_iter)

First payment record: {'payment_id': 101, 'user': 'alice', 'amount': 29.9}
Second payment record: {'payment_id': 102, 'user': 'bob', 'amount': 12.0}
Third payment record: {'payment_id': 103, 'user': 'carol', 'amount': 7.5}


In [15]:
# Now, let's be more efficient and loop over a generator (typical ETL usage), printing each value of each key separately

# Here we are iterating over an *iterator* that yields dictionaries.
# In each iteration, `i` is one dictionary with keys:
#   "payment_id", "user", and "amount".
# We access each key explicitly while printing its values.

for i in stream_payments():
    print("Processing payment:", i["payment_id"], "amount:", i["amount"])

Processing payment: 101 amount: 29.9
Processing payment: 102 amount: 12.0
Processing payment: 103 amount: 7.5


## Exercise 7 – `map()`, `filter()`, `enumerate()`, and `zip()`

**Business question:**  
"Our marketing team has spend and clicks per channel. We want:  
(a) cost per click,  
(b) keep only efficient channels, and  
(c) produce a clean ranked report."  

**Direct question:**  
Given parallel lists of `channels`, `spend`, and `clicks`:
1. Use `zip()` to combine them into records.  
2. Use `map()` to compute CPC = spend / clicks for each channel.  
3. Use `filter()` to keep only channels with CPC < 1.00.  
4. Use `enumerate()` to assign a ranking number starting from 1.

**Goal:**  
Practice the “Python data pipeline mindset”: transform → filter → label.

In [16]:
# Consider these lists

channels_ls = ["instagram", "tiktok", "email", "youtube"]
spend_ls    = [120.0,       80.0,      10.0,    200.0]
clicks_ls   = [150,         40,        30,      100]

In [17]:
# Step 1. Combine with zip() - Creating a list of tuples
# This will transform the data above into an idea of "records"

combined_list = list(zip(channels_ls, spend_ls, clicks_ls))

In [18]:
# Print this combined list
print(combined_list)

[('instagram', 120.0, 150), ('tiktok', 80.0, 40), ('email', 10.0, 30), ('youtube', 200.0, 100)]


In [19]:
# Check the type
print(type(combined_list))

<class 'list'>


In [20]:
# Step 2. Create a function to compute CPC
# This function will take one record (a tuple) and return a dictionary with channel, spend, clicks, and cpc
def compute_cpc(record):
    i_channel, i_spends, i_clicks = record
    return {"channel": i_channel, "spend": i_spends, "clicks": i_clicks, "cpc": i_spends / i_clicks}

# The map() function applies compute_cpc to each item in combined_list
# It returns an iterator, so we convert it to a list
record_with_cpc_list = list(map(compute_cpc, combined_list))
print("Records with CPC:")
for row in record_with_cpc_list:
    print(row)

Records with CPC:
{'channel': 'instagram', 'spend': 120.0, 'clicks': 150, 'cpc': 0.8}
{'channel': 'tiktok', 'spend': 80.0, 'clicks': 40, 'cpc': 2.0}
{'channel': 'email', 'spend': 10.0, 'clicks': 30, 'cpc': 0.3333333333333333}
{'channel': 'youtube', 'spend': 200.0, 'clicks': 100, 'cpc': 2.0}


In [21]:
# Step 3. Filter efficient channels (CPC < 1)
# Define a function that returns True if cpc < 1.0
def is_efficient(row):
    return row["cpc"] < 1.0

filtered_list = list(filter(is_efficient, record_with_cpc_list))

In [22]:
# Print the filtered list
print(filtered_list)

[{'channel': 'instagram', 'spend': 120.0, 'clicks': 150, 'cpc': 0.8}, {'channel': 'email', 'spend': 10.0, 'clicks': 30, 'cpc': 0.3333333333333333}]


In [23]:
# Print each element of the filtered list
print("Filtered channels (CPC < 1.0):")
for i in filtered_list:
    print(i)

Filtered channels (CPC < 1.0):
{'channel': 'instagram', 'spend': 120.0, 'clicks': 150, 'cpc': 0.8}
{'channel': 'email', 'spend': 10.0, 'clicks': 30, 'cpc': 0.3333333333333333}


In [None]:
# Step 4. Rank results with enumerate()

# Enumerate will add a ranking number to each element of the filtered list.
# enumerate() produces (index, value) pairs
# It works like a for loop but automatically provides a counter.
# A counter is a number that increments with each iteration.
# Example: iterate over any iterable while keeping track of the index.

In [25]:
# Generic example of how enumerate() works. Consider this list.
fruits = ["apple", "banana", "cherry"]

In [26]:
# enumerate() returns pairs like: (0, 'apple'), (1, 'banana'), (2, 'cherry')
# The 'start=1' parameter makes counting start from 1 instead of 0.

for index, fruit in enumerate(fruits, start=1):
    print(index, fruit)

1 apple
2 banana
3 cherry


In [None]:
# Applying enumerate() to our efficient filtered channels list
# Enumerate will add a ranking number to each efficient channel. It works like a for loop but adds a counter.

ranked_report = []
for rank, row in enumerate(filtered_list, start=1):
    # Here, 'row' is a dictionary from our filtered list,
    # and 'rank' is an automatically increasing integer.
    ranked_report.append({ # construct a new dictionary with rank, channel, and cpc
        "rank": rank, # the ranking number
        "channel": row["channel"], # the channel name, which is accessed from the row dictionary
        "cpc": round(row["cpc"], 3) # rounding cpc to 3 decimal places
    })

print("\nRanked report:")
for row in ranked_report:
    print(row)


Ranked report:
{'rank': 1, 'channel': 'instagram', 'cpc': 0.8}
{'rank': 2, 'channel': 'email', 'cpc': 0.333}
