# Python for Data Engineering – Live Coding Session

This notebook covers:

- Poetry & project structure
- Python core data structures
- Numpy for efficient computation
- Pandas for tabular data manipulation
- FastAPI basics to expose data as APIs


## 1. Poetry & Project Structure

> Run these in your terminal (not notebook)

```bash
poetry init
poetry add numpy pandas fastapi uvicorn jupyter pydantic

Project layout:
my_project/
├── pyproject.toml
├── notebooks/
├── src/
├── data/
└── app.py


## 2. Python Essentials Refresher

In [5]:
# Lists and comprehensions
numbers = list(range(10))
print(f"This is an array -> {numbers}")
squares = [n**2 for n in numbers]
print(f"This is another array -> {squares}")
even_squares = [n for n in squares if n % 2 == 0]
print(f"Yet another array -> {even_squares}")


# Dictionaries
students = {"alice": 95, "bob": 90, "charlie": 85}
print(f"This is a dictionary -> {students}")
sorted_students = dict(sorted(students.items(), key=lambda item: item[1], reverse=True))
print(f"This is a sorted dictionary -> {sorted_students}")

# Sets and Tuples
unique_scores = set(students.values())
print(f"This is a set -> {unique_scores}")
example_tuple = ("alice", 85)
print(f"This is a tuple -> {example_tuple}")

# Enumerate and zip
names = ["alice", "bob", "charlie"]
grades = [95, 90, 85]
for i, (n, g) in enumerate(zip(names, grades)):
    print(f"{i}: {n} scored {g}")


This is an array -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
This is another array -> [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Yet another array -> [0, 4, 16, 36, 64]
This is a dictionary -> {'alice': 95, 'bob': 90, 'charlie': 85}
This is a sorted dictionary -> {'alice': 95, 'bob': 90, 'charlie': 85}
This is a set -> {90, 85, 95}
This is a tuple -> ('alice', 85)
0: alice scored 95
1: bob scored 90
2: charlie scored 85


## 3. Numpy Basics

In [7]:
import numpy as np

matrix = np.array([[1, 2], [3, 4]])
print("Matrix:\n", matrix)
print("Transpose:\n", matrix.T)
print("Dot product:\n", np.dot(matrix, matrix))

# Random data and filtering
data = np.random.randint(1, 100, size=(5, 5))
print("Random Data:\n", data)
print("Values > 50:\n", data[data > 50])


Matrix:
 [[1 2]
 [3 4]]
Transpose:
 [[1 3]
 [2 4]]
Dot product:
 [[ 7 10]
 [15 22]]
Random Data:
 [[53 33 67 48 30]
 [28 22 50 72  5]
 [99 65 95 41 85]
 [93 41 63 93 82]
 [29 17 48 16 62]]
Values > 50:
 [53 67 72 99 65 95 85 93 63 93 82 62]


## 4. Pandas Basics

In [9]:
import pandas as pd
import numpy as np

# Simulated customer data
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104,105],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eric'],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-02-01', '2023-03-20', '2023-04-05', '2023-04-05']),
    'region': ['North', 'South', 'North', 'East', 'East']
})

# Simulated orders data
orders = pd.DataFrame({
    'order_id': range(1, 11),
    'customer_id': [101, 102, 101, 103, 102, 104, 104, 101, 103, 104],
    'amount': [100, 150, 200, 130, 170, 120, 180, 90, 200, 220],
    'order_date': pd.to_datetime([
        '2023-03-01', '2023-03-05', '2023-03-07', '2023-03-15', '2023-03-22',
        '2023-03-25', '2023-04-01', '2023-04-02', '2023-04-04', '2023-04-10'
    ])
})

# 1. Merge customer and order data
df = orders.merge(customers, on='customer_id', how='left')

# 2. Add a calculated column: days since signup
df['days_since_signup'] = (df['order_date'] - df['signup_date']).dt.days

# 3. Filter orders placed within 30 days of signup
early_orders = df[df['days_since_signup'] <= 30]

# 4. Aggregate: total amount spent per customer and average order value
agg = df.groupby('customer_id').agg(
    total_spent=('amount', 'sum'),
    avg_order_value=('amount', 'mean'),
    order_count=('order_id', 'count')
).reset_index()

# 5. Region-based analysis: total revenue per region
region_revenue = df.groupby('region')['amount'].sum().sort_values(ascending=False)

# 6. Time-based trend: total sales by week
df['week'] = df['order_date'].dt.to_period('W')
weekly_sales = df.groupby('week')['amount'].sum()

# 7. Identify customers with no orders (outer join)
all_customers = pd.DataFrame({'customer_id': customers['customer_id']})
customers_with_orders = pd.DataFrame({'customer_id': orders['customer_id'].unique()})
no_orders_raw = all_customers.merge(customers_with_orders, on='customer_id', how='left', indicator=True)
no_orders_filtered = no_orders_raw[no_orders_raw['_merge'] == 'left_only']

# OUTPUT
print("\nMerged DataFrame:\n", df.head())
print("\nEarly Orders (within 30 days of signup):\n", early_orders)
print("\nCustomer Aggregates:\n", agg)
print("\nRevenue by Region:\n", region_revenue)
print("\nWeekly Sales Trend:\n", weekly_sales)
print("\nCustomers with No Orders before filtering:\n", no_orders_raw)
print("\nCustomers with No Orders:\n", no_orders_filtered)



Merged DataFrame:
    order_id  customer_id  amount order_date     name signup_date region  \
0         1          101     100 2023-03-01    Alice  2023-01-15  North   
1         2          102     150 2023-03-05      Bob  2023-02-01  South   
2         3          101     200 2023-03-07    Alice  2023-01-15  North   
3         4          103     130 2023-03-15  Charlie  2023-03-20  North   
4         5          102     170 2023-03-22      Bob  2023-02-01  South   

   days_since_signup                   week  
0                 45  2023-02-27/2023-03-05  
1                 32  2023-02-27/2023-03-05  
2                 51  2023-03-06/2023-03-12  
3                 -5  2023-03-13/2023-03-19  
4                 49  2023-03-20/2023-03-26  

Early Orders (within 30 days of signup):
    order_id  customer_id  amount order_date     name signup_date region  \
3         4          103     130 2023-03-15  Charlie  2023-03-20  North   
5         6          104     120 2023-03-25    Diana  2023-0

## 5. Pandas Advanced

In [12]:
df['order_rank'] = df.sort_values(['customer_id', 'order_date']) \
                     .groupby('customer_id') \
                     .cumcount() + 1  # Equivalent to ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date)

# You can also use rank() for dense_rank or percent_rank
# df['order_rank'] = df.groupby('customer_id')['order_date'].rank(method='first')

# Show the first few orders per customer
top_orders = df[df['order_rank'] <= 2].sort_values(['customer_id', 'order_rank'])

# OUTPUT
print("\nData with ROW_NUMBER-like column:\n", df[['customer_id', 'order_id', 'order_date', 'order_rank']].sort_values(['customer_id', 'order_rank']))
print("\nFirst 2 Orders per Customer:\n", top_orders[['customer_id', 'order_id', 'order_date', 'order_rank']])


Data with ROW_NUMBER-like column:
    customer_id  order_id order_date  order_rank
0          101         1 2023-03-01           1
2          101         3 2023-03-07           2
7          101         8 2023-04-02           3
1          102         2 2023-03-05           1
4          102         5 2023-03-22           2
3          103         4 2023-03-15           1
8          103         9 2023-04-04           2
5          104         6 2023-03-25           1
6          104         7 2023-04-01           2
9          104        10 2023-04-10           3

First 2 Orders per Customer:
    customer_id  order_id order_date  order_rank
0          101         1 2023-03-01           1
2          101         3 2023-03-07           2
1          102         2 2023-03-05           1
4          102         5 2023-03-22           2
3          103         4 2023-03-15           1
8          103         9 2023-04-04           2
5          104         6 2023-03-25           1
6          104       

## 6. FastAPI Example (Save as `app.py`)

In [None]:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Customer(BaseModel):
    id: int
    name: str
    total_spent: float

@app.get("/")
def root():
    return {"message": "Welcome to the Data API"}

@app.get("/customers/top")
def top_customers():
    return [
        {"id": 102, "name": "Bob", "total_spent": 320},
        {"id": 101, "name": "Alice", "total_spent": 300},
        {"id": 103, "name": "Charlie", "total_spent": 130}
    ]

@app.get("/customer/{customer_id}")
def get_customer(customer_id: int):
    if customer_id == 101:
        return {"name": "Alice", "total_spent": 300}
    return {"error": "Customer not found"}


## 7.Running FastAPI Locally

1. Save the FastAPI code to `app.py`
2. Run this in your terminal:

```bash
uvicorn app:app --reload

3. Open your browser at:
 http://127.0.0.1:8000/docs
 to test the API interactively.


