Day 4 – Python exercises (loops + mini “pipeline”)

Q1. Loop over list with conditional logging
You have a list of file sizes (rows per file):

python
file_rows = [1200, 0, 890, 5, 2300]
file_names = ["orders.csv", "empty.csv", "titanic.csv", "tiny.csv", "payments.csv"]
Task:
Loop over both lists together and print:

"OK" if rows ≥ 100

"WARNING: almost empty" if 1–99

"ERROR: empty file" if 0

Hint:

Use zip(file_names, file_rows) to combine.

Inside loop, use if / elif / else.

In [1]:
file_rows = [1200 , 0 , 890 , 5 ,2300]
file_names = ["orders.csv" , "empty.csv", "titanic.csv", "tiny.csv", "payments.csv"]

In [11]:
for row,names in zip(file_rows,file_names):
    if row>=100:
        print(f"\n File Name : {names} --> OK")
    elif 1<= row <= 99:
        print(f"\n File Name : {names} --> WARNING : ALMOST EMPTY")
    else:
        print(f"\n File Name : {names} --> ERROR : EMPTY FILE")


 File Name : orders.csv --> OK

 File Name : empty.csv --> ERROR : EMPTY FILE

 File Name : titanic.csv --> OK


 File Name : payments.csv --> OK


2.
Loop over dict config and build a summary list
You have a config like:

python
sources = {
    "orders":  {"type": "db",   "priority": 1},
    "titanic": {"type": "file", "priority": 2},
    "logs":    {"type": "file", "priority": 3},
}
Task:

Loop over sources.items() and create a list of strings like:
"orders (db) -> priority 1"

Store them in a list called summary.

At the end, print summary.

Hint:

Start with summary = [].

In the loop, build one f-string and append it.

In [12]:
sources = {
    "orders":  {"type": "db",   "priority": 1},
    "titanic": {"type": "file", "priority": 2},
    "logs":    {"type": "file", "priority": 3},
}


In [30]:
summary = []
for key,value in sources.items():
    l1 = (f"{key} ({value["type"]}) -> priority {value['priority']}")
    summary.append(l1)
print(summary)


['orders (db) -> priority 1', 'titanic (file) -> priority 2', 'logs (file) -> priority 3']


Q3. Nested loops: table → columns → rule
You already did table → columns. Now extend:

python
table_columns = {
    "passengers": ["Name", "Age", "Fare"],
    "survival":   ["Survived", "Pclass", "Sex"]
}
Task:

Using the Titanic DataFrame df:

For each table:

For each column:

If column is numeric (Age, Fare, etc.), print min and max.

Else (string), print number of distinct values.

Format example:

Table passengers, column Age: min=0.42, max=80.0

Table passengers, column Name: distinct=891

Hints:

You can check numeric vs string using df[col].dtype.

Use df[col].min(), df[col].max(), df[col].nunique().

If any column doesn’t exist in df, you can continue or print “not found”.

In [31]:
import pandas as pd
table_columns = {
    "passengers": ["name", "age", "fare"],
    "survival":   ["survived", "pclass", "sex"]
}
df = pd.read_csv("/Users/darvikkunalbanda/DATA_ENGINEERING/cloud_learnings/data/titanic.csv")

In [42]:
for k,v in table_columns.items():
    print(f"\n === {k} ===")
    for col in v:    
        if pd.api.types.is_numeric_dtype(df[col]) :
            print(df[col].min())
            print(df[col].max())
            print(df[col].nunique())
        elif pd.api.types.is_string_dtype(df[col].dtype):
            print(df[col].dropna().unique())
        else:
            print("not found")


 === passengers ===
['Braund, Mr. Owen Harris'
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
 'Heikkinen, Miss. Laina' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
 'Allen, Mr. William Henry' 'McCarthy, Mr. Timothy J'
 'Palsson, Master. Gosta Leonard'
 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)'
 'Nasser, Mrs. Nicholas (Adele Achem)' 'Sandstrom, Miss. Marguerite Rut'
 'Bonnell, Miss. Elizabeth' 'Saundercock, Mr. William Henry'
 'Andersson, Mr. Anders Johan' 'Vestrom, Miss. Hulda Amanda Adolfina'
 'Hewlett, Mrs. (Mary D Kingcome) ' 'Rice, Master. Eugene'
 'Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)'
 'Fynney, Mr. Joseph J' 'Beesley, Mr. Lawrence'
 'McGowan, Miss. Anna "Annie"' 'Sloper, Mr. William Thompson'
 'Palsson, Miss. Torborg Danira'
 'Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)'
 'Fortune, Mr. Charles Alexander' 'Uruchurtu, Don. Manuel E'
 'Wheadon, Mr. Edward H' 'Meyer, Mr. Edgar Joseph'
 'Holverson, Mr. Alexander Oskar' 'Cann, Mr. 

Q4. Nested loops + building a result dict (harder)
Goal: simulate data-quality metrics per column.

Using the same table_columns:

Task:

Build a dictionary dq_report shaped like:

python
{
  "passengers": {
      "Name": {"nulls": X, "distinct": Y},
      "Age":  {"nulls": A, "distinct": B}
  },
  "survival": { ... }
}
Where:

nulls = number of nulls in that column.

distinct = number of distinct values.

Hints:

Start with dq_report = {}.

Outer loop over tables; for each table, set dq_report[table] = {}.

Inner loop over columns; compute metrics and assign:
dq_report[table][col] = {"nulls": nulls, "distinct": distinct}.

At the end, print dq_report or pretty-print one table’s section.

In [46]:
dq_report = {}

for table, cols in table_columns.items():
    dq_report[table] = {}

    for col in cols:
        if col in df.columns:
            nulls = int(df[col].isnull().sum())
            distinct = int(df[col].nunique(dropna=True))
        else:
            nulls = "N/A"
            distinct = "N/A"
        
        dq_report[table][col] = {"nulls":nulls, "distinct": distinct}

print(dq_report)

import pprint
pprint.pprint(dq_report["passengers"])



{'passengers': {'name': {'nulls': 0, 'distinct': 714}, 'age': {'nulls': 0, 'distinct': 88}, 'fare': {'nulls': 0, 'distinct': 220}}, 'survival': {'survived': {'nulls': 0, 'distinct': 2}, 'pclass': {'nulls': 0, 'distinct': 3}, 'sex': {'nulls': 0, 'distinct': 2}}}
{'age': {'distinct': 88, 'nulls': 0},
 'fare': {'distinct': 220, 'nulls': 0},
 'name': {'distinct': 714, 'nulls': 0}}
