<a href="https://colab.research.google.com/github/bhuguvi26/Copy-of-A-Comprehensive-ETL-Workflow-with-Python-for-Data-Engineers/blob/main/Copy_of_A_Comprehensive_ETL_Workflow_with_Python_for_Data_Engineers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

from google.colab import files
print("üìÇ Please upload your CSV, JSON, and XML files...")
uploaded = files.upload()

# ---------- Imports ----------
import json, logging
import xml.etree.ElementTree as ET
from pathlib import Path
from datetime import datetime, timezone
import pandas as pd

# ---------- Constants ----------
INCH_TO_METER = 0.0254
POUND_TO_KG = 0.45359237
OUTPUT_FILE = "transformed_data.csv"
LOG_FILE = "etl_log.txt"
SEARCH_DIRS = [Path("/content"), Path(".")]

# ---------- Logging ----------
logging.basicConfig(
    filename=LOG_FILE,
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    filemode="a"
)
logging.info("‚úÖ Logging initialized")

# ---------- Helpers ----------
def get_files(ext):
    files = []
    for d in SEARCH_DIRS:
        files.extend(d.glob(f"*.{ext}"))
    return [f for f in files if f.name != OUTPUT_FILE]

# ---------- Extract ----------
def extract_csv(file):
    logging.info(f"üìÇ Reading CSV: {file}")
    return pd.read_csv(file)

def extract_json(file):
    logging.info(f"üìÇ Reading JSON: {file}")
    rows = []
    with open(file) as f:
        for line in f:
            rows.append(json.loads(line))
    return pd.DataFrame(rows)

def extract_xml(file):
    logging.info(f"üìÇ Reading XML: {file}")
    tree = ET.parse(file)
    root = tree.getroot()
    rows = []
    for p in root.findall("person"):
        rows.append({
            "name": p.find("name").text,
            "height": float(p.find("height").text),
            "weight": float(p.find("weight").text)
        })
    return pd.DataFrame(rows)

def extract_all():
    csvs = get_files("csv")
    jsons = get_files("json")
    xmls = get_files("xml")

    dfs = []
    for f in csvs: dfs.append(extract_csv(f))
    for f in jsons: dfs.append(extract_json(f))
    for f in xmls: dfs.append(extract_xml(f))

    df = pd.concat(dfs, ignore_index=True)
    logging.info(f"‚úÖ Extracted {len(df)} rows")
    return df

# ---------- Transform ----------
def transform(df):
    logging.info("üîß Transforming...")
    df = df.copy()
    df["height_m"] = round(df["height"] * INCH_TO_METER, 3)
    df["weight_kg"] = round(df["weight"] * POUND_TO_KG, 3)
    logging.info("‚úÖ Transformation complete")
    return df

# ---------- Load ----------
def load(df):
    df.to_csv(OUTPUT_FILE, index=False)
    logging.info(f"üíæ Saved to {OUTPUT_FILE}")

# ---------- Run ETL ----------
start = datetime.now(timezone.utc)
logging.info("üöÄ ETL START")

df = extract_all()
df = transform(df)
load(df)

logging.info(f"üéØ ETL Finished in {datetime.now(timezone.utc) - start}")

print("\n‚úÖ ETL Completed Successfully! Preview:")
display(df.head())

print("\nüìÅ Output file created:", OUTPUT_FILE)
print("üìù Log file:", LOG_FILE)


üìÇ Please upload your CSV, JSON, and XML files...


Saving source1.csv to source1.csv
Saving source1.json to source1.json
Saving source1.xml to source1.xml
Saving source2.csv to source2.csv
Saving source2.json to source2.json
Saving source2.xml to source2.xml
Saving source3.csv to source3.csv
Saving source3.json to source3.json
Saving source3.xml to source3.xml

‚úÖ ETL Completed Successfully! Preview:


Unnamed: 0,name,height,weight,height_m,weight_kg
0,alex,65.78,112.99,1.671,51.251
1,ajay,71.52,136.49,1.817,61.911
2,alice,69.4,153.03,1.763,69.413
3,ravi,68.22,142.34,1.733,64.564
4,joe,67.79,144.3,1.722,65.453



üìÅ Output file created: transformed_data.csv
üìù Log file: etl_log.txt


# ReadMe
üìä ETL Pipeline in Python ‚Äî CSV, JSON, XML | Google Colab

This project demonstrates a complete Extract, Transform, Load (ETL) workflow using Python in Google Colab.
The pipeline extracts data from CSV, JSON, and XML formats, transforms height and weight units, and loads the cleaned data into a CSV file for analytics or database storage.

üöÄ Project Overview
‚úÖ Objective

Build a production-style ETL pipeline that:

Extracts data from multiple formats (CSV, JSON, XML)

Transforms:

Height ‚Üí meters

Weight ‚Üí kilograms

Logs all ETL steps

Saves final clean dataset into transformed_data.csv

‚úÖ Skills Used

Python

Pandas

File handling (CSV, JSON, XML)

Data transformation

Logging for ETL tracking

üìÅ Input Data Formats
CSV Example
name,height,weight
alex,65.78,112.99
ajay,71.52,136.49
alice,69.4,153.03

JSON Example
{"name":"jack","height":68.70,"weight":123.30}
{"name":"tom","height":69.80,"weight":141.49}

XML Example
<data>
   <person>
      <name>simon</name>
      <height>67.90</height>
      <weight>112.37</weight>
   </person>
</data>

üì¶ Output
transformed_data.csv preview:
name	height	weight	height_m	weight_kg
alex	65.78	112.99	1.671	51.251
ajay	71.52	136.49	1.817	61.911
alice	69.40	153.03	1.763	69.413
Generated Log File

etl_log.txt ‚Äî contains timestamped logs for each ETL phase.

‚öôÔ∏è ETL Workflow
1Ô∏è‚É£ Extract

Reads all uploaded .csv, .json, .xml files and combines into a DataFrame.

2Ô∏è‚É£ Transform

Height (inches ‚Üí meters):
height_m = height * 0.0254

Weight (lbs ‚Üí kg):
weight_kg = weight * 0.45359237

3Ô∏è‚É£ Load

Saves result to:

transformed_data.csv

üìé Running the Project in Google Colab
Step 1 ‚Äî Upload Files
from google.colab import files
uploaded = files.upload()

Step 2 ‚Äî Run Complete ETL Script

Run the one-cell ETL code provided in this project.

üß† Key Learnings

ETL automation in Python

Parsing structured data files

Real-world logging practices

Data cleaning & unit conversion

‚úÖ Project Status

‚úî Completed
‚úî Tested with real data
‚úî Production-style logging & modularity