# ETL Pipeline for Job Data Consolidation

This notebook implements an interactive ETL pipeline. You can upload multiple CSV files containing job data, and the notebook will automatically extract a `job_id` from a URL column, merge the files, and perform data completeness checks.

## 1. Upload CSV Files

Use the button below to select and upload one or more CSV files from your computer. After uploading, proceed to the next step to process the files.

In [None]:
import ipywidgets as widgets
from IPython.display import display

uploader = widgets.FileUpload(
    accept='.csv',
    multiple=True,
    description='Upload CSVs'
)

display(uploader)

## 2. Process and Load Uploaded Data

This step reads the content of the uploaded files into pandas DataFrames. It will display the first few rows of each loaded DataFrame.

In [None]:
import pandas as pd
import io

if not uploader.value:
    print("Please upload at least one CSV file in the step above.")
else:
    dfs = []
    for file_info in uploader.value:
        content = file_info['content']
        df = pd.read_csv(io.BytesIO(content))
        dfs.append(df)
        print(f"Loaded {file_info['name']}:")
        display(df.head())
        print("\n")

## 3. Extract Job ID

Next, we extract a numeric `job_id` from a URL column in each DataFrame. The code will attempt to automatically find the URL column by looking for 'url' or 'link' in the column name.

In [None]:
import re

def find_url_column(df):
    for col in df.columns:
        if 'url' in col.lower() or 'link' in col.lower():
            return col
    return None

def extract_job_id(df, url_column):
    if url_column is None:
        print("Warning: Could not find a URL column. Skipping job ID extraction.")
        df['job_id'] = None
        return df
    
    df['job_id'] = df[url_column].str.extract(r'(\d+)', expand=False)
    df['job_id'] = pd.to_numeric(df['job_id'], errors='coerce')
    return df

processed_dfs = []
for i, df in enumerate(dfs):
    url_col = find_url_column(df)
    df = extract_job_id(df.copy(), url_col)
    
    # Assertion for missing job_ids
    missing_ids = df['job_id'].isnull().sum()
    if missing_ids > 0:
        print(f"Warning: Found {missing_ids} rows in DataFrame {i} with a missing job_id after extraction.")
    
    # Drop the original URL column if it was found
    if url_col:
        df = df.drop(columns=[url_col])
        
    processed_dfs.append(df)
    print(f"Processed DataFrame {i} with job_id:")
    display(df)
    print("\n")

## 4. Merge Datasets

All datasets are now merged into a single DataFrame using the `job_id`.

In [None]:
from functools import reduce

if not processed_dfs:
    print("No data to merge.")
else:
    # Merge all dataframes in the list
    merged_df = reduce(lambda left, right: pd.merge(left, right, on='job_id', how='outer'), processed_dfs)
    
    # Assertion for merge failure
    assert len(merged_df) >= max(len(df) for df in processed_dfs), "Merge failed unexpectedly."

    print("Merged Dataset:")
    display(merged_df)

## 5. Data Completeness Testing

Finally, we perform completeness tests on the merged data, checking for null values and empty strings.

In [None]:
import numpy as np

def check_completeness(df):
    df = df.replace(r'^\s*$', np.nan, regex=True)
    
    missing_values = df.isnull().sum()
    total_rows = len(df)
    completeness_percentage = ((total_rows - missing_values) / total_rows) * 100
    
    completeness_summary = pd.DataFrame({
        'missing_values': missing_values,
        'completeness_percentage': completeness_percentage
    })
    
    # Warning for columns with >5% missing values
    for index, row in completeness_summary.iterrows():
        if (100 - row['completeness_percentage']) > 5:
            print(f"Warning: Column '{index}' has more than 5% missing values ({100 - row['completeness_percentage']:.2f}%).")
            
    return completeness_summary

if 'merged_df' in locals():
    completeness_report = check_completeness(merged_df)
    print("Data Completeness Report:")
    display(completeness_report)
    print("\n--- Markdown Summary ---")
    print(completeness_report.to_markdown())