# RoofLink Data ETL Pipeline (V2)

Welcome! This notebook is a tool to help you clean and process raw data exported from RoofLink. It will guide you through uploading your CSV files, cleaning them, merging them into a single dataset, and exporting the result.\n\n**V2 Enhancements:** This version includes smarter data handling, a data quality flagging system (`is_complete`), new calculated fields for analytics, and a comprehensive data quality report.\n\n**Instructions:**\n1. Click on a code cell (the boxes with `[ ]:` next to them).\n2. Press **Shift + Enter** to run the cell.\n3. Run the cells in order from top to bottom.

### Step 1: Check Required Packages

In [None]:
import sys, importlib.util, importlib.metadata\nfrom IPython.display import display, HTML\nprint("Performing package validation...")\nrequired = {'pandas': '1.3.5', 'numpy': '1.21.5', 'ipywidgets': '7.6.5'}\nall_ok = True\nfor pkg, ver in required.items():\n    if importlib.util.find_spec(pkg) is not None:\n        display(HTML(f"✅ <b>{pkg}</b>: Installed (Version: {importlib.metadata.version(pkg)}))"))\n    else:\n        all_ok = False\n        display(HTML(f'❌ <b>{pkg}</b>: Not found. Please install.'))\nif not all_ok:\n    display(HTML('<b style=\"color:red;\">❌ Missing packages. Please install them.</b>'))

### Step 2: Set Up the Pipeline Engine

In [None]:
print("Importing ETL logic and setting up environment...")\nimport sys\nif '.' not in sys.path:\n    sys.path.append('.')\n\ntry:\n    from etl_pipeline_logic import assess_raw_data, clean_csv_data, generate_data_quality_report\n    import ipywidgets as widgets\n    import pandas as pd\n    import numpy as np\n    import io\n    from functools import reduce\n    print("✅ V2 ETL engine imported successfully.")\n    pipeline_ready = True\nexcept ImportError as e:\n    print(f"❌ Critical import error: {e}. Ensure 'etl_pipeline_logic.py' is in the same directory.")\n    pipeline_ready = False

### Step 3: Upload Your CSV Files

In [None]:
if pipeline_ready:\n    uploader = widgets.FileUpload(accept='.csv', multiple=True, description='Upload CSVs')\n    display(uploader)\nelse:\n    print("Cannot proceed. Pipeline setup failed in the previous step.")

### Step 4: Assess Raw Data

In [None]:
if 'uploader' in locals() and uploader.value:\n    assessments = {}\n    print("Assessing uploaded files...")\n    for file_upload in uploader.value:\n        file_name = file_upload['name']\n        file_content = file_upload['content']\n        assessments[file_name] = assess_raw_data(file_content, file_name)\n    print("\n✅ Assessment phase complete.")\nelse:\n    print("Please upload files in the previous step before assessing.")

### Step 5: Clean and Merge Data

In [None]:
if 'uploader' in locals() and uploader.value:\n    cleaned_dfs = []\n    print("Starting data cleaning and merging process...")\n    for file_upload in uploader.value:\n        file_name = file_upload['name']\n        file_content = file_upload['content']\n        assessment = assessments.get(file_name)\n        cleaned_df = clean_csv_data(file_content, file_name, assessment)\n        if cleaned_df is not None and 'job_id' in cleaned_df.columns and cleaned_df['job_id'].notna().any():\n            cleaned_dfs.append(cleaned_df)\n        else:\n            print(f"⚠️ Warning: Could not clean or find job_ids in '{file_name}'. It will be excluded from the merge.")\n\n    final_cleaned_df = None\n    if not cleaned_dfs:\n        print("❌ No valid dataframes with job_ids were produced. Merge step skipped.")\n    elif len(cleaned_dfs) == 1:\n        final_cleaned_df = cleaned_dfs[0]\n        print("\n✅ Only one valid dataframe. No merge needed.")\n    else:\n        print(f"\nAttempting to merge {len(cleaned_dfs)} cleaned dataframes...")\n        try:\n            indexed_dfs = [df.set_index('job_id') for df in cleaned_dfs if 'job_id' in df.columns]\n            final_merged = indexed_dfs[0]\n            for i in range(1, len(indexed_dfs)):
                final_merged = final_merged.combine_first(indexed_dfs[i])\n            final_cleaned_df = final_merged.reset_index()\n            print(f"✅ Successfully merged dataframes into a final dataset with shape {final_cleaned_df.shape}.")\n        except Exception as e:\n            print(f"❌ Error during merging: {e}")\nelse:\n    print("Please upload files first.")

### Step 6: Display Final Data Preview

In [None]:
if 'final_cleaned_df' in locals() and final_cleaned_df is not None:\n    print("--- PREVIEW OF FINAL MERGED DATA ---")\n    # Displaying a sample of both complete and incomplete rows for comparison\n    print("\nComplete Rows (is_complete = True):")\n    display(final_cleaned_df[final_cleaned_df['is_complete'] == True].head())\n    print("\nIncomplete Rows (is_complete = False):")\n    display(final_cleaned_df[final_cleaned_df['is_complete'] == False].head())\nelse:\n    print("No cleaned data available to display.")

### Step 7: Generate Data Quality Report

In [None]:
if 'final_cleaned_df' in locals() and final_cleaned_df is not None:\n    print("--- DATA QUALITY REPORT ---")\n    quality_report_df = generate_data_quality_report(final_cleaned_df)\n    \n    # Save the full report to a CSV file\n    report_filename = 'data_quality_report.csv'\n    quality_report_df.to_csv(report_filename, index=False)\n    print(f"✅ Full data quality report saved to '{report_filename}'.")\n    \n    # Display the report in the notebook\n    display(quality_report_df)\nelse:\n    print("No cleaned data available to report on.")

### Step 8: Export Cleaned Data to CSV

In [None]:
if 'final_cleaned_df' in locals() and final_cleaned_df is not None:\n    output_filename = 'cleaned_rooflink_data.csv'\n    try:\n        final_cleaned_df.to_csv(output_filename, index=False)\n        print(f"✅ Success! Your cleaned data has been saved as '{output_filename}'.")\n    except Exception as e:\n        print(f"❌ Error saving file: {e}")\nelse:\n    print("No cleaned data available to export.")