# RoofLink Data ETL Pipeline

Welcome! This notebook is a tool to help you clean and process raw data exported from RoofLink. It will guide you through uploading your CSV files, cleaning them, merging them into a single dataset, and exporting the result.\n\n**Instructions:**\n1. Click on a code cell (the boxes with `[ ]:` next to them).\n2. Press **Shift + Enter** to run the cell.\n3. Run the cells in order from top to bottom.

### Step 1: Check Required Packages\nThis first step checks to make sure all the necessary software packages are installed. If you see a green checkmark for each package, you're good to go!

In [None]:
import sys, importlib.util, importlib.metadata\nfrom IPython.display import display, HTML\nprint("Performing package validation...")\nrequired = {'pandas': '1.3.5', 'numpy': '1.21.5', 'ipywidgets': '7.6.5'}\nall_ok = True\nfor pkg, ver in required.items():\n    if importlib.util.find_spec(pkg) is not None:\n        display(HTML(f"✅ <b>{pkg}</b>: Installed (Version: {importlib.metadata.version(pkg)}))"))\n    else:\n        all_ok = False\n        display(HTML(f'❌ <b>{pkg}</b>: Not found. Please install.'))\nif not all_ok:\n    display(HTML('<b style=\"color:red;\">❌ Missing packages. Please install them.</b>'))

### Step 2: Set Up the Pipeline Engine\nThis cell imports the main data cleaning logic from the `etl_pipeline_logic.py` script. You should see a success message below.

In [None]:
print("Importing ETL logic and setting up environment...")\nimport sys\n# Ensure the current directory is in the system path so the notebook can find the logic script.\nif '.' not in sys.path:\n    sys.path.append('.')\n\ntry:\n    # Now, we can import our custom logic.\n    from etl_pipeline_logic import assess_raw_data, clean_csv_data, generate_cleaning_report\n    import ipywidgets as widgets\n    import pandas as pd\n    import io\n    from functools import reduce\n    print("✅ ETL engine imported successfully.")\n    pipeline_ready = True\nexcept ImportError as e:\n    print(f"❌ Critical import error: {e}. Ensure 'etl_pipeline_logic.py' is in the same directory.")\n    pipeline_ready = False

### Step 3: Upload Your CSV Files\nClick the 'Upload CSVs' button below to select one or more of your data files. You can select multiple files at once.

In [None]:
if pipeline_ready:\n    uploader = widgets.FileUpload(accept='.csv', multiple=True, description='Upload CSVs')\n    display(uploader)\nelse:\n    print("Cannot proceed. Pipeline setup failed in the previous step.")

### Step 4: Assess Raw Data\nThis step performs a quick check on your uploaded files to understand their structure before we start cleaning. The results are used in the next step.

In [None]:
if 'uploader' in locals() and uploader.value:\n    assessments = {}\n    print("Assessing uploaded files...")\n    for file_upload in uploader.value:\n        file_name = file_upload['name']\n        file_content = file_upload['content']\n        assessments[file_name] = assess_raw_data(file_content, file_name)\n    print("\n✅ Assessment phase complete. You can see detailed logs in 'etl_cleaning_log.txt'.")\nelse:\n    print("Please upload files in the previous step before assessing.")

### Step 5: Clean and Merge Data\nThis is the main event! The pipeline will now clean each of your files, then merge them together into a single dataset using the `job_id`.\n\n**Pay close attention to the output of this cell.** It will print detailed logs about the cleaning process, including warnings about missing data or other issues it finds.

In [None]:
if 'uploader' in locals() and uploader.value:\n    cleaned_dfs = []
    
    print("Starting data cleaning and merging process...")
    
    for file_upload in uploader.value:
        file_name = file_upload['name']
        file_content = file_upload['content']
        
        assessment = assessments.get(file_name)
        cleaned_df = clean_csv_data(file_content, file_name, assessment)
        
        if cleaned_df is not None and 'job_id' in cleaned_df.columns and cleaned_df['job_id'].notna().any():
            cleaned_dfs.append(cleaned_df)
        else:
            print(f"⚠️ Warning: Could not clean or find job_ids in '{file_name}'. It will be excluded from the merge.")

    final_cleaned_df = None
    if not cleaned_dfs:
        print("❌ No valid dataframes with job_ids were produced. Merge step skipped.")
    elif len(cleaned_dfs) == 1:
        final_cleaned_df = cleaned_dfs[0]
        print("\n✅ Only one valid dataframe. No merge needed.")
    else:
        print(f"\nAttempting to merge {len(cleaned_dfs)} cleaned dataframes...")
        try:
            # Set job_id as the index for all dataframes before combining.
            indexed_dfs = [df.set_index('job_id') for df in cleaned_dfs if 'job_id' in df.columns]
            
            # Use combine_first to intelligently merge, filling NaNs from one dataframe with data from the next.
            final_merged = indexed_dfs[0]
            for i in range(1, len(indexed_dfs)):
                final_merged = final_merged.combine_first(indexed_dfs[i])
            
            final_cleaned_df = final_merged.reset_index()
            print(f"✅ Successfully merged dataframes into a final dataset with shape {final_cleaned_df.shape}.")
        except Exception as e:
            print(f"❌ Error during merging: {e}")
else:
    print("Please upload files first.")

### Step 6: Display Final Data Preview\nLet's take a look at the first few rows of the final, cleaned, and merged dataset.

In [None]:
if 'final_cleaned_df' in locals() and final_cleaned_df is not None:\n    print("--- PREVIEW OF FINAL MERGED DATA ---")\n    display(final_cleaned_df.head())\nelse:\n    print("No cleaned data available to display.")

### Step 7: Export Cleaned Data to CSV\nThis final step will save your cleaned and merged data into a new file named `cleaned_rooflink_data.csv` in the same directory as this notebook.

In [None]:
if 'final_cleaned_df' in locals() and final_cleaned_df is not None:\n    output_filename = 'cleaned_rooflink_data.csv'\n    try:\n        final_cleaned_df.to_csv(output_filename, index=False)\n        print(f"✅ Success! Your cleaned data has been saved as '{output_filename}'.")\n    except Exception as e:\n        print(f"❌ Error saving file: {e}")\nelse:\n    print("No cleaned data available to export.")