## Compare Datasets Notebook

This notebook is designed to compare two datasets by analyzing their schema headers. It performs the following tasks:

- **Alphabetical Reordering:**  
  The columns from each dataset are sorted alphabetically to enable a standardized comparison irrespective of the original column order.

- **Difference Identification:**  
  Once sorted, any discrepancies between the two headers are highlighted. This includes:
  - Columns that exist in one dataset but are missing in the other.
  - Differences in column names, which may indicate changes or misalignments in data ingestion and transformation processes.

- **Developer Guidance:**  
  This comparison helps ensure that any evolutions in the upstream data schema are reflected in our processing pipelines. It is especially useful when:
  - Integrating new data sources.
  - Updating ETL processes.
  - Validating changes between different environments or versions of the data.

Please update this notebook as needed when schema changes are detected to maintain data consistency across our reporting systems.

In [None]:
import pandas as pd
import os

# Define file paths
azure_filepath = '../data/usage_data_w_tagsv.csv'
aws_filepath = '../data/usage_data_aws_workspace_v1.1.csv'

# Check if files exist
if not os.path.exists(azure_filepath):
    raise FileNotFoundError(f"No such file or directory: '{azure_filepath}'")
if not os.path.exists(aws_filepath):
    raise FileNotFoundError(f"No such file or directory: '{aws_filepath}'")

# Load the CSV files into dataframes with renamed variables
df_azure = pd.read_csv(azure_filepath)
df_aws = pd.read_csv(aws_filepath)

# Extract the column sets for each dataframe
columns_azure = set(df_azure.columns)
columns_aws = set(df_aws.columns)

# Compare columns
common_columns = columns_azure.intersection(columns_aws)
unique_to_azure = columns_azure - columns_aws
unique_to_aws = columns_aws - columns_azure

print("Common Columns:")
print(common_columns)
print("\nColumns unique to usage_data_w_tagsv.csv (Azure):")
print(unique_to_azure)
print("\nColumns unique to usage_data_aws_workspace.csv (AWS):")
print(unique_to_aws)

In [None]:
df_azure

In [None]:
df_aws
