**Understanding industry classifications**

1. we must first understand what industry classification is best to use and the nuances between each one.

In [67]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Set paths here C:\Users\clint\Desktop\compstak-ra\src\exploration_1.ipynb
#C:\Users\clint\Desktop\compstak-ra\data\data\university-of-british-columbia-leases-2025-04-02.csv
#C:\Users\clint\Desktop\compstak-ra\data\data\university-of-british-columbia-sales-2025-04-02.csv
#C:\Users\clint\Desktop\compstak-ra\data\data\cb_2018_us_state_5m\cb_2018_us_state_5m.shp
path = 'C:/Users/clint/Desktop/data/data/'
path_census = 'C:/Users/clint/Desktop/data/cb_2018_us_state_5m/'

# Load the data
sales = pd.read_csv(path + 'university-of-british-columbia-sales-2025-04-02.csv')
leases = pd.read_csv(path + 'university-of-british-columbia-leases-2025-04-02.csv')


Columns (8,52,73,77) have mixed types. Specify dtype option on import or set low_memory=False.


Columns (15,35,37,59,61,67,71,73,76,77,78,80,82,85) have mixed types. Specify dtype option on import or set low_memory=False.



In [68]:
# Get unique property types in each dataset
leases_prop_types = leases['Property Type'].unique()
sales_prop_types = sales['Property Type'].unique()
leases_prop_subtypes = leases['Property Subtype'].unique()
sales_prop_subtypes = sales['Property Subtype'].unique()

print("Leases Property Types:")
print(leases_prop_types)

print("\nSales Property Types:")
print(sales_prop_types)

Leases Property Types:
['Office' 'Land' 'Multi-Family' 'Retail' 'Mixed-Use' 'Hotel' 'Other'
 'Industrial' nan]

Sales Property Types:
['Office' 'Land' nan 'Multi-Family' 'Industrial' 'Retail' 'Other' 'Hotel'
 'Mixed-Use']


In [69]:
# Analyzing unique buildings in the sales dataset

# Count unique Property IDs in the sales dataframe
unique_buildings_sales = sales['Property Id'].nunique()
total_sales_records = len(sales)

print(f"Number of unique buildings (Property IDs) in sales data: {unique_buildings_sales}")
print(f"Total number of sales records: {total_sales_records}")
print(f"Ratio of unique buildings to total records: {unique_buildings_sales / total_sales_records:.4f}")

# Check for duplicate Property IDs - buildings with multiple sales records
property_id_counts = sales['Property Id'].value_counts()
buildings_with_multiple_sales = (property_id_counts > 1).sum()
max_sales_per_building = property_id_counts.max()

print(f"\nNumber of buildings with multiple sales records: {buildings_with_multiple_sales}")
print(f"Maximum number of sales records for a single building: {max_sales_per_building}")

Number of unique buildings (Property IDs) in sales data: 500230
Total number of sales records: 563653
Ratio of unique buildings to total records: 0.8875

Number of buildings with multiple sales records: 52928
Maximum number of sales records for a single building: 53


In [70]:
# Compare with leases dataset
unique_buildings_leases = leases['Property Id'].nunique()
total_leases_records = len(leases)

print(f"Number of unique buildings (Property IDs) in leases data: {unique_buildings_leases}")
print(f"Total number of lease records: {total_leases_records}")
print(f"Ratio of unique buildings to total records: {unique_buildings_leases / total_leases_records:.4f}")

# Buildings that appear in both datasets
buildings_in_sales = set(sales['Property Id'].unique())
buildings_in_leases = set(leases['Property Id'].unique())
buildings_in_both = buildings_in_sales.intersection(buildings_in_leases)

print(f"\nNumber of buildings that appear in both sales and leases datasets: {len(buildings_in_both)}")
print(f"Percentage of sales buildings also in leases: {len(buildings_in_both)/len(buildings_in_sales):.2%}")
print(f"Percentage of leases buildings also in sales: {len(buildings_in_both)/len(buildings_in_leases):.2%}")

Number of unique buildings (Property IDs) in leases data: 333701
Total number of lease records: 1168997
Ratio of unique buildings to total records: 0.2855

Number of buildings that appear in both sales and leases datasets: 74308
Percentage of sales buildings also in leases: 14.85%
Percentage of leases buildings also in sales: 22.27%


In [71]:
# Calculate total unique buildings in our dataset (no duplicates)
total_unique_buildings = len(set(sales['Property Id']).union(set(leases['Property Id'])))


## Dataset Coverage Analysis: Comparing to U.S. Commercial Building Stock

Let's assess how representative our dataset is by comparing it to the total U.S. commercial building stock.

In [72]:
# According to reference data, there are approximately 5.9 million commercial buildings in the United States
total_us_commercial_buildings = 5_900_000

# Calculate what percentage of all U.S. commercial buildings our dataset represents
dataset_coverage_percentage = (total_unique_buildings / total_us_commercial_buildings) * 100

print(f"Total unique buildings in our dataset: {total_unique_buildings:,}")
print(f"Total commercial buildings in the U.S.: {total_us_commercial_buildings:,}")
print(f"Our dataset represents {dataset_coverage_percentage:.2f}% of all U.S. commercial buildings")

# Breakdown by property type
print("\nBreakdown by property type:")
sales_property_type_counts = sales['Property Type'].value_counts()
leases_property_type_counts = leases['Property Type'].value_counts()

print("\nSales dataset property types:")
print(sales_property_type_counts)

print("\nLeases dataset property types:")
print(leases_property_type_counts)

Total unique buildings in our dataset: 759,623
Total commercial buildings in the U.S.: 5,900,000
Our dataset represents 12.87% of all U.S. commercial buildings

Breakdown by property type:

Sales dataset property types:
Property Type
Retail          173102
Industrial      112989
Multi-Family     92561
Office           88399
Land             50548
Other             8907
Hotel             5500
Mixed-Use         3028
Name: count, dtype: int64

Leases dataset property types:
Property Type
Office          659897
Industrial      296444
Retail          172828
Multi-Family     15391
Mixed-Use         6636
Land              3037
Other             2118
Hotel              643
Name: count, dtype: int64


## U.S. Commercial Buildings by Type: Internet Data vs Our Dataset

We found online estimates for commercial buildings in the United States by type. Let's compare these with our dataset to understand representation by category.

In [73]:
# Internet data on estimated U.S. commercial buildings by type with updated figures
internet_data = {
    'Retail': 1070000,       # 1.07 million brick-and-mortar retail establishments as of 2023
    'Industrial': 350000,     # Includes warehouses, manufacturing facilities, etc.
    'Office': 569311,         # Includes Class A, B, and C office buildings as of 2018
    'Multi-Family': 5200000,  # Includes duplexes, triplexes, and apartment buildings
    'Hotel': 116873,         # Number of hotels and motels as of 2025
    'Mixed-Use': 580000,     # Buildings combining residential with commercial use (apartments within)
    'Land': 13100000,        # Total commercial land parcels across all U.S. states
    'Other': None,            # Includes education, medical, religious facilities (included in total)
}

# Total commercial properties nationwide
total_commercial_properties = 13100000  # Total number of commercial property parcels nationwide

# Notes for each category
category_notes = {
    'Retail': 'Number of brick-and-mortar retail establishments as of 2023.',
    'Industrial': 'Includes warehouses, manufacturing facilities, etc.',
    'Office': 'Includes Class A, B, and C office buildings as of 2018.',
    'Multi-Family': 'Includes duplexes, triplexes, and apartment buildings.',
    'Hotel': 'Number of hotels and motels as of 2025.',
    'Mixed-Use': 'Buildings combining residential with commercial use.',
    'Land': 'Total commercial property parcels across all U.S. states.',
    'Other': 'Includes education, medical, religious, and other facilities.',
}

# Create a mapping between our categories and internet data categories
# Some categories might need to be mapped differently based on your dataset
category_mapping = {
    'Office': 'Office',
    'Industrial': 'Industrial',
    'Retail': 'Retail',
    'Multi-Family': 'Multi-Family',
    'Hotel': 'Hotel',
    'Land': 'Land',
    'Mixed-Use': 'Mixed-Use',
    'Other': 'Other'
}

# Add emoji icons for visual appeal
category_emoji = {
    'Retail': '🛍️',
    'Industrial': '🏭',
    'Office': '🏢',
    'Multi-Family': '🏘️',
    'Hotel': '🏨',
    'Mixed-Use': '🏙️',
    'Land': '🌱',
    'Other': '🧩'
}

# Get all unique property types from both datasets
property_types = set(sales['Property Type'].unique()) | set(leases['Property Type'].unique())

# Count buildings in our dataset by property type
our_data_counts = {}
for prop_type in property_types:
    sales_ids = set(sales[sales['Property Type'] == prop_type]['Property Id'].unique())
    leases_ids = set(leases[leases['Property Type'] == prop_type]['Property Id'].unique())
    unique_ids = sales_ids.union(leases_ids)
    our_data_counts[prop_type] = len(unique_ids)

# Create a comparison table
comparison_data = []
for category, count in our_data_counts.items():
    # Map our category to internet category if available
    internet_category = category_mapping.get(category, 'Other')
    internet_count = internet_data.get(internet_category)
    icon = category_emoji.get(internet_category, '')
    note = category_notes.get(internet_category, '')
    
    coverage_pct = None
    if internet_count is not None and internet_count > 0:
        coverage_pct = (count / internet_count) * 100
    
    comparison_data.append({
        'Category': f"{icon} {category}",
        'Our Dataset Count': f"{count:,}",
        'U.S. Estimated Count': f"{internet_count:,}" if internet_count is not None else 'Included in total',
        'Coverage (%)': f"{coverage_pct:.4f}%" if coverage_pct is not None else 'N/A',
        'Notes': note
    })

# Add a total row
comparison_data.append({
    'Category': '📦 Total',
    'Our Dataset Count': f"{total_unique_buildings:,}",
    'U.S. Estimated Count': f"{total_commercial_properties:,}",
    'Coverage (%)': f"{(total_unique_buildings / total_commercial_properties) * 100:.4f}%",
    'Notes': 'Total number of commercial property parcels nationwide.'
})

# Convert to DataFrame for better display
comparison_df = pd.DataFrame(comparison_data)
display(comparison_df)

Unnamed: 0,Category,Our Dataset Count,U.S. Estimated Count,Coverage (%),Notes
0,🏨 Hotel,4969,116873,4.2516%,Number of hotels and motels as of 2025.
1,🌱 Land,49992,13100000,0.3816%,Total commercial property parcels across all U...
2,🧩 Other,9794,Included in total,,"Includes education, medical, religious, and ot..."
3,🧩 nan,0,Included in total,,"Includes education, medical, religious, and ot..."
4,🛍️ Retail,240247,1070000,22.4530%,Number of brick-and-mortar retail establishmen...
5,🏘️ Multi-Family,84092,5200000,1.6172%,"Includes duplexes, triplexes, and apartment bu..."
6,🏭 Industrial,188239,350000,53.7826%,"Includes warehouses, manufacturing facilities,..."
7,🏙️ Mixed-Use,3707,580000,0.6391%,Buildings combining residential with commercia...
8,🏢 Office,141594,569311,24.8711%,"Includes Class A, B, and C office buildings as..."
9,📦 Total,759623,13100000,5.7986%,Total number of commercial property parcels na...


## Visualizing the Difference Between US Estimated Count and Our Dataset Count (Excluding Land and Multi-Family)

Let's create visual representations to better understand the gap between the estimated number of commercial buildings in the US and our dataset coverage, focusing on commercial properties by excluding Land and Multi-Family categories.

In [74]:
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, HTML

# Extract data for plotting (excluding Land and Multi-Family)
plot_data_filtered = []
categories_filtered = []
us_counts_filtered = []
our_counts_filtered = []

excluded_categories = ['Land', 'Multi-Family']  # Categories to exclude

for item in comparison_data[:-1]:  # Skip the total row
    category = item['Category'].split(' ', 1)[1]  # Remove emoji
    
    # Skip excluded categories
    if category in excluded_categories:
        continue
        
    categories_filtered.append(category)
    
    # Parse the US count
    us_count_str = item['U.S. Estimated Count']
    if us_count_str != 'Included in total':
        us_count = int(us_count_str.replace(',', ''))
    else:
        us_count = 0
    us_counts_filtered.append(us_count)
    
    # Parse our dataset count
    our_count = int(item['Our Dataset Count'].replace(',', ''))
    our_counts_filtered.append(our_count)
    
    plot_data_filtered.append({
        'Category': category,
        'US Estimated Count': us_count,
        'Our Dataset Count': our_count
    })

# Create a DataFrame for plotting
plot_df_filtered = pd.DataFrame(plot_data_filtered)

# Filter out categories with zero US count
plot_df_filtered = plot_df_filtered[plot_df_filtered['US Estimated Count'] > 0]

# Calculate the difference and percentage
plot_df_filtered['Difference'] = plot_df_filtered['US Estimated Count'] - plot_df_filtered['Our Dataset Count']
plot_df_filtered['Coverage %'] = (plot_df_filtered['Our Dataset Count'] / plot_df_filtered['US Estimated Count']) * 100

# Sort by difference magnitude
plot_df_filtered = plot_df_filtered.sort_values('Difference', ascending=False)

print(f"Analysis excluding: {excluded_categories}")

Analysis excluding: ['Land', 'Multi-Family']


In [75]:
# Create a bar chart comparing US Estimated vs Our Dataset counts (excluding Land and Multi-Family)
fig_filtered = px.bar(plot_df_filtered, 
                      x='Category', 
                      y=['US Estimated Count', 'Our Dataset Count'],
                      title='US Estimated Count vs Our Dataset Count by Property Type (Excluding Land & Multi-Family)',
                      barmode='group',
                      height=600,
                      log_y=True,  # Using log scale due to large differences
                      labels={'value': 'Number of Buildings (log scale)', 'variable': 'Data Source'})

fig_filtered.update_layout(
    xaxis_title='Property Type',
    yaxis_title='Number of Buildings (log scale)',
    legend_title='Data Source',
    font=dict(size=12)
)

fig_filtered.show()

# Create a bar chart showing the difference (excluding Land and Multi-Family)
fig2_filtered = px.bar(plot_df_filtered,
                       x='Category',
                       y='Difference',
                       title='Difference Between US Estimated Count and Our Dataset Count (Excluding Land & Multi-Family)',
                       color='Difference',
                       color_continuous_scale='Reds',
                       height=500)

fig2_filtered.update_layout(
    xaxis_title='Property Type',
    yaxis_title='Difference (US Estimated - Our Dataset)',
    font=dict(size=12)
)

fig2_filtered.show()

In [76]:
# Create a visualization for coverage percentage (excluding Land and Multi-Family)
fig3_filtered = px.bar(plot_df_filtered,
                       x='Category',
                       y='Coverage %',
                       title='Dataset Coverage as Percentage of US Estimated Count (Excluding Land & Multi-Family)',
                       color='Coverage %',
                       color_continuous_scale='Blues',
                       height=500)

fig3_filtered.update_layout(
    xaxis_title='Property Type',
    yaxis_title='Coverage (%)',
    font=dict(size=12)
)

# Add a horizontal line for average coverage
avg_coverage_filtered = plot_df_filtered['Coverage %'].mean()
fig3_filtered.add_shape(
    type='line',
    line=dict(dash='dash', color='red', width=2),
    y0=avg_coverage_filtered,
    y1=avg_coverage_filtered,
    x0=-0.5,
    x1=len(plot_df_filtered) - 0.5
)

fig3_filtered.add_annotation(
    text=f'Average Coverage: {avg_coverage_filtered:.2f}%',
    x=len(plot_df_filtered)/2,
    y=avg_coverage_filtered + 0.5,
    showarrow=False,
    font=dict(color='red')
)

fig3_filtered.show()

In [77]:
# Create a table with the numerical differences (excluding Land and Multi-Family)
summary_table_filtered = plot_df_filtered.copy()
summary_table_filtered['Absolute Difference'] = summary_table_filtered['Difference'].abs()
summary_table_filtered['Gap Ratio'] = summary_table_filtered['US Estimated Count'] / summary_table_filtered['Our Dataset Count']

# Format the numbers for better readability
summary_table_filtered['US Estimated Count'] = summary_table_filtered['US Estimated Count'].apply(lambda x: f'{x:,}')
summary_table_filtered['Our Dataset Count'] = summary_table_filtered['Our Dataset Count'].apply(lambda x: f'{x:,}')
summary_table_filtered['Difference'] = summary_table_filtered['Difference'].apply(lambda x: f'{x:,}')
summary_table_filtered['Coverage %'] = summary_table_filtered['Coverage %'].apply(lambda x: f'{x:.4f}%')
summary_table_filtered['Gap Ratio'] = summary_table_filtered['Gap Ratio'].apply(lambda x: f'{x:,.2f}x')

# Sort by absolute difference
summary_table_filtered = summary_table_filtered.sort_values('Absolute Difference', ascending=False)

# Select and reorder columns for display
display_columns = ['Category', 'US Estimated Count', 'Our Dataset Count', 'Difference', 'Coverage %', 'Gap Ratio']
display(summary_table_filtered[display_columns])

Unnamed: 0,Category,US Estimated Count,Our Dataset Count,Difference,Coverage %,Gap Ratio
3,Retail,1070000,240247,829753,22.4530%,4.45x
5,Mixed-Use,580000,3707,576293,0.6391%,156.46x
6,Office,569311,141594,427717,24.8711%,4.02x
4,Industrial,350000,188239,161761,53.7826%,1.86x
0,Hotel,116873,4969,111904,4.2516%,23.52x


## Export Filtered Analysis Charts

Let's export the filtered analysis charts (excluding Land and Multi-Family categories) as image files.

In [78]:
# Export the filtered figures
try:
    print("Exporting filtered comparison bar chart...")
    export_plot(fig_filtered, "filtered_us_vs_dataset_comparison")
    
    print("\nExporting filtered difference bar chart...")
    export_plot(fig2_filtered, "filtered_difference_chart")
    
    print("\nExporting filtered coverage percentage chart...")
    export_plot(fig3_filtered, "filtered_coverage_percentage")
    
except Exception as e:
    print(f"Error exporting plots: {e}")
    print("\nTo fix this issue, please run the following in a terminal:")
    print("pip install -U plotly kaleido")
    print("\nThen restart the kernel and run this cell again.")

Exporting filtered comparison bar chart...
Exported to:
- ../Images/plots\filtered_us_vs_dataset_comparison_20250425_110034.png
- ../Images/plots\filtered_us_vs_dataset_comparison_20250425_110034.jpg
- ../Images/plots\filtered_us_vs_dataset_comparison_20250425_110034.html

Exporting filtered difference bar chart...
Exported to:
- ../Images/plots\filtered_difference_chart_20250425_110034.png
- ../Images/plots\filtered_difference_chart_20250425_110034.jpg
- ../Images/plots\filtered_difference_chart_20250425_110034.html

Exporting filtered coverage percentage chart...
Exported to:
- ../Images/plots\filtered_coverage_percentage_20250425_110034.png
- ../Images/plots\filtered_coverage_percentage_20250425_110034.jpg
- ../Images/plots\filtered_coverage_percentage_20250425_110034.html


## Pie Chart Comparison: Our Dataset Count vs US Estimated Count

Let's create pie charts to visualize the percentage distribution of property types in our dataset compared to the US estimated counts. This will help us understand the differences in composition between our dataset and the estimated US commercial property landscape.

In [79]:
# Create pie charts to compare percentage distributions
import plotly.subplots as sp

# Prepare data for pie charts (using the filtered dataset excluding Land and Multi-Family)
# First, convert back to numeric values for calculations
numeric_data = plot_df_filtered.copy()
numeric_data['US Estimated Count'] = numeric_data['US Estimated Count'].astype(int)
numeric_data['Our Dataset Count'] = numeric_data['Our Dataset Count'].astype(int)

# Calculate total counts for percentages
total_us = numeric_data['US Estimated Count'].sum()
total_ours = numeric_data['Our Dataset Count'].sum()

# Add percentage columns for labels
numeric_data['US Percentage'] = (numeric_data['US Estimated Count'] / total_us) * 100
numeric_data['Our Percentage'] = (numeric_data['Our Dataset Count'] / total_ours) * 100

# Sort by US Estimated Count descending for better visualization
numeric_data = numeric_data.sort_values('US Estimated Count', ascending=False)

# Create a subplot with two pie charts side by side
fig_pies = sp.make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                          subplot_titles=['US Estimated Distribution', 'Our Dataset Distribution'])

# Add the US Estimated pie chart
fig_pies.add_trace(
    go.Pie(
        labels=numeric_data['Category'],
        values=numeric_data['US Estimated Count'],
        textinfo='label+percent',
        insidetextorientation='radial',
        pull=[0.05] * len(numeric_data),  # Pull slices slightly apart
        marker=dict(line=dict(color='#FFFFFF', width=1)),
        name='US Estimated'
    ),
    row=1, col=1
)

# Add Our Dataset pie chart
fig_pies.add_trace(
    go.Pie(
        labels=numeric_data['Category'],
        values=numeric_data['Our Dataset Count'],
        textinfo='label+percent',
        insidetextorientation='radial',
        pull=[0.05] * len(numeric_data),  # Pull slices slightly apart
        marker=dict(line=dict(color='#FFFFFF', width=1)),
        name='Our Dataset'
    ),
    row=1, col=2
)

# Update layout
fig_pies.update_layout(
    title_text='Percentage Distribution Comparison: US Estimated vs Our Dataset (Excluding Land & Multi-Family)',
    title_font_size=18,
    height=600,
    width=1000,
    legend=dict(orientation='h', yanchor='bottom', y=-0.1, xanchor='center', x=0.5),
    annotations=[
        dict(text=f'Total: {total_us:,} properties', x=0.18, y=-0.05, showarrow=False, font_size=12),
        dict(text=f'Total: {total_ours:,} properties', x=0.82, y=-0.05, showarrow=False, font_size=12)
    ]
)

# Show the pie charts
fig_pies.show()

In [80]:
# Create a single pie chart that shows the coverage gap
# Prepare data for the coverage gap pie chart
gap_data = []

for idx, row in numeric_data.iterrows():
    category = row['Category']
    us_count = row['US Estimated Count']
    our_count = row['Our Dataset Count']
    coverage_pct = row['Coverage %']
    gap = us_count - our_count
    
    gap_data.append({
        'Category': category,
        'Covered': our_count,
        'Gap': gap,
        'Coverage %': coverage_pct
    })

# Create a new DataFrame
gap_df = pd.DataFrame(gap_data)

# Create a pie chart showing coverage vs gap for each category
fig_coverage = go.Figure()

# Create custom colorscale for the coverage percentages
colors = px.colors.sequential.Blues[3:] # Get blues color scale, start from a bit darker

for i, row in gap_df.iterrows():
    # Calculate the coverage and gap as percentages of the total US estimated count for this category
    total = row['Covered'] + row['Gap']
    coverage_percent = (row['Covered'] / total) * 100 if total > 0 else 0
    gap_percent = 100 - coverage_percent
    
    # Calculate color intensity based on coverage percentage
    color_idx = min(int(coverage_percent / 25), len(colors)-1)
    coverage_color = colors[color_idx]
    
    # Add a subplot for each category
    fig_coverage = px.pie(
        values=[row['Covered'], row['Gap']],
        names=['Covered', 'Gap'],
        title=f"Coverage for {row['Category']}: {coverage_percent:.2f}%",
        color_discrete_sequence=[coverage_color, 'lightgray'],
    )
    
    # Add the total count as annotation
    fig_coverage.update_layout(
        annotations=[
            dict(text=f'Total: {total:,}', x=0.5, y=-0.1, showarrow=False)
        ]
    )
    
    fig_coverage.show()

In [81]:
# Create a sunburst diagram to show hierarchical relationships
# Prepare data for the sunburst chart
sunburst_data = []

for idx, row in numeric_data.iterrows():
    category = row['Category']
    us_count = row['US Estimated Count']
    our_count = row['Our Dataset Count']
    gap = us_count - our_count
    
    # Add entry for our dataset count
    sunburst_data.append({
        'id': f"{category}-Covered",
        'parent': category,
        'name': 'Covered',
        'value': our_count
    })
    
    # Add entry for gap
    sunburst_data.append({
        'id': f"{category}-Gap",
        'parent': category,
        'name': 'Gap',
        'value': gap
    })
    
    # Add parent category
    sunburst_data.append({
        'id': category,
        'parent': '',
        'name': category,
        'value': 0  # Parent nodes don't need a value
    })

# Create the sunburst chart
fig_sunburst = px.sunburst(
    pd.DataFrame(sunburst_data),
    ids='id',
    parents='parent',
    values='value',
    names='name',
    title='Coverage Analysis by Property Type (Sunburst Visualization)',
    color_discrete_sequence=px.colors.qualitative.Set3,
    height=700,
)

fig_sunburst.update_layout(
    title_font_size=18,
)

fig_sunburst.show()

## Sunburst Visualization Including All Categories

Let's create a sunburst visualization that includes all property categories including Land and Multi-Family to see the complete picture.

In [82]:
# Create a sunburst diagram with all categories (including Land and Multi-Family)

# Extract data for all property types (including Land and Multi-Family)
all_plot_data = []

for item in comparison_data[:-1]:  # Skip the total row
    category = item['Category'].split(' ', 1)[1]  # Remove emoji
    
    # Parse the US count
    us_count_str = item['U.S. Estimated Count']
    if us_count_str != 'Included in total':
        us_count = int(us_count_str.replace(',', ''))
    else:
        us_count = 0
    
    # Parse our dataset count
    our_count = int(item['Our Dataset Count'].replace(',', ''))
    
    all_plot_data.append({
        'Category': category,
        'US Estimated Count': us_count,
        'Our Dataset Count': our_count
    })

# Create a DataFrame for plotting
all_plot_df = pd.DataFrame(all_plot_data)

# Filter out categories with zero US count
all_plot_df = all_plot_df[all_plot_df['US Estimated Count'] > 0]

# Calculate the difference and percentage
all_plot_df['Difference'] = all_plot_df['US Estimated Count'] - all_plot_df['Our Dataset Count']
all_plot_df['Coverage %'] = (all_plot_df['Our Dataset Count'] / all_plot_df['US Estimated Count']) * 100

# Sort by difference magnitude
all_plot_df = all_plot_df.sort_values('Difference', ascending=False)

# Prepare data for the complete sunburst chart
complete_sunburst_data = []

for idx, row in all_plot_df.iterrows():
    category = row['Category']
    us_count = row['US Estimated Count']
    our_count = row['Our Dataset Count']
    gap = us_count - our_count
    
    # Add entry for our dataset count
    complete_sunburst_data.append({
        'id': f"{category}-Covered",
        'parent': category,
        'name': 'Covered',
        'value': our_count
    })
    
    # Add entry for gap
    complete_sunburst_data.append({
        'id': f"{category}-Gap",
        'parent': category,
        'name': 'Gap',
        'value': gap
    })
    
    # Add parent category
    complete_sunburst_data.append({
        'id': category,
        'parent': '',
        'name': category,
        'value': 0  # Parent nodes don't need a value
    })

# Calculate total counts for labeling
total_us_all = all_plot_df['US Estimated Count'].sum()
total_our_all = all_plot_df['Our Dataset Count'].sum()
total_gap_all = total_us_all - total_our_all

# Create the complete sunburst chart
fig_sunburst_all = px.sunburst(
    pd.DataFrame(complete_sunburst_data),
    ids='id',
    parents='parent',
    values='value',
    names='name',
    title='Complete Coverage Analysis by Property Type (Including Land & Multi-Family)',
    color_discrete_sequence=px.colors.qualitative.Bold,
    height=800,
    width=1000,
)

# Update layout with more details
fig_sunburst_all.update_layout(
    title_font_size=20,
    margin=dict(t=80, b=20, l=20, r=20),
)

# Add hover data for better information display
fig_sunburst_all.update_traces(
    hovertemplate='<b>%{label}</b><br>Count: %{value:,}<br>Percentage: %{percentRoot:.2%}<extra></extra>'
)

# Show the complete sunburst
fig_sunburst_all.show()

In [83]:
# Export the complete sunburst visualization
try:
    print("Exporting complete sunburst visualization...")
    export_path = export_plot(fig_sunburst_all, "complete_coverage_sunburst")
    print(f"\nExported to: {export_path}")
except Exception as e:
    print(f"Error exporting plot: {e}")
    print("\nMake sure you have the kaleido package installed:")
    print("pip install -U kaleido")

Exporting complete sunburst visualization...
Exported to:
- ../Images/plots\complete_coverage_sunburst_20250425_110035.png
- ../Images/plots\complete_coverage_sunburst_20250425_110035.jpg
- ../Images/plots\complete_coverage_sunburst_20250425_110035.html

Exported to: ../Images/plots\complete_coverage_sunburst_20250425_110035.png


In [84]:
# Create interactive version of the sunburst with download button
fig_sunburst_all.update_layout(
    updatemenus=[{
        'buttons': [{
            'args': ['type', 'sunburst'],
            'label': 'Sunburst',
            'method': 'restyle'
        }, {
            'args': ['type', 'treemap'],
            'label': 'Treemap',
            'method': 'restyle'
        }],
        'direction': 'down',
        'showactive': True,
        'x': 1.0,
        'xanchor': 'right',
        'y': 1.15,
        'yanchor': 'top'
    }],
)

# Add instructions on how to download
fig_sunburst_all.add_annotation(
    text="Click the camera icon in the toolbar to download this image",
    x=0.5, y=1.05,
    xref="paper", yref="paper",
    showarrow=False,
    font=dict(size=14, color="red"),
    bgcolor="rgba(255, 255, 255, 0.7)"
)

# Show with config that includes download options
fig_sunburst_all.show(config=config)

In [85]:
# Export the pie charts
try:
    print("Exporting percentage distribution pie charts...")
    export_plot(fig_pies, "filtered_percentage_distribution_pies")
    
    print("\nExporting sunburst visualization...")
    export_plot(fig_sunburst, "filtered_coverage_sunburst")
    
except Exception as e:
    print(f"Error exporting plots: {e}")
    print("\nTo fix this issue, please run the following in a terminal:")
    print("pip install -U plotly kaleido")
    print("\nThen restart the kernel and run this cell again.")

Exporting percentage distribution pie charts...
Exported to:
- ../Images/plots\filtered_percentage_distribution_pies_20250425_110035.png
- ../Images/plots\filtered_percentage_distribution_pies_20250425_110035.jpg
- ../Images/plots\filtered_percentage_distribution_pies_20250425_110035.html

Exporting sunburst visualization...
Exported to:
- ../Images/plots\filtered_coverage_sunburst_20250425_110036.png
- ../Images/plots\filtered_coverage_sunburst_20250425_110036.jpg
- ../Images/plots\filtered_coverage_sunburst_20250425_110036.html


In [86]:
# Create downloadable version of filtered plots with explicit download buttons

# Recreate the filtered comparison plot with download button
fig_download_filtered = px.bar(plot_df_filtered, 
                               x='Category', 
                               y=['US Estimated Count', 'Our Dataset Count'],
                               title='US Estimated Count vs Our Dataset Count (Excluding Land & Multi-Family)',
                               barmode='group',
                               height=600,
                               log_y=True,
                               labels={'value': 'Number of Buildings (log scale)', 'variable': 'Data Source'})

fig_download_filtered.update_layout(
    xaxis_title='Property Type',
    yaxis_title='Number of Buildings (log scale)',
    legend_title='Data Source',
    font=dict(size=12),
    title_font_size=20
)

# Add instructions on how to download
fig_download_filtered.add_annotation(
    text="Click the camera icon in the toolbar to download this image",
    x=0.5, y=1.1,
    xref="paper", yref="paper",
    showarrow=False,
    font=dict(size=14, color="red"),
    bgcolor="rgba(255, 255, 255, 0.7)"
)

# Show with config that includes download options
fig_download_filtered.show(config=config)