<a href="https://colab.research.google.com/github/dave-killough/databricks-colab/blob/main/05B_US_Giving_Treemaps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# US Giving Treemaps Module
This notebook is modularized for import by the [datamachine](https://pypi.org/project/datamachine/) package so designated code here can be reused in another Python context as if it were imported like a traditional Python package.  Definition of the module is provided in this first part, and an example of calling it from a different Python context is provided down further in the TESTS part.

Key adaptations to modularize a notebook for datamachine include:

*   Creating a static `_Data` class for caching data and storing variables that would otherwise be global.  
*   Prefixing every cell to be excluded from the imported module by placing a comment of `#test` in the top of the cell.   
*   Hiding classes, variables, and functions by prefixing them with an underscore as in `_get_color()`.  This adaptation helps to clean up the code completion and help assistance in the calling Python context.
*   Placing `import` statements within the functions that are using them.  This futher cleans up the code completion and help assistance in the calling Python context.
*   Documenting exposed functions for help assistance in the calling Python context.




## _Data
This static class provides a cached storage of data, global level values used by various functions in the module, and a set of accessor functions. In this case the same data is being used in a spark dataframe for parallel processing and a pandas dataframe for chart creation.

In [1]:
class _Data:

    eo990spark = None # used for distributed rendering and making eo990pandas
    eo990pandas = None # used by _sector_table() and _giving_treemap_sectors()

    def get_eo990spark(spark):
        if _Data.eo990spark is not None: # cached
            return _Data.eo990spark
        _Data.eo990pandas = None # clear cache
        _Data.eo990spark = spark.sql("""
            SELECT EIN, NTEE3, ntee1.SECTOR AS SECTOR1, eo990.SECTOR,
                ReturnTypeCd, TaxPeriodEndDt, BusinessName, CITY, STATE,
                PYContributionsGrantsAmt, CYContributionsGrantsAmt,
                PYGrantsAndSimilarPaidAmt, CYGrantsAndSimilarPaidAmt,
                PYTotalProfFndrsngExpnsAmt, CYTotalProfFndrsngExpnsAmt,
                CYTotalFundraisingExpenseAmt,
                TotalEmployeeCnt, TotalVolunteersCnt
            FROM eo990
            JOIN gcst AS ntee1 ON LEFT(eo990.NTEE3,1) = ntee1.code
            JOIN gcst AS ntee3 ON LEFT(eo990.NTEE3,3) = ntee3.code
            WHERE CYContributionsGrantsAmt > 0
        """)
        return _Data.eo990spark

    def get_eo990pandas(spark):
        if _Data.eo990pandas is not None:
            return _Data.eo990pandas
        if _Data.eo990spark is None: # load pandas from spark call
            _Data.eo990pandas = _Data.get_eo990spark(spark).toPandas()
        return _Data.eo990pandas


## _get_color()

In [2]:
def _get_color(value):
    """
    Function to return the color based on an input value from -100 to 100.
    -100 is mapped to red, 0 is mapped to white, and 100 is mapped to green.
    """

    import matplotlib as mpl
    import matplotlib.colors as mcolors

    if value < -100 or value > 100:
        return "Input value must be between -100 and 100"

    # Create custom colormap
    colors = [(1, 0, 0), (1, 1, 1), (0, 0.5, 0)]  # R -> W -> G
    cmap_name = 'custom'
    cm = mcolors.LinearSegmentedColormap.from_list(cmap_name, colors, N=256)

    # Normalize the input value to range [0,1]
    norm = mpl.colors.Normalize(vmin=-100, vmax=100)
    rgb = cm(norm(value))  # Get RGB color
    hex_color = mcolors.rgb2hex(rgb) # Convert RGB to hex

    return hex_color

## _sector_table()

In [3]:
def _sector_table(spark):

    import numpy as np
    import pandas as pd

    df = _Data.get_eo990pandas(spark)

    c1 = df['NTEE3'] != 'Z99' # hide unknown for now
    c2 = df['CYContributionsGrantsAmt'] > 0 # used for cell size
    df2 = df[c1 & c2][
        ['EIN','NTEE3','SECTOR',
         'PYContributionsGrantsAmt','CYContributionsGrantsAmt']
    ].groupby(['NTEE3','SECTOR']
    ).agg(
        count=('EIN', 'count'),
        CYamount=('CYContributionsGrantsAmt', 'sum'),
        PYamount=('PYContributionsGrantsAmt', 'sum')
    ).reset_index() # move NTEE3 & SECTOR to columns
    df2['mil'] = (df2['CYamount'] / 1_000_000).round().astype(int)
    df2['growth'] = np.where(
        df2['PYamount'] == 0, 0,
        (df2['CYamount'] - df2['PYamount']) * 100.0 / df2['PYamount']
    )
    df2['growth'] = np.where(
        df2['growth'] > 100, 100,
        df2['growth']).astype(float)
    df2['color'] = df2['growth'].apply(_get_color)

    df3 = df2.copy() #merge(m.gcst, left_on='NTEE3', right_on='code')
    #df3['count'] = df3['EIN'].apply(lambda x: "{:,}".format(x))
    chunks = [] # array of chunks
    #display(df3)
    for index, row in df3.iterrows():
        amount = int(row['CYamount'])
        growth = row['growth']
        count = row['count']
        desc_title = f'${amount:,.0f} ({growth:,.2f}%)'
        count_title = f'{count:,.0f} organizations'
        chunk = f"""<tr style="background-color: {row['color']};">
<td class="description" title="{row['SECTOR']}">{row['SECTOR']}</td>
<td class="org_amount" title="{desc_title}">{row['mil']}</td>
<td class="org_count" title="{count_title}">{row['count']}</td>
<td class="link"><button onclick="loadIframe('{row['NTEE3']}')">
{row['NTEE3']}</button></td></tr>"""
        chunks.append(chunk) # joining array later is faster
    sectors_html = '\n'.join(chunks)
    table_html = f"""\
<table id="myTable" border="1">
<tr>
    <th class="description">sector</th>
    <th class="org_amount">mil$</th>
    <th class="org_count">orgs</th>
    <th class="link">view</th>
</tr>{sectors_html}</table>
"""
    return table_html

## _giving_treemap_sectors()

In [4]:
def _giving_treemap_sectors(spark, kind='growth', ntee3='', publish=False):

    import numpy as np
    import pandas as pd
    import plotly.express as px

    df = _Data.get_eo990pandas(spark)

    crit = df['CYContributionsGrantsAmt'] > 0
    df2 = df[crit].copy().sort_values(
        'CYContributionsGrantsAmt', ascending=False)
    df2['PYContributionsRetained'] = (
        df2['PYContributionsGrantsAmt'] - df2['PYGrantsAndSimilarPaidAmt'])
    df2['CYContributionsRetained'] = (
        df2['CYContributionsGrantsAmt'] - df2['CYGrantsAndSimilarPaidAmt'])

    rename_dict = {
        'ReturnTypeCd': 'Return',
        'TaxPeriodEndDt': 'TaxPeriodEnd',
        'BusinessName': 'Name',
        'CITY': 'City',
        'STATE': 'State',
        'PYContributionsGrantsAmt': 'PY Contributions Grants Amt',
        'CYContributionsGrantsAmt': 'CY Contributions Grants Amt',
        'PYGrantsAndSimilarPaidAmt': 'PY Grants And Similar Paid Amt',
        'CYGrantsAndSimilarPaidAmt': 'CY Grants And Similar Paid Amt',
        'PYTotalProfFndrsngExpnsAmt': 'PY Total Prof Fndrsng Expns Amt',
        'CYTotalProfFndrsngExpnsAmt': 'CY Total Prof Fndrsng Expns Amt',
        'CYTotalFundraisingExpenseAmt': 'CY Total Fundraising Expense Amt',
        'TotalEmployeeCnt': 'Total Employee Count',
        'TotalVolunteersCnt': 'Total Volunteers Count',
        'PYContributionsRetained': 'PY Contributions Retained',
        'CYContributionsRetained': 'CY Contributions Retained'
    }
    df2 = df2.rename(columns=rename_dict)

    df2["level"] = pd.cut( df2["CY Contributions Grants Amt"],
        [-np.inf, 999_999.999, 2_499_999.999, 9_999_999.999,
               49_999_999.999, np.inf],
        labels=[
            '<b>Under $1M</b>', '<b>$1M-<$2.5M</b>', '<b>$2.5M-<$10M</b>',
            '<b>$10M-<$50M</b>','<b>$50M and Up</b>']
    )
    df2['NTEE1'] = df2['NTEE3'].str[0:1]

    dfg = df2.groupby(['NTEE1', 'NTEE3', 'SECTOR1', 'SECTOR']).agg(**{
        'PY Contributions Grants Amt': ('PY Contributions Grants Amt', 'sum'),
        'CY Contributions Grants Amt': ('CY Contributions Grants Amt', 'sum'),
        'Count': ('EIN', 'count')
    }).reset_index()

    dfg['CY Contributions Grants Growth'] = np.where(
        dfg['PY Contributions Grants Amt'] == 0, 0,
        (dfg['CY Contributions Grants Amt']
         - dfg['PY Contributions Grants Amt'])
        * 100.0 / dfg['PY Contributions Grants Amt']
    )
    dfg['CY Contributions Grants Growth'] = np.where(
        dfg['CY Contributions Grants Growth'] > 100, 100,
        dfg['CY Contributions Grants Growth']).astype(float)
    dfg['amount'] = dfg['CY Contributions Grants Amt']
    dfg['growth'] = dfg['CY Contributions Grants Growth']

    dfg['activity'] = '<b>' + dfg['SECTOR1'] + '</b>'
    dfg['sector'] = dfg['SECTOR'] + ' (' + dfg['NTEE3']  + ')'
    #display(dfg.head(10))
    #pio.renderers.default = m.renderer
    fig = px.treemap(
        dfg,
        #labels='NTEE3',
        path=[px.Constant("U.S. Electronic 990 Filers"), 'activity', 'sector'],
        values='amount',
        color='growth',
        color_continuous_scale=['red','white','green'],
        range_color=[-100, 100],
        custom_data=['Count','NTEE3']
    )
    fig.update_layout(margin = dict(t=0, l=0, r=0, b=0))
    fig.update_traces(marker_line_width = 0, tiling_pad = 1)
    # highlight current NTEE3
    labels = list(fig.data[0].labels)
    if ntee3 != '':
        high = df[df['NTEE3'] == ntee3]['SECTOR'].unique()[0] + f' ({ntee3})'
        lc = ['rgb(141, 141, 255)'
              if label == high else 'black' for label in labels]
        lw = [3
              if label == high else 0       for label in labels]
        fig.data[0].marker.line.color = lc
        fig.data[0].marker.line.width = lw
    fig.data[0].texttemplate = """\
%{label}<br>$%{value:,.0f}<br>%{customdata[0]:,.0f}<br><a
href="main_guide.html?sector=%{customdata[1]}"
style="cursor: help; color: blue;"
rel="noopener noreferrer">🔗</a>"""
    hovertemplate = """\
%{label}<br>$%{value:,.0f}<br>%{customdata[0]:,.0f}"""
    #fig.data[0].hovertemplate = hovertemplate
    fig.update_yaxes(range=[-100, 100])
    fig.update_traces(hovertemplate = hovertemplate)
    fig.update_traces(
        textfont=dict(size=12), # Font size for other levels
        insidetextfont=dict(size=12), # Font size for top level
        #maxdepth=1, # 0 means only the top level will have a different size
    )
    return fig

## _giving_treemap_nonprofits()

In [9]:
def _giving_treemap_nonprofits(df): # called by parallel process

    import numpy as np
    import pandas as pd
    import plotly.express as px

    ntee3 = df['NTEE3'].iloc[0]
    sector_desc = df['SECTOR'].iloc[0]

    # create groupings by contribution amount
    cy_amt_col = 'CYContributionsGrantsAmt'
    df["level"] = pd.cut(df[cy_amt_col],
        [-np.inf, 999_999.999, 2_499_999.999, 9_999_999.999,
               49_999_999.999, np.inf],
        labels=[
            '<b>Under $1M</b>', '<b>$1M-<$2.5M</b>', '<b>$2.5M-<$10M</b>',
            '<b>$10M-<$50M</b>','<b>$50M and Up</b>']
    )

    # limit to the largest organizations
    threshold = 800
    if df.shape[0] < threshold:
        cutoff = 0
    elif df[df[cy_amt_col] >=  1_000_000].shape[0] < threshold:
        cutoff =  1_000_000
    elif df[df[cy_amt_col] >=  2_500_000].shape[0] < threshold:
        cutoff =  2_500_000
    elif df[df[cy_amt_col] >= 10_000_000].shape[0] < threshold:
        cutoff = 10_000_000
    elif df[df[cy_amt_col] >= 50_000_000].shape[0] < threshold:
        cutoff = 50_000_000
    else:
        cutoff = df.at[threshold-1,cy_amt_col]
    df["nonprofit"] = np.where(df[cy_amt_col] >= cutoff,df["EIN"], "Total")

    # regroup to collapse smaller organizations into one box
    adf2 = df.groupby(['level','nonprofit']).agg(
        py_amount=('PYContributionsGrantsAmt', 'sum'),
        cy_amount=('CYContributionsGrantsAmt', 'sum'),
        count=('EIN', 'count'),
    )

    # calculate contribution growth
    adf2 = adf2[adf2["cy_amount"] > 0].copy()
    adf2['py_amount'] = adf2['py_amount'].round()
    adf2['cy_amount'] = adf2['cy_amount'].round()
    adf2["growth"] = np.where(adf2["py_amount"] == 0,0,
        ( adf2["cy_amount"] - adf2["py_amount"]) * 100.0
        / adf2["py_amount"])

    # cap growth to 100% for color balance
    adf2["growth"] = np.where(
        adf2["growth"] > 100, 100, adf2["growth"]).astype(float)

    adf2 = adf2.reset_index()

    adf2 = adf2.merge(
        df[['EIN','BusinessName', 'CITY', 'STATE']],
        left_on='nonprofit', right_on='EIN', how='left')
    adf2['BusinessName'].fillna('', inplace=True)
    adf2['CITY'].fillna('', inplace=True)
    adf2['STATE'].fillna('', inplace=True)
    adf2['EIN'].fillna('Total', inplace=True)
    adf2["link"] = np.where(adf2["nonprofit"] == "Total","","🔗")

    # chart it in plotly!
    sector_desc = f'{sector_desc} ({ntee3})'
    fig = px.treemap(
        adf2,
        path=[px.Constant(sector_desc), 'level', 'EIN'],
        values='cy_amount',
        color='growth',
        color_continuous_scale=['red','white','green'],
        range_color=[-100, 100],
        color_continuous_midpoint=0,
        custom_data=['BusinessName', 'CITY', 'STATE', 'link']
    )

    fig.update_layout(margin = dict(t=0, l=0, r=0, b=0))
    fig.update_traces(marker_line_width = 0, tiling_pad = 1)
    labels = list(fig.data[0].labels)

    fig.data[0].hovertemplate = """\
%{customdata[0]}<br>$%{value:,.0f}}
"""
    fig.data[0].texttemplate = """\
%{customdata[0]}<br>$%{value:,.0f}<br><a
href="https://projects.propublica.org/nonprofits/organizations/%{label}"
style="cursor: help; color: blue;" rel="noopener noreferrer"
>%{customdata[3]}</a>"""

    return fig

## _main_guide_html()

In [6]:
def _main_guide_html(spark):

    from IPython.core.display import HTML

    return HTML(f"""\
<!DOCTYPE html>
<html lang="en">
<head>
    <link rel="stylesheet"
    href="https://fonts.googleapis.com/css2?family=Open+Sans&display=swap" />
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>US Giving Treemap</title>
    <style>
        * {{ box-sizing: border-box; }}
        body, html {{ margin: 0; padding: 0; height: 100%; }}
        body {{ font-family: 'Open Sans', sans-serif; }}
        #container {{ display: flex; height: 100%; }}
        #left-sidebar {{
            display: flex; flex-direction: column;
            width: 300px; min-width: 300px; background-color: #ffffff;
            padding: 0px; height: 100%;
        }}
        #right-content {{
            flex-grow: 1; background-color: #e9e9e9; padding: 0px; height: 100%;
        }}
        #treemap-iframe {{ width: 99%; height: 99%; border: none; }}
        #header {{

        }}
        #narrative {{
            flex-grow: 1; background-color: pink;
            padding: 5px 10px 5px 10px; height: 300px; overflow-y: auto;
            background-color: #dddddd; line-height: 18px;
        }}
        .ntee, .description {{
            text-align: left;
        }}
        .description {{
            max-width: 148px;
            white-space: nowrap;
            overflow: hidden;
            text-overflow: ellipsis;
        }}
        .descriptionX:hover {{
            overflow: visible;
            white-space: normal;
            height:auto;  /* just added this line */
        }}
        .org_count, .org_amount {{
            text-align: right;
        }}
        .link {{
            text-align: center;
        }}
        #myTable {{
            font-size: 12px; border-collapse: collapse; width: 100%;
        }}
    </style>
</head>
<body>
    <div id="container">
        <div id="left-sidebar">
            <div id="header">
                <img height=64 style="margin-left: 5px;"
                    src="https://storage.googleapis.com/benevolentmachines/bm_full2.svg">
                <div style="margin-left: 10px; height: 28px;">
                    <b>US GIVING TREEMAP</b>
                    <button class="help"style="margin-left: 8px;" onclick="loadIframe('')">Help</button>
                    <button class="all" style="margin-left: 7px;"onclick="loadIframe('')">ALL</button>
                    </div>
            </div>
            <div id="narrative">

    {_sector_table(spark)}


            </div>
        </div>
        <div id="right-content">
            <iframe id="treemap-iframe" src=""></iframe>
        </div>
    </div>
    <script>
        function loadIframe(ntee3) {{
            var url = "main_treemap.html"
            if (ntee3 != '') {{
                url = "sector_treemap_" + ntee3 + ".html"
            }}
            var iframe = document.getElementById('treemap-iframe');
            if (iframe) {{
                iframe.src = url;
            }} else {{
                console.error('iframe with id "treemap-iframe" not found');
            }}
        }}
        window.addEventListener('resize', function() {{
            var iframe = document.getElementById('treemap-iframe');
            iframe.style.width = (window.innerWidth - 305) + 'px';
            iframe.style.height = (window.innerHeight - 5) + 'px';
        }});

        document.addEventListener("DOMContentLoaded", function() {{
            var queryString = window.location.search;
            var newUrl = 'main_treemap.html';
            var urlParams = new URLSearchParams(queryString);
            if (urlParams.has('sector')) {{
                var sector = urlParams.get('sector');
                var sector3 = sector ? sector.substring(0, 3) : '';
                var newUrl = 'sector_treemap_'
                    + encodeURIComponent(sector3) + '.html';
            }}
            var iframe = document.getElementById('treemap-iframe');
            if (iframe) {{
                iframe.src = newUrl;
            }} else {{
                console.error('iframe with id "treemap-iframe" not found');
            }}

            var headers = document.querySelectorAll("#myTable th");
            var table = document.querySelector("#myTable");
            var rows = Array.from(table.rows).slice(1); // Exclude the header row

            headers.forEach(function(header, index) {{
                header.addEventListener("click", function() {{
                    var sortedRows = rows.sort(function(a, b) {{
                        var aValue = a.cells[index].innerText;
                        var bValue = b.cells[index].innerText;
                        if (!isNaN(aValue) && !isNaN(bValue)) {{
                            return aValue - bValue;
                        }}
                        return aValue.localeCompare(bValue);
                    }});

                    // Toggle between ascending and descending
                    if (header.getAttribute("data-order") === "asc") {{
                        sortedRows.reverse();
                        header.setAttribute("data-order", "desc");
                    }} else {{
                        header.setAttribute("data-order", "asc");
                    }}

                    // Append the sorted rows to the table
                    sortedRows.forEach(function(row) {{
                        table.tBodies[0].appendChild(row);
                    }});
                }});
            }});
        }});
    </script>
</body>
</html>
""")

## _treemap_replaces()

In [7]:
def _treemap_replaces(s):
    s = s.replace('<body>', '<body style="margin:0; overflow="hidden">')
    s = s.replace('</head>',
        '<style>.svg-container { overflow: hidden; }</style></head>')
    return s

## us_giving_treemaps()
This function is exposed to the calling Python context.  It generates the various static HTML files with embedded Plotly charts that work together to make an interactive analytics output called a **Data Garden**.  It's availability to be invoked through datamachine enables the function to be injected into another notebook that participates in a backend data pipeline.



In [14]:
def us_giving_treemaps(spark, folder):

    import os
    import shutil
    import pandas as pd

    print("draft_giving_treemap() ->")
    if os.path.isdir(folder):
        shutil.rmtree(folder)
    os.makedirs(folder)

    with open(f'{folder}/main_guide.html', 'w', encoding='utf_8') as f:
        f.write(_main_guide_html(spark).data)
        print('main_guide.html written')

    with open(f'{folder}/main_treemap.html', 'w', encoding='utf_8') as f:
        fig = _giving_treemap_sectors(spark, kind='growth',ntee3='')
        #fig.write_image("publish/_main.jpg", width=1200, height=628) # linkedin
        src = fig.to_html(full_html=True, include_plotlyjs='cdn')
        #print(fig.to_json(pretty=True))
        f.write(_treemap_replaces(src))
        print('main_treemap.html written')

    chart_df = _Data.get_eo990spark(spark)
    #df = chart_df.repartition('NTEE3')
    rdd = chart_df.rdd.keyBy(lambda row: row.NTEE3)
    grouped_rdd = rdd.groupByKey()
    print("Number of NTEE3: ", grouped_rdd.count())
    #print("List of NTEE3s: ", sorted(grouped_rdd.keys().collect()))
    COLUMNS = spark.sparkContext.broadcast(chart_df.columns)

    def process_group(iterator):
        pandas_df = pd.DataFrame(list(iterator), columns=COLUMNS.value)
        ntee3 = pandas_df['NTEE3'].iloc[0]

        fig = _giving_treemap_nonprofits(pandas_df) # create chart figure
        path = f'{folder}/sector_treemap_{ntee3}.html'
        with open(path, 'w', encoding='utf_8') as f:
            #f.write("testing")
            src = fig.to_html(full_html=True, include_plotlyjs='cdn')
            f.write(_treemap_replaces(src))

        return path

    result_rdd = grouped_rdd.mapValues(process_group)
    collected_results = result_rdd.collect()

    status = 0
    return status # will be passed via datamachine ??

# TESTS
This part sets up the packages, data, and test runs needed to operate the module for development purposes.  The `#test` comment at the top of each cell will exclude the cell from being executed in the calling context.  However, the contents of these cells are still imported in their commented form, so don't use any sensitive information here or anywhere else in this notebook.     

## Install packages

In [None]:
#test
%pip install pyspark==3.5.0
%pip install datamachine

## Imports

In [12]:
#test
import io
import shutil
import zipfile
import requests
from pyspark.sql import SparkSession
import datamachine as dm

## Load data

In [13]:
#test
use_local_file = False
filename = "us_giving_treemap_data.zip"
if use_local_file is True:
    print(f"loading from local file: {filename}")
    zip_archive = zipfile.ZipFile(filename, 'r')
else:
    bucket = "https://storage.googleapis.com/benevolentmachines"
    blob = f"{bucket}/{filename}"
    print(f"loading from online bucket")
    response = requests.get(blob)
    zip_archive = zipfile.ZipFile(io.BytesIO(response.content))
zip_archive.extractall()
spark = SparkSession.builder.appName("us-giving-treemap").getOrCreate()
loaded_gcst_df = spark.read.parquet("transfer_gcst")
loaded_eo990_df = spark.read.parquet("transfer_eo990")
loaded_gcst_df.createOrReplaceTempView("gcst")
loaded_eo990_df.createOrReplaceTempView("eo990")

loading from online bucket


## Test Module

In [15]:
#test
nb = "https://colab.research.google.com/drive/1V_WVv0lrYOhFDvJ5NlLpP_idsOJMAmbk"
nbm = dm.import_notebook(nb)  # loads this notebook
status = nbm.us_giving_treemaps(spark, folder="draft")

draft_giving_treemap() ->
main_guide.html written
main_treemap.html written
Number of NTEE3:  758


## Zip Test Results
This cell creates a `draft.zip` file containing the outputs that you can download and browse on your own computer.

In [16]:
#test
shutil.make_archive(base_name="draft", format="zip", root_dir="draft")

'/content/draft.zip'