In [98]:
#| echo: false
from IPython.display import FileLink, IFrame

Create an HTML file showing how values have changed acrosx two text lists

![](visual_text_diff.png)

As the colors show, we can see what was deleted, added, or changed.

In this case the first four meta descriptions were completely changed (red became green).
In the last three we have a much more nuanced and interesting diff. It shows us exactly what was changed/deleted between the first and second crawls.

On row 6, "April" became "October" and "closed" became "open" for example. Note that the highlighting is per character:

`closed`  
`open`

The "o" and "e" are common in both strings, so they are not highlighted. The other letters are.

Note that hovering over the row numbers will show you the URL that is being compared. They are also clickable, in case you want to actually visit the page.

![](diff_mouseover.png)


## Some examples to explore:

In [95]:
#| echo: false
display(FileLink('meta_desc_diff.html'))
display(FileLink('h2_diff.html'))
display(FileLink('title_diff.html'))
display(FileLink('og_description_diff.html'))

In [None]:
import re
from difflib import HtmlDiff
import pandas as pd
pd.options.display.max_columns = None
df1 = pd.read_parquet('nasa_crawl.parquet')
df2 = pd.read_parquet('nasa_crawl3.parquet')

def compare(df1, df2, column, keep_equal=False):
    compare_df = pd.merge(
        df1[["url", column]], df2[["url", column]], left_on="url", right_on="url"
    ).assign(changed=lambda df: df[f"{column}_x"].ne(df[f"{column}_y"]))
    if ("int" in str(df1[column].dtype).lower()) or (
        "float" in str(df1[column].dtype).lower()
    ):
        compare_df["diff"] = compare_df[f"{column}_y"].sub(compare_df[f"{column}_x"])
        compare_df["diff_perc"] = compare_df["diff"].div(compare_df[f"{column}_x"])
    compare_df = compare_df.dropna(thresh=compare_df.shape[1])
    if keep_equal:
        return compare_df.reset_index(drop=True)
    else:
        return (
            compare_df[compare_df["changed"]]
            .drop("changed", axis=1)
            .reset_index(drop=True)
        )

legend_table = """
    <table>
        <tbody>
            <tr>
                <td><strong>Colors:</strong></td>
                <td style="background-color:hsl(120,100%,83%); border-style: solid; border-color: #efefef;"><span>Added</span></td>
                <td style="background-color:hsl(60,100%,73%); border-style: solid; border-color: #efefef;"><span>Changed</span></td>
                <td style="background-color:hsl(0,100%,83%); border-style: solid; border-color: #efefef;"><span>Deleted</span></td>
            </tr>
        </tbody>
    </table>
    <br>
"""

table_style = """
<style type="text/css">
    table.diff {
    font-family:Menlo;
    border:medium;
    width:97%;

    }
    tr {
  border-bottom: 1px solid #efefef;
}
    table td {
    padding: 2px;
    word-break: break-word;
    }
    .diff_header {background-color:#e0e0e0}
    td.diff_header {text-align:right}
    .diff_next {background-color:#c0c0c0; word-wrap: normal; word-break: break-word;}
    .diff_add {background-color:#aaffaa ; word-wrap: normal; word-break: break-word;}
    .diff_chg {background-color:#ffff77 ; word-wrap: normal; word-break: break-word;}
    .diff_sub {background-color:#ffaaaa ; word-wrap: normal; word-break: break-word;}

table td a:hover:after {
  content: attr(data-title);
  position: absolute;
  font: 10px verdana;
  top: -110%;
  left: 0;
  background: #ace;
  color: black;
  box-sizing: border-box;
  border: 1px solid gray;
  border-radius: 20%;
  padding: 3px;
}
</style>
"""


def diff(diff_df, output_file):
    header1 = diff_df.columns[1].rsplit('_', maxsplit=1)[0] + ' X'
    header2 = diff_df.columns[1].rsplit('_', maxsplit=1)[0] + ' Y'

    htmldiff = HtmlDiff()
    html_str = htmldiff.make_file(
        fromdesc=header1,
        todesc=header2,
        fromlines=diff_df.iloc[:, 1],
        tolines=diff_df.iloc[:, 2]
    )
    html_str = re.sub('<style type=.*</style>', table_style, html_str, flags=re.DOTALL)
    html_str = re.sub(' nowrap="nowrap"', '', html_str, flags=re.DOTALL)
    html_str = re.sub('<body>', '<body><div align="center">', html_str, flags=re.DOTALL)
    html_str = re.sub('</body>', '</div></body>', html_str, flags=re.DOTALL)
    html_str = re.sub('<table class="diff" summary="Legends">.*</table>', '', html_str, flags=re.DOTALL)

    for i, url in enumerate(diff_df['url'], start=1):
        html_str = re.sub(f'>{i}</td', f'><a href="{url}" title="{url}"><b>{i}</b></a></td', html_str, flags=re.DOTALL)
    with open(output_file, 'w') as htmlfile:
        print(legend_table + html_str, file=htmlfile)

diff(compare(df1, df2, 'meta_desc'), output_file= 'meta_desc_diff.html')

## Usage

First create you comparison DataFrame with the `compare` function:

In [94]:
#| code-fold: false
comparison_df = compare(df1, df2, column='meta_desc')
comparison_df

Unnamed: 0,url,meta_desc_x,meta_desc_y
0,https://www.nasa.gov/stem-content/amateur-radi...,ARISS-US is accepting proposals from U.S. scho...,Students have the opportunity to learn about s...
1,https://www.nasa.gov/womens-history-month/,https://www.youtube.com/watch?v=5VPxyMmQRwA ht...,https://www.youtube.com/watch?v=5VPxyMmQRwA
2,https://www.nasa.gov/procurement/,Upcoming 2024 Leadership Engagements Date Even...,Upcoming 2024 Leadership Engagements Date Even...
3,https://www.nasa.gov/reference/lsp-primary-lau...,Mission: PsycheVehicle: SpaceX Falcon HeavyLau...,Mission: PACEVehicle: SpaceX Falcon 9Launch Si...
4,https://www.nasa.gov/foia/foia-reports/chief-f...,Current Chief FOIA Officer Report Chief FOIA O...,Current Chief FOIA Officer Report Chief FOIA O...
5,https://www.nasa.gov/directorates/esdmd/hhp/ae...,The application window for the April 2024 sess...,The application window for the October 2024 se...
6,https://science.nasa.gov/mission/kepler/in-depth,Key Facts Nation United States of America (USA...,Key Facts Nation United States of America (USA...


### What the function does:
1. Get the common URLs in both crawls
2. Compare and only display the values that have changed in the selected `column`
3. Optionally set `keep_equal=True` if you want to get all values, even if they are the same. This can be useful in showing which URLs are common across both crawls.

Comparing URLs will have it's own functionality later as this is a crucial aspect of comparing crawls.

Now that we have our `comparison_df` we can simply feed it to the `diff` function:

In [97]:
#| code-fold: false
diff(comparison_df, output_file='meta_description_diff.html')

Display as an iframe within the notebook, or open as a standalone HTML document in your browser:

In [96]:
#| echo: false
IFrame(src="meta_description_diff.html", width=1200, height=700)