# Text Similarity

*EarlyPrint*'s [Discovery Engine](https://earlyprint.org/lab/tool_discovery_engine.html?which_to_do=find_texts&eebo_tcp_id=A43441&n_results=35&tfidf_weight=6&mallet_weight=6&tag_weight=6) allows you to find a set of texts similar to any text in our corpus. It does this by using some basic measures of text similarity, and it's easy to use if you're interested in finding similar texts across the entire early modern corpus.

But you might be interested in finding similarity across a smaller subset of the corpus. In this tutorial, we'll calculate similarity across the same set of 1666 texts that we used in the [TF-IDF tutorial](https://earlyprint.org/jupyterbook/tf_idf.html). You could easily do the same with any subset of texts that you've gathered using the [Metadata tutorial](https://earlyprint.org/jupyterbook/metadata.html).

This tutorial is meant as a companion to an explanation of text similarity that I wrote for *The Programming Historian*:

> [Understanding and Using Common Similarity Measures for Text Analysis](https://programminghistorian.org/en/lessons/common-similarity-measures)

The article uses the same 1666 corpus as its example, but here we'll work directly with the *EarlyPrint* XML instead of with plaintext files. For full explanations of the different similarity measures and how they're used, please use that piece as a guide.

First, we'll import necessary libraries. [n.b. In the *Programming Historian* tutorial, I use `scipy`'s implementation of pairwise distances. For simplicity's sake, here we're using Sci-kit Learn's built-in distance function.]

In [1]:
import glob
import pandas as pd
from lxml import etree
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import pairwise_distances
from collections import Counter

Next we use `glob` to get our list of files and isolate the filekeys to use later. This is the complete list of texts we're working with in this example. You may have a different directory or filepath for your own files.

In [2]:
# Use the glob library to create a list of file names
filenames = glob.glob("1666_texts/*.xml")
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]
print(filekeys)

['B02845', 'A51130', 'A36358', 'A28171', 'A51877', 'A60482', 'A32566', 'A35206', 'A35114', 'A32207', 'A39345', 'A25743', 'A86466', 'A61929', 'B03114', 'A32916', 'A70852', 'B01661', 'A61594', 'A35608', 'A64861', 'A61503', 'A79302', 'A62436', 'A38556', 'A32751', 'A63370', 'A57484', 'A92820', 'A39246', 'A87622', 'A66752', 'A26426', 'A26249', 'A55410', 'A46087', 'A31237', 'A61867', 'A61891', 'B05835', 'A28989', 'A31124', 'A80818', 'A65296', 'A30203', 'A55387', 'A59325', 'B06022', 'A56381', 'A61600', 'A66777', 'A39714', 'A44801', 'A71109', 'A49213', 'A43020', 'A45206', 'A95690', 'A60606', 'A23770', 'A52519', 'A44938', 'A64258', 'A70867', 'A35851', 'A56390', 'B02572', 'A91186', 'A59229', 'B05308', 'A30143', 'A46046', 'B03376', 'B03317', 'A47095', 'B01318', 'B03106', 'A44879', 'A54070', 'A70287', 'A28209', 'B04153', 'A29017', 'A70866', 'A47367', 'A44334', 'B03109', 'B02123', 'A42533', 'A42537', 'A44627', 'A93280', 'A38792', 'B06375', 'A67572', 'A46030', 'A32581', 'A44478', 'A47379', 'A41072',

## Get Features

In order to measure similarity between texts, you need features of those texts to measure. The [Discovery Engine](https://earlyprint.org/lab/tool_discovery_engine.html?which_to_do=find_texts&eebo_tcp_id=A43441&n_results=35&tfidf_weight=6&mallet_weight=6&tag_weight=6) calculates similarity across three distinct sets of features for the same texts: TF-IDF weights for word counts, LDA Topic Modeling results, and XML tag structures. As our example here, we'll use TF-IDF.

The code below is taken directly from the [TF-IDF Tutorial](https://earlyprint.org/jupyterbook/tf_idf.html), where you'll find a full explanation of what it does. We loop through each text, extract words, count them, and convert those counts to TF-IDF values. There's one key difference: below we use [L2 normalization](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm) on our TF-IDF transformation. Normalizing values helps us account for very long or very short texts that may skew our similarity results.

In [3]:
# Create an empty lists to put all our texts into
all_tokenized = []

# Then you can loop through the files
for f in filenames:
    parser = etree.XMLParser(collect_ids=False) # Create a parse object that skips XML IDs (in this case they just slow things down)
    tree = etree.parse(f, parser) # Parse each file into an XML tree
    xml = tree.getroot() # Get the XML from that tree
    
    # Now we can use lxml to find all the w tags       
    word_tags = xml.findall(".//{*}w")
    # In this next line you'll do several things at once to create a list of words for each text
    # 1. Loop through each word: for word in word_tags
    # 2. Make sure the tag has a word at all: if word.text != None
    # 3. Get the regularized form of the word: word.get('reg', word.text)
    # 4. Make sure all the words are in lowercase: .lower()
    words = [word.get('reg', word.text).lower() for word in word_tags if word.text != None]
    # Then we add these results to a master list
    all_tokenized.append(words)
    
# We can count all the words in each text in one line of code
all_counted = [Counter(a) for a in all_tokenized]

# To prepare this data for Tf-Idf Transformation, we need to put into a different form, a DataFrame, using pandas.
df = pd.DataFrame(all_counted, index=filekeys).fillna(0)

# First we need to create an "instance" of the transformer, with the proper settings.
# Normalization is set to 'l2'
tfidf = TfidfTransformer(norm='l2', sublinear_tf=True)
# I am choosing to turn on sublinear term frequency scaling, which takes the log of
# term frequencies and can help to de-emphasize function words like pronouns and articles. 
# You might make a different choice depending on your corpus.

# Once we've created the instance, we can "transform" our counts
results = tfidf.fit_transform(df)

# Make results readable using Pandas
readable_results = pd.DataFrame(results.toarray(), index=df.index, columns=df.columns) # Convert information back to a DataFrame
readable_results

Unnamed: 0,the,dutch,gazette,or,sheet,of,wildfire,that,fired,fleet,...,hip-gouts,fistulas,sacrolumbi,tennis-balls,cocks-stones,brawn,anomalcus',over-precise,vindicator,astel
B02845,0.064873,0.082339,0.062394,0.031587,0.051094,0.059271,0.101218,0.052543,0.044035,0.035789,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
A51130,0.035911,0.032350,0.000000,0.026113,0.000000,0.033307,0.000000,0.031193,0.000000,0.012859,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
A36358,0.034892,0.013093,0.000000,0.026334,0.000000,0.033336,0.000000,0.026627,0.000000,0.013580,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
A28171,0.019985,0.000000,0.000000,0.015833,0.000000,0.019550,0.000000,0.018441,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
A51877,0.083391,0.000000,0.000000,0.023618,0.000000,0.066251,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
A60948,0.043362,0.000000,0.000000,0.028516,0.000000,0.041129,0.000000,0.038093,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
A53818,0.116119,0.000000,0.000000,0.064717,0.000000,0.107895,0.000000,0.055733,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
A57156,0.038091,0.000000,0.000000,0.026983,0.000000,0.037324,0.000000,0.031922,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
A65985,0.026458,0.000000,0.000000,0.021861,0.000000,0.024689,0.000000,0.024865,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


## Calculate Distance

Below we'll calculate three different distance metrics---euclidean distance, "cityblock" distance, and cosine distance---and create DataFrames for each one. For explanations of each metric, and for a discussion of the difference between similarity and distance, you can refer to [The Programming Historian tutorial](https://programminghistorian.org/en/lessons/common-similarity-measures) which goes into these topics in detail.

Euclidean distance is first, because it's the default in `sklearn`:

In [4]:
euclidean = pairwise_distances(results)
euclidean_df = pd.DataFrame(euclidean, index=df.index, columns=df.index)
euclidean_df

Unnamed: 0,B02845,A51130,A36358,A28171,A51877,A60482,A32566,A35206,A35114,A32207,...,A59614,A53049,A32567,A38630,A32559,A60948,A53818,A57156,A65985,A41955
B02845,0.000000,1.276904,1.282336,1.327388,1.385534,1.334611,1.338054,1.285088,1.357166,1.338225,...,1.310353,1.323486,1.338434,1.234394,1.346497,1.310042,1.346778,1.320946,1.296652,1.313003
A51130,1.276904,0.000000,1.200168,1.197267,1.380927,1.244560,1.325529,1.246744,1.300878,1.318610,...,1.285184,1.209513,1.317437,1.229994,1.336612,1.218881,1.346623,1.226397,1.197699,1.221007
A36358,1.282336,1.200168,0.000000,1.229307,1.374540,1.276676,1.330634,1.255257,1.319624,1.333068,...,1.302456,1.247914,1.332685,1.258395,1.339762,1.247283,1.356161,1.249754,1.203273,1.266901
A28171,1.327388,1.197267,1.229307,0.000000,1.377672,1.108460,1.343706,1.276891,1.237915,1.336067,...,1.294902,1.098583,1.332020,1.289052,1.350955,1.172351,1.350303,1.132169,1.137134,1.186429
A51877,1.385534,1.380927,1.374540,1.377672,0.000000,1.382244,1.380671,1.364325,1.382631,1.364649,...,1.362772,1.384687,1.322868,1.376365,1.373189,1.370114,1.380516,1.375525,1.386100,1.371331
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
A60948,1.310042,1.218881,1.247283,1.172351,1.370114,1.236151,1.313828,1.274511,1.293992,1.311999,...,1.273916,1.220130,1.307373,1.277805,1.329701,0.000000,1.333518,1.195987,1.235246,1.221432
A53818,1.346778,1.346623,1.356161,1.350303,1.380516,1.360853,1.313472,1.335803,1.371737,1.310435,...,1.327658,1.363753,1.300872,1.345072,1.345671,1.333518,0.000000,1.342397,1.358598,1.337482
A57156,1.320946,1.226397,1.249754,1.132169,1.375525,1.216030,1.327618,1.286547,1.274297,1.311981,...,1.291568,1.230670,1.311937,1.289614,1.338420,1.195987,1.342397,0.000000,1.211370,1.222339
A65985,1.296652,1.197699,1.203273,1.137134,1.386100,1.233905,1.345823,1.261121,1.305451,1.333307,...,1.301967,1.222347,1.332329,1.264718,1.353246,1.235246,1.358598,1.211370,0.000000,1.232445


Next is cityblock distance:

In [5]:
cityblock = pairwise_distances(results, metric='cityblock')
cityblock_df = pd.DataFrame(cityblock, index=df.index, columns=df.index)
cityblock_df

Unnamed: 0,B02845,A51130,A36358,A28171,A51877,A60482,A32566,A35206,A35114,A32207,...,A59614,A53049,A32567,A38630,A32559,A60948,A53818,A57156,A65985,A41955
B02845,0.000000,55.729687,51.974010,87.676697,28.857979,88.715280,27.394476,38.905840,85.810841,29.641572,...,34.261523,83.708689,29.228824,31.819858,28.179513,51.192696,26.983699,55.597272,65.165487,64.106420
A51130,55.729687,0.000000,64.822780,91.902515,55.152083,97.926767,53.550342,59.311930,101.112287,55.056478,...,57.298261,89.232443,54.486365,55.686694,54.017029,64.833257,53.223271,67.947342,74.297465,74.267718
A36358,51.974010,64.822780,0.000000,92.879475,50.456637,98.423786,49.290019,55.873211,99.577443,51.205557,...,54.612407,90.693265,50.681598,52.951746,49.793157,63.913262,49.260386,66.967821,72.675023,75.664139
A28171,87.676697,91.902515,92.879475,0.000000,84.423598,97.693334,83.580663,89.582403,113.942085,85.058437,...,86.391556,93.869316,84.456824,87.392112,83.996037,86.722342,83.022954,84.980525,91.087179,92.806112
A51877,28.857979,55.152083,50.456637,84.423598,0.000000,84.998300,21.284498,36.366100,80.559471,22.910382,...,30.387239,80.807050,21.801254,31.711135,21.450491,48.118558,20.612170,52.296205,63.950709,61.429944
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
A60948,51.192696,64.833257,63.913262,86.722342,48.118558,93.510020,46.450501,55.058915,95.593820,47.952769,...,50.148169,87.239523,47.347027,51.897636,46.965439,0.000000,46.041379,61.250843,74.062192,70.586691
A53818,26.983699,53.223271,49.260386,83.022954,20.612170,83.814390,18.472033,34.861591,79.653275,20.768595,...,28.371200,79.604647,19.847619,29.867763,20.029364,46.041379,0.000000,50.393832,62.340451,59.569190
A57156,55.597272,67.947342,66.967821,84.980525,52.296205,93.743856,50.875350,59.830095,96.076857,51.987987,...,55.057412,90.299650,51.697250,56.405675,51.460379,61.250843,50.393832,0.000000,74.114647,72.968215
A65985,65.165487,74.297465,72.675023,91.087179,63.950709,102.454809,62.790934,68.307826,107.964458,64.138075,...,66.734171,96.003553,63.719435,65.396731,63.271991,74.062192,62.340451,74.114647,0.000000,81.737142


And finally cosine distance, which is usually (but not always) preferrable for text similarity:

In [6]:
cosine = pairwise_distances(results, metric='cosine')
cosine_df = pd.DataFrame(cosine, index=df.index, columns=df.index)
cosine_df

Unnamed: 0,B02845,A51130,A36358,A28171,A51877,A60482,A32566,A35206,A35114,A32207,...,A59614,A53049,A32567,A38630,A32559,A60948,A53818,A57156,A65985,A41955
B02845,0.000000,0.815242,0.822193,0.880979,0.959852,0.890593,0.895194,0.825725,0.920950,0.895423,...,0.858512,0.875808,0.895703,0.761865,0.906527,0.858105,0.906906,0.872449,0.840653,0.861988
A51130,0.815242,0.000000,0.720202,0.716725,0.953480,0.774465,0.878513,0.777185,0.846142,0.869367,...,0.825849,0.731461,0.867820,0.756443,0.893266,0.742836,0.906697,0.752025,0.717242,0.745429
A36358,0.822193,0.720202,0.000000,0.755597,0.944681,0.814951,0.885293,0.787835,0.870704,0.888536,...,0.848196,0.778645,0.888025,0.791779,0.897481,0.777857,0.919586,0.780943,0.723933,0.802519
A28171,0.880979,0.716725,0.755597,0.000000,0.948989,0.614342,0.902773,0.815225,0.766216,0.892537,...,0.838385,0.603443,0.887138,0.830828,0.912540,0.687204,0.911659,0.640904,0.646536,0.703807
A51877,0.959852,0.953480,0.944681,0.948989,0.000000,0.955299,0.953126,0.930692,0.955834,0.931133,...,0.928574,0.958679,0.874990,0.947190,0.942824,0.938607,0.952913,0.946034,0.960637,0.940274
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
A60948,0.858105,0.742836,0.777857,0.687204,0.938607,0.764035,0.863072,0.812189,0.837208,0.860671,...,0.811431,0.744358,0.854613,0.816393,0.884052,0.000000,0.889135,0.715193,0.762917,0.745949
A53818,0.906906,0.906697,0.919586,0.911659,0.952913,0.925960,0.862604,0.892185,0.940831,0.858620,...,0.881337,0.929910,0.846134,0.904610,0.905416,0.889135,0.000000,0.901015,0.922894,0.894429
A57156,0.872449,0.752025,0.780943,0.640904,0.946034,0.739364,0.881285,0.827602,0.811916,0.860647,...,0.834074,0.757275,0.860590,0.831553,0.895684,0.715193,0.901015,0.000000,0.733708,0.747056
A65985,0.840653,0.717242,0.723933,0.646536,0.960637,0.761261,0.905620,0.795213,0.852101,0.888854,...,0.847559,0.747067,0.887550,0.799756,0.915638,0.762917,0.922894,0.733708,0.000000,0.759460


## Reading Results

Now that we have DataFrames of all our distance results, we can easily look at the texts that are most similar (i.e. closest in distance) to a text of our choice. We'll use the same example as in the TF-IDF tutorial: Margaret Cavendish's *The Blazing World*.

In [7]:
top5_cosine = cosine_df.nsmallest(6, 'A53049')['A53049'][1:]
print(top5_cosine)

A29017    0.570971
A28171    0.603443
A57484    0.620828
A60482    0.635795
A56381    0.637656
Name: A53049, dtype: float64


We now have a list of text IDs and their cosine similarities, but this list is hard to interpret without more information. We can use the techniques from the [Metadata tutorial](https://earlyprint.org/jupyterbook/metadata.html) to get a DataFrame of metadata for all the 1666 texts:

In [8]:
# Get the full list of metadata files
# (You'll change this line based on where the files are on your computer)
metadata_files = glob.glob("../../epmetadata/header/*.xml")
nsmap={'tei': 'http://www.tei-c.org/ns/1.0'}

all_metadata = [] # Empty list for data
index = [] # Empty list for TCP IDs
for f in metadata_files: # Loop through each file
    tcp_id = f.split("/")[-1].split("_")[0] # Get TCP ID from filename
    if tcp_id in filekeys:
        metadata = etree.parse(f, parser) # Create lxml tree for metadata
        title = metadata.find(".//tei:sourceDesc//tei:title", namespaces=nsmap).text # Get title

        # Get author (if there is one)
        try:
            author = metadata.find(".//tei:sourceDesc//tei:author", namespaces=nsmap).text
        except AttributeError:
            author = None

        # Get date (if there is one that isn't a range)
        try:
            date = metadata.find(".//tei:sourceDesc//tei:date", namespaces=nsmap).get("when")
        except AttributeError:
            date = None

        # Add dictionary of data to data list
        all_metadata.append({'title':title,'author':author,'date':date})

        # Add TCP ID to index list
        index.append(tcp_id)


# Create DataFrame with data and indices
metadata_df = pd.DataFrame(all_metadata, index=index)
metadata_df

Unnamed: 0,title,author,date
A48797,"Wonders no miracles, or, Mr. Valentine Greatra...","Lloyd, David, 1635-1692.",1666
A44938,"A fast-sermon, preached to the Lords in the Hi...","Hall, George, 1612?-1668.",1666
A35608,The Case of Cornelius Bee and his partners Ric...,,1666
A52328,The pernicious consequences of the new heresie...,"Nicole, Pierre, 1625-1695.",1666
A26426,Advertisement be [sic] Agnes Campbel relict of...,"Campbel, Agnes.",1666
...,...,...,...
A66752,Ecchoes from the sixth trumpet. The first part...,"Wither, George, 1588-1667.",1666
A30143,"Grace abounding to the chief of sinners, or, A...","Bunyan, John, 1628-1688.",1666
A32207,His Majesties declaration Charles R.,England and Wales. Sovereign (1660-1685 : Char...,1666
A49213,The French Kings declaration of a vvar against...,France. Sovereign (1643-1715 : Louis XIV),1666


And we can combine this with our cosine distance results to see the metadata for the texts most similar to *The Blazing World*:

In [9]:
metadata_df.loc[top5_cosine.index, ['author','title','date']]

Unnamed: 0,author,title,date
A29017,"Boyle, Robert, 1627-1691.","The origine of formes and qualities, (accordin...",1666
A28171,"Binning, Hugh, 1627-1653.",The common principiles of Christian religion c...,1667
A57484,"Rochefort, César de, b. 1605.","The history of the Caribby-islands, viz, Barba...",1666
A60482,"Smith, John, 1630-1679.",Gērochomia vasilikē King Solomons portraiture ...,1666
A56381,"Parker, Samuel, 1640-1688.",An account of the nature and extent of the div...,1666


You now have all the tools you need to creat your own mini [Discovery Engine](https://earlyprint.org/lab/tool_discovery_engine.html?which_to_do=find_texts&eebo_tcp_id=A43441&n_results=35&tfidf_weight=6&mallet_weight=6&tag_weight=6), one focused on exactly the texts you care most about. For more on how to interpret these results and things to watch out for when calculating similarity, refer again to [The Programming Historian](https://programminghistorian.org/en/lessons/common-similarity-measures).