<a href="https://colab.research.google.com/github/dasmiq/passim/blob/seriatim/docs/passim_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with `passim`

The [Passim package](https://github.com/dasmiq/passim) is a python library and set of command-line tools for analyzing document for test reuse. This notebook walks through the simplest method of installing it and presents a couple of case studies with a small corpus. One of the main reasons to use Passim, however, is its scalability to larger corpora by taking advantage of clusters of machines using the [Apache Spark](https://spark.apache.org/) library. Although we don't discuss issues with these large deployments here, we do show a simple example of using Spark to manipulate Passim's output.

Since Spark uses java code at runtime, we check that we have a working `java` executable.

In [1]:
!java --version

openjdk 11.0.17 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing)


The easiest way to install Passim is to have python's `pip` tool download directly from the GitHub repository.

In [2]:
!pip install git+https://github.com/dasmiq/passim.git@seriatim#egg=passim
# If you're on a shared machine, use --user to install in your home directory.
# !pip install --user git+https://github.com/dasmiq/passim.git@seriatim#egg=passim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting passim
  Cloning https://github.com/dasmiq/passim.git (to revision seriatim) to /tmp/pip-install-e9rhau61/passim_6283e294a97d41ccac79a1680260278e
  Running command git clone --filter=blob:none --quiet https://github.com/dasmiq/passim.git /tmp/pip-install-e9rhau61/passim_6283e294a97d41ccac79a1680260278e
  Running command git checkout -b seriatim --track origin/seriatim
  Switched to a new branch 'seriatim'
  Branch 'seriatim' set up to track remote branch 'seriatim' from 'origin'.
  Resolved https://github.com/dasmiq/passim.git to commit af9972c0ac3fc4f5d7b4f3920d2e858801b48b08
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyspark>=3.0.1
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
C

If the installation succeeds, you should be able to run the `seriatim` command-line program to see what options are available. (See the documentation for some more detail.) Depending on how you run `pip`, you may need to add Passim's installed location to your `PATH` environment variable.

In [3]:
!seriatim --help

https://repos.spark-packages.org/ added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/usr/local/lib/python3.8/dist-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9f5c00bb-16ee-4aa1-9529-5aebd46c7eac;1.0
	confs: [default]
	found graphframes#graphframes;0.8.0-spark3.0-s_2.12 in spark-packages
	found org.slf4j#slf4j-api;1.7.16 in central
downloading https://repos.spark-packages.org/graphframes/graphframes/0.8.0-spark3.0-s_2.12/graphframes-0.8.0-spark3.0-s_2.12.jar ...
	[SUCCESSFUL ] graphframes#graphframes;0.8.0-spark3.0-s_2.12!graphframes.jar (522ms)
downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar ...
	[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.16!slf4j-api.jar (43ms)
:: r

Now let's work with some text. First, we load some useful libraries.

In [4]:
import json
from urllib.request import urlopen
import sys
import passim

We'll work with the Project Gutenberg transcriptions of two poems by Longfellow.

In [5]:
childrens_hour = """THE CHILDREN'S HOUR

Between the dark and the daylight,
  When the night is beginning to lower,
Comes a pause in the day's occupations,
 That is known as the Children's Hour.

I hear in the chamber above me
  The patter of little feet,
The sound of a door that is opened,
  And voices soft and sweet.

From my study I see in the lamplight,
  Descending the broad hall stair,
Grave Alice, and laughing Allegra,
  And Edith with golden hair.

A whisper, and then a silence:
  Yet I know by their merry eyes
They are plotting and planning together
  To take me by surprise.

A sudden rush from the stairway,
  A sudden raid from the hall!
By three doors left unguarded
  They enter my castle wall!

They climb up into my turret
  O'er the arms and back of my chair;
If I try to escape, they surround me;
  They seem to be everywhere.

They almost devour me with kisses,
  Their arms about me entwine,
Till I think of the Bishop of Bingen
  In his Mouse-Tower on the Rhine!

Do you think, o blue-eyed banditti,
  Because you have scaled the wall,
Such an old mustache as I am
  Is not a match for you all!

I have you fast in my fortress,
  And will not let you depart,
But put you down into the dungeon
  In the round-tower of my heart.

And there will I keep you forever,
  Yes, forever and a day,
Till the walls shall crumble to ruin,
  And moulder in dust away!
"""

reaper = """THE REAPER AND THE FLOWERS.

There is a Reaper, whose name is Death,
  And, with his sickle keen,
He reaps the bearded grain at a breath,
  And the flowers that grow between.

"Shall I have naught that is fair?" saith he;
  "Have naught but the bearded grain?
Though the breath of these flowers is sweet to me,
  I will give them all back again."

He gazed at the flowers with tearful eyes,
  He kissed their drooping leaves;
It was for the Lord of Paradise
  He bound them in his sheaves.

"My Lord has need of these flowerets gay,"
  The Reaper said, and smiled;
"Dear tokens of the earth are they,
  Where he was once a child.

"They shall all bloom in fields of light,
  Transplanted by my care,
And saints, upon their garments white,
  These sacred blossoms wear."

And the mother gave, in tears and pain,
  The flowers she most did love;
She knew she should find them all again
  In the fields of light above.

O, not in cruelty, not in wrath,
  The Reaper came that day;
'T was an angel visited the green earth,
  And took the flowers away.
"""

Passim's input consists of one or more _documents_ spread across one or more input files. Whether you define documents as articles, pages, books, or something else will depend on your research question. The minimal document is a record containing a unique identifier in the `id` field and its text in the `text` field. (These field names are configurable.) Other fields will be treated as metadata and passed through to the output unchanged (unless we use the `--field` option discussed below). For now, we assign both of these poems to the group 'gutenberg', which we will use later.

In [6]:
docs = [{'id': 'childrens_hour', 'group': 'gutenberg', 'text': childrens_hour},
        {'id': 'reaper', 'group': 'gutenberg', 'text': reaper}]

Next, we grab the raw OCR text of two newspaper pages from the Library of Congress's [_Chronicling America_](https://chroniclingamerica.loc.gov/) database and append them to the list of document records.

In [7]:
id = '/lccn/sn83045462/1860-08-24/ed-1/seq-4'
docs.append({'id': id, 'group': '/lccn/sn83045462',
             'text': urlopen('https://chroniclingamerica.loc.gov' + id + '/ocr.txt').read().decode('utf-8')})

In [8]:
id = '/lccn/sn85025007/1860-08-25/ed-1/seq-1'
docs.append({'id': id, 'group': '/lccn/sn85025007',
             'text': urlopen('https://chroniclingamerica.loc.gov' + id + '/ocr.txt').read().decode('utf-8')})

To finish our preparation of the input, we write these four records to a single [JSON Lines](https://jsonlines.org/) file, i.e., one JSON record per line. In general, you can spread your input across as many files and directories as you need, especially in a large project. In addition to JSON, Apache Spark also reads CSV, Parquet, and some other binary formats.

In [9]:
with open('in.json', 'w') as f:
  for d in docs:
    print(json.dumps(d), file=f)

Now, we run the default text-reuse detection process using the `seriatim` command. Since this notebook cell might be run multiple times, we delete the output directory first. (Passim reuses existing output to save time.)

In [10]:
# Delete old output
!rm -r out_cluster
!seriatim in.json out_cluster >& out_cluster.err

rm: cannot remove 'out_cluster': No such file or directory


To start with, we define a simple function to work with the JSON lines output.

In [11]:
# Read one JSON record per line
def read_jsonl(f):
  res = []
  for line in f:
    res.append(json.loads(line))
  return res

By default, Passim puts the main output in the `out.json` subdirectory. Intermediate files are in `tmp`.

In [12]:
!ls out_cluster

out.json  tmp


When read in all at once, passim's output is a flat array of `dict` records. Each record contains a `cluster` identifier. In this case, the only three output records all share the same cluster ID since _The Children's Hour_ is shared by three of the four input documents. The `text` field contains the reused substring of the original document.

In [21]:
import glob, itertools
list(itertools.chain.from_iterable([read_jsonl(open(f)) for f in glob.glob('out_cluster/out.json/*json')]))

[{'uid': 4069945004964849451,
  'cluster': 0,
  'begin': 21,
  'end': 1346,
  'boiler': False,
  'src': [{'uid': -7966282239007996852, 'begin': 0, 'end': 1361}],
  'size': 3,
  'pboiler': 0.0,
  'group': '/lccn/sn83045462',
  'id': '/lccn/sn83045462/1860-08-24/ed-1/seq-4',
  'text': "\nTHE CHILDREN'S HOUR.\nBT W. L059V1LL0W.\nBet-reen the dark ?.<i tae d?yh?ht.\nWhen the nuht is becinning t > low?r?\nConn ft p?t)M in thiimf ooeop*'ion?.\nTaftt n known ft* the Children'* Hoar.\nI he*- in th* chamber above m?\n'? ae patt-r of little feet;\nThe round ot? door thM i* opened.\nAnd voice* ?of. ? .d sweet.\nFrom bit stadv I e?e in the UmB ight,\n^D<>.?seorlin? the brovl ball ?Uir.\nuiav? Aiion ?nu inurun* a iogi?.\nAnd Edith with golden hair.\nA whisper and then a siienc;\nYet I Know by th- me ry eyet\nThe> are plot ing and planning togeth\nTo take me bj surprise.\nA sudden rush from the stairway,\nA sadden raid from the hall'\nBt ?hra~ doors led unguarded.\nThey outer my castle wall!\nThey c

Running Passim in its default mode returns, as we just saw, a set of clusters with shared passages. This mode is useful if we do not know _a priori_ which passages will be resued.

If, however, we know something more about the structure of our data, we can tell Passim to consider only certain links between documents. For instance, in our toy corpus of four documents, two of them are clean transcripts of poems and the other two are noisy OCR transcriptions of full newspaper pages. 

If we want to focua on the line-by-line alignment of certain documents, we can instead get output in `--docwise` mode and consider only those alignments where the first "source" text is in group 'gutenberg' and the second "target" text is not. The `--field` and `--filterpairs` options we use to specify these constraints use SQL syntax. When considering a pair of records, Passim keeps the original field names for the source document and appends a `2` to the target document's field names. In this case, we tell Passim to pay attention to the `group` field and to align source documents whose `group` field is 'gutenberg' with target documents whose `group` field is not 'gutenberg'.

In [100]:
!rm -r out_docwise
!seriatim --docwise --field group --filterpairs "group = 'gutenberg' AND group2 <> 'gutenberg'" in.json out_docwise >& out_docwise.err

# This command produces the same results but would be faster on a large corpus
# because it defines a new boolean field `ref` instead of using string comparisons.
#!seriatim --docwise --field "(group = 'gutenberg') AS ref" --filterpairs 'ref AND !ref2' in.json out_docwise >& out_docwise.err

We can use the same simple helper function to get the output of this run.

In [101]:
dw = read_jsonl(open(glob.glob('out_docwise/out.json/*json')[0]))

Each output record is now an entire target document broken into lines. The `--filterpairs` option we used ensures that the target documents are those not in group 'gutenberg', i.e., the targets are the OCR'd newspapers. We ignore the other fields and only grab the lines that are aligned to a source document. This is indicated by the presence of the `wits` (i.e., witnesses) field in the line record.

In [102]:
[[line for line in doc['lines'] if 'wits' in line] for doc in dw]

[[{'begin': 48491,
   'text': 'Between the dark and the daylight,\n',
   'wits': [{'group': 'gutenberg',
     'id': 'childrens_hour',
     'begin': 21,
     'alg': 'Between the dark and the daylight,\n',
     'alg2': 'Between the dark and the daylight,\n',
     'matches': 35,
     'text': 'Between the dark and the daylight,\n'}]},
  {'begin': 48526,
   'text': 'When the night h beginning to lower,\n',
   'wits': [{'group': 'gutenberg',
     'id': 'childrens_hour',
     'begin': 56,
     'alg': '  When the night is beginning to lower,\n',
     'alg2': '--When the night -h beginning to lower,\n',
     'matches': 36,
     'text': '  When the night is beginning to lower,\n'}]},
  {'begin': 48563,
   'text': "Comes a pause in tho day's occupations,\n",
   'wits': [{'group': 'gutenberg',
     'id': 'childrens_hour',
     'begin': 96,
     'alg': "Comes a pause in the day's occupations,\n",
     'alg2': "Comes a pause in tho day's occupations,\n",
     'matches': 39,
     'text': "Comes a pau

For the small dataset in this demo, it's easy enough to read one file at a time. To work with bigger datasets spread over multiple files, it is more convenient to use Apache Spark itself.

In [73]:
from pyspark.sql import SparkSession, Row, DataFrame
from pyspark.sql.functions import (col, explode)
import pyspark.sql.functions as f
spark = SparkSession.builder.appName('Passim Demo').getOrCreate()

We read in the JSON output from the docwise Passim run.

In [63]:
dwdata = spark.read.json('out_docwise/out.json')

Spark infers the schema of this data, so we can see the available fields and their types.

In [72]:
dwdata.printSchema()

root
 |-- group: string (nullable = true)
 |-- id: string (nullable = true)
 |-- lines: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- begin: long (nullable = true)
 |    |    |-- text: string (nullable = true)
 |    |    |-- wits: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- alg: string (nullable = true)
 |    |    |    |    |-- alg2: string (nullable = true)
 |    |    |    |    |-- begin: long (nullable = true)
 |    |    |    |    |-- group: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- matches: long (nullable = true)
 |    |    |    |    |-- ref: boolean (nullable = true)
 |    |    |    |    |-- text: string (nullable = true)
 |-- text: string (nullable = true)
 |-- uid: long (nullable = true)



As above, we focus on those lines that are aligned to a reference text. The functions in the Spark python API correspond to operations in relational algebra such as select, filter, etc. (This style of working with data frames may be familiar to users of Pandas in python or dplyr in R.)

In [97]:
lines = dwdata.select('id', f.explode('lines').alias('line')
  ).filter(col('line')['wits'].isNotNull()
  ).select('id', col('line.begin'),
           col('line.wits')[0]['alg'].alias('ref'),
           col('line.wits')[0]['alg2'].alias('ocr')
  ).sort('id', 'begin')
[r.asDict(True) for r in lines.limit(20).collect()]

[{'id': '/lccn/sn83045462/1860-08-24/ed-1/seq-4',
  'begin': 21,
  'ref': "THE CHILDREN'S HOUR-\n",
  'ocr': "THE CHILDREN'S HOUR.\n"},
 {'id': '/lccn/sn83045462/1860-08-24/ed-1/seq-4',
  'begin': 42,
  'ref': '--\n---------------',
  'ocr': 'BT W. L059V1LL0W.\n'},
 {'id': '/lccn/sn83045462/1860-08-24/ed-1/seq-4',
  'begin': 60,
  'ref': 'Bet-ween the dark -and the daylight,\n  ',
  'ocr': 'Bet‐reen the dark ?.<i tae d?y-h?ht--.\n'},
 {'id': '/lccn/sn83045462/1860-08-24/ed-1/seq-4',
  'begin': 96,
  'ref': 'When the night is beginning t-o lower,\n',
  'ocr': 'When the n-uht is becinning t > low?r?\n'},
 {'id': '/lccn/sn83045462/1860-08-24/ed-1/seq-4',
  'begin': 134,
  'ref': "Comes a pause in the day's occupations,\n",
  'ocr': "Conn ft p?t)M in th---iimf ooeop*'ion?.\n"},
 {'id': '/lccn/sn83045462/1860-08-24/ed-1/seq-4',
  'begin': 171,
  'ref': " That is known -as the Children's Hour.\n",
  'ocr': "-Taftt n known ft* the Children'* Hoar.\n"},
 {'id': '/lccn/sn83045462/1860-08-24/ed-