# Python for Text Similarities 1

**(C) 2017-2023 by [Damir Cavar](http://damir.cavar.me/)**

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks).

**Version:** 1.3, January 2023

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is a tutorial related to the discussion of text similarities in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/).

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/).

## Requirements

The following code examples presuppose a running [Python 3.x](https://python.org/) environment with [Jupyter Lab](https://jupyter.org/) and [NLTK](https://www.nltk.org/) installed.

For [NLTK](https://www.nltk.org/) installation, follow the instructions on the [NLTK Documentation](https://www.nltk.org/data.html) page. You will also need to install the data packages required by the example code below.

## Jaccard coefficient

To calculate the Jaccard coefficient we prepare two texts:

In [1]:
text1 = """Our medicine cures baldness. No diagnostics needed.
           We guarantee Fast Viagra delivery.
           We can provide Human growth hormone. The cheapest Life
           Insurance with us. You can Lose weight with this treatment.
           Our Medicine now and No medical exams necessary.
           Our Online pharmacy is the best.  This cream Removes
           wrinkles and Reverses aging.
           One treatment and you will Stop snoring.  We sell Valium
           and Viagra.
           Our Vicodin will help with Weight loss. Cheap Xanax."""
text2 = """Dear ,
           we sell the cheapest and best Viagra on the planet. Our delivery is
           guaranteed confident and cheap.
        """

We import the *word_tokenizer* from the [NLTK](https://www.nltk.org/) module. We convert the tokenlist from each text to a set of types.

In [3]:
from nltk import word_tokenize

types1 = set(word_tokenize(text1))
types2 = set(word_tokenize(text2))

The types in the first text are:

In [4]:
print(types1)

{'wrinkles', 'medicine', 'exams', 'necessary', 'This', 'Reverses', 'cheapest', 'the', 'pharmacy', 'weight', 'growth', 'cures', 'will', 'Xanax', 'You', 'The', 'Vicodin', 'with', 'needed', 'Life', 'now', 'and', 'Medicine', 'Human', 'loss', 'us', 'you', 'Valium', 'cream', 'aging', 'sell', 'Lose', 'treatment', 'help', 'snoring', 'baldness', 'hormone', 'delivery', 'can', 'Our', 'guarantee', 'Insurance', 'Removes', 'best', 'Fast', 'Cheap', 'diagnostics', 'We', 'medical', 'One', 'No', 'Online', 'this', 'Weight', 'provide', '.', 'Viagra', 'is', 'Stop'}


We can generate the instersection from the two sets of types in the following way:

In [5]:
print(set.intersection(types1, types2))

{'best', 'sell', 'and', 'cheapest', 'the', 'delivery', '.', 'Viagra', 'is', 'Our'}


To calculate the Jaccard coefficient we divide the length of the intersection of the sets of types by the length of the union of these sets:

In [6]:
lenIntersect = len(set.intersection(types1, types2))
lenUnion = len(set.union(types1, types2))

print(lenIntersect / lenUnion)

0.14925373134328357


This division is equivalent to the division of $\frac{words\,in\,both\,sets}{(words\,in\,set\,1)\,+\,(words\,in\,set\,2)\,-\,(words\,in\,both\,sets)}$.

(C) 2017-2023 by [Damir Cavar](http://cavar.me/damir/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))