# String comparisons

In this notebook we use the popular library for string comparisons [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy). It is based on the built-in Python library [difflib](https://docs.python.org/3/library/difflib.html). For more information on the various methods available and their differences, see the blog post [FuzzyWuzzy: Fuzzy String Matching in Python](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/).

> **See also:**
> 
> [textacy](https://github.com/chartbeat-labs/textacy)

## 1. Installation

With [Spack](../productive/envs/spack/index.rst) you can provide `fuzzywuzzy` and the optional `python-levenshtein` library in your kernel:

```console
$ spack env activate python-38
$ spack install py-fuzzywuzzy@0.18.0%gcc@11.2.0+speedup
```

Alternatively, you can install the two libraries with other package managers, for example

```console
$ pipenv install fuzzywuzzy[speedup]
```

## 2. Imort

In [1]:
from fuzzywuzzy import fuzz, process

## 3. Example

In [2]:
berlin = ['Berlin, Germany', 
          'Berlin, Deutschland', 
          'Berlin', 
          'Berlin, DE']

## String similarity

The similarity of the first two strings `'Berlin, Germany'` and `'Berlin, Deutschland'` seems low:

In [3]:
fuzz.ratio(berlin[0], berlin[1])

65

## Partial string similarity

Inconsistent partial strings are a common problem. To get around this, fuzzywuzzy uses a heuristic called _best partial_.

In [4]:
fuzz.partial_ratio(berlin[0], berlin[1])

60

## Token sorting

In token sorting, the string in question is given a token, the tokens are sorted alphabetically and then reassembled into a string, for example:

In [5]:
fuzz.ratio(berlin[1], berlin[2])

48

In [6]:
fuzz.token_set_ratio(berlin[1], berlin[2])

100

## Further information

In [7]:
fuzz.ratio?

## Extract from a list

In [8]:
choices = ['Germany',
           'Deutschland',
           'France', 
           'United Kingdom',
           'Great Britain', 
           'United States']

In [9]:
process.extract('DE', choices, limit=2)

[('Deutschland', 90), ('Germany', 45)]

In [10]:
process.extract('Vereinigtes Königreich', choices)

[('United Kingdom', 51),
 ('United States', 41),
 ('Germany', 39),
 ('Great Britain', 35),
 ('Deutschland', 31)]

In [11]:
process.extractOne('frankreich', choices)

('France', 62)

In [12]:
process.extractOne('U.S.', choices)

('United States', 86)

## Known ports

FuzzyWuzzy is also ported to other languages! Here are some known ports:

* Java: [xpresso](https://github.com/WantedTechnologies/xpresso)
* Java: [xdrop fuzzywuzzy](https://github.com/xdrop/fuzzywuzzy)
* Rust: [fuzzyrusty](https://github.com/logannc/fuzzywuzzy-rs)
* JavaScript: [fuzzball.js](https://github.com/nol13/fuzzball.js)
* C++: [tmplt fuzzywuzzy](https://github.com/Tmplt/fuzzywuzzy)
* C#: [FuzzySharp](https://github.com/BoomTownRoi/BoomTown.FuzzySharp)
* Go: [go-fuzzywuzzy](https://github.com/paul-mannino/go-fuzzywuzzy)
* Pascal: [FuzzyWuzzy.pas](https://github.com/DavidMoraisFerreira/FuzzyWuzzy.pas)
* Kotlin: [FuzzyWuzzy-Kotlin](https://github.com/willowtreeapps/fuzzywuzzy-kotlin)
* R: [fuzzywuzzyR](https://github.com/mlampros/fuzzywuzzyR)