In [None]:
#This cell just installs dependencies -- setting up things that we need

!pip install matplotlib
!pip install seaborn

#You can see helper.py in the folder view on the left.
#It contains functions that this notebook uses.
#This lets us hide distracting details/complexity, but you can
#look inside it if you want to know how these things are working.
import helper

import requests
from tqdm.auto import tqdm
from jsonpath_ng.ext import parse as json_parse
from collections import Counter
from copy import deepcopy
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import re
import math
import random
import pandas as pd
from wordcloud import WordCloud

from IPython.display import JSON as json_display
from IPython.core.display import Markdown

seaborn.set_theme(rc={'figure.figsize':(20,10)})

RESTARTING... LET'S TRY USING THE API
Statistical methods like topic modelling probably should really be used with large volumes of data.
To keep this example reasonably small (and therefore fast) we'll try to work with a smallish set of books that is large enough to work reasonably well.
To begin with, we'll use the catalogue API to search for "typhoid".

(see https://developers.wellcomecollection.org/docs/examples for much more about working with the API)

In [None]:
catalogue_base_url = 'https://api.wellcomecollection.org/catalogue/v2/'

response = requests.get(
    catalogue_base_url + 'works',
    params={
        'include': 'identifiers,subjects,production,items', #items is the one that will allow us to find the text
        'pageSize': 100,
        'query': 'typhoid',
    },
)
if response.status_code != 200:
  print('error', file = sys.stderr)
response_data = response.json()
for k, v in response_data.items():
    if k == 'results': continue #there will be loads of this
    print(f'{k}: {v}')


Last time I updated this cell I got 1099 `totalResults`. Your results may differ, depending upon how Wellcome's collection has changed in the meantime. Anyway, this feels like a nice number of texts to start working with. Let's learn some more about them. We'll start by downloading the catalogue data for all of the pages of results.

In [None]:
#let's have a progress bar
catalogue_bar = tqdm(
  unit = 'pages',
  total = response_data['totalPages'],
  desc = 'downloading catalogue data',
)

#We already got the first page of results in the previous cell
catalogue_bar.update(1)
works = response_data['results']

#Now we'll add all of the other pages of results to the list "works"
while 'nextPage' in response_data:
  response = requests.get(response_data['nextPage'])
  catalogue_bar.update(1)
  if response.status_code != 200:
    print('error', file = sys.stderr)
  response_data = response.json()
  works.extend(response_data['results'])


Now that we have all of the catalogue data for our "typhoid" works, let's get a sense of what this covers. We'll just look at the contents of the first record.

In [None]:
json_display(works[0], expanded = True)

This is quite a lot of data! We are interested in text about typhoid, so let's focus on the type of work that this is (is it something written, or something else, like a drawing or a photograph?) and the subject matter. We can use JSONPath to look this up.

We'll start with the "type" of the work. The last entry in the above JSON is workType. The label and type look relevant. Let's examine the values that these can take across the whole collection.

In [None]:
#This cell uses dumpCount from helper.py.
#dumpCount takes a JSONPath query and a list of JSONL objects
#It prints text describing query results

print('workType types:')
helper.dumpCount('$.workType.type', works)
print()
print('workType labels:')
helper.dumpCount('$.workType.label', works)

We check the "type" just to reassure ourselves about guesses about the data model. At time of writing, the only type is "Format", which encourages me to believe that I don't need to worry about this value and just think about the labels. If you need to be sure that you're using the data model correctly you might need to learn more about this.

We can see that there are a range of types of works in our results. At the time of writing, 3/4 of the works are books and several others are of types that could reasonably have text (e.g. "Archives and manuscripts", "Student dissertations", "E-books", "Manuscripts", "Journals".

Given that text is provided by an OCR pipeline, it is only printed texts that are likely to have online text available. So we filter down (for now) to just books, e-books and journals.

In [None]:
#This filters down to a list of just books, e-books and journals.
#We cannot use JSONPath to do this because JSONPath can only check values for the
#purpose of filtering lists, and works appears to JSONPath code as single JSON objects
#(re e.g. https://stackoverflow.com/a/43737750)

printed_works = list(filter(lambda x: x['workType']['label'] == 'Books' or
                                      x['workType']['label'] == 'E-books' or
                                      x['workType']['label'] == 'Journals', works))

Working out the catalogue subject of a work is more complicated. Works in Wellcome Collection are classified according to a range of schemes. If we look in the above JSON, we can also see that the structure is fairly complex, involving a mixture of "Subjects" and "Concepts". Rather than unpick all this, we'll just look at a part of the structure to get a sense of how things are classified. We'll stick with the printed works here, and we'll limit it to just subjects now. We'll just show subjects that apply to at least 5% of the works.

In [None]:
print('Subjects')
#label of every member of the subjects array which has a type of Subject
helper.dumpCount('$.subjects[?(@.type=="Subject")].label', printed_works, 0.05)

Notice that we now have two columns of numbers.

Before, we were dealing with formats. Each catalogue entry refers to a single physical object --- a book, a journal, a picture, etc etc --- and so it has only one format.

Now, we are dealing with subjects. Each catalogue entry may have more than one subject.

The "entries" column is counting catalogue entries. Because an entry can have multiple subjects, the sum of times that we see each subject, across all subjects in the corpus, is going to be higher than the number of entries in the corpus.

Let's unpack this with a small example. To begin, we can look at the subjects in a sample of ten works.

In [None]:
#My original list was just the first 10 works in printed_works, but the catalogue might change.
#So now I get the same works by identifier, but I've changed the last two to make a better example.
sample = helper.works_by_ids(printed_works,
                             ['bxa3fqrw','f56ccxnd','jf55amap','pw7sr9zn','q5pqqysq','qzy6ufxp','rxyt9ncw','vqhzjwd5','ab2ncfmj', 'sqwwchy7'])

display(Markdown('\n'.join(helper.dump_labels(sample, '$.subjects[?(@.type=="Subject")]', 'subjects', '^Typhoid [Ff]ever$'))))


What we see here is a total of 10 printed works. These have varying numbers of subjects, totalling 18 (3 * 1 + 6 * 2 + 1 * 3 = 18).

I've highlighted all uses of "Typhoid Fever" as a subject. You may notice that there are two different ways of identifying the general subject of "typhoid fever" -- one spelling fever with a capital F and one with a lower case f. These two spellings also have distinct IDs. If we wanted to find all books with this general subject then we would have to use both spellings. Even then, we would have to watch out for cases like that copy of "On typhoid fever", which has both spellings.

You may also notice that there are two copies of William Thomson's "On typhoid fever". As it happens, one of these copies has two different "Typhoid fever" subjects, and one has only one of them.

Let's now run our dumpCount function over the same ten works, first to get the titles, then to get the subjects.

In [None]:
helper.dumpCount('$.title', sample)

Running it on the titles show us that "On typhoid fever" appears twice and all of the others appear once. "On typhoid fever" is therefore 20% of the sample.

The numbers here are out of 10, because there are 10 works.

We only get one list of number because the number of titles equals the number of works, so a second list would just be exactly the same as the first. dumpCount is written not to give us two lists when this happens.

Now let's look at subjects.

In [None]:
helper.dumpCount('$.subjects[?(@.type=="Subject")].label', sample)

There are different numbers of works (10) and subjects (18), as we saw above. Because of this we get two lists: the "entries" list calculates percentages as a proporition of the number of works and the "hits" list calculates percentages as a proportion of the number of subjects.

The left-hand "entries" column is still counting "by work" --- each number is out of 10, the number of works.

4/10 works, or 40% of all works, have the subject "Typhoid Fever", and another 3/10, or 30% of all works, have the subject "Typhoid fever" (upper-case vs lower-case "f").

The first work, "Typhoid fever and chronic typhoid carriers", has the subjects "Typhoid Fever - epidemiology" and "Typhoid Fever - transmission", so it effectively appears twice in the left-hand column, once for each subject. All works will be counted once in this column for each subject that they have. Because of this, the total of entries in the left-hand column is greater than 10 --- in fact it will be 18, the total number of subjects. If we count up the percentages in this column, they will come to 180% (1 * 40 + 1 * 30 + 11 * 10 = 40 + 30 + 110 = 180).

The right-hand "hits" column is counting "by subject" --- each number is out of 18, the total number of subjects possessed by all of the books. Just as "On typhoid fever" appeared twice in our lists of titles, some subjects appear more than once when we list all of the subjects of all of the books. If we count up the percentages in this column, they will come to 100% (1 * 22 + 1 * 17 + 11 * 6 = 22 + 17 + 66 = 105 -- it comes out a little high because of rounding errors, but we can see that it is really 100% by adding up the hits: 4 + 3 + 11 * 1 = 18, and 18/18 gives 100%).

Note also that the inconsistent nature of the data leads to some misrepresentation. If we normalize by case, the proportions will change a little --- let's try that.

In [None]:
#normalize the subjects in a rough and ready way -- this normalizes label case but might not be consistent in other attributes, such as id
#this is good enough for present purposes

normalized_works = [deepcopy(work) for work in sample] #misnomer: we are only normalizing subject label
for n_w in normalized_works:
  subjects = n_w['subjects']
  seen = set()
  normalized_subjects = []
  for subject in subjects:
    if subject['type'] != 'Subject': continue #if it is not actually a subject, move on to the next subject
    lowered = subject['label'].lower()
    if lowered in seen: continue #if this work already has a subject with this label, move on to the next subject
    subject['label'] = lowered
    seen.add(lowered)
    normalized_subjects.append(subject)
  n_w['subjects'] = normalized_subjects
display(Markdown('\n'.join(helper.dump_labels(normalized_works, '$.subjects[?(@.type=="Subject")]', 'subjects', '^typhoid fever$'))))


Our general "Typhoid Fever" subject is now consistently "typhoid fever".

We still have our same ten titles but now only 16 subjects because "Typhoid fever: a history" and the first copy of "On typhoid fever" no longer have the same subject label listed twice with different cases.

So let's perform the same analysis with this slightly cleaner data.

In [None]:
helper.dumpCount('$.subjects[?(@.type=="Subject")].label', normalized_works)

We now see easily see that 60% of the works have the most general "typhoid fever" subject, which is also 38% of all of the subjects covered.

Which might be roughly what we would expect in a corpus based on a search for "typhus".

There is more we could do to clean this data, and to make sure of our assumptions about the data model (for example, I am assuming that one work cannot have two completely identical subjects) but let's get back to a sense of the collection.

You might recall that there is more than one way of talking about classifications in the data model, so let's take a quick look at that back in our fuller set of printed_works. We'll again look at "subjects" applying to at least 5% of books in our printed_works, but now we'll look at "concepts" too.

In [None]:
print('Subjects')
#label of every member of the subjects array which has a type of Subject
helper.dumpCount('$.subjects[?(@.type=="Subject")].label', printed_works, 0.05)
print()
print('Concepts')
#label of every node at any depth beneath subjects which has a type of concept
helper.dumpCount('$.subjects..*[?(@.type=="Concept")].label', printed_works, 0.05)

We have not applied that lower-casing of subject labels to printed_works, just to that sample that we copied out of it. So the difference between "Typhoid Fever" and "Typhoid fever" is back with us. Still, this gives us a bit of an impression of the state of our corpus.

Straight away, we can see that 9% of these works have no subjects and 10% have no concepts.

We can also see that the subjects and concepts are quite similar. The concepts look maybe finer-grained, but I'm guessing at this point.

If you want to see more of the subjects/concepts in the collection, make the number at the end of each `dumpCount` call smaller, or remove it to see all of them. There will be a lot.

One difficulty may be that Wellcome's catalogue includes texts belonging to other collections. These could be classified in different ways.

So let's assume that we are interested in searching works actually held by Wellcome itself and limit down to them.

The way that was suggested to me to do this was to look for works held on either open shelves or in closed stores. This seems to make sense, although perhaps it needs a tweak for purely digital works such as E-books.

For purposes of this notebook we won't worry about the question of digital works, so let's filter down our `printed_works`.

In [None]:
print('All availability ids within printed_works:')
helper.dumpCount('$.availabilities[*].id', printed_works)
print()

open_searcher   = json_parse("$.availabilities[?(@.id=='open-shelves')].id")
closed_searcher = json_parse("$.availabilities[?(@.id=='closed-stores')].id")

wellcome_printed = list(filter(lambda x: len(open_searcher.find(x)) > 0 or len(closed_searcher.find(x)) > 0, printed_works))
print(f'{len(wellcome_printed)}/{len(printed_works)} printed works are available in closed and/or open stores (therefore held by Wellcome itself)')
print('These break down as:')
helper.dumpCount('$.workType.label', wellcome_printed)

We can see that `printed_works` uses the availabilities `online`, `closed-stores`, and `open-shelves`.

485 `printed_works` are held by Wellcome itself. We have stored these in the list `wellcome_printed`. Nearly all of these works are books.

Now that we have done this, we can look again at concepts and subjects, to see what the coverage is like for the particular works that we are interested in.

In [None]:
print('Subjects')
#label of every member of the subjects array which has a type of Subject
helper.dumpCount('$.subjects[?(@.type=="Subject")].label', wellcome_printed, 0.05)

print()
print('Concepts')
#label of every node at any depth beneath subjects which has a type of concept
helper.dumpCount('$.subjects..*[?(@.type=="Concept")].label', wellcome_printed, 0.05)

At time of writing, the subjects `Typhoid Fever - epidemiology` and `Typhoid fever` are both well represented among printed works held at Wellcome. 30% of works have the subject `Typhoid Fever - epidemiology` and 28% have the subject `Typhoid fever`. It may therefore be that 58% of the whole corpus has one or other of these subjects --- but some works might have both subjects, so we can't be sure about that.

A very large 76% of works have the concept `Typhoid Fever`. As 16% of concepts also have no value, this should mean that only 8% of our cirpis both have at least one concept and do not have the concept `Typhoid fever`.

As there are works missing concepts, lack of classification information is not explained only by a work not being held at Wellcome -- in fact, the proportion of missing subjects/concepts has increased now that we have removed non-Wellcome works.

We could keep digging, but let's leave subjects and concepts there.

We'll take a look at one more example of catalogue data before we move on to actually looking at the text: let's get a sense of when these works were published.

In [None]:
print("Dates of printed Wellcome works, by frequency (min 1%)")
helper.dumpCount('$.production[*].dates[*].label', wellcome_printed, 0.01)

Although this lists only a few dates, it is enough for us to see that the date format is not completely consistent. Sometimes we get a year, sometimes we get a year inside square brackets, perhaps indicating some uncertainty.

Let's see how many of the dates do not consist entirely of numbers.

In [None]:
wellcome_printed_dates = helper.list_by_jsonpath('$.production[*].dates[*].label', wellcome_printed)
not_a_number = Counter([x for x in wellcome_printed_dates if not x.isnumeric()])
print(sorted(not_a_number.keys()))
print()
total = len(wellcome_printed_dates)
bad_total = not_a_number.total()
print(f'Total "bad" dates: {bad_total}/{total} ({100*(bad_total/total):3.0f}%)')

We can see that there are quite a few "square brackets" cases, but also date ranges, copyright symbols, question marks and occasionally snippets of text.

At time of writing, 28% of dates are not numbers, which seems like quite a lot.

Let's try just stripping out square brackets: this discards some information that might be important, but for now we just want a rough sense of the date range.

In [None]:
debracketed_dates = [x.strip('[]') for x in wellcome_printed_dates]
not_a_number = Counter([x for x in debracketed_dates if not x.isnumeric()])
print(sorted(not_a_number.keys()))
print()
total = len(debracketed_dates)
bad_total = not_a_number.total()
print(f'Total "bad" dates: {bad_total}/{total} ({100*(bad_total/total):3.0f}%)')

At time of writing, this reduces the "bad date" proportion to 10%.

We could do more, but let's carry on with this data.

In [None]:
print("Dates of printed Wellcome works, roughly chronologically ordered, discarding non-numbers:")
sorted_dates = sorted([int(x) for x in debracketed_dates if x.isnumeric()])
print(sorted_dates)

That gives us a sense of the corpus, we can see that the works span from 1762 to the modern day.

18 and 19 are questionable numbers. These could refer to very early works but my first guess is that these are the first two digits of a century.

Let's just drop those two numbers and then look at this data in a more visual form.

In [None]:
#Only take numbers in the range 1000 - 2100
#This will drop the 18 and 19 that I see, but also should continue to work if the underlying data changes
sorted_dates = [x for x in sorted_dates if x > 1000 and x < 2100]

counted_dates = Counter(sorted_dates)
total = 0
cumulative = {}
for year in set(sorted_dates):
  total += counted_dates[year]
  cumulative[year] = total

xlim = (helper.down(sorted_dates[0], 50), helper.up(sorted_dates[-1], 50))
xticks = range(xlim[0], xlim[1] + 1, 25)

ylim = (0, helper.up(counted_dates.most_common(1)[0][1], 2))
ax = seaborn.scatterplot(counted_dates)
ax.set(
  title = 'Publications per year',
  xlabel = 'Year',
  ylabel = 'Works',
  ylim = ylim,
  yticks = range(ylim[0], ylim[1] + 1, 2),
  xlim = xlim,
  xticks = xticks,
)
plt.show()

ylim = (0, helper.up(total, 50))
ax = seaborn.lineplot(cumulative)
ax.set(
  title = 'Cumulative publications per year',
  xlabel = 'Year',
  ylabel = 'Cumulative works',
  ylim = ylim,
  yticks = range(ylim[0], ylim[1] + 1, 25),
  xlim = xlim,
  xticks = xticks,
)
plt.show()

As I write, these charts show that Wellcome's own collection of printed works that are returned for a search on 'typhoid' were published mainly between the late 1800s and early 1900s -- discounting all the ones with dates that were not easy to work with.

Now we have explored what the catalogue can tell us a little and learned a little bit about the texts in the corpus.

The next step is to find out which ones actually have digitised text available.

In [None]:
#print(helper.list_by_jsonpath('$..*[?(url)]', wellcome_printed)) #broader search for all fields named 'url'
#print(helper.list_by_jsonpath('$..*[?(url=~"presentation")]', wellcome_printed)) #broader search for fields named 'url' with 'presentation' somewhere in their value

#This one returns all URLs that point to IIIF manifest
#But it loses the context: I do not know which catalogue data it refers to
#urls = helper.list_by_jsonpath('$.items[*].locations[?(@.locationType.id="iiif-presentation")].url', wellcome_printed)#[?(.locationType.id="iiif-presentation")]', wellcome_printed)

#This bit of JSONPath finds every member of the `items` array that ultimately turns out to contain a IIIF location
#It then gets the grandparent of that iten, which should be the data structure that we started with
#In this way, we can filter our list of printed Wellcome works to just those for which we have OCR'd text
#TODO: If there are multiple URLs, do I get multiple entries here? Would mean more than one manifest for a work, which I don't think actually happens.
wellcome_ocrd = helper.list_by_jsonpath('$.items[?(@.locations[*].locationType.id="iiif-presentation")].`parent`.`parent`', wellcome_printed)
#print(wellcome_ocrd)
print(f'Of {len(wellcome_printed)} printed works held by Wellcome, {len(wellcome_ocrd)} have a IIIF manifest.')

#TODO Will wellcome_ocrd always be a list of single-element lists?
first_ocrd = wellcome_ocrd[0]
#print(first_ocrd)$.production[*].dates[*].label
#TODO A lot of single-element assumptions in that date lookup -- better just to get it to print "all" the dates, though I suspect that the assumption is generally right
print(f'The first of these in the list is id {first_ocrd["id"]} -- {first_ocrd["title"]}, published {first_ocrd["production"][0]["dates"][0]["label"]}')
first_manifest_url = helper.list_by_jsonpath('$.items[*].locations[?(@.locationType.id="iiif-presentation")].url', [first_ocrd])
for url in first_manifest_url: #I only expect this list to contain one element, but in principle there could be more
  print(f'Its IIIF manifest is at {url}')
first_manifest_url = first_manifest_url[0] #I only expect one URL. In any case, we will assume that the first URL contains the information that we want.
                         #If more than one URL gets printed by the above loop then we should check that assumption.

Feel free to click on that manifest and take a look. Next we will use the information in it to extract the OCR'd text.

In [None]:
response = requests.get(first_manifest_url)
if response.status_code != 200:
  print('error', file = sys.stderr)

first_text_urls = helper.list_by_jsonpath('$.sequences[*].rendering[?(@.format="text/plain")].@id', [response.json()])
for url in first_text_urls:
  print(f'Plain text at {url}')
first_text_url = first_text_urls[0]

Again, there should only be one URL, and you can click on it if you would like to take a look. As a large body of unformatted text, it is not very easy to read.

Now lets get that text so that we can do something with it.

In [None]:
response = requests.get(first_text_url)
if response.status_code != 200:
  print('error', file = sys.stderr)
first_ocr_text = response.text
first_ocr_sentences = first_ocr_text.split('.') #an approximation to a list of sentences from the book
print("Some randomly-selected sentences, just to prove that we have got some text from a work:")
print()
print()
print('.\n\n'.join(random.choices(first_ocr_sentences, k=10)))

Now we'll generate some other analyses. To make sure that we're looking at the same thing, I'm going to switch to getting the text that was the first work in the list when I ran this notebook --- it may or may not be the first work in the list for you.

We'll start by getting that text, in the same way that we did above, but all in one cell this time. We'll also take the opportunity to define some functions that we can reuse later.

In [None]:
def description(catalogue_entry):
  return f'id {catalogue_entry["id"]} -- {catalogue_entry["title"]}, published {catalogue_entry["production"][0]["dates"][0]["label"]}'

def get_iiif_manifest(catalogue_entry):
  manifest_url = helper.expect_one(helper.list_by_jsonpath('$.items[*].locations[?(@.locationType.id="iiif-presentation")].url', [catalogue_entry]))
  response = helper.get(manifest_url)
  return response.json()

def get_plain_text(manifest):    
  text_url = helper.expect_one(helper.list_by_jsonpath('$.sequences[*].rendering[?(@.format="text/plain")].@id', [manifest]))
  response = helper.get(text_url)
  return response.text

#Find the example that I am using
example_work = helper.expect_one([x for x in wellcome_ocrd if x['id']=='kbspden9'])
print(f'The example text is {description(example_work)}')

example_text = get_plain_text(get_iiif_manifest(example_work))

example_sentences = example_text.split('.') #an approximation to a list of sentences from the book
print("Some randomly-selected sentences, just to prove that we have got some text from a work:")
print()
print()
print('.\n\n'.join(random.choices(example_sentences, k=10)))

Now we shall generate a word cloud from that work's text. The word cloud will ignore common words that do not convey much informaion (stop words).

In [None]:
#If you would like to generate a cloud with different settings, adjust some of these parameters
wc = WordCloud(width=1024, height=800, min_font_size=20, 
               margin=75, #space between words
               background_color='white', color_func=lambda *args, **kwargs: 'black', #white background, black text
               max_words=10,
#               stopwords=set() #uncomment this line to prevent stopword removal, or populate the set to give your own stopwords. The default stopwords are here: https://github.com/amueller/word_cloud/blob/main/wordcloud/stopwords.
)
example_freq = wc.process_text(example_text)
wc.generate_from_frequencies(example_freq) #we can just to wc.generate(example_text), but this way we get the frequency table, too
display(pd.DataFrame(Counter(example_freq).most_common()[0:wc.max_words], columns=['word','count']).set_index('word'))
wc.to_image()

The table shows the frequencies of the top words.

The cloud shows the same thing. It indicates that "fever" is the most frequently occuring word in the text, with "case", "one" and "â" also coming up a lot.

It might be interesting to compare the table of frequencies with the cloud --- word clouds are nice and intuitive, but do these sizes really reflect the proportions in the table (and does it matter if they are not quite right)? Just something to think about.

Anyway, that "â" is curious. It might well be a transcription error, let's look for it in context.

In [None]:
a_circumflex_mentions = [x for x in example_sentences if 'â' in x.lower()]
print(f'{len(a_circumflex_mentions)}/{len(example_sentences)} sentences contain "â"')
print('\nHere are the first 10 of them:\n')
print('.\n\n'.join(a_circumflex_mentions[0:10]))
#print('.\n\n'.join(random.choices(a_circumflex_mentions, k=10)))

We can see that some of these seem are likely to be apostrophes or quote marks, as in "Budd's original essay" or "the 'Lancet,'". Others are harder to interpret. Using the URLs generated by the next cell, we can look up the work, and searchable images of its pages, in Wellcome's catalogue.

There are likely better ways to do this, but I have not delved into those parts of the API --- I was mainly interested in getting the whole text of a book.

In [None]:
print('Catalogue:      https://wellcomecollection.org/search/works?query=' + example_work['id'])
print('Digitised work: https://wellcomecollection.org/works/' + example_work['id'] + '/items')

Using the latter URL to search for some of the above sentences, I see that 'â' seems to show up as either an apostrophe/quote mark or an em-dash. So I'm going to assume it is getting rendered in place of some kinds of punctuation. Let's see if we can clean this out of our text, and what that does to the word cloud.

In [None]:
cleaned_text = example_text.replace('â', '')
cleaned_freq = wc.process_text(cleaned_text)
wc.generate_from_frequencies(cleaned_freq) #we can just to wc.generate(example_text), but this way we get the frequency table, too
display(pd.DataFrame(Counter(cleaned_freq).most_common()[0:10], columns=['word','count']).set_index('word'))
wc.to_image()

OK, now "may" is showing up as one of our top words. I'd argue that it may also be quite meaningless, and I could go and look in a similar manner and, if appropriate, remove it too.

But let's move on for the moment. Let's assume that the text is now "clean enough" with our one little "a circumflex" clean up and try something else.