# Chapter 3: Basic text processing

-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*

---

The previous chapter has hopefully whet your appetite. In this chapter we will focus on one of the most important tasks in Humanities research: text processing. One of the goals of text processing is to clean up your data as preparation to some kind of data analysis. Another common goal is to convert a given text collection to a different format. In this chapter we will provide you with the necessary tools to work with collections of texts, clean them and perform some rudimentary analyses on them.

In [None]:
# In this notebook, we will work with the following excerpt of a novel:
text = """Emma by Jane Austen 1816

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection."""

---

#### Quiz!

Just to recap some of the stuff we learnt in the previous chapter. Can you write code that defines the variable `number_of_es` and counts how many times the letter *e* occurs in `text`? (Hint: use a `for` loop and an `if` statement)

In [None]:
number_of_es = 0
# insert your code here

# The following test should print True if your code is correct 
print(number_of_es == 78)

---

## Writing our first function

In the quiz above, you probably wrote a loop that iterates over all characters in `text` and adds 1 to `number_of_es` each time the program finds the letter *e*. Counting objects in a text is a very common thing to do. Therefore, Python provides the convenient method `count`. This method operates on strings (`somestring.count(argument)`) and takes as argument the object you want to count. Using this method, the solution to the quiz above can now be rewritten as follows:

In [None]:
number_of_es = text.count('e')
print(number_of_es)

In fact, `count` takes as argument any string you would like to find. We could just as well count how often the determiner `an` occurs:

In [None]:
print(text.count('an'))

The string `an` is found 12 times in our text. Does that mean that the word *an* occurs 12 times in our text? Go ahead. Count it yourself. In fact, *an* occurs only twice... Think about this. Why does Python print 12?

If we want to count how often the word *an* occurs in the text and not the string `an`, we could surround *an* with spaces, like the following:

In [None]:
print(text.count(' an '))

Although it gets the job done in this particular case, it is generally not a very solid way of counting words in a text. What if there are instances of *an* not followed by a space for example because they are at the end of a line? Then we would need to query the text multiple times for each possible context of *an*. For that reason, we're going to approach the problem using a different, more sophisticated strategy. 

Recall from the previous chapter the function `split`. What does this function do? The function `split` operates on a string and splits a string on whitespace (space, tab, end of line) and returns a list of smaller strings (or words):

In [None]:
print(text.split())

---

#### Quiz!

All the things you have learnt so far should enable you to write code that counts how often a certain item occurs in a list. Write some code that defines the variable `number_of_hits` and counts how often the word *in* (assigned to `item_to_count`) occurs in the the list of words called `words`.

In [None]:
words = text.split()
number_of_hits = 0
item_to_count = 'in'
# insert your code here

# The following test should print True if your code is correct 
print(number_of_hits == 3)

---

We will go through the previous quiz step by step. We would like to know how often the preposition *in* occurs in our text. As a first step we will split the string `text` into a list of words:

In [None]:
words = text.split()

Next we define a variable `number_of_hits` and set it to zero.

In [None]:
number_of_hits = 0

The final step is to loop over all words in `words` and add 1 to `number_of_ins` if we find a word that is equal to `in`:

In [None]:
item_to_count = 'in'
for word in words:
    if word == item_to_count:
        number_of_hits += 1
print(number_of_hits)

Now, say we would like to know how often the word *of* occurs in our text. We could adapt the previous lines of code to search for the word *of*, but what if we also would like to count the number of times *the* occurs, and *house* and *had* and... It would be really cumbersome to repeat all these lines of code for each particular search term we have. Programming is supposed to reduce our workload, not increase it. Just like the function `count` for strings, we would like to have a function that operates on lists, takes as argument the object we would like to count and returns the number of times this object occurs in our list.

In this and the previous chapter you have already seen lots of functions (e.g., `print`, `len`, `sorted`). A function does something, often based on some argument you pass to it, and may return a result. You are not just limited to using existing functions in the standard library but you can write your own functions.

In fact, you *must* write your own functions. Separating your problem into sub-problems and writing a function for each of those is an immensely important part of well-structured programming. Functions are defined using the `def` keyword, they take a name and optionally a number of parameters. 

    def some_name(optional_parameters):
        # code here
        return result

The `return` statement returns a value back to the caller and immediately ends the execution of the function. 

Going back to our problem, we want to write a function called `count_in_list`. It takes two arguments: (1) the object we would like to count and (2) the list in which we want to count that object. Let's write down the function definition in Python:

    def count_in_list(item_to_count, list_to_search):
    
Do you understand all the syntax and keywords in the definition above? Now all we need to do is to add the lines of code we wrote before to the body of this function:

In [None]:
def count_in_list(item_to_count, list_to_search): 
    number_of_hits = 0                            
    for item in list_to_search:                   
        if item == item_to_count:                 
            number_of_hits += 1                   
    return number_of_hits                         

All code should be familiar to you, except the `return` keyword. The `return` keyword is there to tell python to return as a result of calling the function the argument `number_of_hits`. OK, let's go through our function one more time, just to make sure you really understand all of it.

1. First we define a function using `def` and give it the name `count_in_list` (line 1);
2. This function takes two arguments: `item_to_count` and `list_to_search` (line 1);
3. Within the function, we define a variable `number_of_hits` and assign to it the value zero (since at that stage we haven't found anything yet (line 2));
4. We loop over all words in `list_to_search` (line 3);
5. If we find a word that is equal to `item_to_count` (line 4), we add 1 to `number_of_hits` (line 5);
6. Return the result of `number_of_hits` (line 6).

(Note: this function does exactly what `list.count(elem)` does in a single line, as documented [here](https://docs.python.org/3/library/stdtypes.html#typesseq-common). However, it is useful to understand how it works, as will become clear later).

Let's test our little function! We will first count how often the word *an* occurs in our list of words `words`.

In [None]:
print(count_in_list('an', words))

---

#### Quiz!

Using the function we defined, print how often the word *the* occurs in our text

In [None]:
# insert your code here

---

## Intermezzo: dictionaries

In the previous chapter you have acquainted yourself with the `list` datastructure. A dictionary is another datastructure which can contain multiple items. However, its items consist of keys and values, and the dictionary allows you to look up a *value* given its *key*. It is called a dictionary because internally it is organized in such a way that looking up an item can be done quickly, just as an actual dictionary. Although the name dictionary is a metaphor and the keys and values can be anything, it happens to be very useful when working with words! Let's define one:

In [None]:
my_dict = {'book': 'physical objects consisting of a number of pages bound together',
           'sword': 'a cutting or thrusting weapon that has a long metal blade',
           'pie': 'dish baked in pastry-lined pan often with a pastry top'}

Take a close look at the new syntax. Notice the curly braces and the colons. Keys are located at the left side of the colon; values at the right side. Key-value pairs (items) are separated by commas. For readability each item is on a separate line, but we could also write the whole dictionary on a single line. (In general, between parentheses and brackets you may always add extra whitespace and newlines to improve readability). To look up the value of a given key, we 'index' the dictionary using that key:

In [None]:
print(my_dict['sword'])

We say 'index', because we use the same syntax with square brackets when indexing lists or strings. The difference is that we don't use a position number to index a dictionary, but a key. Like lists, dictionaries are mutable which means we can add, remove, and change entries.

Instead of storing definitions as in an actual dictionary, we can also store other values, such as frequencies. Let's define an empty dictionary and add some words and frequencies to it. Remark the syntax to add a new entry:

In [None]:
frequencies = {}
frequencies['pride'] = 8
frequencies['prejudice'] = 9

In a way this is similar to what we have seen before when we changed items in our book list. However, instead of using an index to refer to a position, we refer to a key. Can you imagine why this is so useful?

Existing items can be changed or removed as follows:

In [None]:
frequencies['pride'] += 1
frequencies.pop('prejudice')  # pop will remove the item and return its value

#### Quiz!

Update the frequency dictionary with some other words. Try to print out the frequency you gave for one of the words.


In [None]:
# insert your code here

### Conditions and loops with dictionaries

Just as with lists, we have to take care that items we refer to in a dictionary do actually exist. To check if a key is in a dictionary, we can use the `in` operator:

In [None]:
if 'pride' in frequencies:
    print('Found it !')

In [None]:
if 'supercalifragilisticexpialidocious' in frequencies:
    print('Found it !')

If we do not perform this test, we may get a `KeyError`, i.e., a specific error telling us that the dictionary does not have the requested key:

In [None]:
print(frequencies['supercalifragilisticexpialidocious'])

Since dictionaries are iterable objects, we can iterate over its entries just as we can iterate over a list. When we loop over a dictionary, we get the keys of the dictionary:

In [None]:
for word in frequencies:
    print(f'{word} has frequency {frequencies[word]}')

Often, we want to know both the key and the value. There is a shortcut to get pairs of `key, value` in a loop:

In [None]:
for word, freq in frequencies.items():
    print(f'{word} has frequency {freq}')

#### What we have learnt

To finish this section, here is an overview of the new concepts and functions you have learnt. Make sure you understand them all.

- Dictionaries: `dict`, `{}`
- indexing or accessing keys of dictionaries
- adding items to a dictionary
- `KeyError`
- `key in dict`
- `for key in dict`
- `for key, value in dict.items()`

---

## A more general count function

Our function `count_in_list` is a convenient piece of code allowing us to easily count how often certain items occur in a given list. Now what if we would like to find out for all words in our text how often they occur. Then it would be still quite cumbersome to call our function for each unique word. We would like to have a function that takes as argument a particular list and counts for each unique item in that list how often it occurs. There are multiple ways of writing such a function. We will show you two ways of doing it.

### A count function (take 1)

We will use a dictionary to write the function `counter` that takes as argument a list and returns a `dict` with a key for each unique item and a value corresponding to the number of times it occurs in the list. We will first write some code without the function declaration. If that works, we will add it, just as before, to the body of a function.

We start with defining a variable `counts` which is an empty dictionary:

In [None]:
counts = {}

Next we will loop over all words in our list `words`. For each word, we check whether the dictionary already contains it. If not, we add the word to the dictionary and initialize the count to 0. Then, we add 1 to its value.

In [None]:
for word in words:
    if word not in counts:
        counts[word] = 0
    counts[word] += 1
print(counts)

Now that our code is working, we can add it to a function. We define the function `counter` using the `def` keyword. It takes one argument (`list_to_search`).

In [None]:
def counter(list_to_search):
    counts = {}
    for word in list_to_search:
        if word not in counts:
            counts[word] = 0
        counts[word] += 1
    return counts

Hopefully we are boring you, but let's go through this function step by step.

1. We define a function using `def` and give it the name `counter` (line 1);
2. This function takes a single argument `list_to_search` which is the list we want to search through (line 1);
3. Next we define a variable `counts` which is an empty dictionary (line 2);
4. We loop over all words in `list_to_search` (line 3);
6. If the word is not in `counts`, we add the word to the dictionary and assign it the value 0 (line 4-5);
5. Now the word is guaranteed to be in `counts`; we look up its current value and add 1 to it (line 6);
7. Return the result of counts (line 7);

Let's try out our new function!

In [None]:
print(counter(words))

---

In [None]:
# Excerpt for quiz below. Run this cell before doing the quiz.
chapter1 = """CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.

The real evils, indeed, of Emma's situation were the power of having
rather too much her own way, and a disposition to think a little
too well of herself; these were the disadvantages which threatened
alloy to her many enjoyments.  The danger, however, was at present
so unperceived, that they did not by any means rank as misfortunes
with her.

Sorrow came--a gentle sorrow--but not at all in the shape of any
disagreeable consciousness.--Miss Taylor married.  It was Miss
Taylor's loss which first brought grief.  It was on the wedding-day
of this beloved friend that Emma first sat in mournful thought
of any continuance.  The wedding over, and the bride-people gone,
her father and herself were left to dine together, with no prospect
of a third to cheer a long evening.  Her father composed himself
to sleep after dinner, as usual, and she had then only to sit
and think of what she had lost.

The event had every promise of happiness for her friend.  Mr. Weston
was a man of unexceptionable character, easy fortune, suitable age,
and pleasant manners; and there was some satisfaction in considering
with what self-denying, generous friendship she had always wished
and promoted the match; but it was a black morning's work for her.
The want of Miss Taylor would be felt every hour of every day.
She recalled her past kindness--the kindness, the affection of sixteen
years--how she had taught and how she had played with her from five
years old--how she had devoted all her powers to attach and amuse
her in health--and how nursed her through the various illnesses
of childhood.  A large debt of gratitude was owing here; but the
intercourse of the last seven years, the equal footing and perfect
unreserve which had soon followed Isabella's marriage, on their
being left to each other, was yet a dearer, tenderer recollection.
She had been a friend and companion such as few possessed: intelligent,
well-informed, useful, gentle, knowing all the ways of the family,
interested in all its concerns, and peculiarly interested in herself,
in every pleasure, every scheme of hers--one to whom she could speak
every thought as it arose, and who had such an affection for her
as could never find fault.

How was she to bear the change?--It was true that her friend was
going only half a mile from them; but Emma was aware that great must
be the difference between a Mrs. Weston, only half a mile from them,
and a Miss Taylor in the house; and with all her advantages,
natural and domestic, she was now in great danger of suffering
from intellectual solitude.  She dearly loved her father, but he
was no companion for her.  He could not meet her in conversation,
rational or playful.

The evil of the actual disparity in their ages (and Mr. Woodhouse had
not married early) was much increased by his constitution and habits;
for having been a valetudinarian all his life, without activity
of mind or body, he was a much older man in ways than in years;
and though everywhere beloved for the friendliness of his heart
and his amiable temper, his talents could not have recommended him
at any time.

Her sister, though comparatively but little removed by matrimony,
being settled in London, only sixteen miles off, was much beyond
her daily reach; and many a long October and November evening must
be struggled through at Hartfield, before Christmas brought the next
visit from Isabella and her husband, and their little children,
to fill the house, and give her pleasant society again.

Highbury, the large and populous village, almost amounting to a town,
to which Hartfield, in spite of its separate lawn, and shrubberies,
and name, did really belong, afforded her no equals.  The Woodhouses
were first in consequence there.  All looked up to them.  She had
many acquaintance in the place, for her father was universally civil,
but not one among them who could be accepted in lieu of Miss
Taylor for even half a day.  It was a melancholy change; and Emma
could not but sigh over it, and wish for impossible things,
till her father awoke, and made it necessary to be cheerful.
His spirits required support.  He was a nervous man, easily depressed;
fond of every body that he was used to, and hating to part with them;
hating change of every kind.  Matrimony, as the origin of change,
was always disagreeable; and he was by no means yet reconciled
to his own daughter's marrying, nor could ever speak of her but
with compassion, though it had been entirely a match of affection,
when he was now obliged to part with Miss Taylor too; and from
his habits of gentle selfishness, and of being never able to
suppose that other people could feel differently from himself,
he was very much disposed to think Miss Taylor had done as sad
a thing for herself as for them, and would have been a great deal
happier if she had spent all the rest of her life at Hartfield.
Emma smiled and chatted as cheerfully as she could, to keep him
from such thoughts; but when tea came, it was impossible for him
not to say exactly as he had said at dinner,

"Poor Miss Taylor!--I wish she were here again.  What a pity it
is that Mr. Weston ever thought of her!"

"I cannot agree with you, papa; you know I cannot.  Mr. Weston is such
a good-humoured, pleasant, excellent man, that he thoroughly deserves
a good wife;--and you would not have had Miss Taylor live with us
for ever, and bear all my odd humours, when she might have a house of her own?"

"A house of her own!--But where is the advantage of a house of her own?
This is three times as large.--And you have never any odd humours,
my dear."

"How often we shall be going to see them, and they coming to see
us!--We shall be always meeting! _We_ must begin; we must go and pay
wedding visit very soon."

"My dear, how am I to get so far? Randalls is such a distance.
I could not walk half so far."

"No, papa, nobody thought of your walking.  We must go in the carriage,
to be sure."

"The carriage! But James will not like to put the horses to for
such a little way;--and where are the poor horses to be while we
are paying our visit?"

"They are to be put into Mr. Weston's stable, papa.  You know we
have settled all that already.  We talked it all over with Mr. Weston
last night.  And as for James, you may be very sure he will always like
going to Randalls, because of his daughter's being housemaid there.
I only doubt whether he will ever take us anywhere else.  That was
your doing, papa.  You got Hannah that good place.  Nobody thought
of Hannah till you mentioned her--James is so obliged to you!"

"I am very glad I did think of her.  It was very lucky, for I would
not have had poor James think himself slighted upon any account;
and I am sure she will make a very good servant: she is a civil,
pretty-spoken girl; I have a great opinion of her.  Whenever I see her,
she always curtseys and asks me how I do, in a very pretty manner;
and when you have had her here to do needlework, I observe she
always turns the lock of the door the right way and never bangs it.
I am sure she will be an excellent servant; and it will be a great
comfort to poor Miss Taylor to have somebody about her that she is
used to see.  Whenever James goes over to see his daughter, you know,
she will be hearing of us.  He will be able to tell her how we
all are."

Emma spared no exertions to maintain this happier flow of ideas,
and hoped, by the help of backgammon, to get her father tolerably
through the evening, and be attacked by no regrets but her own.
The backgammon-table was placed; but a visitor immediately afterwards
walked in and made it unnecessary.

Mr. Knightley, a sensible man about seven or eight-and-thirty, was not
only a very old and intimate friend of the family, but particularly
connected with it, as the elder brother of Isabella's husband.
He lived about a mile from Highbury, was a frequent visitor,
and always welcome, and at this time more welcome than usual,
as coming directly from their mutual connexions in London.  He had
returned to a late dinner, after some days' absence, and now walked
up to Hartfield to say that all were well in Brunswick Square.
It was a happy circumstance, and animated Mr. Woodhouse for some time.
Mr. Knightley had a cheerful manner, which always did him good;
and his many inquiries after "poor Isabella" and her children were
answered most satisfactorily.  When this was over, Mr. Woodhouse
gratefully observed, "It is very kind of you, Mr. Knightley, to come
out at this late hour to call upon us.  I am afraid you must have
had a shocking walk."

"Not at all, sir.  It is a beautiful moonlight night; and so mild
that I must draw back from your great fire."

"But you must have found it very damp and dirty.  I wish you may
not catch cold."

"Dirty, sir! Look at my shoes.  Not a speck on them."

"Well! that is quite surprising, for we have had a vast deal
of rain here.  It rained dreadfully hard for half an hour
while we were at breakfast.  I wanted them to put off the wedding."

"By the bye--I have not wished you joy.  Being pretty well aware
of what sort of joy you must both be feeling, I have been in no hurry
with my congratulations; but I hope it all went off tolerably well.
How did you all behave? Who cried most?"

"Ah! poor Miss Taylor! 'Tis a sad business."

"Poor Mr. and Miss Woodhouse, if you please; but I cannot possibly
say `poor Miss Taylor.' I have a great regard for you and Emma;
but when it comes to the question of dependence or independence!--At
any rate, it must be better to have only one to please than two."

"Especially when _one_ of those two is such a fanciful, troublesome creature!"
said Emma playfully.  "That is what you have in your head,
I know--and what you would certainly say if my father were not by."

"I believe it is very true, my dear, indeed," said Mr. Woodhouse,
with a sigh.  "I am afraid I am sometimes very fanciful and troublesome."

"My dearest papa! You do not think I could mean _you_, or suppose
Mr. Knightley to mean _you_.  What a horrible idea! Oh no! I meant
only myself.  Mr. Knightley loves to find fault with me, you know--
in a joke--it is all a joke.  We always say what we like to one another."

Mr. Knightley, in fact, was one of the few people who could see
faults in Emma Woodhouse, and the only one who ever told her of them:
and though this was not particularly agreeable to Emma herself,
she knew it would be so much less so to her father, that she would
not have him really suspect such a circumstance as her not being
thought perfect by every body.

"Emma knows I never flatter her," said Mr. Knightley, "but I
meant no reflection on any body.  Miss Taylor has been used
to have two persons to please; she will now have but one.
The chances are that she must be a gainer."

"Well," said Emma, willing to let it pass--"you want to hear
about the wedding; and I shall be happy to tell you, for we all
behaved charmingly.  Every body was punctual, every body in their
best looks: not a tear, and hardly a long face to be seen.  Oh no;
we all felt that we were going to be only half a mile apart,
and were sure of meeting every day."

"Dear Emma bears every thing so well," said her father.
"But, Mr. Knightley, she is really very sorry to lose poor Miss Taylor,
and I am sure she _will_ miss her more than she thinks for."

Emma turned away her head, divided between tears and smiles.
"It is impossible that Emma should not miss such a companion,"
said Mr. Knightley.  "We should not like her so well as we do, sir,
if we could suppose it; but she knows how much the marriage is to
Miss Taylor's advantage; she knows how very acceptable it must be,
at Miss Taylor's time of life, to be settled in a home of her own,
and how important to her to be secure of a comfortable provision,
and therefore cannot allow herself to feel so much pain as pleasure.
Every friend of Miss Taylor must be glad to have her so happily
married."

"And you have forgotten one matter of joy to me," said Emma,
"and a very considerable one--that I made the match myself.
I made the match, you know, four years ago; and to have it take place,
and be proved in the right, when so many people said Mr. Weston would
never marry again, may comfort me for any thing."

Mr. Knightley shook his head at her.  Her father fondly replied,
"Ah! my dear, I wish you would not make matches and foretell things,
for whatever you say always comes to pass.  Pray do not make any
more matches."

"I promise you to make none for myself, papa; but I must, indeed,
for other people.  It is the greatest amusement in the world! And
after such success, you know!--Every body said that Mr. Weston would
never marry again.  Oh dear, no! Mr. Weston, who had been a widower
so long, and who seemed so perfectly comfortable without a wife,
so constantly occupied either in his business in town or among his
friends here, always acceptable wherever he went, always cheerful--
Mr. Weston need not spend a single evening in the year alone if he did
not like it.  Oh no! Mr. Weston certainly would never marry again.
Some people even talked of a promise to his wife on her deathbed,
and others of the son and the uncle not letting him.  All manner
of solemn nonsense was talked on the subject, but I believed none
of it.

"Ever since the day--about four years ago--that Miss Taylor and I
met with him in Broadway Lane, when, because it began to drizzle,
he darted away with so much gallantry, and borrowed two umbrellas
for us from Farmer Mitchell's, I made up my mind on the subject.
I planned the match from that hour; and when such success has blessed
me in this instance, dear papa, you cannot think that I shall leave
off match-making."

"I do not understand what you mean by `success,'" said Mr. Knightley.
"Success supposes endeavour.  Your time has been properly and
delicately spent, if you have been endeavouring for the last four
years to bring about this marriage.  A worthy employment for a young
lady's mind! But if, which I rather imagine, your making the match,
as you call it, means only your planning it, your saying to yourself
one idle day, `I think it would be a very good thing for Miss Taylor
if Mr. Weston were to marry her,' and saying it again to yourself
every now and then afterwards, why do you talk of success? Where
is your merit? What are you proud of? You made a lucky guess;
and _that_ is all that can be said."

"And have you never known the pleasure and triumph of a lucky guess?--
I pity you.--I thought you cleverer--for, depend upon it a lucky
guess is never merely luck.  There is always some talent in it.
And as to my poor word `success,' which you quarrel with, I do not
know that I am so entirely without any claim to it.  You have drawn
two pretty pictures; but I think there may be a third--a something
between the do-nothing and the do-all. If I had not promoted Mr. Weston's
visits here, and given many little encouragements, and smoothed
many little matters, it might not have come to any thing after all.
I think you must know Hartfield enough to comprehend that."

"A straightforward, open-hearted man like Weston, and a rational,
unaffected woman like Miss Taylor, may be safely left to manage their
own concerns.  You are more likely to have done harm to yourself,
than good to them, by interference."

"Emma never thinks of herself, if she can do good to others,"
rejoined Mr. Woodhouse, understanding but in part.  "But, my dear,
pray do not make any more matches; they are silly things, and break up
one's family circle grievously."

"Only one more, papa; only for Mr. Elton.  Poor Mr. Elton! You
like Mr. Elton, papa,--I must look about for a wife for him.
There is nobody in Highbury who deserves him--and he has been
here a whole year, and has fitted up his house so comfortably,
that it would be a shame to have him single any longer--and I thought
when he was joining their hands to-day, he looked so very much as if
he would like to have the same kind office done for him! I think
very well of Mr. Elton, and this is the only way I have of doing
him a service."

"Mr. Elton is a very pretty young man, to be sure, and a very
good young man, and I have a great regard for him.  But if you
want to shew him any attention, my dear, ask him to come
and dine with us some day.  That will be a much better thing.
I dare say Mr. Knightley will be so kind as to meet him."

"With a great deal of pleasure, sir, at any time," said Mr. Knightley,
laughing, "and I agree with you entirely, that it will be a much
better thing.  Invite him to dinner, Emma, and help him to the best
of the fish and the chicken, but leave him to chuse his own wife.
Depend upon it, a man of six or seven-and-twenty can take care
of himself."
"""

#### Quiz!

Let's put some of the stuff we learnt so far together. You are given a new excerpt, `chapter1`. Convert it to a list of words and count how often the word *Emma* occurs in the text. Assign the result to the variable `emma_count`

In [None]:
emma_count = 0
# insert you code here

# The following test should print True if your code is correct 
print(emma_count == 13)

---

### A count function (take 2)

Let's train our function writing skills a little more. We are going to write another counting function, this time using a slightly different strategy. Recall our function `count_in_list`. It takes as argument a list and the item we want to count in that list. It returns the number of times this item occurs in the list. If we call this function for each unique word in `words`, we obtain a list of frequencies, quite similar to the one we get from the function `counter`. What would happen if we just call the function `count_in_list` on each word in `words`? 

In [None]:
words = text.split()

for word in words:
    print(word, count_in_list(word, words))

As you can see, we obtain the frequency of each word token in `words`, where we would like to have it only for unique word forms. The challenge is thus to come up with a way to convert our list of words into a structure with solely unique words. For this Python provides a convenient data structure called `set`. It takes as argument some iterable (e.g. a list) and returns a new object containing only unique items:

In [None]:
x = ['a', 'a', 'b', 'b', 'c', 'c', 'c']
unique_x = set(x)
print(unique_x)

Using `set` we can iterate over all unique words in our word list and print the corresponding frequency:

In [None]:
unique_words = set(words)
for word in unique_words:
    print(word, count_in_list(word, words))

We wrap the lines of code above into the function `counter2`:

In [None]:
def counter2(list_to_search):
    unique_words = set(list_to_search)
    for word in unique_words:
        print(word, count_in_list(word, list_to_search))

A final check to see whether our function behaves correctly:

In [None]:
counter2(words)

---

#### Quiz!

We have written two functions `counter` and `counter2`, both used to count for each unique item in a particular list how often it occurs in that list. Can you come up with some pros and cons for each function? Why is `counter2` better than `counter` or why is `counter` better than `counter2`?

*Double click this cell and write down your answer.*

---

## Text clean up

In the previous section we wrote code to compute a frequency distribution of the words in a text stored on our computer. The function `split` is a quick and dirty way of splitting a string into a list of words. However, if we look through the frequency distributions, we notice quite an amount of noise. For instance, the pronoun *her* occurs 4 times, but we also find `'her.'` occurring 1 time and the capitalized `'Her'`, also 1 time. Of course we would like to add those counts to that of *her*. The basic tokenization of our text using `split` is fast and simple, but it leaves us with noisy and incorrect frequency distributions.

There are essentially two strategies to follow to correct our frequency distributions. The first is to come up with a better procedure of splitting our text into words. The second is to clean-up our text and pass this clean result to the convenient `split` function. For now we will follow the second path.

Some words in our text are capitalized. To convert all letters to lower case, Python provides the method `lower`. It operates on strings:

In [None]:
x = 'Emma'
x_lower = x.lower()
print(x_lower)

We can apply this method to our complete text to obtain a completely lowercased text, using:

In [None]:
text_lower = text.lower()
print(text_lower)

This solves our problem with miscounting capitalized words, leaving us with the problem of punctuation. The `replace` method is just what we're looking for. It takes two arguments: (1) the string we would like to find and (2) the string we want to replace the first argument with:

In [None]:
x = 'Please. remove. all. periods. from. this. sentence.'
x = x.replace('.', '')
print(x)

Notice that we replace all periods with an empty string written as `''`. Replace returns a copy of the string with all occurrences replaced.

---

#### Quiz!

Write code to lowercase and remove all commas in the following short text:

In [None]:
short_text = 'Commas, as it turns out, are terribly overrated.'
# insert your code here

# The following test should print True if your code is correct 
print(short_text == 'commas as it turns out are terribly overrated.')

---

We would like to remove all punctuation from a text, not just periods and commas. We will write a function called `remove_punc` that removes all (simple) punctuation from a text. Again, there are many ways in which we can write this function. We will show you two of them. The first strategy is to repeatedly call `replace` on the same string each time replacing a different punctuation character with an empty string. 

In [None]:
def remove_punc(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    for marker in punctuation:
        text = text.replace(marker, '')
    return text

short_text = 'Commas, as it turns out, are overrated. Periods, however, even more so!'
print(remove_punc(short_text))

The second strategy we will follow is to show you that we can achieve the same result without using the built in function `replace`. Remember that a string consists of characters. We can loop over a string accessing each character in turn. Each time we find a punctuation marker we skip to the next character.

We want to collect the resulting characters. However, remember that strings are immutable: we cannot change strings and we wouldn't want to make a new string for every new character we add. Therefore we collect the result as a list of characters. In the end, we convert this list to a new string using ``''.join()``. This joins all the elements of the string together in a single string (optionally with a separator, but for this case we use an empty separator `''`).

In [None]:
punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
clean_characters = []
for character in text:
    if character not in punctuation:
        clean_characters.append(character)
clean_text = ''.join(clean_characters)

The code above displays a very common pattern: we start with some collection of items, we want to do something to each item and/or filter the items, and finally the result should be collected in a new list. Therefore Python provides a handy shorthand, which is called a *list comprehension*:

In [None]:
clean_characters = [character for character in text if character not in punctuation]

The square brackets can be left out when a function expects a single, iterable argument. Our new function can now be written as follows:

In [None]:
def remove_punc2(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    return ''.join(character for character in text
                   if character not in punctuation)

short_text = 'Commas, as it turns out, are overrated. Periods, however, even more so!'
print(remove_punc2(short_text))

---

#### Quiz!

1) Can you come up with any pros or cons for each of the two functions above?

*Write your answer here* (double click me)

2) Now it is time to put everything together. We want to write a function `clean_text` that takes as argument a text represented by string. The function should return this string with all punctuation removed and all characters lowercased.

In [None]:
def clean_text(text):
    # insert your code here
    
# The following test should print True if your code is correct 
short_text = 'Commas, as it turns out, are overrated. Periods, however, even more so!'
print(clean_text(short_text) == 
      'commas as it turns out are overrated periods however even more so')

3) This last excercise puts everything together. We want you to take the excerpt `chapter1`. clean up the text, and recompute the frequency distribution. Assign to `woodhouse_counts` the number of times the name *Woodhouse* occurs in the text.

In [None]:
woodhouse_counts = 0
# insert your code here

# The following test should print True if your code is correct 
print(woodhouse_counts == 8)

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>