## Analysis of a text file

The purpose of this section is to write a function with a practical purpose, which is to find out something about the lengths of words in a text file. Let's set a few goals for this session. If we have a text file, let's try to find out:

* The number of occurrences of the word 'and'
* The number of occurrences of a word we supply to the function
* The number of occurrences of *all* words in the text

Run the cell below to define the function.

In [None]:
def count_and(filename):
    # Open the file. Here f is a "file object", which acts like a bookmark in the file
    f = open(filename,"r")
    # Read all the text from the file
    text = f.read()
    # Count all the occurences of the word "the"
    total = 0
    for word in text.split(): # for each word ....
        # .... if we find the word, add one to the total
        if word == "and":
            total += 1
    return total

Now run the next cell to test it

In [None]:
print(count_and("reason.txt"))

## Exercise: Count a different word

Write a function that counts the word "therefore" instead. Feel free to use the function above as a base for this. The start of the function definition is written below, to start you off.

In [None]:
def count_therefore(filename):
    # Open the file. Here f is a "file object", which acts like a bookmark in the file
    f = open(filename,"r")
    # Read all the text from the file
    text = f.read()
    # ADD MORE CODE HERE

Again, remember that when you've finished editing it, you have to use Shift-Enter to tell Jupyter to re-run the cell. Now try it out:

In [None]:
print(count_therefore("reason.txt"))

## Exercise: count *any* word

Here's another copy of the function. This time, change it so that you can ask it to count any word. Rather than having the target word specified in the code, it will be passed in to the function. So for example, calling:

```python
count_any(filename, 'serious')
```

will count the occurrences of the word 'serious' in the given file.

In [None]:
def count_any(filename):
    # Open the file. Here f is a "file object", which acts like a bookmark in the file
    f = open(filename,"r")
    # Read all the text from the file
    text = f.read()
    # ADD MORE CODE HERE

Now try it out -- once you've run it for "serious", go back and try it with a few different words.

In [None]:
print(count_any("reason.txt", "serious"))

## Exercise: Count *all* words

Before this exercise you should be familiar with Python dictionaries. If you're not, please see Worksheet A2.

We could do this by:

* getting a big list of words
* calling count_any with each word in turn

This would be very time-consuming, though, as it would go through the whole file once for each word! So let's try a different approach.

Fortunately, Python provides us with a simple way of counting values. Take a look at the documentation here:

https://docs.python.org/3/library/collections.html#collections.Counter

Try this:

```python
from collections import Counter
c = Counter()
```

Then add some items to your counter:

```python
c.update(['chips', 'peas', 'beans', 'carrots', 'chips', 'carrots', 'chips', 'spinach', 'carrots'])
```

The counter has counted these items, but it doesn't give you a result. If you print your counter,

```python
print(c)
```

you'll see that it's similar to a dictionary. You can get values out in the same way as a dictionary. For example, to see the number of occurrences counted of "chips":

```python
print(c["chips"])
```

Finally, let's write a function to count the number of occurrences for each word in a file.

Change the code below to return a Counter that has been updated with all the words in the file. You don't need a loop for this function.

In [None]:
def count_all(filename):
    # Open the file. Here f is a "file object", which acts like a bookmark in the file
    f = open(filename,"r")
    # Read all the text from the file
    text = f.read()
    # ADD MORE CODE HERE

Now run your code:

In [None]:
print(count_all("reason.txt"))

There's a lot there! Fortunately you can get the most common words and their frequencies from the Counter, using its `most_common()` method. The documentation is here:

https://docs.python.org/3/library/collections.html#collections.Counter.most_common

Try it:

In [None]:
c = count_all("reason.txt")
print(c.most_common())

## Extra exercise: Cleaning up

In the counts there will be "words" that throw off the count. For example, the first real sentence in `reason.txt` is:

> Human reason, in one sphere of its cognition, is called upon to consider
questions, which it cannot decline, as they are presented by its own
nature, but which it cannot answer, as they transcend every faculty of
the mind.

Unless you've specially accounted for it, the Counter will see the last word in that sentence as "mind." -- which means when "mind" occurs in the middle of a sentence elsewhere, it will be counted separately, with two entries in the Counter. You can check this:

```python
c = count_all("reason.txt")
print(c["mind."])
print(c["mind"])
```

The same thing will happen where there is a comma after a word -- for example "cognition" (and others) in the example sentence above.

In order to fix this it would be good to "clean up" the words before they are counted. Have a look at Python's string methods for ways of doing this:

https://docs.python.org/3/library/stdtypes.html#string-methods

There are a few ways to do this. Here some things to consider:

* Consider writing a separate function to clean text, which you will call from `count_all()`. Then you can run it separately when testing.
* Can you clean up the whole text, before you split it into separate words?
* Are there any cases where cleaning up the text could have ambiguous results. (Example: how will you deal with hyphenated words?)

## Extra exercise: The longest word

Create a function to find and return the longest word. As in the first couple of exercises, start by opening the file, and create a loop through all the words in the file. Keep track of the current longest word, and its length. At the end, return the longest word.