# Working With Text

Text in Python has the data type “string.” That means that text is just a sequence of characters. When analysing text programmatically, we are generally not interested in formatting, but just in the raw text itself. As a consequence, you cannot simply open a Word document or something like that in Python. The text format that we usually work with is `.txt`.

Text can be stored in different encodings. These specify how the characters (e.g., “a”, but also “á” and even “道”) are converted to 0 and 1 for the computer. The most modern encoding is UTF-8, which follows the Unicode standard. You should make sure that your texts are always saved using UTF-8 in order to prevent issues when reading them in Python.

That said, let’s look how to work with text in Python. In this directory, there is a file called “heimskringla_preface.txt”. It is the beginning of a Norwegian saga available from [Project Gutenberg](http://www.gutenberg.org/ebooks/598).

In order to work with the text in python, we have to get it from the file stored on the computer’s harddrive, into a python variable that we can work with. It is good practice not to keep the file open all the time. Since we can do everything we want with the text once it is stored in a variable, we can close the file directly after reading its content. So the steps are simply:

1. Open file.
2. Read content.
3. Close file.
4. Analyse text.

In [1]:
textfile = open('heimskringla_preface.txt')
text = textfile.read()
textfile.close()

A more elegant, but a bit less readable variant is to open the file just as long as we need it and let Python close it automatically afterwards:

In [2]:
with open('heimskringla_preface.txt') as textfile:
    text = textfile.read()

The variable `text` is now a large string, containing the complete content of the file. It wouldn’t make sense to print it completely, so we just peek into its beginning.

In [3]:
len(text)

3076

In [4]:
sample_text = text[0:200]
sample_text

'PREFACE OF SNORRE STURLASON.\n\nIn this book I have had old stories written down, as I have heard\nthem told by intelligent people, concerning chiefs who have have held\ndominion in the northern countries'

In [5]:
print(sample_text)

PREFACE OF SNORRE STURLASON.

In this book I have had old stories written down, as I have heard
them told by intelligent people, concerning chiefs who have have held
dominion in the northern countries


A complete text is often difficult to work with, so we’d want to split it into smaller parts. A starting point might be lines:

In [6]:
lines = sample_text.splitlines()
lines

['PREFACE OF SNORRE STURLASON.',
 '',
 'In this book I have had old stories written down, as I have heard',
 'them told by intelligent people, concerning chiefs who have have held',
 'dominion in the northern countries']

Once we have split the text into pieces, we can start to inspect the text piece by piece. For example, we might want to extract a table of contents in order to get an overview over the text. Since we know that headings are always printed using uppercase characters in this edition, we can go through all the lines and only print those that are headings:

In [7]:
for line in lines:
    if line.isupper():
        print(line)

PREFACE OF SNORRE STURLASON.


When we work with text, we often have to clean them up. Some parts of this work require manual investigation, but many parts can be described as simple rules. These are tasks that are perfect for Python. In this case, we might decide to convert the headings from uppercase to the more usual title case, but otherwise leave the text unchanged:

In [8]:
for line in lines:
    if line.isupper():
        line = line.title()
    print(line)

Preface Of Snorre Sturlason.

In this book I have had old stories written down, as I have heard
them told by intelligent people, concerning chiefs who have have held
dominion in the northern countries


Often, we want to transform data this way. And instead of printing it, we want to store the result for further processing. One good way to express such data conversions in Python are so-called *list comprehensions*. They are rules that describe how to transform a list of values into another list of values.

In [9]:
test = ['This', 'is', 'a', 'test']

A typical task for working with data is *mapping,* i.e. transforming each element of a list into another element. If we want to transform each element of this list, e.g. into uppercase, the usual way would be:

1. To create a new list (e.g. `test_transformed`) which is empty in the beginning,
2. to loop through the original list,
3. to transform each element, and
4. to append the transformed element to the new list.

In order to visualise how the new list gets filled up, I print the state of `test_transformed` after each loop.

In [10]:
test_transformed = []
for word in test:
    word_transformed = word.upper()
    test_transformed.append(word_transformed)
    print(test_transformed)

['THIS']
['THIS', 'IS']
['THIS', 'IS', 'A']
['THIS', 'IS', 'A', 'TEST']


A shorter version would be using a list comprehension. It dynamically creates a new list from a given list, practically using an embedded loop.

In [11]:
test_transformed = [word.upper() for word in test]
test_transformed

['THIS', 'IS', 'A', 'TEST']

A second, typical transformation task is *filtering,* i.e. selecting only those elements from a list that match certain criteria. Here, we want to select only thos words from our example list `test` that have more than two letters. The classical way would be to embed an `if` statement into our loop:

In [12]:
test_filtered = []
for word in test:
    if len(word) > 2:
        test_filtered.append(word)
    print(test_filtered)

['This']
['This']
['This']
['This', 'test']


Again, a list comprehension can make this easier by adding the filter criterion directly to the list generation procedure:

In [13]:
test_filtered = [word for word in test if len(word) > 2 ]
test_filtered

['This', 'test']

It is also possible to combine both steps, i.e. to filter list elements and to transform those elements that get included into the new list:

In [14]:
test_filtered_transformed = [word.title() for word in test if len(word) > 2]
test_filtered_transformed

['This', 'Test']

### Task

Write a list comprehension that selects only headings from the list stored in the variable `lines` and transforms them from uppercase to title case. Store the result in a new variable `headings`.

## Working with words

Usually, the unit of a text that we want to work with is not a line, but a word. Words are generally the smallest unit of a text that carries meaning (ignoring compounds for the moment). Splitting text into words is called “tokenisation,” as words are also called tokens in linguistics. And it is a surprisingly difficult task. In many languages—but not all, e.g. not in Chinese—words are usually separated by spaces. We can use that to our advantage.

In [15]:
words = sample_text.split()
words[0:10]

['PREFACE',
 'OF',
 'SNORRE',
 'STURLASON.',
 'In',
 'this',
 'book',
 'I',
 'have',
 'had']

This works reasonable well, but not perfect: E.g., punctuation characters are still attached to the previous word. A way to improve the results is by using so-called “regular expressions.” These are rules that allow to search for patterns in texts. So one way of improving the result is by only looking at letters, not punctuation marks and other things.

In [16]:
import re
words = re.findall('\w+', sample_text)
words[0:10]

['PREFACE',
 'OF',
 'SNORRE',
 'STURLASON',
 'In',
 'this',
 'book',
 'I',
 'have',
 'had']

Since the usual Python functions are not sufficient, we imported additional functionality from another module. Python ships many modules by default, like the `re` module for regular expressions. Other modules come with extended Python version like Anaconda. And many more can be installed from the [Python Package Index](https://pypi.python.org/pypi). All functions from the `re` module can be called using the prefix "`re.`".

The `findall` function looks for all occurrences of a given pattern in the text. The pattern `\w+` says: Find a group of one or more (`+`) word characters (`\w`, which is letters, digits, and \_). Regular expressions are often helpful. We cannot go into detail here, but consult the [Python documentation](https://docs.python.org/3/howto/regex.html#regex-howto) if you want to learn more.

Using this approach, we can then count the words in the text. Python has a handy function for this, we just need to import the `collections` module.

In [17]:
import collections

words = re.findall('\w+', text)
counts = collections.Counter(words)
counts.most_common(10)

[('the', 36),
 ('of', 29),
 ('and', 27),
 ('in', 13),
 ('to', 11),
 ('a', 10),
 ('his', 9),
 ('be', 8),
 ('their', 7),
 ('have', 6)]

It is probably not surprising that words like articles and prepositions appear quite often in the text. But add relatively little to understanding the text. So it makes sense to exclude them from the list. The simplest way to do so is to use stopword lists. The file `stopwords.txt` contains such a list for English, and we can read it into a list.

In [18]:
with open('stopwords.txt') as stopwordfile:
    stopwords = stopwordfile.readlines()
    stopwords = [word.strip() for word in stopwords]

stopwords[0:5]

['a', 'about', 'above', 'after', 'again']

The transformation of the first stopwords list was necessary since the original stopwords still contain the line breaks character that is present in the text file.

Now we can filter the words in the text:

In [19]:
words = [word for word in words if not word in stopwords]
counts = collections.Counter(words)
counts.most_common(10)

[('time', 4),
 ('written', 3),
 ('poem', 3),
 ('chiefs', 3),
 ('raised', 3),
 ('reckoned', 3),
 ('son', 3),
 ('true', 3),
 ('family', 3),
 ('people', 3)]

This is already helpful, but of course a simple frequency list is not the most advanced text analysis technique. If you are interested in learning about the more interesting, though also more complex things, I recommend the [tatom](https://de.dariah.eu/tatom/) tutorial at DARIAH.