# Introduction to Text

This chapter covers how to use code to work with text as data. It has benefitted from the [Python String Cook Book](https://mkaz.blog/code/python-string-format-cookbook/).

Before we get to the good stuff, we need to talk about string encodings. Whether you're using code or a text editor (Notepad, Word, Pages, Visual Studio Code), every bit of text that you see will have an encoding behind the scenes that tells the computer how to display the underlying data. There is no such thing as 'plain' text: all text on computers is the result of an encoding. Oftentimes, a computer programme (email reader, Word, whatever) will guess the encoding and show you what it thinks the text should look like. But it doesn't always know, or get it right: *that is what is happening when you get an email or open an file full of weird symbols and question marks*. If a computer doesn't know whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), it simply cannot display it correctly and you get gibberish.

When it comes to encodings, there are just two things to remember: i) you should use UTF-8 (aka Unicode), it's the international standard. ii) the Windows operating system tends to use either Latin 1 or Windows 1252.

[Unicode](https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code. The Unicode specifications are continually revised and updated to add new languages and symbols.

If you do text analysis on Windows, you will sometimes see the weird symbols appearing because Windows sometimes replaces Unicode characters by similarly looking, eg $\alpha$ becomes a. In other cases, Windows may substitute non-representable characters by question marks, other symbols, or '\uxxx' where the xxx is a series of numbers and letters. The good news is that more modern versions of Windows are moving toward UTF-8. Take special care when saving CSV files containing text on a Windows machine using Excel; unless you specify it, the text may not be saved in UTF-8. If your computer and you get confused enough about encodings and re-save a file with the wrong ones, you could lose data.

Hopefully you'll never have to worry about string encodings. But if you *do* see weird symbols appearing in your text, at least you'll know that there's an encoding problem and will know where to start Googling. You can find a much more in-depth explanation of text encodings [here](https://kunststube.net/encoding/).

## Strings

Strings are the basic data type for text in Python. They can be of any length. A string can be signalled by quote marks or double quote marks like so:

`'text'`

or


`"text"`

We can put this into a variable like so


In [1]:
var = "banana"

Now, if we check the type of the variable:

In [2]:
type(var)

str

We see that it is `str`, which is short for string.

Strings in Python can be indexed, so we can get certain characters out by using square brackets to say which positions we would like.

In [3]:
var[:3]

'ban'

The usual slicing tricks that apply to lists work for strings too, i.e. the positions you want to get can be retrieved using the `var[start:stop:step]` syntax. Here's an example of getting every other character from the string starting from the 2nd position.

In [5]:
var[1::2]

'aaa'

Strings do have many similarities to lists but one way in which they are different is that they are immutable. This means commands like `var[1] = "B"` will result in an error. If you want to change a single character, you will have to replace the entire string. In this example, the command to do that would be `var = "Banana"`.

Like lists, you can find the length of a string using `len`:

In [7]:
len(var)

6

The `+` operator concatenates two or more strings:

In [10]:
second_word = 'panther'
first_word = 'black'
print(first_word + " " + second_word)

'black panther'

Note that we added a space so that the noun made sense. Another way of achieving the same end that scales to many words more efficiently (if you have them in a list) is:


In [11]:
" ".join([first_word, second_word])

'black panther'

Three useful functions to know about are `upper`, `lower`, and `title`. Let's see what they do


In [19]:
var = 'input TEXT'
var_list = [var.upper(), var.lower(), var.title()]
print(var_list)

['INPUT TEXT', 'input text', 'Input Text']


```{admonition} Exercise
Reverse the string `"gnirts desrever a si sihT"` using indexing operations.
```

While we're using `print()`, it has a few tricks. If we have a list, we can print out entries with a given separator:


In [37]:
print(*var_list, sep="; and \n")

INPUT TEXT; and 
input text; and 
Input Text


To turn variables of other kinds into strings, use the `str()` function, for example

In [12]:
'A boolean is either ' + str(True) + ' or ' + str(False)  + ', there are only ' + str(2) + ' options.'

'A boolean is either True or False, there are only 2 options.'

In this example two boolean variables and one integer variable were converted to strings. `str` generally makes an intelligent guess at how you'd like to convert your non-string type variable into a string type. You can pass a variable or a literal value to `str`.

### F-strings

The example above is quite verbose. Another way of combining strings with variables is via *f-strings*. A simple f-string looks like this:

In [22]:
variable = 15.32399
print(f"You scored {variable}")

You scored 15.32399


This is similar to calling `str` on variable and using `+` for concatenation but much shorter to write. You can add expressions to f-strings too:

In [25]:
print(f"You scored {variable**2}")

You scored 234.8246695201


This also works with functions.

In this example, the score number that came out had a lot of (probably) uninteresting decimal places. So how do we polish the printed output? You can pass more inforation to the f-string to get the output formatted just the way you want. Let's say we wanted two decimal places and a sign:

In [26]:
print(f"You scored {variable:+.2f}")

You scored +15.32


There are a whole range of formatting options for numbers as shown in the following table:

| Number     	| Format  	| Output     	| Description                                   	|
|------------	|---------	|------------	|-----------------------------------------------	|
| 15.32347  	| {:.2f}  	| 15.32       	| Format float 2 decimal places                 	|
| 15.32347  	| {:+.2f} 	| +15.32      	| Format float 2 decimal places with sign       	|
| -1         	| {:+.2f} 	| -1.00      	| Format float 2 decimal places with sign       	|
| 15.32347    	| {:.0f}  	| 15          	| Format float with no decimal places           	|
| 3          	| {:0>2d} 	| 03         	| Pad number with zeros (left padding, width 2) 	|
| 3          	| {:*<4d} 	| 3***       	| Pad number with *’s (right padding, width 4)  	|
| 13         	| {:*<4d} 	| 13**       	| Pad number with *’s (right padding, width 4)  	|
| 1000000    	| {:,}    	| 1,000,000  	| Number format with comma separator            	|
| 0.25       	| {:.1%}  	| 25.0%     	| Format percentage                             	|
| 1000000000 	| {:.2e}  	| 1.00e+09   	| Exponent notation                             	|
| 12         	| {:10d}  	|            12 | Right aligned (default, width 10)             	|
| 12         	| {:<10d} 	| 12            | Left aligned (width 10)                       	|
| 12         	| {:^10d} 	|      12       | Center aligned (width 10)                     	|

As well as using this page interactively through the Colab and Binder links at the top of the page, or downloading this page and using it on your own computer, you can play around with some of these options over at [this link](https://www.python-utils.com/).

### Special characters

Python has a string module that comes with some useful built-in strings and characters. For example

In [27]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

gives you all of the punctuation,

In [28]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

returns all of the basic letters in the 'ASCII' encoding (with `.ascii_lowercase` and `.ascii_uppercase` variants), and

In [29]:
string.digits

'0123456789'

gives you the numbers from 0 to 9. Finally, though less impressive visually, `string.whitespace` gives a string containing all of the different (there is more than one!) types of whitespace.

There are other special characters around; in fact, we already met the most famous of them: "\n" for new line. To actually print "\n" we have to 'escape' the backward slash by adding another backward slash:

In [44]:
print('Here is a \n new line')
print('Here is an \\n escaped new line ')

Here is a 
 new line
Here is an \n escaped new line 


The table below shows the most important escape commands:

| Code 	| Result          	|
|------	|-----------------	|
| `\'`   	| Single Quote (useful if using `'` for strings)   	|
| `\"`      | Double Quote (useful if using `"` for strings)   	|
| `\\`   	| Backslash       	|
| `\n`   	| New Line        	|
| `\r`   	| Carriage Return 	|
| `\t`   	| Tab             	|

## Cleaning Text



## Processing Text
