# String Manipulation

One place where the Python language really shines is in the manipulation of strings.
This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of *regular expressions*.
Such string manipulation pattens come up often in the context of data science work, and is one big perk of Python in this context.

Strings in Python can be defined using either single or double quotations (they are functionally equivalent):

In [1]:
x = 'a string'
y = "a string"
x == y

True

In addition, it is possible to define multi-line strings using a triple-quote syntax:

In [2]:
multiline = """
one
two
three
"""

With this, let's take a quick tour of some of Python's string manipulation tools.

## Simple String Manipulation in Python

For basic manipulation of strings, Python's built-in string methods can be extremely convenient.
If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing.
We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper

### Formatting strings: Adjusting case

Python makes it quite easy to adjust the case of a string.
Here we'll look at the ``upper()``, ``lower()``, ``capitalize()``, ``title()``, and ``swapcase()`` methods, using the following messy string as an example:

In [3]:
fox = "tHe qUICk bROWn fOx."

To convert the entire string into upper-case or lower-case, you can use the ``upper()`` or ``lower()`` methods respectively:

In [4]:
fox.upper()

'THE QUICK BROWN FOX.'

In [5]:
fox.lower()

'the quick brown fox.'

A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence.
This can be done with the ``title()`` and ``capitalize()`` methods:

In [6]:
fox.title()

'The Quick Brown Fox.'

In [7]:
fox.capitalize()

'The quick brown fox.'

The cases can be swapped using the ``swapcase()`` method:

In [8]:
fox.swapcase()

'ThE QuicK BrowN FoX.'

### Formatting strings: Adding and removing spaces

Another common need is to remove spaces (or other characters) from the beginning or end of the string.
The basic method of removing characters is the ``strip()`` method, which strips whitespace from the beginning and end of the line:

In [9]:
line = '         this is the content         '
line.strip()

'this is the content'

To remove just space to the right or left, use ``rstrip()`` or ``lstrip()`` respectively:

In [10]:
line.rstrip()

'         this is the content'

In [11]:
line.lstrip()

'this is the content         '

To remove characters other than spaces, you can pass the desired character to the ``strip()`` method:

In [12]:
num = "000000000000435"
num.strip('0')

'435'

The opposite of this operation, adding spaces or other characters, can be accomplished using the ``center()``, ``ljust()``, and ``rjust()`` methods.

For example, we can use the ``center()`` method to center a given string within a given number of spaces:

In [13]:
line = "this is the content"
line.center(30)

'     this is the content      '

Similarly, ``ljust()`` and ``rjust()`` will left-justify or right-justify the string within spaces of a given length:

In [14]:
line.ljust(30)

'this is the content           '

In [15]:
line.rjust(30)

'           this is the content'

All these methods additionally accept any character which will be used to fill the space.
For example:

In [16]:
'435'.rjust(10, '0')

'0000000435'

Because zero-filling is such a common need, Python also provides ``zfill()``, which is a special method to right-pad a string with zeros:

In [17]:
'435'.zfill(10)

'0000000435'

### Finding and replacing substrings

If you want to find occurrences of a certain character in a string, the ``find()``/``rfind()``, ``index()``/``rindex()``, and ``replace()`` methods are the best built-in methods.

``find()`` and ``index()`` are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:

In [18]:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')

16

In [19]:
line.index('fox')

16

The only difference between ``find()`` and ``index()`` is their behavior when the search string is not found; ``find()`` returns ``-1``, while ``index()`` raises a ``ValueError``:

In [20]:
line.find('bear')

-1

In [21]:
line.index('bear')

ValueError: substring not found

The related ``rfind()`` and ``rindex()`` work similarly, except they search for the first occurrence from the end rather than the beginning of the string:

In [22]:
line.rfind('a')

35

For the special case of checking for a substring at the beginning or end of a string, Python provides the ``startswith()`` and ``endswith()`` methods:

In [23]:
line.endswith('dog')

True

In [24]:
line.startswith('fox')

False

To go one step further and replace a given substring with a new string, you can use the ``replace()`` method.
Here, let's replace ``'brown'`` with ``'red'``:

In [25]:
line.replace('brown', 'red')

'the quick red fox jumped over a lazy dog'

The ``replace()`` function returns a new string, and will replace all occurrences of the input:

In [26]:
line.replace('o', '--')

'the quick br--wn f--x jumped --ver a lazy d--g'

For a more flexible approach to this ``replace()`` functionality, see the discussion of regular expressions in [Flexible Pattern Matching with Regular Expressions](#Flexible-Pattern-Matching-with-Regular-Expressions).

### Splitting and partitioning strings

If you would like to find a substring *and then* split the string based on its location, the ``partition()`` and/or ``split()`` methods are what you're looking for.
Both will return a sequence of substrings.

The ``partition()`` method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

In [27]:
line.partition('fox')

('the quick brown ', 'fox', ' jumped over a lazy dog')

The ``rpartition()`` method is similar, but searches from the right of the string.

The ``split()`` method is perhaps more useful; it finds *all* instances of the split-point and returns the substrings in between.
The default is to split on any whitespace, returning a list of the individual words in a string:

In [28]:
line.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

A related method is ``splitlines()``, which splits on newline characters.
Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo Bashō:

In [29]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

haiku.splitlines()

['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']

Note that if you would like to undo a ``split()``, you can use the ``join()`` method, which returns a string built from a splitpoint and an iterable:

In [30]:
'--'.join(['1', '2', '3'])

'1--2--3'

A common pattern is to use the special character ``"\n"`` (newline) to join together lines that have been previously split, and recover the input:

In [31]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))

matsushima-ya
aah matsushima-ya
matsushima-ya


## Format Strings

In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats.
Another use of string methods is to manipulate string *representations* of values of other types.
Of course, string representations can always be found using the ``str()`` function; for example:

In [32]:
pi = 3.14159
str(pi)

'3.14159'

For more complicated formats, you might be tempted to use string arithmetic as outlined in [Basic Python Semantics: Operators](04-Semantics-Operators.ipynb):

In [33]:
"The value of pi is " + str(pi)

'The value of pi is 3.14159'

A more flexible way to do this is to use *format strings*, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted.
Here is a basic example:

In [34]:
"The value of pi is {}".format(pi)

'The value of pi is 3.14159'

Inside the ``{}`` marker you can also include information on exactly *what* you would like to appear there.
If you include a number, it will refer to the index of the argument to insert:

In [35]:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')

'First letter: A. Last letter: Z.'

If you include a string, it will refer to the key of any keyword argument:

In [36]:
"""First letter: {first}. Last letter: {last}.""".format(last='Z', first='A')

'First letter: A. Last letter: Z.'

Finally, for numerical inputs, you can include format codes which control how the value is converted to a string.
For example, to print a number as a floating point with three digits after the decimal point, you can use the following:

In [37]:
"pi = {0:.3f}".format(pi)

'pi = 3.142'

As before, here the "``0``" refers to the index of the value to be inserted.
The "``:``" marks that format codes will follow.
The "``.3f``" encodes the desired precision: three digits beyond the decimal point, floating-point format.

This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available.
For more information on the syntax of these format strings, see the [Format Specification](https://docs.python.org/3/library/string.html#formatspec) section of Python's online documentation.