# String manipulation



``ljust()`` and ``rjust()`` will left-justify or right-justify the string within spaces of a given length:

In [5]:
line = "Some words here"

print(line.ljust(30))
print(line.rjust(30))
# Can add a character as fill
print(line.rjust(25, '|'))
print(line.rjust(5, '|'))

Some words here               
               Some words here
||||||||||Some words here
Some words here


Fun fact: there's no equivalent method for `ljust` in javascript and this led to a commonly installed library called [leftpad](https://www.theregister.com/2016/03/23/npm_left_pad_chaos/) which once broke once and stopped several large websites from operating.

### Finding and replacing substrings

If you want to find occurrences of a certain character in a string, there's ``find()`` and ``replace()``.

The only difference between ``find()`` and ``index()`` is their behavior when the search string is not found; ``find()`` returns ``-1``, while ``index()`` raises a ``ValueError``:

In [10]:
line = 'the quick brown fox jumped over a lazy dog'
print(line.find('bear')) # returns -1 if not found
print(line.find('fox')) # returns the starting position of the word
line.replace('brown', 'red')

-1
16


'the quick red fox jumped over a lazy dog'

For the special case of checking for a substring at the beginning or end of a string, Python provides the ``startswith()`` and ``endswith()`` methods:

In [9]:
line.endswith('dog')
line.startswith('fox')

False

### Splitting

The ``split()`` method finds *all* instances of the split-point and returns the substrings in between.
The default is to split on any whitespace, returning a list of the individual words in a string:

In [28]:
line.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

A related method is ``splitlines()``, which splits on newline characters.
Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo Bashō:

In [29]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

haiku.splitlines()

['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']

### Joining

The ``join()`` method, returns a string built from a splitpoint and an iterable:

In [30]:
'--'.join(['1', '2', '3'])

'1--2--3'

A common pattern is to use the special character ``"\n"`` (newline) to join together lines that have been previously split, and recover the input:

In [31]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))

matsushima-ya
aah matsushima-ya
matsushima-ya


## Format Strings

In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats.
Another use of string methods is to manipulate string *representations* of values of other types.
Of course, string representations can always be found using the ``str()`` function; for example:

In [2]:
pi = 3.14159
str(pi)

'3.14159'

Finally, for numerical inputs, you can include format codes which control how the value is converted to a string.
For example, to print a number as a floating point with three digits after the decimal point, you can use the following:

In [3]:
f"pi = {pi:.3f}"

'pi = 3.142'

# Regular Expressions

In [4]:
!ls *.ipynb

[31m14-Strings-and-Regular-Expressions.ipynb[m[m
[31mWorkshop Python files.ipynb[m[m
[31mpython files and exceptions.ipynb[m[m


[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) generalize the "wildcard" idea to a wide range of flexible string-matching sytaxes.

In [39]:
import re
regex = re.compile('\s+')
regex.split(line)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

Here we've first *compiled* a regular expression, then used it to *split* a string.

Like the `str.split()` method returns a list of substrings, the regular expression ``split()`` method returns a list of substrings between matches to the input pattern.

In this case, the input is ``"\s+"``: "``\s``" is a special character that matches any whitespace (space, tab, newline, etc.), and the "``+``" is a character that indicates *one or more* of the entity preceding it.
Thus, the regular expression matches any substring consisting of one or more spaces.

The ``split()`` method here is basically a convenience routine built upon this *pattern matching* behavior; more fundamental is the ``match()`` method, which will tell you whether the beginning of a string matches the pattern:

In [40]:
for s in ["     ", "abc  ", "  abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' matches
'abc  ' does not match
'  abc' matches


### Advanced regexing

Regex are a full **language** to represent matchings on strings

In [46]:
email = re.compile('\w+@\w+\.[a-z]{3}')

Using this, if we're given a line from a document, we can quickly extract things that look like email addresses

In [47]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)

['guido@python.org', 'guido@google.com']

(Note that these addresses are entirely made up; there are probably better ways to get in touch with Guido).

We can do further operations, like replacing these email addresses with another string, perhaps to hide addresses in the output:

In [48]:
email.sub('--@--.--', text)

'To email Guido, try --@--.-- or the older address --@--.--.'

Finally, note that if you really want to match *any* email address, the preceding regular expression is far too simple.
For example, it only allows addresses made of alphanumeric characters that end in one of several common domain suffixes.
So, for example, the period used here means that we only find part of the address:

In [49]:
email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

This goes to show how unforgiving regular expressions can be if you're not careful!
If you search around online, you can find some suggestions for regular expressions that will match *all* valid emails, but beware: they are much more involved than the simple expression used here!