# String Manipulation and Regular Expressions

In [2]:
x = 'a string'
y = "a string"
if x == y:
    print("they are the same")

they are the same


In addition, it is possible to define multi-line strings using a triple-quote syntax:

In [3]:
multiline = """
one 
two 
three 
"""

## Simple String Manipulation in Python

### Formatting strings: Adjusting case

Python makes it quite easy to adjust the case of a string.
Here we'll look at the ``upper()``, ``lower()``, ``capitalize()``, ``title()``, and ``swapcase()`` methods, using the following messy string as an example:

In [4]:
text = "What does the fox say?"

To convert the entire string into upper-case or lower-case, you can use the ``upper()`` or ``lower()`` methods respectively:

In [8]:
text.upper()

'WHAT DOES THE FOX SAY?'

In [9]:
text.lower()

'what does the fox say?'

A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence.
This can be done with the ``title()`` and ``capitalize()`` methods:

In [10]:
text.title()

'What Does The Fox Say?'

In [11]:
text.capitalize()

'What does the fox say?'

The cases can be swapped using the ``swapcase()`` method:

In [12]:
text.swapcase()

'wHAT DOES THE FOX SAY?'

In [13]:
text

'What does the fox say?'

### Formatting strings: Adding and removing spaces

In [19]:
line = '         hello         '
line.strip()

'hello'

To remove just space to the right or left, use ``rstrip()`` or ``lstrip()`` respectively:

In [20]:
line.rstrip()

'         hello'

In [21]:
line.lstrip()

'hello         '

To remove characters other than spaces, you can pass the desired character to the ``strip()`` method:

In [22]:
num = "000000000011111125"
num.strip('0').strip('1')

'25'

### Finding and replacing substrings

If you want to find occurrences of a certain character in a string, the ``find()``/``rfind()``, ``index()``/``rindex()``, and ``replace()`` methods are the best built-in methods.

``find()`` and ``index()`` are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:

In [23]:
line = 'What does the fox say?'
line.find('fox')

14

In [24]:
line[19:21]

'ay'

In [25]:
line.index('fox')

14

The only difference between ``find()`` and ``index()`` is their behavior when the search string is not found; ``find()`` returns ``-1``, while ``index()`` raises a ``ValueError``:

In [26]:
line.find('bear')

-1

In [27]:
line.index('bear')

ValueError: substring not found

The related ``rfind()`` and ``rindex()`` work similarly, except they search for the first occurrence from the end rather than the beginning of the string:

In [28]:
line.rfind('a')

19

For the special case of checking for a substring at the beginning or end of a string, Python provides the ``startswith()`` and ``endswith()`` methods:

In [29]:
line.endswith('dog')

False

In [30]:
line.startswith('fox')

False

To go one step further and replace a given substring with a new string, you can use the ``replace()`` method.
Here, let's replace ``'fox'`` with ``'dog'``:

In [31]:
line.replace('fox', 'dog')

'What does the dog say?'

The ``replace()`` function returns a new string, and will replace all occurrences of the input:

In [32]:
line.replace('o', '--')

'What d--es the f--x say?'

### Splitting and partitioning strings

If you would like to find a substring *and then* split the string based on its location, the ``partition()`` and/or ``split()`` methods are what you're looking for.
Both will return a sequence of substrings.

The ``partition()`` method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

In [34]:
line.partition('fox')

('What does the ', 'fox', ' say?')

The ``rpartition()`` method is similar, but searches from the right of the string.

The ``split()`` method is perhaps more useful; it finds *all* instances of the split-point and returns the substrings in between.
The default is to split on any whitespace, returning a list of the individual words in a string:

In [35]:
line.split()

['What', 'does', 'the', 'fox', 'say?']

A related method is ``splitlines()``, which splits on newline characters.
Let's do this with the full lyric of *What does the fox say?*

In [40]:
song = """Dog goes "woof"
Cat goes "meow"
Bird goes "tweet"
And mouse goes "squeek"
Cow goes "moo"
Frog goes "croak"
And the elephant goes "toot"
Ducks say "quack"
And fish go "blub"
And the seal goes "ow ow ow"
"""

song.splitlines()

['Dog goes "woof"',
 'Cat goes "meow"',
 'Bird goes "tweet"',
 'And mouse goes "squeek"',
 'Cow goes "moo"',
 'Frog goes "croak"',
 'And the elephant goes "toot"',
 'Ducks say "quack"',
 'And fish go "blub"',
 'And the seal goes "ow ow ow"']

Note that if you would like to undo a ``split()``, you can use the ``join()`` method, which returns a string built from a splitpoint and an iterable such as a **list**.

In [41]:
'--'.join(['My', 'name', 'is'])

'My--name--is'

OR

In [42]:
' '.join(['My','name','is'])

'My name is'

A common pattern is to use the special character ``"\n"`` (newline) to join together lines that have been previously split, and recover the input:

In [46]:
song_lyrics = song.splitlines()
print(song_lyrics)

['Dog goes "woof"', 'Cat goes "meow"', 'Bird goes "tweet"', 'And mouse goes "squeek"', 'Cow goes "moo"', 'Frog goes "croak"', 'And the elephant goes "toot"', 'Ducks say "quack"', 'And fish go "blub"', 'And the seal goes "ow ow ow"']


In [47]:
print("\n".join(song_lyrics))

Dog goes "woof"
Cat goes "meow"
Bird goes "tweet"
And mouse goes "squeek"
Cow goes "moo"
Frog goes "croak"
And the elephant goes "toot"
Ducks say "quack"
And fish go "blub"
And the seal goes "ow ow ow"


## Format Strings

In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats.
Another use of string methods is to manipulate string *representations* of values of other types.
Of course, string representations can always be found using the ``str()`` function; for example:

In [48]:
salary = 1000000
str(salary)

'1000000'

In [53]:
print ("My salary is " + salary)

TypeError: can only concatenate str (not "int") to str

In [54]:
"My salary is " + str(salary)

'My salary is 1000000'

A more flexible way to do this is to use *format strings*, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted.
Here is a basic example:

In [55]:
"My salary is {}".format(salary)

'My salary is 1000000'

Inside the ``{}`` marker you can also include information on exactly *what* you would like to appear there.
If you include a number, it will refer to the index of the argument to insert:

In [56]:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')

'First letter: A. Last letter: Z.'

If you include a string, it will refer to the key of any keyword argument:

In [57]:
"""First letter: {first}. Last letter: {last}.""".format(last='Z', first='A')

'First letter: A. Last letter: Z.'

Finally, for numerical inputs, you can include format codes which control how the value is converted to a string.
For example, to print a number as a floating point with three digits after the decimal point, you can use the following:

In [58]:
"My salary is {0:,.2f}".format(salary)

'My salary is 1,000,000.00'

Here the "``0``" refers to the index of the value to be inserted.
The "``:,``" marks that format codes will follow. 
The "``.2f``" encodes the desired precision: two digits beyond the decimal point, floating-point format.


There is an easier way to use format strings, i.e. instead of 

```
"string {}".format(variable)
```
we can use 

```
"f"string {variable}"
```

In [60]:
f"My salary is {salary:,.2f}"

'My salary is 1,000,000.00'

This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available.
For more information on the syntax of these format strings, see the [Format Specification](https://docs.python.org/3/library/string.html#formatspec) section of Python's online documentation.

**[PROGRAMMING EXERCISE]**

Consider the string "HelloWorld,123,ThisIsUniv.Ai". Find the number of uppercase, lowercase, special character and numerical characters.


In [12]:
test_string = '$$&HelloWorld,123,ThisIsUniv.Ai!!'

In [13]:
counter_dict = {'upper':0,'lower':0,'special_char':0,'num_char':0}

In [14]:
for char in test_string:
    if char.isupper():
        counter_dict['upper'] +=1
    elif char.islower():
        counter_dict['lower'] +=1
    if char.isdigit():
        counter_dict['num_char']+=1
    elif char in '[@_!#$%^&*()<>?/\|}{~:]':
        counter_dict['special_char']+=1