## <center> Lecture 8 </center>
## <center> Strings </center>

## Formatting strings
* We have previously covered the formatting of numeric strings:

In [1]:
print(f'{3.14156:0.2f}')

3.14


```f``` tells Python that we are creating a _formatted_ string

```{3.14156:0.2f}``` tells Python to format pi as a ```float``` with 2 significant digits

* We can also format ```decimal``` types

In [2]:
print(f'{3.14156:1d}')

ValueError: Unknown format code 'd' for object of type 'float'

## Formatting integers
* Integers types can be formatted according to:

In [None]:
f'{1979:d}'

* ```d``` tells Python that what precedes ```:``` is a number that we want displayed as an integer

* Even though its usage is rare, we can also format _characters_:

In [None]:
f'{76:c}{79+32:c}{81+32:c}{68:c}'

* The number ```76``` maps to the character ```L```
* The number ```111``` maps to the character ```o```
* The number ```113``` maps to the character ```q```
* The number ```68``` maps to the character ```D```

## Formatting strings
* The default _presentation type_ is ```s```:

In [None]:
f'{"Loquacious":s}{"D":s}'

* Note that we must use double quotes, not single, when enclosing the string to be displayed
* Because ```s``` is the default, we can implement the above more simply:

In [None]:
'LoquaciousD'

## Formatting with scientific notation
* Very small or very large numbers are easier expressed with _scientific_ notation:

In [None]:
f'{300000000:.2e}'

* The ```.2e``` tells Python that we want the number displayed with 2 decimal points

## Field widths
* We can tell Python to represent our string with a certain _field width_:

In [None]:
f'{3.141592:10f}'

* Notice that the string ```3.141592``` has been right-justified in a string of 10 character places

## Python's older string formatting method
* The string formatting syntax described above is relatively new to Python 
* Formerly, it was advised to use Python's ```format()``` method:

In [None]:
'{:0.2f}'.format(3.14156)

* ```'{:0.2f}'``` is an empty string
* ```.format``` is a call to the ```format()``` method
* ```3.14156``` is the argument of ```format()```, which is the number that we want to insert into the empty string
* It is not hard to see why this syntax has been replaced

## Concatenating strings
* Strings can be joined to one another, or _concatenated_
* There are at least two syntaxes for this operation:

In [None]:
s1 = 'Loquacious'
s2 = 'D is in da house'
s3 = s1 + s2
print(s3)

In [None]:
s1 += s2
print(s1)

* ```+=``` tells Python to "add" string ```s2``` to ```s1```
* Adding here is synonynous with joining

## Repeating strings
* Strings can also be repeated:

In [None]:
s1 = 'LoqD '
s2 = s1*5
print(s2)

In [None]:
s1 = 'LoqD '
s1 *= 5
print(s1)

* ```s1*=5``` tells Python to "multiply" s1 by 5

## Removing white space before and after strings
* Removing white space is a common operation when working with string data
* The string method ```strip()``` removes leading and training white space:

s1 = '   Loquacious D is in da house.   '
s2 = s1.strip()
print("Before strip(): ", s1)
print("After strip(): ", s2)

* We can check the lengths of ```s1``` and ```s2``` to verify the removal:

In [None]:
len(s1)

In [None]:
len(s2)

* Q: How many spaces appeared at the front and back of ```s1``` ?

## Converting to lower or upper case
* String provides the ```lower()``` and ```upper()``` methods for making your string all lower or ALL CAP:

In [None]:
s1 = 'Loquacious D is in da house'
print(s1.lower())
print(s1.upper())

## String comparisons
* Two strings are equal if and only if _every_ character matches
* String comparisons are case-sensitive!

In [None]:
s1 = 'Loquacious D'
s2 = 'loquacious d'
print(s1==s2)

In [None]:
s1 = 'Loquacious D'
s2 = 'Loquacious D'
print(s1==s2)

# Converting to ALL CAP before checking strings for equality
* In practice, you often want to compare strings without considering case
* A common technique is to first convert the strings to upper case:

In [None]:
s1 = 'Loquacious D'
s2 = 'loquacious d'
s1u = s1.upper()
s2u = s2.upper()
print(s1u==s2u)

## Counting occurrences of a _substring_ is accomplished with ```count()```

In [None]:
s1 = 'She sells seashells by the seashore'
s1.count('sea')

## Locating the occurences of a substring
* We can use the ```index``` method to find the starting index of a substring:

In [None]:
s1.index('sea')

* Let's verify that "sea" really is located at position 10

In [None]:
s1[10:13]

## Identifying whether a string contains a substring
* The ```in``` operator tells you whether a string contains a substring:

In [None]:
"sea" in s1

In [None]:
"Loquacious" in s1

## Finding and replacing substrings
* The string method ```replace()``` takes two arguments:
    * the substring to be replaced
    * the substring to replace it with
* It is common to replace _delimiters_, for example commmas with spaces:

In [None]:
s1 = '12,34,45,56,67'
s2 = s1.replace(',',' ')
print("Before replace:", s1)
print("After replace:", s2)

## _Tokenizing_ a string with ```split()```
* The method ```split()``` decomposes a string into a list of elements
* The argument tells the method how to separate the string:

In [None]:
names = "Billy;Bob;Cassie;Delilah;Earnest"
tokens = names.split(';')
for token in tokens:
    print(token)

* ```tokens = names.split(';')``` returns a list:

In [None]:
type(tokens)

## Joining strings
* The ```join()``` method performs the opposite of ```split()```: it merges tokens back into a string
* Let's recreate our list of names, but this time with spaces instead of semi-colons:

In [None]:
' '.join(tokens)

## Spliting long strings into lines
* The method ```splitlines()``` returns a list:
* Let's consider a common Haiku:

In [None]:
my_long_string = "Loquacious D is  \n just a name but the man \n is stranger than words"

In [None]:
haiku_lines = my_long_string.splitlines()

In [None]:
for l in haiku_lines:
    print(l)

## Regular expressions
* Sometimes it is not the exact string, but just a _pattern_, that you are looking for
* For example, one may want to search for Twitter handles or email addresses on social media
* In these cases, we may ask Python to find substrings that match the pattern
* Regular expressions are frequently used when automatically collecting data from the web, known as _web scraping_

## Matching zip codes
* As a first example, consider identifying US zip codes:

In [None]:
import re
pattern = '\d\d\d\d\d'

candidates = ['90210', '10031', '9486']

for c in candidates:
    if re.fullmatch(pattern, c):
        print(c, "is a valid zip code")
    else:
        print(c, "is not a valid zip code")

* The module ```re``` provides us with regular expression functionality
* The pattern ```\d\d\d\d\d``` tells Python that we expect zip codes to have 5 consecutive digits with no spaces
* The call ```fullmatch(pattern, c)``` tests whether the ```pattern``` matches the string ```c```

## Character classes
* ```\d``` is a so-called _character class_ for digits
* Other character classes are:

* ```\D```: any character that is not a digit
* ```\s```: any whitespace character (space, tab, newline)
* ```\S```: any non-white space character
* ```\w```: any letter, digit, or underscore
* ```W```: any character that is not a letter, digit, or underscore

* Another example: testing for Twitter handles
* Let's first make a list of strings that will contain possible handles to test for validity

In [None]:
candidates = ['@the_loquacious_one', '#loquaciousLife', '@pommieboss']

In [None]:
for c in candidates:
    if re.fullmatch('@\w*', c):
        print(c, "is a valid Twitter handle")
    else:
        print(c, "is not a valid Twitter handle")

* The ```*``` tells Python that we are allowing _any_ number of "word characters" ```\w```

* If we want to limit the minimum or maximum length of the Twitter handle, we can use the following regular expression:

In [None]:
candidates = ['@A','@the_loquacious_one', '#loquaciousLife', '@pommieboss']
for c in candidates:
    if re.fullmatch('@\w{3,100}', c):
        print(c, "is a valid Twitter handle")
    else:
        print(c, "is not a valid Twitter handle")

* Because we have specified a minimum of 3 characters after the ```@``` sign, the first element of ```candidates``` is correctly identified as invalid