# Strings

https://pyformat.info


Some relevant functions for stings. Let us conside the following string

```python
x = 'This is a string.'
```

- `x.split(sep=None)`
  - If `sep` is not specified or is `None`, any
    whitespace string is a separator and empty strings are
    removed from the result.

- `x.upper()`
  - Returns the string `x` with all characters uppercase.

- `x.lower()`
  - Returns the string `x` with all characters in lowercase.


In [240]:
x = 'This is a string'
type(x)

str

In [241]:
words = x.split()
words

['This', 'is', 'a', 'string']

In [478]:
# by default split uses blank space ' ' as separator
x = 'We can split separate sentences. Using sep argument.'
x.split()

['We', 'can', 'split', 'separate', 'sentences.', 'Using', 'sep', 'argument.']

In [479]:
# We can define custom separators
x = 'We can split separate sentences. Using sep argument'
x.split(sep=".")

['We can split separate sentences', ' Using sep argument']

## Formatting strings

We can format strings using the `x.format` method. This method allow us to introduce information inside the string x. We write placeholders `{}` inside the string `x` in the positions where we want to put certain information.

Let us see some examples


In [353]:
names = ['David', 'Jaquim', 'Michael']
ages = [19, 30, 50]

In [378]:
for name,age in zip(names,ages):
    print('name: {}\t age: {}'.format(name,age))

name: David	 age: 19
name: Jaquim	 age: 30
name: Michael	 age: 50


#### Placeholders and digit formatting

We can speficy the number of digits used when formatting a number, or the maximum number of decimals.

In [454]:
print('big number {} hard to read'.format(10**10))

big number 10000000000 hard to read


In [532]:
print('big number {:,} better with commas'.format(10**10))

big number 10,000,000,000 better with commas


In [535]:
print('big number {:,.3f} also with decimal places'.format(10**10))

big number 10,000,000,000.000 also with decimal places


In [397]:
for i in range(5,15):
    print('number {:02}'.format(i))

number 05
number 06
number 07
number 08
number 09
number 10
number 11
number 12
number 13
number 14


In [477]:
# Decide the number of decimals
for i in range(5,15):
    print('number 2 dedimals {:.2f}'.format(i/3.),end="\t")
    print('number 3 decimals {:.3f}'.format(i/3.))

number 2 dedimals 1.67	number 3 decimals 1.667
number 2 dedimals 2.00	number 3 decimals 2.000
number 2 dedimals 2.33	number 3 decimals 2.333
number 2 dedimals 2.67	number 3 decimals 2.667
number 2 dedimals 3.00	number 3 decimals 3.000
number 2 dedimals 3.33	number 3 decimals 3.333
number 2 dedimals 3.67	number 3 decimals 3.667
number 2 dedimals 4.00	number 3 decimals 4.000
number 2 dedimals 4.33	number 3 decimals 4.333
number 2 dedimals 4.67	number 3 decimals 4.667


#### Placeholders with integer values 
We can also use placeholders with integers inside, this can be usefull in a variety of situations. For example, if we want to print a repetead value inside a string we don't need to pass it to the format method several times.

In [355]:
# We don't need to do this
for name,age in zip(names,ages):
    print('name: {}\t name again: {} \t age: {}'.format(name, name, age))

name: David	 name again: David 	 age: 19
name: Jaquim	 name again: Jaquim 	 age: 30
name: Michael	 name again: Michael 	 age: 50


In [357]:
# We can simply use placeholders with integers inside to refer
# to the potition of the input of the format method.
for name,age in zip(names,ages):
    print('name: {0}\t name again: {0} \t age: {1}'.format(name,age))

name: David	 name again: David 	 age: 19
name: Jaquim	 name again: Jaquim 	 age: 30
name: Michael	 name again: Michael 	 age: 50


#### Placeholders with keyword arguments

We can also use keyword arguments inside the format method. By doing so we don't need to take into account the order at which the inputs of `forward` are sent.


In [362]:
for n,a in zip(names,ages):
    print('name: {name}\t \t age: {age}'.format(age=a, name=n))

name: David	 	 age: 19
name: Jaquim	 	 age: 30
name: Michael	 	 age: 50


#### Placeholders with dictionary inputs

We can pass dictionaries in the `format` method and use the keys of the dictionaries inside the placeholders.

In [376]:
names = ['David', 'Jaquim', 'Michael']
ages = [19, 30, 50]

d = []

for n,a in zip(names,ages):
    d.append({'name':n, 'age':a})

In [498]:
for d_k in d:
    print('name: {name}\t \t age: {age}'.format(**d_k))

name: David	 	 age: 19
name: Jaquim	 	 age: 30
name: Michael	 	 age: 50


## Date strings

In [537]:
import datetime
my_date = datetime.datetime(2017,10,5,10)

In [538]:
'The date was {m.day}/{m.month}/{m.year}'.format(m=my_date)

'The date was 5/10/2017'

## Encodings


- We can do **`b'cafe'`** to encode in binary.
- We can do **`u'cafe'`** to encode in unicode.

#### Encoding in utf-8, utf-16, utf-32

We can use `x.encode('utf-8')`, `x.encode('utf-16')`, `x.encode('utf-32')` to encode `x` in utf-8, utf-16 or utf-32 respectively.


Table containing `utf-8` representation 
```
Bytes Bits  Byte representation
1     7      0xxxxxxx            
2     11     110xxxxx    10xxxxxx        
3     16     1110xxxx    10xxxxxx    10xxxxxx    
4     21     11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
```


In [22]:
'café'.encode('utf-8')

b'caf\xc3\xa9'

In [23]:
'café'.encode('utf-16')

b'\xff\xfec\x00a\x00f\x00\xe9\x00'

In [24]:
'café'.encode('utf-32')

b'\xff\xfe\x00\x00c\x00\x00\x00a\x00\x00\x00f\x00\x00\x00\xe9\x00\x00\x00'

In [32]:
# Different encodings give different representations in binary
# The different representations use a different amount of bytes
import sys 

#xascci = 'café'.encode('ascii') accents are not present in ascii
x8  = 'café'.encode('utf-8')
x16 = 'café'.encode('utf-16')
x32 = 'café'.encode('utf-32')

[sys.getsizeof(x) for x in [x8,x16,x32]]

[38, 43, 53]

In [33]:
[x for x in [x8,x16,x32]]

[b'caf\xc3\xa9',
 b'\xff\xfec\x00a\x00f\x00\xe9\x00',
 b'\xff\xfe\x00\x00c\x00\x00\x00a\x00\x00\x00f\x00\x00\x00\xe9\x00\x00\x00']

In [25]:
ex  = 'cafe'.encode('ascii')
sys.getsizeof(ex)

37

We can encode a particular scring if we preceed it with:

- `u` for unicode
- `b` for binary (asciii)



In [35]:
u'café'

'café'

In [36]:
b'café'

SyntaxError: bytes can only contain ASCII literal characters. (<ipython-input-36-26fd0a3ede45>, line 1)

## Unicode data 

Unicode is a format for coding exadecimal numbers to symbols. Unicode supports over a million symbols (or characters). Each character is assigned a number, called a code point. Code points are written in Python as\uXXXX, where XXXX is the number in four-digit hexadecimal form.

A font (like the 'times new roman' for a particular character) is a mapping from an image to a symbol/glyph.

We can manipulate unicode strings as 'normal' strings.

Some  important things to consider:

- When reading data from files, expect bytes and decode then with **`b.decode('utf-8')`**.

- When writting data back to a file, encode it with **`b.encode('utf-8')`**.

- Avoid using **`str()`** or  **`bytes()`** without an encoding to convert between types.

In [22]:
some_unicode_char = u'\u0061'

In [23]:
some_unicode_char 

'a'

In [28]:
print(some_unicode_char)

a


In [45]:
def test(a:int):
    return 2*a

In [46]:
test(23)

46

## Checking properties of strings and characters



In [312]:
'A'.isupper(), 'a'.isupper(), 'a'.isdigit(), '1'.isdigit()

(True, False, False, True)


# String similarity with fuzzywuzzy

In [290]:
import fuzzywuzzy

In [293]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [302]:
x = 'this is a string'
y = 'this is a string!'
z = 'this is also an string'

In [311]:
fuzz.ratio(x, y), fuzz.ratio(x, z)

(97, 84)