# Strings

## Splitting Strings 

In [1]:
import re

### Split a string into fields, but the delimiters (and spacing around them) aren’t consistent throughout the string.

In [2]:
line = 'asdf fjdk; afed, fjek,asdf, foo'
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

`[]` - Matches `one or more` charaters inside the bracker
`;`,`,`,`\\s` - Semicolon, Colon and Whitespace delimiters
`*` - Matches `Zero or more` characters

`[;,\s]` - Checks if any word is starting with either of the delimiters mentioned in the bracket

`[;,\s]\s*` - Check if there are any whitespaces after the word

### Split the string but capture the delimiters

In [3]:
re.split(r'(;|,|\s)\s*', line)

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

`()` - Parenthesis are used to Capture group
`|` - Pipepline operator acts as `OR` condition to combine multiple expressions

To get the output similar to that of brackets but using Parenthesis or Groping, add `?:`

In [4]:
re.split(r'(?:,|;|\s)\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

## Matching Text at Start or End of string

In [5]:
file = 'spam.txt'
print(file.endswith('txt'))
print(file.startswith('new'))

True
False


In [6]:
url = 'http://localhost:8888/notebooks/Python%20String%20and%20Text.ipynb'
url.startswith('http')

True

In [7]:
from urllib.request import urlopen
import pprint
def read_data(url):
    if url.startswith(('http', 'https', 'ftp')):
        return urlopen(url).read()
    else:
        with open(url) as u:
            return u.read()

url = 'http://www.pythontutor.com/visualize.html#mode=edit'
pprint.pprint(read_data(url)[:50])

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'


For multiple choices, `Tuple` is required as used in `url.startswith(('http', 'https', 'ftp'))`

In [8]:
re.match('http:|https:|ftp:', url)

<re.Match object; span=(0, 5), match='http:'>

### Matching string using Shell wildcard pattern

In [9]:
from fnmatch import fnmatch, fnmatchcase
print(fnmatch('foo.txt', '*.txt'))
print(fnmatch('foo.txt', '*?o.txt'))
print(fnmatch('Dat23.csv', 'Dat[0-9]*'))

True
True
True


In [10]:
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
[n for n in names if fnmatch(n, 'Dat*.csv')]

['Dat1.csv', 'Dat2.csv']

For `Windows`, to make the matching case-sensitive, use `fnmatchcase`

In [11]:
print(fnmatch('foo.txt', '*.TXT'))
print(fnmatchcase('foo.txt', '*.TXT'))

True
False


In [12]:
addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]

[addr for addr in addresses if fnmatchcase(addr, '* ST')]

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']

In [13]:
[addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]

['5412 N CLARK ST']

### Matching and Searching for Text Patterns

In [14]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [15]:
text == 'yeah'

False

In [16]:
text.startswith('yeah')

True

In [17]:
text.endswith('yeah')

True

In [18]:
'no' in text

True

In [19]:
text.find('no')

10

`startswith`, `endswith`, `in`, `find` are good for simple type of matching. For complex matching `regex` is required

In [20]:
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'

if re.match(r'\d+/\d+/\d+',text1):
    print('True')
else:
    print('False')
    
if re.match(r'\d+/\d+/\d+',text2):
    print('True')
else:
    print('False')

True
False


`\d` is to match the digits
`+` is to check for at least one occurence

`\d+` checks for 1 or more digits
`\d+/` checks for 1 or more digits followed by `/` e.g. `11/`

If you want to perform lots of matchings with the same pattern, it is better to precompile the pattern using `compile`

`match` checks only 1st occurence. For finding all the occurences use `findall`

In [21]:
date_pattern = re.compile(r'\d+/\d+/\d+')

if date_pattern.match(text1):
    print('True')
else:
    print('False')

True


In [22]:
text = 'Today is 02/02/2020. New year will start on 01/01/2021.'
date_pattern.findall(text)

['02/02/2020', '01/01/2021']

### Capturing patterns in Group

In [23]:
date_pattern = re.compile(r'(\d+)/(\d+)/(\d+)')
a = date_pattern.match(text1)
a

<re.Match object; span=(0, 10), match='11/27/2012'>

In [24]:
print(a.group(0))
print(a.group(1))
print(a.group(2))
print(a.group(3))
print(a.group())
print(a.groups())

11/27/2012
11
27
2012
11/27/2012
('11', '27', '2012')


In [25]:
a_all = date_pattern.findall(text)
a_all

[('02', '02', '2020'), ('01', '01', '2021')]

In [26]:
for day, mon, year in a_all:
    print(f'{day}-{mon}-{year}')

02-02-2020
01-01-2021


In [27]:
for m in date_pattern.finditer(text):
    print(m.groups())

('02', '02', '2020')
('01', '01', '2021')


`match()` always tries to find the match at the start of a string.

Use the `findall()` method, to find all occurences

Compile the pattern using `compile` method if using same pattern for many occurences

`group() or group(0)` will return complete result while to display each group pass on the other digits. 

`groups()` will return all the captured groups.

`finditer()` can be used to find the occurences iteratively.

## Search and replace text

In [28]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [29]:
text.replace('yeah', 'yup')

'yup, but no, but yup, but no, but yup'

For complex patterns use `sub()` of `regex`

In [30]:
text = 'Today is 02/02/2020. New year will start on 01/01/2021.'

In [31]:
re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)

'Today is 2020-02-02. New year will start on 2021-01-01.'

`'\3-\1-\2'` is specifying the group number to replace

Similar to `match()`, it's also possible to compile the pattern first and then use

To get the number of substitutions along with substitution, use `subn()` method

In [32]:
date_pattern.subn(r'\3-\1-\2', text)

('Today is 2020-02-02. New year will start on 2021-01-01.', 2)

In [33]:
text_sub, n = date_pattern.subn(r'\3-\1-\2', text)
print(text_sub)
print(n)

Today is 2020-02-02. New year will start on 2021-01-01.
2


### Searching and Replacing Case-Insensitive pattern

In [34]:
text = 'UPPER PYTHON, lower python, Mixed Python'

In [35]:
re.findall('python', text, flags=re.IGNORECASE)

['PYTHON', 'python', 'Python']

In [36]:
re.sub('python', 'snake', text, flags=re.IGNORECASE)

'UPPER snake, lower snake, Mixed snake'

`flags=re.IGNORECASE` will perform the *case-insensitive* operation

### Finding Shortest possible match

In [37]:
str_pat = re.compile(r'\"(.*)\"')
text2 = 'Computer says "no." Phone says "yes."'
str_pat.findall(text2)

['no." Phone says "yes.']

`r'\"(.*)\"'` is attempting to match text enclosed inside quotes. 

`\"` will match the double-quotes, `(.*)` will match zero or more charaters except newline, `\"` will match double-quotes

However, the `*` operator in a regular expression is `greedy`, so matching is based on finding the longest possible match.

To fix this, add the `?` modifier `after the *` operator in the pattern to perform `non-greedy` search

In [38]:
str_pat = re.compile(r'\"(.*?)\"')
str_pat.findall(text2)

['no.', 'yes.']

 ### Multiline Patterns

In [39]:
text2 = '''/*This is line 1
 This is line2*/
'''

`. (dot)` operator in regex matches zero or more characters `except newline`. Hence we need to add new line identifier with pipeline operator

In [40]:
multi = re.compile(r'/\*((.|\n)*?)\*/')
multi.findall(text2)

[('This is line 1\n This is line2', '2')]

To avoid capturing groups use `?:` 

In [41]:
multi = re.compile(r'/\*((?:.|\n)*?)\*/')
multi.findall(text2)

['This is line 1\n This is line2']

Using `re.DOTALL` will make `.` operatore to match newline as well

In [42]:
multi = re.compile(r'/\*(.*?)\*/', re.DOTALL)
multi.findall(text2)

['This is line 1\n This is line2']

## Stripping unwanted characters

`strip()` will strip the charactors from both sides of the string
`lstrip()` will strip from left side
`rstrip()` will strip from right side

In [43]:
s = ' hello world \n'

In [44]:
s.strip()

'hello world'

In [45]:
s.lstrip()

'hello world \n'

In [46]:
s.rstrip()

' hello world'

In [47]:
t = '-----hello====='

In [48]:
t.lstrip('-')

'hello====='

In [49]:
t.rstrip('=')

'-----hello'

`strip()` will strip from only ends and not the middle of the string
To remove the such characters use `replace()` method or `regex sub`

In [50]:
' hello world \n'.replace(' ','')

'helloworld\n'

In [51]:
re.sub('\s+','',' hello world \n')

'helloworld'

## Combining and Concatinating Strings

In [52]:
parts = ['Is', 'Chicago', 'Not', 'Chicago?']

To combine the words or characters from the list use `join()` method

In [53]:
print(''.join(parts))
print(' '.join(parts))

IsChicagoNotChicago?
Is Chicago Not Chicago?


In [54]:
%%timeit
''.join(parts)

108 ns ± 0.976 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [55]:
print('\n'.join(parts))

Is
Chicago
Not
Chicago?


In [56]:
%%timeit
s = ''
for i in parts:
    s+=i

250 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


For loop is slower than `join` method

Strings can be directly concatinated using `+` operator or placing them adjecent

In [57]:
'Hello' + ' World'

'Hello World'

In [58]:
'Hello' ' World'

'Hello World'

In [59]:
'Hello'" World"

'Hello World'

In [60]:
data = ['ACME', 50, 91.1]
' '.join(str(i) for i in data)

'ACME 50 91.1'

# Numbers

### Rounding Numerical Values

In [61]:
round(1.23 , 2)

1.23

In [62]:
round(1.23 , 1)

1.2

In [63]:
round(1.23 , 0)

1.0

In [64]:
round(1.23 , 4)

1.23

In [65]:
round(4.563, 2)

4.56

In [66]:
round(4.563, 1)

4.6

In [67]:
print(round(1.5))
print(round(2.5))

2
2


`round()` takes up to 2 arguments - number/variable of number, number of decimal places

If second argument is not passed and value is exact half way as shown above, it will round off the value to `nearest even integer`

But if it is greater than half like *.51*, it will round off to `next integer` as shown below

In [68]:
print(round(1.55))
print(round(2.55))

2
3


In [69]:
print(round(1627734, -1))
print(round(1627735, -1))
print(round(16.27735, 0))


1627730
1627740
16.0


If rounding digits passed is negative, rounding takes place for tens, hundreds, thousands, and so on.

If rounding digits passed is zero, it will provide decimal representation of integer value

### Performing Accurate Decimal Calculations

In [70]:
a = 4.2
b = 2.1
a + b

6.300000000000001

In [71]:
print(4.2+2.1 == 6.3)
print(a + b == 6.3)

False
False


A well-known issue with floating-point numbers is that they can’t accurately represent all base-10 decimals.
As per our math calculation above result should be True

These errors are a `feature` of the `underlying CPU and the IEEE 754 arithmetic` performed by its floating-point unit. Since Python’s float data type stores data using the native representation, there’s nothing you can do to avoid such errors if you write your code using float instances.

If you want more accuracy (and are willing to give up some performance), you can use the decimal module:

In [72]:
from decimal import Decimal

In [73]:
a = Decimal('4.2')
b = Decimal('2.1')
print(a + b)

6.3


In [74]:
a+b

Decimal('6.3')

In [75]:
a+b == Decimal('6.3')

True

### Formatting numbers for output

In [76]:
x = 1234.56789

#### Precision and accuracy

In [77]:
format(x, '.2f')

'1234.57'

In [78]:
format(x, '0.2f')

'1234.57'

In [79]:
format(x, '0.3f')

'1234.568'

In [80]:
format(x, '0.3%')

'123456.789%'

In [81]:
format(-x, '0.3f')

'-1234.568'

#### Allignment

In [82]:
#Left Justify
format(x, '<10.1f')

'1234.6    '

In [83]:
#Right Justify
format(x, '>10.1f')

'    1234.6'

In [84]:
#Center
format(x, '^10.1f')

'  1234.6  '

#### Inclusion of thousands seperator

In [85]:
format(x, ',')

'1,234.56789'

In [86]:
format(x, '0,.2f')

'1,234.57'

#### Exponent

In [87]:
format(x, 'e')

'1.234568e+03'

In [88]:
format(x, '0.2e')

'1.23e+03'

In [89]:
format(x, '0.2E')

'1.23E+03'

#### Using format in print

In [90]:
x = 0.1234
y = 0.5678

In [91]:
print('x is {:0.2f}'.format(x))

x is 0.12


In [92]:
print('x is {:0.2%}'.format(x))

x is 12.34%


In [93]:
print('x is {0:0.2f} and y is {1:0.3f}'.format(x, y))

x is 0.12 and y is 0.568


In [94]:
print('y is {1:0.2f} and x is {0:0.3f}'.format(x, y))

y is 0.57 and x is 0.123


In [95]:
print('Name is {name} and age is {age}'.format(name='Python', age='3'))

Name is Python and age is 3


#### New format method

In [96]:
print(f'x is {x:0.2f}')

x is 0.12


In [97]:
print('x is {x:0.2%}')

x is {x:0.2%}


In [98]:
print(f'x is {x:0.2f} and y is {y:0.3f}')
print(f'y is {y:0.2f} and x is {x:0.3f}')

x is 0.12 and y is 0.568
y is 0.57 and x is 0.123


In [99]:
name = 'Python'
age = 3
print(f'Name is {name} and age is {age}')

Name is Python and age is 3


### Working with Bin, Oct, Hex

In [100]:
x = 255

#### Using functions

In [101]:
bin(x)

'0b11111111'

In [102]:
oct(x)

'0o377'

In [103]:
hex(x)

'0xff'

#### Using format

In [104]:
format(x, 'b')

'11111111'

In [105]:
format(x, 'o')

'377'

In [106]:
format(x, 'x')

'ff'

#### Signed Numbers

In [107]:
bin(-x)

'-0b11111111'

In [108]:
format(-x, 'b')

'-11111111'

In [109]:
oct(-x)

'-0o377'

In [110]:
format(-x, 'o')

'-377'

In [111]:
hex(-x)

'-0xff'

In [112]:
format(-x, 'x')

'-ff'

#### Converting binary, hex, oct to integer

In [113]:
int(bin(x), 2)

255

In [114]:
int(oct(x), 8)

255

In [115]:
int(hex(x), 16)

255

### Complex Values

In [116]:
a = complex(2 ,3)
print(a)
print(type(a))

(2+3j)
<class 'complex'>


In [117]:
b = 1+2j
print(b)
print(type(b))

(1+2j)
<class 'complex'>


In [118]:
print(a + b)
print(a - b)
print(a * b)
print(a / b)

(3+5j)
(1+1j)
(-4+7j)
(1.6-0.2j)


In [119]:
a.real

2.0

In [120]:
a.imag

3.0

In [121]:
a.conjugate()

(2-3j)

In [122]:
abs(a)

3.605551275463989

#### Additinal complex math

In [123]:
import cmath

In [124]:
cmath.sin(a)

(9.15449914691143-4.168906959966565j)

In [125]:
cmath.cos(a)

(-4.189625690968807-9.109227893755337j)

In [126]:
cmath.exp(a)

(-7.315110094901103+1.0427436562359045j)

Python’s standard mathematical functions do not produce complex values by default, so it is unlikely that such a value would accidentally show up in your code.

In [127]:
import math
math.sqrt(-1)

ValueError: math domain error

If you want complex numbers to be produced as a result, you have to explicitly use cmath or declare the use of a complex type in libraries that know about them

In [128]:
cmath.sqrt(-1)

1j

### Inf and NaNs

Python has no special syntax to represent these special floating-point values, but they can be created using `float()`.

In [129]:
a = float('inf')
b = float('-inf')
c = float('nan')

In [130]:
a

inf

In [131]:
b

-inf

In [132]:
c

nan

In [133]:
a + 1

inf

In [134]:
b - 2

-inf

In [135]:
a + b

nan

In [136]:
c + 1

nan

#### Testing infinity

In [137]:
math.isinf(a)

True

In [138]:
math.isinf(b)

True

In [139]:
math.isnan(c)

True

In [140]:
math.isnan(a)

False

In [141]:
math.isinf(c)

False

In [142]:
2/a

0.0

In [143]:
a1 = float('inf')
a2 = float('inf')
a1 == a2

True

In [144]:
a1 = float('nan')
a2 = float('nan')
a1 == a2

False

### Random 

In [145]:
import random

In [146]:
values = [i for i in range(10)]
values

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### Choose random

In [147]:
random.choice(values)

6

In [148]:
random.choice(values)

6

In [149]:
random.choice(values)

6

#### Shuffle

In [150]:
random.shuffle(values)
values

[4, 8, 0, 9, 5, 2, 7, 3, 6, 1]

#### Produce random numbers

Integers

In [151]:
random.randint(1, 100)

45

In [152]:
random.randint(1, 100)

27

In [153]:
random.randint(-100, 100)

38

In [154]:
random.randrange(1, 10)

6

Between 0 to 1

In [155]:
random.random()

0.7298385718248317

In [156]:
random.random()

0.7650240187045345

In [157]:
random.random()

0.0864472874845198

#### Generate list of random numbers

In [158]:
[random.randint(0,9) for i in range(10)]

[8, 0, 3, 6, 4, 9, 4, 4, 7, 2]

In [159]:
[random.randrange(0,9) for i in range(10)]

[2, 6, 3, 4, 3, 4, 7, 4, 8, 7]

In [160]:
[random.randrange(0,9, 2) for i in range(10)]

[6, 4, 8, 0, 4, 0, 8, 2, 6, 0]

In [161]:
[random.randrange(1,9, 2) for i in range(10)]

[5, 1, 7, 7, 3, 7, 7, 7, 1, 1]

Difference between `randint` and `randrange` is `randrange` provides option of **step size** like `range` also similar to `range`, **stop** number is not included in `randrange`