# Fundamentals of String manipulation
_______________

In [2]:
str_1 = 'immutable'
str_2 = 'Shall I Compare thee to a Summer\'s Day?' 
str_3 = '    2 plus 70 equals seventy 2     '
str_4 = 'CYP2C19'
str_5 = '794613852'

* strings refer to immutable sequences of characters. 
* reassigning characters in a string is not supported.  
* once a string object is declared it cannot be changed, but only replaced. 

In [3]:
str_1[5]

'a'

In [4]:
str_1[5]='t'

TypeError: 'str' object does not support item assignment

## 1- A quick survey of the most useful string methods

In [5]:
print('upper():      '+str_2.upper())
print('lower()       '+str_2.lower())
print('swapcase():   '+str_2.swapcase())

upper():      SHALL I COMPARE THEE TO A SUMMER'S DAY?
lower()       shall i compare thee to a summer's day?
swapcase():   sHALL i cOMPARE THEE TO A sUMMER'S dAY?


* `str.find(Sub)` and `str.index(Sub)` have similar behavior.  
* these two functions are used to find the first ocurrence of `Sub` in a string.  
* `str.rfind(Sub)` and `str.rindex(Sub)` find the first occurrence of `Sub` from the right.  

In [6]:
str_2

"Shall I Compare thee to a Summer's Day?"

In [7]:
# the first letter 'e' is 14 characters from the left and 30 characters from the right
str_2.index('e'), str_2.find('e')

(14, 14)

In [8]:
str_2.rindex('e'), str_2.rfind('e')

(30, 30)

In [9]:
str_2.index('e t'), str_2.rindex('e t')

(14, 19)

* `str.replace(old_string, new_string)` to replace a string with another.

In [None]:
str_2.replace('Shall','Would')

* `str.capitalize()` will capitalize the first letter.

In [None]:
str_1.capitalize()

* `str.strip()` to eliminate leading and trailing white spaces.

In [10]:
str_3.strip()

'2 plus 70 equals seventy 2'

In [None]:
# reminder
print('str_1:    ',str_1)
print('str_2:    ',str_2)
print('str_2:    ',str_3)
print('str_4:    ',str_4)
print('str_5:    ',str_5)

* `str.isalnum()` checks whether a string is entirely alphanumeric characters
* `str.isalpha()` checks whether a string is entirely alphabetic characters   
* `str.isdigit()` checks whether a string is numbers only

In [11]:
str_1.isalnum(), str_2.isalnum(), str_3.isalnum(), str_4.isalnum(), str_5.isalnum()

(True, False, False, True, True)

In [12]:
str_1.isalpha(), str_2.isalpha(), str_3.isalpha(), str_4.isalpha(), str_5.isalpha()

(True, False, False, False, False)

In [13]:
str_5.isdigit()

True

* strings can be added using `+`

In [None]:
str_1+'_____'+str_2+'_____'+str_4

* `str.center(width[, fillchar])` returned a centered string within a length `width`. `fillchar` specifies the padding.

In [14]:
str_4.center(80,'-')

'------------------------------------CYP2C19-------------------------------------'

* `str.count(sub, start, end)` returns the number of occurrences for the substring `sub` between index `start` and `end`. 

In [15]:
phrase = 'in times like these, it is helpful to remember that there have always been times like these'

phrase.count('times like these',0, len(phrase))

2

## 2- The `str.split()` and `str.join()` methods
two of the most essential functions used for string manipulation.
* `str.split(sep=None, maxsplit=-1)`     
splits a string into a list of words based on a specified separator (space by default). `maxsplit` is the max number of splits desired, default -1 results in max number of splits possible.      


* `' '.join([list])`     
joins a list of strings based on the delimeter specified between the two quotation marks.  

In [16]:
str_2_split = str_2.split()
str_2_split

['Shall', 'I', 'Compare', 'thee', 'to', 'a', "Summer's", 'Day?']

In [17]:
# join and separate with + sign
str_2_join='+'.join(str_2_split)
str_2_join

"Shall+I+Compare+thee+to+a+Summer's+Day?"

In [18]:
# we can split based on any character
str_2_split = str_2.split('e')
str_2_split

['Shall I Compar', ' th', '', ' to a Summ', "r's Day?"]

In [19]:
str_2_join='-_-'.join(str_2_split)
str_2_join

"Shall I Compar-_- th-_--_- to a Summ-_-r's Day?"

* `str.rsplit()` concatinates strings from the right side. This effect can only be visible when argument `maxplit` is not set to defualt

In [20]:
# comparing maxsplit assignment
str_2.split()

['Shall', 'I', 'Compare', 'thee', 'to', 'a', "Summer's", 'Day?']

In [21]:
str_2.split(maxsplit=3), str_2.rsplit(maxsplit=3)

(['Shall', 'I', 'Compare', "thee to a Summer's Day?"],
 ['Shall I Compare thee to', 'a', "Summer's", 'Day?'])

In [None]:
'q u e r t y'.split(maxsplit=3), 'q u e r t y'.rsplit(maxsplit=3)

### How to split a string with no spaces (a word) to single characters?  
* by converting it to a list! 

In [22]:
# this has no effect on string with no delimiters or separators
'immutable'.split()

['immutable']

In [23]:
list('immutable')

['i', 'm', 'm', 'u', 't', 'a', 'b', 'l', 'e']

In [None]:
''.join(['i', 'm', 'm', 'u', 't', 'a', 'b', 'l', 'e'])

* this deconstruction allows modification of characters in a string using list attributes. 

<span style='color:blue'>Change `2C19` to `2D19` in <U>str_4</U> and call it <U>str_6</U> </span>  

In [24]:
str_4

'CYP2C19'

In [27]:
str_6 = list(str_4)
str_6[4] = 'D'
str_6 = ''.join(str_6)

In [28]:
str_6

'CYP2D19'

* `chr(int)` is a method that returns the string represending a character whose Unicode code point is the integet `i`.    
* `ord(str)` is the inverse method to `chr(int)`    

In [29]:
chr_list = ''.join([chr(i) for i in range(32,127)])
chr_list

' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

* this is perfect for message encryption! 
* the following is a message:  

In [31]:
encrypted_message = \
[7744, 13456, 1369, 12769, 13456, 13225, 11664, 1369, 10404, 13225, 11025, 1369, 14641, 11881, \
 10404, 13225, 12544, 14400, 1369, 11449, 13456, 14161, 1369, 10404, 12769, 12769, 1369, 14641, \
 11881, 11236, 1369, 11449, 12100, 14400, 11881, 1369, 2500, 1369, 7744, 13456, 1369, 14400, 10404,\
 11025, 1369, 14641, 11881, 10404, 14641, 1369, 12100, 14641, 1369, 14400, 11881, 13456, 14884, 12769, \
 11025, 1369, 10816, 13456, 12996, 11236, 1369, 14641, 13456, 1369, 14641, 11881, 12100, 14400, 1369, \
 2500, 1369, 8464, 11236, 1369, 14641, 14161, 12100, 11236, 11025, 1369, 14641, 13456, 1369, 15376, \
 10404, 14161, 13225, 1369, 15876, 13456, 14884, 1369, 10404, 12769, 12769, 1369, 10609, 14884, 14641,\
 1369, 13456, 11881, 1369, 11025, 11236, 10404, 14161, 1444]

<span style='color:blue'> To decrypt this message you need to do:     
`ord(squar root[message] -5)` for every integer in the encrypted message then join the resulting characters.   
</span>

In [32]:
for num in encrypted_message:
    letter = chr(int((num**(.5)) - 5))
    print(letter, end='')

So long and thanks for all the fish - So sad that it should come to this - We tried to warn you all but oh dear!

&nbsp;

## 3- The `.format()` method
* a string method that is introduced in Python 3.X   
* strings with method `.format()` contain replacement fields `{}`. 
* anything not within the curly braces is cosidered literal text. 
* mostly used with the `print()` method to insert a calculated value into a string to be printed.

In [None]:
"{},{},{} are letters".format('a','b','c')

* we can choose the order of objects in the expression

In [None]:
"{1},{2},{0} are letters".format('a','b','c')

In [None]:
"{1},{2},{0} are letters".format(*'abc')

In [None]:
"{0}{1}{0}".format('abra','cad')

In [None]:
mae=0.8225648511

"The mean absolute error is {0:.4f}".format(mae)

* use of dictionary with floats. `{index: property}`

In [None]:
# padding with zeros. The padding represents the length of the complete output.

'Earth\'s gravitational acceleration is: {:06.2f} ft/s/s'.format(32.1740)

In [None]:
mae=0.8225648511
mse=1.3265458556

"the mae = {1:.5f}\nthe mse = {0:.5f}".format(mse, mae)

In [None]:
print("the mae = {1:08.5f}\nthe mse = {0:08.5f}".format(mse, mae))

* padding and allignment

In [None]:
'{:<30}'.format('left aligned')

In [None]:
'{:>30}'.format('right aligned')

In [None]:
print("the mae = {1:>50,.5f}\nthe mse = {0:>50,.5f}".format(mse, mae))

In [None]:
print("the mae = {1:->50,.5f}\nthe mse = {0:.>50,.5f}".format(mse, mae))

* `.format` supports using dictionaries to assign and arrange values

In [None]:
pt = {1:'pepperoni', 2:'sausage',3:'onions',4:'cheese',5:'olives',6:'green peppers'}

"Standard pizza toppings include: {}, {} and {}".format(pt[1],pt[2],pt[4])

* prior to Python 3.X the standard old fashioned formatting method used is `%`-formatting character.   
* `%`-formatting continues to be supported in later versions but offers less flexibility. 

You will encounter such old style `%` statement very often:

In [None]:
#     %s stands for string
'%s,%s,%s are letters' % ('a','b','c')

In [None]:
#     %d stands for numbers
'%d,%d,%d are numbers' % (-1,5,9)

In [None]:
#     arranging order of display is not supported in old formatting
"the mae = %.5f\nthe mse = %.5f" % (mae, mse)

In [None]:
#    padding with a character is not supported in old formatting
print("the mae = %010.5f\nthe mse = %010.5f" % (mae, mse))


<a href='https://pyformat.info/'>Link</a> for additional information and further comparison between `%-` formatting and `.format()` methods

## 4- File Input/Output

* We want to access the file `sonnet_18.txt` located in folder /lecture_01/files  
* Use `pwd()` to view the current directory   

In [None]:
pwd()

since `sonnet_18.txt` is inside `lecture_01` (where the notebook is located) it is not necessary to define the entire path.

In [None]:
path = 'files/'

In [None]:
f = open(path+'sonnet_18.txt','r')
for line in f:
    print(line, end='')

# close file    
f.close()

* Since Python inherits most of its basic structures from C programming language you willl find the `f.open('filename','r')` and `f.close()` to be very similar to both C and C++.  
* This also applies to other expressions such as the %-formatting specifically in Python v2.7 and earlier.   

In [None]:
f = open(path+'sonnet_18.txt','r')
# .readline() carries out the same task./
for line in f.readlines():
    print(line, end='')

# close file    
f.close()

* It is possible to capture the content of the text file in an object.

In [None]:
text=str() #an empty string object. 
f = open(path+'sonnet_18.txt','r')
for line in f.readlines():
    text=text+str(line)

# close file    
f.close()
print(text)

- open by default has a read argument which is called the mode   
                open('filename', 'r') or 'w', 'rb', 'wb', 'r+', 'a', 'rt'
                read, write, read binary, write binary read-write, append, read text

* however, notice that the `text` object iteslf is one continuout string from beginnine to end padded with an escape character.  
* this string is difficult to maniuplate.

In [None]:
text

* the following is a better way to read the file and capture it line by line    

In [None]:
lines = list(open(path+'sonnet_18.txt', 'r'))
lines

&nbsp; 

* to write a txt file to drive first we create the file then we transfer the lines one by one into the open file.

In [None]:
my_text=\
'So long and thanks for all the fish\n\
So sad that it should come to this\n\
We tried to warn you all but oh dear\n\
You may not share our intellect\n\
Which might explain your disrespect\n\
For all the natural wonders that\n\
grow around you\n\
So long, so long and thanks\n\
for all the fish'

In [None]:
print(my_text)

In [None]:
outfile = open(path+'my_text.txt', 'w')
for line in my_text:
    print(line,file=outfile, end='')
outfile.close()
print('done')

### `print()` in Python

* `print()` is an intuitive method in Python however it is good to remember that `print()` takes a couple of useful argumennts. 

`print(obj, sep = ' ' , end = '\n'`

In [None]:
print(str_1,  str_2, str_3)

* `sep` is useful when 2 objects are enclosed in a single print expression. 

In [None]:
print(str_1,  str_2, str_3, sep = '....')

In [None]:
word_list = ['foo', 'bar', '423','gronk', 'hello kitty', 'sling', 'drag', '8' ]

for word in word_list:
    print(word, end = '\n\n')