String operations documentation
https://docs.python.org/3/library/string.html
    

The strings can be either Unicode strings (str) as well as 8-bit strings (bytes)

In [2]:
print('Hello' + 'World')

HelloWorld


In [3]:
print('Hello' * 8)

HelloHelloHelloHelloHelloHelloHelloHello


In [6]:
x = 'Happy'
y = 'Puppies'
z = [x,y]

'-'.join(z)

'Happy-Puppies'


We can use the split method to turn strings into lists based on a separator that we designate (spaces if left empty).


In [7]:
a = 'They ate the mystery meat. It tasted like chicken. sgsd fdasgs g afgs gs.'

# print(a.split())
print(a.split('.'))
# print(a.split('m'))

['They ate the mystery meat', ' It tasted like chicken', ' sgsd fdasgs g afgs gs', '']


In [8]:
b = 'There is no business like show business.'

print(b.startswith('T'))
print(b.startswith('There'))
print(b.startswith('there'))

True
True
False


In [9]:
string = '''dasa
                dafdff'''

In [10]:
x = '''

Hi 
My NMe =
asd
asd




'''
x

'\n\nHi \nMy NMe =\nasd\nasd\n\n\n\n\n'


Note that this is a boolean operation and returns True False as a result


In [11]:
b = 'There is no business like show business.'
print(b.endswith('.'))
print(b.endswith('business'))
print(b.endswith('Business.'))

True
False
False


In [12]:
'like' in b
print('business' in b)
print('Business' in b)

True
False


In [13]:
c = 'shE HaD a maRveLoUs aSsoRtmeNt of PUPPETS.'

print(c.lower())
print(c.upper())
print(c.capitalize())
print(c.title())

she had a marvelous assortment of puppets.
SHE HAD A MARVELOUS ASSORTMENT OF PUPPETS.
She had a marvelous assortment of puppets.
She Had A Marvelous Assortment Of Puppets.


In [14]:
d = '        II   I have a tendency to leave trailing spaces. II'

# print(d.strip('I '))
# print(d.lstrip())
print(d.rstrip('I'))

        II   I have a tendency to leave trailing spaces. 


In [16]:
e = 'I thought the movie was wonderful!'
print(e.replace('wonderful', 'horrible'))
print(e.replace('wonderful', 'just OK'))

I thought the movie washorrible!
I thought the movie wasjust OK!


In [None]:
string = "This is Ironhack."
string.replace('.','')

#### Note these techniques are important in data wrangling/cleaning specifically text mining 


### REGEX Regular Expressions:

Documentation https://docs.python.org/3/library/re.html

They are like pattern matching operations

Why do we need it?

Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.



In [None]:
import re

In [None]:
text = 'My neighbor, Mr. Rogers, has 5 dogs, neigh m neigh.'
pattern = 'neigh'
print(re.findall(pattern, text))

In [None]:
text = 'My neighbor, Mr. Rogers, has 5 dogs.'
pattern = 'neigh'
print(re.findall(pattern, text))

In [42]:
text = 'My neighbor, Mr. Rogers, has 5 dogs.'
print(re.findall('[Mbn]', text))   #Note that this is CASE SENSITIVE
# ''.join(re.findall('[Mbn]', text))

['M', 'n', 'b', 'M']


'MnbM'

In [43]:
x = ['qrq', 'gsgshs', 'gstwqer']
' '.join(x)

'qrq gsgshs gstwqer'


Predefined sets that we can use as shortcuts


In [44]:
print(re.findall('[a-z]', text))

['y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'r', 'o', 'g', 'e', 'r', 's', 'h', 'a', 's', 'd', 'o', 'g', 's']


In [45]:
print(re.findall('[A-Z]', text))

['M', 'M', 'R']


In [46]:
print(re.findall('[a-zA-Z]', text))

['M', 'y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'M', 'r', 'R', 'o', 'g', 'e', 'r', 's', 'h', 'a', 's', 'd', 'o', 'g', 's']


In [47]:
print(re.findall('[0-9]', text))

['5']



You can also change the size of the pre-defined sets as shown below:
    

In [52]:
print(re.findall('[w-z]', text))

['y']



Combine predefined sets together


In [53]:
text = 'My neighbor, Mr. Rogers, has 5 dogs.'
print(re.findall('[a-zA-Z0-9]', text))

['M', 'y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'M', 'r', 'R', 'o', 'g', 'e', 'r', 's', 'h', 'a', 's', '5', 'd', 'o', 'g', 's']


QUICK QUESTION

What is the difference between the last piece of code and this code

print(re.findall('[A-Z a-z]', text))  #Note the extra space that's added


Complement of the pre defined set 


In [54]:
print(re.findall('[^A-z]', text))

[' ', ',', ' ', '.', ' ', ',', ' ', ' ', '5', ' ', '.']


#### Special Sequences  / Escape Sequence


These are like shortcuts (also known as character classes ) that help us to reduce the expressions such as [a-z] 
to much shorter


In [55]:
print(re.findall('[\w]',text))

['M', 'y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'M', 'r', 'R', 'o', 'g', 'e', 'r', 's', 'h', 'a', 's', '5', 'd', 'o', 'g', 's']


In [56]:
print(re.split(',', text))

['My neighbor', ' Mr. Rogers', ' has 5 dogs.']


In [None]:
#print(re.split('[\A]', text))

In [58]:
text

'My neighbor, Mr. Rogers, has 5 dogs.'

In [61]:
print(re.split('[0-9] ', text))

['My neighbor, Mr. Rogers, has ', 'dogs.']


In [59]:
print(re.sub('[0-9]', '100', text))

My neighbor, Mr. Rogers, has 100 dogs.
