## III. String Manipulation


### 1. String Methods

In [1]:
# Use split() to separate a string
string = "a, b, c, d"
string.split(',')

['a', ' b', ' c', ' d']

In [2]:
# split() is often combined with strip to trim whitespace
string_pieces = string.split(',')
print(string_pieces)
string_pieces_cleaned = [x.strip() for x in string_pieces]
print(string_pieces_cleaned)

['a', ' b', ' c', ' d']
['a', 'b', 'c', 'd']


In [3]:
# Use + to concatenate strings
string = "I" + " " + "like" + " " + "pizza."
print(string)

I like pizza.


In [7]:
# Use join() to concatenate a list of strings with delimiter
names = ["Alex", "Brian", "Charlie", "Douglas"]
string = ", ".join(names)
print(string)

Alex, Brian, Charlie, Douglas


In [12]:
# Use index() and find() to detect a substring
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print("DEF" in alphabet)
print(alphabet.find("Alex")) # find() will return -1 if the substring does not exist
print(alphabet.index("DEF"))
# print(alphabet.find("abc"))

True
-1
3


In [13]:
# Extract the substring from alphabet starting at index 10, ending at index 20
substring = alphabet[10:20]
print(substring)

KLMNOPQRST


In [14]:
# count() returns the number of occurences of a substring
print(alphabet.count("DEF"))
print(string.count(" "))

1
3


In [15]:
# replace() is used to replace a substring for another
print(string.replace("Alex", "Alexander"))

Alexander, Brian, Charlie, Douglas


In [16]:
# replace() can also be used to delete a substring:
print(string.replace(", ", ""))

AlexBrianCharlieDouglas


### 2. Regular Expressions
**Regular expressions** provide a flexible way to search or match complex string patterns in text.Python's built-in `re` module is responsible for applying regular expressions to string. Let's have a look at some examples.

In [19]:
import re
# Example 1: Split a string with a variable number of whitespace
string = "a  b    c    d \t e  \n  f   g"
print(string)
# string.split(' ') # This does not work
pieces = re.split('\s+', string) # \s represents the whitespace character, + means one or more.
print(pieces)

a  b    c    d 	 e  
  f   g
['a', 'b', 'c', 'd', 'e', 'f', 'g']


Useful `re` functions:
- findall()
- search()
- split()
- sub()

In [20]:
re.findall('\s+', string)

['  ', '    ', '    ', ' \t ', '  \n  ', '   ']

In [22]:
match = re.search('\s+', string)
print("Substring:", match.group())
print("Location:", match.span())
print("Start:", match.start())
print("End:", match.end())

Substring:   
Location: (1, 3)
Start: 1
End: 3


In [23]:
re.sub('\s+', ',', string)

'a,b,c,d,e,f,g'

**Construct a regular expression:**

[Reference](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)

1. Anchors
    - ^The: **Starts with** The
    - day\$: **Ends with** day
2. Quantifiers:
    - ab\s\*: ab followed by **zero or more** whitespaces
    - ab\s+: ab followed by **one or more** whitespaces
    - ab\s?: ab followed by **zero or one** whitespaces
    - ab\s{2}: ab followed by **exactly 2** whitespaces
    - ab\s{2, 5}: ab followed by **2 - 5** whitespaces
    - ab\s{2, }: ab followed by **2 or more** whitespaces
3. OR operator
    - a(b|c): a followed by **b or c**
    - a[bc]: same as above
4. Character classes
    - \d: a single digit
    - \w: a single letter or underscore
    - \s: a single whitespace
    - .: any character
    - \D: a single non-digit
    - \W: a single character that is not a letter or underscore
    - \S: a single non-space
5. Bracket expression
    - [a-c]: a or b or c
    - [0-7]: a digit between 0 and 7
    - [^a-c]: a letter not a, b, or c
6. Greedy match
    - <*+{}>: any character included in <>, **expanding as far as possible**
7. Capturing:
    - a(bc): **capture** the group with value bc

In [None]:
# Example 2: Remove additional spaces



In [25]:
# Example 3: Extract Social Security Number
string = "123-45-6789"
pattern = "(\d{3})-(\d{2})-(\d{4})"
regex = re.compile(pattern)
match = regex.match(string)
print(match.groups())

('123', '45', '6789')


In [None]:
# Example 4: Extract info from email addresses