# Strings

## Formatting & String Manipulation

#### F-Strings - Format Specifiers
F-strings are used for string formatting & interpolation. Can use placeholders '{ }' for a value in an f-string & contents can be formatted. Placeholders should contain the value, followed by a semicolon ':', then the format specifier.

In [6]:
# f-string for decimal rounding - .2f format specifier
f'{17.489523:.2f}'

'17.49'

In [8]:
# f-strings for converting to string - d format specifier
f'{10:d}'

'10'

In [12]:
from decimal import Decimal
# f-strings for formatting to Exponential (scientific notation) - e format specifier
f'{Decimal("10000000000000000.0"):.3e}'

'1.000e+16'

In [19]:
# f-strings right & left alignment - < > format specifiers
f'[{"Hello World":>20}]'

'[         Hello World]'

In [21]:
# f-strings centering alignment - ^ format specifier
f'[{"Hello World":^20}]'

'[    Hello World     ]'

In [27]:
# multiple placeholders - format()
# '{} {}'.format('Amanda', 'Cyan')
'{last} {first}'.format(first='Amanda', last='Gray')

'Gray Amanda'

#### strip() - Stripping Whitespace
Can remove leading & trailing whitespace with strip(), or just leading with lstrip() or just trailing with rstrip()

In [39]:
# removing leading & trailing whitespace
sentence = '\t \n This is a test string. \t\t \n'
sentence.strip()

'This is a test string.'

#### capitalize() & title() - Changing Character Case

In [42]:
# capitalizing String's first character
'happy birthday'.capitalize()

'Happy birthday'

In [45]:
# capitalizing first character of every word in String
'strings: a deeper look'.title()

'Strings: A Deeper Look'

#### Comparison Operations with Strings
Letters in strings are assigned integer values that allow them to be compared. Uppercase letters compare as less than lowercase letters.

In [79]:
# see the values of uppercase 'A' vs lowercase 'a'
print(f'A: {ord("A")}; a: {ord("a")}')

A: 65; a: 97


In [81]:
# string comparison
'Orange' <= 'orange'

True

## Working with Substrings

#### count() - Counting Occurrences of Substrings
Count the occurrences of a substring within a greater string with count('substring',start_index,end_index)

In [1]:
# count occurences of 'to'
sentence = 'to be or not to be that is the question'
sentence.count('to',12)

1

#### index() - Locating Substrings
Both index() & rindex() returns the index of substring (for first/last occurrence, respectively) or ValueError if not found.

In [3]:
# index of first occurrence
sentence.index('be')

3

In [4]:
# index of last occurrence
sentence.rindex('be')

16

#### String Contains a Substring?
Can use the 'in' keyword to determine whether a string is contained in another string. Alternatively can test whether a string starts with, or ends with a substring using startswith() and endswidth().

In [14]:
# using 'in' keyword
'that' in sentence

True

In [16]:
# checking start of a String
sentence.startswith('be')

False

In [18]:
# checking end of a String
sentence.endswith('question')

True

#### replace() - Replacing Substrings
Replace substrings using replace('target','replace_with').

In [22]:
# replacing tabs
values = '1\t2\t3\t4\t5'
values.replace('\t',',')

'1,2,3,4,5'

#### split() & join() - Splitting and Joining
- split('delimiter')
- 'delimiter'.join(list)

In [25]:
# splitting string into tokens
letters = 'A, B, C, D'
letters.split(', ')

['A', 'B', 'C', 'D']

In [31]:
# joining list back into string
' | '.join(['A','B','C'])

'A | B | C'

#### partition() - Partitioning Strings
partition() can be used to split a string into a tuple of three strings based on the method's separator argument

In [36]:
# partitioning String with separator ':'
'Amanda: 89, 97, 92'.partition(': ')

('Amanda', ': ', '89, 97, 92')

In [59]:
# partitioning url using right partition
url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'
rest_of_url, separator, document = url.rpartition('/')
document

'table_of_contents.html'

## Character Testing & Regular Expressions

#### Characters & Character-Testing Methods
Python provides string methods for testing whether a string matches certain characteristics. Some of these include:
- isalnum()
- isalpha()
- isdecimal()
- isdigit()
- isidentifier()
- islower()
- isnumeric()
- isspace()
- istitle()

For more check out: https://www.w3schools.com/python/python_ref_string.asp

In [66]:
# checking for for digit (negative)
'-27'.isdigit()

False

In [68]:
# checking for digit (positive)
'27'.isdigit()

True

In [70]:
# checking for alpha-numeric
'A9876'.isalnum()

True

#### Raw Strings
Used to make code more readable & often when working with regular expressions.

In [87]:
# regular string
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'

In [89]:
# raw string
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'

#### Regular Expressions
Regular expressions describe a search pattern for matching characters in other strings. Useful for:
- Extracting data from unstructured text
- Ensuring data is correct format before processing
- Transforming/cleaning data through removing or reformatting data

In [97]:
import re

# matching strings
pattern = '02215'
'Match' if re.fullmatch(pattern, '02215') else 'No match'

'Match'

Regular Expression Metacharacters: https://www.w3schools.com/python/gloss_python_regex_metacharacters.asp

In [107]:
# \d is a character class representing a digit (0-9)
# checking to see if '02215' is 5 digits (true)
'Valid' if re.fullmatch(r'\d{5}', '02215') else 'Invalid'

'Valid'

In [123]:
# checking to see if word starts with capital letter, followed by zero or more lowercase letters
'Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid'

'Valid'

In [127]:
# when a custom character class starts with a caret (^), the class matches any
#    character that's not specified
# here we check if the character 'a' is NOT a lowercase letter (false)
'Match' if re.fullmatch('[^a-z]', 'a') else 'No match'

'No match'

In [129]:
# metacharacters in a custom character class are treated as literal characters
'Match' if re.fullmatch('[*+$]', '*') else 'No match'

'Match'

In [131]:
# * vs + Quantifier -> * for zero or more, + for at least one or more
# here we check for one capital followed by at least one lowercase
'Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid'

'Invalid'

In [135]:
# ? Quantifier matches zero or one occurrences of a subexpression
# here we check for the word labelled or labeled, where the second 'l' is optional
'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'

'Match'

In [137]:
# {n} quantifier matches at least n occurrences of a subexpression &
#    {n,m} matches between n & m (inclusive)
# here we check if the string is between 3-6 digits (false)
'Match' if re.fullmatch(r'\d{3,6}', '12') else 'No match'

'No match'

In [141]:
# substitute strings with re.sub('target_substring','replacement','string')
# additionally may use count=# to specify number of replacements
re.sub(r'\t',',','1\t2\t3\t4')

'1,2,3,4'

In [147]:
# tokenize a string with re.split('delimiter','string')
# here our delimiter is a comma followed by zero or more whitespace
# additionally may use maxsplit=# to specify maximum number of splits
re.split(r',\s*','1, 2, 3,4, 5,6,7,8')

['1', '2', '3', '4', '5', '6', '7', '8']

In [151]:
# may find first match anywhere in a string with re.search('substring','string)
result = re.search('Python','Python is fun')

# re.search() returns a match object (if substring found) so we use
#    re.group() to extract the actual matched string
result.group() if result else 'not found'

'Python'

In [153]:
# $ at the end of a re is an anchor indicating that the expression matches only the end of a string
result = re.search('Python$','Python is fun')

result.group() if result else 'not found'

'not found'

In [155]:
# search string for every matching substring & return list of matching substrings
#    using re.findall('substring','string')
# here we find all phone numbers in a string using re
contact = 'Wally White, Home: 555-555-1234, Work: 555-555-4321'

re.findall(r'\d{3}-\d{3}-\d{4}', contact)

['555-555-1234', '555-555-4321']

In [157]:
# to save memory & perform same action as re.findall, but returning one substring at a time
#    we can use re.finditer('substring','string')
# note that finditer() returns a match object, so we use group() to extract matched string
for phone in re.finditer(r'\d{3}-\d{3}-\d{4}',contact):
    print(phone.group())

555-555-1234
555-555-4321


In [166]:
# can capture, or search for two substrings using a comma
text = 'Charlie Cyan, e-mail: demol@deitel.com'

# our pattern is split, first two titles (words starting with capital letter),
#    then an email specified by one or more alphanumeric characters, an '@', '.',
#    and 3 more alphanumeric characters
pattern = r'([A-Z][a-z]+ [A-Z][a-z]+), e-mail: (\w+@\w+\.\w{3})'
result = re.search(pattern, text)

# again, search returns a match object, we use re.groups() to extract matched substrings as tuple
# note: groups() different from group()
result.groups()

('Charlie Cyan', 'demol@deitel.com')

In [168]:
# groups() returns substrings as a tuple, group() returns as one string
result.group()

'Charlie Cyan, e-mail: demol@deitel.com'

In [170]:
# when using group() for a string containing multiple matches, can specify the substring
#    using numeric argument
result.group(2)

'demol@deitel.com'

## Data Science

#### Data Cleaning
Data cleaning is a difficult & messy process. Some common data transformations include:
- removing unnecessary data and features
- combining related features
- sampling data to obtain a representative subset
- standardizing data formats
- grouping data and more

Bad data & missing values can significantly impact data analysis -- substitute withi a reasonable value

In [4]:
import pandas as pd

# example - using regular expressions to validate data
zips = pd.Series({'Boston':'02215', 'Miami':'3310'})
zips

Boston    02215
Miami      3310
dtype: object

In [10]:
# using match() with RE to check for values with 5 digits
zips.str.match(r'\d{5}')

Boston     True
Miami     False
dtype: bool

In [12]:
# example - using contains() to see if a value contains a substring
# here we check if our data (cities) contains two capital letters (a state abbreviation)
cities = pd.Series(['Boston, MA 02215', 'Miami, FL 33101'])

cities.str.contains(r'[A-Z]{2}')

0    True
1    True
dtype: bool

#### Data Reformatting

In [17]:
# example data containing name, email & phone
contacts = [['Mike Green', 'demo1@deitel.com', '5555555555'],
           ['Sue Brown', 'demo2@deitel.com', '5555551234']]

contactsdf = pd.DataFrame(contacts, columns=['Name', 'Email', 'Phone'])

contactsdf

Unnamed: 0,Name,Email,Phone
0,Mike Green,demo1@deitel.com,5555555555
1,Sue Brown,demo2@deitel.com,5555551234


In [19]:
# getting the formatted version of a phone number (including hyphens)
import re

def get_formatted_phone(value):
    result = re.fullmatch(r'(\d{3})(\d{3})(\d{4})', value)
    return '-'.join(result.groups()) if result else value

formatted_phone = contactsdf['Phone'].map(get_formatted_phone)

formatted_phone

0    555-555-5555
1    555-555-1234
Name: Phone, dtype: object

In [23]:
# replacing the 'Phone' column in the table with formatted form
contactsdf['Phone'] = formatted_phone

contactsdf

Unnamed: 0,Name,Email,Phone
0,Mike Green,demo1@deitel.com,555-555-5555
1,Sue Brown,demo2@deitel.com,555-555-1234
